software and hardware reliability of fault tolerant...

Thesis on

Software and Hardware Reliability of Fault Tolerant Systems

Submitted for the award of DOCTOR OF PHILOSOPHY

Degree in

Mathematics By

Sulekha Rani

UNDER THE SUPERVISION OF

Dr. Madhu Jain

I.I.T. Roorkee, Roorkee

Prof. S. C. Agrawal Shobhit University, Meerut

SHOBHIT INSTITUTE OF ENGINEERING & TECHNOLOGY

A DEEMED-TO-BE UNIVERSITY

MODIPURAM, MEERUT-250110 (INDIA) 2011

Certificate

This is to certify that the thesis entitled “Software and Hardware Reliability of

Fault Tolerant Systems” which is being submitted by Ms Sulekha Rani for the degree

of Doctor of Philosophy in Mathematics to School of Basic and Applied Sciences,

Shobhit University, Meerut, a deemed-to-be-University, established by GOI u/s 3 of

UGC Act 1956, is a record of bonafide investigations and extensions of the problems

carried out by her under my supervision and guidance.

To the best of my knowledge, the thesis embodies the work of the candidate

herself and has not been submitted to any other University or Institution for the award of

any degree or diploma.

It is further certified that she worked with me for the required period in the School

of Basic and Applied Sciences, Shobhit University, Meerut.

Date : Prof. S. C. Agrawal (Internal Supervisor)

Prof. S.C. Agrawal M.A., Ph.D.

Director School of Basic and Applied Sciences

Shobhit University, Meerut E-mail: [email protected]

Indian Institute of Technology Roorkee, Roorkee

Dr. Madhu Jain M.Phil.(Gold Med.), Ph. D, D.Sc. Department of Mathematics, I.I.T, Roorkee (India)

Office: (01332) 285521 Resi: (01332) 285506 Mob: 09412811021 E-mail: [email protected] [email protected]

Certificate

This is to certify that the thesis entitled “Software and Hardware Reliability of

Fault Tolerant Systems” which is being submitted by Ms Sulekha Rani for the degree

of Doctor of Philosophy in Mathematics to School of Basic and Applied Sciences,

Shobhit University, Meerut, a deemed-to-be-University, established by GOI u/s 3 of

UGC Act 1956, is a record of bonafide investigations and extensions of the problems

carried out by her under my supervision and guidance.

To the best of my knowledge, the matter embodied in this thesis is the original

work of the candidate and has not been submitted for the award of any other degree or

diploma.

It is further certified that she worked with me for the required period in the

Institute of Basic Science, Khandari, Agra. She continued to take my guidance at I. I. T.

Roorkee too.

Date : Dr. Madhu Jain (External Supervisor)

mailto:[email protected]

Declaration

I, hereby, declare that the work presented in this thesis entitled “Software and

Hardware Reliability of Fault Tolerant Systems” for the award of degree of Doctor of

Philosophy, submitted to School of Basic and Applied Sciences, Shobhit University,

Meerut, a deemed-to-be-University, established by GOI u/s 3 of UGC Act 1956, is an

authentic record of my own research work carried out under the supervision of Prof. S.

C. Agrawal and Dr. Madhu Jain.

I also declare that the work embodied in the present thesis

(i) is my original work and has not been copied from any Journal/Thesis/Book, and

(ii) has not been submitted by me for any other degree or diploma.

(Sulekha Rani)

Acknowledgements

First and foremost I would like to express my heartiest gratitude to Dr. Madhu

Jain, my external guide for giving me the opportunity to complete my doctoral

program under her scholarly and able guidance. She has set up a benchmark for me

not only as a mentor but also as a person in the society through selfless service of

imparting the knowledge in a distinguished manner. I gratefully thank Prof. S. C.

Agrawal, Director of School of Basic and Applied Sciences, Shobhit University,

Meerut, my internal guide not only for providing me the opportunity to carry out the

research for the degree of Ph. D. in Mathematics but also for keeping me up for new and

innovative ideas. He has always been a driving force behind my research activities.

I earnestly wish to express my deepest feeling of gratitude to Prof. G. C. Sharma,

former Pro-Vice Chancellor, Dr. B. R. A. University, Agra, who has always been willing

to provide advice, academic support and help whenever I needed during the period of

my research work. I have to attribute most of my success to his timely suggestions

thoughout my years of stay for carrying my research work at Institute of Basic Science,

Agra.

I am thankful to the Chancellor Dr. Shobhit Kumar, Pro Chancellor Kunwar

Shekhar Vijendra, Vice-Chancellor Prof. R. P. Agarwal of Shobhit University for

providing congenial environment for conducting research in the University. I would

always be thankful to my reverend teachers for their affection, guidance, valuable

inputs and blessings bestowed upon me throughout the journey of my academic career.

It is difficult to thank my best friend Dr. Priyanka Agarwal, who like a guiding

beacon showed me the path towards achieving my academic milestone while she was

my room partner at Agra and Roorkee both. She stood by me, walked with me hand in

hand, and has played a vital role not only advising me from time to time but also in

compilation and completion of my research work. I am thankful to my senior

colleagues Ritu Gupta and Shweta Upadhahayay for their help and encouragement

during the entire processing of this work. Their discussions have always had a

substantial and motivating impact on me. I would like to extend my sincere thanks to

all fellow research workers who helped me in the completion of my thesis. They are

named alphabetically as Mrs Anuradha, Mr Naresh, Mrs Preeti, Mrs Ragni, Mr Ram

Singh, Mrs Richa, Mrs Sapna, Mr Satya Prakash, Mr Vivek and many others.

I am heartily thankful to Amma Ji (Mrs Kanak Lata Jain) for her blessings and

love bestowed upon me during the period of my research work at Agra and IIT

Roorkee. I extend my heartiest thanks to my parents Shri Hari Singh and Smt. Roshni

Devi for sowing the seedling of education, nurturing the plant of learning and being

there by my side throughout my academic career. Despite of all the hardships in their

life they made their best efforts and gave support to make me capable of achieving my

academic and professional goals. I can proudly say that whatever I am, is the result of

their hardwork, morals, values and blessings. I am highly obliged to my elder brother

Mr. Bhanu and my bhabhi Mrs. Usha for their heartiest love, attention, helpful attitude

and affection. I always received considerable encouragement and constant support

from them.

I also thank my other friends who have directly or indirectly contributed in my

research endeavor and without whom life at work and out of work would have been a

lot less enjoyable.

I would like to acknowledge the valuable contributions of librarians of IIT

Delhi, IIT Roorkee and Delhi University and to their staffs, which helped me in getting

the information and data for the thesis, whenever required.

I thank the Almighty God, who had faithfully called, guided and brought me to

an expected end of this work. To him I vow humbly.

(Sulekha Rani)

Research Papers Published/Accepted/ Communicated For Publication in Refereed Journals/Proceedings

1. Jain, M., Agrawal, S. C. and Rani, S. (2010): Reliability Analysis of Distributed

Software and Hardware System, International Journal of Information and

Computer Science (IJICS), 13, pp. 1-11.

2. Jain, M., Agrawal, S. C. and Rani, S. (2010): Reliability Modeling of Hardware

and Software Interactions, Journal, Mathematics Today, 26, pp. 46-57.

3. Jain, M., Agrawal, S. C. and Rani, S. (2010): Reliability Modeling of Hardware

and Software Interactions, Advances in Information Theory and Operations

Research, Om Parkash (Ed.), VDM Verlag, Germany, pp. 230-240.

4. Jain, M., Agrawal, S. C. and Rani, S. (2010): Availability Analysis of K-out-of-

N:G System with two types of Failure and Common Cause Failure, Accepted for

the publication in Journal of International Academy of Physical Sciences (In

Press).

5. Jain, M. and Rani, S. (2010): Reliability Modeling of Software Fault

Tolerance in A Clustered Architecture, International Journal of Information

and Computer Science (IJICS), 13 (In Press).

6. M. Jain and Rani, S. (2011): Availability Analysis of Imperfect Fault Coverage

System with Reboot and Common Cause Failure, Electronic Proceedings

International Conference on Advances in Modeling, Optimization and

Computing (AMOC-2011), IIT Roorkee, 5-7 Dec. 2011, ISBN 81-86224-71-2,

pp. 433-442.

7. M. Jain, S. C. Agrawal and Rani, S. (2011): Software Reliability Growth Model

with N-version Programming, Testing-Effort and Imperfect Debugging,

Accepted for the publication in Transactions of Physical and Life Sciences (In

Press).

8. M. Jain and Rani, S. (2011): Availability Analysis for Repairable System with

Warm Standby, Switching Failure and Reboot Delay, Revised for publication in

International Journal of Mathematics in Operations Research.

9. M. Jain and Rani, S. (2011): Transient Analysis of Hardware and Software

Systems with Warm Standbys and Switching Failures, Revised for publication in

International Journal of Mathematics in Operations Research.

10. Jain, M., Agrawal, S. C. and Rani, S. (2011): Quasi Renewal Processes for

Software and Hardware Systems with Common Cause Failure, Communicated

for publication in CSI Journal.

11. M. Jain and Rani, S. (2011): Availability of Hardware-Software System with

Standbys, Switching Failure and Reboot Delay, Communicated for publication in

International Journal of Engineering.

12. M. Jain and Rani, S. (2011): Availability Analysis of Imperfect Fault Coverage

System with Reboot and Common Cause Failure, Communicated for publication

in International Journal of Engineering.

Participation in the Conferences/Workshops/Seminars

1. Attended International Conference on ‘Soft Computing For Problem Solving

(SocProS-2011)’ held at IIT Roorkee, Roorkee, December 20-22, 2011.

2. International Conference on ‘Advances in Modeling, Optimization and Computing

(AMOC-2011)’ held at IIT Roorkee, Roorkee, December 5-7, 2011. Presented paper

entitled “Availability Analysis of Imperfect Fault Coverage System with Reboot and

Common Cause Failure”, Abstract published in Souvenir, pp. 91.

3. Annual Conference of ‘Vijnana Parishad of India & The Global Society of

Mathematical and Allied Science’ held at School of Basic and Applied Sciences,

Shobhit University, Meerut, March 24-26, 2011. Presented paper entitled

“Software Reliability Growth Model with N-version Programming, Testing-Effort

and Imperfect Debugging”, Abstract published in Souvenir, pp. 12.

4. 11th International Conference of ‘International Academy of Physical Sciences’ held

at Institute of Interdisciplinary Studies, University of Allahabad, Allahabad,

February 20-22, 2010. Presented paper entitled “Availability Analysis of K-out-of-

N: G system with Two Types of Failure and Common Cause Failure”, Abstract

published in Souvenir, pp. 25.

5. 2nd National Seminar on ‘Recent Trends in Advancement of Mathematical And

Physical Sciences’ held at D. N. College, Meerut, January 30-31, 2010. Presented

paper entitled “A Study Reliability Analysis of Distributed Software and Hardware

Systems with spares”, Abstract published in Souvenir, pp. Maths 6.

6. Attended Workshop on ‘Mathematical Modeling and Related Optimization

Techniques’ held at University of Delhi, Delhi, December 14-17, 2009.

7. Annual Conference of ‘Vijnana Parishad of India & National Symposium’ held at

Jaypee Institute of Engineering & Technology, Guna, December 4-6, 2009.

Presented paper entitled “Reliability Analysis of Distributed Software and Hardware

Systems”, Abstract published in Souvenir, pp. 53.

8. 11th National Conference of ‘Indian Society of Information Theory and Applications

(ISITA)’ held at Guru Govind Singh Khalsa College, Sarhali (Tarn Taran), Amritsar,

October 24-26, 2009. Presented paper entitled “Reliability Modeling of Hardware

and Software Interactions”, Abstract published in Souvenir, pp. 32.

9. Attended workshop on ‘Optimization Techniques and Their Applications to Various

Displines’ held at Guru Govind Singh Khalsa College, Sarhali (Tarn Taran),

Amritsar, October 19-23, 2009.

10. 14th Annual Conference of ‘Gwalior Academy Of Mathematical Sciences (GAMS)

and A Symposium on Computational Mathematics and its Application to

Engineering, Management and Biology’ held in Department of Applied Sciences

and Humanities, IPS College of Technology and Management, Gwalior, July 17-19,

2009. Presented paper entitled “Reliability Modelling of Software Fault Tolerance in

a Clustered Architecture”.

Contents

Preface (i)-(iii)

List of Tables (iv)

List of Figures (v)-(vii)

Chapter 1: General Introduction 1-40

1.1 Motivation 2 1.2 Fault Tolerance 4 1.3 Redundancy Issues 12 1.4 Reliability Analysis 16 1.5 Review of the Literature 25 1.6 Organization of the thesis 34 1.7 Concluding Remarks 38 Chapter 2: Fault Tolerance Software in A Clustered

Architecture 40-53 2.1 Introduction 41 2.2 Reliability Models 43 2.3 Model for Software Fault Tolerance 44 2.4 The Analysis 46 2.5 Numerical Illustration 47 2.6 Conclusion 49 Chapter 3: Reliability Modeling of Hardware and Software

Interactions 54-64

3.1 Introduction 55 3.2 Model Description 57 3.3 Governing Equations and Analysis 59 3.4 Numerical Results 62 3.5 Conclusion 63

Chapter 4: Distributed Software and Hardware Systems 65-78 4.1 Introduction 66 4.2 Model Description 69

4.3 The Equations and Analysis 69 4.4 Performance Indices 72 4.5 Numerical Results 72 4.6 Conclusion 73

Chapter 5: Transient Analysis of A Hardware-Software System

Section 5(A): Embedded Computer System with Two Types of Failure and Common Cause Failure 80-94

5A.1 Introduction 81

5A.2 Model Description 84 5A.3 The Analysis 86 5A.4 Illustration 88

5A.5 Performance Indices 90 5A.6 Numerical Results 91

5A.7 Conclusion 92 Section 5(B): Hardware and Software Systems with Warm Standbys

and Switching Failures 95-125

5B.1 Introduction 96 5B.2 Model Description 99

5B.3 Governing Equations 100 5B.4 Special Case 104

5B.5 Performance Measures 114 5B.6 Numerical Results 115

5B.7 Conclusion 116 Chapter 6: Repairable Redundant System with Reboot Delay

Section 6(A): Availability Analysis of Hardware-Software System with Switching Failure 127-146

6A.1 Introduction 128 6A.2 Model Description and Governing Equations 131 6A.3 Availability Prediction 135 6A.4 Performance Measures 138 6A.5 Sensitivity Analysis 139 6A.6 Conclusion 140

Section 6(B): Availability Analysis of Repairable System with Warm Standby and Switching Failure 147-170 6B.1 Introduction 148 6B.2 Model Description 151

6B.3 Availability Prediction for Three Configuration 152 6B.4 Transient Solution 162 6B.5 Numerical Results 163 6B.6 Conclusion 164

Chapter 7: Warranty Policy for Hardware and Software with Common Cause Failure 171-186

7.1 Introduction 172 7.2 Model Description 174 7.3 The Analysis 176 7.4 Warranty Policy with Repair 180 7.5 Numerical Results 182 7.6 Conclusion 183

Chapter 8: Semi-Markov Models with Common Cause Failure Section 8A: Redundant System with Rejuvenation 188-198 8A.1 Introduction 189 8A.2 Model Description 191

8A.3 Semi-Markov Analysis 194 8A.4 Performance Measures 197 8A.5 Total Expected Downtime Cost 197 8A.6 Conclusion 198 Section 8B: Imperfect Fault Coverage System with Reboot 199-215 8B.1 Introduction 200 8B.2 Model Description 203

8B.3 The Steady State Availability 204 8B.4 Special Cases 207 8B.5 Numerical Results 208 8B.6 Concluding Remarks 210

Chapter 9: Software Reliability Growth Model (SRGM) with N-version Programming 216-229

9.1 Introduction 217

9.2 NHPP Model 220 9.3 Mean Value Function 222 9.4 Reliability Estimation 223 9.5 Parameter Estimation 224 9.6 Total Expected Cost of Software 225 9.7 Numerical Results 226 9.8 Concluding Remarks 227

Future Scope 230

References 231-259

(i)

Preface

The system managers are required to pay increasing attention to the interrelated

fields of reliability and fault tolerant systems, keeping intact the quality, if they are to

survive under the competitive pressures of the twenty-first century. In today’s

technological world nearly everyone depends upon the continued functioning of a wide

array of complex machinery and equipment for our everyday safety, security, mobility

and economic welfare. We expect our electric appliances, lights, hospital monitoring

control, next-generation aircraft, nuclear power plants, data exchange systems, and

aerospace applications, to function well whenever we need them. When they fail, the

results can be catastrophic, injury or even loss of life. Reliability and availability of fault

tolerant systems are becoming major design issues now-a-days in massively parallel

distributed computing systems. Fault tolerance system that equips the redundant

subsystems or components in to safe guard from failure in order to improve reliability has

been used for many decades. Example of systems in which fault tolerant is needed

include mission-critical, transaction such as banking, computation-intensive,

mobile/wireless computing system and many more. High performance measures in terms

of speed and computing power is essentially used as major design objective for such

systems.

In this thesis our aim is to develop some models that cover both fundamental and

theoretical work in the areas of reliability and fault tolerant system including system

redundancy, multi-state system, optimization, component reliability, system reliability,

warranty, availability etc.. An important feature of our investigation is a set of reliability

assessments that can be used to ensure the desired reliability of fault tolerant systems.

The thesis stresses upon mathematical models for evaluating reliability/availability, and

shows how these models can be applied to the development of reliable software systems.

The study of various hardware and software components for reliability/fault

tolerant system is done analytically as well as numerically in different chapters. The

sensitivity of different parameters is examined by exhibiting the numerical results in

(ii)

tables and graphs. Chapter 1, deals with the fundamental concepts involved in the

hardware and software components for the reliability of fault tolerant systems. Fault

tolerant system, redundancy, reliability methodology, availability and their applications

in the system characterization along with some historical background as well as recent

developments on the subject have been described. The outlines and concluding remarks

of the thesis are also given in the end of the chapter. The reliability of fault tolerant

systems is the main theme in chapter 2 of the thesis. The systems are characterized by

having a number of operating components and levels based on some assumptions. In

chapter 3, we discuss the interactions of the software and hardware failure and formulate

a model for the reliability analysis of a software failure, hardware failure due to common

cause failure. Chapter 4 is devoted to the distributed system along with operating

components with standbys and common cause failure to evaluate the reliability of the

system.

Transient analysis of a hardware-software system is studied in chapter 5. This

chapter is divided into two sections. In section 5(A), we analyze the Markov model for K-

out-of-N:G system which have N non identical components and Y warm standby

components with common cause failure. In section 5(B) we analyze a system, having

operating components as well as warm standbys under the assumption of switching

failure. Chapter 6 focuses on a repairable system with reboot delay. This chapter is

arranged into two parts. In section 6(A), we study a repairable system with M hardware

and N software components. In section 6(B), we consider the redundant system by

incorporating the set up time. Warranty policy for hardware and software system is

proposed in chapter 7.

Chapter 8 deals with semi-Markov models of the redundant systems with

common cause failure. This chapter is divided into two sections. In Section 8A we

determine the availability of redundant system with common cause and rejuvenation by

using an embedded Markov chain approach. Section 8B is devoted to the analysis of the

imperfect fault coverage system with reboot delay and recovery using supplementary

variable technique. Chapter 9 is concerned with a model with N-version programming,

testing effort and imperfect debugging.

(iii)

We hope that the investigations made in the present study will be helpful to the

system designers and the practitioners to deal with various hardware and software

components involved in reliability/fault tolerant system. The decision makers and

industrial engineers may be able to improve the quality of service on the basis of our

findings. In addition, the present study will also be beneficial to the society to tackle the

reliability and availability issues of fault tolerant systems.

(Sulekha Rani)

(iv)

List of Tables

Table No. Title Page No. 1.1 Comparison of Design diversity techniques 11

2.1 Probabilities of Software fault tolerance for different values of λ for data set I 50

2.2 Probabilities of Software fault tolerance for different values of λP for data set II 51

4.1(a) Performance indices for different values of λP 75 4.1(b) Performance indices for different values of μS 76 4.1(c) Performance indices for different values of α 77

5A.1(a) Performance indices for different values of (μ, μh) and (μC, μS) 93

5A.1(b) System Availability for different values of (λh, 'hλ ) 93

5B.1(a) Performance indices for different values of λh 118 5B.1(b) Performance indices for different values of λS 119 5B.1(c) Performance indices for different values of α 120 5B.1(d) Performance indices for different values of μh 121 5B.1(e) Performance indices for different values of μS 122 5B.1(f) Performance indices for different values of q 123 6A.1(a) Performance indices for different values of λd 141 6A.1(b) Performance indices for different values of λS 142 6A.1(c) Performance indices for different values of β 143 6A.1(d) Performance indices for different values of α 144

6B.1(a) Comparison of availability of three configuration for different values of α 165

6B.1(b) Comparison of availability of three configuration for different values of β 166

6B.1(c) Comparison of availability of three configuration for different values of q 167

8B.1(a) Effects of parameters C, μ and β on the availability for different distribution of repair time 211

8B.1(b) Effects of parameters θ, μ and β on the availability for different distribution of repair time 212

v

List of Figures

Fig. No. Title Page No.

1.1 The transition of fault, error and failure in a software life cycle 6 1.2 Recovery block 8 1.3 N-version programming 8 1.4 N-self checking programming 10 1.5 Triple modular redundancy 14

1.6 N-modular redundancy 14 1.7 Reliability of N component system in series 17 1.8 Reliability of N component system in parallel 18 1.9 Reliability of N component system in series-parallel 18 1.10 Reliability of N component system in parallel-series 19 2.1 State transition diagram for software fault tolerance 45 2.2 Probabilities of software fault tolerance 52 2.3(a) Profiles of reliability by varying λ for data set I 53 2.3(b) Profiles of reliability by varying λ for data set II 53 2.4(a) Profiles of reliability by varying λP for data set I 53 2.4(b) Profiles of reliability by varying λP for data set II 53 2.5(a) Profiles of reliability by varying c for data set I 53 2.5(b) Profiles of reliability by varying c for data set II 53 3.1 State transition diagram 58 3.2(a) Availability (A) vs λ by varying μm 64 3.2(b) Availability (A) vs μS by varying λ 64 3.2(c) Availability (A) vs λ by varying P1 64 3.2(d) Availability (A) vs λ by varying P2 64 3.2(c) Availability (A) vs λ by varying '

1P 64

3.2(d) Availability (A) vs λ by varying '2P 64

4.1 State transition diagram 74 4.2(a) Reliability vs time by varying λP 78 4.2(b) Reliability vs time by varying μS 78 4.2(c) Reliability vs time by varying α 78 5A.1 State transition diagram 86 5A.2(a) Reliability vs time by varying λ 94 5A.2(b) Reliability vs time by varying λ′ 94 5A.2(c) Reliability vs time by varying λS 94 5A.2(d) Reliability vs time by varying λC 94

vi

List of Figures (contd…)

5B.1 State transition diagram 117 5B.2(a) Reliability vs time by varying λh 124 5B.2(b) Reliability vs time by varying μh 124 5B.2(c) Reliability vs time by varying λS 124 5B.2(d) Reliability vs time by varying μS 125 5B.2(e) Reliability vs time by varying α 125 5B.2(f) Reliability vs time by varying q 125 6A.1 State transition diagram 133 6A.2(a) Availability vs λh by varying q 145

6A.2(b) Availability vs λh by varying β 145 6A.2(c) Availability vs λh by varying α 145 6A.2(d) Membership functions for input parameter λh 145 6A.3(a) Availability vs μS by varying q 146 6A.3(b) Availability vs μS by varying β 146 6A.3(c) Availability vs μS by varying α 146 6A.3(d) Membership functions for input parameter μS 146 6B.1 State transition diagram for model 1 153 6B.2 State transition diagram for model 2 156 6B.3 State transition diagram for model 3 159 6B.4 Availability vs time for model 1 by varying (a) μ1 (b) μ2 (c) η1

(d) η2 168

6B.5 Availability vs time for model 2 by varying (a) μ1 (b) μ2 (c) η1 (d) η2

169

6B.6(a) Availability vs time by varying μ1 for model 3 170 6B.6(b) Availability vs time by varying η1 for model 3 170 6B.6(c) Availability vs time by varying μ2 for model 3 170 6B.6(d) Availability vs time by varying η2 for model 3 170 6B.6(e) Availability vs time by varying μ3 for model 3 170 6B.6(f) Availability vs time by varying η3 for model 3 170 7.1 Expected cost vs warranty period by varying (i) β1h (ii) βS and

(iii) λC 184

7.2 Standard deviation vs warranty period by varying (i) β1h (ii) βS and (iii) λC

185

7.2 Coefficient of variation vs warranty period by varying (i) β1h (ii) βS and (iii) λC

186

8A.1 Redundant system without rejuvenation 192 8A.2 Redundant system with rejuvenation 193 8A.3 Failed software rejuvenation model 193 8B.1 State transition diagram for two unit system 204

8B.2 Availability vs λ by varying (i) C (ii) β (iii) θ for exponential distribution repair time

213

vii

List of Figures (contd…)

8B.3 Availability vs λ by varying (i) C (ii) β (iii) θ for uniform distribution repair time

213

8B.4 For gamma distributed repair time, the availability vs λ by varying (i) C (ii) β (iii) θ (iv) λS

214

8B.5 For gamma distributed repair time, the availability vs (i) λ (ii) λS (iii) β (iv) µ by varying r

215

9.1(i) Mean time vs time by varying a1 228 9.1(ii) Mean time vs time by varying β11 228 9.2(i) Expected cost vs time by varying a1 228 9.2(i) Expected cost vs time by varying a2 228 9.3(i) Reliability vs time by varying a1 229 9.3(ii) Reliability vs time by varying b11 229 9.4 Membership functions for input parameter T 229

General Introduction

1.1 Motivation

1.2 Fault Tolerance

1.3 Redundancy Issues

1.4 Reliability Analysis

1.5 Review of the Literature

1.6 Outline of the Thesis

1.7 Concluding Remarks

Chapter-1

Chapter-1: General Introduction

2

1.1 Motivation

The hardware and software computer systems directly or indirectly have a

significant impact on human life. Now-a-days new generation embedded computers

are demanded by all engineering systems such as communication, production, atomic

power plants, aircraft, automobiles, etc. The embedded computer systems are

becoming complex in both their design and architecture. An embedded computer

system can be used in real time systems which may consist of hundreds and may be

thousands of interacting softwares and hardware components. Hardware products can

experience intermittent failures in processing the same input because of the wearing

out of components. While software products appear to exhibit the same intermittent

failure characteristics but do have deterministic behavior. With the advancement of

technology, the softwares have become essential parts of various systems such as, in

air traffic control, fighter aircraft, space shuttle and automated-guided missiles,

telecommunications, process control in nuclear plants and factories, defense systems,

and many more. Even in our day-to-day life, many of the gadgets and the automobiles

are software controlled. It is evident that there are many well known cases of tragic

consequences due to software failures. That is why in many popular software

packages, a very high degree of reliability is necessary. To avoid the failures and

faults, the reliability of the hardware and software needs to be predicted during the

design and development phases.

The fault-tolerance system plays a major role in process control,

transportation, electronic equipments, space, communications and many other areas

that affect our lives. Due to increase in dependency and demand, the size and

complexity of computer hardware and software system has grown up extremely. In

many systems, the fault-tolerance is achieved by applying a set of analysis and design

techniques to generate systems with dramatically improved dependability. In the early

days of fault-tolerant computing, it was possible to evaluate specific hardware and

software outcomes. The chips are used in embedded systems that contain complex,

highly-integrated functions, and hardware and software must be capable to cope up

with a variety of standards subject to techno-economic constraints.


3

Fault tolerant systems are often demanded for critical applications where loss

of human life or damage to property is of great concern. Reliability of the fault

tolerant software presents special difficulties since all the errors present in the system

are design errors, that cannot be accurately modeled by the traditional models used for

hardware. In any system, fault tolerance is achieved through redundancy in some or

another way. Redundancy is a common approach to improve the reliability and

availability of a system. Adding redundancy increases the cost and complexity of a

system design and with the high reliability of modern electrical and mechanical

components, many applications do not need redundancy in order to be successful.

However, if the cost of failure is high enough, the redundancy may be an attractive

option.

Hardware reliability requirements provide an impetus to achieve high safety

margins in the mechanical stresses, reduced variability and increase tolerances, input

impedance, breakdown voltage, etc. The cause of failure in hardware has usually been

physical deterioration, a manufacturing defect or poor quality of materials. Computer

hardware failures could be due to failure of the control processing unit, power supply,

and display terminal, printer, cooling system, memory disk or simple faulty

operations. For many functions of the computer systems, the design process is very

important part of hardware reliability. Sometimes a hardware component may need to

be replaced that is why; often redundant components (standbys) are kept in hand.

There may be some down time if any part is not readily available but the part itself

does not require a corrective process adding to the downtime. Hardware reliability

may change during certain periods such as at initial use or at end of a useful life.

For given input and initial conditions, the software will always produce the

same results. Software reliability models in the 1970s usually directed at single unit

software systems and were called software reliability growth models (SRGMs) and

later on models were developed to address multi component systems or modular

systems. Software reliability is generally accepted as the key factor in software

quality since it quantifies software failures.

The combined hardware and software reliability models have been developed

by many researchers. Such developed models require observing the implementation

carefully for the next few years, at least. In literature, a few models deal with


4

combined software and hardware reliability. Hardware and software reliability

engineering have many concepts with unique terminology and many mathematical

and statistical expressions. Reliability of combined hardware and software system is

in many ways analogous to reliability modeling of purely hardware system. Individual

hardware platforms and the software assigned to those platforms are independent of

other hardware/software platforms.

It is worth while to discuss some concepts which are to be specifically used in

the present thesis for the modeling purpose of fault tolerant system. The objective of

our study is the present thesis has been to develop various models for the fault-

tolerant system to predict the reliability/availability. Our prime goal in the present

chapter is to provide a brief account of various quantitative approaches used in the

development of various models for the performance evaluation of fault-tolerant

system based on reliability theory.

In present ongoing chapter, we provide an overview of hardware and software

reliability issues of fault tolerant system. The issues related to our research topic are

briefly reviewed apart from highlighting the basic concepts and noble features of the

work carried out in the present thesis. The remainder of this chapter is organized as

follows. Section 1.2 details various aspects of the fault tolerant system whereas

section 1.3 is devoted to the redundancy issues. Reliability analysis is discussed in

section 1.4. In section 1.5, we have reviewed various reliability models related to our

work and developed by prominent researchers in different frameworks. The outline of

the thesis is presented in section 1.6. Finally, noble features and future scope of the

work done are highlighted in final section 1.7.

1.2 Fault Tolerance

A fault-tolerance is the ability of a system to continue correct performance of

its intended tasks after the occurrence of hardware and software faults. Fault tolerant

system research covers a wide spectrum of applications namely embedded real-time

systems, commercial transaction systems, transportation systems, and military/space

systems, distribution and service systems, etc.. Fault tolerance approach in any system

results in the improvement as far as the efficiency and performance is concerned.

During last few decades many researchers have contributed in the development of


5

fault tolerance techniques that explore the performance of computer systems which

are prone to software or hardware failures. The idea of implementing fault tolerance

in separate layers of an embedded system helps in managing the complexity of the

derived architectural solutions. Integrating provisions for coping with both hardware

and software faults can reduce the overlapping of fault tolerance techniques. The

reliability of fault tolerant software presents special difficulties since all the errors

present are design errors, that cannot be accurately modeled by the traditional models

used for hardware. To tackle fault tolerance related issues, the following terminology

are commonly used:

Faults- A fault, sometimes called a bug, is the identified or hypothesized cause of a

software failure. Software faults can be classified as (i) design faults and (ii)

operational faults according to the phases of creation.

Design faults- A design fault is a fault occurring in the software design and

development process. Design faults can be recovered with fault removal

approaches by revising the design documentation and source code.

Operational faults- Such faults occur during the lifetime of the system and are

invariably due to physical causes, such as processor failures or disk crashes and

software operation due to timing, race conditions, workload-related stress and

other environmental conditions.

Error- An error is the part of the system state which is liable to lead to a failure. It is

an intermediate stage in between faults and failures. Software faults are most often

caused by design faults. Design faults occur when a designer, either misunderstands a

specification or simply makes a mistake.

Failure- A failure mode is an identifiable weakness in the system design and

manufacture. Failures can be classified into severity classes, e.g. critical, major,

minor. A failure occurs when the user perceives that a software program is unable to

deliver the expected service.

A fault-tolerant system may be able to tolerate one or more fault-types including:

• Transient, intermittent or permanent hardware faults

• Software and hardware design errors


6

• Operator errors

• Externally induced upsets or physical damage

Figure 1.1: The transition of fault, error and failure in a software life cycle

The “fault-error-failure" relationship in a software life cycle is depicted in

figure 1.1. There are some common approaches which may be used to deal with

design faults in the softwares. Some basic approaches are as follows:

(a) Fault avoidance (prevention) during the software development process.

(b) Fault tolerance and fault/failure forecasting after the development process.

(c) Fault removal during the software development process.

(a) Fault avoidance- Fault avoidance is meant for determining the introduction of

faults during the development of the software. It includes all the techniques to

examine the process of software developments, standards, methodologies, etc. Fault

avoidance techniques are employed to check the occurrence of faults such as quality

control (design review, component screening, testing, etc.) and shielding from

interference (radiation, humidity, heat, etc.).

(b) Fault/failure prediction (forecasting)- Forecasting can play a vital role in

reducing the existence of faults and the occurrences and consequences of failures. For

this purpose dependability-enhancing techniques based on reliability estimation and

reliability prediction are used.

(c) Fault removal- Fault removal is the approach to detect and eliminate software

faults during different phases of development of the software. The reviews,


7

inspection, testing, verification, validation, etc. are some common techniques used for

this purpose.

Fault tolerance can be divided into software and hardware fault tolerances. It is

worthwhile to have a look on software and hardware fault tolerance.

1.2.1 Software Fault Tolerance

A fault tolerant system is supposed to be able to tolerate not only faults in the

system itself but also faults in the application programs. In order to create a fault

tolerant system for a particular application, the fault tolerance demands of the target

application must first be identified. Then, the appropriate fault tolerance methods

must be used in order to meet the overall fault tolerance requirements. Some

performance measures namely reliability, availability, performability, mean time to

failure, mean time to recovery, performance degradation due to the fault tolerance can

play important role for the assessment of faults during the software life cycle.

Individual fault tolerance methods must be refined, to draw the system functions

taken into consideration in order to develop effective fault tolerance schemes.

The key issues in the software fault tolerant system are:

Component reliability is an important quality measure for the system level

analysis. It is established that the software reliability is hard to calculate, and the

use of past-verification reliability evaluation is an open problem to be tackled.

The multi-version techniques are based on the assumption that the software, made

differently, should fail differently. If one of the redundant version fails, at least

one of the others should provide an acceptable output.

Probability models provide a formal conceptual structure to tackle the delicate

issues of conditional independence involved in the failure processes of design

diverse systems.

Software architectures, design techniques, static checks, dynamic tests, special

libraries, and run-time routines help software engineers to develop the fault tolerant

software. There are basic two techniques for software fault tolerance i.e. (i) recovery

block and (ii) N-version programming,


8

(i) Recovery Blocks

Figure 1.2: Recovery block

The recovery blocks technique (Randell, 1975) combines the basics of the

checkpoint and restart approach with multiple versions of a software component such

that a different version is tried after an error is detected. Checkpoints are created

before a version executes. Checkpoints are required to recover the state after a version

fails to provide a valid operation starting point for the next version if an error is

detected.

(ii) N-Version Programming

Figure 1.3: N-version programming

The N-version concept attempts to tackle the software and hardware fault

tolerance concept based on N-way redundant components. In an N-version software

system, each module is for the use upto N different implementations. Each version


9

attains the same task but in a different way. N-Version programming approach of

software fault tolerance is based on design diversity conjecture. NVP proposed by

Avizienis and Chen (1977) considers the execution of N functionally equivalent

software modules (called versions) that receive the same input and send their outputs

to a voter. The voter produces an output if at least M out of N outputs accept.

Otherwise, the system fails. Generally, majority voting is used in which N is odd and

M = (N+1)/2.

Single and Multiple Version Software Techniques

Single-version techniques add to a single software module a number of

functional capabilities that are unnecessary in a fault-free environment. These

techniques are based on redundancy and are used to a single version of software to

detect and recover from faults. Single-version software fault tolerance techniques

involve considerations on program structure and actions, error detection, exception

handling, checkpoint and restart, process pairs, and data diversity (cf. Lyu, 1995).

The multi-version fault-tolerant software technique is also called design

diversity approach. It is based on the use of two or more versions of a piece of

software executed either in sequence or in parallel. The fundamental reasoning for the

use of multiple versions is the expectation that components create differently (i.e.,

different designers, different algorithms, different design tools, etc.) should fail

differently. Therefore, in the case that one version fails in a particular situation, there

is a good chance that at least one of the alternate versions is able to provide a suitable

output. These multiple versions are executed either in sequence or in parallel, and can

be applied as alternatives (with separate means of error detection), in pairs (to

implement detection by replication checks) or in major groups (to enable masking

through voting).

N Self-Checking Programming

N self-checking programming (cf. Laprie , 1987, 90, 95) is the use of multiple

software versions combined with structural variations of the recovery blocks and N-

version programming. N self-checking programming using acceptance tests is shown

in Figure 1.4.


10

Figure 1.4: N-self checking programming

Here the versions and the acceptance tests are developed independently from

common requirements. The use of separate acceptance tests for each version is the

main difference of this N self-checking model from the recovery blocks approach.

Comparison between Recovery Block, N-Version Programming and N- Self Checking Programming

Each design diversity method, recovery block, N-version programming, and N

self-checking programming, has its own advantages and disadvantages compared with

the others. The comparison of the features of the three and summary of the same is

given in Table 1.1. In the design and implementation phase we display specific fault-

tolerant techniques in developing reliable software systems for either single-version

software or multiple version software.


11

Table 1.1: Comparison of design diversity techniques

1.2.2 Hardware Fault Tolerance

The majority of fault tolerant designs have indicated the direction of

developing computers that automatically recover from random faults occurring in

hardware components. Hardware fault tolerance includes triple modular redundancy,

duplication with comparison, standby sparing, watchdog timers, self-purging

redundancy, and many others techniques, which are actively being researched. Such

hardware methods typically have the advantage of speedily detecting and removing

faults as they occur. The techniques employed for fault tolerant design involve

partitioning a computing system into modules that perform the fault containment area.

Each module is backed up with protective redundancy so that, if the module fails,

others can resume its function.

The working stage has the following three types of hardware faults:

(i) Transient faults- Transient faults are intermittent faults that are caused by

external events or by the environment (Somani and Vaidya, 1997). For examples,

Features Recovery Block N-version

Programming

N-self-checking

Programming

Minimum No. of

Versions 2 3 4

Output Mechanism Acceptance Test Decision Algorithm

Decision Algorithm

and Acceptance Test

Execution Time Primary Version Slowest Version

Slowest Pair

Recovery Scheme

Backward Recovery Forward Recovery

Forward and Backward Recovery


12

we can realize such faults in energetic particles, the chip or electrical surges etc.

Though these faults do not cause permanent faults, but they may result in incorrect

program execution by inadvertently altering processors’ states, signal transfer, or

stored values on registers, etc.

(ii) Intermittent Faults- Intermittent faults enter the system, stay active for a very

small duration and then disappear, only to return again. The examples of such

faults can be encountered in heat-sensitive components which can produce

intermittent faults through their phases of heating and cooling.

(iii) Permanent Faults- Permanent faults are completely repeatable and always cause

an associated failure.

Some key concepts in the area of hardware fault tolerant are also described in the

next section devoted to redundancy issues as given below:

1.3 Redundancy Issues

The extra critical component of a system provided with the intention that the

reliability of the system will increase due to this component is called redundant

component. In any system, fault tolerance is achieved through redundancy in some or

another way. The redundancy in software consists of the introduction of extra

elements such as instructions, parts of programs, programs etc., to certain that the

failure can be tolerated, for the substitution or masking of the faulty element. Fault

masking is a structural redundancy technique that completely masks faults within a set

of redundant modules. A number of identical modules execute the same functions,

and their outputs are voted to remove errors created by a faulty module. This

technique used to prevent faults from introducing errors, e.g. error correcting codes,

majority voting, etc.

There are two types of the redundancies possible:

(I) Space Redundancy and (II) Time Redundancy


13

(I) Space Redundancy

Space redundancy provides additional components, functions, or data items

that are unnecessary for a fault-free operation. It is classified into hardware

redundancy, information redundancy and software redundancy.

(a) Hardware Redundancy

The physical replication of hardware is perhaps the most common form of fault

tolerance used in the system. As semiconductor components have become smaller and

less expensive, the concept of hardware redundancy has become more common and

more practical. There are three basic forms of hardware redundancy:

(i) passive, (ii) active and (iii) hybrid redundancy.

(i) Passive Redundancy

Passive redundancy achieves fault tolerance by masking the fault that occurs

without requiring any action on the part of the system or an operator. Passive

techniques, in their most basic form, do not provide for the detection of the faults but

simply mask the faults. In the context of passive redundancy, it is worthwhile to

discuss the concept of triple modular redundancy and N-modular redundancy.

Triple Modular Redundancy

The most common form of passive hardware redundancy is triple modular

redundancy (TMR). In this type of redundancy the components are in triplicate to

perform the same computation in parallel. Majority voting is used to find out the

correct result. If one of the modules fails, the majority voter will mask the fault by

recognizing the result of the remaining two fault-free modules as correct.

A TMR system (see fig. 1.5) can mask only one module fault. A failure in either of

the remaining modules would cause the voter to produce an erroneous result. TMR is

usually used in applications where a substantial increase in reliability is required for a

short period.


14

Fig. 1.5: Triple modular redundancy

N-Modular Redundancy

N-modular redundancy (NMR) approach is based on the same principle as

TMR, but uses n modules. The number n is usually selected to be odd, to make

majority voting possible. A NMR system (see fig. 1.6) can mask [N/2] module faults.

Fig.1.6: N-modular redundancy

(ii) Active Redundancy

Active redundancy method is also known as dynamic method. In this

approach, one achieves fault tolerance by first detecting the faults which occur and

then performing actions needed to recover the system back to the operational state.


15

Active hardware redundancy uses fault detection, fault location, and fault recovery in

an attempt to achieve fault tolerance.

(iii) Hybrid Redundancy

Hybrid techniques combine the features of both the passive and active

approaches. Fault masking is used in hybrid system to prevent erroneous results from

being generated. Fault detection, fault location and fault recovery are also used in the

hybrid approaches to improve the fault tolerance by removing faulty hardware and

replacing it with spares. Spares provisioning is one form of providing redundancy in a

system. Hybrid methods are most often used in the critical-computation applications

where fault masking is required to check momentary errors and to achieve high

reliability.

The basic techniques for hybrid redundancy include:

Self-Purging Redundancy

N-modular Redundancy with Spares

Triplex-Double Redundancy

(b) Information Redundancy

Information redundancy is simply the addition of redundant information to

data to allow fault detection, fault masking or possibly fault tolerance. Examples of

information redundancy are error detecting and error correcting codes, formed by the

addition of redundant information to data words or by the mapping of data words into

new representations containing redundant information.

(c) Software Redundancy

Software redundancy refers to the use of extra code, small routines or possibly

complete programs, in order to check the correctness or the consistency of the results

produced by given software. The two types of diversity of the software redundancy

techniques are:

(i) Design diversity and (ii) Data diversity


16

(i) Design Diversity

Design diversity is an identical service through separate design and

implementations. It aims at making the modules as diverse and independent as

possible.

(ii) Data diversity

The data diverse techniques are meant to complement, rather than replace,

design diverse techniques. The data diversity is used to obtaining a related set of

points in the program data space, executing the same software on those points, and

then using a decision algorithm to determine the resulting output.

(II) Time Redundancy

Time redundancy techniques attempt to reduce the extra amount, weight, size,

power consumption, etc.. In some applications, the extra time is of less importance

than extra hardware. The basic concept of time redundancy is the repetition of

computations in such ways that allow faults to be detected. If the repetition is done

twice, and if the fault which has occurred is transient, then the stored copy will differ

from the re-computed result, so the fault will be detected. If the computation is done

three or more times, a fault can be corrected.

Standby Redundancy

Standby is one common form of active hardware redundancy techniques for

achieving fault-tolerance. A standby system consists of primary module, and one or

more modules that serve as standby spares. The standby components are considered to

be hot-standby, cold-standby and warm-standby respectively. Hot standby has all the

spares operational in synchrony with the on-line primary components and ready to

take over when the primary components experience any fault. Cold standby has its

spares unpowered until needed to replace a faulty component. Warm standby are

initially powered up with a reduced failure rate. Then they are subject to the regular

full failure rate when they are used to replace the faulty primary components.

1.4 Reliability Analysis

From a qualitative point of view, reliability can be defined as the ability of any

item to remain functional. Quantitatively, reliability specifies the probability that no

operational interruptions will occur during a stated time interval. The term reliability


17

is used as a reliability characteristic denoting a probability of success or a success

ratio.” The reliability is also defined as “the ability of an item to perform a required

function, under stated conditions, for a stated period of time.

The commonly used structures of the system are the (i) series, (ii) parallel and

(iii) series-parallel/parallel-series configuration.

(i) Reliability for Series Configuration:

Figure 1.7: Reliability of N component system in series

The series system is best thought of a system that contains no redundancy that

is, it is non-redundant system where all the equipments of the system are connected in

series. If any one of equipment fails, the series system fails. Let ( ) )N...,,2,1i(,tRi = is

the reliability of ith component at time t. The reliability of the N- components series

system at time t is given by

( ) ( )∏=

=N

1iiS tRtR …(1.1)

If ith (i=1,2,….,N) component has exponential distributed life time with constant

failure rate of iλ , then the reliability of this series system is given by

( )∑

== =

−−−−

N

1ii

N21

tttt

S ee..........eetRλ

λλλ …(1.2)

(ii) Reliability for Parallel Configuration:

In parallel configuration of a redundant system all components of the system

are arranged in parallel form (see fig 1.8).

Module 1 Module 2 Module N . . .


18

Figure 1.8: Reliability of a N component system in parallel

The reliability of parallel system is given by

( ) ( )[ ]∏=

−−=N

1iiP tR11tR …(1.3)

(iii) Reliability for Series-Parallel/ Parallel-Series Configuration: In series-

parallel/parallel-series configuration, all components of the system are arranged in

both series and parallel form.

The reliability of series-parallel system (see fig. 1.9) consisting of N series stages,

each with Ni parallel components is given by

( ) ( )( )[ ]∏=

−−=N

1i

NiSP

itR11tR …(1.4)

Figure 1.9: System reliability in series-parallel configuration

Module 1

Module 2

Module Ni

.

.

.

Module 1

Module 2

Module Ni

.

.

.

. . .

Module 1

Module 2

Module Ni

.

.

.

Module 1

Module 2

Module N

.

.

.


19

The reliability of parallel-series system (see fig. 1.10) is given by

( ) ( )[ ]∏=

−−=N

1iSPS tR11tR …(1.5)

Figure 1.10: System reliability in parallel-series configuration

1.4.1 Reliability Indices

Reliability- Reliability is the probability that an item will not fail by a given time t,

under a given set of operating conditions. The probability of failure by a given time t

is referred to as the unreliability of the item. Mathematically, the unreliability is

represented by

∫ ≥=t

0

0t,dt)t(f)t(U

where U(t) is the unreliability of the system and f(t), the probability density function

of failure.

Then the reliability R(t) at time of an item is

R(t)=1-U(t) …(1.6)

Failure Rate- The probability of a system failure in a given time interval [t1, t2] can

be defined in terms of the reliability function as

∫ ∫∫∞ ∞

−=1 2

2

1 t t

t

t

dt)t(fdt)t(fdt)t(f )t(R)t(R 21 −=

The failure distribution function is given as

∫ ∫∫∞− ∞−

−=2 12

1

t tt

t

dt)t(fdt)t(fdt)t(f )t(F)t(F 12 −=

Module 1

Module 1

Module 1

Module 2

Module 2

Module 2

. . .

. . .

. . .

Module N1

Module N2

Module Ni


20

The rate at which failure occurs in a certain time interval [t1, t2] is called the failure

rate. Thus the failure rate is obtained using

( ) ( )

( ) ( )112

21

tRtttRtR

−−

=

If we consider the interval as [ ]tt,t ∆+ , then failure rate is given by

( ) ( )( ) ( )tRt

ttRtR∆

∆+− …(1.7)

The rate in the above definitions is expressed as failure per unit time.

Hazard Function- The hazard function h(t) is defined as the limit of the failure rate as

the interval approaches zero. The instantaneous failure rate h(t) is given as

( ))t(tR

ttR)t(Rlim)t(h0t ∆

∆∆

+−=

→

)t(R)t(f)t(R

dtd

)t(R1

=

−= …(1.8)

Mean time to failure (MTTF)-Suppose that the reliability function for a system is

given by R(t). The expected failure time during which a component is expected to

perform successfully, or the mean time to failure is given by

( )dttftMTTF0∫∞

= …(1.9)

Thus [ ])t(Rdtd)t(f −= ,

Equation (1.8) yields

( )[ ]∫∞

−=0

tRdtMTTF [ ] ∫∞

∞ +−=0

0 dt)t(R)t(tR ….(1.10)

After solving equation (1.9), we obtain

( )dttRMTTF0∫∞

= ….(1.11)

Thus, MTTF is the definite integral evaluation of the reliability function.


21

Mean time to repair (MTTR)- An important measure often used in maintenance

studies is the mean time to repair. MTTR is the expected value of the random variable

repair time, not failure time, and is given by

( )dtttrMTTR0∫∞

=

where r(t) is the repair density function. If repair time is exponential distributed as

such te)t(r µµ −= , then the MTTR is given as

µ1MTTR = ...(1.12)

Mean time between failure (MTBF)- This is a basic measure of reliability for

repairable items. MTBF indicates that the system has failed and is subject to repair.

Now

MTBF=MTTF+MTTR ...(1.13)

The MTTR is a small fraction of the MTBF, so the approximation that the MTBF and

MTTF are equal is often quite good.

Reliability indices for system having exponential life time distribution.

The one parameter exponential reliability function is given by

−=−=

mTexp)Texp()T(R λ ...(1.14)

where λ is failure rate and λ1m = .

The mean time to failure (MTTF) of the one parameter exponential distribution is

given by

∫∫∞∞

−==00

dt)texp(tdt)t(tfMTTF λλλ1

= ….(1.15)

Failure Rate Function- The exponential failure rate function is given by

λλ

λ λ

λ

=== −

−

)T(

)T(

ee

)T(R)T(f)T( =Constant ….(1.16)

1.4.2 Failure Issues In practice, different types of failures can significantly reduce the reliability of

the systems. The common cause failures, degraded failure, switching, failure detection


22

are the most concern failure issues in active systems. It is worth while to discuss such

concepts as these are included in many models developed in the thesis.

Common Cause Failure- In this failure, all the units in the system fail due to same

cause. The example of such failure can be realized in computer systems wherein

humidity, fires, power outage, etc. can cause the failure of several components

simultaneously. When one device has several functional then its failure prevents each

of the individual units from functioning. To cite such a situation we refer high voltage

in case of electric/electronic devices, high pressure in case of hydraulic pumps, etc..

Many researchers have considered the concept of common cause failure in their

studies on hardware/software reliability systems.

Degraded Failure- With usage and growing time, all systems are subject to

degradation. This degradation can result in high production costs and inferior product

quality. So, to maintain product costs low and pre-specified quality of product, the

provision of repair facility and standby support is recommended which also ensures

the smooth and long run functioning of the system. The growing importance of

maintenance has generated an increasing interest in the development of redundant

repairable operating systems.

Switching and failure detection- The switching and failure detection detect the failed

units and switching mechanism replaces them; it also some times brings the unit to

main system for working. When switching is perfect, its effect is over ruled but for

imperfect switching, it influences the system performance.

1.4.3 System Configurations

The system configuration is among the most useful models to calculate the

reliability of the systems. The k-out-of-n:G configuration structure is a very popular

type of redundancy in fault tolerant systems. There are many studies in this area. We

try to categorically classify them. At the first glance, they may be classified into some

main groups, namely K-out-of-N:G configuration, K-out-of-N: F configuration,

Consecutive K-out-of-N configuration, M-to-K-out-of-N configuration and Weighted

K-out-of-N configuration.


23

K-out-of-N:G Configuration- K-out-of-N configuration has received a great attention

from both practitioners and researchers. The k-out-of-n configuration is a widely

adopted structure for partially redundant safety systems. K-out-of-N:G configurations

are often encountered in industrial organizations namely electronics industry,

telecommunication network systems, power-generator and transmission systems,

avionic etc.. In K-out-of-N: G configuration, the system works till K of the giving N

units are good. The aggregate reliability of K-out-of N:G configuration system when

all the units in the system have same reliability R(t), is given by:

( ) ( )( ) ( )( ) 1NN

kiNofoutK tR1tR

iN

tR −

=−−− −

= ∑

In a K-out-of-N:G configuration, the system operates only if at least K-out-of-N

components operate.

The 1-out-of-N: G configuration is a parallel system whereas N-out-of-N: G

configuration is a series system.

K-out-of-N: F configuration - K-out-of-N: F configuration has a failure on the failure

of K units out of available N units. It is obvious that K-out-of-N: G and (N-K+1) out-

of-N: F configuration are equivalent.

Consecutive K-out-of-N configuration- Consecutive K-out-of-N configuration is the

system configuration in which the units of the system are connected in such a way

that only the failure of K consecutive units out of N units causes system failure.

M-to-K-out-of-N configuration- M-to-K-out-of-N configuration is a non coherent

system in which no fewer than M and no more than M- out- of- N units are to function

for the successful operation of the configuration.

Weighted K-out-of-N configuration- Weighted K-out-of-N configuration is the

system configuration consisting of N units and each unit of the configuration are

associated with some positive integer value called weight. In addition, system weights

less than K causes system failure. In mathematical sense, in a weighted K-out of-N

system, the component i carries a weight wi, wi ≥ 0 for i = 1, 2, . . . , N such that w

=∑=

N

1iiw where w is the total weight of all the components. Thus, K-out of –N: G


24

configuration can be seen as a special case of the weighted K-out of-N: G

configuration wherein each component has a weight of 1.

1.4.4 System Availability

The availability function of a system, denoted by A(t), is defined as the

probability that the system is available at time t. The availability is different from the

reliability that focuses on a period of time when the system is free of failures. It

concerns a time point at which the system does not stay at the failed state. The some

commonly used measures of availability are as follows:

Instantaneous or Point Availability- Instantaneous or point availability is the

probability that a system will be operational at any random time, t. It gives a

probability that a system will function at the given time, t.

Let the component operates properly from 0 to t with probability R(t).

The point availability is the summation of these two probabilities and is obtained as:

∫ −+=t

0

du)u(m)ut(R)t(R)t(A ….(1.17)

where m(u) being the renewal density function of the system.

Transient Availability-The transient availability is given as

( ) { }ttimeatworkingisSystemPtA = …(1.18)

Average Availability- The average availability, ( )tAAVE , over an interval [0,t] is

obtained using

( ) ( )dttAt1tA

T

0AVE ∫= …(1.19)

Steady State Availability- The availability function, which is a complex function

of time, has a simple steady-state or asymptotic expression. The steady state

availability is given by

( )MTTRMTTF

MTTFtALimAt +

==∞→

…(1.20)


25

where MTTF and MTTR stand for mean time to failure and mean time to repair

respectively and A(t) is the transient availability at time t.

The average availability measures the fraction of time that the system is operational

over the interval of interest. One should not confuse the average availability with the

point availability.

1.5 Review of the Literature

In this section, we present a brief survey concerning our research topic on the

software and hardware reliability of fault tolerant system. We provide the historical

advancement of researches in the area of fault tolerance system, N-version

programming, recovery block, redundancy, reliability analysis, K-out-of-N:G

configuration, common cause failure, availability analysis, switching failure, degraded

failure and Markov models for software reliability growth. This section gives an

overview of recent past and currently developed software/hardware reliability models

and the underlying mathematical concepts that profoundly influenced the work

contained in the chapters (2)-(9) of the thesis.

1.5.1 Fault tolerant systems

The requirement for developing a unified method for tolerating both hardware

and software fault has been identified in the recent past, and various works in this

direction have already appeared in the literature (cf. Laprie, et al., 1990; Dugan and

Lyu, 1995). System structure for the software fault tolerance was analyzed by Kant

(1987), Lala and Alger (1988), Belli and Jedrzeiowicz (1990). Carpenter (1990),

Bobbio (1990) and Wu et al. (1994) evaluated the dependability analysis of

software/hardware fault-tolerant systems. Leu et al. (1991) did the fault-tolerant

software reliability modeling using Petri nets. McAllister and Scott (1991),

Bondavalli et al. (1993) and Wu et al. (1996)) calculated cost while modeling the

fault–tolerant software. Kanoun et al. (1993), Tai et al. (1993) and Chiaradonna et

al. (1994) discussed the reliability growth of fault tolerant software. Laplante (1993)

provided an introduction to the phenomenon of the software techniques and hardware

selection considerations to combat the effects of single-event-upset, and a discussion

of the application of the techniques in the fault-tolerant systems. Zhuang and Xie

(1994) analyzed some fault-tolerance configurations based on a multipath principle.


26

Giandomenico et al. (1995) proposed a uniform approach to software and hardware

fault tolerances. Huang and Kintala (1995) studied architectural issues in software

fault tolerance.

Cheng et al. (2000) considered the problem of guaranteeing reliability

requirements with bounded recovery times on fail-stop processors in fault-tolerant

multiprocessor real-time systems. They classified tasks based on their recovery-time

requirements into (i) hard recovery, (ii) soft recovery, and (iii) best-effort recovery

tasks. Bondavalli et al. (2002) and Littlewood et al. (2002) used the adaptive

approach to achieve the reliability and hardware-software fault tolerance in distributed

computing environment. Rehage et al. (2005) considered the redundancy

management of fault tolerant aircraft system architectures. Saha (2006a) developed

the software tool for fault tolerance. The reliability and performance analysis of

hardware–software systems with fault-tolerant software components were presented

by Levitin (2006). Wattanapingskorn and Coit (2007) discussed the fault–tolerant

embedded system design and optimization by considering the reliability estimation

uncertainty. Dabney et al. (2008) considered a fault tolerant approach to test control

utilizing dual-redundant processors. Lim et al. (2008) proposed a fault avoidance

scheme which increases system dependability by avoiding common faults on

remaining nodes when parts of nodes fail, and analyze the system dependability.

Leach (2008) and Zhang and Qin (2008) analyzed an improved fault tolerant system

and used checkpoints legacy code to improve fault-tolerance. Khan et al. (2010)

proposed the fault tolerance techniques in grid computing system. Simeu-Abazi et al.

(2011) suggested the methodology of alarm filtering using dynamic fault tree.

1.5.2 N-version Programming

Software/Hardware is a major source of reliability decay in dependable

systems. One of the classical remedies is to provide fault tolerance by using N-version

programming. Knight and Leveson (1986) did an experimental evaluation of the

assumption of independence in multi-version programming. Eckhardt and Lee (1988)

discussed the fundamental differences in the reliability of N-modular redundancy and

N-version programming. Pham (1994) and Teng and Pham (2002) proposed the

software reliability growth model for the optimal design of N-version software

systems subject to certain constraints. Chatterjee et al. (2004) studied the N-version

http://www.sciencedirect.com/science/article/pii/S0045790610000418?_alid=1790873486&_rdoc=1&_fmt=high&_origin=search&_docanchor=&_ct=1117&_zone=rslt_list_item&md5=2f922638a03bdcedab62440de2d9ac88


27

programming with imperfect debugging. In 2006b, Saha proposed a model for single-

version scheme of fault tolerant computing. Yamachi et al. (2006) described the

multi-objective genetic algorithm for solving N-version program design problem.

Laval et al. (2011) gave the simultaneous version for software evaluation assessment

and computation time of model queries on large models.

1.5.3 Recovery Block

There are only a few studies in the fault tolerance literature on the recovery

block. In the last few decades, recovery block has been treated by many researchers as

fault tolerance system. Velardi and Ciciani (1983) studied the recovery blocks for

communication systems. Distributed execution of recovery blocks for uniform

treatment of hardware and software faults in real-time applications were studied by

Rossi and Simone (1984) and Kim and Welch (1989). Nicola and Goyal (1990)

modeled the correlated failures and community error recovery in multiversion

software. Randell and Xu (1995) employed the recovery block concept, in software

fault tolerance. Al-Saqabi et al. (1996) considered recovery from concurrent failures

in communication protocols. Optimization models for component based recovery

block technique were proposed by Berman and Kumar (1999) and Abulnaja (2005).

Li et al. (2006) discussed the design of correct and efficient checkpoint and recovery

strategies for distributed agent systems. Lv et al. (2010) gave block orthogonal greedy

algorithm for stable recovery of block-sparse signal representations. In 2011,

Abujarad and Kulkarni explored the structure of the recovery paths which is too

complex to permit existing heuristic-based approaches for adding recovery.

1.5.4 Redundancy Models

A vast literature can be found on various hardware and software redundancy

techniques and reliability modeling for redundant systems. Kumar et al. (1986)

considered the reliability analysis of a two-unit redundant system with critical human

error. Grosspietsch (1989) proposed the schemes of dynamic redundancy for fault

tolerant in random access memories. Implementing fault-tolerance via modular

redundancy with comparison was done by Yinong and Chen (1992). Venkateswaran

et al. (2002) analyzed redundancy based fault detection of gyroscopes in spacecraft

applications. Valdes and Zequeira (2003, 2006) proposed the optimal allocation of an

http://www.sciencedirect.com/science/article/pii/S0165168410002367?_alid=1786238528&_rdoc=1&_fmt=high&_origin=search&_docanchor=&_ct=35130&_zone=rslt_list_item&md5=fa2dac4cb449f5a17f335fca2212027e



28

active redundancy in a two-component series system. Bueno and Carmo (2007)

defined a active redundancy allocation for a k-out-of-n: F system of dependent

components. Li and Hu (2008) gave some new stochastic comparisons for

redundancy allocations in series and parallel systems. Flammini et al. (2009)

presented a new modeling approach to the safety evaluation of n-modular redundant

computer system in the presence of imperfect maintenance. Lisnianski et al. (2000)

and Tian et al. (2009) studied the structure optimization of multi-state system with

time redundancy. The optimal task allocation and hardware redundancy policies in

distributed computing systems were considered by Hsieh (2003), Yang et al. (2009)

and Randles et al. (2011). Valdes et al. (2010) analyzed some stochastic comparisons

in series systems with active redundancy. Smidt-Destombes et al. (2011) studied

spare parts model with cold-standby redundancy on system level. Belzunce et al.

(2011) employed the optimal allocation of redundant components for series and

parallel structures of two dependent components.

1.5.5 Software/Hardware Reliability Models

Hardware/Software plays a key role in the modern life. This has increased our

dependence on machines and its reliability. The idea of unreliable hardware/software

may be unimaginable and damaging. Most of the models in the literature basically

discuss the calculation of system reliability using the component reliability.

Reliability estimates are defined as a function of different user profiles. Each user

profile uses different modules and hence it results into different reliabilities. Various

researchers worked on the software reliability to present measurement, prediction and

applications of the systems (cf. Shooman, 1983; Musa et al., 1987 and Elsayed

1996). In literature, researchers have developed software reliability models from

different point of views including the theoretical developments and applications (cf.

Shooman, 1990; Malaiya and Srimani, 1990). Kapur et al. (1992), Sridharan and

Jayashree (1998), Pham et al. (1999) and Pham (2003a) obtained the transient

solutions of a software model with imperfect debugging and generation of errors by

two servers. Hoyland and Rausand (1994) described various aspects of system

reliability models and statistical methods.

There are important contributions of some notable researchers on the software

reliability engineering (cf. Lewis, 1994; Lyu, 1996; Sahner et al., 1996). Smidts and

http://www.sciencedirect.com/science/article/pii/S0305054810002017?_alid=1804983526&_rdoc=1&_fmt=high&_origin=search&_docanchor=&_ct=246&_zone=rslt_list_item&md5=1d94ec440e28535a59d70ff705503776


29

Sova (1999) proposed an architectural model for software reliability quantification.

Xang and Xie (2000) studied the operational and testing reliability to study the

software growth. Sarhan (2002) analyzed the reliability equivalence with basic

series/parallel system. Levitin (2004) gave algorithm for evaluating reliability and

expected execution time for systems consisting of fault-tolerant software components

running on several hardware units. Ho et al. (2003) and Chang and Jeng (2005)

studied of the connectionist models for software reliability prediction. Choi and

Seong (2006) defined the reliability assessment of embedded digital system using

multi-state function. Yu et al. (2007) gave the reliability optimization of a redundant

system with failure dependency. Jha et al. (2009) and Yang et al. (2010) examined

the software reliability model with testing effort and cost analysis.

Yang and Meng (2011) proposed a warm standby repairable system consisting

of two dissimilar units and one repairman. A surrogate-based approach is presented

that simultaneously addresses the issues of accuracy, efficiency, and unimportant

failure modes. Efficient surrogate models for reliability analysis of systems with

multiple failure modes were studied by Bichon et al. (2011).

1.5.6 k-out-of-n: G configuration

In k-out-of-n: G configuration the system works if at least k components work

out of available n components. A variety of applications of these systems are found in

reliability analysis especially when performance prediction of hardware and software

systems is considered. System reliability of k-out-of-n system has been discussed by

many researchers in different frameworks. Dhillon (1978) and Chung (1980)

proposed a k-out-of-N three state system with common-cause failures and

replacement policy. Shanthikumar (1982) worked on recursive algorithm to evaluate

the reliability of a consecutive K-out-of-N: F system. Hwang (1986) developed

reliability models for consecutive k-out-of-n: G systems. Vanderperre (1990) and

She and Pecht (1992) proposed a general closed-form equation for system reliability

of a k-out-of-n warm-standby system. The equation reduces to the hot and cold

standby cases under the appropriate restrictions. Moustafa (1997, 1998) suggested the

transient analysis to evaluate the reliability with and without repair for k-out of-n: G

systems with M failure modes and imperfect coverage. In 2000, Amari did transient

analysis to examine the reliability with and without repair for K-out-of-N: G systems


30

with M failures modes. Moustafa (2001a) found availability of K-out-of-N:G systems

with exponential failures and general repairs. Hong et al. (2002) considered joint

reliability importance of k-out-of-n-systems. Da Casta Bueno (2005) proposed

minimal standby redundancy allocation in a k-out-of-n:F system with dependent

components. Janab and Dhillon (2006) gave assessment of reversible multi-state k-

out-of-n: G/F load-sharing systems with flow-graph models. Rushdi and Alsulami

(2007) evaluated the cost elasticities of reliability and MTTF for k-out-of-n systems.

Da Casta Bueno and Do Carmo (2007) worked on active redundancy allocation for a

k-out-of-n:F system of dependent components. Yinghui and Jing (2008) proposed a

new model for load-sharing k-out-of-n: G systems with different components. The

parallel and k-out-of-n:G systems with non- identical components and their mean

residual life functions were studied by Gurler and Bairamov (2009). Beutner (2010)

developed a non parametric model for k-out-of-n systems. Habib et al. (2010)

discussed reliability of a consecutive (r, s)-out-of-(m, n):F lattice system with

conditions on the number of failed components in the system. Eryilmaz (2011)

analyzed the dynamic behavior of k-out-of-n: G systems.

1.5.7 Common Cause Failure

Common cause failure analysis is important phenomenon in reliability and

fault tolerance system, as common cause failures often dominate hardware failures.

Some typical common causes include impact, vibration, pressure, stress and

temperature. There are many researchers who incorporated the concept of common

cause failure in their studies on reliability assessment. In 1987, Hughes and Mosleh

(1991) proposed a new approach to the problem of the quantification of common

cause failure in the systems under consideration. They produced a direct procedure for

system common cause failure probability calculations not dependent on any

modelling assumptions, but only on the system structure. Sridharan and

Mohanavadivu (1997) did reliability and availability analysis for two non-identical

unit parallel systems with common cause failures and human errors. Levitin (2001)

incorporated common-cause failures for non repairable multistate series–parallel

systems analysis. Vaurio (2003) proposed the modelling and quantification of

common cause failures in redundant standby safety systems by incorporating the

assessment uncertainties in the estimation of multiple failure rates based on data from

http://www.sciencedirect.com/science/article/pii/S0307904X08000061?_alid=1790590836&_rdoc=5&_fmt=high&_origin=search&_docanchor=&_ct=674610&_zone=rslt_list_item&md5=c862866283a69b65f7eedb16905022be


http://www.sciencedirect.com/science/article/pii/S0307904X09001814?_alid=1790590836&_rdoc=21&_fmt=high&_origin=search&_docanchor=&_ct=674610&_zone=rslt_list_item&md5=8bce10ea3959c9479d8ebf0f5118c2a8



31

many plants or systems. Salem and El-Damcese (2004) developed a model for the

analysis of systems subject to common-cause failures under the assumption of

Weibull distribution. The optimization of the system reliability in the presence of

common cause failure was considered by Ramirez-Marquez and Coit (2006). Xing et

al. (2007) proposed an efficient decomposition and aggregation approach for

incorporating common-cause failures into the reliability evaluation of hierarchical

computer-based systems. El-Damcese (2009) studied a k-out-of-(M+S): G warm

standby system with repair and time varying failure due to common cause failure to

formulate the reliability and availability. Li et al. (2009a) evaluated a warm standby

system with components having proportional hazard rates. The heterogeneous

redundancy optimaization for multi-state series-parallel systems subject to common

cause was done by Li et al. (2010). Hajeeh (2011) discussed the reliability and

availability for series configurations having both warm and cold standbys with the

existence of common cause failure of the system at all states. The mean time to failure

(MTTF) and steady state availability are derived for all configurations.

1.5.8 Availability Analysis

In reliability literature, several studies are devoted to the availability analysis

of fault tolerance system. Gupta and Tyagi (1986) suggested the MTTF and

availability evaluation of a two-unit, two-state, standby redundant complex system

with constant human failure. Chen (1992) gave the transient analysis to predict the

reliability and availability of k-to-l-out-of-n: G system. Tokuno and Yamada (1995)

considered a software availability model by incorporating a positive fault-correction

time and uncertainty of the fault-correction activities. They assumed that the hazard

rate for software-failure occurrence reduces geometrically with the progress in the

fault-removal process. Sarkar and Chaudhuri (1999) studied the availability of a

system with gamma life and exponential repair under a perfect repair policy.

The availability of a system maintained through several imperfect repairs

before a replacement or a perfect repair was studied by Biswas and Sarkar (2000),

Sarkar and Sarkar (2000), Sarkar and Sarkar (2001). De Smidt-Destombes et al.

(2004) considered the availability of a k-out-of-N system given limited spares and

repair capacity under a condition based maintenance strategy. Kharoufeh et al. (2006)

gave availability of periodically inspected systems with Markovian wear and shocks.

http://www.inderscience.com/search/index.php?action=basic&wf=author&year1=1998&year2=2007&o=2&q=Mohammed%20A.%20Hajeeh


32

Kiureghian et al. (2007) discussed the availability of K-out-of-N systems. The high-

availability for the failure-aware resource management computing cluster with

distributed virtual machine was proposed by Fu (2010). Moghaddass et al. (2010)

discussed the availability of a general k-out-of-n: G system with non-identical

components considering shut-off rules using quasi-birth-death process and semi

markov model. An improved delay time model with imperfect maintenance at

inspection has been developed by Wang et al. (2011a) based on the assumption of

imperfect inspection maintenance and perfect failure maintenance.

1.5.9 Failure Analysis

In the fault tolerant system and its future prospects, we can't avoid that the

system are always prone to failures. These failures causes the problem for the services

being provided through these systems. The replacement from standby state to active

state of a spare unit is said to be the standby switching. There is always a possibility

that switching device may also fail with some probability and the interruption

problem occurred due to power supply failure is not sorted out due to standby

switching failure. Various models have been developed by several researchers for

analyzing the standby redundant systems with switching failures. A notable

contribution in this area is due to Chow (1971) who discussed the reliability of two

items in sequence with sensing and switching. Alidrisi (1992) defined the reliability

of a dynamic warm standby redundant system of n components with imperfect

switching. Reliability prediction of imperfect switching systems subject to multiple

stresses was suggested by Pan (1997). Xu et al. (2005) studied the asymptotic

stability of a repairable system with repair time of failed system that follows arbitrary

distribution along with imperfect switching failure. Jain et al. (2007) studied the

queueing system with mixed standbys; they assumed the life-time and repair time of

the units to be exponentially distributed. Wang et al. (2006a) and Wang and Chen

(2009) made a comparative analysis of availability between three systems with

general repair times, reboot delay and switching failures. Hsu et al. (2008, 2011b)

studied availability system with reboot delay, standby switching failures and an

unreliable repair facility by considering the time-to-failure and the reboot time as

exponentially distributed. The repair time of the service station and the time-to-repair

of component were assumed to be generally distributed.

http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6V47-47WXYPK-1WC&_user=10&_coverDate=06%2F30%2F1992&_alid=1367497179&_rdoc=14&_fmt=high&_orig=search&_cdi=5751&_sort=r&_docanchor=&view=f&_ct=30320&_acct=C000050221&_version=1&_urlVersion=0&_userid=10&md5=04f6ae5462acc60f51bacfff2123a61b





33

Degraded failure may not cease the fundamental function and there can be

multiple stages of degradation, and the system may fail after a certain number of

stages. When all standbys (warm and cold) are being used, the failure of units may

occur in degraded fashion. Significant works have been done on the system with

degraded failure rate. Many researchers have studied the problem of degraded failure

in different frameworks and suggested ways and means to tackle such situations.

Initially, the concept of degraded states and common-cause failures was studied by

Yamashiro (1982). Gupta and Sharma (1986) and Nahas et al. (2007) did reliability

analysis of two state repairable parallel redundant systems under human failure and

developed the algorithm to analyze the series - parallel system. Jain et al. (2004)

worked on the bilevel control of degraded machining system with warm standbys,

setup and vacation. Pham et al. (1997) evaluated full and degraded mission reliability

and mission dependability for intermittently operated, multi-functional systems and

obtained the availability and mean life time of multistage degraded system with

partial repairs. Soro et al. (2010) investigated multi-state degraded systems with

minimal repairs and imperfect preventive maintenance.

1.5.10 Software Reliability Growth Models

A reliability model of Markov structured software was developed by

Littlewood (1975). Trivedi and Shooman (1975) and Kemmeny and Snell (1976)

studied many state Markov models for the estimation and prediction of computer

software performance parameters. Whittaker and Poore (1993) did Markov analysis

of software specification. Whittaker and Thomason (1994) studied Markov chain

models for statistical software testing. A non - homogeneous Markov software

reliability model with imperfect repair was presented by Gokhale et al. (1996). Propp

and Wilson (1996) considered exact sampling with coupled Markov chains and

applications to statistical mechanics. Whittaker et al. (2000) studied a Markov chain

model to enable reliability prediction for future builds using testing data for the

software system. El-Gohary (2004) analyzed the estimations of parameters in a three

state reliability semi-Markov model. Prowell and Poore (2004) explored the

computing system reliability using Markov chain usage models. Lo et al. (2005) gave

the reliability assessment and sensitivity analysis of software reliability growth

modeling based on software module structure. Montoro-Cazorla and Perez-Ocon

http://www.sciencedirect.com/science/article/pii/S0951832006001694?_alid=1786220590&_rdoc=3&_fmt=high&_origin=search&_docanchor=&_ct=190042&_zone=rslt_list_item&md5=f6c2d96637f333c470202805c3f8fcb4

http://www.sciencedirect.com/science/article/pii/S0951832006001694?_alid=1786220590&_rdoc=3&_fmt=high&_origin=search&_docanchor=&_ct=190042&_zone=rslt_list_item&md5=f6c2d96637f333c470202805c3f8fcb4

http://www.sciencedirect.com/science/article/pii/S0951832096001408?_alid=1786219343&_rdoc=9&_fmt=high&_origin=search&_docanchor=&_ct=190042&_zone=rslt_list_item&md5=fba48b78c507389029c7d18aa2f117e2


http://www.sciencedirect.com/science/article/pii/S0951832009002063?_alid=1786217523&_rdoc=4&_fmt=high&_origin=search&_docanchor=&_ct=14958&_zone=rslt_list_item&md5=33b4ca15271058eb336d6677b879ef90



34

(2006) proposed the reliability of a system under two types of failure using a

Markovian arrival process. Chiquet et al. (2008) estimated the reliability of stochastic

dynamical systems with Markovian switching. Do Van et al. (2010) gave the

importance measures for Markov reliability models. Buchholz et al. (2010) described

the multi-class Markovian arrival processes and their parameter fitting. Meedeniya et

al. (2011) proposed the software and hardware architecture with necessary reliability-

relevant attributes to quantify the quality of individual deployment alternatives by

employing an evolutionary algorithm.

1.6 Organization of the Thesis

The objective of our study in this thesis is to develop reliability/availability

models of fault tolerant system. The redundancy issues are taken into consideration to

predict the hardware and software reliability. The whole thesis is structured into 9

chapters dealing with reliability/availability analysis of fault tolerant system in

different frameworks. In chapters 2-9, we investigate some models of fault tolerant

systems by starting the requisite assumptions and notations to formulate the

mathematical model and to provide analytical and numerical results. The introductory

section of each chapter includes the review of relevant literature and organization of

the chapter. The proper mathematical analysis of the developed models is presented

using appropriate methodology based on reliability theory. The numerical results are

summarized in tables and exhibited via graphs also. In the end of the thesis, the

relevant references are arranged alphabetically for the sake of convenience of the

readers. The chapter wise organization of the thesis is as follows:


The ongoing chapter on general introduction presents the research motivation,

and various conceptual aspects of hardware and software reliability of fault tolerance

system. It facilitates a complementary background of the techniques for the analysis

of the developed reliability models of the redundant systems. The important

contributions in the area of our research interest have been discussed. The outline of

the thesis and the significance of the work investigated in various chapters are also

highlighted.


35

Chapter-2: Fault Tolerance in A Clustered Architecture

The collection of the software faults, error detection, diagnosis and error

recovery is considered to investigate the fault tolerance in a clustered system. In this

chapter we study the different levels of tolerance. The numerical results provided will

be helpful to examine the reliability and availability of the concerned system.

Chapter-3: Reliability Modeling of Hardware and Software

Interactions

A reliable computer system gives the normal level of service in the presence

of hardware and software both. In this chapter, we deal with a reliability model for a

computer system that considers the failures of three types such as software failure,

hardware failure and interaction of software and hardware failure along with common

cause failure. We derive the probabilities of the system being in different states by

using the successive over relaxation (SOR) technique.

Chapter-4: Distributed Software and Hardware Systems

A distributed system is collection and combination of computers in which any

member of the cluster is capable of supporting the processing functions of any other

member. In this chapter we investigate a multi-host system with standbys. The

common cause failure which is an important factor to calculate the availability of the

realistic system is also taken into consideration. Some important indices to predict the

performance of the system are given.

Chapter-5: Transient Analysis of a Hardware-Software

System

This chapter focuses on the performance prediction of hardware and software

system supported with standbys. This chapter is arranged into two sections as follows:

Section-5(A): Embedded Computer System with Two Types of Failures and Common Cause Failure

In this section we propose a Markov model for K-out-of-N: G system with

common cause failure. The hardware system consists of N non-idential components


36

and Y warm standby components under the care of a single repairman. The system

has hardware error, human error and common cause failures. The numerical results

have been facilitated to have insights into the sensitivity of the system parameters by

using Runge-Kutta method of fourth order.

Section-5(B): Hardware and Software Systems with Warm Standbys and Switching Failures

This section is concerned with Markovian model for a hardware and software

system. There is provision of warm standby hardware units which is likely to have

switching failure when used. Numerical techniques based on matrix method and

Runge-Kutta method are used to compute various system performance indices.

Chapter-6: Availability Analysis of Repairable Redundant

System with Reboot Delay

In this chapter we discuss the availability indices of a hardware-software

system with switching failure and reboot delay. The chapter is further subdivided into

two sections, which are given as follows:

Section-6(A): Hardware-Software System with Switching Failure

In this chapter we consider the reliability and availability analysis for a system

having both M hardware along with warm standbys and N software components. The

concepts of switching failures, common cause failure and reboot delay are taken into

consideration. Network-based Fuzzy Interference Systems (ANFIS) approach and

successive over relaxation (SOR) technique are used to facilitate the numerical

results.

Section-6(B): Repairable System with Warm Standby and Switching Failure

We study the availability for repairable system supported by warm standby for

three kinds of configurations. To formulate the model, the assumptions of switching

failure and setup time are incorporated. The numerical results have been provided to


37

have insights into the system descriptors so far as the performance indices are

concerned.

Chapter-7: Warranty Policy for Hardware and Software

Systems with Common Cause Failure

In this chapter we study the performance of K-out-of-N system with common

cause failure. The system consists of hardware and software components. The

analysis has been along with the application of warranty policy. The free replacement

warranty policy and repair are considered. The reliability and other measures are

obtained. The analytical results are supported by the numerical results.

Chapter-8: Semi-Markov Models with Common Cause Failure

In many real time systems, it is practically impossible to make a perfect

system in which component failure leading to system failure does not take place. The

redundancy is a widely spread technology of building such systems that continue to

operate satisfactorily in the presence of faults occurring in system. In this chapter we

develop semi-Markov models with for the redundant systems with common cause

failure. This chapter is arranged into two sections as follows:

Section-8A: Redundant System with Rejuvenation

The principle objective of applying redundancy is to achieve availability goals

subject to techno-economic constraints. In this section we predict the availability of

redundant system with common cause failure and rejuvenation by using an embedded

Markov chain approach. A recursive procedure for generating the state-transition

probabilities is employed. The appropriate framework for finding the optimal

rejuvenation interval is discussed by considering the downtime cost factors.

Section-8B: Imperfect Fault Coverage System with Reboot

We develop a semi-Markov model for two unit system by using the

supplementary variable technique (SVT). The concepts of common cause failure,

reboot and imperfect fault coverage are taken into consideration. The model is

analyzed for three types of distributions of repair time, such as exponential, gamma


38

and uniform. We derive explicit expressions for the availability and failure frequency

of the systems.

Chapter-9: Software Reliability Growth Model (SRGM) with

N-version Programming

This investigation is concerned with the software reliability growth model

based on non-homogeneous Poisson process with testing-effort. The N-version

programming is taken into consideration by assuming the imperfect debugging

process. We propose the estimation of unknown parameters by using the maximum

likelihood parameter estimation approach. The Adaptive Network-based Fuzzy

Interference Systems (ANFIS) approach is also employed to facilitate the numerical

results.


The main objective of this chapter has been to discuss some fundamental

concepts of fault tolerance and reliability analysis. The key issues related to the design

and analysis of fault-tolerant systems are also described. In this thesis, we have

studied software fault tolerance, hardware fault tolerance and other related issues in

different frameworks.

Our investigations focus on establishing reliability indices for the hardware

and software system. It is expected that hardware and software fault tolerance system

studied will benefit all concerned by enabling greater predictability on the

dependability of software. It is worthwhile to highlight the noble features of the

investigations carried out in the present thesis. Some key issues tackled based on

suitable methodology are as follows:

The concepts of recovery blocks, N-version programming, common cause

failure, reboot delay, imperfect debugging, switching failures, standbys, etc.

are taken into account while developing reliability models for fault tolerance

systems.

When the number of operating components available is at the verge of

minimal specified level, the workload on the operating components increases

and the system starts working in degraded mode. The concept of degraded


39

failure has been used for developing the models studied in chapters 3, 4, 5(A),

7 and 8(B).

The moment a standby unit is switched in place of a broken-down unit, it

should work; however, sometimes is not so. There is always a possibility that

the switching process may fail. The probabilistic nature of this situation is

quite important to be incorporated in the system repair models, where standbys

play a crucial role in case of system failures. This concept has been

incorporated in chapters 5(B) and 6(A)-(B) to make model to depict more

realistic and versatile scenario.

In chapter 5(A), we have developed k-out-of-N system, wherein out of total N

components, k are required for smooth functioning of the system and (N-k)

units are kept as standbys.

Fault tolerance and software reliability growth models are widely used for the

performance modeling and analysis of communication systems and computer

networks. The concepts of N-version programming, testing effort and

imperfect debugging studied in chapter 9 may proved useful in real time fault

tolerance system.

In chapter 7 warranty policies for warranty cost model is employed to provide

the analytical results which are difficult to obtain using classical renewal

process.

Runge-Kutta (R-K) is reasonable accurate and well behaved approach and can

be employed for a wide range of problems. Runge-Kutta (R-K) fourth order

method is used to obtain the solution of differential equations governing the

reliability models in chapter 4, 5(A) and (B) and 8(B).

The successive over relaxation (SOR) method is used in chapters 3 and 6(A)

to deal with the hardware-software model to compute the availability. This

method provides quite good and perfect trend of the availability.

Supplementary variable technique (SVT) is being used in chapter 8(B) to

analyze the availability and failure frequency for different distributions such as


40

exponential, gamma and uniform distributions for the imperfect fault coverage

system with common cause, recovery and reboot delay.

The capacity planners and the system analysts may utilize the results of models

developed in the presents thesis to architect the fault tolerance system of future

generation. It is hoped that our investigations may be beneficial to the decision

makers, system designers, network administrations and system developers to achieve

the desired grade of service or to manage the limited resources under the techno-

economic constraints as for as real time fault tolerance systems are concerned.

Fault Tolerance in A Clustered

Architecture 2.1 Introduction

2.2 Reliability Models

2.3 Model for Software Fault Tolerance

2.4 The Analysis

2.5 Numerical Illustrations

2.6 Conclusion

Chapter-2


42

System architecture depending upon a cluster of computers

has received a considerable attention recently. In a clustered

system, the software applications can be made with commercial

hardware, operating systems and application software to get high

system availability. In this chapter we describe various levels in

terms of fault detection, fault recovery, volatile data consistency

and persistent data consistency. The application software is

responsible for the extent of the data backup, subsequent recovery

and error detection. The numerical results have been facilitated to

have insights of the system descriptors on the performance indices.

2.1 Introduction

Software reliability is one of the most important characteristics of system

software. Its measurement and management aspects during the software life-cycle are

important to produce and maintain reliable software. Fault tolerance is the ability of a

system to continue correct performance of its tasks after the occurrence of hardware

or software faults. In recent years, the computer system failures have been caused by

software faults which were introduced during the software development process. The

events of system unavailability and data inconsistency are often caused by the

existence and end of disclosing of faults in the system. A fault is simply any physical

defect, imperfection or flaw that occurs in hardware or software. Some faults can not

be tolerated to conduct a entire system failure, whereas in some cases the impact may

be delivered of a partial system failure.

The fault tolerant architecture includes the software design of error detection

and diagnosis as well as error recovery. The executing program is supervised by the

watchdog, which warns a failure condition of the software program in case that the

execution time of each subprogram runs over its default value. In a cluster, the system

is built with commercial hardware, operating system and database system. The

systems are required to be present upon users demand and their data should be

consistent in the user's purpose. Carpenter (1990) defined the mechanism for


43

evaluating the effectiveness of software fault–tolerant structures. Siewiorek and

Swarz (1992) presented reliable design of computer systems. Hoeflin and Mendiratta

(1995) analyzed an elementary model for predicting switching system outage

durations. Hunag and Kintala (1995) proposed the software fault tolerance in the

application layer. Mendiratta (1996) discussed the reliability impacts of software

fault tolerance mechanisms. Sahner et al. (1996) proposed the performance and

reliability analysis of computer systems using the sharpe software package. Hughes–

Fenchel (1997) studied a flexible clustered approach to achieve high availability.

Berman and Kumar (1999) presented the optimization models for complex recovery

block schemes.

Cheng et al. (2000) analyzed a fault-tolerance model for multiprocessor real-

time systems. Littlewood et al. (2002) studied the reliability of diverse fault-tolerant

software based systems. Ho et al. (2003) studied various models for the software

reliability prediction. Levitin (2004) presented the reliability and performance

analysis for fault-tolerant programs consisting of versions with different

characteristics. Zhang and Qin (2008) and Leach (2008) described the parametric

analysis and checkpoints in legacy code to an improve the fault tolerant system.

Santos et al. (2009) considered power saving and fault tolerance in real-time critical

embedded systems. Adapting grid applications to safety using fault tolerant methods

were studied by Shi et al. (2010). Rafe and Mahdian (2011) explored the style based

modeling and verification of fault tolerance service oriented architecture. Shet et al.

(2011) suggested various strategies for fault tolerance in multicomponent

applications.

In this investigation fault tolerance software in a clustered architecture is

studied. The rest of the chapter is structured in the following sections. In section 2.2,

we explain the reliability models and levels. In section 2.3, we construct the transient

equations for clustered fault tolerance system with the help of transition diagram. In

section 2.4, the steady state probabilities are obtained. In section 2.5, we provide the

numerical results which are displayed with the help of tables and graphs. Section 2.6

is devoted to the concluding remarks.


44

2.2 Reliability Models

The hardware and software models are being used to predict the system

availability and other reliability indices. The data consistency model gives the

predictions of the detect rate. In other words it deals with the units of load calls,

messages, transactions, etc. that are lost due to hardware and software failures as a

proportion of the total offered load. The levels of reliability which are based on the

definition of levels of software fault tolerance are considered to model the fault

tolerant system. The reliability levels are described in the ascending order as follows:

Level 0 : Basic automatic fault detection by watchdog, no automatic fault

recovery, no data consistency.

The watchdog finds out the faults in the hardware and software. For a software

fault, the application process is restarted at the beginning of internal state. For a

hardware fault, the system is manually rearranged and the faulty processor is

removed.

Level 1 : Basic automatic fault detection by watchdog automatic fault recovery, no

data consistency.

The watchdog and recovery find out a group of hardware and software faults.

When the watchdog detects the fault in the hardware, the system is automatically

rearranged and recovered. In a software fault the process is again started at the initial

internal state. The restarted internal state does not effect the previous execution.

Level 2 : Basic automatic fault detection by watchdog, automatic fault recovery, no

data consistency enhanced automatic fault detection by watchdog plus

periodic check pointing logging and recovery of internal state.

In this level a larger set of hardware and software faults are automatically

detected by the watchdog and application. A hardware failure detected is rearranged

around the faulty unit. If both hardware and software fail, the application is restarted

and the application closes to the state at which it damages.


45

Level 3 : Level 2 and persistent data recovery.

The persistent data of the application is replicated on a backup disk connected

to a backup node with the capabilities of level 2. The persistent data is kept consistent

with the data on the primary node throughout the normal operation of the application.

The higher level is more reliable than the lower level so that level i is more

reliable than level i-1(1≤i≤3).

2.3 Model for Software Fault Tolerance

Fig. 2.1: State transition diagram for software fault tolerance

The Markov model for the systems having different levels (see fig. 2.1) of

software fault tolerance includes five states (i) working (W), (ii) fault detection and

recovery (FDR), (iii) volatile data recovery (VDR), (iv) persistence data recovery

(PDR) and (v) failure (F).

The working state represents the normal execution state of the system. If in

this state we find error, the system will go into other states. If the error is recoverable,

the system enters in the fault detection and recovery state and the recovery starts. If

the error is recovered in this state, it goes back to working state; otherwise the system

either fails or another level of recovery is entered. The recovery process goes on to

volatile data recovery state and persistent data recovery state in a similar pattern.


46

The following notations are used to formulate the model:

λ : The error rate.

1λ : The rate at which recovery can not be completed in this state.

2λ : The rate at which volatile data recovery can not be completed in this state.

3λ : The rate at which persistent data recovery can not be completed in this

state.

pλ : The failure rate of power supply unit. µ : Manual repair state.

1µ : The rate at which successful recovery is performed.

2µ : The rate at which successful volatile data recovery is performed.

3µ : The rate at which successful persistent data recovery is performed.

C : Fault recovery coverage factor for the error.

1C : Volatile data recovery coverage factor for the error.

2C : Persistent data recovery coverage factor for the error.

( )tPi : Prob. that the system is in ith ( i=0,1,2,3,4) state at time t.

The equations governing the model are constructed as follows:

( )[ ] ( ) )t(p)t(P)t(P)t(ptPC1dt

dP43322110CP

0 µ+µ+µ+µ+λ+λ+−λ−= … (2.1)

( )[ ] ( ) )t(PtPCC1dt

dP0C1111P11

1 λ+µ+λ−λ+−λ−= …(2.2)

( )[ ] ( ) )t(PCtPCC1dt

dP1112222P21

2 λ+µ+λ+λ+−λ−= …(2.3)

[ ] ( ) )t(PCtPdt

dP22233P3

3 λ+µ+λ+λ−= …(2.4)

( )[ ] ( ) ( )[ ] ( )[ ][ ] )t(p)t(P

)t(PC1)t(pC1tPC1dt

dP

43P3

2P211P110P4

µ−λ+λ+

λ+−λ+λ+−λ+λ+−λ= …(2.5)


47

2.4 The Analysis

For notational convenience, we denote:

p0 λ+λ=Λ ; 1p11 µ+λ+λ=Λ ; 3p33 µ+λ+λ=Λ

)C1(C −= ; )C1(C 11 −= ; )C1(C 22 −=

For steady state, the equations (2.1)-(2.5) can be written as

( )[ ] ( )[ ] ( )[ ][ ] 43P3

2P211P110P

pPPC1pC1PC10

µ−λ+λ+λ+−λ+λ+−λ+λ+−λ=

…(2.6)

( )[ ] 43322110CP pPPpPC10 µ+µ+µ+µ+λ+λ+−λ−= …(2.7)

( )[ ] 0C1111P11 PPCC10 λ+µ+λ−λ+−λ−= …(2.8)

( )[ ] 1112222P21 PCPCC10 λ+µ+λ+λ+−λ−= …(2.9)

[ ] 22233P3 PCP0 λ+µ+λ+λ−= …(2.10)

Solving equations (2.7)-(2.10), we obtain

( )P

43322110

PPPPP

λ+λµ+µ+µ+µ

= ...(2.11)

1

0C1

Pp

Λλ

= …(2.12)

)C(PC

P2211

011C2 Λ+λΛ

λλ= …(2.13)

( )22131

022113 C

PCCP

Λ+λΛΛλλλ

= …(2.14)

From eq. (2.6) and the values of P1, P2 and P3 from (2.12)-(2.14), we get


48

( ) ( )( )

( )22131

32211C221311C

22131C221310

4 CCCCC

CC

1PΛ+λΛΛ

µλλλ−Λ+λΛλλ−

Λ+λΛµλ−Λ+λΛΛΛ

µ= …(2.15)

Now using normalizing condition∑=

=4

0ii 1P , we obtain P0 as

( )

( )( ) ( )( ) ( )

µ−µλλλ+Λ+λΛΛΛ+

Λ+λ−µΛλλ+µ−µΛ+λΛλ

Λ+λΛΛµ=

32211C221310

221311C12213C

221310

CCC

CCCC

P …(2.16)

2.5 Numerical Illustrations

In this section, the numerical illustrations have been made to calculate the

reliability R(t), the state probabilities Pi(t) and expected operational time (t) by

varying error rate ( λ ) and failure rate of power supply ( pλ ). The effects of these

parameters on the reliability indices have been examined for two sets of default

parameters fixed as follows:

Data set I: ,95.0C,9.0C,05.0,5,50,4 1P321 ===λ=λ=λ=λ

10,5,1,2.0,999.0C 3212 =µ=µ=µ=µ=

Data set II: ,5.0=λ ,95.0C,9.0C,5,50,4 1321 ===λ=λ=λ

10,5,1,2.0,999.0C 3212 =µ=µ=µ=µ=

Table 2.1 depicts the results for the probabilities of the software fault tolerance

for different values of λ , by setting other parameters for data set I. It is seen that, the

probabilities of fault tolerance decrease by increasing t and λ both. Table 2.2 displays

the numerical results for the software fault tolerance probabilities for the varying

values of Pλ for data set II. We notice the same decreasing pattern of these

probabilities with both t and Pλ as seen in table 2.1.

Fig. 2.2 illustrates the probabilities of fault tolerance by varying values of t for

λ=0.5, λ1=4, λ2=50, λ3=5, C=0.9, C1=0.95, C2=0.999, μ=0.2, μ1=1, μ2=5, μ3=10. It is

observed that probability )t(P0 decreases sharply for the lower values of t and then

after it becomes almost constant. Probabilities )t(P),t(P 21 and )t(P3 increase and then


49

decrease and later on become asymptotically constant as the time grows. On the

contrary, )t(P4 initially increases sharply and then attains almost constant value for

the increasing values of t.

For data sets I and II, figs 2.3-2.5 show the results for the reliability R(t) with

respect to time t for different values of parameters P,λλ and C, respectively. Fig.

2.3(a) shows that the reliability decreases with respect to time t as well as λ while in

fig. 2.3(b) reliability decreases for lower values t and then becomes almost constant

for higher time t. Fig. 2.4(a) depicts that reliability R(t) initially decreases sharply and

then after attains almost constant value by increasing t. It is also observed that R(t)

decreases as Pλ increases. In fig. 2.4(b), R(t) depicts almost constant trend of R(t) with

t. In fig. 2.5(a), we see the similar decreasing pattern followed by constant trend for

R(t) with t; also R(t) decreases as c increases. In fig. 2.5(b), R(t) first gradually

decreases then becomes almost constant by increasing t. It is also found that R(t) is

not much effected by different values of C.

2.6 Conclusion

We have proposed a Markov chain based reliability model to describe the

software fault tolerance. The error detection and recovery methods by including the

process recovery, volatile data recovery and persistent data recovery have been

explored. Several levels of recovery procedures incorporated make our model more

versatile and realistic to deal with real time system. It is noticed that the manual

recovery which is very timely and costly, may be helpful to improve the reliability

and availability of the system to a great extent.


50

λ t P0 P1 P2 P3 P4

0.1

0 1 0 0 0 0

2 0.920896 0.016651 0.001151 0.003838 0.057464

4 0.88213 0.015919 0.0011 0.003667 0.097184

6 0.858149 0.015467 0.001069 0.00356 0.121755

8 0.843315 0.015186 0.001049 0.003495 0.136955

10 0.834138 0.015013 0.001037 0.003454 0.146358

0.5

0 1 0 0 0 0

2 0.699654 0.064339 0.004454 0.01494 0.216613

4 0.595017 0.054184 0.003748 0.01253 0.334522

6 0.545007 0.049329 0.00341 0.011378 0.390876

8 0.521106 0.047009 0.003249 0.010827 0.41781

10 0.509682 0.0459 0.003172 0.010563 0.430682

0.9

0 1 0 0 0 0

2 0.54885 0.092092 0.006383 0.021509 0.331165

4 0.431503 0.071082 0.004918 0.016474 0.476023

6 0.386685 0.063057 0.004359 0.014551 0.531349

8 0.369567 0.059992 0.004146 0.013815 0.55248

10 0.36303 0.058822 0.004064 0.013534 0.56055

Table 2.1: Probabilities of software fault tolerance for different values of λ for data set I


51

Table 2.2: Probabilities of software fault tolerance for different

values of pλ for data set II

λp t P0 P1 P2 P3 P4

0.01

0 1 0 0 0 0

2 0.694165 0.06381 0.004417 0.014815 0.222793

4 0.587974 0.053503 0.0037 0.012369 0.342454

6 0.537726 0.048626 0.003361 0.011211 0.399076

8 0.51395 0.046318 0.003201 0.010663 0.425869

10 0.502699 0.045226 0.003125 0.010404 0.438547

0.05

0 1 0 0 0 0

2 0.652029 0.059746 0.004135 0.013853 0.270238

4 0.535607 0.048446 0.003349 0.011172 0.401427

6 0.484754 0.04351 0.003005 0.01 0.458731

8 0.462541 0.041354 0.002855 0.009488 0.483762

10 0.452838 0.040412 0.00279 0.009264 0.494696

0.09

0 1 0 0 0 0

2 0.612878 0.055971 0.003872 0.012961 0.314319

4 0.48963 0.044009 0.00304 0.010122 0.4532

6 0.439934 0.039185 0.002704 0.008978 0.509199

8 0.419896 0.03724 0.00257 0.008515 0.53178

10 0.411816 0.036455 0.002515 0.008328 0.540885


52

P0

P1

P2

P3

P4

0

0.2

0.4

0.6

0.8

1

0 1 2 3 4 5 6 7 8 9 10time

Prob

abili

ty

Fig. 2.2: Probabilities of software fault tolerance


53

00.10.20.30.40.50.60.70.80.9

1

0 1 2 3 4 5 6 7 8 9 10t

R(t

)

λ=0.1λ=0.5λ=0.9

0

0.2

0.4

0.6

0.8

1

0 1 2 3 4 5 6 7 8 9 10t

R(t

)

λ = 0.1λ = 0.5λ = 0.9

Fig. 2.3(a): Profiles of reliability by Fig. 2.3(b): Profiles of reliability by

varying λ for data set I varying pλ for data set II

λp = 0.1λp = 0.5λp = 0.9

00.10.20.30.40.50.60.70.80.9

1

0 1 2 3 4 5 6 7 8 9 10t

R(t

)

λp = 0.01λp = 0.05λp = 0.09

0

0.2

0.4

0.6

0.8

1

0 1 2 3 4 5 6 7 8 9 10t

R(t

)

Fig. 2.4(a): Profiles of reliability by Fig. 2.4(b): Profiles of reliability by

varying λ for data set I varying pλ for data set II

0

0.2

0.4

0.6

0.8

1

0 1 2 3 4 5 6 7 8 9 10t

R(t

)

c = 0.9c = 0.95c = 0.99

0.9

0.92

0.94

0.96

0.98

1

0 1 2 3 4 5 6 7 8 9 10t

R(t

)

c = 0.9c = 0.95c = 0.99

Fig. 2.5(a): Profiles of reliability by Fig. 2.5(b): Profiles of reliability by varying c for data set I varying c for data set II

Reliability Modeling of Hardware and Software Interactions

3.1 Introduction

3.2 Model Description

3.3 Governing Equations and Analysis

3.4 Numerical Results

3.5 Conclusion

Chapter-3

Chapter-3: Reliability Modeling of Hardware and Software Interaction

55

For the reliability prediction of a computer system, the

failures are divided into three categories namely software failures,

hardware failures and interaction of software-hardware (SW/HW)

failures. In this investigation we develop a reliability model for a

computer system that includes the failures of all the three kinds.

The Markov process is used to establish the system reliability

indices with the consideration of hardware, software and

hardware-software interaction failures. The case of common cause

failure is also discussed. Using successive over relaxation (SOR)

method, we obtain the probabilities of the system being in

different states. Some important indices to predict the

performance of the system such as reliability, mean time to failure,

etc. have been obtained.

3.1 Introduction

Reliability of the computer system generally refers to why, when and how

system hardware and software failures occur. A reliable computer system needs to

provide its normal level of service in the presence of hardware and software both. The

software is an integral part of many embedded systems and major source of reliability

degradation in dependable systems. Software reliability modeling is a generic term for

a set of methods of statistical analysis which enables to calculate the reliability indices

of software to be predicted from observation of its failure during later testing and

operational use. The basic hardware reliability model consists of all hardware

elements of the system, so that the overall logistics support requirements for spares,

maintenance personnel etc. can be easily determined based on the failure rates of the

system. The systems have hardware and software components and the hardware may

fail by some common cause such as power supply, humidity, temperature, designing,

etc.. The software component can fail either because of latent faults or common cause

failure. Some times, the computer system fails because of interactions between the

hardware and the software. All such kinds of failure need attention of maintenance


56

engineers. In view of this the reliability quantification has become an important part

of the performance modeling of such systems.

Various researchers have contributed significantly in the field of software

reliability. Hamlet (1995) discussed the software quality, software process and

software testing. Linberg (1999) analyzed the software developer perception about

software project failure through a case study. Lee (2002) gave the detailed account of

the embedded software. Zalewski et al. (2003) examined various aspects of the

software of computer control systems. Goseva-Popstojanova and Trivedi (2003)

presented the architecture based approaches to software reliability prediction. Munch

and Heidrich (2004) discussed various concepts and approaches for the software

project control center. Jeske and Zhang (2005) presented some successful approaches

to software reliability modeling in industry. Raj Kiran and Ravi (2008) studied the

software reliability by soft computing techniques. Vinod et al. (2008) examined the

integrating safety critical software system in probabilistic safety assessment.

Sofokleous and Andreon (2008) considered automatic, evolutionary test data

generation for dynamic software testing. Oltean and Diosan (2009) studied an

autonomous GP-based system for regression and classification problems. In (2010),

Fu explored a failure-aware resource management for high-availability computing

clusters with distributed virtual machines. Meedeniya et al. (2011) presented the

reliability driven deployment optimization for embedded systems.

A few researchers tried to find a reliability and availability of both hardware

and software and combined reliability model for the entire system. Reussner et al.

(2003) gave the reliability indices for component based software architectures. Huang

and Chang (2007) analyzed an improved decomposition scheme for assessing the

reliability of embedded systems by using dynamic fault trees. Van et al. (2008) did

the reliability analysis of Markovian systems at steady state using perturbation

analysis. Dominguez-Garcia et al. (2008) described an integrated methodology for

the dynamic performance and reliability evaluation of fault tolerant systems. Recently

Zio (2009) studied the reliability engineering from the view point of old problems and

new challenges. Analysis of service availability for time-triggered rejuvenation

policies were considered by Salfner and Wolter (2010). Catelani et al. (2011)

proposed a new approach concerning the automated software testing as an aid to


57

maximize the test plan coverage within the time available and also to increase the

software reliability and quality.

The purpose of this chapter is to find the availability of the system failure

depending on whether they are hardware failure, software failure or hardware-

software interaction failure. A discrete state Markov chain is used to constitute a set

of differential equations for transient probabilities governing the model. With the aid

of steady state equations, we obtain a complete solution of the differential equations.

The rest of the chapter is organized as follows. Section 3.2 deals with model

description by stating the requisite assumptions and notations. Section 3.3 provides

governing equations and analysis. In section 3.4, numerical results are provided.

Finally in section 3.5, the conclusion is drawn.


We consider a unified reliability model that accounts for failures in hardware,

software and interaction of software-hardware failures. In general, for modeling

purpose we assume that a hardware subsystem acts separately and the software

subsystem works independently. However, the hardware and software subsystems

cannot operate independent of each other. In this investigation we develop a model

with the assumption that the hardware and software interact with each other. We

explore the hardware failures, software failures and failure due to interaction of

hardware and software and consequently the reliability of the system which is affected

by the interaction of hardware-software failures. The transition flow diagram for

Markov model in shown in figure 3.1. In our model state (0, 0) is the fully working

state. States b, bb, ab, cb show that there is only detection of faults by the software but

recovery is not possible. States a, ba, aa, ca show that there is detection of faults and

also recovery by the software while states c, bc, ac, cc indicate that there is no

detection of faults by the software. FT is the total failure state.

The following assumptions are made to formulate the Markov model:

The whole system consists of one software and two hardware components. The

system fails when both the hardware or software components fail.


58

The hardware failures, software failures and HW/SW failures are independent of

one another.

λ2 is the failure rate of the hardware component. The life times and repair times

of software and hardware components are assumed to be exponentially

distributed. ab ,λλ and cλ are the failure rates of the hardware component,

considering to the failure of state b, a, c.

Sµ is the repair rate of the hardware components when recovered by the software

and mµ is the manually repair rate of the hardware components, when it is not

recovered by the software .

After one state the undetected hardware degradation may cause a HW/SW failure

with ac,bc and cc states respectively and a detected degradation may cause an

execution abortion with rates abbbb ,, λ′λ′λ′ and cbλ′ , respectively.

Partially failed hardware may further become totally failed with

rates babb ,λλ and bcλ if degradation is detected but not recovered by software

methods, aaab ,λλ and acλ if degradation is detected and recovered by software

methods and cacb ,λλ and ccλ if degradation is not detected.

Fig. 3.1: State transition diagram


59

Some other notations used for model formulation are as follows:

For the brevity we use some more notations as follows:

21211 qPqP ′′=α , 21214 qPPP ′′=α , 2117 qPq ′′=α ,

21212 PPqP ′′=α , 21215 PPPP ′′=α , 2118 PPq ′′=α ,

1213 qqP ′=α , 1216 qPP ′=α , 119 qq ′=α ,

and

bbbbbb λ′+λ=Λ , bcbcbc λ′+λ=Λ , ababab λ′+λ=Λ

acacac λ′+λ=Λ , cbcbcb λ′+λ=Λ , cccccc λ′+λ=Λ .

3.3 Governing Equations and Analysis The set of differential difference equations governing the model based on

transition diagram shown in fig. 3.1 is given as follows:

)t(Q)t(Q)t(Q2)t(Q aSbm0000 µ+µ+λ−=′ …(3.1)

( ) ( ) ( ) ( ) ( )[ ] ( )tQtQtQtQqP2tQ b321bbmbaSbbm0021b α+α+αλ+λ′+µ−µ′+µ′+λ=′ …(3.2)

[ ] )t(Q)()t(Q)t(Q)t(QPP2)t(Q a654aSaaSabm0021a α+α+αλ+µ−µ′+µ′+λ=′ …(3.3)

[ ] )t(Q)()t(Q)t(Q)t(Qq2)t(Q c987ccaScbm001c α+α+αλ−µ′+µ′+λ=′ …(3.4)

)t(Q)()t(Q)t(Q bbbbmb1bbb Λ+µ′−αλ=′ …(3.5)

mμ′ : Manully repair rate at state b.

Sµ′ : Repair rate at state a .

11 P,P ′ : Probability that the hardware degradation is detected.

11 q,q ′ : Probability that the hardware degradation is undetected.

22 P,P ′ : Probability that the degradation is recovered by software methods.

22 q,q ′ : Probability that the degradation is not recovered by software methods.

FT : Total failure state with aborted failure rate and hardware-related software failure rate.


60

)t(Q)()t(Q)t(Q babaSb2bba λ+µ′−αλ=′ …(3.6)

)t(Q)()t(Q)t(Q bcbcb3bbc Λ−αλ=′ …(3.7)

)t(Q)()t(Q)t(Q ababma4aab Λ+µ′−αλ=′ …(3.8)

)t(Q)()t(Q)t(Q aaaaSa5aaa λ+µ′−αλ=′ …(3.9)

)t(Q)()t(Q)t(Q acaca6aac Λ−αλ=′ …(3.10)

)t(Q)()t(Q)t(Q cbcbmc7ccb Λ+µ′−αλ=′ …(3.11)

)t(Q)()t(Q)t(Q cacaSc8cca λ+µ′−αλ=′ …(3.12)

)t(Q)()t(Q)t(Q ccccc9ccc Λ−αλ=′ …(3.13)

)t(Q)t(Q)t(Q)t(Q)t(Q)t(Q)t(Q)t(Q)t(Q)t(Q)t(Q

cccccacacbcbacacaaaa

ababbcbcbababbbbbbTF

Λ+λ+Λ+Λ+λ+Λ+Λ+λ+Λ+λ′=′

…(3.14)

When t ∞→ , the steady-state equations governing the model are obtained from (3.1)-

(3.14) as follows:

0QQQ2 aSbm00 =µ+µ+λ− …(3.15)

[ ] 0Q)(QQQqP2 b321bbmbaSbbm0021 =α+α+αλ+λ′+µ−µ′+µ′+λ …(3.16)

[ ] 0Q)(QQQPP2 a654aSaaSabm0021 =α+α+αλ+µ−µ′+µ′+λ …(3.17)

[ ] 0Q)(QQQq2 c987ccaScbm001 =α+α+αλ−µ′+µ′+λ …(3.18)

0Q)(Q bbbbmb1b =Λ+µ′−αλ …(3.19)

0Q)(Q babaSb2b =λ+µ′−αλ …(3.20)

0Q)(Q bcbcb3b =Λ−αλ …(3.21)

0Q)(Q ababma4a =Λ+µ′−αλ …(3.22)

0Q)(Q aaaaSa5a =λ+µ′−αλ …(3.23)

0Q)(Q acaca6a =Λ−αλ …(3.24)

0Q)(Q cbcbmc7c =Λ+µ′−αλ …(3.25)

0Q)(Q cacaSc8c =λ+µ′−αλ …(3.26)

0Q)(Q ccccc9c =Λ−αλ …(3.27)

0QQQQQQQQQQ

cccccacacbcb

acacaaaaababbcbcbababbbbbb

=Λ+λ+Λ+Λ+λ+Λ+Λ+λ+Λ+λ′

…(3.28)


61

With the help of equations (3.15)-(3.28), we get the matrix

0AQ = …(3.29)

where

[ ]cccacbacabaabcbabbcba00 QQQQQQQQQQQQQQ ++++++++++++=

=

43

21

AAAA

A .

Here

Λαλ−λ+µ′αλ−

λ+µ′αλ−θλ−

θλ−µ′−µ′−θλ−

µ−µ−λ

=

bc3b

bam2b

bam1b

31

221

Sm121

Sm

00000000000000000000q200000PP2000qP200002

1A

µ′−µ′−µ′−µ′−

=

00000000000000000000000000000000000000

Sm

Sm

2A

αλ−αλ−αλ−

αλ−αλ−αλ−

=

000000000000000000000000000000000000

9c

8c

7c

6a

5a

4a

3A


62

Λλ+µ′

Λ+µ′Λ

λ+µ′Λ+µ′

=

cc

caS

cbm

ac

aaS

abS

000000000000000000000000000000

4A

where

)( 321bbm1 α+α+αλ+λ′+µ=θ

)( 654aS2 α+α+αλ+µ=θ

)( 987c3 α+α+αλ=θ .

Denote

[ ]T321 P,P,PQ =

where

[ ]bbcab001 Q,Q,Q,Q,QP =

[ ]acaaabbcba2 Q,Q,Q,Q,QP =

[ ]cccacb3 Q,Q,QP = .

The steady state availability of the system can be obtained as: cccacbacabaabcbabbcba00 QQQQQQQQQQQQQA ++++++++++++=

…(3.30)

SOR technique can be applied for any converging iterative process. To solve

the equations (3.15)-(3.28), we use the successive over relaxation (SOR) technique

which is a powerful numerical method for solving a linear system of equations.

3.4 Numerical Results Extensive numerical results have been obtained to examine the effect of

various parameters on the system availability and are displayed in figures 3.2(a-f). In

order to compute various performance indices, the default parameters are taken as

follows:

0.7,'λ0.6,λ0.8,'λ0.7,λ0.5,'λ0.6,λ0.5,'λ0.4,λ

0.7,'λ0.7,λ0.5,'λ0.6,'λ0.5,'λ0.3,λ0.4,λ0.3,λ0.4,λ

aaabbcba

bbcab

aaabbcba

bbcab

========

=========


63

0.35.'P0.30,'P0.35,P0.25,P2,'μ1.5,μ2.1,'μ0.4,μ

0.5,'λ0.5,λ0.6,'λ0.7,λ0.7,'λ0.6,λ0.4,'λ0.5,λ

21Sm

cccacbac

21Sm

cccacbac

========

========

Figure 3.2(a) depicts the availability by varying failure rate (λ ) for different

values of mµ . It is clear that availability gradually decreases with the increase in

failure rate. It is also observed that system availability is higher for higher values

of mµ . Figure 3.2(b) exhibits the system availability by varying software repair rate

Sµ under different values of λ . This shows the sharp increasing trends in availability

with the increment in Sµ as we expect. Also we can see that the availability is greater

for lower values ofλ than its higher values.

Figs 3.2(c-d) show the effect of failure rateλ on system availability for

different values of 1P and 2P , respectively. A decreasing trend is seen for availability

with the increase in failure rateλ in both figures. Also as we increase the detection

probability of hardware degradation ( 1P ), the availability decreases significantly but

an increasing trend of availability is found when we increase the recovery probability

from hardware degradation ( 2P ).

In figs 3.2(e) -3.2(f), we see the pattern of availability by varying failure rate

(λ ) for different values of 'Pand'P 21 , respectively. From these figures it is clear that

the availability reveals decreasing pattern with the increase inλ ; the decrements are

more prevalent for lower values of λ in comparison to higher values of λ .

Furthermore, the availability tends to be constant for higher values of λ . For different

values of 'Pand'P 21 , no significant effect is found in the system availability.

3.5 Conclusion The software/hardware reliability model with the consideration of the

interactions between hardware and software subsystems has been developed. We have

examined the effects of hardware/software failures on the availability. The availability

of the whole computer system evaluated may be helpful to the system designers and

decision makers for the future design and upgradation of the embedded systems.


64

0.5

0.6

0.7

0.8

0.9

1

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

λ

A

μm=1 μm=3 μm=5

0

0.2

0.4

0.6

0.8

1

0.5 1 1.5 2 2.5µs

A

λ=.4 λ=.7 λ=1

Fig. 3.2(a): Availability (A) vsλ by Fig. 3.2(b): Availability (A) vs Sµ by varying mµ varyingλ

0.4

0.5

0.6

0.7

0.8

0.9

1

0.4 0.5 0.6 0.7 0.8 0.9 1λ

A

p1=.25 p1=.45 p1=.65

0.14

0.16

0.18

0.2

0.22

0.24

0.4 0.5 0.6 0.7 0.8 0.9 1λ

A

p2=.35 p2=.45 p2=.55

Fig. 3.2(c): Availability (A) vsλ by Fig. 3.2(d): Availability (A) vsλ by varying 1P varying 2P

0.4

0.5

0.6

0.7

0.8

0.9

1

0.2 0.3 0.4 0.5 0.6 0.7λ

A

p'1=.30 p'1=.40 p'1=.50

0.5

0.6

0.7

0.8

0.9

1

0.4 0.5 0.6 0.7 0.8 0.9 1λ

A

p'2=.35 p'2=.45 p'2=.55

Fig. 3.2(e): Availability (A) vsλ by Fig. 3.2(f): Availability (A) vsλ by varying 'P1 varying 'P2

Distributed Software and Hardware Systems

4.1 Introduction


4.3 The Equations and Analysis

4.4 Performance Indices


4.6 Conclusion

Chapter-4


66

The reliability/availability issues are key ingredients of the

performance quantification of the distributed system for design

and development of a system. The present investigation is

concerned with a multi-host system with standbys. When all

standbys are used, the system begins to work in degraded mode.

Both software and hardware failures are taken into account along

with the assumption that the software faults are constantly being

identified and removed. The common cause failure which is an

important factor to predict the availability of realistic system is also

taken into consideration. A Markov model is developed by

constructing the governing transient equations in terms of

probabilities of various system states. These probabilities are also

employed to obtain some reliability indices. Numerical experiment

has been performed by using Runge-Kutta method with the help of

MATLAB.

4.1 Introduction

The reliability/availability prediction is a key concern of the system engineer

in any distributed system. In the recent years, distributed computing systems have

become more popular due to the low-cost processors and are shared among the hosts.

A distributed system is a type of cluster system that is collection and combination of

computers in which any member of the cluster is capable of supporting the processing

functions of any other member. From time to time, various Markov models have been

developed by many researchers to analyze the component availabilities. The common

cause failure is an important factor that should be incorporated to predict the

availability of a system working in distributed environments. In the present

investigation, we analyze the reliability issues of a distributed system, which is

subject to degradation and may fail due to common cause. The repair facility is

provided to restore the partially failed system to original state. The system may fail


67

partially from a good state or from any degraded state. The system can also fail due to

common cause, for example due to electrical/mechanical fault or due to

humidity/voltage problem. We consider that the system may also fail partially at any

time. The availability, which is the probability that the system is operating

satisfactorily at time t, is evaluated by using reliability theory approach.

The reliability analysis becomes more complicated when components of

distributed system are subject to common cause failure during any phase of the

mission. Common cause failures are multiple dependent component failures within a

system that are a direct result of a common cause. Many researchers have described

the hardware and/or software reliability models in different frameworks under

common cause failures. Jankala and Vaurio (1993) made a residual common cause

failure analysis to predict a probabilistic safety assessment. Jain (1998) did the

reliability analysis of two-unit system with common cause failure. Goseva-

Popstojanova and Trivedi (2000) developed Markov model for failure correlation and

studied its effects on the software reliability measures. Zhang and Horigome (2001)

discussed the availability and reliability of the system with dependent components

and time-varying failure and repair rates. Lewis (2001) considered a load-capacity

interference model with common-mode failure in 1-out-of-2: G system. Kvam and

Miller (2002) provided the common cause failure prediction using data mapping.

Nakagawa and Yasui (2003) worked on the reliability of a system complexity.

Vaurio (2003) obtained the common cause failure probabilities in standby safety

system by considering the fault tree analysis with testing-scheme and timing

dependencies. Yadavalli et al. (2005) presented the Bayesian study of a two-

component system with common cause shock failure. Lu and Lewis (2006) have done

the reliability evaluation of standby safety systems due to independent and common

cause failures. Xing et al. (2007) discussed the reliability analysis of hierarchical

computer based systems subject to common cause failures. Van et al. (2008)

presented the perturbation method to estimate an importance factor in the framework

of steady-state sensitivity analysis of Markov processes in reliability studies. Atwood

and Kelly (2008) considered the binomial failure rate to analyze a common cause

model. A two-stage approach for multi-objective decision making with applications to


68

system reliability optimization was studied by Li et al. (2009b). Gamiz and Miranda

(2010) gave regression analysis of the structure function for reliability evaluation of

continuous-state system. In (2011), Savage and Son proposed the set theory method

for the system reliability of structures with degrading components.

The replacement of failed units by standbys in multi-component machining

system is useful in computer system working in distributed environment. The

provision of repair facility in addition to spares may be helpful in the smooth running

of the software and hardware systems. Park and Kim (2002) did an availability

analysis for the improvement of active/standby cluster systems using software

rejuvenation. Kim and Dshalalow (2002) analyzed the stochastic disaster recovery

systems with external resources. Zhang and Wang (2007) studied a deteriorating cold

standby repairable system with priority in use. Lim et al. (2008) explored the diversity

and fault avoidance for dependable replication systems. Ke et al. (2008) considered a

repairable system with detection, imperfect coverage and reboot. Chakravarthy and

Gomez-Corral (2009) discussed the influence of delivery times on repairable k-out-

of-N systems with spares. Erglmaz (2010) gave mixture representations for the

reliability of consecutive-k systems. Arya et al. (2011) described a methodology for

reliability enhancement of redial distribution system by determining optimal values of

repair times and failure rates of each section.

The purpose of investigation in this chapter is to find the availability/reliability

indices for distributed computer system having standbys and subject to common cause

failures. Markov chain is used to constitute a set of differential difference equations

for transient probabilities governing the model. With the help of R-K method we

evaluate the transient probabilities. The rest of the chapter is organized as follows. In

section 4.2, we outline the model description along with some assumptions. In section

4.3, we construct the governing equations by using appropriate transition rates. In

section 4.4, we establish some performance indices. In section 4.5, we facilitate the

numerical results which are obtained with the help of R-K method. Section 4.6 is

devoted to the concluding remarks.


69


Consider the redundant N+Y=K configuration for distributed computer system

having N operating hosts and Y processing hosts as spare hosts. Here if all of the N

hosts fail, the system fails otherwise whenever one host is working, the system is still

working. The following assumptions are made to formulate the model.

The system consists of N operating components (i.e. hosts), Y spare components

and two softwares. When any operating host fails then it can be replaced by the

available spare host.

All the operating (spare) hosts have the same hardware failure rate hλ ( )α arising

from an exponential distribution.

Both software as well as hardware components have only two states, working

state and failed state.

There is repair provision as and when software or hardware malfunction occurs.

The repair times are exponentially distributed with parameter sµ ( )sµ′ for software

failure and )μ(μ hh ′ for hardware failure; here sµ′ is faster repair rate for software

failure of second software and hµ′ is faster repair rate for hardware failure of

second one.

All the failures involved are mutually independent.

( ) (t)λ,tλ ss ′ are the software failure rates caused by all software faults which occur

in Poisson fashion.

Pi(t) is the probability of the system being in ith (i=0,1,2,……..,15) state at time t.

The system may also fail due to common cause with failure rate Pλ .

When all standbys are exhausted, the remaining operating hosts fail with degraded

failure rate dλ .

4.3 The Equations and Analysis

The differential difference equations governing the model are constructed by

using the transition rates nλ and nµ to the failure and repair processes, respectively.


70

When there are n failed operating hosts in the system, the state dependent failure rate

are given by

For N operating components and Y spares, the diagram for particular case

when N=5 and Y=2, depicting the transition flow in all states A={0,1,2,……..,15} is

shown in figure 4.1. Let Pi(t) be the probability of the system being in ith

(i=0,1,2,……..,15) state at time t. P0(t) and P1(t), P2(t) denote the probabilities that

both software and N operating components are functioning well at time t. Let Pi(t) (i=

3,4) be the probability that both software are working and all operating components

have failed and the system is working in degraded mode and faster repair rate hµ′ for

hardware failure is adopted. P5(t), P6(t) and P7(t) are the probabilities that one

software is in failed state at time t and repair is rendered with faster rate hµ′ ; sλ′ is the

failure rate for software when N operating components are working. P8(t), P9(t)

denote that one software and all operating components have failed, and one software

as well as spares are working; the repair is with faster rate hµ′ for hardware failed

components. Pi(t) (i= 10,11,12,13,14) show the probability that at time t both software

have failed and there is no repair process. P15(t) denotes the probability that at time t

entire system has failed.

Now the differential difference equations governing the model developed with the

help of transition diagram are given below:

( ) ( ) ( )[ ] ( )tP252tPtPdt

dP0Ps5s1h

0 α+λ+λ+λ−µ+µ= …(4.1)

( ) ( ) ( ) ( ) ( )[ ] ( )tP52tP25tPtPdt

dP1hPs06s2h

1 α+λ+µ+λ+λ−α+λ+µ+µ= …(4.2)

( ) ( ) ( ) ( ) [ ] ( )tP52tP5tPtPdt

dP2hPs17s3h

2 λ+µ+λ+λ−α+λ+µ+µ′= …(4.3)

( ) ( ) ( ) ( ) [ ] ( )tP42tP5tPtPdt

dP3dhPs28s4h

3 λ+µ′+λ+λ−λ+µ+µ′= …(4.4)

( )( )

−≤<λ−+≤≤α−+λ

=λ1KnY,nYN

Yn0,nYN

dn


71

( ) ( ) [ ] ( )tP32tP4tPdt

dP4dhPs3d9s

4 λ+µ′+λ+λ−λ+µ= …(4.5)

( ) ( ) ( ) ( )[ ] ( )tP25tPtPtP2dt

dP5Pss10s6h0s

5 α+λ+λ+λ′+µ−µ′+µ+λ= …(4.6)

( ) ( ) ( ) ( ) ( )

( )[ ] ( )tP5

tp25tPtPtP2dt

dP

6hPss

511s7h1s6

α+λ+µ+λ+λ′+µ−

α+λ+µ′+µ+λ= …(4.7)

( ) ( ) ( ) ( ) ( ) ( )[ ] ( )tP5tp5tPtPtP2dt

dP7hPss612s8h2s

7 λ+µ+λ+λ′+µ−α+λ+µ′+µ′+λ=

…(4.8)

( ) ( ) ( ) ( ) ( ) ( )[ ] ( )tP4tp5tPtPtP2dt

dP8dhPss713s9h3s

8 λ+µ′+λ+λ′+µ−λ+µ′+µ′+λ=

…(4.9)

( ) ( ) ( ) ( ) ( )[ ] ( )tP3tp4tPtP2dt

dP9dhPss8d14s4s

9 λ+µ′+λ+λ′+µ−λ+µ′+λ= …(4.10)

( ) [ ] ( )tPtPdt

dP10Ps5s

10 λ+µ′−λ′= …(4.11)

( ) [ ] ( )tPtPdt

dP11Ps6s

11 λ+µ′−λ′= …(4.12)

( ) [ ] ( )tPtPdt

dP12Ps7s

12 λ+µ′−λ′= …(4.13)

( ) [ ] ( )tPtPdt

dP13Ps8s

13 λ+µ′−λ′= …(4.14)

( ) [ ] ( )tPtPdt

dP14Ps9s

14 λ+µ′−λ′= …(4.15)

( ) ( ) ( )∑=

λ+λ+λ=14

0iiP9d4d

15 tPtP3tP3dt

dP …(4.16)

In order to solve the above set of equations (4.1)-(4.16), we impose the initial

condition P1(0)=1, Pi(0)=0, i≠ 1. Numerical method based on R-K fourth order is used

to obtain the transient probabilities Pi(t), 0, 151 ≤≤ i .


72

4.4 Performance Indices

Now we establish various indices by using probabilities obtained in previous

section as follows:

The reliability of the system is given by

( ) ( ) ( )∑∑−

+=

−

+=

+=3K

YNnn,1

3K

YNnn,0 tPtPtR ...(4.17)

Expected number of operating hardware units in the system, when one software is

failed

( ) ( )∑+=

−−=K

1Ynn1,1 tY)P(nNtEO …(4.18)

Expected number of operating hardware units in the system, when both

softwares are failed

( ) ( )∑+=

−−=K

1Ynn2,2 tY)P(nNtEO …(4.19)

Expected number of spare hardware units in the system is obtained as

( ) ( )∑=

−=Y

1nnYtES ( ) ( ) ( ){ }tPtPtP n2,n1,n0, ++ …(4.20)


In this section, we perform a numerical experiment for the transient analysis

by employing Runge-Kutta technique (RKT) of fourth order to solve the system of

differential equations. R-K method is implemented by exploiting MATLAB’s ‘ode

45’ function. A time span is taken with equal intervals. We set the default parameters

as

2.9.'μ2.5,'μ3,μ

8,μ0.3,α0.1,λ0.7,'λ0.4,λ0.06,λ0.09,λ3,m2,Y5,N

Sh

S

S

hPSd

===

==========

The numerical results for different performance indices are summarized in

tables 4.1(a)-4.1(c). The graphical presentations of some indices namely, reliability,

expected number of operating units when one software has failed, expected number of

operating units when both softwares have failed and expected number of spare

hardware units in the system have been done in figs 4.2(a)-4.2(c).


73

Tables 4.1(a)-4.1(c) display the numerical results to examine the effect of

varying time t on the reliability R(t), expected number of operating units when one

and both softwares have failed and the expected number of spare hardware units in the

system for different values of Pλ , Sµ and α , respectively. Table 4.1(a) reveals that

R(t), EO1(t) and ES(t) decrease as time t increases but EO2(t) initially decreases

slightly and after some time it tends to be constant with the time t. From tables 4.1(b)-

4.1(c), we observe that when the parameters Sµ and α increase then R(t), EO1(t) and

ES(t) decrease but EO2(t) increases with time t.

From figs 4.2(a)-4.2(c), we see the variation of reliability with time for

different values of Pλ , Sµ andα , respectively. We notice that reliability decreases

when time increases which is quite obvious. As Pλ ( Sµ ) increases, the reliability

decreases (increases). The effect of failure rate α of spare host on reliability is not

much significant.

4.6 Conclusion

In the present investigation, we have analyzed the performance of distributed

computer system by incorporating the common cause failure. The system can fail

either due to a software failure or due to hardware failure. Markov model developed

for evaluating various performance measures such as reliability, expected number of

operative hardware units in the system and expected number of spare units may be

applied to many embedded systems including computer and communication systems,

telecommunication, etc..


74

Fig. 4.1: State transition diagram


75

Table 4.1(a): Performance indices for different values of Pλ

t

.1λp = .5λp = .9λp =

R(t) EO1(t) EO2(t) ES(t) R(t) EO1(t) EO2(t) ES(t) R(t) EO1(t) EO2(t) ES(t) 0.00 1.00 0.00 0.00 1.00 1.00 0.00 0.00 1.00 1.00 0.00 0.00 1.00 1.00 0.86 0.22 0.04 0.85 0.58 0.14 0.03 0.57 0.39 0.10 0.02 0.42 1.25 0.83 0.22 0.05 0.83 0.51 0.13 0.03 0.50 0.31 0.09 0.02 0.35 1.50 0.81 0.22 0.05 0.81 0.44 0.12 0.03 0.44 0.24 0.07 0.02 0.28 1.75 0.79 0.21 0.05 0.79 0.39 0.10 0.03 0.39 0.19 0.06 0.01 0.23 2.00 0.77 0.21 0.05 0.77 0.34 0.09 0.02 0.35 0.15 0.05 0.01 0.19 2.25 0.75 0.21 0.05 0.75 0.30 0.08 0.02 0.30 0.12 0.04 0.01 0.16 2.50 0.73 0.20 0.05 0.73 0.27 0.07 0.02 0.27 0.10 0.03 0.01 0.13 2.75 0.71 0.20 0.05 0.71 0.24 0.06 0.02 0.24 0.08 0.02 0.01 0.10 3.00 0.69 0.19 0.05 0.69 0.21 0.05 0.01 0.21 0.06 0.02 0.01 0.09


76

Table 4.1(b): Performance indices for different values of Sμ

t .1λp = .5λp = .9λp =



77

Table 4.1(c): Performance indices for different values ofα

t .3α = .6α = .9α =



78

0.0

0.2

0.4

0.6

0.8

1.0

1.2

0 1.25 1.75 2.25 2.75(t)

R(t

)

λp=.1λp=.5λp=.9

Fig. 4.2(a): Reliability vs time by varying pλ

0.0

0.2

0.4

0.6

0.8

1.0

1.2

0.0 1.3 1.8 2.3 2.8(t)

R(t

)

μs=3μs=6μs=9

Fig. 4.2(b): Reliability vs time by varying Sµ

0

0.2

0.4

0.6

0.8

1

1.2

0 1.25 1.75 2.25 2.75(t)

R(t

)

α=.3α=.6α=.9

Fig. 4.2(c): Reliability vs time by varyingα

Transient Analysis of a Hardware-Software System

Section-5A Embedded Computer System with Two Types of Failure and Common Cause Failure

Section-5B Hardware and Software Systems with Warm Standbys and Switching Failures

Chapter-5

Embedded Computer System with Two Types of Failure and Common

Cause Failure 5A.1 Introduction

5A.2 Model Description

5A.3 The Analysis

5A.4 Illustration

5A.5 Performance Indices

5A.6 Numerical Results

5A.7 Conclusion

Section-5A

Section-5A: Embedded Computer System with Two Types of Failures…

81

Redundancy of hardware components is generally required to

design highly reliable embedded computer systems. A common

form of redundancy is a K-out-of-N: G system in which at least K

out of N components must be good for the system to be good. This

investigation is concerned with a Markov model for K-out-of-N: G

system with common cause failure. The hardware system consists

of N non-identical components and Y warm standby components.

There is a single repairman who repairs the failed components on a

first-come-first-served basis. The developed probabilistic model

represents the redundant computer system with one software

component. The software/hardware system along with human

error and hardware error has been investigated in order to obtain

the reliability indices under the assumption that each components

may fail due to two types of failures (hardware and human) or

common cause or software failure. Numerical results have been

facilitated with the help of Runge-Kutta method of fourth order to

validate the analytical results. The sensitivity of parameters on

system availability has also been examined.

5A.1 Introduction

These day’s embedded computers are used in day to day as well as different

areas such as transportation, nuclear reactors, aircraft, hospital operation, etc.. The

reliability has become an important aspect of planning, designing and operation of all

engineering systems. The aim of reliability analysis is to measure the probability that

the designed equipment will work its intended function in the hands of the customers.

In the field of reliability a large number of researchers have paid their attention in the

estimation as well as prediction of the reliability of a computer hardware or software

system.


82

In various situations, the systems are sometime affected by environmental

factors such as human errors or common cause failure. Human errors are important

while predicting the reliability and safety measures of any engineering system. In a

real life situation, many faults are caused directly or indirectly due to human errors

such as wrong action, poor communication, wrong interpretation, poor handling, poor

maintenance and operation procedure, etc.. Further, common cause failure is also key

factor that should be incorporated to predict the system reliability in different

frameworks. The common cause failure may occur due to equipment design

deficiency, power supply, humidity, temperature, etc.. An example of a human error is

the fire in a room where the redundant units are located. In this case the entire

redundant system will fail, irrespective of whether one or more units were operating.

Hardware failures occur due to flaws in the design and manufacturing processes,

faulty operations, poor quality control, poor maintenance, etc.. Hence realistic

reliability model must include the occurrence of human errors, hardware failure and

common cause failure. The system reliability/availability can be quantified more

accurately by the use of these concepts.

In recent years significant attention of researchers has been focused on

reliability issues by considering the common cause failure. Chari et al. (1991)

presented the reliability analysis in the presence of change common cause shock

failures. De-Almeida and Souza (1993) discussed the maintenance strategy for a 2-

unit redundant standby system. Subramanian and Anantharaman (1995) made the

reliability analysis of a complex standby redundant system. Rajamanickam and

Chandrasekar (1997) established the reliability measures for two unit systems with a

dependent structure for failure and repair times. Jain (1998) and Vaurio (1998, 2005)

provided the implicit method for incorporating common cause failure in two unit

system. Whittaker et al. (2000) developed a Markov chain model for predicting the

reliability of multi-build software. Kuo et al. (2001) suggested the framework for

modeling software reliability using various testing-effort and fault detection rates.

Yadavalli et al. (2002) analyzed the asymptotic confidence limits for the steady state

availability of a two unit parallel system with preparation time for the repair facility.

Ou and Bechta-Dugan (2003) did the approximate sensitivity analysis for acyclic

Markov reliability models. Azaron et al. (2005) studied the reliability function of a


83

class of time dependent systems with standby redundancy. Blokus (2006) presented

the reliability analysis of large systems with dependent components. Levitin (2007)

suggested a modification of the generalized reliability block diagram method for

evaluating reliability and performance distribution of complex multi-state series-

parallel system with uncovered failures. Hall and Mosleh (2008) analyzed the

framework for reliability growth of one-shot systems. El-damcese (2009) gave the

reliability and availability analysis of a k-out-of-(M+S): G warm standby system due

to common cause failure. Mahmoud and Moshref (2010) analyzed the hardware

failure, human error and preventive maintenance for a two unit cold standby system.

In (2011), Xing et al. proposed exact combinatorial reliability analysis of dynamic

systems with sequence-dependent failures.

It is common knowledge that redundancy can be used to increase the

reliability of a system without changing the reliability of the individual units that form

the system. k-out-of-n:G warm standby systems have found applications in various

fields including power plant, network design, redundant system testing, medical

diagnosis, etc.. In a K-out-of-n: G system, K is the minimum number of components

that must work if the whole system consisting of total N components is to work.

Akhtar (1994), Amari et al. (2004) and Myers (2007) discussed the reliability of K-

out-of-N: G system with imperfect fault coverage. There are many factors such as

critical human error or high temperature of computer chips, etc. which could cause the

whole system to fail. El-Damcese (1997) gave the human error and common cause

failure modeling of a two unit multiple systems. Huang et al. (2000) studied the

generalized multi-state K-out-of-N: G systems. Dutuit and Rauzy (2001) and Smidt-

Destombesa et al. (2004) considered the assessment of K-out-of-N and related

systems. Arulmozhi (2002) presented the reliability of an M-out-of-N warm standby

system with R repair facilities. Zhang et al. (2006) obtained availability and

reliability of K-out-of (M+N): G warm standby systems. Lu and Lewis (2008) and

Chakravarthy and Gomez-Corral (2009) studied the configuration determination for

K-out-of-N partially redundant systems. Levitin and Amari (2010) established the

algorithm for evaluating the time-to failure distribution of k-out-of-n system with

shared standby elements. Ruiz-Castro and Li (2011) modeled a discrete k-out-n:G

system with multi state components by means of block-structure Markov chains.


84

In the present investigation, we develop Markov model for K-out-of-N:G

system by incorporating the failures caused by the human error and hardware problem

for a multi-component system. The outline of the chapter is as follows. In section

5A.2, we describe the mathematical model along with the underlying Markov process.

The transient analysis of the K-out-of-N: G system is presented in section 5A.3 where

an illustration of the model is also given. In section 5A.4, several system performance

measures are derived. The interesting representative numerical results to bring out the

quantitative nature of the model are discussed in section 5A.5. At last, section 5A.6 is

devoted to the concluding remarks.


We develop Markov model for the multicomponent system which is initially

considered to be in good state. The system or the components may fail due to

hardware failure and human error. In addition to this the system is subject to failure

due to some common cause as well as due to software failure. The provision of warm

standbys hardware components is also taken to be consideration. The following

assumptions are made to formulate the model:

The system consists of M operating and Y warm standby hardware components.

The system functions successfully with at least K components.

The failed components are repaired in the order of their failure.

There is single repairman and he is always available to repair the failed components.

The life time and repair time of the hardware components are exponentially distributed.

The switching time from standby to operating component is assumed to be negligible.

The system may also fail due to common cause failure or software failure according to exponential distribution.

Notations

N : Total number of hardware components in the system i.e. N=M+Y.

λ : Failure rate of an operating hardware component due to hardware failure.


85

λ′ : Failure rate of an standby hardware component due to hardware failure.

hλ : Failure rate of an operating hardware component due to human failure.

hλ′ : Failure rate of a standby hardware component due to human failure.

Cλ : Failure rate of an operating hardware component due to common cause failure.

Sλ : Failure rate of an operating hardware component due to software failure.

μ : Repair rate of a component failed due to hardware faults when at least one standby is available.

hμ : Repair rate of a failed component due to human failure.

Cμ : Repair rate of a failed component due to common cause.

Sμ : Repair rate of a component failed due to software failure.

( )( )tP 0,0 : Probability that there is no failed component at time t.

( )( )tP ji, : Probability that there are i )Ni0( ≤≤ and j )Nj0( ≤≤ components, where Nji1 ≤+≤ , failed due to hardware failure and human failure

respectively, at time t.

( )( )tP Sf : Probability that the system fails due to software failure at time t.

( )( )tP cf : Probability that the system fails due to common cause at time t.


86

Fig. 5A.1: State transition diagram

5A.3 The Analysis

Now using the appropriate in-flow and out-flow rates shown in transition

diagram (see fig. 5A.1), we construct the differential difference equations governing

the model as follows:


87

( ) ( )[ ] ( )( ) ( )( ) ( )( )

(t)Pμ(t)Pμ

tPμtμPtPλλλYMλλYMλdt

(t)dP

f.)(CC(Sf)S

0,1h1,00,0SChh(0,0)

++

++++′++′+−= …(5A.1)

( ) ( ) ( ) ( )( )[ ] ( )( )

( )[ ] ( )( ) ( )( ) ( )( ))2.A5...(Yi1

(t),Pμ(t)PμtPμtμPtPλ1iYMλ

tPμλλλiYMλλiYMλdt

(t)dP

(Cf.)C(Sf)Si,1h1,0i1,0i

i,0SChhi,0

≤≤

++++′+−++

+++′−++′−+−=

+−

( ) ( ) ( )[ ] ( )( ) ( )( )

( )( ) ( )( ) (5A.3)...1Ni1Y(t),Pμ(t)PμtμPtPμ

tMλλtPμλλλiYMλiYMdt

(t)dP

(Cf)C(Sf)S1,0ii,1h

1,0ii,0CShi,0

−≤≤+++++

++++−++−+−=

+

−

( )( ) ( ) ( ) ( )tPtP

dt)t(dP

0,1N0,N0,N

−λ+µ−= …(5A.4)

( ) ( ){ } ( ){ }[ ] ( )( )

( ) ( )( ) ( )( ) ( )( )(5A.5)...Yj1 (t),Pμ

(t)PμtPμtμPtPλ1)j(YMλ

tPμλλλjYMλλjYMλdt

(t)dP

(Cf)C

(Sf)S1j0,hj1,1j0,hh

j0,hSChhj0,

≤≤+

+++′+−++

µ++++′−++′−+−=

+−

( ) ( ) ( )[ ] ( )( ) ( )( )

( )( ) ( )( ) (5A.6)...1Nj1Y,(t)Pμ(t)PμtPtP

tPMtPλλjYMjYMdt

)t(dP

(Cf)C(Sf)Sj,11j,0h

1j,0hj,0hSChj,0

−≤≤+++µ+µ+

λ+µ+µ+++λ−++λ−+−=

+

−

( )( ) ( ) ( ) ( )tPtP

dt)t(dP

1N,0hN,0hN,0

−λ+µ−= …(5A.7)

( ) { } { }[ ] ( )( )

( )( ) ( )( ) ( )( ) ( )( ) ( )( ) ( )( ))8.A5...(1Nji1Y,0j,i

,tPtPtPtPtPMtM

tPjiYMjiYMdt

(t)dP

CfCSfS1j,ih0,1i1j,ihj,1i

j,ihSChji,

−≤+≤+≠

µ+µ+µ+µ+′+λλ+

µ+µ+λ+λ+λ−−++λ−−+−=

++−−


88

( ) ( ){ } ( ){ }[ ] ( )( )

( ){ } ( ) ( ){ } ( )

( )( ) ( )( )...(5A.9) Yji20,ji,,

(t)Pμ(t)PμtPμtμP

)t(PjiYM)t(P1jiYM

tPμμλλλjiYMλλjiYMλdt

(t)dP

(Cf)C(Sf)S1ji,h1,1i

1j,i'hhj,1ih

ji,hSChhji,

≤+≤≠

++++

λ−−+λ+λ+−−+λ+

++++′−−++′−−+−=

++

−−

( ) ( ) ( )( ) ( )( ) ( )( )

)10.A5...(Nji0,ji,(t),Pμ(t)Pμ

tPλtλPtPλλμμdt

(t)dP

(Cf)C(Sf)S

1ji,hj1,iji,SChji,

=+≠++

+++++= −−

5A.4 Illustration

In this section, we present 2-out-of-5:G system for illustration purpose. The

differential difference equations associated with the system states are as follows:

( )( ) ( ) ( )[ ] ( )( ) ( )( ) ( )( )

)11.A5...((t)Pμ(t)Pμ

tPμtμPtPλλλ32λλ32λdt

tdP

(Cf)c(Sf)S

0,1h1,00,0CShh0,0

++

++++′++′+−=

( )( ) ( ) ( )[ ] ( )( ) ( ) ( ) ( )( )

( )( ) )12.A5(...(t)Pμ(t)PμtPμ

tμP(t)Pλ32λtPμλλλ22λ22λdt

tdP

(Cf)c(Sf)S,11h

2,00,01,0CShh1,0

+++

+′+++++′++λ′+−=

( )( ) ( ) ( )[ ] ( )( ) ( ) ( ) ( )( )

( )( ) (t)Pμ(t)PμtPμ

tμP(t)Pλ22λtPμλλ2λλ2λdt

tdP

(Cf)c(Sf)S2,1h

3,01,02,0CShh2,0

+++

+′+++++λ′++′+−= …(5A.13)

( )( ) [ ] ( )( ) ( ) ( ) ( )( ) ( )( )

(t)Pμ(t)Pμ

tPμtμP(t)Pλ2λtPμλλ2λ2λdt

tdP

)(Cc(Sf)S

3,1h4,02,03,0CSh3,0

f++

++′++++++−= …(5A.14)

( )( ) [ ] ( )( ) ( ) ( )( ) ( )( )

(t)Pμ(t)Pμ

tPμtμP(t)P2λtPλλμλλdt

tdP

(Cf)c(Sf)S

4,1h5,03,04,0CSh4,0

++

+++++++−= …(5A.15)

( ) ( ) (t)λP)t(μPdt

tdP(4,0)(5,0)

5,0 +−= …(5A.16)

( )( ) ( ) ( )[ ] ( )( ) ( ) ( )

( )( ) ( ) (t)Pμ(t)Pμ)t(PμtμP

(t)Pλ32λtPμλλλ22λλ22λdt

tdP

(Cf)c(Sf)S0,2h1,1

0,0hh0,1hCShh0,1

++++

′+++++′++′+−= …(5A.17)


89

( )( ) ( ) ( )[ ] ( )( ) ( ) ( )

( ) ( ) ( )( ) ( )( ) (5A.18)...(t)Pμ(t)PμtPμtμPtPλ2λ

(t)Pλ22λtPμμλλλ2λλ2λdt

tdP

(Cf)c(Sf)S1,2h2,1(1,0)hh

0,1,11,hCShh1,1

++++′++

′++++++′++′+−=

( )( ) [ ] ( )( ) ( ) ( ) ( ) ( )

( )( ) ( )( ) (5A.19) (t)Pμ(t)PμtPμtμP

tPλ2λ(t)Pλ2λtPλλμμ2λ2λdt

tdP

(Cf)c(Sf)S2,2h3,1

(2,0)hh1,1,12,CShh2,1

…++++

′++′+++++++−=

( )( ) [ ] ( )( ) ( ) ( ) ( )( )

( ) (t)Pμ(t)PμPμ

tμPtP2λ(t)P2λtPμμλλλλdt

tdP

(Cf)c(Sf)S3,2h

4,1(3,0)h2,13,1hCSh3,1

+++

++++++++−= …(5A.20)

( ) ( ) (t)Pλ(t)λP(t)P)μ(μdt

tdP(4,0)h(3,1)(4,1)h

4,1 +++−= …(5A.21)

( )( ) ( ) ( )[ ] ( )( ) ( ) ( ) ( )( )

( ) ...(5A.22) (t)Pμ(t)PμPμ

tμP(t)Pλ22λtPμλλ2λλ2λdt

tdP

(Cf)c(Sf)S0,3h

1,2,0,1hh0,2hCShh0,2

+++

+′+++++λ′++′+−=

( )( ) [ ] ( )( ) ( ) ( ) ( ) ( )

( )( ) ( )( ) (5A.23) (t)Pμ(t)PμtPμtμP

tPλ2λ(t)Pλ2λtPλλμμ2λ2λdt

tdP

(Cf)c(Sf)S1,3h2,2

(1,1)hh0,21,2CShh1,2

…++++

′++′+++++++−=

( )( ) [ ] ( )( ) ( ) ( ) ( )( )

( )( ) (5A.24)... (t)Pμ(t)PμtPμ

tμPtP2λ(t)P2λtPμμλλλλdt

tdP

(Cf)c(Sf)S2,3h

3,2(2,1)h1,22,2hCSh2,2

+++

++++++++−=


tdP(3,1)h(2,2)(3,2)h

3,2 +++−= …(5A.25)

( )( ) [ ] ( )( ) ( ) ( ) ( )( )

( ) (t)Pμ(t)PμPμ

tμP(t)Pλ2λtPμλλ2λ2λdt

tdP

(Cf)c(Sf)S0,4h

1,30,2hh0,3hCSh0,3

+++

+′++++++−= …(5A.26)

( )( ) [ ] ( )( ) ( ) ( )

( )( ) ( )( ) (t)Pμ(t)PμtPμtμP

tP2λ(t)P2tPλλμμλλdt

tdP

(Cf)c(Sf)S1,4h2,2

(1,2)h0,31,3CShh1,3

++++

+λ++++++−= …(5A.27)


tdP(2,2)h(1,3)(2,3)h

2,3 +++−= …(5A.28)

( )( ) [ ] ( )( ) ( ) ( )( ) ( )( )

(t)Pμ(t)Pμ

tPμtμPtP2λtPμλλdt

tdP

)(Cc(Sf)S

0,5h1,4(0,3)h0,4CShh0,4

f++

+++λ+λ+++−= …(5A.29)


90

( )( ) (t)Pμ(t)Pμ(t)Pλ(t)λP(t)P)λλμ(μdt

tdP)(Cc(Sf)S(1,3)h(0,4)(1,4)CSh

1,4f+++++++−=

…(5A.30)

( ) ( ) (t)Pλ(t)Pμdt

tdP(0,4)h(0,5)h

0,5 +−= …(5A.31)

( )( )

)32.A5...()t(p)t(p)t(p)t(p

)t(p)t(p)t(p(t)Pμdt

tdP

1

0i)3,0(C

2

0i)2,0(C

3

0i)1,0(C

4

0i)0,i(C

1

0i)3,0(C

2

0i)2,0(C

3

0i)1,0(C

4

0i(i,0)C

fC,

∑∑∑∑

∑∑∑∑

====

====

λ+λ+λ+λ+

µ−µ−µ−−=

( )( )

)33.A5...()t(p)t(p)t(p)t(p

)t(p)t(p)t(p(t)Pμdt

tdP

1

0i)3,0(S

2

0i)2,0(S

3

0i)1,0(S

4

0i)0,i(S

1

0i)3,0(S

2

0i)2,0(S

3

0i)1,0(S

4

0i(i,0)S

fS,

∑∑∑∑

∑∑∑∑

====

====

λ+λ+λ+λ+

µ−µ−µ−−=

In order to solve the above set of equations (5A.11)-(5A.33), we impose the

initial condition P(0, 0) =1 and P (i,j) (0)=0, i≠ 0, j≠ 0. Numerical method based on R-K

fourth order is used to obtain the transient probabilities.

5A.5 Performance Indices

We obtain the performance indices by using probabilities obtained in previous

section as follows:

Expected number of failed components at time t due to hardware error is

( ){ } ( ) ( )∑∑−

==

=iN

0jji,

N

1ihard tPitNE …(5A.34)

Expected number of failed components at time t due to human error is

( ){ } ( ) ( )∑∑−

==

=jN

0iji,

N

1jhuman tPjtNE …(5A.35)

Expected number of working components in the system at time t is

( ){ } ( ) ( ) ( )tPYjiMtNE ji,

N

1Yjiworking ∑

+=+

−+−= …(5A.36)


91

Expected number of standby components in the system at time t is

( ){ } ( ) ( ) ( )tPji-YtNE ji,

Y

0jistandby ∑

=+

−= …(5A.37)

Component availability at time t is

( ) ( ){ } ( ){ }

+

−=N

tNEtNE1tA hardhuman

comp …(5A.38)

System unavailability at time t is

( ) ( )tA1tUA compsystem −= …(5A.39)

5A.6 Numerical Results

In this section, numerical results for various performance indices are provided.

We present sensitivity analysis to illustrate how the system is affected by varying

failure rate, repair rate and other parameters. Runge-Kutta (RK) technique of fourth

order is used to calculate the system of differential equations, which is implemented

by exploiting the software MATLAB’s ‘ode 45’ function. A time span is taken with

equal intervals. The numerical results are displayed in tables 5A.1(a)-5A.1(b). The

graphical presentation of the reliability R(t) has been done in figs 5A.2(a)-5A.2(d) for

different varying parameters and default parameters choosen as follows:

.001.0,001.0,002.0,001.0,29.0,14.0,15.0',01.0,01.0,1.0

CS

hCSh h

=µ=µ

=µ=µ=λ=λ=λ=λ=λ′=λ

From table 5A.1(a) we notice the patterns of various performance indices

namely ( ){ }tNE hard , ( ){ }tNE human , ( ){ }tNE working and ( ){ }tNE dbytans by varying the

repair rates. It is observed that there is a decreasing trend in the values of ( ){ }tNE hard ,

( ){ }tNE human , ( ){ }tNE working , ( ){ }tNE dbytans with the increasing values of SCh ,,, µµµµ .

In the table 5A.1(b), we demonstrate the system availability for different values of

failure rates at fix time t = 5.

In figs 5A.2(a)-5A.2(d), we show the variation of reliability with time for

different values of CS and,, λλλ′λ , respectively. Fig. 5A.2(a) reveals the behavior of


92

reliability with respect to time t and failure rateλ . It is found that the reliability

decreases sharply for the initial values of t but shows a smooth decreasing pattern for

the further increased values of t. Then after in figs 5A.2(b)-5A.2(d), we illustrate the

behavior of reliability R(t) with respect to time t by varying the parameters

CS and, λλλ′ , respectively. It is noticed that as the values of failure rates

( )CS and, λλλ′ increase, the reliability decreases in each case.

5A.7 Conclusion

The reliability of a system without assuming human error and common cause

failure may not depict a real picture of the actual reliability/availability modeling.

Therefore the real time system reliability modeling must include the occurrence of

common cause failures, hardware error and human error. A K-out-of-N: G system

with warm standby components is studied in this chapter. The transient availability

and other performance indices obtained may be helpful to improve the system

availability in particular when occurrence of common cause failure and human errors

are involved. Our investigation in the present study facilitates an insight to the system

designers and developers to produce more reliable embedded computer systems by

judging correct measure of fault generation.


93

Table 5A.1(a): Performance indices for different values of ( hμ,μ ) and ( SC μ,μ )

Table 5A.1(b): System Availability for different values of ( 'hh λ,λ ).

μ hμ Cμ Sμ t E{Nhard(t)} E{Nhuman(t)} E{Nwork(t)} E{Nstand(t)}

0.45

0.4

0.5

0.5

0 0 0 1 1 2 0.19999 0.134955 0.094451 0.006999 4 0.070383 0.043826 0.030845 0.000878 6 0.037462 0.021731 0.015971 0.00041 8 0.020852 0.011232 0.008658 0.000211

0.9

0.8

0.5

0.5

0 0 0 1 1 2 0.227216 0.151455 0.115752 0.019112 4 0.063309 0.037373 0.029252 0.002943 6 0.022243 0.011735 0.009802 0.000886 8 0.008314 0.003991 0.003536 0.000302

0.45

0.4

0.7

0.9

0 0 0 1 1 2 0.071508 0.047969 0.033341 0.002169 4 0.020423 0.012539 0.00882 0.000192 6 0.009112 0.005135 0.003813 0.000076 8 0.004276 0.002188 0.001731 0.000032

System Availability ( )tA

λ λ′ sλ Cλ t ( )0.4λ,0.3λ 'hh == ( )0.8λ,0.9λ '

hh ==

0.4 0.5 0.5 0.5 5 0.988739

0.977129

0.8 0.7 0.5 0.5 5 0.981654

0.969133

0.4 0.5 0.9 0.9 5 0.997062

1.000000


94

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8 9 10 11t

R(t)

λ=.1λ=.5λ=.9

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8 9 10 11t

R(t)

λ'=.5λ'=1.0λ'=1.5

Fig. 5A.2(a): Reliability vs time by Fig. 5A.2(b): Reliability vs time by varyingλ varyingλ′

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8 9 10 11t

R(t)

λs=.20λs=.60λs=1.0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8 9 10 11t

R(t)

λc=.30λc=.60λc=.90

Fig. 5A.2(c): Reliability vs time by Fig. 5A.2(d): Reliability vs time by varying Sλ varying Cλ

Hardware and Software Systems with Warm Standbys and Switching

Failures

5B.1 Introduction

5B.2 Model Description

5B.3 Governing Equations

5B.4 Special Case

5B.5 Performance Measures

5B.6 Numerical Results

5B.7 Conclusion

Section-5B

Section-5B: Hardware and Software Systems with Warm Standbys…

96

Redundancy or standby is a provision which plays an

important role in improving the reliability of engineering systems.

This investigation deals with the reliability and sensitivity analysis

of a repairable system (consisting of hardware and software

components) with warm standbys and switching failures. Failure

and repair rates of the components are assumed to follow

exponential distributions. By using Markovian property, the

transient state model is developed to establish the system

reliability and other performance measures. Numerical technique

based on Runge-Kutta method and matrix method are used. A

numerical example is provided to illustrate the tractability of the

proposed method.

5B.1 Introduction

It is evident that in the most of ‘real world’ multi component machining

systems, after any component’s failure, the component is intended to be repaired

rather than replaced. In this chapter, we develop markovian model for a system

consisting of primary and warm standby components which are considered as

repairable. Subramanian and Venkatakrishan (1975) investigated reliability of 2-

unit standby redundant system with repair, maintenance and standby failure. Goel and

Shrivastava (1991) analyzed the profit of a two unit redundant system with test and

correlated failures and repairs. Hsieh and Wang (1995) computed reliability of a

repairable system with spares and a removable repairman. Yang and Xie (2000)

studied the operational and testing reliability in software reliability model. The

reliability and sensitivity analysis of a multi-component system with warm standbys

and a repairable service station was done by Wang et al. (2004). They assumed that

the life time and repair time of the units are exponentially distributed and the failed

units are repaired on FCFS basis. Hsu et al. (2009) suggested a repairable system with

imperfect coverage and reboot with the help of Bayesian and asymptotic estimation.

The two unit repairable system was considered with different types of prior


97

assumptions for unknown parameters, in which the coverage factor for an operating

unit failure is possible. Bieth et al. (2010) studied standby system with two repair

persons under arbitrary life-and repair times. Zheng et al. (2011) defined well-

posedness and stability of the repairable system with N failure modes and one standby

unit

The demand for products with more and more functionalities has increased

due to the high industrial competition and the advances in embedded hardware and

software technologies. To gain and maintain competitive advantage, the system

designers require a high level reliability of both hardware as well as software systems.

A high or required level of reliability and availability are often essential requisites for

embedded system. Lai et al. (2002) analyzed the availability of distributed

software/hardware systems. Guo and Yang (2007) discussed the methods of simple

reliability block diagram for safety integrity verification. In 2010, Kornecki and

Zalewski studied hardware certification for real-time safety critical systems.

The standby redundant repairable systems have been studied extensively in

the past by Osaki and Nakagawa (1976), Kumar and Agarwal (1980) and many

others. A detailed bibliography on redundant repairable systems can be found in

Yearout et al. (1986). The reliability prediction of a system with redundancy plays an

increasingly important role in power systems, manufacturing systems, industrial

systems, etc.. The warm standby component with a lower failure rate than the

operating units is recommended due to economic constraints. In many real time

systems, the standby might not be able to switch over successfully to act as a primary

unit; and it might also need a longer warm-up time. Wang and Kuo (2000) gave cost

and probabilistic analysis of series systems having mixed standby components. Jain

and Baghel (2001) and Bhuyan and Sarmah (2002) estimated reliability of a

repairable standby redundant system. Chandrasekhar et al. (2004) studied the two

unit standby system and obtained exact confidence limits for the availability of the

system, when the time-to-failure of an operative unit is constant and the time-to-repair

of the failed unit is governed by a two stage Erlangian distribution. Wang et al.

(2005) suggested the cost benefit analysis of series systems with warm standby

components with general repair time. Wang and Liang (2006) did cost benefit

analysis of the systems with warm standby units and imperfect coverage. In this


98

system they assumed that time-to-failure is distributed exponentially and time-to-

repair is distributed according to general distribution. Xu et al. (2005) established

asymptotic stability of a repairable system with imperfect switching mechanism. In

2007, Jain et al. gave transient analysis of M/M/R machining system with mixed

standbys, switching failures, balking, reneging and additional removable repairmen.

Zhang et al. (2006) considered the availability and reliability of a k-out-of-(M+N): G

warm standby systems. In 2009, El-Damcese analyzed the k-out-of-(M+S): G warm

standby system with time varying failure and repair rates in the presence of common

cause failure. Yang and Meng (2011) discussed the reliability analysis of a warm

standby repairable system with priority in use.

Most studies about the reliability of a system assume that the switchover from

warm standby units to primary units is always perfect and there are no failures during

the switching. But as stated above in real time practice, there is always possibility of

failures during the switching from standby state to operating state. Chow (1971) has

done studies on an imperfect switching system, which contains two identical

components. Alidrisi (1992) gave the recursive formula for the reliability of a

dynamic warm standby n components redundant system with imperfect switching and

constant failure rate. Pan (1997) predicted the reliability of imperfect switching

system subject to multiple stresses. Wang et al. (2006a) compared the availability and

reliability analysis for different system configurations with warm standby components

and standby switching failures; the repair time and the failure time for each of the

primary and warm standby components are assumed to follow the negative

exponential distribution. Ke et al. (2007) discussed the reliability and sensitivity

analysis of a system with primary units, warm standby units, unreliable service

stations and standby switching failures. They considered the time-to failure, time-to-

repair, breakdown time and service time of the failed units governed by the

exponentially distribution. Hsu et al. (2008) investigated a problem of a redundant

repairable system with switching failure with the help of bayesian approach. Wang

and Chen (2009) provided a comparative analysis of availability between three

systems with general repair times, reboot delay and switching failures. In this system,

they considered that the time-to-failure and the time-to-repair of the primary and

standby units are exponentially and generally distributed, respectively. Recently, Yun


99

and Cha (2010) and Hsu et al. (2011b) considered a general warm standby system

with switching time of the standby unit.

It is realized in many real time multi-component systems that Markovian

models are more natural and suitable. It is of also vital importance to perform the

reliability and sensitivity analysis of a repairable system with warm standby units and

switching failures. The present investigation is concerned with Markovian model for a

system which works with both hardware and software components and there is a

provision of warm standby hardware units which is likely to have switching failures

when used. The section-wise arrangement of rest of the chapter is as follows. Section

5B.2 provides the mathematical description of Markov model along with assumptions

and notations. In section 5B.3, we construct the steady state equations. Section 5B.4

contains an illustration to evaluate various performance indices of the system. Section

5B.5 is devoted to the performance measures of the system. The solution approach

based on Laplace transform and matrix method is given. Numerical results and

sensitivity analysis are given in section 5B.6 with the help of R-K and matrix method.

The chapter is concluded in the final section 5B.7 which summarizes the works done

and highlights the important features included in our investigation.


We consider the transient analysis of embedded hardware and software

system. There is provision of warm standbys to replace the failed hardware units, and

cold standbys to replace the software units. All the hardware and software units are

subject to failure and repair. The standby units and its associated switching

mechanisms are also subject to failures. The failure rate of software standby

component is zero. The system state transition diagram is depicted in fig. 5B.1. The

assumptions and notations used to describe the system are as follows:

Assumptions:

The system consists of M operating and S warm standby units for hardware components.

Upon failure of an operative hardware unit, the available warm standby unit becomes operative instantaneously and the failed unit goes under the repair if the repairman is free, otherwise waits for the repair in the queue.


100

Once an operating hardware unit fails, a standby unit replaces it; the failure characteristic of the standby unit becomes same as that of the operating unit.

The repair crew has R repairmen to facilitate the repair of failed hardware components. Each repairman can repair only one failed unit at a time; the repair discipline is first come first served (FCFS).

When the repair of failed unit is completed, it is as good as new one.

The switching may fail with probability q during the switching from standby state to operating state.

The switch over times from failure to repair, from repair to standby and from standby to operating states are negligible.

When both hardware components as well as software component fail, the system becomes non repairable.

The states of all components are mutually independent.

Notations:

M : The number of operating hardware/software units in the system.

S : The number of hardware warm standby units in the system.

hλ : Failure rate of hardware operating units.

Sλ : Failure rate of software operating units.

α : Failure rate of standby hardware units.

q : Failure probability of switching of hardware standby units.

( )Sh µµ : Repair rate of permanent repairmen when providing repair of failed hardware (Software) units.

P (i, j) : Steady state probability that i and j failed units are present in the system which fail due to hardware and software faults, respectively.

5B.3 Governing Equations

With the help of state dependent failure rates and repair rates, we construct the

Chapman-Kolmogorov equations governing the model as follows (See transition

diagram):


101

( )( ) ( )[ ] ( ) ( ) ( ) )t(P)t(P)t(PMSMdt

tdP1,0S0,1h0,0Sh

0,0 µ+µ+λ+α+λ−= …(5B.1)

( )( ) ( )[ ] ( )( ) ( )[ ] ( ) ( )

( ) (5B.2))t(P

)t(P2)t(PSq1MtPM1SMdt

tdP

1,1S

0,2h0,0h0,1hSh0,1

…µ+

µ+α+−λ+µ+λ+α−+λ−=

( ) ( )[ ] ( )( ) ( ) ( )[ ] ( )

( ) ( ) ( )( )( ) ( )( )

(5B.3)Si2

,tPq1qM)t(P)t(PRi

)t(P1iSq1MtP)Ri(MiSMdt

)t(dP

0,n1ni

h

2i

0n1,iS0,1ih

0,1ih0,ihSh0,i

1

1

1

…≤≤

−λ+µ+µ∧+

α+−+−λ+µ∧+λ+α−+λ−=

−−−

=+

−

∑

( ) ( )[ ] ( )( ) ( )[ ] ( )

( ) ( )( )

( ) (5B.4)tPqM)t(PR)t(P

)t(P1iSMtPRMiSMdt

)t(dP

0,n

1R

0n

nSh0,2Sh1,1SS

0,Sh0,1ShSh0,1S

1

1

1 …λ+µ+µ+

λ+−++µ+λ+λ−+−=

∑−

=

−++

++

( )( ) ( )[ ] ( )( ) ( )[ ] ( )

( ) ( ) (5B.5)1Ni2S),t(PR)t(P

)t(P1iSMtPRMiSMdt

tdP

0,1ih1,iS

0,1ih0,ihSh0,i

…−≤≤+µ+µ+

λ+−++µ+λ+λ−+−=

+

−

( )( )( ) ( ) ( ) )t(PR)t(P

dttdP

0,Nh0,1Nh0,N µ−λ= − …(5B.6)

( )( ) ( )[ ] ( )( ) ( ) ( )

( ) (5B.7)Sj1),t(P

)t(P)t(PMtPMjSMdt

tdP

1j,0S

j,1h1j,0Sj,0SShj,0

…≤≤µ+

µ+λ+µ+λ+α−+λ−=

+

−

( )( ) ( ) ( )[ ] ( )( ) ( ) ( )

( ) ( ) (5B.8)1Nj1S),t(P)t(P

)t(P1jSMtPjSMjSMdt

tdP

1j,0Sj,ih

1j,0Sj,0SShj,0

…−≤≤+µ+µ+

λ+−++µ+λ−++λ−+−=

+

−

( )( )( ) ( ) )9.B5...()t(P)t(P

dttdP

N,0S1N,0SN,0 µ−λ= −

( )( ) ( ) ( ) ( )[ ] ( )

( ) ( )[ ] ( )( )( )( ) ( )

( ) ( ) ( )[ ] ( ) ( )

(5B.10) Sji2,0j,i

),t(P)t(PjiSq1M)t(PR

)t(Pq1qM)t(PiSq1M

)t(PRiMjiSq1Mdt

tdP

1j.iS1j,ihj.1ih

n,n

2ji

1nn

1nnjihj,1ih

j,iShShj,i

21

21

21

…≤+≤≠

µ+α+−+−λ+µ+

−λ+α−+−λ+

µ+µ∧+λ+α+−+−λ−=

+−+

−+

=+

−+−+− ∑


102

( )( ) ( ) ( )[ ] ( )( )

( ) ( ) ( )( )

( )

( )( ) ( ) ( ) ( )

(5B.11)1Sji

),t(P)t(PRi)t(PqM

)t(PqM)t(PM)t(P1jiSM

tPRiMjiSMdt

tdP

1j,1Sj,1ihn,n1nnji

h

1S

Rnn

n,n1nnji

h

1R

1nn1j,iSj,1ih

j,iShShj,i

21

21

21

21

21

21

…+=+

µ+µ∧+λ+

λ+λ+λ+−−++

µ+µ∧+λ+λ+−+−=

++−+−+

−

=+

−+−+−

=+−−

∑

∑

( )( ) ( ) ( )[ ] ( )( )

( ) ( ) ( ) ( ) ( )

( ) (5B.12)1Nji2S),t(P

)t(PRi)t(PM)t(P1jiSM

tPRiMjiSMdt

tdP

1j,iS

j,1ih1j,iSj,1ih

j,iShShj,i

…−=+≤+µ+

µ∧+λ+λ++−++

µ+µ∧+λ+λ+−+−=

+

+−−

( )( ) ( ) ( )[ ] ( )

( ) ( )[ ] ( ) ( ) ( ) ( )

( ) (5B.13)1Sj1,1i),t(P

)t(PM)t(P1i)t(PjSq1M

)t(PM1jSq1Mdt

tdP

1j.iS

1j,iSj.1ihj,1ih

j,iShShj,i

…−≤≤=µ+

λ+µ++α−+−λ+

µ+µ+λ+α+−+−λ−=

+

−+−

( )( )( ) ( ) ( ) ( ) Nji),t(P)1R()t(PM)t(P

dttdP

j,iSh1j,iSj,1ihj,i =+µ+µ−−λ+λ= −− …(5B.14)

In the construction of above equations, we have used RiΛ for min (i, R).

Taking Laplace transforms of equations (5B.1) to (5B.14) with initial conditions

( ) 1)0(P 0,0 = and ( ) 0)0(P j,i = , for ,0j,0i ≠≠ we get

( )[ ] ( ) ( ) ( ) 1)s(P)s(P)s(PsMSM 1,0*

S0,1*

h0,0*

Sh =µ−µ−+λ+α+λ …(5B.15)

( )[ ] ( )( ) ( )[ ] ( ) ( )

( ) (5B.16)0)s(P

)s(P2)s(PSq1MsPsM1SM

1,1*

S

0,2*

h0,0*

h0,1*

hSh

…=µ−

µ−α+−λ−+µ+λ+α−+λ

( )[ ] ( )( ) ( ) ( )[ ] ( )

( ) ( ) ( )( )( ) ( )( )

(5B.17)Si2

,0sPq1qM)s(P)s(PRi

)s(P1iSq1MsPs)Ri(MiSM

0,n*1ni

h

2i

0n1,i

*S0,1i

*h

0,1i*

h0,i*

hSh

11

1

…≤≤

=−λ−µ−µ∧−

α+−+−λ−+µ∧+λ+α−+λ

−−−

=

+

−

∑


103

( )[ ] ( )( ) ( )[ ] ( ) ( )

( )( )

( )( ) (5B.18)0sPqM)s(PR

)s(P)s(P1iSMsPsRMiSM

0,n*

1R

0n

nSh0,2S

*h

1,1S*

S0,S*

h0,1S*

hSh

1

1

1 …=λ−µ−

µ−λ+−+−+µ+λ+λ−+

∑−

=

−+

++

( )[ ] ( )( ) ( )[ ] ( ) ( )

( ) (5B.19)1Ni2S,0)s(PR

)s(P)s(P1iSMsPsRMiSM

0,1i*

h

1,i*

S0,1i*

h0,i*

hSh

…−≤≤+=µ−

µ−λ+−+−+µ+λ+λ−+

+

−

( ) ( ) ( ) )s(P)s(PsR 0,1N*

h0,N*

h −λ−+µ …(5B.20)

( )[ ] ( )( ) ( ) ( )

( ) (5B.21)Sj1,0)s(P

)s(P)s(PMsPsMjSM

1j,0*

S

j,1*

h1j,0*

Sj,0*

SSh

…≤≤=µ−

µ−λ−+µ+λ+α−+λ

+

−

( ) ( )[ ] ( )( ) ( ) ( )

( ) ( ) (5B.22)1Nj1S,0)s(P)s(P

)s(P1jSMsPsjSMjSM

1j,0*

Sj,i*

h

1j,0*

Sj,0*

SSh

…−≤≤+=µ−µ−

λ+−+−+µ+λ−++λ−+

+

−

( ) ( ) ( ) )s(P)s(Ps 1N,0*

SN,0*

S −λ−+µ …(5B.23)

( ) ( ) ( )[ ] ( )

( ) ( )[ ] ( )( )( )( ) ( )

( ) ( ) ( )[ ] ( ) ( )

(5B.24)Sji2,0j,i,0)s(P)s(PjiSq1M)s(PR

)s(Pq1qM)s(PiSq1M

)s(PsRiMjiSq1M

1j.i*

S1j,i*

hj.1i*

h

n,n*

2ji

1nn

1nnjihj,1i

*h

j,i*

ShSh

21

21

21

…≤+≤≠=µ−α+−+−λ−µ−

−λ−α−+−λ−

+µ+µ∧+λ+α+−+−λ

+−+

−+

=+

−+−+− ∑

( ) ( )[ ] ( )( ) ( ) ( )

( )( )

( )( )

( )

( ) ( ) ( ) (5B.25)1Sji,0)s(P)s(PRi

)s(PqM)s(PqM)s(PM

)s(P1jiSMsPsRiMjiSM

1j,1*

Sj,1i*

h

n,n*1nnji

h

1S

Rnnn,n

*1nnjih

1R

1nn1j,i

*S

j,1i*

hj,i*

ShSh

2121

21

2121

21

…+=+=µ−µ∧−

λ−λ−λ−

λ+−−+−+µ+µ∧+λ+λ+−+

++

−+−+−

=+

−+−+−

=+

−

−

∑∑

( ) ( )[ ] ( )( ) ( ) ( )

( ) ( ) ( ) ( )

(5B.26)1Nji2S,0)s(P)s(PRi)s(PM

)s(P1jiSMsPsRiMjiSM

1j,i*

Sj,1i*

h1j,i*

S

j,1i*

hj,i*

ShSh

…−=+≤+=µ−µ∧−λ−

λ++−+−+µ+µ∧+λ+λ+−+

++−

−


104

( ) ( )[ ] ( ) ( ) ( )[ ] ( )

( ) ( ) ( ) ( )

(5B.27)1Sj1,1i,0)s(P)s(PM)s(P1i

)s(PjSq1M)s(PsM1jSq1M

1j.i*

S1j,i*

Sj.1i*

h

j,1i*

hj,i*

ShSh

…−≤≤==µ−λ−µ+−

α−+−λ−+µ+µ+λ+α+−+−λ

+−+

−

( ) ( ) ( ) ( ) Nji,0)s(PM)s(P)s(Ps)1R( 1j,i*

Sj,1i*

hj,i*

Sh =+=λ−λ−+µ+µ− −− …(5B.28)

5B.4 Special Case

In this investigation we consider a special case of general model by

considering 4 hardware components, 4 software components and 3 warm standby

components for the hardware units. Now the differential-difference equations

governing the model are as follows:

( ) ( ) ( ) ( )( ) ( ) ( ) ( ) ( )tPtPtP434dt

tdP1,0S0,1h0,0Sh

0,0 µ+µ+λ+α+λ−= …(5B.29)

( )( ) ( ) ( )( ) ( )[ ] ( )( ) ( )( )

( )( ) (5B.30)tP

tP2tP3q14tP424dt

tdP

1,1S

0,2h0,0h0,1hSh0,1

…µ+

µ+α+−λ+µ+λ+α+λ−=

( )( ) ( ) ( )( ) ( )[ ] ( )( ) ( )( )

( )( ) ( ) ( )( ) (5B.31)tPq1q4λtPμ

tP3μtP2αq14λtP2μ4λα4λdt

tdP

0,0h2,1S

3,0h1,0h2,0hSh2,0

…−++

++−++++−=

( )( ) ( ) ( )( ) ( )[ ] ( )( ) ( )( ) ( )( )

( ) ( )( ) ( ) ( )( ) (5B.32)tPq1q4λtPq1q4λ

tPμtP3μtPαq14λtP3μ4λ4λdt

tdP

1,0h0,02

h

3,1S4,0h2,0h3,0hSh3,0

…−+−+

+++−+++−=

( )( ) ( ) ( )( ) ( )( ) ( )( ) ( )( )

( )( ) ( )( ) ( )( ) (5B.33)tqP4λtPq4λtPq4λ

tPμtP3μtP4λtP3μ4λ3λdt

tdP

2,0h1,02

h0,03

h

4,1S5,0h3,0h4,0hSh4,0

…+++

+++++−=

( )( ) ( ) ( )( ) ( )( ) ( )( ) ( )( )tPtP3tP3tP342dt

tdP1,5S0,6h0,4h0,5hSh

0,5 µ+µ+λ+µ+λ+λ−= …(5B.34)

( )( ) ( ) ( )( ) ( )( ) ( )( ) ( )( )tPtP3tP2tP34dt

tdP1,6S0,7h0,5h0,6hSh

0,6 µ+µ+λ+µ+λ+λ−= …(5B.35)

( )( )( )( ) ( )( )tPtP3

dttdP

0,6h0,7h0,7 λ+µ−= …(5B.36)


105

( )( ) ( ) ( )( ) ( )( ) ( )( ) ( )( )tPtPtP4tP424dt

tdP2,0S1,1h0,0S1,0SSh

1,0 µ+µ+λ+µ+λ+α+λ−= …(5B.37)

( )( ) ( ) ( )( ) ( )[ ] ( )( )

( )( ) ( )( ) ( )( ) (5B.38)tPtP4tP2

tP2q14tP44dt

tdP

2,1S0,1S1,2h

1,0h1,1ShSh1,1

…µ+λ+µ+

α+−λ+µ+µ+λ+α+λ−=

( )( ) ( ) ( )( ) ( )[ ] ( )( ) ( )( )

( )( ) ( )( ) ( ) ( )( ) (5B.39)tPq14tPtP4

tP2tPq14tP244dt

tdP

1,0h2,2S0,2S

1,3h1,1h1,2ShSh1,2

…−λ+µ+λ+

µ+α+−λ+µ+µ+λ+λ−=

( )( ) ( ) ( )( ) ( )( ) ( )( ) ( )( )

( )( ) ( )( ) ( )( ) (5B.40)tqP4tPq4tP

tP4tP2tP4tP243dt

tdP

1,1h1,02

h2,3S

0,3S1,4h1,2h1,3ShSh1,3

…λ+λ+µ+

λ+µ+λ+µ+µ+λ+λ−=

( )( ) ( ) ( )( ) ( )( ) ( )( ) ( )( )

( )( ) (5B.41)tP

tP4tP2tP3tP242dt

tdP

2,4S

0,4S1,5h1,3h1,4ShSh1,4

…µ+


( )( ) ( ) ( )( ) ( )( ) ( )( ) ( )( )

( )( ) (5B.42)tP

tP4tP2tP2tP24dt

tdP

2,5S

0,5S1,6h1,4h1,5ShSh1,5

…µ+


( )( ) ( ) ( )( ) ( )( ) ( )( )tP4tPtP2dt

tdP0,6S1,5h1,6Sh

1,6 λ+λ+µ+µ−= …(5B.43)

( )( ) ( ) ( )( ) ( )( ) ( )( ) ( )( )tPtP4tPtP44dt

tdP3,0S1,0S2,1h2,0SSh

2,0 µ+λ+µ+µ+λ+α+λ−= …(5B.44)

( )( ) ( ) ( )( ) ( )[ ] ( )( )

( )( ) ( )( ) ( )( )tPtP4tP2

tPq14tP44dt

tdP

3,1S1,1S2,2h

2,0h2,1ShSh2,1

µ+λ+µ+

α+−λ+µ+µ+λ+λ−= …(5B.45)

( )( ) ( ) ( )( ) ( )( ) ( )( )

( )( ) ( )( ) ( )( ) (5B.46)tqP4tPtP4

tP2tP4tP243dt

tdP

2,0h3,2S1,2S

2,3h2,1h2,2ShSh2,2

…λ+µ+λ+

µ+λ+µ+µ+λ+λ−=


106

( )( ) ( ) ( )( ) ( )( ) ( )( ) ( )( )

( )( ) (5B.47)tP

tP4tP2tP3tP242dt

tdP

3,3S

1,3S2,4h2,2h2,3ShSh2,3

…µ+


( )( ) ( ) ( )( ) ( )( ) ( )( ) ( )( )

( )( ) (5B.48)tP

tP4tP2tP2tP24dt

tdP

3,4S

1,4S2,5h2,3h2,4ShSh2,4

…µ+


( )( ) ( ) ( )( ) ( )( ) ( )( )tP4tPtP2dt

tdP1,5S2,4h2,5Sh

2,5 λ+λ+µ+µ−= …(5B.49)

( )( ) ( ) ( )( ) ( )( ) ( )( ) ( )( )tPtP4tPtP44dt


3,0 µ+λ+µ+µ+λ+λ−= …(5B.50)

( )( ) ( ) ( )( ) ( )( ) ( )( ) ( )( )

( )( ) (5B.51)tP

tP4tP2tP4tP43dt

tdP

4,1S

2,1S3,2h3,0h3,1ShSh3,1

…µ+


( )( ) ( ) ( )( ) ( )( ) ( )( ) ( )( )

( )( ) (5B.52)tP

tP4tP2tP3tP242dt

tdP

4,2S

2,2S3,3h3,1h3,2ShSh3,2

…µ+


( )( ) ( ) ( )( ) ( )( ) ( )( ) ( )( )

( )( ) (5B.53)tP

tP4tP2tP2tP24dt

tdP

4,3S

2,3S3,4h3,2h3,3ShSh3,3

…µ+


( )( ) ( ) ( )( ) ( )( ) ( )( )tP4tPtP2dt

tdP2,4S3,3h3,4Sh

3,4 λ+λ+µ+µ−= …(5B.54)

( )( ) ( ) ( )( ) ( )( ) ( )( ) ( )( )tPtP4tPtP33dt


4,0 µ+λ+µ+µ+λ+λ−= …(5B.55)

( )( ) ( ) ( )( ) ( )( ) ( )( ) ( )( )

( )( ) (5B.56)tP2

tP4tPtP3tP32dt

tdP

4,2h

3,1S5,1S4,0h4,1ShSh4,1

…µ+



107

( )( ) ( ) ( )( ) ( )( ) ( )( ) ( )( )

( )( ) (5B.57)tP

tP2tP4tP2tP24dt

tdP

5,2S

4,3h3,2S4,1h4,2ShSh4,2

…µ+

µ+λ+λ+µ+µ+λ+λ−=

( )( ) ( ) ( )( ) ( )( ) ( )( )tP4tPtP2dt

tdP3,3S4,2h4,3Sh

4,3 λ+λ+µ+µ−= …(5B.58)

( )( ) ( ) ( )( ) ( )( ) ( )( ) ( )( )tPtP3tPtP22dt


5,0 µ+λ+µ+µ+λ+λ−= …(5B.59)

( )( ) ( ) ( )( ) ( )( ) ( )( ) ( )( )

( )( ) (5B.60)tP2

tPtP2tP3tP2dt

tdP

5,0h

6,1S5,2h4,1S5,1ShSh5,1

…λ+

µ+µ+λ+µ+µ+λ+λ−=

( )( ) ( ) ( )( ) ( )( ) ( )( )tP4tPtP2dt

tdP4,2S5,1h5,2Sh

5,2 λ+λ+µ+µ−= …(5B.61)

( )( ) ( ) ( )( ) ( )( ) ( )( ) ( )( )tPtP2tPtPdt


6,0 µ+λ+µ+µ+λ+λ−= …(5B.62)

( )( ) ( ) ( )( ) ( )( ) ( )( )tP2tPtPdt

tdP5,1S6,0h6,1Sh

6,1 λ+λ+µ+µ−= …(5B.63)

( )( ) ( ) ( )( ) ( )( )tPtPdt

tdP6,0S7,0S

7,0 λ+µ−= …(5B.64)

In order to solve the above set of equations (5B.29)-(5B.64), we impose the

initial conditions P(0,0) (0)=1 and P (i,j) (0)=0, i≠ 0, j≠ 0. Numerical method based on

R-K fourth order technique is used to obtain the transient probabilities.

For solving the set of equations governing the model, we take Laplace

transforms of equations (5B.29)-(5B.64) and solve using matrix method with initial

conditions ( ) ( ) 0j,0ifor0)0(P,1)0(P j,i0,0 =≠== . Now equations (5B.29)-(5B.64)

become

( ) ( )( ) ( )( ) ( )( ) 1sPsPsPs434 1,0*

S0,1*

h0,0*

Sh =µ−µ−+λ+α+λ …(5B.65)


108

( ) ( )( ) ( )[ ] ( )( ) ( )( )( )( ) )66.B5...(0sP

sP2sP3q14sPs424

1,1*

S

0,2*

h0,0*

h0,1*

hSh

=µ−

µ−α+−λ−+µ+λ+α+λ

( ) ( )( ) ( )[ ] ( )( ) ( )( )( )( ) )67.B5...(0sP

sP3sP2q14sPs244

1,2*

S

0,3*

h0,1*

h0,2*

hSh

=µ−

µ−α+−λ−+µ+λ+α+λ

( ) ( )( ) ( )[ ] ( )( ) ( )( )( )( ) (5B.68)0sP

sP3sPq14sPs344

1,3*

S

0,4*

h0,2*

h0,3*

hSh

…=µ−

µ−α+−λ−+µ+λ+λ

( ) ( )( ) ( )( ) ( )( ) ( )( ) 0sPsP3sP4sPs343 1,4*

S0,5*

h0,3*

h0,4*

hSh =µ−µ−λ−+µ+λ+λ …(5B.69)

( ) ( )( ) ( )( ) ( )( ) ( )( ) 0sPsP3sP3sPs342 1,5*

S0,6*

h0,4*

h0,5*

hSh =µ−µ−λ−+µ+λ+λ …(5B.70)

( ) ( )( ) ( )( ) ( )( ) ( )( ) 0sPsP3sP2sPs34 1,6*

S0,7*

h0,5*

h0,6*

hSh =µ−µ−λ−+µ+λ+λ …(5B.71)

( )( ) ( )( ) 0sPsP3 0,6*

h0,7*

h =λ−µ …(5B.72)

( ) ( )( ) ( )( ) ( )( ) ( )( ) 0sPsPsP4sPs424 2,0*

S1,1*

h0,0*

S1,0*

SSh =µ−µ−λ−+µ+λ+α+λ …(5B.73)

( ) ( )( ) ( )[ ] ( )( ) ( )( )( )( ) ( )( ) (5B.74)0sPsP4

sP2sP2q14sPs44

2,1*

S0,1*

S

1,2*

h1,0*

h1,1*

ShSh

…=µ−λ−

µ−α+−λ−+µ+µ+λ+α+λ

( ) ( )( ) ( )[ ] ( )( ) ( )( )( )( ) ( )( ) (5B.75)0sPsP4

sP2sPq14sPs244

2,2*

S0,2*

S

1,3*

h1,2*

h1,2*

ShSh

…=µ−λ−

µ−α+−λ−+µ+µ+λ+λ

( ) ( )( ) ( )( ) ( )( ) ( )( )( )( ) )76.B5...(0sP

sP4sP2sP4sPs243

2,3*

S

0,3*

S1,4*

h1,2*

h1,3*

ShSh

=µ−

λ−µ−λ−+µ+µ+λ+λ

( ) ( )( ) ( )( ) ( )( ) ( )( )( )( ) )77.B5...(0sP

sP4sP2sP3sPs242

2,4*

S

0,4*

S1,5*

h1,3*

h1,4*

ShSh

=µ−


( ) ( )( ) ( )( ) ( )( ) ( )

( )( ) )78.B5...(0sP

)s(P4sP2sP2sPs24

2,5*

S

0,5*

S1,6*

h1,4*

h1,5*

ShSh

=µ−


( ) ( )( ) ( )( ) ( )( ) 0sP4sPsPs2 0,6*

S1,5*

h1,6*

Sh =λ−λ−+µ+µ …(5B.79)

( ) ( )( ) ( )( ) ( )( ) ( )( ) 0sPsP4sPsPs44 3,0*

S1,0*

S2,1*

h2,0*

SSh =µ−λ−µ−+µ+λ+α+λ …(5B.80)


109

( ) ( )( ) ( )[ ] ( )( ) ( )( )( )( ) ( )( ) 0sPsP4

sP2sPq14sPs44

3,1*

S1,1*

S

2,2*

h2,0*

h2,1*

ShSh

=µ−λ−

µ−α+−λ−+µ+µ+λ+λ …(5B.81)

( ) ( )( ) ( )( ) ( )( ) ( )( )( )( ) )82.B5...(0sP

sP4sP2tP4sPs243

3,2*

S

1,2*

S2,3*

h2,1*

h2,2*

ShSh

=µ−


( ) ( )( ) ( )( ) ( )( ) ( )( )( )( ) )83.B5...(0tP

sP4sP2sP3sPs242

3,3*

S

1,3*

S2,4*

h2,2*

h2,3*

ShSh

=µ−


( ) ( )( ) ( )( ) ( )( ) ( )( )( )( ) )84.B5...(0sP

sP4sP2sP2sPs24

3,4*

S

1,4*

S2,5*

h2,3*

h2,4*

ShSh

=µ−


( ) ( )( ) ( )( ) ( )( ) 0sP4sPsPs2 1,5*

S2,4*

h2,5*

Sh =λ−λ−+µ+µ …(5B.85)

( ) ( )( ) ( )( ) ( )( ) ( )( ) 0sPsP4sPsPs44 4,0*

S2,0*

S3,1*

h3,0*

SSh =µ−λ−µ−+µ+λ+λ …(5B.86)

( ) ( )( ) ( )( ) ( )( ) ( )( )( )( ) )87.B5...(0sP

sP4sP2sP4sPs43

4,1*

S

2,1*

S3,2*

h3,0*

h3,1*

ShSh

=µ−


( ) ( )( ) ( )( ) ( )( ) ( )( )( )( ) )88.B5...(0sP

sP4sP2sP3sPs242

4,2*

S

2,2*

S3,3*

h3,1*

h3,2*

ShSh

=µ−


( ) ( )( ) ( )( ) ( )( ) ( )( )( )( ) )89.B5...(0sP

sP4sP2sP2sPs24

4,3*

S

2,3*

S3,4*

h3,2*

h3,3*

ShSh

=µ−


( ) ( )( ) ( )( ) ( )( ) 0sP4sPsPs2 2,4*

S3,3*

h3,4*

Sh =λ−λ−+µ+µ …(5B.90)

( ) ( )( ) ( )( ) ( )( ) ( )( ) 0sPsP4sPsPs33 5,0*

S3,0*

S4,1*

h4,0*

SSh =µ−λ−µ−+µ+λ+λ …(5B.91)

( ) ( )( ) ( )( ) ( )( ) ( )( ) 0sP4sPsP3sPs32 3,1*

S5,1*

S4,0*

h4,1*

ShSh =λ−µ−λ−+µ+µ+λ+λ …(5B.92)

( ) ( )( ) ( )( ) ( )( ) ( )( )( )( ) )93.B5...(0sP

sP2sP4sP2sPs24

5,2*

S

4,3*

h3,2*

S4,1*

h4,2*

ShSh

=µ−

µ−λ−λ−+µ+µ+λ+λ

( ) ( )( ) ( )( ) ( )( ) 0sP4sPsPs2 3,3*

S4,2*

h4,3*

Sh =λ−λ−+µ+µ …(5B.94)

( ) ( )( ) ( )( ) ( )( ) ( )( ) 0sPsP3sPsPs22 6,0*

S4,0*

S5,1*

h5,0*

SSh =µ−λ−µ−+µ+λ+λ …(5B.95)

( ) ( )( ) ( )( ) ( )( ) ( )( ) 0sPsP2sP3sPs2 6,1*

S5,2*

h4,1*

S5,1*

ShSh =µ−µ−λ−+µ+µ+λ+λ …(5B.96)


110

( ) ( )( ) ( )( ) ( )( ) 0sP4sPsPs2 4,2*

S5,1*

h5,2*

Sh =λ−λ−+µ+µ …(5B.97)

( ) ( )( ) ( )( ) ( )( ) ( )( ) 0sPsP2sPsPs 7,0*

S5,0*

S6,1*

h6,0*

SSh =µ−λ−µ−+µ+λ+λ …(5B.98)

( ) ( )( ) ( )( ) ( )( ) 0sP2sPsPs 5,1*

S6,0*

h6,1*

Sh =λ−λ−+µ+µ …(5B.99)

( ) ( )( ) ( )( ) 0sPsP 6,0*

S7,0*

S =λ−µ …(5B.100)

For brevity, we denote the Laplace transform of probabilities ( ) ( )sP*j,i with one suffix

i.e. by Pk (s) as defined below:

( )( ) ( ) 7i0,sPsP 1i*

0,i ≤≤= + ; ( )( ) ( ) 6i0,sPsP 1i8*

1,i ≤≤= ++ ; ( )( ) ( ) 5i0,sPsP 1i15*

2,i ≤≤= ++ ;

( )( ) ( ) 4i0,sPsP 1i21*

3,i ≤≤= ++ ; ( )( ) ( ) 3i0,sPsP 1i26*

4,i ≤≤= ++ ; ( )( ) ( ) 2i0,sPsP i31*

5,i ≤≤= + ;

( )( ) ( ) 1i0,sPsP i34*

6,i ≤≤= + ; ( )( ) ( )( )sPsP 36*

7,i = .

The system of equations (5B.65)-(5B.100) can be written in matrix form as

( )0P)s(P).s(Q * = …(5B.101)

where, ( ) ( ) ( ) ( )[ ]T*36

*2

*1

* sP,...,sP,sPsP = and ( ) [ ] T0...,,0,0,10P = .

Here Q(s) is a 3636× order matrix and can be written as tri-diagonal block matrix as

follows:

( ) 77ij ]A[sQ ×=

Now sub matrices Aij (i=1,2,……7 and j=1,2,….7) are constructed for particular case

as follows:

[ ]T77S12 0,IA µ−= ,

[ ][ ]

[ ]

µλ−µ−Λλ−

µ−Λλ−µ−Λλ−λ−λ−λ−

µ−Λα+−λ−−λ−−λ−µ−Λα+−λ−−λ−

µ−Λα+−λ−µ−Λ

=

hh

h7h

h6h

h5hh2

h3

h

h4hh2

h

h3hh

h2h

h1

11

3000000320000003300000034q4q4q40003)q1(4)q1(q4)q1(q4000032)q1(4)q1(q40000023)q1(000000

A

[ ]77S21 0,I4A λ−= , [ ]T66S23 0,IA µ−= ,


111

[ ][ ]

µ−Λλ−µ−Λλ−

µ−Λλ−λ−λ−µ−Λα+−λ−−λ−


=

h

13h

h12h

h11hh2

h

h10hh

h9h

h8

22

20000000200000230000024q4q40002)q1(4)q1(q4000022)q1(400000

A

[ ]66S32 0,I4A λ−= , [ ]T55S34 0,IA µ−= ,

[ ]

Λλ−µ−Λλ−

µ−Λλ−µ−Λλ−λ−


=

19h

h18h

h17h

h16hh

h15h

h14

33

000022000023000024q40002)q1(40000

A

[ ]55S43 0,I4A λ−= , [ ]T44S45 0,IA µ−= ,

Λλ−µ−Λλ−

µ−Λλ−µ−Λλ−

µ−Λ

=

24h

h23h

h22h

h21h

h20

44

000220002300024000

A

[ ]44S54 0,I4A λ−= , [ ]T33S56 0,IA µ−= ,

+µ+µλ−µ−+µ+µ+λ+λλ−

µ−+µ+µ+λ+λλ−µ−+µ+λ+λ

=

s2002s242002s32300s33

A

Shh

hShShh

hShShh

hSSh

55

[ ]T22S67 0,IA µ−=

λ−λ−

λ−=

040000300003

A

S

S

S

65


112

+µ+µλ−µ−+µ+µ+λ+λλ−

µ−+µ+λ+λ=

s202s220s22

A

Shh

hhSShh

hSSh

66

λ−+µ+µλ−

µ−+µ+λ+λ=

0s

sA

S

Shh

hSSh

76 ,

µ−

µ−=

S

S

77 0A

Other sub matrices Aij are zero matrices of appropriate size.

In sub matrices, we have used the following notations

s434 Sh1 +λ+α+λ=Λ , s424 hSh2 +µ+λ+α+λ=Λ ,

s244 hSh3 +µ+λ+α+λ=Λ , s344 hSh4 +µ+λ+λ=Λ ,

s343 hSh5 +µ+λ+λ=Λ , s342 hSh6 +µ+λ+λ=Λ ,

s34 hSh7 +µ+λ+λ=Λ , s424 SSh8 +µ+λ+α+λ=Λ ,

s44 ShSh9 +µ+µ+λ+α+λ=Λ , s244 ShSh10 +µ+µ+λ+λ=Λ

s243 ShSh11 +µ+µ+λ+λ=Λ , s242 ShSh12 +µ+µ+λ+λ=Λ ,

s24 ShSh13 +µ+µ+λ+λ=Λ , s44 SSh14 +µ+λ+α+λ=Λ ,



s2 Sh19 +µ+µ=Λ , s44 SSh20 +µ+λ+λ=Λ ,

s43 ShSh21 +µ+µ+λ+λ=Λ , s242 ShSh22 +µ+µ+λ+λ=Λ

s24 ShSh23 +µ+µ+λ+λ=Λ , s2 Sh24 +µ+µ=Λ

Using Cramer’s rule, the laplace transform ( )sP*k of probabilities ( )tPk , can be obtained

as

( ) Lk0,)s(Q

)s(QsP 1k*

k ≤≤= + …(5B.102)


113

where )s(Q 1k+ is the determinant obtained by replacing the (k+1)th column of

determinant )s(Q by RHS vector P(0).

For calculating the characteristic roots of the matrix Q(s), we note that

s = 0 is one of the roots. Let s = -d, so that we get

( ) ( )dIQdQ −=− …(5B.103)

Now eq. (5B.101) becomes ( ) ( ) ( )1P)s(PdIQ)s(P.dQ ** =−=− …(5B.104)

It is observed that the eigen values of Q are real and distinct and it is also

observed that Q is positive definite. So, all eigen values of Q are positive. Let

( )Lk1k ≤≤ν denote the eigen values of Q, then we get

( )∏=

ν+=L

1kkss)s(Q …(5B.105)

361,)s(s

)s(Q)s(P L

1kk

1 ≤≤ν+

=

∏=

+ lll …(5B.106)

We may write )s(Pl in partial fractions form as

( ) ∑= ν+

+=L

1k k

k00*0 s

as

asP …(5B.107)

∑= ν+

=L

1k k

k*

sa)s(P l

l …(5B.108)

where 0a and ka l are real numbers calculated as

∏=

ν= L

1jj

10

)0(Qa …(5B.109)

and

( )

Lk2,L1,)(Q

a L

k1j

kjk

k1k ≤≤≤≤

ν−νν

ν−−=

∏≠=

+ lll …(5B.110)

On taking inverse Laplace transform of eqs (5B.107) and (5B.108), we get

( )

∑∏∏ =

≠==

ν−νν

ν−ν−−

ν=

L

1kL

k1j

kjk

kk1L

1kk

10

)texp()(Q)0(a)t(P …(5B.111)


114

( )

L1where,)texp()(Q

)t(PL

1kL

k1j

kjk

kk1 ≤≤−

−−−= ∑

∏=

≠=

+ lll

ννν

νν …(5B.112)

5B.5 Performance Measures

To quantify the performance of the system concerned is the main objective of

developing a mathematical model of real time system. In this section, we obtain some

performance indices in terms of probabilities obtained in previous section as follows:

Expected number of failed components at time t due to hardware failure is

( ) ( )( )∑∑−+

=

+

=

=iSM

0jji,

SM

1iH tPitF …(5B.113)

Expected number of failed components at time t due to software failure is

( ) ( )( )∑∑−+

=

+

=

=jSM

0iji,

SM

1jS tPjtF …(5B.114)

Expected number of standby components in the system at time t is

( ) ( ) ( )( )tPji-StS ji,

S

0jiC ∑

=+

−= …(5B.115)

Component availability at time t is

( ) ( ) ( )

++

−=SM

tFtF1tA SHC …(5B.116)

Failure frequency at time t is

( ) ( )tPt 1SMdF −+λ=ω …(5B.117)

Reliability of the system is

( ) ( )( )∑∑−+

=

+

=

=1SM

0jji,

1-SM

0itPtR …(5B.118)


115


In this section, we check the validity of the proposed model by employing

Runge-Kutta (R-K) technique of fourth order and matrix method to solve the system

of differential equations. R-K method is implemented by exploiting MATLAB’s

‘ode45’ function. We consider a time span with equal intervals. For different values

of λh, λs, α, µh, µs, tables 5B.1(a)-5B.1(f) and figs 5B.2(a)-5B.2(f) depict various

performance measures and reliability of the system. For illustration purpose, we

choose default parameters as λ=0.9, λS=0.1, C=0.3, θ=0.6, β=1, µ=3.

In tables 5B.1(a)-5B.1(f), various performance measures such as expected

number of failed components due to hardware and software failure, expected number

of standby components, availability and failure frequency of the system are

summarized. From tables it is noticed that the expected number of failed components

due to software failures is increasing with respect to time but due to failures of

hardware components, it initially increases and after some time decreases gradually.

Expected number of standby components and availability are also decreasing function

of time but the failure frequency shows the increasing pattern with respect to increase

in time. From tables 5B.1(a)-5B.1(f), it is noticed that FH(t) increases as λh, α, µS, q

increase but decreases as λs and µh increase. It is seen that FS(t) increases on

increasing λs and µh but decreases on increasing the values of λh, µS. By increasing

the values of α and q, FS(t) increases. For other performance indices viz. SC(t), AC(t)

and ωF(t), we see that as λh, λs increase, SC(t) and AC(t) decrease but ωF(t) increases.

We see that when µh and µS increase, SC(t) and AC(t) show the increasing trend but

ωF(t) decreases. On increasing the probability of perfect switching q, it is found that

SC(t) and AC(t) are decreasing while ωF(t) is increasing. With respect to parameter α,

SC(t) decreases but AC(t) and ωF(t) remain almost constant.

In figs 5B.2(a)-5B.2(f), we compute the reliability with respect to time t for

different system parameters. In figs 5B.2(a) and 5B.2(b), it is observed that reliability

decreases as time increases. Also as λh increases, the reliability decreases; however

we observe the reverse effect on increasing µh. The reliability with respect to time,

initially decreases and after some time it becomes almost constant as seen in figs

5B.2(c) and 5B.2(d). We also notice that the reliability decreases sharply on


116

increasing the values of λs; it increases as µs increases. From figs 5B.2(e) and 5B.2(f),

it can be observed that initially reliability decreases sharply then after decreases

slowly for the higher values of α, q and time t but after some time reliability becomes

constant.

Overall from the tables and figs we can conclude that the availability AC(t)

and reliability R(t) decrease with time. It is quite obvious to notice that as failure rate

of hardware components increases, the expected number of failed hardware

components increases whereas as the failure rate of software components increases,

the expected number of failed software components increases. Overall, on the basis of

numerical results, it can be concluded that a system would work more effectively with

the adequate support of standbys and repair facility.

5B.7 Conclusion

In this chapter, explicit expressions for the reliability, availability and other

performance measures are provided. It is noticed that the system reliability can be

improved by providing sufficient standbys and sufficient repairmen for an embedded

system which contains both hardware and software components. Our numerical

results indicate that the switching failure has a significant effect on the reliability as

such incorporation of switching failure in the model makes our model more realistic

and versatile to deal with real time system. Our future research will look at the

reliability analysis of the system with unreliable server and/or server vacation.


117

Fig. 5B.1: State transition diagram


118

Table 5B.1(a): Performance indices for different values of hλ

t λs=0.2 λs=0.5 λs=0.8

FH(t) FS(t) SC(t) AC(t) ωF(t) FH(t) FS(t) SC(t) AC(t) ωF(t) FH(t) FS(t) SC(t) AC(t) ωF(t)

0 0.00 0.00 3.00 1.00 0.000 0.00 0.00 3.00 1.00 0.000 0.00 0.00 3.00 1.00 0.000

2 1.09 0.87 1.27 0.72 0.000 0.72 2.53 0.55 0.54 0.005 0.44 3.99 0.19 0.37 0.023

4 0.97 1.28 1.09 0.68 0.001 0.43 3.79 0.25 0.40 0.019 0.19 5.27 0.04 0.22 0.066

6 0.90 1.52 0.99 0.65 0.001 0.31 4.39 0.15 0.33 0.032 0.14 5.59 0.02 0.18 0.084

8 0.86 1.67 0.93 0.64 0.002 0.26 4.66 0.10 0.30 0.039 0.13 5.67 0.01 0.17 0.090

10 0.84 1.76 0.90 0.63 0.002 0.24 4.77 0.09 0.28 0.043 0.12 5.69 0.01 0.17 0.091

12 0.82 1.82 0.88 0.62 0.002 0.23 4.83 0.08 0.28 0.044 0.12 5.69 0.01 0.17 0.091


119

t λh=0.3 λh=0.6 λh=0.9 FH(t) FS(t) SC(t) AC(t) ωF(t) FH(t) FS(t) SC(t) AC(t) ωF(t) FH(t) FS(t) SC(t) AC(t) ωF(t)

0 0.00 0.00 3.00 1.00 0.000 0.00 0.00 3.00 1.00 0.000 0.00 0.00 3.00 1.00 0.000

2 0.84 1.97 0.75 0.60 0.002 1.36 1.96 0.51 0.53 0.011 1.77 1.94 0.35 0.47 0.028

4 0.57 3.03 0.44 0.49 0.009 0.96 3.01 0.30 0.43 0.026 1.28 2.97 0.21 0.39 0.054

6 0.45 3.61 0.30 0.42 0.016 0.76 3.58 0.21 0.38 0.042 1.03 3.55 0.15 0.35 0.079

8 0.38 3.92 0.23 0.38 0.022 0.66 3.90 0.16 0.35 0.053 0.90 3.86 0.11 0.32 0.096

10 0.35 4.09 0.20 0.37 0.025 0.61 4.06 0.14 0.33 0.059 0.84 4.02 0.10 0.31 0.105

12 0.33 4.18 0.18 0.36 0.026 0.58 4.15 0.13 0.32 0.063 0.80 4.11 0.09 0.30 0.111

Table 5B.1(b): Performance indices for different values of Sλ


120

t α=0.1 α=0.2 α=0.3 FH(t) FS(t) SC(t) AC(t) ωF(t) FH(t) FS(t) SC(t) AC(t) ωF(t) FH(t) FS(t) SC(t) AC(t) ωF(t)

0 0.00 0.00 3.00 1.00 0.000 0.00 0.00 3.00 1.00 0.000 0.00 0.00 3.00 1.00 0.000

2 1.48 1.95 0.46 0.51 0.015 1.49 1.95 0.45 0.51 0.015 1.50 1.95 0.45 0.51 0.015

4 1.06 3.00 0.27 0.42 0.034 1.06 3.00 0.27 0.42 0.034 1.07 3.00 0.26 0.42 0.034

6 0.85 3.57 0.19 0.37 0.053 0.85 3.57 0.19 0.37 0.053 0.86 3.57 0.18 0.37 0.053

8 0.74 3.88 0.15 0.34 0.066 0.74 3.88 0.15 0.34 0.066 0.75 3.88 0.14 0.34 0.066

10 0.68 4.05 0.13 0.32 0.074 0.69 4.05 0.12 0.32 0.074 0.69 4.05 0.12 0.32 0.074

12 0.65 4.14 0.12 0.32 0.078 0.66 4.14 0.11 0.32 0.078 0.66 4.14 0.11 0.31 0.078

Table 5B.1(c): Performance indices for different values of α


121

t µh=3 µh=4 µh=5 FH(t) FS(t) SC(t) AC(t) ωF(t) FH(t) FS(t) SC(t) AC(t) ωF(t) FH(t) FS(t) SC(t) AC(t) ωF(t)

0 0.00 0.00 3.00 1.00 0.000 0.00 0.00 3.00 1.00 0.000 0.00 0.00 3.00 1.00 0.000

2 1.50 1.95 0.45 0.51 0.015 1.20 1.96 0.57 0.55 0.009 1.00 1.97 0.66 0.58 0.006

4 1.07 3.00 0.26 0.42 0.034 0.85 3.02 0.33 0.45 0.027 0.71 3.02 0.38 0.47 0.023

6 0.86 3.57 0.18 0.37 0.053 0.68 3.60 0.23 0.39 0.045 0.56 3.61 0.26 0.40 0.041

8 0.75 3.88 0.14 0.34 0.066 0.59 3.91 0.18 0.36 0.058 0.49 3.92 0.20 0.37 0.054

10 0.69 4.05 0.12 0.32 0.074 0.54 4.07 0.15 0.34 0.065 0.45 4.09 0.17 0.35 0.061

12 0.66 4.14 0.11 0.31 0.078 0.52 4.16 0.14 0.33 0.069 0.43 4.17 0.16 0.34 0.065

Table 5B.1(d): Performance indices for different values of hμ


122

Table 5B.1(e): Performance indices for different values of Sμ

t µS=2 µS=3 µS=4 FH(t) FS(t) SC(t) AC(t) ωF(t) FH(t) FS(t) SC(t) AC(t) ωF(t) FH(t) FS(t) SC(t) AC(t) ωF(t)

0 0.00 0.00 3.00 1.00 0.000 0.00 0.00 3.00 1.00 0.000 0.00 0.00 3.00 1.00 0.0000

2 1.75 1.26 0.66 0.57 0.009 1.90 0.85 0.80 0.61 0.006 1.99 0.60 0.89 0.63 0.0041

4 1.56 1.64 0.57 0.54 0.012 1.84 0.96 0.77 0.60 0.006 1.97 0.63 0.88 0.63 0.0041

6 1.49 1.80 0.54 0.53 0.014 1.82 0.98 0.77 0.60 0.006 1.97 0.63 0.88 0.63 0.0042

8 1.46 1.86 0.53 0.52 0.014 1.82 0.99 0.77 0.60 0.006 1.97 0.63 0.88 0.63 0.0042

10 1.45 1.89 0.52 0.52 0.015 1.82 0.99 0.77 0.60 0.006 1.97 0.63 0.88 0.63 0.0042

12 1.45 1.90 0.52 0.52 0.015 1.82 0.99 0.77 0.60 0.006 1.97 0.63 0.88 0.63 0.0042


123

Table 5B.1(f): Performance indices for different values of q

t q=0.3 q=0.5 q=0.7 FH(t) FS(t) SC(t) AC(t) ωF(t) FH(t) FS(t) SC(t) AC(t) ωF(t) FH(t) FS(t) SC(t) AC(t) ωF(t)

0 0.00 0.00 3.00 1.00 0.000 0.00 0.00 3.00 1.00 0.000 0.00 0.00 3.00 1.00 0.000

2 1.15 1.96 0.62 0.56 0.012 1.27 1.96 0.56 0.54 0.013 1.39 1.96 0.50 0.52 0.014

4 0.87 3.01 0.35 0.45 0.033 0.94 3.01 0.32 0.44 0.033 1.00 3.00 0.29 0.43 0.034

6 0.72 3.58 0.24 0.38 0.052 0.77 3.58 0.22 0.38 0.053 0.81 3.58 0.20 0.37 0.053

8 0.64 3.89 0.19 0.35 0.066 0.68 3.89 0.17 0.35 0.066 0.71 3.89 0.16 0.34 0.066

10 0.60 4.06 0.16 0.33 0.073 0.63 4.06 0.15 0.33 0.073 0.66 4.05 0.13 0.33 0.073

12 0.58 4.14 0.15 0.32 0.077 0.61 4.14 0.13 0.32 0.077 0.63 4.14 0.12 0.32 0.078


124

0.85

0.9

0.95

1

0 1 2 3 4 5 6 7 8 9 10 11 12t

R(t)

λh=0.3λh=0.6λh=0.9

0.85

0.9

0.95

1

0 1 2 3 4 5 6 7 8 9 10 11 12t

R(t)

µh=3µh=4µh=5

Fig. 5B.2(a): Reliability vs time by Fig. 5B.2(b): Reliability vs time by varying hλ varying hμ

0.6

0.7

0.8

0.9

1

0 1 2 3 4 5 6 7 8 9 10 11 12t

R(t)

λs=0.4λs=0.6λs=0.8

Fig. 5B.2(c): Reliability vs time by varying Sλ


125

0.97

0.975

0.98

0.985

0.99

0.995

1

0 1 2 3 4 5 6 7 8 9 10 11 12t

R(t)

µS=2µS=3µS=4

0.94

0.95

0.96

0.97

0.98

0.99

1

0 1 2 3 4 5 6 7 8 9 10 11 12

t

R(t)

α=0α=0.3α=0.5

Fig. B.52(d): Reliability vs time by Fig. 5B.2(e): Reliability vs time by varying Sμ varyingα

0.965

0.97

0.975

0.98

0.985

0.99

0.995

1

0 1 2 3 4 5 6 7 8 9 10 11 12

t

R(t)

q=0.1q=0.5q=0.9

Fig. 5B.2(f): Reliability vs time by varying q

Availability Analysis of Repairable Redundant System with Reboot

Delay Section-6A Hardware-Software System with Switching Failure

Section-6B Repairable System with Warm Standby and Switching Failure

Chapter-6

Hardware-Software System with Switching Failure

6A.1 Introduction

6A.2 Model Description and Governing Equations

6A.3 Availability Prediction

6A.4 Performance Measures

6A.5 Sensitivity Analysis

6A.6 Conclusions

Section-6A

Section-6A: Hardware-Software System with Switching Failure

128

In this section, we present the availability analysis of a

system having N software components, M hardware components,

S warm standby units and two types of repairmen. The switching

failures, common cause failure and reboot delay are also

considered. The life time, repair time and delay time of reboot of

the software and hardware components are assumed to be

exponentially distributed. Numerical results are provided for

various performance indices. Sensitivity analysis is also carried

out to explore the changes in the availability characteristics for

the variation of different system parameters. Adaptive Network-

based Fuzzy Interference Systems (ANFIS) approach is also

employed to exhibit the scope of soft computing for numerical

tractability.

6A.1 Introduction

System availability is an increasingly important issue in power plants,

manufacturing systems, industrial systems and standby systems. For maintaining a

high degree quality, the availability prediction is often an essential requisite. Any

production system cannot work continuously because of some interference due to

failure of the components during the operation.

Spare parts support has been widely used for improving the system reliability

and availability. There is enormous literature available on machine interference

problems with spares. The reliability analysis has been an area of interest for many

researchers due to its applications in many organizations working in machining

environments. Aven (1990) gave availability formulae for the standby systems of

similar units that are preventively maintained. Subramanian and Anantharaman

(1995) and Azaron et al. (2005) made reliability analysis of a complex standby

redundancy system which is based on time-dependent. Jain et al. (2004) provided

prediction of machine interference model with spare and two modes of failure. Cha et

al. (2008) modeled a general standby system and provided various indices for


129

predicting its performance. A standby system by considering human error failures,

hardware and preventive maintenance was studied by Mahmoud and Moshref

(2010). Wang et al. (2011a) proposed the availability model and parameters

estimation method for the delay time model with imperfect maintenance at inspection.

In the past decade, many researchers have worked on the reliability and

availability of the machining system with warm standby. A system with warm

standby has been studied in detail by considering different conditions, such as

repairable system and human error (Guo and Hua, 2003), general repair times (Wang

et al. 2005) and components having proportional hazard rates (Li et al., 2009a).

Mokaddis et al. (1997) analyzed two unit warm standby system subject to

degradation. Wang et al. (2004) studied the reliability and sensitivity analysis of a

system with M operating machines, S warm standbys, and a repairable service station

and presented derivations for the system reliability and the mean time to system

failure. Recently, Yun and Cha (2010) discussed a standby system with two units and

determining the optimal switching time which maximizes the expected system life

and related allocation problem. Stability analysis of a new kind n-unit series

repairable system was studied by Guo et al. (2011).

Most studies about the reliability of a system assume that the switchover from

warm standby unit to primary one is always perfect. But in the real life situations, this

assumption only simplifies the analysis of the problem because a warm standby unit

might not be able to switchover to a primary unit successfully. The concept of

imperfect switching has been discussed by many researchers. Goel and Shrivastava

(1992) studied two unit standby systems with imperfect switch, preventive

maintenance and correlated failures and repairs. Hsieh and Hsieh (2003) discussed

the reliability and cost optimization in distributed computer systems. Huang et al.

(2006) suggested parametric nonlinear programming approach for a repairable system

with switching failure and fuzzy parameter. Wang et al. (2006) and Wang and Chen

(2009) compared the reliability/availability of the warm standby systems with general

repair times, reboot delay and switching failures. Levitin and Amari (2010) studied a

k-out-of-n system with shared standby elements along with algorithm for evaluating

the time-to-failure distribution. In (2011), Kancev and Cepin evaluated a risk and cost


130

using an age-dependent unavailability modeling of test and maintenance for the

standby components.

Redundant repairable systems have been studied extensively in the past, and a

detailed bibliography can be found in Zhang and Horigome (2001). They analyzed

the systems in which one or more components can fail simultaneously due to a

cumulative shock-damage process. They provided the reliability and availability

analysis of repairable systems, where failure and repair rates of components can be

varying with time. Yadavalli et al. (2002) studied the steady-state availability of a

two-unit parallel system with the introduction of preparation time for the repair

facility. Seo et al. (2003) proposed the life time and reliability estimation of

repairable redundant system subject to periodic alternation. Balsamo et al. (2004)

described the model-based performance in software development. Raj Kiran and

Ravi (2008) predicted the ensemble models to accurately forecast software reliability.

Various statistical and intelligent techniques such as neuro-fuzzy inference system

constitute the ensembles presented. Reliability analysis for a k/n(F) system with repair

was done by Zhang and Wu (2009). Levitin and Xing (2010) analyzed the reliability

and performance of multi-state systems with propagated failures having selective

effect. Wang et al. (2011b) developed the availability model and proposed the

parameter estimation method for the delay time model with imperfect maintenance at

inspection.

Common cause failure, where a single failure event can propagate and cause

failure of more than one unit in the system, has drawn the attention of many

investigators working in the area of system reliability and availability. There are

several internal factors (e.g. designing deficiencies, fabrication, etc.) and external

factors (e.g. environmental conditions like temperature, dust and humidity, power

failure, fire, flood, earthquake, etc.) which can lead to common cause failure. The

study of the systems incorporating the common cause factor for the investigation

whether it is for predicting the behavior of new designs or studying possible changes

to existing ones is challenging task. Hu (2006) discussed a repairable system with

warm standbys under common cause failure. Reliability of standby safety systems due

to independent and common cause failures was calculated by Lu and Lewis (2006). Li


131

et al. (2010) considered the heterogeneous redundancy optimization for multi-state

series-parallel systems subject to common cause failure.

Some researchers have worked in the field of reliability to analyze the

repairable components of hardware systems with standby, switching failures and other

different environments. But no one has considered the reliability and availability

analysis for both hardware and software components of any system under different

arguments. The present investigation is concerned with the reliability and availability

analysis for the system having both hardware and software components by

considering realistic situations of warm standby provisioning, switching failures,

common cause failure and reboot delay. In section 6A.2, we describe the model and

provide notations used throughout the chapter. In section 6A.3, the analysis is

provided along with illustrations of the proposed model. Performance measures have

been derived in section 6A.4. Numerical results have been given in section 6A.5.

Finally conclusions are made to highlight the key features of our study in section

6A.6.

6A.2 Model Description and Governing Equations

Consider the performance analysis based on availability measures of a

hardware-software system. The embedded system consists of N softwares and M

hardwares components with the provisioning of S warm standby hardware

components. The concepts of switching failures and reboot delay are taken into

consideration. At time t=0, the system is in working state, i.e., there are no failed

components. The life-times of software components and hardware components are

exponential distributed with mean hS

1and1λλ , respectively. There is provision of

repair of failed components; the repair times are exponentially distributed with

parameters sµ and hμ for software and hardware components, respectively. When an

operating hardware component fails, a warm standby hardware unit is immediately

substituted in place of it and the failed component is sent for repairing. The model is

based on the following assumptions and notations:

Each of the operating units fails independent of the state of the others.


132

When the repair of failed hardware/software component is completed, it is as good as a new one.

The repaired hardware component joins the operating group, otherwise it joins the standby group.

The standby hardware components fail independent of the state of all others and have an exponential life time distribution with parameterα .

The system may also fail due to failure of power supply unit.

When all hardware standbys are exhausted, the remaining operating hardware components fail with degraded failure rate dλ .

When a standby hardware component moves into an operating state, its characteristics are same as that of an operating component.

The switching device which is used to replace the failed hardware component by standby hardware component, is subject to failure with probability q during the switching from standby state to operating state.

After the switching, a reboot delay which is exponentially distributed, takes place with mean time 1/β .

Notations

Following notations have been used for formulating the mathematical model:

M (N) : Number of hardware (software) operating components.

S : Number of warm standby hardware components.

hλ ( Sλ ) : Failure rate of a hardware (software) operating components.

dλ : Degradation failure rate of a hardware operating components.

pλ : Power supply failure rate of the system.

α : Failure rate of a warm standby hardware component.

hµ ( Sµ ) : Repair rate of a hardware (software) failed component.

q : Switching failure probability of hardware standby component.

β1 : Reboot delay of a hardware standby component to operating component.


133

k,j,iP : Steady state probability that there are i operating software components, j operating hardware components and k standbys hardware components in the system, where i=0,1,2…,N, j=0,1,2…,M and k=0,1,2…,K.

R(t) : Reliability function of the system at time t.

Fig. 6A.1: State transition diagram

Chapman-Kolmogorov equations governing the model are constructed by

using appropriate transition rates as follows (see transition diagram given in fig.

6A.1):

( ) 0PPPNSM )S,M,1N(S)1S,M,N(h)S,M,N(pSh =µ+µ+λ+λ+α+λ −− …(6A.1)


134

( )[ ] ( )(6A.2)1Sn1,0P

PP1nSPNnSM

)nS,M,1N(S

)1nS,M,N(h)1nS,M,N()nS,M,N(phSh

…−≤≤=µ+

µ+α+−+λ+µ+β+λ+α−+λ−

−−

−−+−−

[ ] )3.A6...(0PPPPNM )0,M,1N(S)0,1M,N(h)1,M,N()0,M,N(phSh =µ+µ+α+λ+µ+β+λ+λ− −−

( ) 0PPq1MP )1S,M,N()S,M,N(h)S,1M,N(p =β+−λ+λ− −− …(6A.4)

( ) 1Sn1,0Pq1qMPPn

0r)rS,M,N(

rnh)0,M,N()nS,1M,N(p −≤≤=−λ+β+λ− ∑

=−

−−− …(6A.5)

( )[ ](6A.6)1Sn1

,0PqMPPMP1M1S

0n)n,M,N(

)nS(h)0,2M,N(h)0,M,N(h)0,1M,N(phd

…−≤≤

=λ+µ+λ+λ+µ+λ−− ∑−

=

−−−

0PP )1L(d)L(h hh=λ+µ− − …(6A.7)

( )[ ]( ) )8.A6...(1Ni1,0Pi1N

PPPiNSM

)S,M,i1N(S

)S,M,i1N(S)1S,M,iN(h)S,M,iN(pSSh

−≤≤=λ−++

µ+µ+λ+µ+λ−+α+λ−

−+

−−−−−

( ) ( )[ ] ( )( ) 1Sk1,1Ni1,Pi1NPP

P1kSPiNkSM

)S,M,i1N(S)kS,M,i1N(S)1kS,M,iN(h

)1kS,M,iN()kS,M,iN(pShSh

−≤≤−≤≤λ−++µ+µ+

α+−+λ+µ+µ+β+λ−+α−+λ−

−+−−−−−−

+−−−−

…(6A.9)

( )[ ]( ) 1Ni1,0Pi1NP

PPPiNM

)0,M,i1N(S)0,M,i1N(S

)0,1M,iN(h)1,M,iN()0,M,iN(pShSh

−≤≤=λ−++µ+

µ+α+λ+µ+µ+β+λ−+λ−

−+−+

−−−− …(6A.10)

( ) 0PPq1MP )1S,M,iN()S,M,iN(h)S,1M,iN(p =β+−λ+λ− −−−−− …(6A.11)

( )

(6A.12)1Sk1,1Ni1

,0Pq1qMPPn

0r)rS,M,iN(

rnh)0,M,iN()kS,1M,iN(p

…−≤≤−≤≤

=−λ+β+λ− ∑=

−−−

−−−−

( )[ ])13.A6...(Mj1,1Ni1,0PqM

PPMPjM1S

0n)n,M,iN(

)nS(h

)0,1jM,iN(h)0,M,iN(h)0,jM,iN(phd

≤≤−≤≤=λ+

µ+λ+λ+µ+λ−−

∑−

=−

−

+−−−−−

0PP )1L(d)L(h SS=λ+µ− − …(6A.14)

0P0P )S,M,0(S)S,M,0(s =λ+=µ− …(6A.15)

Sk1,1Ni1,0PP )kS,M,0(S)kS,M,0(s ≤≤−≤≤=λ+µ− −− …(6A.16)


135

To solve equations (6A.1)-(6A.16), we employ numerical method based on successive

over relaxation (SOR). To examine the tractability of proposed approach, we consider

an illustration and outline the solution procedure in the next section.

6A.3 Availability Prediction

In this section, we consider an embedded computer system having 2 software

components, 3 hardware operating components and 2 standby hardware components.

The difference equations associated with the system states are constructed as follows:

( ) 0P223PP )2,3,2(pSh)2,3,1(S)1,3,2(h =λ+λ+α+λ−µ+µ …(6A.17)

( ) 0P23PPP2 )1,3,2(phSh)1,3,1(S)0,3,2(h)2,3,2( =λ+µ+β+λ+α+λ−µ+µ+α …(6A.18)

( ) 0P23PPP )0,3,2(phSh)0,3,1(S)0,2,2(h)1,3,2( =λ+µ+β+λ+λ−µ+µ+α …(6A.19)

( ) 0P2qP3Pq3PP3 )0,2,2(phd)1,3,2(h)2,3,2(2

h)0,1,2(h)0,3,2(h =λ+µ+λ−λ+λ+µ+λ …(6A.20)

( ) 0PPP2 )0,1,2(phd)0,0,2(h)0,2,2(d =λ+µ+λ−µ+λ …(6A.21)

0PP )0,0,2(h)0,1,2(d =µ−λ …(6A.22)

( ) 0PPPq13 )2,2,2(p)1,3,2()2,3,2(h =λ−β+−λ …(6A.23)

( ) ( ) 0PPq1q3PPq13 )1,2,2(p)2,3,2(h)0,3,2()1,3,2(h =λ−−λ+β+−λ …(6A.24)

( ) 0P23PP2P )2,3,1(pSSh)2,3,0(S)2,3,2(S)1,3,1(h =λ+µ+λ+α+λ−µ+λ+µ …(6A.25)

( ) 0P3PP2PP2 )1,3,1(phSSh)1,3,0(S)1,3,2(S)0,3,1(h)2,3,1( =λ+µ+β+µ+λ+α+λ−µ+λ+µ+α …(6A.26)

( ) 0P3PP2PP )0,3,1(phSSh)0,3,0(S)0,3,2(S)0,2,1(h)1,3,1( =λ+µ+β+µ+λ+λ−µ+λ+µ+α …(6A.27)

( ) 0P2qP3Pq3PP3 )0,2,1(phd)1,3,1(h)2,3,1(2

h)0,1,1(h)0,3,1(h =λ+µ+λ−λ+λ+µ+λ …(6A.28)

( ) 0PPP2 )0,1,1(phd)0,0,1(h)0,2,1(d =λ+µ+λ−µ+λ …(6A.29)

0PP )0,0,1(h)0,1,1(d =µ−λ …(6A.30)

( ) 0PPPq13 )2,1,1(p)1,3,1()2,3,1(h =λ−β+−λ …(6A.31)


136

( ) ( ) 0PPPq1q3Pq13 )1,1,1(p)0,3,1()2,3,1(h)1,3,1(h =λ−β+−λ+−λ …(6A.32)

0PP )2,3,0(S)2,3,1(S =µ−λ …(6A.33)

0PP )1,3,0(S)1,3,1(S =µ−λ …(6A.34)

0PP )0,3,0(S)0,3,1(S =µ−λ …(6A.35)

The normalizing condition is

( )∑∑∑= = =

=2

0i

3

0j

2

0kk,j,i 1P …(6A.36)

For equations (6A.17)-(6A.36), we get the matrix equation as

AP=0

and A1, A2, ….. etc. are given as follows:

=

987

654

321

AAAAAAAAA

A

=

43

211 BB

BBA ,

=

43

212 CC

CCA , [ ] 683 0A ×=

=

43

214 DD

DDA ,

=

43

215 EE

EEA ,

=

43

216 FF

FFA

64ppppppp

7

0000000000000000000000000

A

×

λλλλλλλ

= ,

64pppppp

S

S

S

8 000000000000000

A

×

λλλλλλλ

λλ

=

64pp

S

S

S

9

0000000000000000000

A

×

λλµ−

µ−µ−

=


137

( )( )

( )( )

44hd2h3qh32qh3hhS2h30

0hhS2h32

00hS22h3

1B

p

p

p

p

×λ+µ+λ−λλλ

µµ+λ+β+λ+λ−α

µµ+λ+β+λ+α+λ−α

µλ+λ+α+λ−

=

44h

2

000000000000000

B

×

µ

= , ( )( ) ( )

44hh

h

d

3

0q13q1q300q130000

2000

B

×

β−λ−λβ−λ

λ

=

( )

44p

p

hd

hphd

4

0000000000

B

×

λλ−

µ−λµλ+µ+λ−

=

34

S

S

S

1

00000

0000

C

×

µµ

µ

= , [ ] 342 0C ×= , [ ] 343 0C ×= , [ ] 344 0C ×=

44

S

S

S

1

0000

000200

020002

D

×

λλ

λ

= , [ ] 442 0D ×= , [ ] 443 0D ×= , [ ] 444 0D ×=

34h3qh32qh3

hpSSh30

hShpSh32

0hpSS22h3

1E

×λλλ

µ+λ+β+λ+µ+λ−α

µµ+µ+λ+β+λ+α+λ−α

µλ+µ+λ+α+λ−

=

( )34hphd

h2

0200000000

E

×

µλ+µ+λ−µ

= , ( )( ) ( )

34hh

h3

q13q1q30q13000000

E

×

β−λ−λ−β−λ

=


138

( )

34

hd

hpdhd

4

000000

02

E

×

µ−λµλ+λ+µ−λ

=

34

S

1

000000000

00

F

×

µ

= ,

34

S

S2

0000000000

F

×

µµ

= ,

34p

p3

0000000000

F

×

λ−λ−

= , [ ] 344 0F ×=

The probability vector P is given by

[ ]T321 ,, PPPP =

where

( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )[ ]2,2,12,2,22,0,02,1,02,2,02,3,02,3,12,3,2 PPPPPPPP +++++++=1P

( ) ( ) ( ) ( ) ( ) ( ) ( )[ ]1,1,11,1,21,0,01,1,01,2,01,3,01,3,1 PPPPPPP ++++++=2P

( ) ( ) ( )[ ]0,3,00,3,10,3,2 ppP ++=3P

We have used the successive over relaxation (SOR) technique which is a

powerful numerical method for solving a linear system of equations.


In this section, we find the expressions for various performance measures in terms

of probabilities determined by SOR method as follows:

The availability of the system is given by

( ) ( ) ( ) ( )

+++−= ∑ ∑ ∑ ∑

= = = =−−−

N

1i

S

0k

S

1k

S

1kk,2M,1Nk,1M,Nk,M,00,0,i PPPP1A …(6A.36)

The system availability, when both software are working, is given by

( ) ( )

+−= ∑ ∑

= =

S

0k

S

1kk1,-MN,kN,0,2S PP1A …(6A.36)

The system availability, when one software is working, is given by

( ) ( )

+−= ∑ ∑

= =

S

0k

S

1kk2,-M1,-Nk1,0,-N2S PP1A …(6A.36)


139

The frequency of the system failure is

( ) ( )( ) ( ) ( )

+−λ+

λ=ω ∑ ∑∑

= =−−−

=

S

1k

S

1kk,2M,1Nk,1M,N

N

1i0,0,idf PPq13P …(6A.36)

6A.5 Sensitivity Analysis

In this section, we present computational experiment for exploring the system

availability and other performance measures. The numerical results obtained from

SOR (Successive over Relaxation) technique are compared with the neuro-fuzzy

results by building Adaptive Network Based Fuzzy Inference System (ANFIS) in

MATLAB 7.4. Neuro Fuzzy (NF) method is characterized by their membership

function by defining the tolerance limit for getting achievements. We use Gaussian

function for describing the membership function. For all approximations, ANFIS are

trained for 100 epochs. For illustration purpose, we fix default parameters as β= 0.8,

μS=1, μh=2, α=0.5, q=0.5, λh=0.3, λd=0.7, λp=0.9, and λS=0.4.

Tables 6A.1(a)-6A.1(d) show various performance measures for the variation of

different parameters. In tables 6A.1(a-d), by varying the degradation failure rate λd

and switching failure probability q, we observe that as λd increases, A, AS2 , AS1 and

ωf decrease. Tables 6A.1(b)-6A.1(d) reveal the similar observation for the variation in

λS, β and α, respectively

Figs 6A.2(a)-6A.2(c) are drawn for availability with the variation in failure rate

of hardware components (λh). Figs 6A.3(a)-6A.3(c) exhibit the availability by

changing the values of the repair rate of the software components (μS). The fuzzy

membership functions, graph for λh and μS are shown in figs 6A.2(d) and 6A.3(d),

respectively. The effect of switching failure probability q on the availability can be

examined from fig. 6A.2(a) and 6A.3(a). It is noted that as λh, q and μS are increasing,

the availability is decreasing. Further, we display the effect of reboot delay β with

respect to the λh and μS in figs 6A.2(b) and 6A.3(b), respectively. We observe that as

λh and μS increase, R decreases but as β increases, the availability increases for both

cases. Figs 6.A2(c) and 6A.3(c) reveal that availability rarely changes for failure rate

of standby α, i.e., availability is decreasing as λh and μS are increasing. The

corresponding availability curves for various values of α are almost identical which

shows that the failure rate of standby has very little effect on the system availability.


140

Based on sensitivity analysis, we conclude that the availability changes

significantly by changing the parameters λh, λS, q and β. The effect of failure rate (α)

of spare component on the availability is not significant; this may be due to choice of

other parameter values.

6A.6 Conclusion

In this chapter, we have provided the availability analysis of the embedded

system having both hardware and software components under the assumption that the

standby switching at the primary state might fail. The availability and other

performance indices obtained may be helpful to improve the availability of the

concerned system in particular when reboot and switching failure are prevalent. Our

study may be used in computer and communication networks, distributed computing

system, etc..


141

Table 6A.1(a): Performance indices for different values of dλ

q λh=0.1 λh=0.5 λh=0.9

A AS2 AS1 ωf A AS2 AS1 ωf A AS2 AS1 ωf 0 0.9112 0.8363 0.0749 0.7599 0.9043 0.8299 0.0744 0.7599 0.8939 0.8204 0.0735 0.7599 0.1 0.9069 0.8323 0.0746 0.6759 0.8999 0.8259 0.0740 0.6759 0.8894 0.8163 0.0732 0.6759 0.2 0.9026 0.8283 0.0743 0.5939 0.8955 0.8218 0.0737 0.5939 0.8849 0.8121 0.0728 0.5939 0.3 0.8982 0.8243 0.0739 0.5141 0.8911 0.8177 0.0733 0.5141 0.8804 0.8079 0.0725 0.5141 0.4 0.8939 0.8203 0.0736 0.4363 0.8867 0.8137 0.0730 0.4363 0.8759 0.8038 0.0721 0.4363 0.5 0.8895 0.8163 0.0733 0.3607 0.8822 0.8096 0.0727 0.3607 0.8714 0.7996 0.0718 0.3607 0.6 0.8852 0.8123 0.0729 0.2871 0.8778 0.8055 0.0723 0.2871 0.8669 0.7955 0.0714 0.2871 0.7 0.8809 0.8083 0.0726 0.2157 0.8734 0.8015 0.0720 0.2157 0.8624 0.7913 0.0711 0.2157 0.8 0.8765 0.8043 0.0722 0.1463 0.8690 0.7974 0.0716 0.1463 0.8578 0.7871 0.0707 0.1463 0.9 0.8722 0.8003 0.0719 0.0790 0.8646 0.7933 0.0713 0.0790 0.8533 0.7830 0.0704 0.0790


142

q ΛS=0.1 λS=0.5 λS=0.9

A AS2 AS1 ωf A AS2 AS1 ωf A AS2 AS1 ωf 0 0.9329 0.9118 0.0211 0.7967 0.8887 0.8000 0.0886 0.7486 0.8488 0.7127 0.1361 0.7072

0.1 0.9283 0.9073 0.0210 0.7088 0.8843 0.7961 0.0882 0.6658 0.8445 0.7091 0.1354 0.6288 0.2 0.9237 0.9028 0.0209 0.6230 0.8798 0.7921 0.0878 0.5850 0.8403 0.7055 0.1348 0.5524 0.3 0.9191 0.8983 0.0208 0.5394 0.8754 0.7881 0.0874 0.5063 0.8361 0.7019 0.1341 0.4779 0.4 0.9145 0.8938 0.0207 0.4579 0.8710 0.7841 0.0870 0.4297 0.8318 0.6983 0.1335 0.4055 0.5 0.9099 0.8893 0.0206 0.3786 0.8666 0.7801 0.0865 0.3552 0.8276 0.6947 0.1328 0.3351 0.6 0.9053 0.8848 0.0205 0.3015 0.8622 0.7761 0.0861 0.2827 0.8233 0.6912 0.1322 0.2667 0.7 0.9007 0.8803 0.0205 0.2265 0.8578 0.7721 0.0857 0.2123 0.8191 0.6876 0.1315 0.2003 0.8 0.8961 0.8758 0.0204 0.1536 0.8534 0.7681 0.0853 0.1440 0.8149 0.6840 0.1309 0.1358 0.9 0.8915 0.8713 0.0203 0.0829 0.8490 0.7641 0.0849 0.0778 0.8106 0.6804 0.1302 0.0734

Table 6A.1(b): Performance indices for different values of Sλ


143

q β=0.1 β =0.5 β =0.9


0.1 0.7084 0.6457 0.0627 0.2735 0.8206 0.7510 0.0696 0.5156 0.9179 0.8432 0.0747 0.7257 0.2 0.7034 0.6411 0.0623 0.2355 0.8160 0.7467 0.0693 0.4512 0.9135 0.8391 0.0744 0.6384 0.3 0.6984 0.6365 0.0619 0.1998 0.8113 0.7424 0.0689 0.3889 0.9091 0.8351 0.0741 0.5531 0.4 0.6934 0.6319 0.0615 0.1665 0.8066 0.7381 0.0685 0.3289 0.9047 0.8310 0.0737 0.4698 0.5 0.6884 0.6273 0.0611 0.1355 0.8019 0.7338 0.0681 0.2710 0.9003 0.8270 0.0734 0.3886 0.6 0.6834 0.6228 0.0606 0.1068 0.7973 0.7295 0.0678 0.2153 0.8960 0.8229 0.0731 0.3095 0.7 0.6784 0.6182 0.0602 0.0805 0.7926 0.7252 0.0674 0.1618 0.8916 0.8188 0.0727 0.2324 0.8 0.6734 0.6136 0.0598 0.0565 0.7879 0.7209 0.0670 0.1105 0.8872 0.8148 0.0724 0.1574 0.9 0.6684 0.6090 0.0594 0.0349 0.7833 0.7166 0.0666 0.0614 0.8828 0.8107 0.0720 0.0845

Table 6A.1(c): Performance indices for different values of β


144

q α=0.1 α =0.5 α =0.9


0.1 0.8982 0.8241 0.0740 0.6823 0.8955 0.8218 0.0737 0.6771 0.8931 0.8197 0.0734 0.6724 0.2 0.8933 0.8197 0.0736 0.5990 0.8909 0.8176 0.0733 0.5949 0.8888 0.8157 0.0731 0.5912 0.3 0.8885 0.8152 0.0733 0.5179 0.8864 0.8134 0.0730 0.5148 0.8845 0.8118 0.0727 0.5121 0.4 0.8837 0.8108 0.0729 0.4391 0.8819 0.8093 0.0726 0.4369 0.8803 0.8079 0.0724 0.4349 0.5 0.8788 0.8064 0.0725 0.3625 0.8773 0.8051 0.0723 0.3610 0.8760 0.8039 0.0721 0.3597 0.6 0.8740 0.8020 0.0721 0.2883 0.8728 0.8009 0.0719 0.2873 0.8718 0.8000 0.0718 0.2865 0.7 0.8692 0.7975 0.0717 0.2162 0.8683 0.7967 0.0716 0.2158 0.8675 0.7960 0.0715 0.2153 0.8 0.8644 0.7931 0.0713 0.1465 0.8638 0.7926 0.0712 0.1463 0.8632 0.7921 0.0711 0.1462 0.9 0.8595 0.7887 0.0709 0.0790 0.8592 0.7884 0.0708 0.0790 0.8590 0.7882 0.0708 0.0790

Table 6A.1(d): Performance indices for different values of α


145

0.7

0.75

0.8

0.85

0.9

0.95

1

0 0.5 1 1.5 2 2.5 3 3.5 4

λh

Avai

labi

lity

q=0.2(Analytical Set 1)q=0.2(Afnis Set 1)q=0.9(Analytical Set 2)q=0.9(Afnis Set 2)

0.7

0.75

0.8

0.85

0.9

0.95

1

0 0.5 1 1.5 2 2.5 3 3.5 4

λhAv

aila

bilit

y

β=0.4(Analytical Set 1)β=0.4(Afnis Set 1)β=0.8(Analytical Set 2)β=0.8(Afnis Set 2)

Fig. 6A.2(a): Availability vs hλ by Fig. 6A.2(b): Availability vs hλ by varying q varying β

0.7

0.75

0.8

0.85

0.9

0.95

1

0 0.5 1 1.5 2 2.5 3 3.5 4

λh

Avai

labi

lity

α=0.2(Analytical Set 1)α=0.2(Afnis Set 1)α=0.9(Analytical Set 2)α=0.9(Afnis Set 2)

1 2 3 4 5 6 7 8 9 10

0

0.2

0.4

0.6

0.8

1

input1

Deg

ree

of m

embe

rshi

p

in1mf1 in1mf2 in1mf3 in1mf4 in1mf5

Training DataANFIS Output

Fig. 6A.2(c): Availability vs hλ by Fig. 6A.2(d): Membership functions varyingα for input parameter hλ


146

0.7

0.75

0.8

0.85

0.9

0.95

1

0 1 2 3 4 5 6 7 8 9 10

µS

Avai

labi

lity

q=0.2(Analytical Set 1)q=0.2(Afnis Set 1)q=0.9(Analytical Set 2)q=0.9(Afnis Set 2)

0.7

0.75

0.8

0.85

0.9

0.95

1

0 1 2 3 4 5 6 7 8 9 10

µS

Ava

ilabi

lity

β=0.4(Analytical Set 1)β=0.4(Afnis Set 1)β=0.8(Analytical Set 2)β=0.8(Afnis Set 2)

Fig. 6A.3(a): Availability vs Sµ by Fig. 6A.3(b): Availability vs Sµ by varying q varying β

0.7

0.75

0.8

0.85

0.9

0.95

1

0 1 2 3 4 5 6 7 8 9 10

µS

Ava

ilabi

lity

α=0.2(Analytical Set 1)α=0.2(Afnis Set 1)α=0.9(Analytical Set 2)α=0.9(Afnis Set 2)

1 2 3 4 5 6 7 8 9 10

0

0.2

0.4

0.6

0.8

1

input1

Deg

ree

of m

embe

rshi

p

in1mf1 in1mf2 in1mf3 in1mf4 in1mf5

Training DataANFIS Output

Fig. 6A.3(c): Availability vs Sµ by Fig. 6A.2(d): Membership functions varying α for input parameter Sµ

Repairable System with Warm Standby and Switching Failure

6B.1 Introduction


6B.3 Availability Prediction for Three Configurations

6B.4 Transient Solution


6B.6 Conclusion

Section-6B

Section-6B: Repairable System with Warm Standby…

148

By using Markov process, the system state transitions can be

modeled to predict the system availability in many realistic

applications wherein all components of the system cannot be

treated as identical because of their failure and repair

characteristics. In this chapter, the efforts have been made to

examine the availability characteristics for three different

configurations with warm standby, switching failure and delay of

reboot. For primary and warm standby components, the time-to-

failure, time-to-repair and time-to-delay are assumed to follow the

exponential distribution. The switching of warm standbys to

replace the failed component is subject to failure with probability

q. The numerical results using Runge-Kutta method have been

provided for supporting the analytical results. These results

validate the prediction capability of the proposed analytical

framework of the system incorporating standby, switching and

reboot.

6B.1 Introduction

The multi-component machining systems are being used in every field of our

life to perform different activities due to our dependence on them. As the time grows,

a system is subject to failure; these failures may result in loss of production, money,

goodwill, etc.. This situation can be handled by facilating the spare part/standby

support as well as the maintenance provided by the repair crew. For ensuring the

desired efficiency and availability of the system, many researchers have suggested the

provision of repair facility and standby components. The failure and repair are

assumed to be a coupled event in a system working in machining environment. The

system availability is very important aspect to both the system users and manufactures

because failures may cause much loss in the time and cost. The system failures are

also unavoidable in many safety-critical systems such as in the banking systems,


149

military systems, nuclear systems and so forth. As the complexity and competitive

industrial pressure of the systems are increasing, the need to understand changes in

the availability of a complex repairable system.

The steady-state availability is the probability that a system will be

operational at any random point of time and is expressed as the expected fraction of

time. The availability is more related to major life cycle costs in time and money.

Therefore being able to model availability accurately and use this performance

measure to make design decisions becomes crucial to the ultimate success of any

system working in machining environment. The availability analysis of various

complex systems under different types of failure modes has been taken up from time

to time by several researchers. Kapur and Garg (1990) and Aven (1990) presented

some simple approximation formulae for the compound availability of standby

redundant systems. Jie (1991) and Galikowsky et al. (1996) estimated the reliability

and availability of a series system. Wang and Sivazlian (1997) discussed the life

cycle cost analysis for the availability prediction of system with parallel components.

Abu-Salih et al. (1999) established asymptotic confidence limits for the steady-state

availability for the repair facility. Park and Kim (2002) studied of software

rejuvenation that follows a proactive fault-tolerant approach to handle software-origin

system failure. They mapped the software rejuvenation and switchover states with a

semi-Markov process and obtained the mathematical steady-state solutions of the

chain. Many authors have used standby provisioning in their models for increasing the

operational availability of the system. Smidt-Destombesa et al. (2004) considered an

installed base of k-out-of-N systems, each consisting of identical, repairable

components. System maintenance consists of replacing all failed and degraded

components by spares. They focused on the downtime resulting from the lack of spare

parts and maintenance strategy. The analysis of reliability and the availability of

systems having warm standby components and switching failures was studied by

Wang et al. (2006). El-Damcese (2009) analyzed the warm standby system subject to

common cause failures with time varying failure and repair rates. Maheshwari et al.

(2010) studied machine repair problem with K-type warm spares, multiple vacations

for repairmen and reneging. Yuan and Xu (2011) considered an optimal replacement

policy for a repairable system with repairman vacations.


150

A switching is used to connect a standby unit by replacing failed components

but switchover from standby component to operational one may be perfect or

imperfect. The standby redundant components support the demand of pre-specified

minimum reliability of the system. The provision of warm standbys with switching

failures has attracted many researchers working in the area of reliability prediction of

machining systems. In 1983, Gupta et al. studied the switching failure in a two-unit

standby redundant system. Goel and Gupta (1984) analyzed the availability of a two-

unit cold standby system with two switching failure modes. Labib (1991) and Alidrisi

(1992) proposed the stochastic analysis of a n-components redundant system with

two-unit warm standby system and switching devices. Dhillon and Yang (1992) and

Dhillon (1993) analyzed the reliability and availability of warm standby systems with

common-cause failures and human errors. Mokaddis et al. (1994) developed two

models for two dissimilar-unit standby redundant system with three types of repair

facilities and perfect or imperfect switch. Chung (1995), Singh and Goel (1995) and

Xu et al. (2005) predicted the reliability and availability of standby systems with

imperfect switching and multiple non-critical/critical errors. Reliability and sensitivity

analysis of a system with multiple unreliable service stations and standby switching

failures was considered by Ke et al. (2007). Ke and Lee (2007) suggested the

asymptotic confidence limits for a repairable system with standbys subject to

switching failures. Hsu et al. (2008) explored statistically availability of redundant

system with reboot delay, standby switching failures and an unreliable repair facility,

which consists of two active components and one warm standby. Wang and Chen

(2009) described the general repair times, reboot delay and switching failures to

evaluate the availability for the different configurations. Other important contributions

in the direction of standby provisioning in repairable system was due by Cha et al.

(2008) and Yun and Cha (2010). They optimized the design of a general warm

standby system. Mahmoud and Moshref (2010) studied a two-unit cold standby

system considering hardware, human error failures and preventive maintenance. Ke et

al. (2011) discussed the reliability measures of a repairable system with standby

switching failures and reboot delay.

The better maintenance of the system may result in better reliability and

performance of the system. Different models and various availability characteristics


151

of interest have been discussed and obtained by using the theory of Markov Process

by several researchers. In this chapter, reliability models have been developed to

address the issue of availability prediction and to facilitate the comparison of different

configurations. The purpose of the present investigation is to study the availability

and failure frequency of three systems with switching devices, standby components,

setup time for repair and delay of reboot. This chapter is arranged into following

sections. The ongoing section 6B.1 reviews the previous studies on

reliability/availability analysis of repairable system having standby and switching

failures. In order to construct the model mathematically, some notations and

assumptions are stated in section 6B.2. In section 6B.3, we develop three models and

construct the governing equations by using the appropriate transition rates in order to

evaluate the availability for three configurations. The steady state availability and

failure frequency have also been obtained. In section 6B.4, numerical illustrations are

provided to examine the availability indices. The concluding remarks are given in the

last section 6B.5.


We develop Markov model for a multi-component system consisting of

identical operating primary units and warm standby units. Each of the operating unit

fails independent of the state of the others according to an exponential failure time

distribution with parameterλ . Whenever one of the operating unit fails, it is instantly

replaced by a warm standby if available. The failure component is sent immediately

for repair. The available warm standby unit may also fail exponentially with

parameter ( )λ<α≤α 0 . It is assumed that the switchover time is instantaneous.

However, the switch over of standby unit to replace the failed primary unit is

imperfect; the switch over failure probability is q. If a warm standby unit fails to

switch to a primary unit, the next available standby unit attempts to switch. This

process continues until switching is successful or all the warm standby units have

exhausted. Whenever an operating unit or a warm standby fails, it is immediately sent

to a repair facility where failed unit is repaired by the repairman who takes a setup

time before starting the repair. The set up time and repair time are exponentially

distributed with parameterηandµ . After repair, a unit works like a new one. The


152

reboot delay for an operating unit and warm standby unit is assumed to be

exponentially distributed with rate β .

We use the index (i,j) to represent the system states that there are ‘i’ operating

units and ‘j’ standby units in the repairable system. We use index ‘k’ corresponding to

the system state when the system is broken down and is under repair; also k=i+j i.e. k

represents the total number of units in the system. The set up time and the repair time

both are assumed to be exponentially distributed with rate kη and kµ ( )jik += when

the system is broken down and under repair, and there are total ‘k’ units in the system.

To obtain the availability function A(t) and other performance indices viz.

failure frequency, probability of the system being in reboot state and the probability of

the system being in broken down and under setup/repair state, we obtain transient

probabilities ( )tp j,i and ( )tpk by using Runge-Kutta method of fourth order. The

steady state probabilities have also been determined in explicit form by using

recursive approach.

6B.3 Availability Prediction for Three Configurations

Three models are developed by considering different configurations of multi-

component system with spare part support as described below:

6B.3.1 Model 1.

The system consists of two similar units, one is operating and other one acts as

a standby. Initially one unit is operative and other unit is kept as warm standby as

shown by node (1,1) in fig. 6B.1. When the operating unit fails, it is replaced by the

standby unit if available. If standby unit is consumed, i.e. not available for

replacement, the system fails. When operating unit fails, the standby unit is switched

to being operative by means of a switching device. The switch may be available,

when required with probability q. The system can reach a failed state denoted by node

(0, 1); then after rebooting, the system is available for work. The node (0,0) represents

the failure of both units. Whenever the server is broken down, immediately it is sent

for repair to the repairmen who needs setup time before starting the repair. After

completion of repair, the server becomes available for service. States (0) and (1) show

that the server is down and in set up state when one and both units, respectively, are


153

failed. It is assumed that 1µ and 2µ are the repair rates and 1η and 2η are the setup rates

for the repair while system is in states (0) and (1), respectively.

Fig. 6B.1: State transition diagram for model 1

The following differential-difference equations are constructed by using appropriate

transition rates depicted in fig 6B.1:

( ) ( ) )t(P)t(Pdt

tdP121,1

1,1 µ+α+λ−= … (6B.1)

( ) ( ) )t(Pq1)t(Pdt

tdP1,11,0

1,0 −λ+β−= … (6B.2)

( ) ( ) )t(P)t(P)t(P)t(Pdt

tdP111,01,10,12

0,1 µ+β+α+η+λ−= … (6B.3)

( ))t(P)t(qP)t(P

dttdP

0,11,10,010,0 λ+λ+η−= … (6B.4)

( ) )t(P)t(Pdt

tdP0,0101

0 η+µ−= … (6B.5)

( ) )t(P)t(Pdt

tdP0,1212

1 η+µ−= … (6B.6)

The initial condition is


154

( ) ( ) ( ) ( ) ( ) ( ) 00P0P0P0P0P,10P 210,00,11,01,1 ====== …(6B.7)

Steady state probabilities

When t ∞→ , the steady-state equations governing the model are obtained from

(6B.1)-(6B.7) as follows:

( ) 121,1 PP0 µ+α+λ−= … (6B.8)

( ) 1,11,0 Pq1P0 −λ+β−= … (6B.9)

( ) 011,01,10,11 PPPP0 µ+β+α+η+λ−= … (6B.10)

0,11,10,01 PqPP0 λ+λ+η−= … (6B.11)

0,0101 PP0 η+µ−= … (6B.12)

0,1212 PP0 η+µ−= … (6B.13)

The normalizing condition is given by

1PPPPPP 010,01,00,11,1 =+++++ , … (6B.14)

Solving equations (6B.8)-(6B.13) recursively and using (6B.14), we obtain

( ) ( )( )[ ]1121011

211,1 ημαλΛΛΛημ

μP++++++

η= … (6B.15a)

( )( ) ( )( )[ ]1121011


αλμP++++++

+= … (6B.15b)

( )( ) ( )( )[ ]1121011


q1μP++++++β

−λη= … (6B.15c)

( )( ) ( )( )[ ]11210111


qμP++++++η

η+λ+αλ= … (6B.15d)

( )( ) ( )( )[ ]11210112

211 ημαλΛΛΛημ

αλμP++++++µ

+η= … (6B.15e)

( )( ) ( )( )[ ]1121011

20 ημαλΛΛΛημ

qP++++++

η+λ+αλ= … (6B.15f)


155

where

( )β−λη

=Λq12

0 , ( )1

21

qµ

η+λ+αλ=Λ , ( )

1

22

qη

η+λ+αλ=Λ

Performance Indices

For configuration 6B.1, the availability ( )∞1A is given by

( ) 010,11,11 PPPPA +++=∞

( )( )( ) ( )( )[ ]1121011

11

ημαλΛΛΛημαλημ

++++++++

= …(6B.16)

Other performance indices for this configuration are as follows:

Failure Frequency ( ) 0,11,1f PPF λ+α+λ=

( )( )( ) ( )( )[ ]1121011

11

ημαλΛΛΛημαλημ

+++++++λ+

= …(6B.17)

Probability of the system being in reboot state 1,0RB PP =

( )( ) ( )( )[ ]1121011

21

ημαλΛΛΛημq1μ

++++++β−λη

= …(6B.18)

Probability of the system being under setup/repair state 01D PPP +=

= ( )( ) ( )( )[ ]11210110

1110

ημαλΛΛΛημμqμ

++++++η+α+λλµ+η …(6B.19)

6B.3.2 Model 2.

In this configuration, we consider that the system consists of two operating

and one standby units. Other assumptions made are same as taken for configuration 1.

For proper functioning of the system two units are required. For the configuration 2,

the differential-difference equations are constructed by considering the appropriate

rates as shown in fig. 6B.2.


156


( ) ( ) )t(P)t(P2dt

tdP221,2

1,2 µ+α+λ−= … (6B.20)

( ) ( ) )t(Pq12)t(Pdt

tdP1,21,1

1,1 −λ+β−= … (6B.21)

( ) ( ) )t(P)t(P)t(P)t(P2dt

tdP111,11,20,22

0,2 µ+β+α+η+λ−= … (6B.22)

( ))t(P)t(qP2)t(P

dttdP

0,21,20,110,1 λ+λ+η−= … (6B.23)

( ) )t(P)t(Pdt

tdP0,1111

1 η+µ−= … (6B.24)

( ) )t(P)t(Pdt

tdP0,2222

2 η+µ−= … (6B.25)


( ) ( ) ( ) ( ) ( ) ( ) 00P0P0P0P0P,10P 120,11,10,21,2 ====== …(6B.26)


The steady-state equations governing the model are obtained from (6B.20)-(6B.26) as

follows:

( ) 221,2 PP20 µ+α+λ−= … (6B.27)


157

( ) 1,21,1 Pq12P0 −λ+β−= … (6B.28)

( ) 111,11,20,22 PPPP20 µ+β+α+η+λ−= … (6B.29)

0,21,20,11 PqP2P0 λ+λ+η−= … (6B.30)

0,1111 PP0 η+µ−= … (6B.31)

0,2222 PP0 η+µ−= … (6B.32)

Solving equations (6B.27)-(6B.32) recursively and using the normalizing condition,

we obtain

( ) ( )( )[ ]2221022

222,1 ημα2λΛΛΛημ

μηP++++++

= … (6B.33a)

( )( ) ( )( )[ ]2221022

22,0 ημα2λΛΛΛημ

α2λμP++++++

+= … (6B.33b)

( )( )( ) ( ) ( )( )[ ]2221022

221,1 ημα2λΛΛΛημα2λ

α2λq1μ2P+++++++β

+−λη= … (6B.33c)

( )[ ]( ) ( ) ( )( )[ ]22210221

222

1,0 ημα2λΛΛΛημα2λα2λμq2P

+++++++η+λ+ηλ

= … (6B.33d)

( )( ) ( )( )[ ]2221022

22 ημα2λΛΛΛημ

α2ληP++++++

+= … (6B.33e)

( )( ) ( )( )[ ]( ) ( )( )[ ]22210221

2221 ημα2λΛΛΛημ

q1222μP++++++µ

−λ+αη−α+λη+λ= … (6B.33f)

where

( )β−λη

=Λq12 2

0 , ( )( ) ( )( )[ ]1

221

q1222µ

−λ+αη−α+λη+λ=Λ

( )[ ]1

22

2q2η

α+λλ+ηλ=Λ .


158

Performance Indices

The system availability ( )∞2A is given by

( ) 120,21,22 PPPPA +++=∞

( )( )( ) ( )( )[ ]2221022

22

ημα2λΛΛΛημα2λημ

++++++++

= …(6B.34)

Other performance indices for this model are obtained as

Failure Frequency ( ) 0,21,2f P2P2F λ+α+λ=

( )( )( ) ( )( )[ ]2221022

22

ημα2λΛΛΛημα2λ2ημ

+++++++λ+

= …(6B.35)

Probability of the system being in reboot state 1,1RB PP β=

( )( )( ) ( ) ( )( )[ ]2221022

22

ημα2λΛΛΛημα2λα2λq1μ2

++++++++−λη

= …(6B.36)

Probability of the system being under setup/repair state 12D PPP +=

( ) ( )( ) ( )[ ]( ) ( )( )[ ]22210221

222221

ημα2λΛΛΛημμq12α2λ2λ2μ

++++++η−λ−αη−+η+µ+α+λη

= …(6B.37)

6B.3.3 Model 3.

In this model, the system consists of 2 operating and two standby units and at

least two units are required for the functioning of the system. The states 1, 2, 3

represent that the server is broken down when the system has one operating unit, two

operating units, and two operating as well as one standby units respectively; the

corresponding set up and repair rates are (µ1, η1), (µ2, η2) and (µ3, η3).

For the configuration 3 as shown in fig. 6B.3, we construct the differential-

difference equations as follows:


159


( ) ( ) )t(P)t(P22dt

tdP332,2

2,2 µ+α+λ−= … (6B.38)

( ) ( ) )t(P)t(P)t(P2)t(P2dt

tdP222,12,21,23

1,2 µ+β+α+η+α+λ−= … (6B.39)

( ) ( ) )t(P)t(P)t(P)t(P2dt

tdP111,11,20,22

0,2 µ+β+α+η+λ−= … (6B.40)

( ) ( ) )t(Pq12)t(Pdt

tdP2,22,1

2,1 −λ+β−= … (6B.41)

( ))t(P)q1(2)t(P)q1(q2)t(P

dttdP

1,22,21,11,1 −λ+−λ+β−= …(6B.42)

( ))t(P2)t(qP2)t(Pq2)t(P

dttdP

0,21,22,22

0,110,1 λ+λ+λ+η−= … (6B.43)

( ) )t(P)t(Pdt

tdP1,2333

3 η+µ−= … (6B.44)

( ) )t(P)t(Pdt

tdP0,2222

2 η+µ−= … (6B.45)

( ) )t(P)t(Pdt

tdP0,1111

1 η+µ−= … (6B.46)


160


0)0(P)0(P)0(P)0(P)0(P)0(P)0(P)0(P,1)0(P 1230,11,12,10,21,22,2 ========= …(6B.47)


The steady-state equations corresponding to (6B.38)-(6B.46) are as follows:

( ) 332,2 PP220 µ+α+λ−= … (6B.48)

( ) 222,12,21,23 PPP2P20 µ+β+α+η+α+λ−= … (6B.49)

( ) 111,11,20,22 PPPP20 µ+β+α+η+λ−= … (6B.50)

( ) 2,22,1 Pq12P0 −λ+β−= … (6B.51)

1,22,21,1 P)q1(2P)q1(q2P0 −λ+−λ+β−= … (6B.52)

0,21,22,22

0,11 P2qP2Pq2P0 λ+λ+λ+η−= … (6B.53)

1,2333 PP0 η+µ−= … (6B.54)

0,2222 PP0 η+µ−= … (6B.55)

0,1111 PP0 η+µ−= … (6B.56)

Solving (6B.37)-(6B.45) recursively and using normalizing condition

1PPPPPPPPP 1230,11,12,10,21,22,2 =++++++++ , … (6B.57)

we compute probabilities as

( )[ ]( ) ( )( )[ ]3354321033

31,2 22

22Pµ+ηα+λ+Λ+Λ+Λ+Λ+Λ+Λ+ηµ

α+λµ= … (6B.58a)

( ) ( )( )[ ]3354321033

332,2 22

Pµ+ηα+λ+Λ+Λ+Λ+Λ+Λ+Λ+ηµ

µη= … (6B.58b)

( ) ( )( )[ ]3354321033

330,2 22


µΛ= … (6B.58c)

( ) ( )( )[ ]3354321033

302,1 22


µΛ= … (6B.58d)


161

( ) ( )( )[ ]3354321033

311,1 22


µΛ= … (6B.58e)

( ) ( )( )[ ]3354321033

340,1 22


µΛ= … (6B.58f)

( )( ) ( )( )[ ]3354321033

33 22

22Pµ+ηα+λ+Λ+Λ+Λ+Λ+Λ+Λ+ηµ

α+λη= … (6B.58g)

( ) ( )( )[ ]3354321033

322 22


µΛ= … (6B.58h)

( ) ( )( )[ ]3354321033

351 22


µΛ= … (6B.58i)

where

( )β

η−λ=Λ 3

0q12 , ( )( )

βα+λ+η−λ

=Λ22qq12 3

1

( )( ) ( )2

3332

q122222µ

η−λ−αη−α+λη+α+λ=Λ

( )( ) ( )2

3333

q122222η

η−λ−αη−α+λη+α+λ=Λ

( ) ( )( ) ( )21

3332232

4q12222q222q2

ηηη−λ−αη−η+α+λλ+ηλα+λ+ηηλ

=Λ

( ) ( )( ) ( )21

3332232

5q12222q222q2

ηµη−λ−αη−η+α+λλ+ηλα+λ+ηηλ

=Λ

Performance indices

For configuration 3, the availability ( )∞3A is obtained as

( ) 1230,21,22,23 PPPPPPA +++++=∞

( )[ ]( ) ( )( )[ ]3354321033

333

2222

µ+ηα+λ+Λ+Λ+Λ+Λ+Λ+Λ+ηµΛ+η+α+λµ

= …(6B.59)


162

Other performance indices are given as follows:

Failure Frequency ( )( ) 0,11,12,1f P2PPq12F λ++−λ=

( ) ( )( )[ ]( ) ( )( )[ ]3354321033

333

222222

µ+ηα+λ+Λ+Λ+Λ+Λ+Λ+Λ+ηµΛλ+α+λ+ηα+λµ

= …(6B.60)

Probability of the system being in reboot state is

0,21,2RB PPP += ( )( ) ( )( )[ ]3354321033

103

22 µ+ηα+λ+Λ+Λ+Λ+Λ+Λ+Λ+ηµΛ+Λβµ

= …(6B.61)

Probability of the system being under setup/repair state

321D PPPP ++= ( ) ( )( ) ( )( )[ ]3354321033

5233

2222

µ+ηα+λ+Λ+Λ+Λ+Λ+Λ+Λ+ηµΛ+Λµ+α+λη

= …(6B.62)

6B.4 Transient Solution

The transient probabilities for the system of equations (6B.1)-(6B.6), (6B.20)-

(6B.26) and (6B.38)-(6B.48) for models 1, 2 and 3 respectively, can not be obtained

explicitly using analytical method. However numerical methods to solve set of

differential equations can be easily employed. To obtain the transient probabilities, we

employ numerical approach based on Runge-Kutta (RK) technique of fourth order.

Once transient probabilities are evaluated, we can obtain availability for models 1, 2

and 3 respectively, using the following formulae:

( ) ( ) ( ) ( ) ( )tPtPtPtPtA 1,10,1101 +++= …(6B.63)

( ) ( ) ( ) ( ) ( )tPtPtPtPtA 1,20,2212 +++= …(6B.64)

( ) ( ) ( ) ( ) ( ) ( ) ( )tPtPtPtPtPtPtA 2,21,20,23213 +++++= …(6B.65)

The failure frequencies Fi(t) (i=1,2,3) for model 1, 2, 3 are determined using

( ) ( ) ( ) ( )tPtPtF 1,10,11 α+λ+λ= …(6B.66)

( ) ( ) ( ) ( )tP2tP2tF 1,20,22 α+λ+λ= …(6B.67)

( ) ( ) ( ) ( ) ( ){ }tPtPq12tP2tF 1,12,10,13 +−λ+λ= …(6B.68)


163


To illustrate the computational tractability for transient behaviour of the all the

three configuration, we perform the computational experiment. The coding for the

computer program has been implemented by exploiting MATLAB’s ‘ode 45’

function. A time span is considered with equal intervals. For numerical results, we

choose default parameters as

2.0,4.0,4.0,5.0q,3.0,4,2,6,3.0,3.0 211321 =η=η=η==β=µ=µ=µ=α=λ ,

η3=0.3.

In tables 6B.1(a)-6B.1(c), we examine the effect of different parameters such

as failure rate of standbys (α), reboot rate (β ) and switching parameter (q)

respectively, on the system availability. It is observed that the availability decreases as

time grows and also as the values of parameters α and q increase, the availability

decreases for all cases. The availability increases by increasing the values of β for all

the three configurations.

To examine the effects of setup time and repair rate, the graphical presentation

of availability has been done in figures 6B.4(a-d), 6B.5(a-d) and 6B.6(a-f) for model

1, 2 and 3, respectively. From figs 6B.4(a-d), 6B.5(a-d) and 6B.6(a-f), we note that as

time increases, A(t) initially sharply decreases and after some time it becomes almost

constant. In figs 6B.4(a) and 6B.4(b) we note that as µ1 (µ2) increases, A(t) decreases.

From figs 6B.4(c) and 6B.4(d), it is found that as η1 and η2 are increasing, A(t) is

increasing.

In figs 6B.5(a)-6B.5(d), the graphs for A(t) have been plotted with respect to

time t for model 2. We analyze the effects of repair rate and setup time on the

availability. As repair rates µ1 and µ2 increase, A(t) decreases. As set up time η1

increases, A(t) increases however on increasing η2, the availability remains almost

constant.

For model 3, the availability graphs with respect to time are plotted in figs

6B.6(a),6B.6(c) and 6B.6(e) for examine the effects of repair rates (µ1, µ2 and µ3) and

in figs 6B.6(b),6B.6(d) and 6B.6(f) for setup time (η1, η2 and η3). It is seen in figs

6B.6(a), 6B.6(c) and 6B.6(e) that A(t) remains almost same for lower values of t on


164

increasing the repair rates µ1, µ2 and µ3. For higher values of t, the effect of repair

rates µ2 and µ3 on the availability are insignificant; however on increasing µ1, we

observe a slight decrement in the availability. In figs 6B.6(b) and 6B.6(d), it is found

that the availability A(t) increases on increasing the values of η1 and η2, but decreases

on increasing η3 as can be visualized in fig. 6B.6(f).

Finally from the tables and graphs, we conclude that

• The availability slightly decreases as time t increases but after some time it

becomes constant for the three configurations. With the increase in different

parameters namely repair rate (α), reboot delay (β) and switching failure (q), the

availability decreases. The tables 6B.1(a)-6B.1(c) reveal that the availability of

the system is more affected by the reboot delay β than by the failure rate α of

standby and switching probability q.

• From the figs for models 1 and 2, it is demonstrated that on increasing the repair

rate, the availability decreases and on the other hand on increasing the set up time

the availability increases. Thus we conclude that the set up time has the significant

effect on the availability and should be incorporated for the availability analysis.

6B.5 Conclusion

In this investigation we have addressed the issue of improvement in the

availability of a multi-component system supported by warm standby and a repair

facility. We have included the setup time, switching failure and reboot delay so that

model developed may be closer to realistic situations. Various performance measures

including the system availability at transient state have been examined with the help

of numerical results. It has been shown that the combination of standby and

maintainability is of great importance in many real time systems operating in the

machining environment. We have explored some system characteristics such as

failure frequency, probability of the system being in reboot state, probability of the

system being under setup/repair sate which may be helpful to the system designers

during design and development phases of the concerned system under optimal

operating conditions.


165

t α=0.3 α=0.6 α=0.9 A1(t) A2(t) A3(t) A1(t) A2(t) A3(t) A1(t) A2(t) A3(t)

0 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 2 0.623 0.472 0.443 0.623 0.477 0.440 0.623 0.482 0.439 4 0.504 0.423 0.363 0.503 0.430 0.363 0.502 0.435 0.365 6 0.471 0.430 0.351 0.470 0.437 0.351 0.469 0.441 0.352 8 0.463 0.438 0.344 0.462 0.443 0.343 0.461 0.446 0.343 10 0.461 0.443 0.335 0.460 0.447 0.332 0.460 0.449 0.331 12 0.460 0.446 0.324 0.460 0.449 0.319 0.459 0.451 0.317 14 0.460 0.447 0.312 0.459 0.450 0.306 0.459 0.452 0.303

Table 6B.1(a): Comparison of availability of three configurations for different values of α


166

t β=0.2 β=0.5 β=0.8 A1(t) A2(t) A3(t) A1(t) A2(t) A3(t) A1(t) A2(t) A3(t)

0 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 2 0.611 0.454 0.417 0.643 0.500 0.485 0.663 0.528 0.529 4 0.484 0.398 0.326 0.530 0.453 0.410 0.549 0.473 0.442 6 0.450 0.406 0.316 0.493 0.454 0.386 0.504 0.465 0.402 8 0.443 0.416 0.315 0.479 0.456 0.368 0.486 0.463 0.376 10 0.443 0.424 0.311 0.474 0.457 0.351 0.480 0.463 0.355 12 0.444 0.428 0.305 0.471 0.458 0.335 0.477 0.463 0.337 14 0.445 0.432 0.297 0.471 0.458 0.319 0.476 0.463 0.320

Table 6B.1(b): Comparison of availability of three configurations for different values of β


167

t q=0.1 q=0.5 q=0.9 A1(t) A2(t) A3(t) A1(t) A2(t) A3(t) A1(t) A2(t) A3(t)

0 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 2 0.640 0.460 0.433 0.623 0.472 0.443 0.607 0.484 0.462 4 0.522 0.401 0.339 0.504 0.423 0.363 0.487 0.444 0.401 6 0.483 0.411 0.323 0.471 0.430 0.351 0.460 0.450 0.388 8 0.471 0.423 0.318 0.463 0.438 0.344 0.455 0.453 0.374 10 0.467 0.432 0.313 0.461 0.443 0.335 0.454 0.454 0.359 12 0.466 0.436 0.305 0.460 0.446 0.324 0.454 0.455 0.344 14 0.466 0.439 0.295 0.460 0.447 0.312 0.454 0.455 0.328

Table 6B.1(c): Comparison of availability of three configurations for different values of q


168

0.4

0.6

0.8

1

0 2 4 6 8 10 12 14

t

A(t)

µ1=2µ1=8

0.4

0.6

0.8

1

0 2 4 6 8 10 12 14t

A(t)

µ2=3µ2=9

(a) (b)

0.4

0.6

0.8

1

0 2 4 6 8 10 12 14t

A(t)

η1=0.5η1=0.8

0.4

0.6

0.8

1

0 2 4 6 8 10 12 14t

A(t)

η2=0.6η2=0.9

(c) (d)

Fig. 6B.4: Availability vs time for model 1 by varying (a) 1µ (b) 2µ (c) 1η (d) 2η


169

0.4

0.6

0.8

1

0 2 4 6 8 10 12 14t

A(t)

µ1=2µ1=8

0.4

0.6

0.8

1

0 2 4 6 8 10 12 14t

A(t)

µ2=3µ2=9

(a) (b)

0.4

0.6

0.8

1

0 2 4 6 8 10 12 14

t

A(t)

η1=0.5η1=0.8

0.4

0.6

0.8

1

0 2 4 6 8 10 12 14t

A(t)

η2=0.6η2=0.9

(c) (d)

Fig. 6B.5: Availability vs time for model 2 by varying (a) 1µ (b) 2µ (c) 1η (d) 2η


170

0

0.2

0.4

0.6

0.8

1

0 2 4 6 8 10 12 14t

A(t)

µ1=2µ1=8

0

0.2

0.4

0.6

0.8

1

0 2 4 6 8 10 12 14

t

A(t)

η1=0.2η1=0.5

Fig. 6B.6(a): Availability vs time by Fig. 6B.6(b): Availability vs time by varying 1µ for model 3 varying 1η for model 3

0

0.2

0.4

0.6

0.8

1

0 2 4 6 8 10 12 14

t

A(t)

µ2=3µ2=9

0

0.2

0.4

0.6

0.8

1

1.2

0 2 4 6 8 10 12 14

t

A(t)

η2=0.6η2=0.9

Fig. 6B.6(c): Availability vs time by Fig. 6B.6(d): Availability vs time by varying 2µ for model 3 varying 2η for model 3

0

0.2

0.4

0.6

0.8

1

0 2 4 6 8 10 12 14

t

A(t)

µ3=2µ3=9

0

0.2

0.4

0.6

0.8

1

0 2 4 6 8 10 12 14t

A(t)

η3=0.5η3=0.8

Fig. 6B.6(e): Availability vs time by Fig. 6B.6(f): Availability vs time by varying 3µ for model 3 varying 3η for model 3

Warranty Policy for Hardware and Software Systems with Common

Cause Failure

7.1 Introduction


7.3 The Analysis

7.4 Warranty Policy with Repair


7.6 Conclusion

Chapter-7

Chapter-7: Warranty Policy for Hardware and Software…

172

A software/hardware system consisting of one software

component and N hardware components which are subject to

individual and common cause failures is analyzed. The renewal

theory has been applied to determine renewals that bring a

component back into the system as-good-as new one. In this

chapter, we investigate warranty policy of repairs and

replacements for developing the warranty cost models. The

reliability and other measures are obtained for performance

evaluation of the system. Numerical example is given to illustrate

the computational tractability of the model with the help of

MATLAB software.

7.1 Introduction

A warranty is an assurance from a manufacturer to a consumer that the

product sold is guaranteed to perform satisfactorily upto a specified period of time i.e.

the warranty period. In case when an item does not perform satisfactorily as specified,

the dealer/manufacturer is responsible to repair or replace it by a new one. The main

object of a warranty is to provide protection to both manufacturers as well as

consumers. The term of warranty is not only the concern of the consumers but also

increases the sales and reputation of the manufacturers. Now-a-days with the

competitive global market, the warranty has become more powerful tool to increase

the sales and revenue. Many different types of warranty policies such as free

replacement repair policy, pro-rata warranty policy, etc. exist in the literature. In free

replacement repair policy, the manufacturer guarantees to repair or provides

replacement for failed items free of cost upto specified warranty period from the date

of initial purchase. In pro-rata warranty policy, the manufacturer or seller agrees to

refund a fraction of the purchase price of the items which fail before time specified

from the time of the initial purchase. Many researchers have studied the warranty cost

models in different frameworks. Ascher and Feingold (1984), Abdel-Hameed (1995)

and Hunter (1996) gave the mathematical techniques for warranty analysis of


173

software reliability. Vaurio (1999) suggested availability and cost functions for

periodically inspected preventively maintained units. Wang and Sheu (2001)

explored the effect of the warranty cost on the imperfect EMQ model with general

discrete shift distribution. Pham (2003b) analyzed the software reliability and cost

models. Attardi et al. (2005) presented a mixed-Weibull regression model for the

analysis of automotive warranty data. Rahman and Chattopadhyay (2006) gave the

review of long term warranty policies. Yun et al. (2008) studied warranty servicing

with imperfect repair. Wu et al. (2009) considered optimal price, warranty length and

production rate for free replacement policy in the static demand market. Yang et al.

(2009) considered cost-oriented task allocation and hardware redundancy policies in

heterogeneous distributed computing systems considering software reliability.

Srinivas et al. (2009) analyzed the influence of delivery times on repairable k-out-of-

N systems with spares. Yang et al. (2010) defined a generic data-driven software

reliability model using mining technique. Zhu et al. (2010) proposed the availability

optimization of the system subject to competing risk. A decision support model for

warranty servicing of repairable items was proposed by Rao (2011).

In reliability literature, renewal theory provides many variants of

replacement/repair policies for the maintainability of the system. Sheu (1991)

proposed a generalized block replacement policy with minimal repair and general

random repair costs for a multi-unit system. Shaked and Zhu (1992) gave some

results on the block replacement policies using renewal theory. Blischke and Murthy

(1994), Wang and Pham (1996) and Murthy et al. (2004) discussed the quasi-

renewal process and its applications in imperfect maintenance. Salameh and Jaber

(2000) developed the economic production quantity model for item with imperfect

quality. Pham and Wang (2001) proposed a quasi-renewal process for the software

reliability and testing costs. Yanez et al. (2002) discussed generalized renewal process

for the analysis of repairable systems with limited failure experience. Rai and Singh

(2005) gave a modeling framework for assessing the impact of new time/mileage

warranty limits on the number and cost of automotive warranty claims. Huang et al.

(2007b) proposed optimal reliability, warranty and price for new products. Noortwijk

and Weide (2008) discussed applications to continuous-time processes of

computational techniques for discrete-time renewal processes. Park and Pham (2008)

http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6V27-4TW6HN2-1&_user=10&_coverDate=05%2F31%2F2009&_alid=1535727777&_rdoc=1&_fmt=high&_orig=search&_origin=search&_zone=rslt_list_item&_cdi=5695&_sort=r&_st=13&_docanchor=&view=f&_ct=531&_acct=C000050221&_version=1&_urlVersion=0&_userid=10&md5=1877e9851df0cd22e65b697dda94193c&searchtype=a


http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6V4T-4YCG003-2&_user=10&_coverDate=06%2F30%2F2010&_alid=1535727777&_rdoc=2&_fmt=high&_orig=search&_origin=search&_zone=rslt_list_item&_cdi=5767&_sort=r&_st=13&_docanchor=&view=f&_ct=531&_acct=C000050221&_version=1&_urlVersion=0&_userid=10&md5=a7180f81fc72404fcbc2ed6b04c3e134&searchtype=a



174

developed the warranty system-cost model using quasi-renewal processes. Samatli-

Pac and Taner (2009) discussed the role of repair strategy in warranty cost

minimization via quasi-renewal processes. Zhou et al. (2009) suggested the dynamic

pricing and warranty policies for products with fixed lifetime. Park and Pham (2010)

studied altered quasi-renewal concepts for modeling renewable warranty costs with

imperfect repairs. In (2011), Kallen discussed the modeling imperfect maintenance

and the reliability of complex system using superposed renewal process.

The warranty cost depends on the terms of the warranty and is calculated by

the manufacturer as per servicing a claim under warranty. In this chapter, we suggest

a free replacement warranty policy according to which if an item fails, it is replaced

by a new without paying any cost (i.e. free of charge) because the item is non

repairable. The replacement occurs according to a renewal process. The number of

failures during the warranty period is mathematically calculated based on quasi

renewal process. We evaluate the representative cost functions to evaluate the

effectiveness of policies. The rest of the chapter is structured as follows. Section 7.2

deals with model description by stating the requisite assumptions and nomenclatures.

Section 7.3 provides the mathematical analysis of the model. In section 7.4, we

describe the warranty policy with repair. In section 7.5, numerical results are

provided. Finally in section 7.6, the conclusion is drawn.


We provide the quasi-renewal analysis of the distribution function of the

number of product failures of a multi-component repairable system within a warranty

period w. A replacement service would be possible during the warranty period by

introducing two quasi renewal concepts based on (i) altered quasi renewal process and

(ii) mixed quasi renewal process. Appling quasi renewal process, the cost analysis is

performed for K-out-of-N system consisting of N hardware components and one

software component. The system components may fail individually or due to common

cause failure. For modeling purpose, the following assumptions are made.

Assumptions

The repair and replacement do not happen simultaneously.

http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6V0V-507CRNJ-2&_user=10&_coverDate=11%2F30%2F2010&_alid=1535727777&_rdoc=3&_fmt=high&_orig=search&_origin=search&_zone=rslt_list_item&_cdi=5656&_sort=r&_st=13&_docanchor=&view=f&_ct=531&_acct=C000050221&_version=1&_urlVersion=0&_userid=10&md5=6940b0755a1be8c3cd842f683ac3e01c&searchtype=a



175

The repair cost and replacement cost are constant. Also, repair time and

replacement service time are negligible.

All warranty claims are executed and are valid.

The repairs are imperfect and the repair process can be modeled by a quasi-

Renewal process.

The time to perform an inspection, in which it is to determine whether the

failed component needs repair or a replacement, is negligible.

The warranty period is renewable.

Abreviation and Nomenclature

pmf, pdf, cdf : Probability mass function, probability density function,

cumulative distribution function.

QRP : Quasi-renewal Processes.

CV : Coefficient of variation.

i.i.d : Identical and independently distributed.

r.v. : Random variable.

w : Length of a warranty period.

T : r.v. denoting time.

Cλ : Constant common cause failure rate.

β : Parameter for QRP.

Nsystem : The number of system failures.

)w(R C : The inter-failure time function of common cause.

Nh(t),Ns(t), Nc(t) : The number of hardware failures, software failures

and failure due to common cause in (0,t], respectively.

fh(.), Fh(.), (.)R h : pdf, cdf and reliability function of hardware failure time within

a warranty period w, respectively.


176

fs(.), Fs(.), (.)RS : pdf, cdf and reliability function of software failure time within

a warranty period w, respectively.

fih(.), Fih(.), Rih(.) : pdf, cdf and reliability function of the hardware component failure times after (i-1)th repair/replacement within a warranty period w, respectively.

fis(.), Fis(.), Ris(.) : pdf, cdf and reliability function of the software component failure times after (i-1)th repair/replacement within a warranty period w, respectively.

C(w) : Warranty cost of the system for a warranty period w.

ch , cs, c0 : Warranty cost for repairs/replacements within a warranty period w, for hardware, software and common cause failure, respectively.

7.3 The Analysis

Following Park and Pham (2008), the probability mass function (pmf) of Nh and Ns

are given by

[ ] ( ) ( )( )wRwFnNP h)1n(

n

1iihhh h

h

+=

== ∏ …(7.1)

and

[ ] ( ) ( )( )wRwFnNP s)1n(

n

1jjsss s

s

+=

== ∏ …(7.2)

In this section we evaluate the expected number of system failure due to

hardware, software components and common cause failures. Under the imperfect

repair, the reliability functions for the hardware and software respectively, are

obtained as

( ) ( )( ) ∏ ∫∏==

==

hh n

1i

x

0ihih

n

1iihh x)dxf(ββ-1wRwR …(7.3)

( ) ( )( ) ∏ ∫∏==

==

ss n

1j

x

0jsjs

n

1jjss x)dxf(ββ-1wRwR … (7.4)


177

We also define

( ) ∫ λλ−=−=x

0CCCC x)dxf(1(w)F1wR …(7.5)

7.3.1 Series System

The reliability function of the system when N hardware components arranged

in series is given by

( ) (w)R(w)R(w)RwR c

n

1jjs

Nn

1iih

sh

= ∏∏

==

−

= ∫∏ ∫∏ ∫

==

x

0cc

n

1j

x

0jsjs

Nn

1i

x

0ihih x)dxf(λλ1x)dxf(ββ-1x)dxf(ββ-1

sh

…(7.6)

7.3.2 K-out-of-N System

When the hardware components are arranged in a K-out-of-N system along

with one software component, the reliability function is obtained as the probability of

having at least K functioning hardware units out of N and software in functioning

state along with no failure due to common cause. Thus, we obtain

[ ] [ ]∑=

−−

=

N

Kkcs

kNh

kh (w)R(w).R.(w)R1(w)R

kN

R(w)

−

−

−

=

∫∏ ∫

∑ ∏ ∫∏ ∫

=

=

−

==

x

0cc

n

1j

x

0jsjs

N

Kk

kNn

1i

x

0ihih

kn

1i

x

0ihih

x)dxf(λλ1x)dxf(ββ1

x)dxf(ββ-11x)dxf(ββ-1kN

s

hh

…(7.7)

The probability that the system is not working is given by

( ) ( )workingnotissystemProb.wP =

{ } ( )

{ } ( ) [ ](w)R(w).R(w)R.(w)R(w)R(w)RkN

(w)R(w).R.(w)R(w)RkN

cscs

kN

hk

h

N

Kk

cs

kN

hk

h

1K

0k

+

+

=

−

=

−−

=

∑

∑ …( 7.8)


178

where ( ) ( )wR1wR ss −= and ( ) ( )wR1wR CC −= .

Thus

( )

−+

−

−−

−

+

−

−

−

=

∫∏ ∫

∫∏ ∫

∏ ∫∏ ∫∑∫

∏ ∫∏ ∫∏ ∫∑

=

=

−

===

=

−

==

−

=

x

0cc

n

1j

x

0jsjs

x

0cc

n

1j

x

0jsjs

kNn

1i

x

0ihih

kn

1i

x

0ihih

N

Kk

x

0cc

n

1j

x

0jsjs

kNn

1i

x

0ihih

kn

1i

x

0ihih

1K

0k

x)dxf(λλx)dxf(ββ1

x)dxf(λλ1.x)dxf(ββ11.

x)dxf(ββ-11x)dxf(ββ-1kN

x)dxf(λλ1.

x)dxf(ββ1x)dxf(ββ-11x)dxf(ββ-1kN

wP

s

s

hh

shh

Hence, the expected number of system failures due to hardware component failure is

given by

( ) ( ) ( )∑ ∑∞

=

−−

=

−

=

1ncs

kNh

kh

1K

0khh

h

(w)R(w).R.(w)R1(w)RkN

.nNE …(7.9)

Expected number of system failures due to software failure is given by

( ) ( ) ( )∑ ∑∞

=

−

=

=

1ncs

kN

hk

h

N

Kkss

s

(w)R.(w)R.(w)R(w)RkN

.nNE …(7.10)

Expected number of system failures due to common cause failure is given by

( ) ( ) ( )∑ ∑∞

=

−

=

=

1ncs

kN

hk

h

N

KkCC

C

(w)R(w).R.(w)R(w)RkN

.nNE …(7.11)

Expected number of system failures is obtained as

( ) ( ) ( ) ( )Cshsystem NENENENE ++= …(7.12)

Now we derive the variance of the number of system failures due to hardware

component failures as follows:


179

The second moment of the number of system failures due to hardware failure is

( ) ( ) ( )∑ ∑∞

=

−−

=

−

=

1ncs

KNh

kh

1K

0k

2hh

2

h

(w)R(w).R.(w)R1(w)RkN

.nNE …(7.13)

Therefore, the variance of the number of system failures due to failure of hardware

component is given by

( ) ( ) ( )[ ]2h2hh NENENVar −= …(7.14)

Similarly, we can obtain the variance of the number of system failures due to software

component failure as

( ) ( ) ( )[ ]2S2SS NENENVar −= …(7.15)

where

( ) ( ) ( )∑ ∑∞

=

−

=

=

1ncs

kN

hk

h

N

Kks

2s

2

s

(w)R.(w)R.(w)R(w)RkN

.nNE …(7.16)

Also

( ) ( ) ( )[ ]2C2CC NENENVar −= …(7.17)

The variance of the number of repair services for the system is given by

( ) ( ) ( )[ ]2system2systemsystem NENENVar −= …(7.18)

Let hc , Sc and 0c be the repair cost per failure due to hardware failure,

software failure and common cause failure, respectively. The expected warranty cost

is given by

( )( ) ( ) ( ) ( )CoSShh N.EcN.EcN.EcwCE ++= …(7.19)

The variance of warranty system cost for software can be obtained as

( )( ) ( ) ( ) ( )C2oS

2Sh

2h NVar.cNVar.cNVar.cwCVar ++= …(7.20)


180

7.4 Warranty Policy with Repair

In this section we consider the warranty repair service and do not take into

consideration the warranty replacement service. For illustration purpose we consider a

dissimilar hardware system having 3-out-of-4 configuration of hardware units along

with one software unit. Let 4321 ,,, xxxx be the indicators denoting that the hardware

components 1 through 4 are working respectively. Also Sx and Cx indicate that there

is no failure due to software and common cause, respectively. ix denotes the

complement events so that ii x1x −= .

Then the reliability of the system is given by

( )[ ] )x)P(P(xxxxxP)xxxxP()xxxP(x)xxxP(x)xxxP(xR CS43214321432143214321 ++++=

…(7.21)

( )[ ( ) ( ) ( ) ( ) ( ) ( ) ( )( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )( ) ( ) ( ) ( )] ( ) ( )CCSS44332211

4433221144332211

4433221144332211

nN.PnNPnNPnNPnNPnNPnNPnNPnNPnNPnNPnNPnNPnNP

nNPnNPnNPnNPnNPnNPnNPnNPR

======+====+====+

====+=====

…(7.21)

where

[ ] ( ) ( ) ( )( ) 4,3,2,1h),x)P(P(xdx.xβfβ1dxxβfβnNP CS

x

0 h1nh1n

n

1i

x

0 ihihhh hh

h

=

−

== ∫∏ ∫ ++

=

Also )w(R)w(R)x)P(P(x CSCS =

λλ−

−= ∫∏ ∫

=

x

0CC

n

1j

x

0jsjs x)dxf(1x)dxf(ββ1

s

The expected warranty cost is given by:

( ) ( )NcECE =

where 0sh cccc ++=


181

7.4(i) K-R-out-of-N System

Let us consider K-R-out-of-N system. The probability that the system working is

given by

( ) ( ) )w(R).w(R.)w(R1)w(RN

P csN

hh

R

K)workingsystem(

ll

l l−

=

−

=∑ …(7.23)

Now the probability that the system is not working, is given by

( ) ( )

( ) ( )

( )( ) ( ) [ ](w)R(w).R(w).R(w)R(w)RwRN

(w)(w).R.R(w)R1(w)RN

(w)(w).R.R(w)R1(w)RN

P

cscs

N

hhK

csN

hh

N

1R

csN

hh

1K

0working)not(system

+

+

−

+

−

=

−

=

−

+=

−−

=

∑

∑

∑

llR

l

ll

l

ll

l

l

l

l

…(7.24)

The expected number of system failure due to hardware component failure is given by

( ) ( ) ( )

( ) ( )

−

+

−

=

−

+=

∞

=

−−

=

∑

∑ ∑

)w(R).w(R.)w(R1)w(RN

)w(R).w(R.)w(R1)w(RN

n'NE

csN

hh

N

1R

1ncs

Nhh

1K

1hh

h

ll

l

ll

l

l

l …(7.25)

So, the expected number of system failure due to software component failure is given

by

( ) ( ) ( ) ( )∑ ∑∞

=

−

=

=

1ncs

N

hh

R

Kss

s

)w(R.)w(R.)w(R)w(RN

n'NEll

l l …(7.26)

The expected number of system failure due to common cause is given by

( ) ( ) ( ) ( )∑ ∑∞

=

−

=

=

1ncs

N

hh

R

KCC

C

)w(R.)w(R.)w(R)w(RN

n'NEll

l l …(7.27)

Expected number of system failure is given by

( ) ( ) ( ) ( )Csh 'NE'NE'NENE ++= …(7.28)


182

Now with the help of above results we can calculate the variance of the

number of system failure, the variance of the number of repair services, the expected

warranty cost and the variance of warranty system cost, etc. for K-out-of-N system.


In this section, we provide numerical results by coding computer program in

‘MATLAB’ software to examine the validity and tractability of analytical results of

the proposed model by taking an illustration. We consider a 3-out-of-4 system and

compute numerical results for total expected cost, standard deviation and coefficient

of variance of the system. The life time of the components follows the exponential

distribution. The warranty cost and other parameters are taken as β1h=0.7, β2h=0.3,

β3h=0.4, β4h=0.2, β6h=0.1, β8h=0.2, β9h=0.8, β12h=0.3, β6h=0.4, β20h=0.2, βs=0.8, λc=.9,

c=1500.

For different values of β1h, βs, λc, the total expected cost is shown in figs 7.1(i)-

(iii). Figs 7.2(i)-(iii) and 7.3(i)-(iii) show the standard deviation and coefficient of

variance respectively with respect to the warranty period. From figs 7.1(i)-(iii), we

can see that total expected cost decreases with warranty period because reliability has

been increased by replacing hardware/software components in which cost is incurred

initially only. From figs 7.1(ii) and 7.1(iii), it is clear that as βs and λc increase, the

expected cost decreases which are quite obvious.

The standard deviation graphs for different parameters with respect to

warranty period are displayed in figs 7.2(i)-(iii). The value of standard deviation

initially increases sharply and after some time it decreases slightly. In these figs, there

is no significant effect of increasing the values of β1h and βs, however in fig. 7.2(iii),

on increasing the λc, initially there is no change but after some time standard deviation

decreases.

In figs 7.3(i)-(iii), we exhibit the coefficient of variance (CV) with respect to

the warranty period. The value of CV initially increases sharply and after some time it

increases slowly. It is also seen that CV remains almost constant as β1h increases. But

as βs and λc increase, the value of CV reveals the increasing trend with respect to the

warranty period.


183

7.6 Conclusion

Warranty provides an assurance to the customers regarding product quality

and warranty policies. When a repairable item fails under warranty, the

manufacturer has the option of either repairing the failed item or replacing it

with a new one. The cost depends on several factors such as the reliability of the

product, the warranty terms, the maintenance actions of the customers and the

servicing strategies used by the manufacturer. There is huge scope for future

research in this area. The warranty cost models developed at a component level

by using renewal, replacement and repair. The quasi-renewal processes used

provides more realistic results for warranty cost model of K-out-of-N systems.

The results of reliability and warranty cost functions of our study may be easily

used in practices and would be of helpful for designing optimal warranty

policies.


184

0

100

200

300

400

500

600

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5w

EC

β1h=0.3β1h=0.5β1h=0.7

0

100

200

300

400

500

600

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5w

EC

βs=0.2βs=0.4βs=0.6

(i) (ii)

0

100

200

300

400

500

600

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5w

EC

λc=0.2λc=0.4λc=0.6

(iii)

Fig. 7.1: Expected cost vs warranty period by varying (i) h1β (ii) Sβ and (iii) Cλ


185

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5w

SD

β1h=0.3β1h=0.5β1h=0.7

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5w

SD

βs=0.3βs=0.5βs=0.7

(i) (ii)

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5w

SD

λc=0.5λc=0.6λc=0.7

(iii)

Fig. 7.2: Standard deviation vs warranty period by varying (i) h1β (ii) Sβ and (iii) Cλ


186

05

10

1520253035

404550

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5w

CV

β1h=0.2β1h=0.5β1h=0.8

0

5

10

15

20

25

30

35

40

45

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5w

CV

βs=0.4βs=0.5βs=0.6

(i) (ii)

0

5

1015

20

25

30

3540

45

50

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5w

CV

λc=0.3λc=0.5λc=0.7

(iii)

Fig. 7.3: Coefficient of variation vs warranty period by varying (i) h1β (ii) Sβ and (iii) Cλ

Semi-Markov Models with Common Cause Failure

Section-8A Redundant System with Rejuvenation

Section-8B Imperfect Fault Coverage System with Reboot

Chapter-8

Imperfect Fault Coverage System with Reboot

8A.1 Introduction


8A.3 Semi-Markov Analysis


8A.5 Total Expected Downtime Cost

8A.6 Conclusion

Section-8A

Section 8A: Redundant System with Rejuvenation

189

Software rejuvenation is a preventive maintenance technique

to prevent failures in continuously running systems that

experience software aging. In this section, a stochastic model is

developed to study the availability evaluation problem of

rejuvenation system. The availability of redundant system with

common cause failure and rejuvenation is obtained by using an

embedded Markov chain approach. A recursive procedure for

generating the state-transition probabilities is employed. The

appropriate framework for finding the optimal rejuvenation

interval is discussed by considering the downtime cost factors.

8A.1 Introduction

Today redundancy plays a dominant role in increasing systems availability

especially where the system availability must be greater than that of the components

used. In classical availability modeling of the redundant system it is assumed the

occurrence of common cause failures violate the assumption of independent

redundant units failure. It is due to the fact that the common mode failures are the

multiple failures which occur because of a single initiating factor or cause. When this

cause occurs all other failures are triggered to constitute a complete system failure.

Subramanian and Anantharaman (1995) considered the reliability analysis of a

complex standby redundant system. Vaurio (2005) evaluated the uncertainties and

quantification of common cause failure rates and probabilities for the redundant

system analyses. Lu and Lewis (2006) suggested the reliability evaluation of standby

safety systems having independent and common cause failures. Shen et al. (2008)

proposed the exponential asymptotic property of a parallel repairable system with

warm standby under common-cause failure. Cekyay and Ozekici (2010) discussed the

mean time to failure and availability of semi Markov missions with maximal repair.

Hajeeh (2011) studied the availability for series configurations having standbys with

the existence of common cause failure of the system at all states.



190

The system reliability engineering has advanced up to a level since it is

beginning to be separated into various specialized areas such as life cycle costing,

reliability growth modeling, reliability optimization and others. Many redundant

systems can be modelled by using the semi-Markov process, such as machine repair

standby system. An assumption of exponential distributions is rather unrealistic. If a

system under consideration is not highly reliable, no asymptotic methods can be

applied. Adke and Manjunath (1984) studied finite Markov processes. Sadek and

Limnios (2005) suggested non parametric estimation of reliability and survival

function for continuous-time finite Markov processes. Wen and Li (2009) proposed

the minimum packet drop sequences based networked control system model with

embedded Markov chain. An embedded Markov chain approach to stock rationing

was employed by Fadiloglu and Bulut (2010). Grabski (2011) also considered a gave

semi-Markov failure rates processes.

Rejuvenation is an appropriate technique to prevent a computer system from

failures by periodically performing the maintenance of its software. Huang et al.

(1995) studied the software rejuvenation and its applications. Park and Kim (2002)

discussed the availability analysis and improvement of active/standby cluster systems

using software rejuvenation. Wie et al. (2005) gave the analysis of a two-level

software rejuvenation policy. Xie et al. (2005) examined a two-level software

rejuvenation policy. Rinsaka and Dohi (2007) discussed a faster estimation algorithm

for periodic preventive rejuvenation schedule maximizing system availability.

Iwamoto et al. (2008) considered periodic software rejuvenation schedules under

discrete-time operation circumstance. Methods and opportunities for rejuvenation in

aging distributed software systems were studied by Avritzer et al. (2010).

The main purpose of this investigation is to examine the optimal time to

perform software rejuvenation which improves the availability, the downtime cost and

the dependability measures. The rest of the section is organized as follows. In Section

8A.2, the model description is presented. In Sections 8A.3 three semi-Markov models

are explicitly described. In Section 8A.4, we derive some performance measures.

Total expected downtime cost of the proposed model is presented in Section 8A.5.

Finally, we conclude this section by providing a short discussion in sub section 8A.6.

http://www.sciencedirect.com/science/article/pii/S0096300311006254?_alid=1864734592&_rdoc=10&_fmt=high&_origin=search&_docanchor=&_ct=847&_zone=rslt_list_item&md5=2b6a6bf49ea3bc27309f3b8b35b2b6f3

http://www.sciencedirect.com/science/article/pii/S0164121201001078?_alid=1864347820&_rdoc=6&_fmt=high&_origin=search&_docanchor=&_ct=945&_zone=rslt_list_item&md5=afec49e46279559b9aa3c5f619bd3f37


http://www.sciencedirect.com/science/article/pii/S095183200400047X?_alid=1864347820&_rdoc=3&_fmt=high&_origin=search&_docanchor=&_ct=945&_zone=rslt_list_item&md5=6c43a1aabdf5efca8fb0454bb62bf24c


http://www.sciencedirect.com/science/article/pii/S0164121210001093?_alid=1864347820&_rdoc=9&_fmt=high&_origin=search&_docanchor=&_ct=945&_zone=rslt_list_item&md5=47c25187b45b299f7d370b83a3f6ba33



191

8A.2. Model Description

Consider a computer system with one redundant node with the provision of

both automatic and manual switching procedure among the primary and the standby

unit. In case of a switching automation failure, the system is switched to the standby

unit manually. The time to resource exhaustion for the primary unit has to be modeled

by an increasing failure rate (IFR) distribution as software resources are exhausted in

an increasing manner with respect to the time that the primary unit serving the system.

We consider three different models for two unit redundant system with common cause

failure. In the first model, there is no rejuvenation action. The second model

incorporates full rejuvenation action whereas in the third model the concepts of

rejuvenation failure are incorporated.

8A.2.1 Model without Rejuvenation

A two identical component system consisting in one active component and

one standby is considered. When the active unit fails, the service is restored by a

switching mechanism of the standby unit. In Figure 8A.1, the state transition diagram

of the two component redundant system without rejuvenation is presented. Initially,

the system is in state (1, 1) where the active unit is responsible for the service. The

active unit fails with an IFR Weibull distribution F1(t). Actually, the time until the

active unit fails follows a probability distribution F1(t). We denote the probability

distribution of the time needed for a certain state transition by F1. Let γ be the

probability of automatic switching success, then γF1(t) denotes the probability of

entering state (0, 1) from state (1, 1). The service is switched to the standby unit

manually and the system enters state F with probability (1-γ). The time to enter state

(0, 1) from state F follows a distribution with parameter β. Hence, in the case of a

failure incidence at the primary unit, the systems enter state (0, 1) automatically or

manually. System control is then switched to the standby unit, while the failed unit is

repaired. Both primary and standby units may fail simultaneously due to common

cause failure and system enters to state (0, 0) from state (1, 1) with distribution FC(t)

with parameter FC.


192

Fig. 8A.1. Redundant system without rejuvenation

After repair, one active unit and one standby unit are available. In this case,

the system returns to state (1, 1) with distribution F2(t) which is assumed to be Erlang

with parameters (KE, λE). If during the repair of the failed unit the serving unit also

fails, the system experiences a total failure and enters state (0, 0) with distribution

F1(t); as soon as the failed unit is repaired, the system returns to state (0, 1) with

distribution F2(t).

8A.2.2 Rejuvenation Model

In this model, we made same assumption as stated for model without

rejuvenation. The additional assumptions for this model are as follows. Software

rejuvenation follows a probability distribution F3(t) and the system moves to state (R,

1), which is the rejuvenation state, either by switching to the standby node

automatically with probability γ3, or in the case of a failure in the automation

mechanism, the system enters state FR with probability γ4 and the service control is

switched manually to the standby unit as the primary unit is being rejuvenated. The

state transition diagram of the system is depicted in fig. 8A.2.


193

Fig. 8A.2. Redundant system with rejuvenation

If a failure occurs on the active unit before software rejuvenation is triggered

then the system behaves in the same manner as in the case of not taking any

rejuvenation actions. The distribution of the time for recovering from rejuvenation is

F4(t) which is assumed to be an exponential distribution with parameter βR. The time

to trigger rejuvenation has a fixed duration, F3(t)-u(t-tr) where u(t) is the unit step

function and tr is the time to trigger rejuvenation.

8A.2.3 Failed Rejuvenation Model

Fig. 8A.3. Failed software rejuvenation model


194

The case when rejuvenation actions may not be completed properly or a

rejuvenation action may be performed improperly is modeled by failed rejuvenation.

In detail, failed rejuvenation indicates an abnormal function during the rejuvenation

process. When failed rejuvenation occurs, the system enters a failure state. In this

case, when software rejuvenation is performed and the system state is (R, 1). The

rejuvenation may fail to be completed; a failure occurs in the rejuvenated unit in state

(R, 1) resulting in a transition to state (0, 1). The time needed for this transition is

assumed to be exponentially distributed with parameter λR. In fig. 8A.3, the transition

diagrams of the failed rejuvenation models for the rejuvenation model are shown.

8A.3. Semi-Markov Analysis

A semi-Markov process ( ){ }0t:tX ≥ is a stochastic process in which changes

of state occur according to a Markov chain and in which the time interval between

two successive transitions is a random variable whose distribution depends on the

state from which the transition takes place as well as the state to which the next

transition takes place.

8A.3.1 Embedded Markov Model

The embedded Markov model provides computational advantages for steady-

state calculations preserving the nature of the decision problem. In this sub-section,

the embedded Markov chains are used for all the three equivalent semi-Markov

models. The relevant characteristics of semi-Markov models are considered at those

epochs when the system state changes and the time spent in a particular state can

follows an arbitrary probability distribution.

The steady-state probabilities of each state of the semi-Markov process (SMP) and the

mean sojourn time in state i are:

∑∈

=

Ejjj

iii hv

hvπ and ( )( ) Ei,dttH1h

0ii ∈−= ∫

∞

…(8A.1)

where ( )tHi is the sojourn time distribution of state i.


195

8A.3.2 Model without rejuvenation

Before the computation of the mean sojourn times and consequently the

computation of the limiting distribution of the SMP, the sojourn time distributions

have to be computed. We consider the sojourn times in states (1,1) and (0,1) as

follows. Let, H1,1 be the sojourn time distribution in state (1,1), which is the minimum

of three random variables X1, X1 and V which follow distributions F1(t), F2(t) and

FC(t) respectively. Thus

( ) ( ) { }( )tV,X,XminPrtXPrtH 21min1,1 ≤=≤= …(8A.2)

{ }( ) ( )( ) ( )( )tF1tF1V,X,XminPr1 C2

121 −−=>−=

The sojourn time distribution in state (0, 1) is obtained as

( ) ( )( ) ( )( )tF1tF11tH 211,0 −−−= …(8A.3)

The steady-state probabilities of each state of the SMP can be computed as.

( )( ) ( )( ) ( )( ) ( )( ) ,dttF1tF1h,dttF1tF1h 20

11,0C0

211,1 −−=−−= ∫∫

∞∞

Also

( )( ) ( )( )dttF1h,dtexp1h0

20,00

F ∫∫∞∞

−=−= β …(8A.4)

8A.3.3 Rejuvenation Model

Let X~ F1(t), Y~ F1(t), Z~ F3(t), W~ F1(t) and V~ FC(t) be the random

variables denoting the time required for the change of the state (1,1) to states (0, 1),

F1, FR, (R, 1) and (0, 0) respectively. The next step in the SMP analysis is to compute

the sojourn time at each state. For this model, the sojourn time of state (1, 1) is given

by

( ) ( ) { }( )tV,W,Z,Y,XminPrtXPrtH min1,1 ≤=≤=

( )( ) ( )( ) ( )( )tF1tF1tF11 C2

32

1 −−−−= …(8A.5)

Though H0,1 is the same as in equation (8A.3).


196

The mean sojourn times of state (1,1) is obtained as

( )( ) ( )( ) ( )( )dttF1tF1tF1h C2

30

211,1 −−−= ∫

∞

( )( ) ( )( )∫∞

−−=0

C2

1 dttF1tF1

Similarly we obtain sojourn times of states (0, 1), F, (0, 0), FR and (R, 1) as follows:

( )( ) ( )( )dttF1tF1h 20

11,0 −−= ∫∞

…(8A.6)

( )( )dtexp1h0

F ∫∞

−= β …(8A.7)

( )( )dttF1h0

20,0 ∫∞

−= …(8A.8)

( )( )dtaexp1h0

FR ∫∞

−= …(8A.9)

( )( )dtexp1h0

R1,R ∫∞

−= β …(8A.10)

Using equation (8A.1), the steady-state distribution of the embedded Markov chain

can be computed.

8A.3.4 Failed Rejuvenation Model

In this model we assume that the rejuvenation is properly completed with

probability q and fails with probability 1-q leading the system to state (0, 1). The

sojourn time distribution and the mean sojourn time in the rejuvenation state (R, 1)

changes. Due to the fact that is one more transition from state (R, 1) to state (0, 1); the

time for this transition is assumed to be exponentially distributed with parameter Rλ .

Thus HR,1(t) and hR,1 can be determined as:

( ) ( )( )texp1tH RR1,R λβ +−−= …(8A.11)


197

and CRR

1,R F1h++

=λβ

…(8A.12)

8A.4. Performance Measures

Let U denotes the up states depicted by shaded nodes. The asymptotic

availability (A) can be obtained using

∑∈

=Ui

iA π …(8A.13)

where iπ is the steady-state probability of state i.

The availability for model 1 is given by:

1,01,1rejwithoutA ππ += …(8A.14)

Furthermore for models 2 and model 3, we obtain availability by using

1,R1,01,1rejwithA πππ ++= …(8A.15)

8A.5. Total Expected Downtime Cost

The total expected downtime cost in the steady state in a time interval of L time units,

is computed by

( ){ } ( )[ ]( ) ( ) ( ) LiXPr.iwElimLXgElimLTCE tEittt

×

==×= ∑

∈∞→∞→

( ) ( ) ( ) L.iwLiXPrlim.iwE iEi

ttEi×

=×

== ∑∑

∈∞→

∈

π …(8A.16)

where w(i) denotes the reward function for state i.

Let us assume the average cost per unit of downtime for the automation mechanism

failure and repair procedure needed are denoted by CA and cR, respectively.

According to (8A.5), the expected total downtime cost for three models are:

(i) Model without rejuvenation

( ){ } ( ) L.c.cLTCE 0,0RFA1 ×+= ππ …(8A.17)


198

(ii) Model with rejuvenation

( ){ } ( ){ } L.c.cLTCE 0,0RFRFA2 ×++= πππ …(8A.18)

(iii) Failed rejuvenation model

( ){ } ( ){ } L.c.cLTCE 0,0RFRFA3 ×++= πππ …(8A.19)

In real time, the system analysts and decision makers may be interested in

deriving the optimal rejuvenation interval tr that minimizes total expected downtime

cost for each one of the three presented rejuvenation models. By comparing the total

expected downtime of all the models, one can take decision regarding appropriate

rejuvenation schedule in particular when dealing with large applications.

8A.6. Conclusion

We have presented embedded Markov chain approach for the prediction of

asymptotic availability and the expected total downtime cost of software rejuvenation

model of a redundant computer system with one active and one standby unit. The

performance indices established can be further used to determine the optimal

rejuvenation policy in particular when failed rejuvenation takes place.

Imperfect Fault Coverage System with Reboot

8B.1 Introduction 8B.2 Model Description

8B.3 The Steady State Availability 8B.4 Special Cases 8B.5 Numerical Results

8B.6 Concluding Remarks

Section-8B


200

Availability of a two unit system is studied with different

types of prior assumptions for unknown parameters such as

imperfect fault coverage, reboot and common cause failure, etc.

The solution of the semi-Markov model has been obtained by

using supplementary variable technique (SVT). The explicit

expressions for the availability and failure frequency of the system

for some special distributions of repair time such as exponential,

gamma and uniform distribution have been used. The numerical

simulation has been carried out to explore the effect of different

distributions on the availability.

8B.1 Introduction

Availability is an important concept for the planning, design and operation

stages of various complex systems. It is defined as the percentage of time that a

system is available to perform its required functions. Redundancy, repair maintenance

and preventive maintenance are some of the well-known methods by which the

availability of a system can be enhanced. The stochastic models with standby units

have widely been studied by the various researchers. Some important contributions in

this direction are due to Mahmoud et al. (1987) and Verma and Chari (1991).

Yadavalli et al. (2002) found the asymptotic confidence limits for the steady state

availability of a two unit parallel system, with the assumption that the repair facility is

not available for a random time after each repair completion. The availability of a K-

out-of-N system, given limited spares and repair capacity under a condition based

maintenance strategy with warm standby system was studied by Smidt-Destombesa et

al. (2004) and Zhang et al. (2006). Wang et al. (2006) compared different system

configurations with warm standby components and standby switching failures. Based

on reliability analysis, they developed the explicit expressions for the mean time-to-

failure (MTTF) and the steady-state availability for different configurations.

Kiureghian et al. (2007) considered the availability, reliability and downtime of a

system with repairable components. Cekyay and Ozekici (2010) obtained the mean

time to failure and availability of semi-Markov missions with maximal repair as these


201

are frequently used in modern technology. Recently, Moghaddass and Zuo (2011)

discussed the optimal design of a repairable k-out-of-n system considering

maintenance.

The subject of common cause failures has been receiving significant attention

of the researchers over the past two decades. It has been realized that in order to

predict realistic availability of standby systems, the occurrence of common-cause

failures must be considered. A common-cause failure is defined as any instance where

multiple units or elements fail due to a single cause. These types of failures could

occur owing to equipment design efficiencies, abnormal environment, external

catastrophe, common power source, common manufacturer, etc.. Either for predicting

the behavior of new designs or studying possible changes in existing ones, modeling

of redundant repairable system with common-cause failures is important topic for

investigation. Some noble works which have been done in this area are as follows.

Chung (1981) studied a k-out-of-N: G three-state unit redundant system with

common-cause failure and replacements. To increase the system availability, the

switching time of the warm standby unit scheduled with common cause failure was

analyzed by Singh (1989). Dhillon and Anudu (1993) performed the common-cause

failure analysis of a non-identical unit parallel system with arbitrarily distributed

repair times. Human error and common-cause failure modeling of a two-unit multiple

systems was explored by El-Damcese (1997) and Atwood and Kelly (2008). Xing

(2007) and El-Damcese (2009) analyzed warm standby system subject to common

cause failures with time varying failure and repair rates. Ke et al. (2010) studied

simulation inferences for an availability system with general repair distribution and

imperfect fault coverage. Hsu et al. (2011b) discussed standby system with general

repair, reboot delay, switching failure and unreliable repair facility.

For highly reliable systems, coverage has a significant effect on the system’s

availability. However, some failures can remain undetected or uncovered, which can

lead to the system failure. Examples of the effect of uncovered faults can be found in

computing systems, electrical power systems, distribution networks, pipelines

carrying dangerous materials, etc.. Systems subject to imperfect fault-coverage may

fail even prior to the exhaustion of standbys due to uncovered component failures.

Therefore, it is important to consider the effects of imperfect fault-coverage in the


202

design and analysis of these systems. Further, the effects of fault-coverage also play a

key role in electrical power distribution, dangerous fluid transportation, and several

standby redundancy applications. The reliability analysis of K-out-of-M: G systems

with dependent failures and imperfect coverage was studied by Moustafa (1997),

Amari et al. (2004), Myers (2007). Xing and Dugan (2002) showed that the

reliability of the systems subject to imperfect fault-coverage decreases after a certain

level of active redundancy. The systems with imperfect fault coverage have been

intensively studied by Vieira and Madeira (2004). Tang and Lee (2005) described a

simple recovery strategy for economic lot scheduling problem. Chang et al. (2005)

evaluated the reliability and other important measures for multistate systems subject

to imperfect fault coverage. Wang and Chiu (2006b) developed the steady-state

availability systems with warm standby units and imperfect coverage along with cost

benefit analysis. Levitin (2007) and Levitin and Amari (2008) explored a block

diagram method for analyzing the multi-state systems with uncovered failures.

Therefore, it is important to consider the effects of imperfect fault coverage in

designing these systems. Ke et al. (2008) studied a system characteristics of a two-

unit repairable system with different types of priors assumed such as detection,

recovery time and reboot delay and coverage factor for an operating unit. Kuniewski

et al. (2009) considered a sampling inspection for the evaluation of time dependent

reliability of deteriorating systems under imperfect defect detection. The aggregated

semi-Markov repairable system with history-dependent up and down states was

examined by Wang and Cui (2011).

In this investigation, we present the availability analysis of two unit system

with warm standbys, imperfect fault coverage by incorporating the concept of reboot

and common cause failures. This section is arranged in the following manner. The

model under consideration is described in section 8B.2. In section 8B.3, we establish

steady state availability of the system. Some cases of specific distributions of repair

are considered in section 8B.4. In section 8B.5, the proposed model is illustrated

numerically. The effects of various system parameters are also examined by faciliting

the sensitivity analysis. The findings and noble features of our investigation are

summarized in section 8B.6.


203


We consider a redundant repairable system in which two components are

active. If any unit fails, then the system will immediately take reconfiguration

operation considering negligible time. The reconfiguration operation will detect and

remove the failed unit from the system. However, other entire operating unit will

continue to operate as it is. The probability of successful reconfiguration operation is

defined as coverage factor C. After recovering, the failed component is ready for

repair. Active components are considered repairable. It is assumed that each of the

active components fails independently of the others according to Poisson distribution

with parameter λ. The system fails either due to a common cause failure or when all

of its units fail. The inter-failure time of common cause failure and the recovery time

are assumed to be exponentially distributed with rate λS and θ, respectively. The

reboot time is also exponentially distributed with parameter β. The general

distribution is considered for the repair time of the units.

In figure 8BB.1, the state transition diagram of a redundant repairable system

is shown in which two active components are initially fully working in state (2).

When one of the the active components fails with probability C, a protection switch

successfully restores service by switching on the other component, and the system

enters state (1). With probability (1-C) the protection switch fails to cover the failure

of the active component and the system enters state (4). We assume that the active

component failure in the state (4) is cleared by a reboot, and the delay for an active

unit follows an exponential distribution with rate β. The failure of the active

component is detected immediately with probability C, and when this happens, the

system enters-state (1). If the failure of the active component is not detected with

probability (1-C), the systems enter state (3). There is a latent fault in the spare

component when the system is in state (3). If a component failure occurs when the

system is in state (1), the system fails and enters state (0). When a component

switches over successfully, its failure characteristics become those of the active

component. If a unit is inactive, it is immediately sent to the repair facility and is

repaired one at a time in order of breakdowns. The system can fail with the common

cause failure rate λS. Let the time-to-repair of the components be independent and

identically distributed random variable following a general distribution with C.D.F.


204

B(x), (x > 0), and hazard rate b(x). In this section, we provide a closed form solution

for the availability and mean time to failure. In order to improve the system

availability, we should not only add additional redundancy but also improve the

coverage-factor.

8B.3 The Steady State Availability

In this section we evaluate the availability of the redundant system having two

units. The state transition diagram is depicted in fig. 8B.1. Now we construct the

steady-state equations by balancing the flow rates as follows (see transition diagram

in fig. 8B.1):

Fig. 8B.1: State transition diagram for two unit system

24 PC2P0 λ+β−= …(8B.1)

23 PC2P0 λ+θ−= …(8B.2)

( ) ( )0PP20 12S +λ+λ−= …(8B.3)

( ) ( ) ( ) ( ) ( ) ( )0PxbPxbPxbxPdx

xdP0431

1 +β+θ+λ−=− …(8B.4)

( ) ( ) 2S10 PxPdx

xdPλ+λ=− …(8B.5)

Solving (8B.1), (8B.2) and (8B.3), we get


205

224 bPPC2P =βλ

= …(8B.6)

223 aPPC2P =θλ

= …(8B.7)

( ) ( ) 22S1 PP20P Λ=λ+λ= …(8B.8)

where ( )βλ

=θλ

=λ+λ=ΛC2b,C2a,2 S .

Taking Laplace transforms of (8B.4) and (8B.5) and using (8B.6) and (8B.7), we have

( ) ( ) ( ) ( ) ( ) ( )0P0PsBPsB2sPs 10*

2**

1 −+λ=−λ …(8B.9)

( ) ( ) ( ) ( )0PsPsPssP 0*2S

*1

*0 −λ+λ=− …(8B.10)

Substituting λ=s and s=0 in (8B.9), we get

( ) ( ) ( ) ( )0P0PBPB2 10*

2* =λ+λλ …(8B.11)

( ) ( ) ( ) ( ) ( )0P0P0BP0B20P 10*

2**

1 −+λ=λ

( ) ( )0P0PP2P 1021 −+λ=λ …(8B.12)

Differentiating (8B.9) w. r. t.‘s’ and then putting s=0, we obtain

( ) ( )0PbPb20Pds

)0(dP0121

*1

*1 −λ−=−λ

( ) ( )0PbPb2Pds

0dP01211

*1 −λ−=λ …(8B.13)

Substituting s=0 in (8B.10), we get

( )0PPP 02S1 =λ+λ …(8B.14)

Differentiating (8B.10) w.r.t. ‘s’ and then substituting s=0, we obtain

( )21S0

*1 PbPds

0dPλ+−=λ …(8B.15)

Using (8B.13) and (8B.15), we get


206

( ) 211001 PbPP0Pb Λ−+= …(8B.16)

Therefore

( ) ( )( )( ) 2*

*

0 PB

B20Pλ

λλ−Λ= …(8B.17)

Now equation (8B.12), gives

( )( )( ) 2*

*

1 PB

B1Pλλ

λ−Λ= …(8B.18)

Again equation (8B.16) yields

( )( ) ( )( )[ ]( ) 2*

*S1

*

0 PB

BbB1Pλλ

λλ+Λλ+λ−Λ−= …(8B.19)

Normalizing condition gives

1PPPPP 43210 =++++ …(8B.20)

Finally solving equations (8B.6), (8B.7), (8B.18B) and (8B.19), we obtain

probabilities of the system states as

( ) ( ) ( )( ) ( )( ){ }[ ]( ) ( )( ){ } ( )( )[ ]θλ+βλλ+λ+λ+λ+λθβλ

λ−−λλ+λ+λλ+λ=

C2C2Bb1B2bB1B2b2P *

1S*

S1

**SS1S

0 …(8B.21a)

( ) ( )( )( ) ( )( ){ } ( )( )[ ]θλ+βλλ+λ+λ+λ+λθβλ

λ−λ+λθβ=

C2C2Bb1B2bB12P *

1S*

S1

*S

1 …(8B.21b)

( )( ) ( )( ){ } ( )( )[ ]θλ+βλλ+λ+λ+λ+λθβ

λθβ=

C2C2Bb1B2bBP *

1S*

S1

*

2 …(8B.21c)

( )( ) ( )( ){ } ( )( )[ ]θλ+βλλ+λ+λ+λ+λθβ

λβλ=

C2C2Bb1B2bBC2P *

1S*

S1

*

3 …(8B.21d)

( )( ) ( )( ){ } ( )( )[ ]θλ+βλλ+λ+λ+λ+λθβ

λθλ=

C2C2Bb1B2bBC2P *

1S*

S1

*

4 …(8B.21e)

The system availability is given by

( ) 321 PPPA ++=∞


207

( ) ( ) ( ){ } ( )[ ]( ) ( )( ){ } ( )( )[ ]θλ+βλλ+λ+λ+λ+λθβ

λλ+λ−+λ+λ+λθβ=

C2C2Bb1B2bB2a12

*1S

*S1

*SS …(8B.22)

Failure frequency of the system is

( ) 40S PC2Pf λ+λ+λ=

( ) ( ) ( ) ( )( ) ( )( ){ }[ ] ( )( ) ( )( ){ } ( )( )[ ]θλ+βλλ+λ+λ+λ+λθβλ

λθλ+λ−−λλ+λ+λλ+λλ+λ=

C2C2Bb1B2bBC4B1B2b2

*1S

*S1

*22**SS1SS …(8B.23)

8B.4 Special Cases

In this section we discuss special cases corresponding to three repair time

distributions viz. exponential distribution, gamma distribution and uniform

distribution. The system performance indices for each case are established by setting

( )λ*B and b1 given as below:

(i) Exponential Distribution-

In this case, we have ( )µ+λ

µ=λ*B ,

µ=

1b1

Now equation (8B.22) yields

( ) ( ) ( )[ ]( ) ( )[ ]ba1

a1AS

2

++µ+λ+Λλµ+λΛ−+λµ+µ+λµΛ

=∞ …(8B.24)

Failure frequency of the system is obtained as f ( ) 40S PC2P λ+λ+λ=

( ) ( ){ }[ ]( ) ( ){ } ( )[ ]θ+βλµ+λ+µµ+µ+λΛθβλ

θµλ+λµ−µλ+µ+λΛΛλ+λ=

CC2C4

2S

222SS …(8B.25)

where ( )βλ

=θλ


(ii) Gamma Distribution-

For gamma distribution, we have

( )r

*

rrB

µ+λ

µ=λ ,

µ=

1b1


208

Then equation (8B.22) provides

( ) ( ) ( ) ( )[ ]( ) ( )[ ]ba1r

a1rrAS

r

rr

++µ+λ+Λλµ+λΛ−+λµµ+µ+λµΛ

=∞ …(8B.26)

Failure frequency of the system f ( ) 40S PC2P λ+λ+λ=

( ) ( ) ( ) ( ){ }[ ] ( )( ) ( ) ( ){ } ( ) ( )[ ]θ+βµλ+λ+µµ+µ+λΛθβλ

µθλ+µ+λλ−µλ+µ+λΛΛλ+λ=

+

+−

CCr2rrrC4r2rr

1rrS

rr

1rr221rrS

rS …(8B.27)

where ( )βλ

=θλ


(iii) Uniform Distribution-

By setting ( )abeeB

ab*

−−

=λλ−λ−

, 2

bab1+

= in equation (8B.22), we get availability for

uniform distribution as

( ) ( ) ( ) ( )[ ]( ) ( ) ( ) ( ){ }[ ]S

ab

baba12baaba1ee2ab2A

λ+++++Λ+λ−Λ−+λ−+−Λ

=∞λ−λ−

…(8B.28B)

Also failure frequency f ( ) 40S PC2P λ+λ+λ=

( ) ( ) ( ) ( ) ( )( ){ }[ ] ( )( ) ( ) ( )( ){ } ( )( )[ ]θ+β−λ++λ+−+Λ−θβλ

−θλ+−−−−−λ+Λ−Λλ+λ=

λ−λ−λ−λ−

λ−λ−λ−λ−λ−λ−

CCee4ba2eeabeeC8eeab2eeab

abS

ab22

ab22ababS

22S

…(8B.29)


In this section we provide numerical results for the availability and failure

frequency for the repairable system with the varying parameters namely imperfect

fault coverage (C), reboot (β) and common cause failure (λs). The program has been

coded using software ‘MATLAB’. The sensitivity analysis is facilitated to check the

validity of the analytical results. For computational purpose we set the default

parameters as λS=0.1, C=0.3, θ=0.6, β=1, µ=3. For analyzing the effects of various

parameters on availability, numerical results are summarized in tables 8B.1(a)-

8B.1(b) and 8B.2(a)-8B.2(b). The graphical presentation of the availability has been


209

done in figs 8B.1-8B.4 by considering the specific distribution namely exponential,

gamma and uniform.

Tables 8B.1(a) and 8B.1(b) display the effects of parameters C, µ and β on the

availability for exponential, gamma and uniform distribution. Table 8B.1(a) reveals

that the availability increases as C and µ increase in case of exponential, or gamma

distribution. In case of uniform distribution, for the smaller values of β (i.e. β=1), the

availability decreases initially and becomes almost constant. It is observed that the

availability increases initially and becomes constant for β=2 and β=3. In table 1(b),

we examine the effects of θ, µ and β on the availability and note that as β increases,

the availability increases for all distributions. The availability increases as µ increases

in case of exponential and gamma distribution but it remains almost constant for

uniform distribution. On increasing the values of C, the availability increases for

uniform distribution but decreases for exponential distribution. It is noticed that for

gamma distribution, the availability decreases only for β=1 but increases for β=2 and

β=3.

From figs 8B.1-8B.3, we see the variation of availability with respect to

parameter λ and different values of C, β, θ and λS for exponential, uniform and gamma

uniform distributions, respectively. In figs 8B.1(i)-8B.1(iii) and 8B.3(i)-8B.3(iii), the

availability sharply decreases when λ increases but in figs 8B.2(i)-8B.2(iii), the

availability initially decreases smoothly and then after it decreases sharply.

Figs 8B.4(i)-8B.4(iv) show the variation of availability with respect to λ, λS, β

and µ respectively by varying r for gamma distribution. The availability slightly

decreases when r increases as clear from fig. 8B.4(i)-8B.4(ii). But in figs 8B.4(iii)-

8B.4(iv) for gamma distribution, the availability increases when r increases.

Overall we conclude that for all distributions, the availability is decreasing

with respect to failure rate λ. A close study of availability tables reveals that the

availability is more affected by the parameter C and θ because both factors prevent

the entire system failure by replacing the available ones with the recovered failed

elements.


210

8B.5 Concluding Remarks

In many real time applications, the highly available fault-tolerant systems are

expensive and time consuming to develop and deploy. In this section, we have

developed steady-state results for the availability of redundant systems with imperfect

fault coverage, reboot and common cause failure. According to the system-operating

parameters, the steady-state probabilities are obtained which are further used to find

the availability and failure frequency of the system. The proposed model has the

advantage of being quite general and will provide a useful performance evaluation

tool for real time fault-tolerant systems arising in practical applications, such as

production systems, flexible manufacturing systems, computer and communication

systems, transportation systems, inventory problems, and many other related systems.


211

Availability

C Uniform Distribution

Exponential Distribution

Gamma Distribution

0.7

µ β=1=2=3 β=1 β=2 β=3 β=1 β=2 β=3 01 0.434 0.467 0.490 0.498 0.451 0.474 0.482 03 0.409 0.667 0.711 0.728 0.662 0.706 0.722 05 0.409 0.730 0.782 0.801 0.728 0.780 0.799 07 0.409 0.761 0.817 0.838 0.760 0.816 0.836 09 0.409 0.779 0.838 0.859 0.779 0.837 0.859 11 0.409 0.792 0.852 0.874 0.791 0.851 0.873 13 0.409 0.800 0.862 0.884 0.800 0.861 0.884 15 0.409 0.807 0.869 0.892 0.806 0.869 0.891

0.8

01 0.479 0.484 0.500 0.505 0.464 0.479 0.485 03 0.444 0.701 0.731 0.741 0.694 0.724 0.734 05 0.444 0.770 0.805 0.817 0.767 0.802 0.814 07 0.444 0.804 0.842 0.8B55 0.802 0.840 0.853 09 0.444 0.824 0.863 0.877 0.823 0.862 0.876 11 0.444 0.8B38 0.878 0.892 0.837 0.877 0.891 13 0.444 0.847 0.888 0.903 0.847 0.888 0.902 15 0.444 0.855 0.896 0.911 0.854 0.896 0.910

0.9

01 0.519 0.501 0.509 0.512 0.477 0.485 0.487 03 0.491 0.733 0.748 0.753 0.725 0.740 0.745 05 0.491 0.808 0.825 0.831 0.804 0.822 0.828 07 0.491 0.845 0.863 0.870 0.843 0.861 0.868 09 0.491 0.867 0.886 0.893 0.865 0.885 0.892 11 0.491 0.881 0.901 0.908 0.880 0.900 0.907 13 0.491 0.892 0.912 0.919 0.891 0.911 0.918 15 0.491 0.900 0.920 0.927 0.899 0.920 0.927

Table 8B.1(a): Effects of parameters C, µ and β on the availability for different distributions of repair time.


212

Availability

θ Uniform

Distribution Exponential Distribution

Gamma Distribution

0.7

µ β=1=2=3 β=1 β=2 β=3 β=1 β=2 β=3 01 0.429 0.399 0.449 0.469 0.395 0.445 0.464 03 0.429 0.537 0.629 0.668 0.535 0.628 0.666 05 0.429 0.577 0.684 0.730 0.576 0.684 0.729 07 0.429 0.596 0.711 0.760 0.596 0.711 0.760 09 0.429 0.608 0.727 0.778 0.607 0.727 0.778 11 0.429 0.615 0.738 0.790 0.615 0.737 0.790 13 0.429 0.620 0.745 0.798 0.620 0.745 0.798 15 0.429 0.624 0.750 0.805 0.624 0.750 0.805

0.8B

01 0.502 0.388 0.442 0.463 0.391 0.446 0.467 03 0.502 0.509 0.610 0.652 0.510 0.611 0.654 05 0.502 0.543 0.659 0.710 0.543 0.660 0.711 07 0.502 0.558 0.683 0.738 0.559 0.684 0.739 09 0.502 0.567 0.697 0.755 0.568 0.698 0.755 11 0.502 0.573 0.707 0.766 0.573 0.707 0.766 13 0.502 0.577 0.713 0.774 0.578 0.713 0.774 15 0.502 0.581 0.718 0.779 0.581 0.718 0.779

0.9

01 0.469 0.381 0.437 0.459 0.389 0.446 0.469 03 0.469 0.493 0.598 0.643 0.496 0.601 0.646 05 0.469 0.522 0.644 0.698 0.523 0.645 0.700 07 0.469 0.535 0.666 0.724 0.536 0.667 0.725 09 0.469 0.543 0.678 0.740 0.544 0.679 0.741 11 0.469 0.548 0.687 0.750 0.548 0.687 0.751 13 0.469 0.551 0.693 0.757 0.552 0.693 0.758 15 0.469 0.554 0.697 0.763 0.554 0.697 0.763

Table 8B.1(b): Effects of parameters θ, µ and β on the availability for different

distributions of repair time


213

0.6

0.7

0.8

0.9

1

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

λ

Ava

ilabi

lity

c=0.7c=0.8c=0.9

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

λ

Ava

ilab

ilit

y

c=0.1c=0.2c=0.3

(i) (i)

0.5

0.6

0.7

0.8

0.9

1

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

λ

Ava

ilabi

lity

β=1β=2β=3

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

λ

Abai

labi

lity

β=1β=2β=3

(ii) (ii)

0.4

0.5

0.6

0.7

0.8

0.9

1

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1λ

Ava

ilab

ilit

y

θ=0.4θ=0.6θ=0.8

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

λ

Ava

ilabi

lity

θ=0.7θ=0.8θ=0.9

(iii) (iii)

Fig. 8B.2: Availability vs λ by varying Fig. 8B. 3: Availability vs λ by (i) C (ii) β (iii) θ for exponential varying (i) C (ii) β (iii) θ for distributed repair time uniform distributed repair time


214

0.4

0.5

0.6

0.7

0.8

0.9

1

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

λ

Avai

labi

lity

c=0.7c=0.8c=0.9

0.4

0.5

0.6

0.7

0.8

0.9

1

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

λAv

aila

bilit

y

β=1β=2β=3

(i) (ii)

0.4

0.5

0.6

0.7

0.8

0.9

1

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

λ

Avai

labi

lity

θ=0.4θ=0.6θ=0.8

0.4

0.5

0.6

0.7

0.8

0.9

1

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

λ

Avai

labi

lity

λs=.1λs=.5λs=.9

(iii) (iv)

Fig. 8B.4: For gamma distributed repair time, the availability vs λ by varying (i) C (ii) β (iii) θ (iv) λS


215

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

0

0.2

0.4

0.6

0.8

1

Availability

λ

r=.5r=2

r=3.5

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.6

0.62

0.64

0.66

0.68

0.7

0.72

Availability

λs

r=.5r=2

r=3.5

(i) (ii)

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

0.6

0.62

0.64

0.66

0.68

0.7

0.72

Availability

β

r=.5r=2

r=3.5

1 1.5 2 2.5 3

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Availability

µ

r=.5r=2

r=3.5

(iii) (iv)

Fig. 8B.5: For gamma distributed repair time, the availability vs (i) λ (ii) λs (iii) β (iv) µ by varying r

Software Reliability Growth Model (SRGM) with N-version

Programming

9.1 Introduction

9.2 NHPP Model

9.3 Mean Value Function

9.4 Reliability Estimation

9.5 Parameter Estimation

9.6 Total Expected Cost of Software



Chapter-9

Chapter-9: Software Reliability Growth Model (SRGM)…

217

The software reliability models have received attention of

software designers in the evaluation of quality factors. In this

chapter, we propose the software reliability growth model

(SRGM) based on non-homogeneous poisson process (NHPP).

The main aim of this investigation is to develop software

reliability growth model by incorporating both testing-effort

and N-version programming by assuming the imperfect

debugging process. Detected software faults during testing have

been categorized into three types namely minor, major and

critical depending on the severity of the faults. A modified

approach is discussed to determine the delivery cost of the

software at any time during the testing. The parameter

estimation approach is proposed to estimate the unknown

parameters of the proposed model. Numerical results are also

presented to demonstrate the validity of analytical results. By

using Adaptive Network-based Fuzzy Interference Systems

(ANFIS) approach for SRGM, we explore the prediction

capabilities of soft computing.

9.1 Introduction

Reliability is the basic requirement for both hardware and software systems

with the increasing demand of software. High quality software systems have become

more essential for a high degree of system reliability. The quality of the software is

described by many metrics such as complexity, portability, maintainability,

availability and reliability. In the recent years, many fault tolerant systems have

become software dependent for their correct functioning. There are two methods for

obtaining fault tolerance of software (i) N-version programming (ii) recovery block.


218

In NVP, N independent programs are performed in parallel and same

applications and decisions are obtained by voting on the output from individual

program. N-version programming technique can tolerate the design faults present in

the software if the design diversity concept is implemented properly. Each version of

the module should be implemented as diverse as possible manner, including different

tool sets, different programming languages and possible different environments. All N

software versions are executed simultaneously and their results are sent to a voter

which selects the correct output. Whenever a failure occurs, the causes of all faults

detected are immediately removed perfectly. N-version programming (NVP)

technique was proposed by Chen and Avizienies (1977). Various researchers have

contributed significantly in the field of N-version programming (cf. Brilliant et al.,

1990; Dugan and Lyu, 1993). Pham (1994) presented the system reliability analysis

of N-version programming application. Teng and Pham (2002) and Bhaskar and

Kumar (2006) discussed the software reliability growth model for N-version

programming with cost and imperfect debugging. Laval et al. (2011) discussed

supporting simultaneous version of software evolution assessment.

SRGMs provide the relationship among the cumulative number of faults

detected by the software testing and expended time. Various stochastic process

models have been successfully used in studying hardware and software reliability

problems. Some notable researchers have contributed towards this field. Pham and

Zhang (2003) and Cai et al. (2008) studied the cost analysis of reliability growth

models based on NHPP for hardware and software systems with testing coverage.

Yang and Xie (2000) and Rackwitz (2001) studied the operational and testing

reliability in software systems. Pham (2007) and Castillo et al. (2008) analyzed the

optimization and reliability problems with imperfect-debugging and fault-detection

dependant-parameters for SRGM. Rebba and Mahadevan (2008) gave the

computational methods for modeling the reliability assessment. Zio (2009) discussed

the old and new challenges related to reliability prediction. Reliability growth

modeling for in-service systems considering latent failure modes was discussed by

Jin et al. (2010). Hsu et al. (2011a) proposed a enhancing software reliability

modeling and prediction through the introduction of time-variable fault reduction

factor.


219

In some applications, SRGMs also incorporate the time-dependent behavior of

testing-effort expenditure. Testing efforts include the number of executed test case,

man power spent during the testing and the number of CPU hours. Many researchers

have described the software reliability growth aspects in different frameworks under

testing effort. Software reliability growth models with testing-effort function, fault

detection rate and change point were studied by Yamada and Othera (1990), Kuo et

al. (2001), Kapur and Bhardhan (2002), Huang and Kuo (2002), Huang (2005).

Catuneanu et al. (1991), Kuo (2005) and Huang and Lyu (2005) examined the

software reliability release policy with testing effort by considering the cost and

testing efficiency. The framework for developing testing effort dependent software

reliability growth models with learning function for distributed systems was

considered by Yamada et al. (1993), Inoue and Yamada (2004), Kapur et al. (2004),

Huang et al. (2007a), Kapur et al. (2009) and many more. Huang and Hung (2010)

presented the software reliability analysis and assessment using queueing moldels

with multiple change-points. The sensitivity analysis of release time of software

reliability models incorporating testing effort with multiple change-points was studied

by Li et al. (2011).

The NHPP based models are the most important models because of their

simplicity, convenience and compatibility. These groups of models provide an

analytical framework for describing the software failure occurrence or fault-removal

phenomenon during testing. These models are normally based upon different

debugging scenarios and can catch quantitatively typical reliability growth concept

observed in the testing phase of the software products. This chapter investigates N-

version programming for a SRGM based on NHPP that incorporates imperfect

debugging and testing effort. The rest of the chapter is organized as follows. In

section 9.2, we describe the development of the SRGM based on NHPP to examine

the software testing for N-version programming. In section 9.3, we define mean value

function which describes the fault detection, isolation and removal phenomenon in the

software testing. In section 9.4, we describe the reliability estimation for the model. In

section 9.5, we suggest the parameter estimation of the model which is based on

NHPP and NVP. The total software cost is evaluated in section 9.6. In section 9.7, we


220

present numerical results by taking an illustration. Finally section 8 contains the

concluding remarks.

9.2 NHPP Model

We develop a testing-effort dependent reliability growth model which

incorporates the testing-effort spent on software testing. Let us consider a counting

process ( ){ }0t,tN ≥ describing the cumulative number of errors detected up to testing

time t. The software reliability growth model (SRGM) based on NHPP can be

formulated as a Poisson process given by:

( ){ } ( ){ } ( ){ } ( ).....,2,1,0n,tmexp!ntmntNPr

n

=−== …(9.1)

where ( )tm is a mean value function which represents the expected cumulative

number of errors detected in the time interval (0, t].

9.2.1 Software Reliability Growth Model

In this section, we develop a software reliability growth model (SRGM) for N-

version programming systems by considering the errors removal efficiency and the

error introduction rate during testing. The model describes as how the observations of

failures and correcting process of the underlying faults which occur in software

development when the software is being tested and debugged, affect the reliability of

the software. For formulations the model, the following assumptions are made:

There are three types of errors (minor, major and critical) in the system. The minor, major and critical errors are considered for each of the N-version. The error introduction probabilities are assumed to be constant for each version.

The detection/isolation/fault removal phenomenon is modeled by non homogeneous poisson process (NHPP).

The time-dependent behavior of the testing-effort is governed by an exponential distribution.

The mean number of faults/errors detected in the time interval ( ]tt,t ∆+ is proportional to the mean number of remaining errors in the system.

The imperfect debugging is considered for each of the N-version of the software.


221

The following notations are used for mathematical formulation and performance

evaluation purpose:

pdf : Probability density function.

N : Total number of versions in the system.

M : Total amount of testing-effort eventually consumed.

i : Index representing the version number in the system, i=1,2,3,…,N.

j : Index representing the types of error i.e. minor, major and complex faults, respectively in the system for j=1,2,3.

ia : Number of error to be detected eventually in version i, i=1,2,3,…,N.

ijb : Failure detection rate per unit testing effort of version i for jth type fault, where Njj2j1 b....bb ≠≠≠ .

ijp : Error content proportion of version i for type jth error, where Njj2j1 p....pp ≠≠≠ .

ijβ : Error introduction rate for jth type error of version i, where Njj2j1 .... β≠≠β≠β .

( )tmDij : Expected mean number of jth type faults detected in version i in time

(0, t]. ( )tmI

ij : Expected mean number of jth type faults isolated in version i in time (0, t].

( )tmRij : Expected mean number of jth type faults removal in version i in time

(0, t].

( )tλij : Initial failure intensity of type j error for version i, i=1,2,3,…,N.

( )t'λ i : Increased initial failure intensity for version i, i=1,2,3,…,N.

( )t'mi : Increased mean value function for version i, i=1,2,3,…,N.

iα : Correlating parameters for ith version, 10 i ≤α≤ .

iR : Reliability of version i, i=1,2,3,…,N.

9.2.2 Testing-Effort Function

The testing-effort function (TEF) can be evaluated by the number of test cases

runs, human power, or the number of CPU hours. We consider exponential TEF to

define the possible testing-effort patterns.

The total amount function of testing-effort W(t) spent in the time interval (0,t] is

( ) ( )( ){ }texp1MtW γ−−= …(9.2)


222

The current testing-effort expenditure rate at testing time t is given by

( ) ( )tWdtdtw = …(9.3)

9.3 Mean Value Function

We propose the software reliability growth models in which software faults

detected during the testing phase are isolated and removed and then software tends to

grow. The software faults are assumed of different severity such as minor, major and

complex. The mean value function of the software reliability growth model with time

dependent testing-effort function is established. The faults are removed in different

stages according to their severity.

The mean value functions are governed by the following differential equations:

( ) ( ) ( ) ( )[ ]tmtnbtw

1tmdtd D

ijiiji

Dij −=× …(9.4)

where

( ) ( )tmdtdtn

dtd D

ijijij β= …(9.5)

( )( ) ( ) ( ) ( )[ ]tmtmtwbtw

1dt

tdm Iij

Dijij

Iij −=× …(9.6)

( ) ( ) ( ) ( )[ ]tmtmtwbdt

tdm Rij

Iijij

Rij −= …(9.7)

Solving above equations, for i=1, 2, 3,…,N; j=1, 2, 3 we get

( ) ( ) ( ) ( )( )[ ]tWb1exp11

aptn ijijij

ij

ijij β−−β−

β−= …(9.8)

and

( )

( ) ( )( ) ( ) ( ) ( )( )( )

β−−−β−β

−−−β−

= ijijij

ijij

ij

ijiRij 1exp1

tWb1tWbexp1

1pa

dttdm …(9.9)


223

The failure intensity for i=1, 2, 3,…,N; j=1, 2, 3 is obtained as

( ) ( )dt

tdmtλ

Rij

ij = and ( ) ∑=

λ=3

1jiji )t(tλ

Therefore

( ) ( )( ) ( ) ( ) ( )( )( ) ( ) ( ) ( )( )

×−−−+−+−−

=λ ∑=

tWtwbtwβb

β1expβ1twbtWbexpβ1pa

)t( ijij

ijijijijij

3

1j ij

ijii

…(9.10)

9.4 Reliability Estimation

Now we evaluate the conditional reliability for ith version and reliability

measures of the system with N-version, by considering two cases such as (i) the

increased failure intensity of each version of the system (ii) the increase mean value

function for each version in the system. These cases are as follows:

Case (I):

The increased failure intensity for each version is determined by

∑=

=n

1iiii λα'λ

With the help of equations (9.10) and (9.11), we evaluate the conditional reliability

for given time x for the ith version as

( ) ( ) ( ){ }[ ]TmXTmexpTXR iii −+−= …(9.11)

The reliability expression for two-version software system is determined by using

( ) ( ) ( ) ( )TXRT

XRTXRT

XRR 2121sys −+= …(9.12)

For N-version, the reliability expression is given by

( ) ( ) ( ) ( ) ( ) ( )

( ) ( )TXRΠ1....

TXRT

XRTXRT

XRTXRT

XRR

i

N

1i

1N

k

N

kjiji

N

jiji

N

1iisys

=

−

<<<=

−+−

+−= ∑∑∑


224

( ) ( ) ( ) ( ) ( ) ( ) ( )TXRΠ1T

XR....TXRT

XR1-TXR i

N

1i

1NN

...kji...kji

jiN

N

1ii =

−

≠≠≠≠<<<<=

−+−= ∑∑ l

ll

…(9.13)

Case (II):

The increased mean value function for each version is given by

∑=

=n

1iiii mα'm

and

( ) ( ) ( )( ) ( ) ( ) ( )( )( )∑=

−−−−−−−−

=3

1jijij

ij

ijij

ij

ijiii β1expβ1

βtWb

1tWbexp1β1pa

t,λm

The conditional reliability of version i is given as below

( ) ( ) ( ){ }[ ]T''mXT'mexpTX'R iii −+−=

…(9.14)

The conditional reliability of two version software is given as follows

( ) ( ) ( ) ( )TX'RT

X'RTX'RT

X'R'R 2121sys −+= …(9.15)

The conditional reliability of a N-version is given by

( ) ( ) ( ) ( ) ( ) ( )

( ) ( )TX'RΠ1....

TX'RT

X'RTX'RT

X'RTX'RT

X'R'R

i

jijiisys

N

1i

1N

k

N

kji

N

ji

N

1i

=

−

<<<=

−+−

+−= ∑∑∑

( ) ( ) ( ) ( ) ( ) ( ) ( )TX'RΠ1T

XR....TXRT

XR1-TX'R

ii

N

1i

1NN

...kji...kji

jiN

N

1i =

−

≠≠≠≠<<<<=

−+−= ∑∑ l

ll

…(9.16)

9.5 Parameter Estimation

Parameter estimation is the basic requirement of software reliability

prediction. Parameter estimation can be done by using the well established likelihood

estimation approach to evaluate the unknown parameters for the NHPP models.


225

The joint p.d.f. for ith (i=1, 2, …, N) version in the system is given by

( ) ( )[ ] ( )is

n

1siiniini2i1i tλtmexpt,...,t,tf ∏

=

−= …(9.17)

Let ( )ini2i1 t,...,t,t be the time between failures for ith (i=1, 2, …, N) version. The joint

likelihood function is defined as follows:

( ) ( )∑∏∑∑= ===

−==

N

1iis

n

1s

N

1iini

N

1ii tλtmexpfL '

i …(9.18)

Taking logarithm of L, we get

( ) ( )∑∑∑= ==

λ+−=N

1i

n

1sis

'N

1iini tlogtmL log

i …(9.19)

Partial derivatives of eq. (9.21) w. r. t. ( )Ni1ai ≤≤ , ( )Ni1ibij ≤≤= ,

( )Ni1ipij ≤≤= and ( )Ni1iij ≤≤=β and iα give the likelihood equations of the

above parameters.

9.6 Expected Cost of the Software

It is of vital importance for software manufacturer to control a software

development process in terms of cost, reliability and optimal testing time. The quality

of the software system usually depends on the testing time and testing efforts. The

total software testing cost incurred during the software life-cycle is measured from the

time when the testing starts. The cost of testing before and after release is quantified

in terms of various cost factors including setup cost, cost of removing errors during

the testing and operational phase. The expected total cost function is given by:

( ) ( ) ( ) ( )[ ] ( )dxxwCTmTmCTmCTCT

03LC21 ∫×+−×+×= …(9.20)

where

LCT = Software life-cycle length.

1C = Cost of fixing a fault during the testing phase.

2C = Cost of fixing a fault during the operational phase ( )0CC 12 >> .


226

3C = Cost per unit of testing-effort consumption during the testing.


In this section, numerical results are obtained and are compared with the

neuro-fuzzy results by building Adaptive Network Based Fuzzy Inference System

(ANFIS) in software ‘MATLAB 7.4’. ANFIS is built by using the fuzzy toolbox of

the MATLAB package. We use Gaussian function for describing the membership

function of input variable. For all approximations, ANFIS are trained for 50 epochs

and 5 membership functions. The linguistic values of the input parameter are given in

table 9.1 and the corresponding membership functions are shown in fig. 9.4. We

consider T as linguistic variable for fuzzy system. For illustration purpose, we

consider that a software consisting of simple faults and 3 versions, i.e. j=1 and i=3.

For different values of a1 and b11, figs 9.1-9.3 depict the cumulative number of faults

detected m(T), total expected cost C(T) and reliability R(T) of the software by taking

the default parameters as C1=200, C2=500, C3=350, TLC=300, a1=100, a2=250,

a3=500, P11=.20, P21=.30 P31=.50, β11=.5, β21=.6, β31=.7, γ=.009, M=80, x=1.

Table 9.1: Linguistic values of the membership functions for time t

From figs 9.1(i)-(ii), we notice that the expected number of detected faults are

increasing as time passes. As a and β increase, the mean value function reveals an

increasing trend. The total expected cost has been exhibited for different parameters

a1 and b11 with respect to time in figs 9.2(i)-(ii). It is seen that the expected cost

initially decreases sharply but later on increases gradually. We note that as a1

Input

Variables

No. of

membership

function

Linguistic

Values

T 5

Low

Medium

High


227

increases, the expected cost increases but on increasing the value b11, the expected

cost remains constant.

By varying the parameters a1 and b11, reliability is depicted in figs 9.3(i)-(ii).

We observe that the reliability initially increases with time but finally it becomes

almost constant. From fig. 9.3(i), it is clear that as error content function increases, the

reliability decreases. From fig. 9.3(ii), it is also seen that as error detection rate

increases, the reliability decreases for some time and later on approaches to a constant

value.


In this chapter, a software reliability growth model (SRGM) based on non-

homogeneous Poisson process (NHPP) for N-versions programming with testing-

effort has been developed. Our model takes care of imperfect debugging which is the

more realistic assumption of software development process. It is worth-mentioning

that the proposed model will be helpful in calculating the number of various types of

faults and their effect on the reliability growth and the actual application. The

developed software reliability model is capable for providing the reliability for N-

version when the parameters can be estimated as realistically as possible and the

distribution for the testing suits to concrete situations. Various interesting quantities

for software reliability measurement can be computed easily for the concerned

software system as validated by numerical simulation.


228

0

100

200

300

400

500

600

0 1 2 3 4 5 6 7 8 9 10T

m(T

)

a1=200(Analytical Set 1)a1=200(Afnis Set 1)a1=300(Analytical Set 2)a1=300(Afnis Set 2)

0

50

100

150

200

250

300

350

400

0 1 2 3 4 5 6 7 8 9 10T

m(T

)

β11=0.5(Analytical Set 1)β11=0.5(Afnis Set 1)β11=0.6(Analytical Set 2)β11=0.6(Afnis Set 2)

Fig. 9.1(i): Mean time vs time by Fig. 9.1(ii): Mean time vs time by varying 1a varying 11β

20000

30000

40000

50000

60000

70000

80000

0 20 40 60 80 100 120 140T

C(T

)


40000

60000

80000

100000

120000

140000

160000

0 20 40 60 80 100 120 140

T

C(T)


Fig. 9.2(i): Expacted cost vs time by Fig. 9.2(ii): Expacted cost vs time by varying 1a varying 2a


229

0

0.2

0.4

0.6

0.8

1

0 20 40 60 80 100 120 140 160

T

Rsys


0

0.2

0.4

0.6

0.8

1

0 20 40 60 80 100 120 140 160T

Rsys

b11=0.04(Analytical Set 1)b11=0.04(Afnis Set 1)b11=0.06(Analytical Set 2)b11=0.06(Afnis Set2)

Fig. 9.3(i): Reliability vs time by Fig. 9.3(ii): Reliability vs time by varying 1a varying 11b

Fig. 9.4: Membership functions for input parameter T

Future Scope

230

Future Scope With the advances in computer technology, modern society almost relies on

the gadgets which contain software. The investigation done in the present thesis is

mainly concerned with the performance prediction of hardware and software

reliability of fault tolerant systems. We have indicated the open problems and scope

of the research work in the first chapter. However, there are some specific

observations in this regard which we would like to mention. Some important issues

related to our research investigations which can be addressed in future are as follows:

System reliability/availability is a key measure performance which is to be

assessed for a concerned system during design and development phase. Based on

our investigation a further interesting possibility is to study the optimal number of

spare parts for different configurations.

Software fault tolerance techniques are gaining popularity. The analysis of cluster

architectures with respect to fault tolerance and reliability measures presented in

chapter 2 will give insight to the developers and system designers to improve the

reliability of the concerned system. However, optimal configuration can be

suggested based upon pre-specified techno-economic constraints.

For the smooth and uninterrupted functioning of any system, standby units are

used. The optimal allocation of redundant components based on cost criterion can

be further determined for the repairable system with warm standbys studied in

chapters 5 and 6.

The failure of any hardware system may also take place due to some common

cause. The concept of common cause has been incorporated in chapters 1, 4, 5(A),

7, 8(B). The concept of common cause can be further included in other models

also.

The models of embedded Markov chain are gaining importance these days. The

work presented in this thesis needs refinements for the improvement of the

analysis procedure by integrating the economic factors.

N-version programming is one of the most common techniques for the system

capable of fault tolerance. The method and analysis provided for software

reliability growth model (SRGM) with N-version programming in chapter 9 can

be extended to flexible environment.

References

References

232

1. Abdel-Hameed, M. (1995): Inspection, maintenance and replacement models. Computers and Operations Research, Vol. 22, No. 4, pp. 435-441.

2. Abujarad, F. and Kulkarni, S. S. (2011): Automated constraint-based addition of non masking and stabilizing fault tolerance. Theoretical Computer Science, Vol. 412, No. 33, pp. 4228-4246.

3. Abulnaja, O. A. (2005): Component-based recovery blocks technique. Artificial Intelligence & Machine Learning Journal, Vol. 5, No. 2, pp. 1-5.

4. Abu-Salih, M., Anakerh, N. and Ahmed, M. S. (1999): Confidence limits for steady state availability. Pakistan Journal of Statistics, Vol. 6, No. 2A, pp. 189-196.

5. Adke S. R. and Manjunath S. M. (1984): An Introduction to Finite Markov Processes. Wiley Eastern Limited, New York.

6. Akhtar, S. (1994): Reliability of K-out-of-N: G system with imperfect fault coverage. IEEE Transactions on Reliability, Vol. 43, pp. 101-106.

7. Alidrisi, M. M. (1992): The reliability of a dynamic warm standby redundant system of n components with imperfect switching. Microelectronics and Reliability, Vol. 32, No. 6, pp. 851-859.

8. Al-Saqabi, K., Saleh, K. and Ahmad, I. (1996): Recovery from concurrent failures in communication protocols. Journal of Systems and Software, Vol. 35, No. 1, pp. 55-65.

9. Amari, S. V. (2000): Transient analysis of reliability with and without repair for K-out-of-N: G systems with M failures modes. Reliability Engineering and System Safety, Vol. 67, No. 3, pp. 321-324.

10. Amari, S., Pham, H. and Dill, G. (2004): Optimal design of K-out-of-N: G subsystem subjected to imperfect fault coverage. IEEE Transactions on Reliability, Vol. 53, pp. 567-75.

11. Arulmozhi, G. (2002): Reliability of an M-out-of-N warm standby system with R repair facilities. OPSEARCH, Vol. 39, pp. 77-87.

12. Arya, L. D., Choube, S. C. and Arya, R. (2011): Differential evolution applied for reliability optimization of radial distribution systems. International Journal of Electrical Power and Energy Systems, Vol. 33, No. 2, pp. 271-277.

13. Ascher, H. and Feingold, H. (1984): Repairable System Reliability. Marcel Dekker Inc., New York.

14. Attardi, L., Guida, M. and Pulcini, G. (2005): A mixed-Weibull regression model for the analysis of automotive warranty data. Reliability Engineering and System Safety, Vol. 87, pp. 265-273.

15. Atwood, C. L. and Kelly, D. L. (2008): The binomial failure rate common-cause model with Win BUGS. Reliability Engineering and System Safety, Vol. 94, No. 5, pp. 990-999.

16. Aven, T. (1990): Availability formulae for standby systems of similar units that are preventively maintained. IEEE Transactions on Reliability, Vol. 39, pp. 603-6.

http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6V4T-4V2NK9P-1&_user=10&_coverDate=12%2F03%2F2008&_alid=861975847&_rdoc=9&_fmt=high&_orig=search&_cdi=5767&_docanchor=&view=f&_ct=1166&_acct=C000050221&_version=1&_urlVersion=0&_userid=10&md5=94df6b3525ca3301e7b7e01765a6cd8f

http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6V4T-4V2NK9P-1&_user=10&_coverDate=12%2F03%2F2008&_alid=861975847&_rdoc=9&_fmt=high&_orig=search&_cdi=5767&_docanchor=&view=f&_ct=1166&_acct=C000050221&_version=1&_urlVersion=0&_userid=10&md5=94df6b3525ca3301e7b7e01765a6cd8f

References

233

17. Avizienis, A. and Chen, L. (1977): On the implementation of N-Version programming for software fault tolerance during program execution. In Proceeding COMPSAC 77, pp. 149-155.

18. Avritzer, A., Cole, R. G. and Weyuker, E. J. (2010): Methods and opportunities for rejuvenation in aging distributed software systems. Journal of Systems and Software, Vol. 83, No. 9, pp. 1568-1578.

19. Azaron, A., Katagiri, H., Sakawa, M. and Modarres, M. (2005): Reliability function of a class of time-dependent systems with standby redundancy. European Journal of Operational Research, Vol. 164, No. 2, pp. 378-86.

20. Balsamo, S., Di Marco, A., Inverardi, P. and Simeoni, M. (2004): Model-based performance prediction in software development: A survey. IEEE Transactions on Software Engineering, Vol. 30, No. 5, pp. 295-310.

21. Belli, F. and Jedrzeiowicz, P. (1990): Fault-tolerant programs and their reliability. IEEE Transactions on Reliability, Vol. 16, No. 3, pp. 184-92.

22. Belzunce, F., Martinez-Puertas, H. and Ruiz, J. M. (2011): On optimal allocation of redundant components for series and parallel of two dependent components. Journal of Statistical Planning and Inference, Vol. 141, No. 9, pp. 3094-3104.

23. Berman, O. and Kumar, U. D. (1999): Optimization models for recovery black schemes. European Journal of Operational Research, Vol. 115, No. 2, pp. 368- 379.

24. Beutner, E. (2010): Non parametric model checking for k-out-of-n systems. Journal of Statistical Planning and Inference, Vol. 140, No. 3, pp. 626-639.

25. Bhaskar, T. and Kumar, U. D. (2006): A cost model for N-version programming with imperfect debugging. Journal of The Operational Research Society, Vol. 57, No. 8, pp. 986-994.

26. Bhuyan, P. and Sarmah, P. (2002): Reliability estimation of a repairable standby redundant system, Statistical Papers; Springer Berlin, Vol. 43, No. 3, pp. 323-336.

27. Bichon, B. J., Mc Farland, J. M. and Mahadevan, S. (2011): Efficient surrogate models for reliability analysis of systems with multiple failure modes. Reliability Engineering and System Safety, Vol. 96, No. 10, pp. 1386-1395.

28. Bieth, B., Hong, L. and Sarkar, J. (2010): A standby system with two repair persons under arbitrary life-and repair times. Mathematical and Computer Modelling, Vol. 51, pp. 756-767.

29. Biswas, A. and Sarkar, J. (2000): Availability of a system maintained through several imperfect repairs before a replacement or a perfect repair. Statistics & Probability Letters, Vol. 50, pp. 105-114.

30. Blischke, W. R. and Murthy, D. N. P. (1994): Warranty Cost Analysis. Marcel Dekker Inc., New York.




References

234

31. Blokus, A. (2006): Reliability analysis of large systems with dependent components. International Journal of Reliability, Quality and Software Engineering, Vol. 13, No. 1, pp. 1-14.

32. Bobbio, A. (1990): Dependability analysis of fault–tolerant systems: A literature survey. Microprocessing and Microprogramming, Vol. 29, No. 1, pp. 13.

33. Bondavalli, A., Chiaradonna, S., Giandomenico, F.D. and Xu, J. (2002): An adaptive approach to achieving hardware and software fault tolerance in a distributes computing environment. Journal of System Architecture, Vol. 47, No. 9, pp. 763-781.

34. Bondavalli, A., Di Giandomenico, F. and Xu, J. (1993): A cost-effective and flexible scheme for software fault tolerance. Journal of Computer Systems Science and Engineering, Vol. 8, No. 4, pp. 234-244.

35. Brilliant, S. S., Knight, J. C. and Leveson, N. G. (1990): Analysis of faults in an N-version software experiment. IEEE Transactions on Software Engineering, Vol. 16. No. 2, pp. 238-247.

36. Buchholz, P., Kemper, P. and Kriege, J. (2010): Multi-class Markovian arrival processes and their parameter fitting. Performance Evaluation, Vol. 67, No. 11, pp. 1092-1106.

37. Bueno, V. D. C. and Carmo, I. M. D. (2007): Active redundancy allocation for a k-out-of-n: F system of dependent components. European Journal of Operational Research, Vol. 176, No. 2, pp. 1041-1051.

38. Cai, K. Y., Hu, D. B. Bai, C. G., Hu, H. and Jing, T. (2008): Does software reliability growth behavior follow a non-homogeneous Poisson process. Infosoft Technologies, Vol. 50, No. 12, pp. 1232-1247.

39. Carpenter, G. F. (1990): Mechanism for evaluating the effectiveness of software fault- tolerant structures. Microprocessors and Microsystems, Vol. 14, No. 8, pp. 505-510.

40. Castillo, E., Minguez, R. and Castillo, C. (2008): Sensitivity analysis in optimization and reliability problems. Reliability Engineering and System Safety, Vol. 93, No. 12, pp. 1788-1800.

41. Catelani, M., Ciani. L., Scarano, V. L. and Bacioccola, A. (2011): Software automated testing: A solution to maximizing the test plan coverage and to increase software reliability and quality in use. Computer Standards and Interfaces, Vol. 33, No. 2, pp. 152-158.

42. Catuneanu, V. M., Moldovan, C., Popentiu, F. L. and Popovici, D. (1991): Software reliability release policy with testing effort, Microelectronics Reliability, Vol. 31, No. 5, pp. 895-899.

43. Cekyay, B. and Ozekici, S. (2010): Mean time to failure and availability of semi-markov missions with maximal repair. European Journal of Operational Research, Vol. 207, pp. 1442-1454.

44. Cha, J. H., Mi, J. and Yun, W. Y. (2008): Modelling general standby system and evaluation of its performance. Applied Stochastic Models in Business and Industry, Vol. 24, pp. 159-169.

References

235

45. Chakravarthy, S. R. and Gomez-Corral, A. (2009): The influence of delivery times on repairable k-out-of-N systems with spares. Applied Mathematical Modelling, Vol. 33, No. 5, pp. 2368-2387.

46. Chandrasekhar, P., Natarajan, R. and Yadavalli, V. S. S. (2004): A study on a two unit standby system with Erlangian repair time. Asia-Pacific Journal of Operational Research, Vol. 21, No. 3, pp. 271-277.

47. Chang, W. K. and Jeng, S. L. (2005): Impartial evaluation in software reliability practice. Journal of Systems and Software, Vol. 76, No.2, pp. 99-110.

48. Chang, Y., Amari S. and Kuo S. (2005): OBDD-based evaluation of reliability and importance measures for multistate systems subject to imperfect fault coverage. IEEE Transactions on Dependable And Secure Computing, Vol. 2, pp. 336-347.

49. Chari, A. A., Shatri, M. P. and Verma, S. M. (1991): Reliability analysis in the presence of change common cause shock failures. Microelectronics Reliability, Vol. 31, pp. 15-19.

50. Chatterjee, S., Misra, R. B. and Alam, S. S. (2004): N-version programming with imperfect debugging. Computers & Electrical Engineering, Vol. 30, No. 6, pp. 453-463.

51. Chen, Y. M. (1992): Transient analysis of reliability and availability in k-to-l-out-of-n: G system. Reliability Engineering and System Safety, Vol. 35, No. 3, pp. 179-82.

52. Cheng, S. T., Chen, C. M. and Tripathi, S. K. (2000): A Fault–tolerant Model for multiprocessor real-time systems. Journal of Computer and System Sciences, Vol. 61, No. 3, pp. 457–477.

53. Chiaradonna, S., Bondavalli, A. and Strigini, L. (1994): On performability modeling and evaluation of software fault tolerance structures. In Proceeding EDCC-1, Berlin, Germany, pp. 97-114.

54. Chiquet, J., Eid, M. and Limnios, N. (2008): Modelling and estimating the reliability of stochastic dynamical systems with Markovian switching. Reliability Engineering & System Safety, Vol. 93, No. 12, pp. 1801-1808.

55. Choi, J. G. and Seong, P. H. (2006): Reliability assessment of embedded digital system using multi-state function. Reliability Engineering and System Safety, Vol. 91, No. 3, pp. 261-269.

56. Chow, D. K. (1971): Reliability of two items in sequence with sensing and switching. IEEE Transactions on Reliability, Vol. 20, pp. 254-256.

57. Chung, W. K. (1980): An availability calculation for K-out-of-N redundant system with common-cause failures and replacement. Microelectronics and Reliability, Vol. 20, pp. 517-519.

58. Chung, W. K. (1981): A k-out-of-N: G three-state unit redundant system with common-cause failure and replacements. Microelectronics and Reliability, Vol. 21, No. 4, pp. 589-591.

References

236

59. Chung, W. K. (1995): Reliability of imperfect switching of cold standby systems with multiple non-critical and critical errors. Microelectronics and Reliability, Vol. 35, No. 12, pp. 1479-1482.

60. Da Casta Bueno, V. (2005): Minimal standby redundancy allocation in a k-out-of-n: F system of dependent components. European Journal of Operational Research, Vol. 165, No. 3, pp. 786-793.

61. Da Casta Bueno, V. and Do Carmo, I. M. (2007): Active redundancy allocation for a k-out-of-n: F system of dependent components. European Journal of Operational Research, Vol. 176, No. 2, pp. 1041-1051.

62. Dabney, R. W., Etzkorn, L. and Cox, G. W. (2008): A fault tolerant approach to test control utilizing dual-redundant processors. Advances in Engineering Software, Vol. 39, No. 5, pp. 371-383.

63. De Smidt-Destombes, K. S., van der Heijden, M. C. and van Harten, A. (2004): On the availability of a k-out-of-N system given limited spares and repair capacity under a condition based maintenance strategy. Reliability Engineering and System Safety, Vol. 83, pp. 287-300.

64. De-Almeida, A. T. and Souza, C. F. M. (1993): Decision theory in maintenance strategy for a 2-unit redundant standby system. IEEE Transactions on Reliability, Vol. 42, No. 3, pp. 401-407.

65. Dhillon, B. S. (1978): A k-out-of-n three state devices system with common-cause failures. Microelectronics and Reliability, Vol. 18, pp. 447-448.

66. Dhillon, B. S. (1993): Reliability and availability analysis of a system with warm standby and common cause failures. Microelectronics and Reliability, Vol. 33, No. 9, pp. 1343-1349.

67. Dhillon, B. S. and Anudu, O. C. (1993): Common-cause failure analysis of a non-identical unit parallel system with arbitrarily distributed repair times. Microelectronics and Reliability, Vol. 33, pp. 87-103.

68. Dhillon, B. S. and Yang, N. (1992): Reliability and availability analysis of warm standby systems with common-cause failures and human errors. Microelectronics and Reliability, Vol. 32, pp. 561-575.

69. Do Van, P., Barros, A. and Berenguer, C. (2010): From differential to difference importance measure for Markov reliability models. European Journal of Operational Research, Vol. 204, No. 3, pp. 513-521.

70. Dominguez-Garcia, A. D., Kassakian, J. G., Schindall, J. E. and Zinchu, J. J. (2008): An integrated methodology for the dynamic performance and reliability evaluation of fault tolerant systems. Reliability Engineering and System Safety, Vol. 93, No. 11, pp. 1628-1649.

71. Dugan, J. B. and Lyu, M. R. (1993): System reliability analysis of N-version programming application. Proceeding 4th IEEE International Symposium on Software Reliability Engineering, pp. 103-111.

72. Dugan, J. B. and Lyu, M. R. (1995): Dependability modeling for fault-tolerant software and systems. John Wiley & Sons Ltd., pp. 109-138.

http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6V47-3YMWKFD-5&_user=10&_coverDate=12%2F31%2F1995&_alid=1367497179&_rdoc=5&_fmt=high&_orig=search&_cdi=5751&_sort=r&_docanchor=&view=f&_ct=30320&_acct=C000050221&_version=1&_urlVersion=0&_userid=10&md5=5b10fb05119f74c32ea25eb786c4d602



References

237

73. Dutuit, Y. and Rauzy, A. (2001): New insights in the assessment of K-out-of-N and related systems. Reliability Engineering and System Safety, Vol. 72, No.3, pp. 303-314.

74. Eckhardt, D. E. and Lee, L. D. (1988): Fundamental differences in the reliability of N-modular redundancy and N-version programming. Journal of System and Software, Vol. 8, No. 4, pp. 313-318.

75. El-Damcese, M. A. (1997): Human error and common-cause failure modeling of a two-unit multiple system. Theoretical and Applied Mechanics, Vol. 26, pp. 117-127.

76. El-Damcese, M. A. (2009): Analysis of warm standby system subject to common cause failures with time varying failure and repair rates. Applied Mathematical Sciences, Vol. 3, No. 18, pp. 853-860.

77. El-Gohary, A. (2004): Estimations of parameters in a three state reliability semi-Markov model. Applied Mathematics and Computation, Vol. 154, No. 2, pp. 389-403.

78. Elsayed, E. (1996): Reliability Engineering. Addison Wesley Longman Reading, Mass.

79. Erglmaz, S. (2010): Mixture representations for the reliability of cousecutive-k systems. Mathematical and Computer Modelling, Vol. 51, No. 5-6, pp. 405-412.

80. Eryilmaz, S. (2011): Dynamic behaviour of k-out-of-n: G systems. Operations Research Letters, Vol. 39, No. 2, pp. 155-159.

81. Fadiloglu, M. M. and Bulut, O. (2010): An embedded Markov chain approach to stock rationing. Operations Research Letters, Vol. 38, pp. 510-515.

82. Flammini, F., marrone, S., Mazzocca, n. and vittorini, V. (2009): A new modeling approach to the safety evaluation of n-modular redundant computer system in presence of imperfect maintenance. Reliability Engineering and System Safety, Vol. 94, pp. 1422-1432.

83. Fu, S. (2010): Failure-aware resource management for high availability computing clusters with distributed virtual machines. Journal of Parallel and Distributed Computing, Vol. 70, No. 4, pp. 384-393.

84. Galikowsky, C., Sivazlian, B. D. and Chaovalitwongse, P. (1996): Optimal redundancies for reliability and availability of series system. Microelectronics and Reliability, Vol. 36, pp. 1537-1546.

85. Gamiz, M. L., Miranda, M. D. M. (2010): Regression analysis of the structure function for reliability evaluation of continuous-state system. Reliability Engineering and System Safety, Vol. 95, No. 2, pp. 134-142.

86. Giandomenico, F. D., Bondavalli, A. and Xu, J. (1995): Hardware and software fault tolerance: adaptive architectures in distributed computing environments. Technical Report B4-15, IEI-CNR.

87. Goel, L. R and Gupta, R. (1984): Availability analysis of a two-unit cold standby system with two switching failure modes. Microelectronics and Reliability, Vol. 24, No. 3, pp. 419-423.

References

238

88. Goel, L. R. De and Shrivastava, P. (1991): Profit analysis of a two unit redundant system with provision for test and correlated failures and repairs. Microelectronics and Reliability, Vol. 31, pp. 827-833.

89. Goel, L. R. De and Shrivastava, P. (1992): A two unit standby system with imperfect switch, preventive maintenance and correlated failures and repairs. Microelectronics and Reliability, Vol. 32, No. 12, pp. 1687-1691.

90. Gokhale, S. S., Philip, T., and Marinos, P. N. (1996): A non- homogeneous Markov software reliability model with imperfect repair. In Proceeding of International Performance and Dependability Symposium (IPDS), Urbana-Champaign, IL., pp. 262-270.

91. Goseva-Popstojanova, K. and Trivedi, K. S. (2000): Failure correlation in software reliability model. IEEE Transactions on Reliability, Vol. 49, pp. 37-48.

92. Goseva-Popstojanova, K. and Trivedi, K. S. (2003): Architecture-based approaches to software reliability prediction. Computers and Mathematics with Applications, Vol. 46, No. 7, pp. 1023-1036.

93. Gou, L., Xu, H., Gao, C. and Zhu, G. (2011): Stability analysis of a new kind n-unit series repairable system. Applied Mathematical Modelling, Vol. 35, No. 1, pp. 202-217.

94. Grabski, F. (2011): Semi-Markov failure rates processes. Applied Mathematics and Computation, Vol. 217, No. 24, pp. 9956-9965.

95. Grosspietsch, K. E. (1989): Schemes of dynamic redundancy for fault tolerant in random access memories. Microelectronics and Reliability, Vol. 29, No. 6, pp. 1098.

96. Guo, and Hua, W. (2003): Analysis of repairable, warm standby, human & machine systems with two identical units. Mathematics in Practice and Theory, Vol. 33, No. 7, pp. 88-95.

97. Guo, H. and Yang, X. (2007): A simple reliability block diagram method for safety integrity verification. Reliability Engineering and System Safety, Vol. 92, pp. 1267-1273.

98. Gupta, P. P. and Sharma, R. K. (1986): Reliability analysis of two state repairable parallel redundant system under human failure. Microelectronics and Reliability, Vol. 26, No. 2, pp. 221-224.

99. Gupta, P. P. and Tyagi, L. (1986): M.T.T.F. and availability evaluation of a two-unit, two-state, standby redundant complex system with constant human failure. Microelectronics and Reliability, Vol. 26, No. 4, pp. 647-650.

100. Gupta, S. M., Jaiswal, N. K. and Goel, L. R. (1983): Switch failure in a two-unit standby redundant system. Microelectronics and Reliability, Vol. 23, No. 1, pp. 129-132.

101. Gurler, S. and Bairamov, I. (2009): Parallel and k-out-of-n: G systems with non identical components and their mean residual life functions. Applied Mathematical Modelling, Vol. 33, No. 2, pp. 1116-1125.

http://www.sciencedirect.com/science/article/pii/S0096300311006254?_alid=1864734592&_rdoc=10&_fmt=high&_origin=search&_docanchor=&_ct=847&_zone=rslt_list_item&md5=2b6a6bf49ea3bc27309f3b8b35b2b6f3



References

239

102. Habib, A. S., Yuge, T., Al-Seedy, R. O. and Ammar, S. I. (2010): Reliability of a consecutive (r, s)-out-of-(m, n): F lattice system with conditions on the number of failed components in the system. Applied Mathematical Modelling, Vol. 34, No. 3, pp. 531-538.

103. Hajeeh, M. A. (2011): Reliability and availability of a standby system with common cause failure. International Journal of Operational Research, Vol. 11, No. 3, pp. 343-363.

104. Hall, B. J. and Mosleh, A. (2008): An analytical framework for reliability growth of one-shot systems. Reliability Engineering and System Safety, Vol. 93, pp. 1751-1760.

105. Hamlet, D. (1995): Software quality, software process and software testing. Advances in Computers, Vol. 41, pp. 191-229.

106. Ho, S. L., Xie, M. and Goh, T. N. (2003): A study of the connectionist models for software reliability prediction. Computers and Mathematics with Applications, Vol. 46, No. 7, pp. 1037-1045.

107. Hoeflin, D. A. and Mendiratta, V. B. (1995): An elementary model for perdicting switching system outage durations. Proceedings of the XV International Switching Symposium, Berlin.

108. Hong, J. S., Koo, H. Y. and Lie, C. H. (2002): Joint reliability importance of k-out-of-n- systems. European Journal of Operational Research, Vol. 142, pp. 539-547.

109. Hoyland, A. and Rausand, M. (1994): System Reliability Theory: Models and Statistical Methods. John Wiley and Sons.

110. Hsieh, C. C. (2003): Optimal task allocation and hardware redundancy policies in distributed computing systems. European Journal of Operational Research, Vol. 147, No. 2, pp. 430-447.

111. Hsieh, C. C., and Hsieh, Y. C. (2003): Reliability and cost optimization in distributed computer systems. Computers & Operations Research, Vol. 30, No. 8, pp. 1103-1119.

112. Hsieh, Y. C. and Wang, K. H. (1995): Reliability of a repairable system with spares and a removable repairman. Microelectronics and Reliability, Vol. 35, No. 2, pp. 197-208.

113. Hsu, C. J., Huang, C. Y. and Chang, J. R. (2011a): Enhancing software reliability modeling and prediction through the introduction of time-variable fault reduction factor. Applied Mathematical Modelling, Vol. 35, No. 1, pp. 506-521.

114. Hsu, Y. L., Ke, J. C. and Lee, S. L. (2008): On a redundant repairable system with switching failure: Bayesian approach. Journal of Statistical Computation and Simulation, Vol. 78, No. 12, pp. 1163-1180.

115. Hsu, Y. L., Ke, J. C. and Liu, T. H. (2011b): Standby system with general repair, reboot delay, switching failure and unreliable repair facility- A statistical standpoint. Mathematical and Computers in Simulation, Vol. 81, No. 11, pp. 2400-2413.






http://www.inderscience.com/browse/index.php?journalID=170&year=2011&vol=11&issue=3

http://www.inderscience.com/browse/index.php?journalID=170&year=2011&vol=11&issue=3

http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6V47-3YMWKR1-4C&_user=10&_coverDate=02%2F28%2F1995&_alid=1367492086&_rdoc=5&_fmt=high&_orig=search&_cdi=5751&_sort=r&_docanchor=&view=f&_ct=917&_acct=C000050221&_version=1&_urlVersion=0&_userid=10&md5=6327a2314f5d4e8a6865b5ca95e5dafc

http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6V47-3YMWKR1-4C&_user=10&_coverDate=02%2F28%2F1995&_alid=1367492086&_rdoc=5&_fmt=high&_orig=search&_cdi=5751&_sort=r&_docanchor=&view=f&_ct=917&_acct=C000050221&_version=1&_urlVersion=0&_userid=10&md5=6327a2314f5d4e8a6865b5ca95e5dafc

References

240

116. Hsu, Y. L., Lee, S. L. and Ke, J. C. (2009): A repairable system with imperfect coverage and reboot: Bayesian and asymptotic estimation. Mathematics and Computers in Simulation, Vol. 79, pp. 2227-2239.

117. Hu, W. W. (2006): Asymptotic stability of a parallel repairable system with warm standby under common cause failure. Journal of Mathematical Analysis and Applications, Vol. 8, No. 1, pp. 5-20.

118. Huang, C. Y. (2005): Performance analysis of software reliability growth models with testing-effort and change-point. Journal of Systems and Software, Vol. 76, No. 2, pp. 181-194.

119. Huang, C. Y. and Chang, Y. R. (2007): An improved decomposition scheme for assessing the reliability of embedded systems by using dynamic fault trees. Reliability Engineering and System Safety, Vol. 92, pp. 1403-1412.

120. Huang, C. Y. and Huang, T. Y. (2010): Software reliability analysis and assessment using queueing models with multiple change-points. Computers and Mathematics with Applications, Vol. 60, No. 7, pp. 2015-2030.

121. Huang, C. Y. and Kintala, C. (1995): Software Fault Tolerance in the Application Layer. In M. R. Lyu (Ed). Software Fault Tolerance, John Wiley.

122. Huang, C. Y. and Kuo, S. (2002): Analysis of incorporating logistic testing effort function into software reliability modeling. IEEE Transactions on Reliability, Vol. 51, No. 3, pp. 261-270.

123. Huang, C. Y. and Lyu, M. (2005): Optimal release time for software systems considering cost, testing efforts and test efficiency. IEEE Transactions on Reliability, Vol. 54, No. 4, pp. 583-591.

124. Huang, C. Y., Kuo, S. and Luo, M. (2007a): An assessment of testing-effort development software reliability growth models. IEEE Transactions on Reliability, Vol. 56, No. 2, pp. 198-211.

125. Huang, H. I., Lin, C. H., Ke, J. C. (2006): Parametric nonlinear programming approach for a repairable system with switching failure and fuzzy parameter. Applied Mathematics and Computation, Vol. 183, pp. 508-517.

126. Huang, H. Z., Liu, Z. J. and Murthy, D. N. P. (2007b): Optimal reliability, warranty and price for new products. IIE Transactions, Vol. 39, pp. 819-827.

127. Huang, J., Zuo, M. and Wu, Y. (2000): Generalized multi-state K-out-of-N: G systems. IEEE Transactions on Reliability, Vol. 49, pp. 105-11.

128. Huang, Y., Kintala, C., Koletis, N. and Fulton, N. D. (1995): Software rejuvenation, analysis, module and application. In Proceedings of 25th Symposium on Fault Tolerant Computing, pp. 381-390.

129. Hughes, R. P. (1987): A new approach to common cause failure. Reliability Engineering and System Safety, Vol. 17, pp. 211-236.

130. Hughes-Fenchel, G. (1997): A flexible clustered approach to high availability. Proceeding of the Twenty-Seventh Annual International Symposium on Fault Tolerant Computing, Seattle, WA.

References

241

131. Hunter, J. J. (1996): Mathematical techniques for warranty analysis. In W. R. Blischke & D. N. P. Murthy (Eds.), Product Warranty Handbook, New York: Marcel Dekker, pp. 157-190.

132. Hwang, F. K. (1986): Simplified reliabilities for consecutive k-out-of-n systems. Society for Industrial and Applied Mathematics Alg. Disc. Math., Vol. 7, pp. 258-264.

133. Inoue, S. and Yamada, S. (2004): Stochastic differential equation modeling for testing-effort dependent software reliability assessment. In Proceeding of ISSAT, pp. 256-260.

134. Iwamoto, K., Dohi, T. and Kaio, N. (2008): Estimating periodic software rejuvenation schedules under discrete-time operation circumstance. IEICE Transactions, 91-D, No. 1, pp. 23-31.

135. Jain, M. (1998): Reliability analysis of two-unit system with common cause failure, Indian Journal of Pure and Applied Mathematics, Vol. 29, No. 12, pp. 1-8.

136. Jain, M. and Baghel, K. P. S. (2001): A multi-components spare and state dependent rates. The Nepali Mathematical Science Report, Vol. 19, No. 2, pp. 81-92.

137. Jain, M., Baghel, K. P. S., and Jadown, M. (2004): Performance prediction of machine interference model with spare and two mode of failure. Operations Research, Information Technology and Industry, Eds M. Jain and G. C. Sharma, S.R.S Pub., Agra, pp. 197-208.

138. Jain, M., Rakhee and Singh, M. (2004): Bilevel control of degraded machining system with warm standbys, setup and vacation. Applied Mathematical Modeliiing, Vol. 28, No. 12, pp. 1015-1026.

139. Jain, M., Sharma, G. C. and Singh, N. (2007): Transient analysis of M/M/R maching system with mixed standbys, switching failures, balking, reneging and additional removable repairmen. IJE Transactions B: Basics, Vol. 20, No. 2, pp. 169-182.

140. Janab, K. and Dhillon, B. S. (2006): Assessment of reversible multi-state k-out-of-n: G/F load-sharing systems with flow-graph models. Reliability Engineering and System Safety, Vol. 91, pp. 765-771.

141. Jankala, K. E. and Vaurio, J. K. (1993): Residual common cause failure analysis in a probabilistic safety assessment. In Proceeding P.S.A., Vol. 2, pp. 804-810.

142. Jeske, D. R. and Zhang, X. (2005): Some successful approaches to software reliability modeling in industry. Journal of System and Software, Vol. 74, No. 1, pp. 85-99.

143. Jha, P. C., Gupta, D., Yang, B. and Kapur, P. K. (2009): Optimal testing resource allocation during module testing considering cost, testing effort and reliability. Computers & Industrial Engineering, Vol. 57, No. 3, pp. 1122-1130.

144. Jie, M. (1991): Interval estimation of availability of a series system. IEEE Transaction on Reliability, Vol. R-40, No. 5, pp. 541-546.

References

242

145. Jin, T., Liao, H. and Kilari, M. (2010): Reliability growth modeling for in-service electronic systems considering latent failure modes. Microelectronics Reliability, Vol. 50, No. 3, pp. 324-331.

146. Kallen, M. J. (2011): Modelling imperfect maintenance and the reliability of complex system using superposed renewal process. Reliability Engineering and System Safety, Vol. 96, No. 6, pp. 636-641.

147. Kancev, D. and Cepin, M. (2011): Evaluation of risk and cost using an age-dependent unavailability modeling of test and maintenance for standby components. Journal of Loss Prevention in the Process Industries, Vol. 24, No. 2, pp. 146-155.

148. Kanoun, K., Kaaniche, M., Beounes, C., Laprie, J. C. and Arlat, J. (1993): Reliability growth of fault tolerant software. IEEE Transactions on Reliability, Vol. 42, No. 2, pp. 205-18.

149. Kant, K. (1987): Software fault tolerance in real–time systems. Information Sciences, Vol. 42, No. 3, pp. 255-282.

150. Kapur, P. K. and Bardhan, A. (2002): Testing effort control through software reliability growth modeling. International Journal of Modelling Simulation, Vol. 22, No. 1, pp. 90-96.

151. Kapur, P. K. and Garg, R. B. (1990): Compound availability measures for a two-unit standby system. Microelectronics and Reliability, Vol. 30, No. 3, pp. 425-429.

152. Kapur, P. K., Goswami, D. and Gupta, A. (2004): A software reliability growth model with testing effort dependant learning function for distributed systems. International Journal of Reliability, Quality and Safety Engineering, Vol. 11, No. 4, pp. 365-377.

153. Kapur, P. K., Sharma, K. O., and Garg, R. B. (1992): Transient solutions of software reliability model with imperfect debugging and error-detection/generation. Microelectronics and Reliability, Vol. 32, No. 1, pp. 475-478.

154. Kapur, P. K., Shatnawi, O., Agarwal, A. G. and Kumar, R. (2009): Unified framework for developing testing effort dependent software reliability growth models. WSEAS Transactions on Systems, Vol. 8, No. 4, pp. 521-531.

155. Ke, J. B., Chen, J. W. and Wang, K. H. (2011): Reliability measures of a repairable system with standby switching failures and reboot delay. Quality Technology & Quantitative Management, Vol. 8, No. 1, pp. 15-26.

156. Ke, J. B., Lee, W.C. and Wang, K. H. (2007): Reliability and sensitivity analysis of a system with multiple unreliable service stations and standby switching failures. Physica A: Statistical Mechanics and its Applications, Vol. 380, No. 1, pp. 455-469.

157. Ke, J. C. and Lee, S. L. (2007): Asymptotic confidence limits for a repairable system with standbys subject to switching failures. American Journal of Applied Science, Vol. 4, No. 11, pp. 834-849.

http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6TVG-4N6FFRS-6&_user=10&_coverDate=07%2F01%2F2007&_alid=1367497179&_rdoc=2&_fmt=high&_orig=search&_cdi=5534&_sort=r&_docanchor=&view=f&_ct=30320&_acct=C000050221&_version=1&_urlVersion=0&_userid=10&md5=85bba5ccc5ae8b1eceea955fb5b47f42



References

243

158. Ke, J. C., Lee, S. L. and Hsu, Y. L. (2008): On a repairable system with detection, imperfect coverage and reboot: Bayesian approach. Simulation Modelling Practice and Theory, Vol. 16, No. 3, pp. 353-367.

159. Ke, J. C., Su, Z. L., Wang, K. H. and Hsu, Y. L. (2010): Simulation inferences for an availability system with general repair distribution and imperfect fault coverage. Simulation Modelling Practice and Theory, Vol. 18, No. 3, pp. 338-347.

160. Kemmeny, J. G. and Snell, J. L. (1976): A Finite Markov Chains. Springer-Verlag, New York. NY.

161. Khan, F. G., Qureshi, K. and Nazir, B. (2010): Performance evaluation of fault tolerance techniques in grid computing system. Computer and Electrical Engineering, Vol. 36, No. 6, pp. 1110-1122.

162. Kharoufeh, J. P., Finkelstein, D. and Mixon, D. (2006): Availability of periodically inspected systems with Markovian wear and shocks. Journal of Applied Probability, Vol. 43, pp. 303-317.

163. Kim, K. H. and Welch, H. O. (1989): Distributed execution of recovery blocks: an approach for uniform treatment of hardware and software faults in real-time applications. IEEE Transactions on Computers, Vol. C-38, No. 5, pp. 626-636.

164. Kim, S. K. and Dshalalow, J. H. (2002): Stochastic disaster recovery systems with external resources. Mathematical and Computer Modelling, Vol. 36, No. 11-13, pp. 1235-1257.

165. Kiureghian, A. D., Ditlevsen, O. D. and Song, J. (2007): Availability, reliability and downtime of systems with repairable components. Reliability Engineering and System Safety, Vol. 92, pp. 231-242.

166. Knight, J. C., and Leveson, N. G. (1986): An experimental evaluation of the assumption of independence in multi-version programming. IEEE Transactions on Software Engineering, Vol.12, pp. 96-106.

167. Kornecki, A. J. and Zalewski, J. (2010): Hardware certification for real-time safety critical systems: State of the art. Annual Reviews in Control, Vol. 34, pp. 163-174.

168. Kumar, A. and Agarwal, M. (1980): A review of standby systems. IEEE Transactions on Reliability, Vol. R-29, pp. 290-294.

169. Kumar, A., Agarwal, M. L. and Garg, S. C. (1986): Reliability analysis of a two-unit redundant system with critical human error. Microelectronics and Reliability, Vol. 26, pp. 867-871.

170. Kuniewski, S. P., Weide, J. A. M. V. D. and Noartwijik, J. M. V. (2009): Sampling inspection for the evaluation of time dependent reliability of deteriorating systems under imperfect defect detection. Reliability Engineering and System Safety, Vol. 94, No. 9, pp. 1480-1490.

171. Kuo, L. (2005): Software Reliability. Handbook of Statistics, Vol. 25, pp. 929-963.




References

244

172. Kuo, S., Huang, C. and Lyu, M. (2001): Framework for modeling software reliability using various testing-efforts and fault detection rates. IEEE Transactions on Reliability, Vol. 50, No. 3, pp. 310-320.

173. Kvam, P. H. and Miller, J. G. (2002): Common cause failure prediction using data mapping. Reliability Engineering and System Safety, Vol. 76, No. 3, pp. 273-278.

174. Labib, S. W. (1991): Stochastic analysis of a two-unit warm standby system with two switching devices. Microelectronics and Reliability, Vol. 31, No. 6, pp. 1163-1173.

175. Lai, C. D., Xie, M., Poh, K. L., Dai, Y. S. and Yang, P. (2002): A model for availability analysis of distributed software/hardware systems. Information and Software Technology, Vol. 44, pp. 343-350.

176. Lala, J. H. and Alger, L. S. (1988): Hardware and software fault tolerance: a unified architectural approach. In Proceeding IEEE 18th International Symposium on Fault Tolerant Computing, pp. 240-245.

177. Laplante, P. A. (1993): Fault-tolerant control of real time systems in the presence of single event upsets. Control Engineering Practice, Vol.1, No. 5, pp. 763-769.

178. Laprie, J. C. (1987): Hardware and software-fault tolerance: definition and analysis of architectural solutions digest of papers. FTCS-17: The Seventeenth International Symposium on Fault-Tolerant Computing, pp. 116-121.

179. Laprie, J. C. (1990): Definition and analysis of hardware- and software- fault-tolerance architectures. IEEE Computer, pp. 39-51.

180. Laprie, J. C. (1995): Architectural Issues in Software Fault Tolerance. Michael R. Lyu, editor, Wiley, pp. 47-80.

181. Laprie, J. C., Arlat, J., Beounes, C. and Kanoun, K. (1990): Definition and analysis of hardware-and-software fault-tolerant architectures. IEEE Computer, Vol. 23. No. 7, pp. 39-51.

182. Laval, J., Denier, S., Ducasse, S. and Falleri, J. R. (2011): Supporting simultaneous versions for software evaluation assessment. Science of Computer Programming, Vol. 76, No. 12, pp. 1177-1193.

183. Leach, R. J. (2008): Setting checkpoints in legacy code to improve fault-tolerance. Journal of Systems and Software, Vol. 81, No. 6, pp. 920-928.

184. Lee, E. A. (2002): Embedded software. Advances in Computers, Vol. 56, pp. 55-95.

185. Leu, S. W., Fernandez, E. B. and Khoshgoftaar, T. (1991): Fault–tolerant software reliability modeling using Petri nets. Microelectronics and Reliability, Vol. 31, No. 4, pp. 645-667.

186. Levitin, G. (2001): Incorporating common-cause failures into non repairable multistate series-parallel system analysis. IEEE Transactions on Reliability, Vol. 50, pp. 380-388.

References

245

187. Levitin, G. (2004): Reliability and performance analysis for fault tolerant programs consisting of versions with different characteristics. Reliability Engineering and System Safety, Vol. 86, No. 1, pp. 75-81.

188. Levitin, G. (2006): Reliability and performance analysis of hardware–software systems with fault-tolerant software components. Reliability Engineering and System Safety, Vol. 91, pp. 570-579.

189. Levitin, G. (2007): Block diagram method for analyzing multi-state systems with uncovered failures. Reliability Engineering and System Safety, Vol. 92, No. 6, pp. 727-734.

190. Levitin, G. and Amari, S. V. (2008): Multi-state systems with multi-fault coverage. Reliability Engineering and System Safety, Vol. 93, pp. 1730-1739.

191. Levitin, G. and Amari, S. V. (2010): Approximation algorithm for evaluating time-to-failure distribution of k-out-of-n system with shared standby elements. Reliability Engineering and System Safety, Vol. 95, pp. 396-401.

192. Levitin, G. and Xing L. (2010): Reliability and performance of multi-state systems with propagated failures having selective effect. Reliability Engineering and System Safety, Vol. 95, pp. 655-661.

193. Lewis, E. E. (1994): Introduction to Reliability Engineering. Tweede editie. Illinois,Vs: John.

194. Lewis, E. E. (2001): A load-capacity interference model for common mode failure in 1-out-of-2G system. IEEE Transactions on Reliability, Vol. 50, pp. 47-51.

195. Li, C. Y., Chen, X., Yi, X. S. and Tao, J. Y. (2010): Heterogeneous redundancy optimization for multi-state series–parallel systems subject to common cause failure. Reliability Engineering & System Safety, Vol. 95, No. 3, pp. 202-207.

196. Li, H. F., Wei, Z. and Goswami, D. (2006): Quasi-atomic recovery for distributed agents. Parallel Computing, Vol. 32, No. 10, pp. 733-758.

197. Li, X. and Hu, X. (2008): Some new stochastic comparisons for redundancy allocations in series and parallel systems. Statistics &Probability Letters, Vol. 78, No. 18, pp. 3388-3394.

198. Li, X., Xie, M. and Ng, S. H. (2011): Sensitivity analysis of release time of software reliability models incorporating testing effort with multiple change-points. Applied Mathematical Modelling, Vol. 34, No. 11, pp. 3560-3570.

199. Li, X., Yan, R. and Zuo, M. J. (2009a): Evaluating a warm standby system with components having proportional hazard rates. Operations Research Letters, Vol. 37, pp. 56-60.

200. Li, Z., Liao, H. and Coit, D. W. (2009b): A two-stage approach for multi-objective decision making with applications to system reliability optimization. Reliability Engineering and System Safety, Vol. 94, No. 10, pp. 1585-1592.

http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6V4T-4XCYJBK-4&_user=1143371&_coverDate=03%2F31%2F2010&_alid=1680675257&_rdoc=2&_fmt=high&_orig=search&_origin=search&_zone=rslt_list_item&_cdi=5767&_st=13&_docanchor=&_ct=43591&_acct=C000051781&_version=1&_urlVersion=0&_userid=1143371&md5=a9951eaf814177621b8d18d4c7305fc8&searchtype=a



References

246

201. Lim, S. H., Lee, B. H. and Kim, J. H. (2008): Diversity and fault avoidance for dependable replication systems. Information Processing Letters, Vol. 108, No.1, pp. 33-37.

202. Linberg K. R. (1999): Software developer perceptions and about software project failure: A case study. Journal of Systems and Software, Vol. 49, No. 2-3, pp. 177-192.

203. Lisnianski, A., Levitin, G. and Ben-Haim, H. (2000): Structure optimization of multi-state system with time redundancy. Reliability Engineering and System Safety, Vol. 67, No. 2, pp. 103-112.

204. Littlewood, B. (1975): A reliability model of Markov structured software. Proceeding of the International Conference on Reliable Software, pp. 204-207.

205. Littlewood, B., Popov, P. and Strigini, L. (2002): Assessing the reliability of diverse fault–tolerant software based systems. Safety Sciences, Vol. 40, No. 9, pp. 781-796.

206. Lo., J. H., Huang, C. Y., Chen, I. Y., Kuo, S. Y. and Lyu, M. R. (2005): The reliability assessment and sensitivity analysis of software reliability growth modeling based on software module structure, The Journal of Systems and Software, Vol. 76, pp. 3-13.

207. Lu, L. and Lewis, G. (2006): Reliability evaluation of standby safety systems due to independent and common cause failures. Proceeding of the 2006 IEEE, International Conference on Automation Science and Engineering, Shanghai, China, pp. 264-269.

208. Lu, L. and Lewis, G. (2006): Reliability evaluation of standby safety systems due to independent and common cause failures. IEEE Conference on Automation Science and Engineering, pp. 274-279.

209. Lu, L. and Lewis, G. (2008): Configuration determination for K-out-of-N partially redundant systems. Reliability Engineering and System Safety, Vol. 93, No. 11, pp. 1594-1604.

210. Lv, X., Wan, C. and Bi, G. (2010): Block orthogonal greedy algorithm for stable recovery of block -sparse signal representations. Signal Processing, Vol. 90, No. 12, pp. 3265-3277.

211. Lyu, M. R. (1995): Software Fault Tolerance, John Wiley & Sons, 1995.

212. Lyu, M. R. (1996): Handbook of Software Reliability Engineering. IEEE Computer Society, Press, McGraw Hill.

213. Maheshwari, S., Sharma, P. and Jain, M. (2010): Machine repair problem with K-type warm spares, multiple vacations for repairmen and reneging. International Journal of Engineering and Technology, Vol. 2, No. 4, pp. 252-258.

214. Mahmoud, M. A. W. and Moshref, M. E. (2010): On a two unit cold standby system considering hardware, human error failures and preventive maintenance. Mathematical and Computer Modelling, Vol. 51, No. 5-6, pp. 736-745.



References

247

215. Mahmoud, M., Mokhles, M. A. and Saleh, E. H. (1987): Availability analysis of a repairable system with common cause failure and one standby unit. Microelectronics and Reliability, Vol. 27, pp. 741-754.

216. Malaiya, Y. K., Srimani, P. K. (1990): Software reliability models: theoretical developments, evaluation and applications, los Alamitos. IEEE Computer Society, In Press.

217. McAllister, D. F. and Scott, R. K. (1991): Cost modeling of fault–tolerant software. Information and Software Technology, Vol. 33, No. 8, pp. 594-603.

218. Meedeniya, I., Buhnova, B., Aleti, A., Grunske, L. (2011): Reliability driven deployment optimization for embedded systems. Journal of Systems and Software, Vol. 84, No. 5, pp. 835-846.

219. Mendiratta, V. B. (1996): Assessing the reliability impacts of software fault tolerance mechanisms. Proceedings of 1996 International Symposium of Software Reliability Engineering, White Plains, NY.

220. Moghaddass, R. and Zuo, M. J. (2011): Optimal design of a repairable k-out-of-n system considering maintenance. Reliability and Maintainability Symposium (RAMS), Lake Buena Vista, pp. 1-6.

221. Moghaddass, R., Zuo, M. J. and Wang, W. (2010): Availability of a general k-out-of-n: G system with non-identical components considering shut-off rules using quasi-birth-death process. Reliability Engineering and System Safety, Vol. 96, pp. 489-496.

222. Mokaddis, G. S., Labib, S. W. and Ahmed, A. M. (1997): Analysis of a two unit warm standby system subject to degradation. Microelectronics and Reliability, Vol. 37, No. 4, pp. 641-647.

223. Mokaddis, G. S., Labib, S. W. and El-Said, K. M. (1994): Two models for two dissimilar-unit standby redundant system with three types of repair facilities and perfect or imperfect switch. Microelectronics and Reliability, Vol. 34, No. 7, pp. 1239-1247.

224. Montoro-Cazorla, D. and Perez-Ocon, R. (2006): Reliability of a system under two types of failure using a Markovian arrival process. Operations Research Letters, Vol. 34, No. 5, pp. 525-530.

225. Mosleh, A. (1991): Common cause failure: An analysis methodology and examples. Reliability Engineering, Vol. 34, pp. 249-292.

226. Moustafa, M. S. (1997): Reliability analysis of K-out-of-M: G systems with dependent failures and imperfect coverage. Reliability Engineering and System Safety, Vol. 58, pp. 15-17.

227. Moustafa, M. S. (1998): Transient analysis of reliability with and without repair for k-out of-n: G systems with M failure modes. Reliability Engineering and System Safety, Vol. 59, pp. 317-320.

228. Moustafa, M. S. (2001a): Availability of K-out-of-N: G systems with exponential failures and general repairs. Economic Quality Control, Vol. 16, No. 1, pp. 75-82.

References

248

229. Munch, J. and Heidrich, J. (2004): Software project control centers: Concepts and approaches. Journal of Systems and Software, Vol. 70, No. 1-2, pp. 3-19.

230. Murthy, D. N. P., Solem, O. and Roren, T. (2004): Product warranty logistics: Issues and challenges. European Journal of Operational Research, Vol. 156, No.1, pp. 110-126.

231. Musa, J. D., Tannino, A. and Okumoto, K. (1987): Software Reliability: Measurement, Prediction, and Application. New York, McGraw-Hill.

232. Myers, A. (2007): k-out-of-n: G system reliability with imperfect fault coverage. IEEE Transactions on Reliability, Vol. 56, pp. 464-473.

233. Nahas, N., Nourelfath, m. and Ait-Kadi, D. (2007): Coupling ant colony and the degraded ceiling algorithm for the redundancy allocation problem of series-parallel systems. Reliability Engineering and Systems Safety, Vol. 92, No. 2, pp. 211-222.

234. Nakagawa, T. and Yasui, K. (2003): Note on reliability of a system complexity. Mathematical and Computer Modelling, Vol. 38, No. 11-13, pp. 1365-1371.

235. Nicola, V. F. and Goyal, A. (1990): Modeling of correlated failures and community error recovery in multi version software. IEEE Transactions on Engineering, Vol. 16, No. 3, pp. 350-359.

236. Noortwijk, V. J. M. and Weide, V. D. J. A. M. (2008): Applications to continuous-time processes of computational techniques for discrete-time renewal processes. Reliability Engineering and System Safety, Vol. 93, pp. 1853-1860.

237. Oltean, M. and Diosan, L. (2009): An autonomous GP-based system for regression and classification problems. Applied Soft Computing, Vol. 9, No. 1, pp. 49-60.

238. Osaki, S. and Nakagawa, T. (1976): Bibliography for reliability and availability of stochastic system. IEEE Transactions on Reliability, Vol. R-25, pp. 284-287.

239. Ou, Y. and Bechta-Dugan, J. (2003): Approximate sensitivity analysis for acyclic Markov reliability models. IEEE Transactions on Reliability, Vol. 52, No. 2, pp. 220-231.

240. Pan, J. N. (1997): Reliability prediction of imperfect switching systems subject to multiple stresses. Microelectronics and Reliability, Vol. 37, No. 3, pp. 439-445.

241. Park, K. and Kim, S. (2002): Availability analysis and improvement of active/standby cluster systems using software rejuvenation. Journal of Systems and Software, Vol. 61, No. 2, pp. 121-128

242. Park, M. and Pham, H. (2008): Warranty system-cost analysis using quasi-renewal process. OPSEARCH, Vol. 45, No. 3, pp. 263-274.

243. Park, M. and Pham, H. (2010): Altered quasi-renewal concepts for modeling renewable warranty costs with imperfect repairs. Mathematical and Computer Modelling, Vol. 52, No. 9-10, pp. 1435-1450.

http://www.sciencedirect.com/science/article/pii/S0951832006000020?_alid=1786220590&_rdoc=2&_fmt=high&_origin=search&_docanchor=&_ct=190042&_zone=rslt_list_item&md5=23a967427a90f2aaa1242d2cb7a80783




http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6W86-4S3G3T0-3&_user=10&_coverDate=01%2F31%2F2009&_alid=876477969&_rdoc=9&_fmt=high&_orig=search&_cdi=6646&_sort=d&_docanchor=&view=f&_ct=737&_acct=C000050221&_version=1&_urlVersion=0&_userid=10&md5=90d377f835d0c0e3a8664936d3e6f3f4

http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6W86-4S3G3T0-3&_user=10&_coverDate=01%2F31%2F2009&_alid=876477969&_rdoc=9&_fmt=high&_orig=search&_cdi=6646&_sort=d&_docanchor=&view=f&_ct=737&_acct=C000050221&_version=1&_urlVersion=0&_userid=10&md5=90d377f835d0c0e3a8664936d3e6f3f4

http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6V47-3SN0C61-6N&_user=10&_coverDate=03%2F31%2F1997&_alid=1367497179&_rdoc=25&_fmt=high&_orig=search&_cdi=5751&_sort=r&_docanchor=&view=f&_ct=30320&_acct=C000050221&_version=1&_urlVersion=0&_userid=10&md5=f68831b727209ff22f1c87afd522c540







References

249

244. Pham, H. (1994): On the optimal design of N-version software systems subject to constraints. Journal of Systems and Software, Vol. 27, No. 1, pp. 55- 61.

245. Pham, H. (2003a): A software cost model with imperfect debugging, random life cycle and penalty cost. International Journal of System Science, Vol. 27, pp. 455-463.

246. Pham, H. (2003b): Software reliability and cost models: Perspectives, comparison, and practice. European Journal of Operational Research, Vol. 49, No. 3, pp. 475-489.

247. Pham, H. (2007): An imperfect-debugging fault-detection dependant-parameter software. International Journal Automotive and Computer, Vol. 4, No. 4, pp. 325-328.

248. Pham, H. and Wang, H. (2001): A quasi-renewal process for software reliability and testing costs. IEEE Transactions on Systems, Man and Cybernetics-Part A: Systems and Humans, Vol. 31, pp. 623-631.

249. Pham, H. and Zhang, X. (2003): NHPP software reliability and cost models with testing coverage. European Journal of Operational Research, Vol. 145, No. 2, pp. 443-454.

250. Pham, H., Nordmann, L. and Zhang, X. (1999): A general imperfect software-debugging model with S-shaped fault detection rate. IEEE Transactions on Reliability, Vol. 48, No. 2, pp. 168-75.

251. Pham, H., Suprasad, A. and Misra, R. B. (1997): Availability and mean life time prediction of multistage degraded system with partial repairs. Reliability Engineering and Systems Safety, Vol. 56, No. 2, pp. 169-173.

252. Propp, J. G. and Wilson, D. B. (1996): Exact sampling with coupled Markov chains and applications to statistical mechanics. Random Structures and Algorithms, Vol. 9, pp. 223-252.

253. Prowell, S. J. and Poore, J. H. (2004): Computing system reliability using Markov chain usage models. The Journal of Systems and Softwares, Vol. 73, pp. 219-225.

254. Rackwitz, R. (2001): Reliability analysis- A review and some perspectives. Structural Safety, Vol. 23, No. 4, pp. 365-395.

255. Rafe, V. and Mahdian, F. (2011): Style based modeling and verification of fault tolerance service oriented architecture. Procedia Computer Science, Vol. 3, pp. 972-976.

256. Rahman, A. and Chattopadhyay, G. N. (2006): Review of long term warranty policies. Asia Pacific Journal of Operational Research, Vol. 22, No. 4, pp. 453-473.

257. Rai, B. and Singh, N. (2005): A modeling framework for assessing the impact of new time/mileage warranty limits on the number and cost of automotive warranty claims. Reliability Engineering and System Safety, Vol. 88, No. 2, pp. 157-169.




References

250

258. Raj Kiran, N. and Ravi, V. (2008): Software reliability prediction by soft computing techniques. Journal of Systems and Software, Vol. 81, No. 4, pp. 576-583.

259. Rajamanickam, S. P. and Chandrasekar, B. (1997): Reliability measure for two unit systems with a dependent structure for failure and repair times. Microelectronics Reliability, Vol. 37, No. 5, pp. 829-833.

260. Ramirez-Marquez, J. E. and Coit, D. W. (2006): Optimization of system reliability in the presence of common cause failure. Reliability Engineering and System Safety, Vol. 92, No. 10, pp. 1421-1434.

261. Randell, B. (1975): System structure for software fault tolerance. IEEE Transactions on Software Engineering, Vol. SE-1, No. 2, pp. 220-232.

262. Randell, B. and Xu, J. (1995): The evolution of the recovery block concept, in Software Fault Tolerance, Wiley, pp. 1-21.

263. Randles, M., Lamb, D., Odat, E. and Taleb-Bendiab, A. (2011): Distributed redundancy and robustness in complex systems. Journal of Computer and System Sciences, Vol. 77, No. 2, pp. 293-304.

264. Rao, B. M. (2011): A decision support model for warranty servicing of repairable items. Computers and Operations Research, Vol. 38, No. 1, pp. 112-130.

265. Rebba, R. and Mahadevan, S. (2008): Computational methods for model reliability assessment. Reliability Engineering and System Safety, Vol. 93, No. 8, pp. 1197-1207.

266. Rehage, D., Carl, U. B. and Vahl, A. (2005): Redundancy management of fault tolerant aircraft system architectures- reliability synthesis and analysis of degraded system states. Aerospace Science and Technology, Vol. 9, No. 4, pp. 337-347.

267. Reussner, R. H., Schmidt, H. W. and Poernomo, I. H. (2003): Reliability prediction for component-based software architectures. Journal of Systems and Software, Vol. 66, No. 3, pp. 241-252.

268. Rinsaka, K. and Dohi, T. (2007): A faster estimation algorithm for periodic preventive rejuvenation schedule maximizing system availability. ISAS, pp. 94-109.

269. Rossi, G. P. and Simone, C. (1984): A multitasking operating system with explicit treatment of recovery points. Microprocessing and Microprogramming, Vol. 14, No. 2 pp. 55-66.

270. Ruiz-Castro, J. E. and Li, Q. L. (2011): Algorithm for a general discrete k-out-of-n: G systems subject to several types of failure with an indefinite number of repairpersons. European Journal of Operational Research, Vol. 211, No.1, pp. 97-111.

271. Rushdi, A. M. and Alsulami, A. E. (2007): Cost elasticities of reliability and MTTF for k-out-of-n systems. Journal of Mathematics and Statistics, Vol. 3, No. 3, pp. 122-128.

http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6V0N-4NTRT02-1&_user=10&_coverDate=04%2F30%2F2008&_alid=876477969&_rdoc=49&_fmt=high&_orig=search&_cdi=5651&_sort=d&_docanchor=&view=f&_ct=737&_acct=C000050221&_version=1&_urlVersion=0&_userid=10&md5=50e8c25eff00f0363fc00c262e4f0f8e



http://www.sciencedirect.com/science/article/pii/S037722171000665X?_alid=1790590836&_rdoc=4&_fmt=high&_origin=search&_docanchor=&_ct=674610&_zone=rslt_list_item&md5=0baecf6a6d8670d04fbad7005f7a4de2



References

251

272. Sadek, A. and Limnios, N. (2005): Nonparametric estimation of reliability and survival function for continuous-time finite Markov processes. Journal of Statistical Planning and Interference, Vol. 133, No. 1, pp. 1-21.

273. Saha, G. K. (2006a): A software tool for fault tolerance. Journal of Information Science & Engineering, Vol. 22, No. 4.

274. Saha, G. K. (2006b): A single-version scheme of fault tolerant computing. Journal of Computer Science & Technology, Vol. 6, No. 1, pp. 22-27.

275. Sahner, R. A., Trivedi, K. S. and Puliafito, A. (1996): Performance and Reliability Analysis of Computer Systems: An Example-Based Approach using the SHARPE Software Package. Kluwer Academic Publishers, Boston, MA.

276. Salameh, M. K. and Jaber, M. Y. (2000): Economic production quantity model for item with imperfect quality. International journal of Production Quantity, No. 64, pp. 59-64.

277. Salem, A. M. and El-Damcese, M. A. (2004): Reliability and systems subject to common cause hazards. Nulear Engineering and Design, Vol. 227, No. 3, pp. 349-354.

278. Salfner, F. and Walter, K. (2010): Analysis of service availability for time-triggered rejuvenation policies. Journal of Systems and Software, Vol. 83, No. 9, pp. 1579-1590.

279. Samatlı-Pac, G., Mehmet, R. and Taner (2009): The role of repair strategy in warranty cost minimization: An investigation via quasi-renewal process. European Journal of Operational Research, Vol.197, No. 2, pp. 632-641.

280. Santos, R. M., Santos, J. and Orozco, J. D. (2009): Power saving and fault-tolerance in real-time critical embedded systems. Journal of Systems Architecture, Vol. 55, No. 2, pp. 90-101.

281. Sarhan, A. M. (2002): Reliability equivalence with basic series/parallel system. Applied Mathematics and Computation, Vol. 132, No. 1, pp. 115-133.

282. Sarkar, J. and Chaudhuri, G. (1999): Availability of a system with gamma life and exponential repair under a perfect repair policy. Statistics & Probability Letters, Vol. 43, pp. 189-196.

283. Sarkar, J. and Sarkar, S. (2000): Availability of a periodically inspected system under perfect repair. Journal of Statistical Planning and Inference, Vol. 91, pp. 77-90.

284. Sarkar, J. and Sarkar, S. (2001): Availability of a periodically inspected system supported by a spare unit, under perfect repair or perfect upgrade. Statistics &Probability Letters, Vol. 53, pp. 207-217.

285. Savage, G. J. and Son, Y. K. (2011): The set theory method for system reliability of structures with degrading components. Reliability Engineering and System Safety, Vol. 96, No. 1, pp. 108-116.

286. Seo, J. H., Jang, J. S. and Bai, D. S. (2003): Lifetime and reliability estimation of repairable redundant system subject to periodic alternation. Reliability Engineering and System Safety, Vol. 80, pp. 197-204.

http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6VCT-4T0WJV0-4&_user=10&_coverDate=09%2F01%2F2009&_alid=1535752977&_rdoc=2&_fmt=high&_orig=search&_origin=search&_zone=rslt_list_item&_cdi=5963&_sort=r&_st=13&_docanchor=&view=c&_ct=549&_acct=C000050221&_version=1&_urlVersion=0&_userid=10&md5=adc1a8de21a48eed760ba743b6f5f501&searchtype=a

http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6VCT-4T0WJV0-4&_user=10&_coverDate=09%2F01%2F2009&_alid=1535752977&_rdoc=2&_fmt=high&_orig=search&_origin=search&_zone=rslt_list_item&_cdi=5963&_sort=r&_st=13&_docanchor=&view=c&_ct=549&_acct=C000050221&_version=1&_urlVersion=0&_userid=10&md5=adc1a8de21a48eed760ba743b6f5f501&searchtype=a

References

252

287. Shaked, M. and Zhu, H. (1992): Some results on block replacement policies and renewal theory. Journal of Applied Probability, Vol. 29, pp. 932-946.

288. Shanthikumar, J. G. (1982): Recursive algorithm to evaluate the reliability of a consecutive K-out-of-N: F system. IEEE Transactions on Reliability, Vol. R-31, pp. 442-443.

289. She, J. and Pecht, M. G. (1992): Reliability of a k-out-of-n warm-standby system. IEEE Transactions on Reliability, Vol. 41, No. 1, pp. 72-75.

290. Shen, Z., Hu, X. and Fan, W. (2008): Exponential asymptotic property of a parallel repairable system with warm standby under common-cause failure. Journal of Mathematical Analysis and Applications, Vol. 341, No. 1, pp. 457-466.

291. Shet, A. G., Elawasif, W. R., Foley, S. S., Park, B. H., Bemholdt, D. E. and Bramley (2011): Strategies for fault tolerance in multi component applications. Procedia, Computer Science, Vol. 4, pp. 2287-2296.

292. Sheu, S. H. (1991): A generalized block replacement policy with minimal repair and general random repair costs for a multi-unit system. Journal of the Operational Research Society, Vol. 42, pp. 331-341.

293. Shi, X., Pazat, J. L., Rodriguez, E., Jin, H. and Jiang, H. (2010): Adapting grid applications to safety using fault-tolerant methods: Design, implementation and evaluation. Future Generation Computer Systems, Vol. 26, No. 2, pp. 236-244.

294. Shooman, M. L. (1983): Software Engineering: Design, Reliability, and Management, New York, McGraw-Hill.

295. Shooman, M. L. (1990): Probabilistic Reliability: An Engineering Approach, 2d ed. Krieger, Melbourne.

296. Siewiorek, D. P. and Swarz, R. S. (1992): Reliability Computer System: Design and Evaluation, Digital Press, USA.

297. Simeu-Abazi, Z., Lefebvre, A. and Derain, J. P. (2011): A methodology of alarm filtering using dynamic fault tree. Reliability Engineering and System Safety, Vol. 96, No. 2, pp. 257-266.

298. Singh, J. (1989): A warm standby redundant system with common cause failure. Reliability Engineering and System Safety, Vol. 26, pp. 135-141.

299. Singh, J. and Goel, P. (1995): Availability analysis of a standby complex system having imperfect switch-over device. Microelectronics and Reliability, Vol. 35, No. 2, pp. 285-288.

300. Smidt-Destombes, K. S. D., Elst, N. P. V., Barros, A. I., Mulder, H. and Hontelez, J. A. M. (2011): Spare parts model with cold-standby redundancy on system level. Journal of Computers and Operations Research, Vol. 38, No. 7, pp. 985-991.

301. Smidt-Destombesa, K. S., Heijden, M. C. and Harten, A. (2004): On the availability of a K-out-of-N system given limited spares and repair capacity under a condition based maintenance strategy. Reliability Engineering System Safety, Vol. 83, No. 3, pp. 287-300.



References

253

302. Smidts, C. and Sova, D. (1999): An architectural model for software reliability quantification: Sources of data. Reliability Engineering and System Safety, Vol. 64, No. 2, pp. 279-290.

303. Sofokleous, A. A. and Andreou, S. A. (2008): Automatic, evolutionary test data generation for dynamic software testing. Journal of Systems and Software, Vol. 81, No. 11, pp. 1883-1898.

304. Somani, A. K. and Vaidya, N. H. (1997): Understanding fault tolerance and reliability, IEEE Computer Society Press Los Alamitos, CA, USA , Vol. 30, No. 4, pp. 45-50.

305. Soro, I. W., Nourelfath, M. and Ait-Kadi, D. (2010): Performance evaluation of multi-state degraded systems with minimal repairs and imperfect preventive maintenance. Reliability Engineering and Systems Safety, Vol. 95, No. 2, pp. 65-69.

306. Sridharan, V. and Jayashree, P. R. (1998): Transient solutions of a software model with imperfect debugging and generation of errors by two servers. Mathematical and Computer Modelling, Vol. 27, No. 3, pp. 103-108.

307. Sridharan, V. and Mohanavadivu, P. (1997): Reliability and availability analysis for two non-identical unit parallel systems with common cause failures and human errors. Microelectron Reliability, Vol. 37, No. 5, pp. 747-752.

308. Srinivas, R., Chakravarthy, A. and Gomez-Corral (2009): The influence of delivery times on repairable k-out-of-N systems with spares. Applied Mathematical Modelling, Vol. 33, No. 5, pp. 2368-2387.

309. Subramanian, R. and Anantharaman, V. (1995): Reliability analysis of a complex standby redundant system. Reliability Engineering and System Safety, Vol. 48, pp. 57-70.

310. Subramanian, R. and Venkatakrishan, K. S. (1975): Reliability of 2-unit standby redundant system with repair, maintenance and standby failure. IEEE Transactions on Reliability, Vol. R-24, pp. 139-142.

311. Tai, A. T., Avizienis, A. and Meyer, J. F. (1993): Performability enhancement of fault-tolerant software. IEEE Transactions on Reliability, Sp. Issue on Fault Tolerant Software. Vol. R-42, No.2, pp. 227-237.

312. Tang, L. C. and Lee, L. H. (2005): A simple recovery strategy for economic lot scheduling problem: A two- product case. International Journal of Production Economics, Vol. 98, pp. 97-107.

313. Teng, X. and Pham, H. (2002): A software reliability growth model for N-version programming systems. IEEE Transactions on Reliability, Vol. 51, No. 3, pp. 311-321.

314. Tian, Z., Levitin, G. and Zuo, M. J. (2009): A joint reliability-redundancy optimization approach for multi-state series-parallel systems. Reliability Engineering and System Safety, Vol. 94, No. 10, pp. 1568-1576.

315. Tokuno, K. and Yamada, S. (1995): Markovian software availability modeling for performance evaluation. In Stochastic Modelling in Innovative Manufacturing: Proceedings, Cambridge, U.K., July 21-22, 1995, (Edited by

http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6V0N-4RM7MSP-2&_user=10&_coverDate=11%2F30%2F2008&_alid=876477969&_rdoc=27&_fmt=high&_orig=search&_cdi=5651&_sort=d&_docanchor=&view=f&_ct=737&_acct=C000050221&_version=1&_urlVersion=0&_userid=10&md5=606b2dd08e435b007f3ac8d38355798c







http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6TYC-4T2S8WG-1&_user=10&_coverDate=05%2F31%2F2009&_alid=1535752977&_rdoc=17&_fmt=high&_orig=search&_origin=search&_zone=rslt_list_item&_cdi=5615&_sort=r&_st=13&_docanchor=&view=c&_ct=549&_acct=C000050221&_version=1&_urlVersion=0&_userid=10&md5=75656f16144e8b2818b09590a01d505d&searchtype=a



http://www.sciencedirect.com/science/article/pii/S0951832009000659?_alid=1760818795&_rdoc=24&_fmt=high&_origin=search&_docanchor=&_ct=196&_zone=rslt_list_item&md5=3ef96932f2ee818047b3e893bb8e6903

http://www.sciencedirect.com/science/article/pii/S0951832009000659?_alid=1760818795&_rdoc=24&_fmt=high&_origin=search&_docanchor=&_ct=196&_zone=rslt_list_item&md5=3ef96932f2ee818047b3e893bb8e6903

References

254

A.H. Christer, S. Osaki and L.C. Thomas), pp. 246-256, Springer-Verlag, Berlin, (1997).

316. Trivedi, A. K. and Shooman, M. L. (1975): A many state Markov model for the estimation and prediction of computer software performance parameters. Proceeding of the International Conference on Reliable Software, pp. 208-220.

317. Valdes, J. E. and Zequeira, R. I. (2003): On the optimal allocation of an active redundancy in a two-component series system. Statistics & Probability Letters, Vol. 63, No. 3, pp. 325-332.

318. Valdes, J. E. and Zequeira, R. I. (2006): On the optimal allocation of two active redundancies in a two-component series system. Operations Research Letters, Vol. 34, No. 1, pp. 49-52.

319. Valdes, J. E., Arango, G., Zequeira, R. I. and Brito, G. (2010): Some stochastic comparisons in series systems with active redundancy. Statistics & Probability Letters, Vol. 80, No. 11-12, pp. 945-949.

320. Van, P. D., Barros, A. and Berenguer, C. (2008): Reliability importance analysis of Markovian systems at steady state using perturbation analysis. Reliability Engineering and System Safety, Vol. 93, No. 11, pp. 1605-1649.

321. Vanderperre, E. J. (1990): Reliability analysis of a warm standby system with general distributions. Microelectronics & Reliability, Vol. 30, No. 3, pp. 487-490.

322. Vaurio, J. K. (1998): An implicit method for incorporating common cause failure in system analysis. IEEE Transactions on Reliability, Vol. 47, No. 2, pp. 173-180.

323. Vaurio, J. K. (1999): Availability and cost functions for periodically inspected preventively maintained units. Reliability Engineering and System Safety, Vol. 63, pp. 133-140.

324. Vaurio, J. K. (2003): Common cause failure probabilities in standby safety system fault tree analysis with testing-scheme and timing dependencies. Reliability Engineering and System Safety, Vol. 79, pp. 43-57.

325. Vaurio, J. K. (2005): Uncertainties and quantification of common cause failure rates and probabilities for system analyses. Reliability Engineering and System Safety, Vol. 90, pp. 186-195.

326. Velardi, P. and Ciciani, B. (1983): Recovery blocks for communicating systems. Microprocessing and Microprogramming, Vol. 11, No. 5 pp. 287-294.

327. Venkateswaran, N., Siva, M. S. and Goel, P. S. (2002): Analytical redundancy based fault detection of gyroscopes in spacecraft applications. Acta Astronautica, Vol. 50, No. 9, pp. 535-545.

328. Verma, S. M. and Chari, A. A. (1991): Availability and frequency of failure of a system in the presence of chance common-cause shock failures. Microelectronics and Reliability, Vol. 31, No. 2/3, pp. 265-269.

References

255

329. Vieira, M. and Madeira, H. (2004): Joint evaluation of recovery and performance of a COTS DBMS in the presence of operator faults. Performance Evaluation, Vol. 56, pp. 187-212.

330. Vinod, G. V., Santosh, T. V., Saraf, R. K. and Ghosh, A. K. (2008): Integrating safety critical software system in probabilistic safety assessment. Nuclear Engineering and Design, Vol. 238, No. 9, pp. 2392-2399.

331. Wang, C. H. and Sheu, S. H. (2001): The effect of the warranty cost on the imperfect EMQ model with general discrete shift distribution. Production Planning and Control, Vol. 12, No. 6, pp. 621-628.

332. Wang, H. and Pham, H. (1996): A quasi-renewal process and its applications in imperfect maintenance. International Journal of Systems Science, Vol. 27, No.10, pp. 1055-1062.

333. Wang, K. H. and Chen, Y. J. (2009): Comparative analysis of availability between three systems with general repair times, reboot delay and switching failures. Applied Mathematics and Computation, Vol. 215, No. 1, pp. 384-394.

334. Wang, K. H. and Kuo, C. C. (2000): Cost and probabilistic analysis of series systems and mixed standby components. Applied Mathematical Modeling, Vol. 24, pp. 957-967.

335. Wang, K. H. and Liang, L. W. (2006b): Cost benefit analysis of availability systems with warm standby units and imperfect coverage. Applied Mathematics and Computation, Vol. 172, pp. 1239-1256.

336. Wang, K. H. and Sivazlian, B. D. (1997): Life cycle cost analysis for availability system with parallel components. Computers and Industrial Engineering, Vol. 33, pp. 129-132.

337. Wang, K. H., Lai, Y. J. and Ke, J. B. (2004): Reliability and sensitivity analysis of a system with warm standbys and a repairable service station. International Journal of Operations Research, Vol. 1, No. 1, pp. 61-70.

338. Wang, K. H., Liou, Y. C. and Pearn, W. L. (2005): Cost benefit analysis of series systems with warm standby components and general repair times. Mathematical Methods of Operations Research, Vol. 61, pp. 329-343.

339. Wang, K., Dong, W. and Ke, J. (2006a): Comparison of reliability and the availability between four systems with warm standby components and standby switching failures. Applied Mathematics and Computation, Vol. 183, pp. 1310-1322.

340. Wang, L. and Cui, L. (2011): Aggregated semi-Markov repairable systems with history-dependent up and down states. Mathematical and Computer Modelling, Vol. 53, pp. 883-895.

341. Wang, L., Hu, H., Wang, Y., Wu, W. and He, P. (2011a): The availability model and parameters estimation method for the delay time model with imperfect maintenance at inspection. Applied mathematical Modelling, Vol. 35, No. 6, pp. 2855-2863.

342. Wang, Z., Lam, J., Ma, L., Bo, Y. and Guo, Z. (2011b): Variance-constrained dissipative observer-based control for a class of nonlinear stochastic

http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6TY8-4W9XDT3-1&_user=10&_coverDate=09%2F01%2F2009&_alid=1367497179&_rdoc=56&_fmt=high&_orig=search&_cdi=5612&_sort=r&_docanchor=&view=f&_ct=30320&_acct=C000050221&_version=1&_urlVersion=0&_userid=10&md5=40a9fc9feb7e54b78cdf76a385f21fdb




http://www.sciencedirect.com/science/article/pii/S0022247X1000973X?_alid=1786229553&_rdoc=1&_fmt=high&_origin=search&_docanchor=&_ct=11446&_zone=rslt_list_item&md5=678d6e057029e2b8af626ed9794b8655

http://www.sciencedirect.com/science/article/pii/S0022247X1000973X?_alid=1786229553&_rdoc=1&_fmt=high&_origin=search&_docanchor=&_ct=11446&_zone=rslt_list_item&md5=678d6e057029e2b8af626ed9794b8655

References

256

systems with degraded measurements. Journal of Mathematical Analysis and Applications, Vol. 377, No. 2, pp. 645-658.

343. Wattanapingskorn, N. and Coit, D. W. (2007): Fault–tolerant embedded system design and optimization considering reliability estimation uncertainty. Reliability Engineering and System Safety, Vol. 92, No. 4, pp. 395-407.

344. Wen, P. and Li, Y. (2009): Minimum packet drop sequences based networked control system model with embedded Markov chain. Simulation Modelling Practice and Theory, Vol. 17, pp. 1635-1641.

345. Whittaker, J. A. and Poore, J. H. (1993): Markov analysis of software specification. ACM Transaction of Software Engineering and Methodology, Vol. 2, pp. 93-106.

346. Whittaker, J. A. and Thomason, M. G. (1994): A Markov chain model for statistical software testing. IEEE Transactions on Software Engineering, Vol. 30, No. 10, pp. 812-824.

347. Whittaker, J. A., Rekeb, K. and Thomason, M. G. (2000): A Markov chain model for predicting the reliability of multi-build software. Information and Software Technology, Vol. 42, pp. 889-894.

348. Wie, X., Yiguang, H. and Trivedi, K. S. (2005): Analysis of a two-level software rejuvenation policy. Reliability Engineering and System Safety, Vol. 87, No. 1, pp. 13-22.

349. Wu, C. C., Chou, C. Y. and Huang, C. (2009): Optimal price, warranty length and production rate for free replacement policy in the static demand market. Omega, Vol. 37, No. 1, pp. 29-39.

350. Wu, J., Fernandez, E. B., Zhang, M. (1996): Design and modeling of hybrid fault-tolerant software with cost constraints. Journal of System and Software, Vol. 35, No. 2, pp. 141-149.

351. Wu, J., Wang, Y. and Fernandez, E. B. (1994): A uniform approach to software and hardware fault tolerance. Journal of Systems and Software. Vol. 26, pp. 117-127.

352. Xang, B. and Xie, M. (2000): A study of operational and testing reliability in software analysis. Reliability Engineering and System Safety, Vol. 70, No. 32, pp. 323-329.

353. Xie, W., Yiguang H., Y. and Trivedi, K. S. (2005): Analysis of a two-level software rejuvenation policy. Reliability Engineering and System Safety, Vol. 87, No. 1, pp. 13-22

354. Xing, L. (2007): Reliability evaluation of phased-mission systems with imperfect fault coverage and common-cause failures. IEEE Transactions on Reliability, Vol. 56, pp. 58-68.

355. Xing, L. and Dugan, J. B. (2002): Generalized imperfect coverage phased-mission analysis. In: Proceedings of Annual Reliability and Maintainability Symposium, Virginia Univ. Charlottesville, VA, pp. 112-119.



References

257

356. Xing, L., Meshkat, L. and Donohue, S. K. (2007): Reliability analysis of hierarchical computer based systems subject to common cause failures. Reliability Engineering and System Safety, Vol. 92, No. 3, pp. 351-359.

357. Xing, L., Shrestha, A. and Dai, Y. (2011): Exact combinatorial reliability analysis of dynamic systems with sequence-dependent failures. Reliability Engineering and System Safety, Vol. 96, No. 10, pp. 1375-1385.

358. Xu, H., Guo, W., Yu, J. and Zhu, G. (2005): Asymptotic stability of a repairable system with imperfect switching mechanism. International Journal of Mathematics and Mathematical Sciences, Vol. 4, pp. 631-643.

359. Yadavalli, V. S. S., Batha, M. and Bekker, A. (2002): Asymptotic confidence limits for the steady state availability of a two unit parallel system with preparation time for the repair facility. Asia-Pacific Journal of Operational Research, Vol. 19, pp. 249-256.

360. Yadavalli, V. S. S., Bekker, A. and Pauw, J. (2005): Bayesian study of a two-component system with common cause shock failures. Asia-Pacific Journal of Operational Research, Vol. 22, No. 1, pp. 105-119.

361. Yamachi, H., Tsujimura, Y., Kambayashi, Y. and Yamamoto, H. (2006): Multi-objective genetic algorithm for solving N-version programm design problem. Reliability Engineering and System Safety, Vol. 91, No. 9, pp. 1083-1094.

362. Yamada, S. and Othera, H. (1990): Software reliability growth models for testing-effort control. European Journal of Operational Research, Vol. 46, pp. 343-349.

363. Yamada, S., Hishitani, J. and Osaki, S. (1993): Software reliability growth model with weibull testing effort: A model and application. IEEE Transactions on Reliability, Vol. 42, pp. 100-105.

364. Yamashiro, M. (1982): A repairable multistate system with several degraded states and common-cause failures. Microelectronics and Reliability, Vol. 22, No. 3, pp. 615-618.

365. Yanez, M., Joglar, F. and Modarres, M. (2002): Generalized renewal process for analysis of repairable systems with limited failure experience. Reliability Engineering and System Safety, Vol. 77, pp. 167-180.

366. Yang, B. and Xie, M. (2000): A study of operational and testing reliability in software reliability analysis. Reliability Engineering and System Safety, Vol. 70, pp. 323-329.

367. Yang, B., Hu, H. and Guo, S. (2009): Cost-oriented task allocation and hardware redundancy policies in heterogeneous distributed computing systems considering software reliability. Computers & Industrial Engineering, Vol. 56, No. 4, pp. 1687-1696.

368. Yang, B., Li, X., Xie, M. and Tan, F. (2010): A generic data-driven software reliability model with model mining technique. Reliability Engineering & System Safety, Vol. 95, No. 6, pp.671-678.






References

258

369. Yang, L. and Meng, X. Y. (2011): Reliability analysis of a warm standby repairable system with priority in use. Applied mathematical Modelling, Vol. 35, No. 9, pp. 4295-4303.

370. Yearout, R. D., Reddy, P. and Grosh, D. L. (1986): Standby redundancy in reliability-A review. IEEE Transactions on Reliability, Vol. R-35, pp. 285-292.

371. Yinghui, T. and Jing, Z. (2008): New model for load-sharing k-out-of-n: G systems with different components. Journal of Systems Engineering and Electronics, Vol. 19, No. 4, pp. 748-751.

372. Yinong, C. and Chen, T. (1992): Implementing fault-tolerance via modular redundancy with comparison. Microelectronics Reliability, Vol. 32, No. 1-2, pp. 287-288.

373. Yu, H., Chu, C., Chatelet, E. and Yalaoui, F. (2007): Reliability optimization of a redundant system with failure dependencies. Reliability Engineering and System Safety, Vol. 92, No. 12, pp. 1627-1634.

374. Yuan, L. and Xu, J. (2011): An optimal replacement policy for a repairable system based on its repairman having vacations. Reliability Engineering and System Safety, Vol. 96, No. 7, pp. 868-875.

375. Yun, W. Y, Murthy, D. N. P. and Jack, N. (2008): Warranty servicing with imperfect repair. International Journal of Economics, Vol. 111, pp. 159-69.

376. Yun, W. Y. and Cha, J. H. (2010): Optimal design of a general warm standby system. Reliability Engineering & System Safety, Vol. 95, No. 8, pp. 880-886.

377. Zalewski, J., Ehrenberger, W., Saglietti, F., Gorski, J. and Kornecki, A. (2003): Safety of computer control systems: challenges and results in software development. Annual Reviews in Control, Vol. 27, No. 1, pp. 23-37.

378. Zhang, M. and Qin, W. (2008): Parametric Analysis of an improved fault tolerant system. Electronic Notes in Theoretical Computer Science, Vol. 207, No. 10, pp. 121-136.

379. Zhang, T. and Horigome, M. (2001): Availability and reliability of system with dependent components and time-varying failure and repair rates. IEEE Transactions on Reliability, Vol. 50, pp. 151-158.

380. Zhang, T., Xie, M., and Horigome, M. (2006): Availability and reliability of k-out-of-(M+N): G warm standby systems. Reliability Engineering and System Safety, Vol. 91, No. 4, pp. 381-387.

381. Zhang, Y. L. and Wang, G. J. (2007): A deteriorating cold standby repairable system with priority in use. European Journal of Operational Research, Vol. 183, No. 1, pp. 278-295.

382. Zhang, Y. L. and Wu, S. (2009): Reliability analysis for a k/n (F) system with repairable repair-equipment. Applied Mathematical Modelling, Vol. 33, No. 7, pp. 3052-3067.

383. Zheng, F., Zhu, G. and Gao, C. (2011): Well-posedness and stability of the repairable system with N failure modes and one standby unit. Journal of Mathematical Analysis and Applications, Vol. 375, No. 1, pp. 174-184.

http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6V4T-4YP8TK6-1&_user=10&_coverDate=08%2F31%2F2010&_alid=1367491115&_rdoc=3&_fmt=high&_orig=search&_cdi=5767&_sort=r&_docanchor=&view=f&_ct=905&_acct=C000050221&_version=1&_urlVersion=0&_userid=10&md5=45a86a4aa640a4a8b1139c7c0f9b0374

http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6V4T-4YP8TK6-1&_user=10&_coverDate=08%2F31%2F2010&_alid=1367491115&_rdoc=3&_fmt=high&_orig=search&_cdi=5767&_sort=r&_docanchor=&view=f&_ct=905&_acct=C000050221&_version=1&_urlVersion=0&_userid=10&md5=45a86a4aa640a4a8b1139c7c0f9b0374

References

259

384. Zhou, Z., Li, Y. and Tang, K. (2009): Dynamic pricing and warranty policies for products with fixed lifetime. European Journal of Operational Research, Vol. 196, No. 3, pp. 940-948.

385. Zhu, Y., Elsayed, E. A., Liao, H. and Chan, L. Y. (2010): Availability optimization of systems subject to competing risk. European Journal of Operational Research, Vol. 202, No. 3, pp. 781-788.

386. Zhuang, W. J. and Xie, M. (1994): Design and analysis of some fault–tolerance configurations based on a multipath principle. Journal of System and Software, Vol. 25, No. 1, pp. 101-108.

387. Zio, E. (2009): Reliability engineering: Old problems and new challenges. Reliability Engineering and System Safety, Vol. 94, No. 2, pp. 125-141.

http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6VCT-4WHMMFP-3&_user=10&_coverDate=05%2F01%2F2010&_alid=1535752977&_rdoc=25&_fmt=high&_orig=search&_origin=search&_zone=rslt_list_item&_cdi=5963&_sort=r&_st=13&_docanchor=&view=c&_ct=549&_acct=C000050221&_version=1&_urlVersion=0&_userid=10&md5=ef85202b45b165d0f24106a19db8b746&searchtype=a

http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6VCT-4WHMMFP-3&_user=10&_coverDate=05%2F01%2F2010&_alid=1535752977&_rdoc=25&_fmt=high&_orig=search&_origin=search&_zone=rslt_list_item&_cdi=5963&_sort=r&_st=13&_docanchor=&view=c&_ct=549&_acct=C000050221&_version=1&_urlVersion=0&_userid=10&md5=ef85202b45b165d0f24106a19db8b746&searchtype=a

software and hardware reliability of fault tolerant...

Documents