software and hardware reliability of fault tolerant...
TRANSCRIPT
Thesis on
Software and Hardware Reliability of Fault Tolerant Systems
Submitted for the award of DOCTOR OF PHILOSOPHY
Degree in
Mathematics By
Sulekha Rani
UNDER THE SUPERVISION OF
Dr. Madhu Jain
I.I.T. Roorkee, Roorkee
Prof. S. C. Agrawal Shobhit University, Meerut
SHOBHIT INSTITUTE OF ENGINEERING & TECHNOLOGY
A DEEMED-TO-BE UNIVERSITY
MODIPURAM, MEERUT-250110 (INDIA) 2011
Certificate
This is to certify that the thesis entitled “Software and Hardware Reliability of
Fault Tolerant Systems” which is being submitted by Ms Sulekha Rani for the degree
of Doctor of Philosophy in Mathematics to School of Basic and Applied Sciences,
Shobhit University, Meerut, a deemed-to-be-University, established by GOI u/s 3 of
UGC Act 1956, is a record of bonafide investigations and extensions of the problems
carried out by her under my supervision and guidance.
To the best of my knowledge, the thesis embodies the work of the candidate
herself and has not been submitted to any other University or Institution for the award of
any degree or diploma.
It is further certified that she worked with me for the required period in the School
of Basic and Applied Sciences, Shobhit University, Meerut.
Date : Prof. S. C. Agrawal (Internal Supervisor)
Prof. S.C. Agrawal M.A., Ph.D.
Director School of Basic and Applied Sciences
Shobhit University, Meerut E-mail: [email protected]
Indian Institute of Technology Roorkee, Roorkee
Dr. Madhu Jain M.Phil.(Gold Med.), Ph. D, D.Sc. Department of Mathematics, I.I.T, Roorkee (India)
Office: (01332) 285521 Resi: (01332) 285506 Mob: 09412811021 E-mail: [email protected] [email protected]
Certificate
This is to certify that the thesis entitled “Software and Hardware Reliability of
Fault Tolerant Systems” which is being submitted by Ms Sulekha Rani for the degree
of Doctor of Philosophy in Mathematics to School of Basic and Applied Sciences,
Shobhit University, Meerut, a deemed-to-be-University, established by GOI u/s 3 of
UGC Act 1956, is a record of bonafide investigations and extensions of the problems
carried out by her under my supervision and guidance.
To the best of my knowledge, the matter embodied in this thesis is the original
work of the candidate and has not been submitted for the award of any other degree or
diploma.
It is further certified that she worked with me for the required period in the
Institute of Basic Science, Khandari, Agra. She continued to take my guidance at I. I. T.
Roorkee too.
Date : Dr. Madhu Jain (External Supervisor)
Declaration
I, hereby, declare that the work presented in this thesis entitled “Software and
Hardware Reliability of Fault Tolerant Systems” for the award of degree of Doctor of
Philosophy, submitted to School of Basic and Applied Sciences, Shobhit University,
Meerut, a deemed-to-be-University, established by GOI u/s 3 of UGC Act 1956, is an
authentic record of my own research work carried out under the supervision of Prof. S.
C. Agrawal and Dr. Madhu Jain.
I also declare that the work embodied in the present thesis
(i) is my original work and has not been copied from any Journal/Thesis/Book, and
(ii) has not been submitted by me for any other degree or diploma.
(Sulekha Rani)
Acknowledgements
First and foremost I would like to express my heartiest gratitude to Dr. Madhu
Jain, my external guide for giving me the opportunity to complete my doctoral
program under her scholarly and able guidance. She has set up a benchmark for me
not only as a mentor but also as a person in the society through selfless service of
imparting the knowledge in a distinguished manner. I gratefully thank Prof. S. C.
Agrawal, Director of School of Basic and Applied Sciences, Shobhit University,
Meerut, my internal guide not only for providing me the opportunity to carry out the
research for the degree of Ph. D. in Mathematics but also for keeping me up for new and
innovative ideas. He has always been a driving force behind my research activities.
I earnestly wish to express my deepest feeling of gratitude to Prof. G. C. Sharma,
former Pro-Vice Chancellor, Dr. B. R. A. University, Agra, who has always been willing
to provide advice, academic support and help whenever I needed during the period of
my research work. I have to attribute most of my success to his timely suggestions
thoughout my years of stay for carrying my research work at Institute of Basic Science,
Agra.
I am thankful to the Chancellor Dr. Shobhit Kumar, Pro Chancellor Kunwar
Shekhar Vijendra, Vice-Chancellor Prof. R. P. Agarwal of Shobhit University for
providing congenial environment for conducting research in the University. I would
always be thankful to my reverend teachers for their affection, guidance, valuable
inputs and blessings bestowed upon me throughout the journey of my academic career.
It is difficult to thank my best friend Dr. Priyanka Agarwal, who like a guiding
beacon showed me the path towards achieving my academic milestone while she was
my room partner at Agra and Roorkee both. She stood by me, walked with me hand in
hand, and has played a vital role not only advising me from time to time but also in
compilation and completion of my research work. I am thankful to my senior
colleagues Ritu Gupta and Shweta Upadhahayay for their help and encouragement
during the entire processing of this work. Their discussions have always had a
substantial and motivating impact on me. I would like to extend my sincere thanks to
all fellow research workers who helped me in the completion of my thesis. They are
named alphabetically as Mrs Anuradha, Mr Naresh, Mrs Preeti, Mrs Ragni, Mr Ram
Singh, Mrs Richa, Mrs Sapna, Mr Satya Prakash, Mr Vivek and many others.
I am heartily thankful to Amma Ji (Mrs Kanak Lata Jain) for her blessings and
love bestowed upon me during the period of my research work at Agra and IIT
Roorkee. I extend my heartiest thanks to my parents Shri Hari Singh and Smt. Roshni
Devi for sowing the seedling of education, nurturing the plant of learning and being
there by my side throughout my academic career. Despite of all the hardships in their
life they made their best efforts and gave support to make me capable of achieving my
academic and professional goals. I can proudly say that whatever I am, is the result of
their hardwork, morals, values and blessings. I am highly obliged to my elder brother
Mr. Bhanu and my bhabhi Mrs. Usha for their heartiest love, attention, helpful attitude
and affection. I always received considerable encouragement and constant support
from them.
I also thank my other friends who have directly or indirectly contributed in my
research endeavor and without whom life at work and out of work would have been a
lot less enjoyable.
I would like to acknowledge the valuable contributions of librarians of IIT
Delhi, IIT Roorkee and Delhi University and to their staffs, which helped me in getting
the information and data for the thesis, whenever required.
I thank the Almighty God, who had faithfully called, guided and brought me to
an expected end of this work. To him I vow humbly.
(Sulekha Rani)
Research Papers Published/Accepted/ Communicated For Publication in Refereed Journals/Proceedings
1. Jain, M., Agrawal, S. C. and Rani, S. (2010): Reliability Analysis of Distributed
Software and Hardware System, International Journal of Information and
Computer Science (IJICS), 13, pp. 1-11.
2. Jain, M., Agrawal, S. C. and Rani, S. (2010): Reliability Modeling of Hardware
and Software Interactions, Journal, Mathematics Today, 26, pp. 46-57.
3. Jain, M., Agrawal, S. C. and Rani, S. (2010): Reliability Modeling of Hardware
and Software Interactions, Advances in Information Theory and Operations
Research, Om Parkash (Ed.), VDM Verlag, Germany, pp. 230-240.
4. Jain, M., Agrawal, S. C. and Rani, S. (2010): Availability Analysis of K-out-of-
N:G System with two types of Failure and Common Cause Failure, Accepted for
the publication in Journal of International Academy of Physical Sciences (In
Press).
5. Jain, M. and Rani, S. (2010): Reliability Modeling of Software Fault
Tolerance in A Clustered Architecture, International Journal of Information
and Computer Science (IJICS), 13 (In Press).
6. M. Jain and Rani, S. (2011): Availability Analysis of Imperfect Fault Coverage
System with Reboot and Common Cause Failure, Electronic Proceedings
International Conference on Advances in Modeling, Optimization and
Computing (AMOC-2011), IIT Roorkee, 5-7 Dec. 2011, ISBN 81-86224-71-2,
pp. 433-442.
7. M. Jain, S. C. Agrawal and Rani, S. (2011): Software Reliability Growth Model
with N-version Programming, Testing-Effort and Imperfect Debugging,
Accepted for the publication in Transactions of Physical and Life Sciences (In
Press).
8. M. Jain and Rani, S. (2011): Availability Analysis for Repairable System with
Warm Standby, Switching Failure and Reboot Delay, Revised for publication in
International Journal of Mathematics in Operations Research.
9. M. Jain and Rani, S. (2011): Transient Analysis of Hardware and Software
Systems with Warm Standbys and Switching Failures, Revised for publication in
International Journal of Mathematics in Operations Research.
10. Jain, M., Agrawal, S. C. and Rani, S. (2011): Quasi Renewal Processes for
Software and Hardware Systems with Common Cause Failure, Communicated
for publication in CSI Journal.
11. M. Jain and Rani, S. (2011): Availability of Hardware-Software System with
Standbys, Switching Failure and Reboot Delay, Communicated for publication in
International Journal of Engineering.
12. M. Jain and Rani, S. (2011): Availability Analysis of Imperfect Fault Coverage
System with Reboot and Common Cause Failure, Communicated for publication
in International Journal of Engineering.
Participation in the Conferences/Workshops/Seminars
1. Attended International Conference on ‘Soft Computing For Problem Solving
(SocProS-2011)’ held at IIT Roorkee, Roorkee, December 20-22, 2011.
2. International Conference on ‘Advances in Modeling, Optimization and Computing
(AMOC-2011)’ held at IIT Roorkee, Roorkee, December 5-7, 2011. Presented paper
entitled “Availability Analysis of Imperfect Fault Coverage System with Reboot and
Common Cause Failure”, Abstract published in Souvenir, pp. 91.
3. Annual Conference of ‘Vijnana Parishad of India & The Global Society of
Mathematical and Allied Science’ held at School of Basic and Applied Sciences,
Shobhit University, Meerut, March 24-26, 2011. Presented paper entitled
“Software Reliability Growth Model with N-version Programming, Testing-Effort
and Imperfect Debugging”, Abstract published in Souvenir, pp. 12.
4. 11th International Conference of ‘International Academy of Physical Sciences’ held
at Institute of Interdisciplinary Studies, University of Allahabad, Allahabad,
February 20-22, 2010. Presented paper entitled “Availability Analysis of K-out-of-
N: G system with Two Types of Failure and Common Cause Failure”, Abstract
published in Souvenir, pp. 25.
5. 2nd National Seminar on ‘Recent Trends in Advancement of Mathematical And
Physical Sciences’ held at D. N. College, Meerut, January 30-31, 2010. Presented
paper entitled “A Study Reliability Analysis of Distributed Software and Hardware
Systems with spares”, Abstract published in Souvenir, pp. Maths 6.
6. Attended Workshop on ‘Mathematical Modeling and Related Optimization
Techniques’ held at University of Delhi, Delhi, December 14-17, 2009.
7. Annual Conference of ‘Vijnana Parishad of India & National Symposium’ held at
Jaypee Institute of Engineering & Technology, Guna, December 4-6, 2009.
Presented paper entitled “Reliability Analysis of Distributed Software and Hardware
Systems”, Abstract published in Souvenir, pp. 53.
8. 11th National Conference of ‘Indian Society of Information Theory and Applications
(ISITA)’ held at Guru Govind Singh Khalsa College, Sarhali (Tarn Taran), Amritsar,
October 24-26, 2009. Presented paper entitled “Reliability Modeling of Hardware
and Software Interactions”, Abstract published in Souvenir, pp. 32.
9. Attended workshop on ‘Optimization Techniques and Their Applications to Various
Displines’ held at Guru Govind Singh Khalsa College, Sarhali (Tarn Taran),
Amritsar, October 19-23, 2009.
10. 14th Annual Conference of ‘Gwalior Academy Of Mathematical Sciences (GAMS)
and A Symposium on Computational Mathematics and its Application to
Engineering, Management and Biology’ held in Department of Applied Sciences
and Humanities, IPS College of Technology and Management, Gwalior, July 17-19,
2009. Presented paper entitled “Reliability Modelling of Software Fault Tolerance in
a Clustered Architecture”.
Contents
Preface (i)-(iii)
List of Tables (iv)
List of Figures (v)-(vii)
Chapter 1: General Introduction 1-40
1.1 Motivation 2 1.2 Fault Tolerance 4 1.3 Redundancy Issues 12 1.4 Reliability Analysis 16 1.5 Review of the Literature 25 1.6 Organization of the thesis 34 1.7 Concluding Remarks 38 Chapter 2: Fault Tolerance Software in A Clustered
Architecture 40-53 2.1 Introduction 41 2.2 Reliability Models 43 2.3 Model for Software Fault Tolerance 44 2.4 The Analysis 46 2.5 Numerical Illustration 47 2.6 Conclusion 49 Chapter 3: Reliability Modeling of Hardware and Software
Interactions 54-64
3.1 Introduction 55 3.2 Model Description 57 3.3 Governing Equations and Analysis 59 3.4 Numerical Results 62 3.5 Conclusion 63
Chapter 4: Distributed Software and Hardware Systems 65-78 4.1 Introduction 66 4.2 Model Description 69
4.3 The Equations and Analysis 69 4.4 Performance Indices 72 4.5 Numerical Results 72 4.6 Conclusion 73
Chapter 5: Transient Analysis of A Hardware-Software System
Section 5(A): Embedded Computer System with Two Types of Failure and Common Cause Failure 80-94
5A.1 Introduction 81
5A.2 Model Description 84 5A.3 The Analysis 86 5A.4 Illustration 88
5A.5 Performance Indices 90 5A.6 Numerical Results 91
5A.7 Conclusion 92 Section 5(B): Hardware and Software Systems with Warm Standbys
and Switching Failures 95-125
5B.1 Introduction 96 5B.2 Model Description 99
5B.3 Governing Equations 100 5B.4 Special Case 104
5B.5 Performance Measures 114 5B.6 Numerical Results 115
5B.7 Conclusion 116 Chapter 6: Repairable Redundant System with Reboot Delay
Section 6(A): Availability Analysis of Hardware-Software System with Switching Failure 127-146
6A.1 Introduction 128 6A.2 Model Description and Governing Equations 131 6A.3 Availability Prediction 135 6A.4 Performance Measures 138 6A.5 Sensitivity Analysis 139 6A.6 Conclusion 140
Section 6(B): Availability Analysis of Repairable System with Warm Standby and Switching Failure 147-170 6B.1 Introduction 148 6B.2 Model Description 151
6B.3 Availability Prediction for Three Configuration 152 6B.4 Transient Solution 162 6B.5 Numerical Results 163 6B.6 Conclusion 164
Chapter 7: Warranty Policy for Hardware and Software with Common Cause Failure 171-186
7.1 Introduction 172 7.2 Model Description 174 7.3 The Analysis 176 7.4 Warranty Policy with Repair 180 7.5 Numerical Results 182 7.6 Conclusion 183
Chapter 8: Semi-Markov Models with Common Cause Failure Section 8A: Redundant System with Rejuvenation 188-198 8A.1 Introduction 189 8A.2 Model Description 191
8A.3 Semi-Markov Analysis 194 8A.4 Performance Measures 197 8A.5 Total Expected Downtime Cost 197 8A.6 Conclusion 198 Section 8B: Imperfect Fault Coverage System with Reboot 199-215 8B.1 Introduction 200 8B.2 Model Description 203
8B.3 The Steady State Availability 204 8B.4 Special Cases 207 8B.5 Numerical Results 208 8B.6 Concluding Remarks 210
Chapter 9: Software Reliability Growth Model (SRGM) with N-version Programming 216-229
9.1 Introduction 217
9.2 NHPP Model 220 9.3 Mean Value Function 222 9.4 Reliability Estimation 223 9.5 Parameter Estimation 224 9.6 Total Expected Cost of Software 225 9.7 Numerical Results 226 9.8 Concluding Remarks 227
Future Scope 230
References 231-259
(i)
Preface
The system managers are required to pay increasing attention to the interrelated
fields of reliability and fault tolerant systems, keeping intact the quality, if they are to
survive under the competitive pressures of the twenty-first century. In today’s
technological world nearly everyone depends upon the continued functioning of a wide
array of complex machinery and equipment for our everyday safety, security, mobility
and economic welfare. We expect our electric appliances, lights, hospital monitoring
control, next-generation aircraft, nuclear power plants, data exchange systems, and
aerospace applications, to function well whenever we need them. When they fail, the
results can be catastrophic, injury or even loss of life. Reliability and availability of fault
tolerant systems are becoming major design issues now-a-days in massively parallel
distributed computing systems. Fault tolerance system that equips the redundant
subsystems or components in to safe guard from failure in order to improve reliability has
been used for many decades. Example of systems in which fault tolerant is needed
include mission-critical, transaction such as banking, computation-intensive,
mobile/wireless computing system and many more. High performance measures in terms
of speed and computing power is essentially used as major design objective for such
systems.
In this thesis our aim is to develop some models that cover both fundamental and
theoretical work in the areas of reliability and fault tolerant system including system
redundancy, multi-state system, optimization, component reliability, system reliability,
warranty, availability etc.. An important feature of our investigation is a set of reliability
assessments that can be used to ensure the desired reliability of fault tolerant systems.
The thesis stresses upon mathematical models for evaluating reliability/availability, and
shows how these models can be applied to the development of reliable software systems.
The study of various hardware and software components for reliability/fault
tolerant system is done analytically as well as numerically in different chapters. The
sensitivity of different parameters is examined by exhibiting the numerical results in
(ii)
tables and graphs. Chapter 1, deals with the fundamental concepts involved in the
hardware and software components for the reliability of fault tolerant systems. Fault
tolerant system, redundancy, reliability methodology, availability and their applications
in the system characterization along with some historical background as well as recent
developments on the subject have been described. The outlines and concluding remarks
of the thesis are also given in the end of the chapter. The reliability of fault tolerant
systems is the main theme in chapter 2 of the thesis. The systems are characterized by
having a number of operating components and levels based on some assumptions. In
chapter 3, we discuss the interactions of the software and hardware failure and formulate
a model for the reliability analysis of a software failure, hardware failure due to common
cause failure. Chapter 4 is devoted to the distributed system along with operating
components with standbys and common cause failure to evaluate the reliability of the
system.
Transient analysis of a hardware-software system is studied in chapter 5. This
chapter is divided into two sections. In section 5(A), we analyze the Markov model for K-
out-of-N:G system which have N non identical components and Y warm standby
components with common cause failure. In section 5(B) we analyze a system, having
operating components as well as warm standbys under the assumption of switching
failure. Chapter 6 focuses on a repairable system with reboot delay. This chapter is
arranged into two parts. In section 6(A), we study a repairable system with M hardware
and N software components. In section 6(B), we consider the redundant system by
incorporating the set up time. Warranty policy for hardware and software system is
proposed in chapter 7.
Chapter 8 deals with semi-Markov models of the redundant systems with
common cause failure. This chapter is divided into two sections. In Section 8A we
determine the availability of redundant system with common cause and rejuvenation by
using an embedded Markov chain approach. Section 8B is devoted to the analysis of the
imperfect fault coverage system with reboot delay and recovery using supplementary
variable technique. Chapter 9 is concerned with a model with N-version programming,
testing effort and imperfect debugging.
(iii)
We hope that the investigations made in the present study will be helpful to the
system designers and the practitioners to deal with various hardware and software
components involved in reliability/fault tolerant system. The decision makers and
industrial engineers may be able to improve the quality of service on the basis of our
findings. In addition, the present study will also be beneficial to the society to tackle the
reliability and availability issues of fault tolerant systems.
(Sulekha Rani)
(iv)
List of Tables
Table No. Title Page No. 1.1 Comparison of Design diversity techniques 11
2.1 Probabilities of Software fault tolerance for different values of λ for data set I 50
2.2 Probabilities of Software fault tolerance for different values of λP for data set II 51
4.1(a) Performance indices for different values of λP 75 4.1(b) Performance indices for different values of μS 76 4.1(c) Performance indices for different values of α 77
5A.1(a) Performance indices for different values of (μ, μh) and (μC, μS) 93
5A.1(b) System Availability for different values of (λh, 'hλ ) 93
5B.1(a) Performance indices for different values of λh 118 5B.1(b) Performance indices for different values of λS 119 5B.1(c) Performance indices for different values of α 120 5B.1(d) Performance indices for different values of μh 121 5B.1(e) Performance indices for different values of μS 122 5B.1(f) Performance indices for different values of q 123 6A.1(a) Performance indices for different values of λd 141 6A.1(b) Performance indices for different values of λS 142 6A.1(c) Performance indices for different values of β 143 6A.1(d) Performance indices for different values of α 144
6B.1(a) Comparison of availability of three configuration for different values of α 165
6B.1(b) Comparison of availability of three configuration for different values of β 166
6B.1(c) Comparison of availability of three configuration for different values of q 167
8B.1(a) Effects of parameters C, μ and β on the availability for different distribution of repair time 211
8B.1(b) Effects of parameters θ, μ and β on the availability for different distribution of repair time 212
v
List of Figures
Fig. No. Title Page No.
1.1 The transition of fault, error and failure in a software life cycle 6 1.2 Recovery block 8 1.3 N-version programming 8 1.4 N-self checking programming 10 1.5 Triple modular redundancy 14
1.6 N-modular redundancy 14 1.7 Reliability of N component system in series 17 1.8 Reliability of N component system in parallel 18 1.9 Reliability of N component system in series-parallel 18 1.10 Reliability of N component system in parallel-series 19 2.1 State transition diagram for software fault tolerance 45 2.2 Probabilities of software fault tolerance 52 2.3(a) Profiles of reliability by varying λ for data set I 53 2.3(b) Profiles of reliability by varying λ for data set II 53 2.4(a) Profiles of reliability by varying λP for data set I 53 2.4(b) Profiles of reliability by varying λP for data set II 53 2.5(a) Profiles of reliability by varying c for data set I 53 2.5(b) Profiles of reliability by varying c for data set II 53 3.1 State transition diagram 58 3.2(a) Availability (A) vs λ by varying μm 64 3.2(b) Availability (A) vs μS by varying λ 64 3.2(c) Availability (A) vs λ by varying P1 64 3.2(d) Availability (A) vs λ by varying P2 64 3.2(c) Availability (A) vs λ by varying '
1P 64
3.2(d) Availability (A) vs λ by varying '2P 64
4.1 State transition diagram 74 4.2(a) Reliability vs time by varying λP 78 4.2(b) Reliability vs time by varying μS 78 4.2(c) Reliability vs time by varying α 78 5A.1 State transition diagram 86 5A.2(a) Reliability vs time by varying λ 94 5A.2(b) Reliability vs time by varying λ′ 94 5A.2(c) Reliability vs time by varying λS 94 5A.2(d) Reliability vs time by varying λC 94
vi
List of Figures (contd…)
5B.1 State transition diagram 117 5B.2(a) Reliability vs time by varying λh 124 5B.2(b) Reliability vs time by varying μh 124 5B.2(c) Reliability vs time by varying λS 124 5B.2(d) Reliability vs time by varying μS 125 5B.2(e) Reliability vs time by varying α 125 5B.2(f) Reliability vs time by varying q 125 6A.1 State transition diagram 133 6A.2(a) Availability vs λh by varying q 145
6A.2(b) Availability vs λh by varying β 145 6A.2(c) Availability vs λh by varying α 145 6A.2(d) Membership functions for input parameter λh 145 6A.3(a) Availability vs μS by varying q 146 6A.3(b) Availability vs μS by varying β 146 6A.3(c) Availability vs μS by varying α 146 6A.3(d) Membership functions for input parameter μS 146 6B.1 State transition diagram for model 1 153 6B.2 State transition diagram for model 2 156 6B.3 State transition diagram for model 3 159 6B.4 Availability vs time for model 1 by varying (a) μ1 (b) μ2 (c) η1
(d) η2 168
6B.5 Availability vs time for model 2 by varying (a) μ1 (b) μ2 (c) η1 (d) η2
169
6B.6(a) Availability vs time by varying μ1 for model 3 170 6B.6(b) Availability vs time by varying η1 for model 3 170 6B.6(c) Availability vs time by varying μ2 for model 3 170 6B.6(d) Availability vs time by varying η2 for model 3 170 6B.6(e) Availability vs time by varying μ3 for model 3 170 6B.6(f) Availability vs time by varying η3 for model 3 170 7.1 Expected cost vs warranty period by varying (i) β1h (ii) βS and
(iii) λC 184
7.2 Standard deviation vs warranty period by varying (i) β1h (ii) βS and (iii) λC
185
7.2 Coefficient of variation vs warranty period by varying (i) β1h (ii) βS and (iii) λC
186
8A.1 Redundant system without rejuvenation 192 8A.2 Redundant system with rejuvenation 193 8A.3 Failed software rejuvenation model 193 8B.1 State transition diagram for two unit system 204
8B.2 Availability vs λ by varying (i) C (ii) β (iii) θ for exponential distribution repair time
213
vii
List of Figures (contd…)
8B.3 Availability vs λ by varying (i) C (ii) β (iii) θ for uniform distribution repair time
213
8B.4 For gamma distributed repair time, the availability vs λ by varying (i) C (ii) β (iii) θ (iv) λS
214
8B.5 For gamma distributed repair time, the availability vs (i) λ (ii) λS (iii) β (iv) µ by varying r
215
9.1(i) Mean time vs time by varying a1 228 9.1(ii) Mean time vs time by varying β11 228 9.2(i) Expected cost vs time by varying a1 228 9.2(i) Expected cost vs time by varying a2 228 9.3(i) Reliability vs time by varying a1 229 9.3(ii) Reliability vs time by varying b11 229 9.4 Membership functions for input parameter T 229
General Introduction
1.1 Motivation
1.2 Fault Tolerance
1.3 Redundancy Issues
1.4 Reliability Analysis
1.5 Review of the Literature
1.6 Outline of the Thesis
1.7 Concluding Remarks
Chapter-1
Chapter-1: General Introduction
2
1.1 Motivation
The hardware and software computer systems directly or indirectly have a
significant impact on human life. Now-a-days new generation embedded computers
are demanded by all engineering systems such as communication, production, atomic
power plants, aircraft, automobiles, etc. The embedded computer systems are
becoming complex in both their design and architecture. An embedded computer
system can be used in real time systems which may consist of hundreds and may be
thousands of interacting softwares and hardware components. Hardware products can
experience intermittent failures in processing the same input because of the wearing
out of components. While software products appear to exhibit the same intermittent
failure characteristics but do have deterministic behavior. With the advancement of
technology, the softwares have become essential parts of various systems such as, in
air traffic control, fighter aircraft, space shuttle and automated-guided missiles,
telecommunications, process control in nuclear plants and factories, defense systems,
and many more. Even in our day-to-day life, many of the gadgets and the automobiles
are software controlled. It is evident that there are many well known cases of tragic
consequences due to software failures. That is why in many popular software
packages, a very high degree of reliability is necessary. To avoid the failures and
faults, the reliability of the hardware and software needs to be predicted during the
design and development phases.
The fault-tolerance system plays a major role in process control,
transportation, electronic equipments, space, communications and many other areas
that affect our lives. Due to increase in dependency and demand, the size and
complexity of computer hardware and software system has grown up extremely. In
many systems, the fault-tolerance is achieved by applying a set of analysis and design
techniques to generate systems with dramatically improved dependability. In the early
days of fault-tolerant computing, it was possible to evaluate specific hardware and
software outcomes. The chips are used in embedded systems that contain complex,
highly-integrated functions, and hardware and software must be capable to cope up
with a variety of standards subject to techno-economic constraints.
Chapter-1: General Introduction
3
Fault tolerant systems are often demanded for critical applications where loss
of human life or damage to property is of great concern. Reliability of the fault
tolerant software presents special difficulties since all the errors present in the system
are design errors, that cannot be accurately modeled by the traditional models used for
hardware. In any system, fault tolerance is achieved through redundancy in some or
another way. Redundancy is a common approach to improve the reliability and
availability of a system. Adding redundancy increases the cost and complexity of a
system design and with the high reliability of modern electrical and mechanical
components, many applications do not need redundancy in order to be successful.
However, if the cost of failure is high enough, the redundancy may be an attractive
option.
Hardware reliability requirements provide an impetus to achieve high safety
margins in the mechanical stresses, reduced variability and increase tolerances, input
impedance, breakdown voltage, etc. The cause of failure in hardware has usually been
physical deterioration, a manufacturing defect or poor quality of materials. Computer
hardware failures could be due to failure of the control processing unit, power supply,
and display terminal, printer, cooling system, memory disk or simple faulty
operations. For many functions of the computer systems, the design process is very
important part of hardware reliability. Sometimes a hardware component may need to
be replaced that is why; often redundant components (standbys) are kept in hand.
There may be some down time if any part is not readily available but the part itself
does not require a corrective process adding to the downtime. Hardware reliability
may change during certain periods such as at initial use or at end of a useful life.
For given input and initial conditions, the software will always produce the
same results. Software reliability models in the 1970s usually directed at single unit
software systems and were called software reliability growth models (SRGMs) and
later on models were developed to address multi component systems or modular
systems. Software reliability is generally accepted as the key factor in software
quality since it quantifies software failures.
The combined hardware and software reliability models have been developed
by many researchers. Such developed models require observing the implementation
carefully for the next few years, at least. In literature, a few models deal with
Chapter-1: General Introduction
4
combined software and hardware reliability. Hardware and software reliability
engineering have many concepts with unique terminology and many mathematical
and statistical expressions. Reliability of combined hardware and software system is
in many ways analogous to reliability modeling of purely hardware system. Individual
hardware platforms and the software assigned to those platforms are independent of
other hardware/software platforms.
It is worth while to discuss some concepts which are to be specifically used in
the present thesis for the modeling purpose of fault tolerant system. The objective of
our study is the present thesis has been to develop various models for the fault-
tolerant system to predict the reliability/availability. Our prime goal in the present
chapter is to provide a brief account of various quantitative approaches used in the
development of various models for the performance evaluation of fault-tolerant
system based on reliability theory.
In present ongoing chapter, we provide an overview of hardware and software
reliability issues of fault tolerant system. The issues related to our research topic are
briefly reviewed apart from highlighting the basic concepts and noble features of the
work carried out in the present thesis. The remainder of this chapter is organized as
follows. Section 1.2 details various aspects of the fault tolerant system whereas
section 1.3 is devoted to the redundancy issues. Reliability analysis is discussed in
section 1.4. In section 1.5, we have reviewed various reliability models related to our
work and developed by prominent researchers in different frameworks. The outline of
the thesis is presented in section 1.6. Finally, noble features and future scope of the
work done are highlighted in final section 1.7.
1.2 Fault Tolerance
A fault-tolerance is the ability of a system to continue correct performance of
its intended tasks after the occurrence of hardware and software faults. Fault tolerant
system research covers a wide spectrum of applications namely embedded real-time
systems, commercial transaction systems, transportation systems, and military/space
systems, distribution and service systems, etc.. Fault tolerance approach in any system
results in the improvement as far as the efficiency and performance is concerned.
During last few decades many researchers have contributed in the development of
Chapter-1: General Introduction
5
fault tolerance techniques that explore the performance of computer systems which
are prone to software or hardware failures. The idea of implementing fault tolerance
in separate layers of an embedded system helps in managing the complexity of the
derived architectural solutions. Integrating provisions for coping with both hardware
and software faults can reduce the overlapping of fault tolerance techniques. The
reliability of fault tolerant software presents special difficulties since all the errors
present are design errors, that cannot be accurately modeled by the traditional models
used for hardware. To tackle fault tolerance related issues, the following terminology
are commonly used:
Faults- A fault, sometimes called a bug, is the identified or hypothesized cause of a
software failure. Software faults can be classified as (i) design faults and (ii)
operational faults according to the phases of creation.
Design faults- A design fault is a fault occurring in the software design and
development process. Design faults can be recovered with fault removal
approaches by revising the design documentation and source code.
Operational faults- Such faults occur during the lifetime of the system and are
invariably due to physical causes, such as processor failures or disk crashes and
software operation due to timing, race conditions, workload-related stress and
other environmental conditions.
Error- An error is the part of the system state which is liable to lead to a failure. It is
an intermediate stage in between faults and failures. Software faults are most often
caused by design faults. Design faults occur when a designer, either misunderstands a
specification or simply makes a mistake.
Failure- A failure mode is an identifiable weakness in the system design and
manufacture. Failures can be classified into severity classes, e.g. critical, major,
minor. A failure occurs when the user perceives that a software program is unable to
deliver the expected service.
A fault-tolerant system may be able to tolerate one or more fault-types including:
• Transient, intermittent or permanent hardware faults
• Software and hardware design errors
Chapter-1: General Introduction
6
• Operator errors
• Externally induced upsets or physical damage
Figure 1.1: The transition of fault, error and failure in a software life cycle
The “fault-error-failure" relationship in a software life cycle is depicted in
figure 1.1. There are some common approaches which may be used to deal with
design faults in the softwares. Some basic approaches are as follows:
(a) Fault avoidance (prevention) during the software development process.
(b) Fault tolerance and fault/failure forecasting after the development process.
(c) Fault removal during the software development process.
(a) Fault avoidance- Fault avoidance is meant for determining the introduction of
faults during the development of the software. It includes all the techniques to
examine the process of software developments, standards, methodologies, etc. Fault
avoidance techniques are employed to check the occurrence of faults such as quality
control (design review, component screening, testing, etc.) and shielding from
interference (radiation, humidity, heat, etc.).
(b) Fault/failure prediction (forecasting)- Forecasting can play a vital role in
reducing the existence of faults and the occurrences and consequences of failures. For
this purpose dependability-enhancing techniques based on reliability estimation and
reliability prediction are used.
(c) Fault removal- Fault removal is the approach to detect and eliminate software
faults during different phases of development of the software. The reviews,
Chapter-1: General Introduction
7
inspection, testing, verification, validation, etc. are some common techniques used for
this purpose.
Fault tolerance can be divided into software and hardware fault tolerances. It is
worthwhile to have a look on software and hardware fault tolerance.
1.2.1 Software Fault Tolerance
A fault tolerant system is supposed to be able to tolerate not only faults in the
system itself but also faults in the application programs. In order to create a fault
tolerant system for a particular application, the fault tolerance demands of the target
application must first be identified. Then, the appropriate fault tolerance methods
must be used in order to meet the overall fault tolerance requirements. Some
performance measures namely reliability, availability, performability, mean time to
failure, mean time to recovery, performance degradation due to the fault tolerance can
play important role for the assessment of faults during the software life cycle.
Individual fault tolerance methods must be refined, to draw the system functions
taken into consideration in order to develop effective fault tolerance schemes.
The key issues in the software fault tolerant system are:
Component reliability is an important quality measure for the system level
analysis. It is established that the software reliability is hard to calculate, and the
use of past-verification reliability evaluation is an open problem to be tackled.
The multi-version techniques are based on the assumption that the software, made
differently, should fail differently. If one of the redundant version fails, at least
one of the others should provide an acceptable output.
Probability models provide a formal conceptual structure to tackle the delicate
issues of conditional independence involved in the failure processes of design
diverse systems.
Software architectures, design techniques, static checks, dynamic tests, special
libraries, and run-time routines help software engineers to develop the fault tolerant
software. There are basic two techniques for software fault tolerance i.e. (i) recovery
block and (ii) N-version programming,
Chapter-1: General Introduction
8
(i) Recovery Blocks
Figure 1.2: Recovery block
The recovery blocks technique (Randell, 1975) combines the basics of the
checkpoint and restart approach with multiple versions of a software component such
that a different version is tried after an error is detected. Checkpoints are created
before a version executes. Checkpoints are required to recover the state after a version
fails to provide a valid operation starting point for the next version if an error is
detected.
(ii) N-Version Programming
Figure 1.3: N-version programming
The N-version concept attempts to tackle the software and hardware fault
tolerance concept based on N-way redundant components. In an N-version software
system, each module is for the use upto N different implementations. Each version
Chapter-1: General Introduction
9
attains the same task but in a different way. N-Version programming approach of
software fault tolerance is based on design diversity conjecture. NVP proposed by
Avizienis and Chen (1977) considers the execution of N functionally equivalent
software modules (called versions) that receive the same input and send their outputs
to a voter. The voter produces an output if at least M out of N outputs accept.
Otherwise, the system fails. Generally, majority voting is used in which N is odd and
M = (N+1)/2.
Single and Multiple Version Software Techniques
Single-version techniques add to a single software module a number of
functional capabilities that are unnecessary in a fault-free environment. These
techniques are based on redundancy and are used to a single version of software to
detect and recover from faults. Single-version software fault tolerance techniques
involve considerations on program structure and actions, error detection, exception
handling, checkpoint and restart, process pairs, and data diversity (cf. Lyu, 1995).
The multi-version fault-tolerant software technique is also called design
diversity approach. It is based on the use of two or more versions of a piece of
software executed either in sequence or in parallel. The fundamental reasoning for the
use of multiple versions is the expectation that components create differently (i.e.,
different designers, different algorithms, different design tools, etc.) should fail
differently. Therefore, in the case that one version fails in a particular situation, there
is a good chance that at least one of the alternate versions is able to provide a suitable
output. These multiple versions are executed either in sequence or in parallel, and can
be applied as alternatives (with separate means of error detection), in pairs (to
implement detection by replication checks) or in major groups (to enable masking
through voting).
N Self-Checking Programming
N self-checking programming (cf. Laprie , 1987, 90, 95) is the use of multiple
software versions combined with structural variations of the recovery blocks and N-
version programming. N self-checking programming using acceptance tests is shown
in Figure 1.4.
Chapter-1: General Introduction
10
Figure 1.4: N-self checking programming
Here the versions and the acceptance tests are developed independently from
common requirements. The use of separate acceptance tests for each version is the
main difference of this N self-checking model from the recovery blocks approach.
Comparison between Recovery Block, N-Version Programming and N- Self Checking Programming
Each design diversity method, recovery block, N-version programming, and N
self-checking programming, has its own advantages and disadvantages compared with
the others. The comparison of the features of the three and summary of the same is
given in Table 1.1. In the design and implementation phase we display specific fault-
tolerant techniques in developing reliable software systems for either single-version
software or multiple version software.
Chapter-1: General Introduction
11
Table 1.1: Comparison of design diversity techniques
1.2.2 Hardware Fault Tolerance
The majority of fault tolerant designs have indicated the direction of
developing computers that automatically recover from random faults occurring in
hardware components. Hardware fault tolerance includes triple modular redundancy,
duplication with comparison, standby sparing, watchdog timers, self-purging
redundancy, and many others techniques, which are actively being researched. Such
hardware methods typically have the advantage of speedily detecting and removing
faults as they occur. The techniques employed for fault tolerant design involve
partitioning a computing system into modules that perform the fault containment area.
Each module is backed up with protective redundancy so that, if the module fails,
others can resume its function.
The working stage has the following three types of hardware faults:
(i) Transient faults- Transient faults are intermittent faults that are caused by
external events or by the environment (Somani and Vaidya, 1997). For examples,
Features Recovery Block N-version
Programming
N-self-checking
Programming
Minimum No. of
Versions 2 3 4
Output Mechanism Acceptance Test Decision Algorithm
Decision Algorithm
and Acceptance Test
Execution Time Primary Version Slowest Version
Slowest Pair
Recovery Scheme
Backward Recovery Forward Recovery
Forward and Backward Recovery
Chapter-1: General Introduction
12
we can realize such faults in energetic particles, the chip or electrical surges etc.
Though these faults do not cause permanent faults, but they may result in incorrect
program execution by inadvertently altering processors’ states, signal transfer, or
stored values on registers, etc.
(ii) Intermittent Faults- Intermittent faults enter the system, stay active for a very
small duration and then disappear, only to return again. The examples of such
faults can be encountered in heat-sensitive components which can produce
intermittent faults through their phases of heating and cooling.
(iii) Permanent Faults- Permanent faults are completely repeatable and always cause
an associated failure.
Some key concepts in the area of hardware fault tolerant are also described in the
next section devoted to redundancy issues as given below:
1.3 Redundancy Issues
The extra critical component of a system provided with the intention that the
reliability of the system will increase due to this component is called redundant
component. In any system, fault tolerance is achieved through redundancy in some or
another way. The redundancy in software consists of the introduction of extra
elements such as instructions, parts of programs, programs etc., to certain that the
failure can be tolerated, for the substitution or masking of the faulty element. Fault
masking is a structural redundancy technique that completely masks faults within a set
of redundant modules. A number of identical modules execute the same functions,
and their outputs are voted to remove errors created by a faulty module. This
technique used to prevent faults from introducing errors, e.g. error correcting codes,
majority voting, etc.
There are two types of the redundancies possible:
(I) Space Redundancy and (II) Time Redundancy
Chapter-1: General Introduction
13
(I) Space Redundancy
Space redundancy provides additional components, functions, or data items
that are unnecessary for a fault-free operation. It is classified into hardware
redundancy, information redundancy and software redundancy.
(a) Hardware Redundancy
The physical replication of hardware is perhaps the most common form of fault
tolerance used in the system. As semiconductor components have become smaller and
less expensive, the concept of hardware redundancy has become more common and
more practical. There are three basic forms of hardware redundancy:
(i) passive, (ii) active and (iii) hybrid redundancy.
(i) Passive Redundancy
Passive redundancy achieves fault tolerance by masking the fault that occurs
without requiring any action on the part of the system or an operator. Passive
techniques, in their most basic form, do not provide for the detection of the faults but
simply mask the faults. In the context of passive redundancy, it is worthwhile to
discuss the concept of triple modular redundancy and N-modular redundancy.
Triple Modular Redundancy
The most common form of passive hardware redundancy is triple modular
redundancy (TMR). In this type of redundancy the components are in triplicate to
perform the same computation in parallel. Majority voting is used to find out the
correct result. If one of the modules fails, the majority voter will mask the fault by
recognizing the result of the remaining two fault-free modules as correct.
A TMR system (see fig. 1.5) can mask only one module fault. A failure in either of
the remaining modules would cause the voter to produce an erroneous result. TMR is
usually used in applications where a substantial increase in reliability is required for a
short period.
Chapter-1: General Introduction
14
Fig. 1.5: Triple modular redundancy
N-Modular Redundancy
N-modular redundancy (NMR) approach is based on the same principle as
TMR, but uses n modules. The number n is usually selected to be odd, to make
majority voting possible. A NMR system (see fig. 1.6) can mask [N/2] module faults.
Fig.1.6: N-modular redundancy
(ii) Active Redundancy
Active redundancy method is also known as dynamic method. In this
approach, one achieves fault tolerance by first detecting the faults which occur and
then performing actions needed to recover the system back to the operational state.
Chapter-1: General Introduction
15
Active hardware redundancy uses fault detection, fault location, and fault recovery in
an attempt to achieve fault tolerance.
(iii) Hybrid Redundancy
Hybrid techniques combine the features of both the passive and active
approaches. Fault masking is used in hybrid system to prevent erroneous results from
being generated. Fault detection, fault location and fault recovery are also used in the
hybrid approaches to improve the fault tolerance by removing faulty hardware and
replacing it with spares. Spares provisioning is one form of providing redundancy in a
system. Hybrid methods are most often used in the critical-computation applications
where fault masking is required to check momentary errors and to achieve high
reliability.
The basic techniques for hybrid redundancy include:
Self-Purging Redundancy
N-modular Redundancy with Spares
Triplex-Double Redundancy
(b) Information Redundancy
Information redundancy is simply the addition of redundant information to
data to allow fault detection, fault masking or possibly fault tolerance. Examples of
information redundancy are error detecting and error correcting codes, formed by the
addition of redundant information to data words or by the mapping of data words into
new representations containing redundant information.
(c) Software Redundancy
Software redundancy refers to the use of extra code, small routines or possibly
complete programs, in order to check the correctness or the consistency of the results
produced by given software. The two types of diversity of the software redundancy
techniques are:
(i) Design diversity and (ii) Data diversity
Chapter-1: General Introduction
16
(i) Design Diversity
Design diversity is an identical service through separate design and
implementations. It aims at making the modules as diverse and independent as
possible.
(ii) Data diversity
The data diverse techniques are meant to complement, rather than replace,
design diverse techniques. The data diversity is used to obtaining a related set of
points in the program data space, executing the same software on those points, and
then using a decision algorithm to determine the resulting output.
(II) Time Redundancy
Time redundancy techniques attempt to reduce the extra amount, weight, size,
power consumption, etc.. In some applications, the extra time is of less importance
than extra hardware. The basic concept of time redundancy is the repetition of
computations in such ways that allow faults to be detected. If the repetition is done
twice, and if the fault which has occurred is transient, then the stored copy will differ
from the re-computed result, so the fault will be detected. If the computation is done
three or more times, a fault can be corrected.
Standby Redundancy
Standby is one common form of active hardware redundancy techniques for
achieving fault-tolerance. A standby system consists of primary module, and one or
more modules that serve as standby spares. The standby components are considered to
be hot-standby, cold-standby and warm-standby respectively. Hot standby has all the
spares operational in synchrony with the on-line primary components and ready to
take over when the primary components experience any fault. Cold standby has its
spares unpowered until needed to replace a faulty component. Warm standby are
initially powered up with a reduced failure rate. Then they are subject to the regular
full failure rate when they are used to replace the faulty primary components.
1.4 Reliability Analysis
From a qualitative point of view, reliability can be defined as the ability of any
item to remain functional. Quantitatively, reliability specifies the probability that no
operational interruptions will occur during a stated time interval. The term reliability
Chapter-1: General Introduction
17
is used as a reliability characteristic denoting a probability of success or a success
ratio.” The reliability is also defined as “the ability of an item to perform a required
function, under stated conditions, for a stated period of time.
The commonly used structures of the system are the (i) series, (ii) parallel and
(iii) series-parallel/parallel-series configuration.
(i) Reliability for Series Configuration:
Figure 1.7: Reliability of N component system in series
The series system is best thought of a system that contains no redundancy that
is, it is non-redundant system where all the equipments of the system are connected in
series. If any one of equipment fails, the series system fails. Let ( ) )N...,,2,1i(,tRi = is
the reliability of ith component at time t. The reliability of the N- components series
system at time t is given by
( ) ( )∏=
=N
1iiS tRtR …(1.1)
If ith (i=1,2,….,N) component has exponential distributed life time with constant
failure rate of iλ , then the reliability of this series system is given by
( )∑
== =
−−−−
N
1ii
N21
tttt
S ee..........eetRλ
λλλ …(1.2)
(ii) Reliability for Parallel Configuration:
In parallel configuration of a redundant system all components of the system
are arranged in parallel form (see fig 1.8).
Module 1 Module 2 Module N . . .
Chapter-1: General Introduction
18
Figure 1.8: Reliability of a N component system in parallel
The reliability of parallel system is given by
( ) ( )[ ]∏=
−−=N
1iiP tR11tR …(1.3)
(iii) Reliability for Series-Parallel/ Parallel-Series Configuration: In series-
parallel/parallel-series configuration, all components of the system are arranged in
both series and parallel form.
The reliability of series-parallel system (see fig. 1.9) consisting of N series stages,
each with Ni parallel components is given by
( ) ( )( )[ ]∏=
−−=N
1i
NiSP
itR11tR …(1.4)
Figure 1.9: System reliability in series-parallel configuration
Module 1
Module 2
Module Ni
.
.
.
Module 1
Module 2
Module Ni
.
.
.
. . .
Module 1
Module 2
Module Ni
.
.
.
Module 1
Module 2
Module N
.
.
.
Chapter-1: General Introduction
19
The reliability of parallel-series system (see fig. 1.10) is given by
( ) ( )[ ]∏=
−−=N
1iSPS tR11tR …(1.5)
Figure 1.10: System reliability in parallel-series configuration
1.4.1 Reliability Indices
Reliability- Reliability is the probability that an item will not fail by a given time t,
under a given set of operating conditions. The probability of failure by a given time t
is referred to as the unreliability of the item. Mathematically, the unreliability is
represented by
∫ ≥=t
0
0t,dt)t(f)t(U
where U(t) is the unreliability of the system and f(t), the probability density function
of failure.
Then the reliability R(t) at time of an item is
R(t)=1-U(t) …(1.6)
Failure Rate- The probability of a system failure in a given time interval [t1, t2] can
be defined in terms of the reliability function as
∫ ∫∫∞ ∞
−=1 2
2
1 t t
t
t
dt)t(fdt)t(fdt)t(f )t(R)t(R 21 −=
The failure distribution function is given as
∫ ∫∫∞− ∞−
−=2 12
1
t tt
t
dt)t(fdt)t(fdt)t(f )t(F)t(F 12 −=
Module 1
Module 1
Module 1
Module 2
Module 2
Module 2
. . .
. . .
. . .
Module N1
Module N2
Module Ni
Chapter-1: General Introduction
20
The rate at which failure occurs in a certain time interval [t1, t2] is called the failure
rate. Thus the failure rate is obtained using
( ) ( )
( ) ( )112
21
tRtttRtR
−−
=
If we consider the interval as [ ]tt,t ∆+ , then failure rate is given by
( ) ( )( ) ( )tRt
ttRtR∆
∆+− …(1.7)
The rate in the above definitions is expressed as failure per unit time.
Hazard Function- The hazard function h(t) is defined as the limit of the failure rate as
the interval approaches zero. The instantaneous failure rate h(t) is given as
( ))t(tR
ttR)t(Rlim)t(h0t ∆
∆∆
+−=
→
)t(R)t(f)t(R
dtd
)t(R1
=
−= …(1.8)
Mean time to failure (MTTF)-Suppose that the reliability function for a system is
given by R(t). The expected failure time during which a component is expected to
perform successfully, or the mean time to failure is given by
( )dttftMTTF0∫∞
= …(1.9)
Thus [ ])t(Rdtd)t(f −= ,
Equation (1.8) yields
( )[ ]∫∞
−=0
tRdtMTTF [ ] ∫∞
∞ +−=0
0 dt)t(R)t(tR ….(1.10)
After solving equation (1.9), we obtain
( )dttRMTTF0∫∞
= ….(1.11)
Thus, MTTF is the definite integral evaluation of the reliability function.
Chapter-1: General Introduction
21
Mean time to repair (MTTR)- An important measure often used in maintenance
studies is the mean time to repair. MTTR is the expected value of the random variable
repair time, not failure time, and is given by
( )dtttrMTTR0∫∞
=
where r(t) is the repair density function. If repair time is exponential distributed as
such te)t(r µµ −= , then the MTTR is given as
µ1MTTR = ...(1.12)
Mean time between failure (MTBF)- This is a basic measure of reliability for
repairable items. MTBF indicates that the system has failed and is subject to repair.
Now
MTBF=MTTF+MTTR ...(1.13)
The MTTR is a small fraction of the MTBF, so the approximation that the MTBF and
MTTF are equal is often quite good.
Reliability indices for system having exponential life time distribution.
The one parameter exponential reliability function is given by
−=−=
mTexp)Texp()T(R λ ...(1.14)
where λ is failure rate and λ1m = .
The mean time to failure (MTTF) of the one parameter exponential distribution is
given by
∫∫∞∞
−==00
dt)texp(tdt)t(tfMTTF λλλ1
= ….(1.15)
Failure Rate Function- The exponential failure rate function is given by
λλ
λ λ
λ
=== −
−
)T(
)T(
ee
)T(R)T(f)T( =Constant ….(1.16)
1.4.2 Failure Issues In practice, different types of failures can significantly reduce the reliability of
the systems. The common cause failures, degraded failure, switching, failure detection
Chapter-1: General Introduction
22
are the most concern failure issues in active systems. It is worth while to discuss such
concepts as these are included in many models developed in the thesis.
Common Cause Failure- In this failure, all the units in the system fail due to same
cause. The example of such failure can be realized in computer systems wherein
humidity, fires, power outage, etc. can cause the failure of several components
simultaneously. When one device has several functional then its failure prevents each
of the individual units from functioning. To cite such a situation we refer high voltage
in case of electric/electronic devices, high pressure in case of hydraulic pumps, etc..
Many researchers have considered the concept of common cause failure in their
studies on hardware/software reliability systems.
Degraded Failure- With usage and growing time, all systems are subject to
degradation. This degradation can result in high production costs and inferior product
quality. So, to maintain product costs low and pre-specified quality of product, the
provision of repair facility and standby support is recommended which also ensures
the smooth and long run functioning of the system. The growing importance of
maintenance has generated an increasing interest in the development of redundant
repairable operating systems.
Switching and failure detection- The switching and failure detection detect the failed
units and switching mechanism replaces them; it also some times brings the unit to
main system for working. When switching is perfect, its effect is over ruled but for
imperfect switching, it influences the system performance.
1.4.3 System Configurations
The system configuration is among the most useful models to calculate the
reliability of the systems. The k-out-of-n:G configuration structure is a very popular
type of redundancy in fault tolerant systems. There are many studies in this area. We
try to categorically classify them. At the first glance, they may be classified into some
main groups, namely K-out-of-N:G configuration, K-out-of-N: F configuration,
Consecutive K-out-of-N configuration, M-to-K-out-of-N configuration and Weighted
K-out-of-N configuration.
Chapter-1: General Introduction
23
K-out-of-N:G Configuration- K-out-of-N configuration has received a great attention
from both practitioners and researchers. The k-out-of-n configuration is a widely
adopted structure for partially redundant safety systems. K-out-of-N:G configurations
are often encountered in industrial organizations namely electronics industry,
telecommunication network systems, power-generator and transmission systems,
avionic etc.. In K-out-of-N: G configuration, the system works till K of the giving N
units are good. The aggregate reliability of K-out-of N:G configuration system when
all the units in the system have same reliability R(t), is given by:
( ) ( )( ) ( )( ) 1NN
kiNofoutK tR1tR
iN
tR −
=−−− −
= ∑
In a K-out-of-N:G configuration, the system operates only if at least K-out-of-N
components operate.
The 1-out-of-N: G configuration is a parallel system whereas N-out-of-N: G
configuration is a series system.
K-out-of-N: F configuration - K-out-of-N: F configuration has a failure on the failure
of K units out of available N units. It is obvious that K-out-of-N: G and (N-K+1) out-
of-N: F configuration are equivalent.
Consecutive K-out-of-N configuration- Consecutive K-out-of-N configuration is the
system configuration in which the units of the system are connected in such a way
that only the failure of K consecutive units out of N units causes system failure.
M-to-K-out-of-N configuration- M-to-K-out-of-N configuration is a non coherent
system in which no fewer than M and no more than M- out- of- N units are to function
for the successful operation of the configuration.
Weighted K-out-of-N configuration- Weighted K-out-of-N configuration is the
system configuration consisting of N units and each unit of the configuration are
associated with some positive integer value called weight. In addition, system weights
less than K causes system failure. In mathematical sense, in a weighted K-out of-N
system, the component i carries a weight wi, wi ≥ 0 for i = 1, 2, . . . , N such that w
=∑=
N
1iiw where w is the total weight of all the components. Thus, K-out of –N: G
Chapter-1: General Introduction
24
configuration can be seen as a special case of the weighted K-out of-N: G
configuration wherein each component has a weight of 1.
1.4.4 System Availability
The availability function of a system, denoted by A(t), is defined as the
probability that the system is available at time t. The availability is different from the
reliability that focuses on a period of time when the system is free of failures. It
concerns a time point at which the system does not stay at the failed state. The some
commonly used measures of availability are as follows:
Instantaneous or Point Availability- Instantaneous or point availability is the
probability that a system will be operational at any random time, t. It gives a
probability that a system will function at the given time, t.
Let the component operates properly from 0 to t with probability R(t).
The point availability is the summation of these two probabilities and is obtained as:
∫ −+=t
0
du)u(m)ut(R)t(R)t(A ….(1.17)
where m(u) being the renewal density function of the system.
Transient Availability-The transient availability is given as
( ) { }ttimeatworkingisSystemPtA = …(1.18)
Average Availability- The average availability, ( )tAAVE , over an interval [0,t] is
obtained using
( ) ( )dttAt1tA
T
0AVE ∫= …(1.19)
Steady State Availability- The availability function, which is a complex function
of time, has a simple steady-state or asymptotic expression. The steady state
availability is given by
( )MTTRMTTF
MTTFtALimAt +
==∞→
…(1.20)
Chapter-1: General Introduction
25
where MTTF and MTTR stand for mean time to failure and mean time to repair
respectively and A(t) is the transient availability at time t.
The average availability measures the fraction of time that the system is operational
over the interval of interest. One should not confuse the average availability with the
point availability.
1.5 Review of the Literature
In this section, we present a brief survey concerning our research topic on the
software and hardware reliability of fault tolerant system. We provide the historical
advancement of researches in the area of fault tolerance system, N-version
programming, recovery block, redundancy, reliability analysis, K-out-of-N:G
configuration, common cause failure, availability analysis, switching failure, degraded
failure and Markov models for software reliability growth. This section gives an
overview of recent past and currently developed software/hardware reliability models
and the underlying mathematical concepts that profoundly influenced the work
contained in the chapters (2)-(9) of the thesis.
1.5.1 Fault tolerant systems
The requirement for developing a unified method for tolerating both hardware
and software fault has been identified in the recent past, and various works in this
direction have already appeared in the literature (cf. Laprie, et al., 1990; Dugan and
Lyu, 1995). System structure for the software fault tolerance was analyzed by Kant
(1987), Lala and Alger (1988), Belli and Jedrzeiowicz (1990). Carpenter (1990),
Bobbio (1990) and Wu et al. (1994) evaluated the dependability analysis of
software/hardware fault-tolerant systems. Leu et al. (1991) did the fault-tolerant
software reliability modeling using Petri nets. McAllister and Scott (1991),
Bondavalli et al. (1993) and Wu et al. (1996)) calculated cost while modeling the
fault–tolerant software. Kanoun et al. (1993), Tai et al. (1993) and Chiaradonna et
al. (1994) discussed the reliability growth of fault tolerant software. Laplante (1993)
provided an introduction to the phenomenon of the software techniques and hardware
selection considerations to combat the effects of single-event-upset, and a discussion
of the application of the techniques in the fault-tolerant systems. Zhuang and Xie
(1994) analyzed some fault-tolerance configurations based on a multipath principle.
Chapter-1: General Introduction
26
Giandomenico et al. (1995) proposed a uniform approach to software and hardware
fault tolerances. Huang and Kintala (1995) studied architectural issues in software
fault tolerance.
Cheng et al. (2000) considered the problem of guaranteeing reliability
requirements with bounded recovery times on fail-stop processors in fault-tolerant
multiprocessor real-time systems. They classified tasks based on their recovery-time
requirements into (i) hard recovery, (ii) soft recovery, and (iii) best-effort recovery
tasks. Bondavalli et al. (2002) and Littlewood et al. (2002) used the adaptive
approach to achieve the reliability and hardware-software fault tolerance in distributed
computing environment. Rehage et al. (2005) considered the redundancy
management of fault tolerant aircraft system architectures. Saha (2006a) developed
the software tool for fault tolerance. The reliability and performance analysis of
hardware–software systems with fault-tolerant software components were presented
by Levitin (2006). Wattanapingskorn and Coit (2007) discussed the fault–tolerant
embedded system design and optimization by considering the reliability estimation
uncertainty. Dabney et al. (2008) considered a fault tolerant approach to test control
utilizing dual-redundant processors. Lim et al. (2008) proposed a fault avoidance
scheme which increases system dependability by avoiding common faults on
remaining nodes when parts of nodes fail, and analyze the system dependability.
Leach (2008) and Zhang and Qin (2008) analyzed an improved fault tolerant system
and used checkpoints legacy code to improve fault-tolerance. Khan et al. (2010)
proposed the fault tolerance techniques in grid computing system. Simeu-Abazi et al.
(2011) suggested the methodology of alarm filtering using dynamic fault tree.
1.5.2 N-version Programming
Software/Hardware is a major source of reliability decay in dependable
systems. One of the classical remedies is to provide fault tolerance by using N-version
programming. Knight and Leveson (1986) did an experimental evaluation of the
assumption of independence in multi-version programming. Eckhardt and Lee (1988)
discussed the fundamental differences in the reliability of N-modular redundancy and
N-version programming. Pham (1994) and Teng and Pham (2002) proposed the
software reliability growth model for the optimal design of N-version software
systems subject to certain constraints. Chatterjee et al. (2004) studied the N-version
Chapter-1: General Introduction
27
programming with imperfect debugging. In 2006b, Saha proposed a model for single-
version scheme of fault tolerant computing. Yamachi et al. (2006) described the
multi-objective genetic algorithm for solving N-version program design problem.
Laval et al. (2011) gave the simultaneous version for software evaluation assessment
and computation time of model queries on large models.
1.5.3 Recovery Block
There are only a few studies in the fault tolerance literature on the recovery
block. In the last few decades, recovery block has been treated by many researchers as
fault tolerance system. Velardi and Ciciani (1983) studied the recovery blocks for
communication systems. Distributed execution of recovery blocks for uniform
treatment of hardware and software faults in real-time applications were studied by
Rossi and Simone (1984) and Kim and Welch (1989). Nicola and Goyal (1990)
modeled the correlated failures and community error recovery in multiversion
software. Randell and Xu (1995) employed the recovery block concept, in software
fault tolerance. Al-Saqabi et al. (1996) considered recovery from concurrent failures
in communication protocols. Optimization models for component based recovery
block technique were proposed by Berman and Kumar (1999) and Abulnaja (2005).
Li et al. (2006) discussed the design of correct and efficient checkpoint and recovery
strategies for distributed agent systems. Lv et al. (2010) gave block orthogonal greedy
algorithm for stable recovery of block-sparse signal representations. In 2011,
Abujarad and Kulkarni explored the structure of the recovery paths which is too
complex to permit existing heuristic-based approaches for adding recovery.
1.5.4 Redundancy Models
A vast literature can be found on various hardware and software redundancy
techniques and reliability modeling for redundant systems. Kumar et al. (1986)
considered the reliability analysis of a two-unit redundant system with critical human
error. Grosspietsch (1989) proposed the schemes of dynamic redundancy for fault
tolerant in random access memories. Implementing fault-tolerance via modular
redundancy with comparison was done by Yinong and Chen (1992). Venkateswaran
et al. (2002) analyzed redundancy based fault detection of gyroscopes in spacecraft
applications. Valdes and Zequeira (2003, 2006) proposed the optimal allocation of an
Chapter-1: General Introduction
28
active redundancy in a two-component series system. Bueno and Carmo (2007)
defined a active redundancy allocation for a k-out-of-n: F system of dependent
components. Li and Hu (2008) gave some new stochastic comparisons for
redundancy allocations in series and parallel systems. Flammini et al. (2009)
presented a new modeling approach to the safety evaluation of n-modular redundant
computer system in the presence of imperfect maintenance. Lisnianski et al. (2000)
and Tian et al. (2009) studied the structure optimization of multi-state system with
time redundancy. The optimal task allocation and hardware redundancy policies in
distributed computing systems were considered by Hsieh (2003), Yang et al. (2009)
and Randles et al. (2011). Valdes et al. (2010) analyzed some stochastic comparisons
in series systems with active redundancy. Smidt-Destombes et al. (2011) studied
spare parts model with cold-standby redundancy on system level. Belzunce et al.
(2011) employed the optimal allocation of redundant components for series and
parallel structures of two dependent components.
1.5.5 Software/Hardware Reliability Models
Hardware/Software plays a key role in the modern life. This has increased our
dependence on machines and its reliability. The idea of unreliable hardware/software
may be unimaginable and damaging. Most of the models in the literature basically
discuss the calculation of system reliability using the component reliability.
Reliability estimates are defined as a function of different user profiles. Each user
profile uses different modules and hence it results into different reliabilities. Various
researchers worked on the software reliability to present measurement, prediction and
applications of the systems (cf. Shooman, 1983; Musa et al., 1987 and Elsayed
1996). In literature, researchers have developed software reliability models from
different point of views including the theoretical developments and applications (cf.
Shooman, 1990; Malaiya and Srimani, 1990). Kapur et al. (1992), Sridharan and
Jayashree (1998), Pham et al. (1999) and Pham (2003a) obtained the transient
solutions of a software model with imperfect debugging and generation of errors by
two servers. Hoyland and Rausand (1994) described various aspects of system
reliability models and statistical methods.
There are important contributions of some notable researchers on the software
reliability engineering (cf. Lewis, 1994; Lyu, 1996; Sahner et al., 1996). Smidts and
Chapter-1: General Introduction
29
Sova (1999) proposed an architectural model for software reliability quantification.
Xang and Xie (2000) studied the operational and testing reliability to study the
software growth. Sarhan (2002) analyzed the reliability equivalence with basic
series/parallel system. Levitin (2004) gave algorithm for evaluating reliability and
expected execution time for systems consisting of fault-tolerant software components
running on several hardware units. Ho et al. (2003) and Chang and Jeng (2005)
studied of the connectionist models for software reliability prediction. Choi and
Seong (2006) defined the reliability assessment of embedded digital system using
multi-state function. Yu et al. (2007) gave the reliability optimization of a redundant
system with failure dependency. Jha et al. (2009) and Yang et al. (2010) examined
the software reliability model with testing effort and cost analysis.
Yang and Meng (2011) proposed a warm standby repairable system consisting
of two dissimilar units and one repairman. A surrogate-based approach is presented
that simultaneously addresses the issues of accuracy, efficiency, and unimportant
failure modes. Efficient surrogate models for reliability analysis of systems with
multiple failure modes were studied by Bichon et al. (2011).
1.5.6 k-out-of-n: G configuration
In k-out-of-n: G configuration the system works if at least k components work
out of available n components. A variety of applications of these systems are found in
reliability analysis especially when performance prediction of hardware and software
systems is considered. System reliability of k-out-of-n system has been discussed by
many researchers in different frameworks. Dhillon (1978) and Chung (1980)
proposed a k-out-of-N three state system with common-cause failures and
replacement policy. Shanthikumar (1982) worked on recursive algorithm to evaluate
the reliability of a consecutive K-out-of-N: F system. Hwang (1986) developed
reliability models for consecutive k-out-of-n: G systems. Vanderperre (1990) and
She and Pecht (1992) proposed a general closed-form equation for system reliability
of a k-out-of-n warm-standby system. The equation reduces to the hot and cold
standby cases under the appropriate restrictions. Moustafa (1997, 1998) suggested the
transient analysis to evaluate the reliability with and without repair for k-out of-n: G
systems with M failure modes and imperfect coverage. In 2000, Amari did transient
analysis to examine the reliability with and without repair for K-out-of-N: G systems
Chapter-1: General Introduction
30
with M failures modes. Moustafa (2001a) found availability of K-out-of-N:G systems
with exponential failures and general repairs. Hong et al. (2002) considered joint
reliability importance of k-out-of-n-systems. Da Casta Bueno (2005) proposed
minimal standby redundancy allocation in a k-out-of-n:F system with dependent
components. Janab and Dhillon (2006) gave assessment of reversible multi-state k-
out-of-n: G/F load-sharing systems with flow-graph models. Rushdi and Alsulami
(2007) evaluated the cost elasticities of reliability and MTTF for k-out-of-n systems.
Da Casta Bueno and Do Carmo (2007) worked on active redundancy allocation for a
k-out-of-n:F system of dependent components. Yinghui and Jing (2008) proposed a
new model for load-sharing k-out-of-n: G systems with different components. The
parallel and k-out-of-n:G systems with non- identical components and their mean
residual life functions were studied by Gurler and Bairamov (2009). Beutner (2010)
developed a non parametric model for k-out-of-n systems. Habib et al. (2010)
discussed reliability of a consecutive (r, s)-out-of-(m, n):F lattice system with
conditions on the number of failed components in the system. Eryilmaz (2011)
analyzed the dynamic behavior of k-out-of-n: G systems.
1.5.7 Common Cause Failure
Common cause failure analysis is important phenomenon in reliability and
fault tolerance system, as common cause failures often dominate hardware failures.
Some typical common causes include impact, vibration, pressure, stress and
temperature. There are many researchers who incorporated the concept of common
cause failure in their studies on reliability assessment. In 1987, Hughes and Mosleh
(1991) proposed a new approach to the problem of the quantification of common
cause failure in the systems under consideration. They produced a direct procedure for
system common cause failure probability calculations not dependent on any
modelling assumptions, but only on the system structure. Sridharan and
Mohanavadivu (1997) did reliability and availability analysis for two non-identical
unit parallel systems with common cause failures and human errors. Levitin (2001)
incorporated common-cause failures for non repairable multistate series–parallel
systems analysis. Vaurio (2003) proposed the modelling and quantification of
common cause failures in redundant standby safety systems by incorporating the
assessment uncertainties in the estimation of multiple failure rates based on data from
Chapter-1: General Introduction
31
many plants or systems. Salem and El-Damcese (2004) developed a model for the
analysis of systems subject to common-cause failures under the assumption of
Weibull distribution. The optimization of the system reliability in the presence of
common cause failure was considered by Ramirez-Marquez and Coit (2006). Xing et
al. (2007) proposed an efficient decomposition and aggregation approach for
incorporating common-cause failures into the reliability evaluation of hierarchical
computer-based systems. El-Damcese (2009) studied a k-out-of-(M+S): G warm
standby system with repair and time varying failure due to common cause failure to
formulate the reliability and availability. Li et al. (2009a) evaluated a warm standby
system with components having proportional hazard rates. The heterogeneous
redundancy optimaization for multi-state series-parallel systems subject to common
cause was done by Li et al. (2010). Hajeeh (2011) discussed the reliability and
availability for series configurations having both warm and cold standbys with the
existence of common cause failure of the system at all states. The mean time to failure
(MTTF) and steady state availability are derived for all configurations.
1.5.8 Availability Analysis
In reliability literature, several studies are devoted to the availability analysis
of fault tolerance system. Gupta and Tyagi (1986) suggested the MTTF and
availability evaluation of a two-unit, two-state, standby redundant complex system
with constant human failure. Chen (1992) gave the transient analysis to predict the
reliability and availability of k-to-l-out-of-n: G system. Tokuno and Yamada (1995)
considered a software availability model by incorporating a positive fault-correction
time and uncertainty of the fault-correction activities. They assumed that the hazard
rate for software-failure occurrence reduces geometrically with the progress in the
fault-removal process. Sarkar and Chaudhuri (1999) studied the availability of a
system with gamma life and exponential repair under a perfect repair policy.
The availability of a system maintained through several imperfect repairs
before a replacement or a perfect repair was studied by Biswas and Sarkar (2000),
Sarkar and Sarkar (2000), Sarkar and Sarkar (2001). De Smidt-Destombes et al.
(2004) considered the availability of a k-out-of-N system given limited spares and
repair capacity under a condition based maintenance strategy. Kharoufeh et al. (2006)
gave availability of periodically inspected systems with Markovian wear and shocks.
Chapter-1: General Introduction
32
Kiureghian et al. (2007) discussed the availability of K-out-of-N systems. The high-
availability for the failure-aware resource management computing cluster with
distributed virtual machine was proposed by Fu (2010). Moghaddass et al. (2010)
discussed the availability of a general k-out-of-n: G system with non-identical
components considering shut-off rules using quasi-birth-death process and semi
markov model. An improved delay time model with imperfect maintenance at
inspection has been developed by Wang et al. (2011a) based on the assumption of
imperfect inspection maintenance and perfect failure maintenance.
1.5.9 Failure Analysis
In the fault tolerant system and its future prospects, we can't avoid that the
system are always prone to failures. These failures causes the problem for the services
being provided through these systems. The replacement from standby state to active
state of a spare unit is said to be the standby switching. There is always a possibility
that switching device may also fail with some probability and the interruption
problem occurred due to power supply failure is not sorted out due to standby
switching failure. Various models have been developed by several researchers for
analyzing the standby redundant systems with switching failures. A notable
contribution in this area is due to Chow (1971) who discussed the reliability of two
items in sequence with sensing and switching. Alidrisi (1992) defined the reliability
of a dynamic warm standby redundant system of n components with imperfect
switching. Reliability prediction of imperfect switching systems subject to multiple
stresses was suggested by Pan (1997). Xu et al. (2005) studied the asymptotic
stability of a repairable system with repair time of failed system that follows arbitrary
distribution along with imperfect switching failure. Jain et al. (2007) studied the
queueing system with mixed standbys; they assumed the life-time and repair time of
the units to be exponentially distributed. Wang et al. (2006a) and Wang and Chen
(2009) made a comparative analysis of availability between three systems with
general repair times, reboot delay and switching failures. Hsu et al. (2008, 2011b)
studied availability system with reboot delay, standby switching failures and an
unreliable repair facility by considering the time-to-failure and the reboot time as
exponentially distributed. The repair time of the service station and the time-to-repair
of component were assumed to be generally distributed.
Chapter-1: General Introduction
33
Degraded failure may not cease the fundamental function and there can be
multiple stages of degradation, and the system may fail after a certain number of
stages. When all standbys (warm and cold) are being used, the failure of units may
occur in degraded fashion. Significant works have been done on the system with
degraded failure rate. Many researchers have studied the problem of degraded failure
in different frameworks and suggested ways and means to tackle such situations.
Initially, the concept of degraded states and common-cause failures was studied by
Yamashiro (1982). Gupta and Sharma (1986) and Nahas et al. (2007) did reliability
analysis of two state repairable parallel redundant systems under human failure and
developed the algorithm to analyze the series - parallel system. Jain et al. (2004)
worked on the bilevel control of degraded machining system with warm standbys,
setup and vacation. Pham et al. (1997) evaluated full and degraded mission reliability
and mission dependability for intermittently operated, multi-functional systems and
obtained the availability and mean life time of multistage degraded system with
partial repairs. Soro et al. (2010) investigated multi-state degraded systems with
minimal repairs and imperfect preventive maintenance.
1.5.10 Software Reliability Growth Models
A reliability model of Markov structured software was developed by
Littlewood (1975). Trivedi and Shooman (1975) and Kemmeny and Snell (1976)
studied many state Markov models for the estimation and prediction of computer
software performance parameters. Whittaker and Poore (1993) did Markov analysis
of software specification. Whittaker and Thomason (1994) studied Markov chain
models for statistical software testing. A non - homogeneous Markov software
reliability model with imperfect repair was presented by Gokhale et al. (1996). Propp
and Wilson (1996) considered exact sampling with coupled Markov chains and
applications to statistical mechanics. Whittaker et al. (2000) studied a Markov chain
model to enable reliability prediction for future builds using testing data for the
software system. El-Gohary (2004) analyzed the estimations of parameters in a three
state reliability semi-Markov model. Prowell and Poore (2004) explored the
computing system reliability using Markov chain usage models. Lo et al. (2005) gave
the reliability assessment and sensitivity analysis of software reliability growth
modeling based on software module structure. Montoro-Cazorla and Perez-Ocon
Chapter-1: General Introduction
34
(2006) proposed the reliability of a system under two types of failure using a
Markovian arrival process. Chiquet et al. (2008) estimated the reliability of stochastic
dynamical systems with Markovian switching. Do Van et al. (2010) gave the
importance measures for Markov reliability models. Buchholz et al. (2010) described
the multi-class Markovian arrival processes and their parameter fitting. Meedeniya et
al. (2011) proposed the software and hardware architecture with necessary reliability-
relevant attributes to quantify the quality of individual deployment alternatives by
employing an evolutionary algorithm.
1.6 Organization of the Thesis
The objective of our study in this thesis is to develop reliability/availability
models of fault tolerant system. The redundancy issues are taken into consideration to
predict the hardware and software reliability. The whole thesis is structured into 9
chapters dealing with reliability/availability analysis of fault tolerant system in
different frameworks. In chapters 2-9, we investigate some models of fault tolerant
systems by starting the requisite assumptions and notations to formulate the
mathematical model and to provide analytical and numerical results. The introductory
section of each chapter includes the review of relevant literature and organization of
the chapter. The proper mathematical analysis of the developed models is presented
using appropriate methodology based on reliability theory. The numerical results are
summarized in tables and exhibited via graphs also. In the end of the thesis, the
relevant references are arranged alphabetically for the sake of convenience of the
readers. The chapter wise organization of the thesis is as follows:
Chapter-1: General Introduction
The ongoing chapter on general introduction presents the research motivation,
and various conceptual aspects of hardware and software reliability of fault tolerance
system. It facilitates a complementary background of the techniques for the analysis
of the developed reliability models of the redundant systems. The important
contributions in the area of our research interest have been discussed. The outline of
the thesis and the significance of the work investigated in various chapters are also
highlighted.
Chapter-1: General Introduction
35
Chapter-2: Fault Tolerance in A Clustered Architecture
The collection of the software faults, error detection, diagnosis and error
recovery is considered to investigate the fault tolerance in a clustered system. In this
chapter we study the different levels of tolerance. The numerical results provided will
be helpful to examine the reliability and availability of the concerned system.
Chapter-3: Reliability Modeling of Hardware and Software
Interactions
A reliable computer system gives the normal level of service in the presence
of hardware and software both. In this chapter, we deal with a reliability model for a
computer system that considers the failures of three types such as software failure,
hardware failure and interaction of software and hardware failure along with common
cause failure. We derive the probabilities of the system being in different states by
using the successive over relaxation (SOR) technique.
Chapter-4: Distributed Software and Hardware Systems
A distributed system is collection and combination of computers in which any
member of the cluster is capable of supporting the processing functions of any other
member. In this chapter we investigate a multi-host system with standbys. The
common cause failure which is an important factor to calculate the availability of the
realistic system is also taken into consideration. Some important indices to predict the
performance of the system are given.
Chapter-5: Transient Analysis of a Hardware-Software
System
This chapter focuses on the performance prediction of hardware and software
system supported with standbys. This chapter is arranged into two sections as follows:
Section-5(A): Embedded Computer System with Two Types of Failures and Common Cause Failure
In this section we propose a Markov model for K-out-of-N: G system with
common cause failure. The hardware system consists of N non-idential components
Chapter-1: General Introduction
36
and Y warm standby components under the care of a single repairman. The system
has hardware error, human error and common cause failures. The numerical results
have been facilitated to have insights into the sensitivity of the system parameters by
using Runge-Kutta method of fourth order.
Section-5(B): Hardware and Software Systems with Warm Standbys and Switching Failures
This section is concerned with Markovian model for a hardware and software
system. There is provision of warm standby hardware units which is likely to have
switching failure when used. Numerical techniques based on matrix method and
Runge-Kutta method are used to compute various system performance indices.
Chapter-6: Availability Analysis of Repairable Redundant
System with Reboot Delay
In this chapter we discuss the availability indices of a hardware-software
system with switching failure and reboot delay. The chapter is further subdivided into
two sections, which are given as follows:
Section-6(A): Hardware-Software System with Switching Failure
In this chapter we consider the reliability and availability analysis for a system
having both M hardware along with warm standbys and N software components. The
concepts of switching failures, common cause failure and reboot delay are taken into
consideration. Network-based Fuzzy Interference Systems (ANFIS) approach and
successive over relaxation (SOR) technique are used to facilitate the numerical
results.
Section-6(B): Repairable System with Warm Standby and Switching Failure
We study the availability for repairable system supported by warm standby for
three kinds of configurations. To formulate the model, the assumptions of switching
failure and setup time are incorporated. The numerical results have been provided to
Chapter-1: General Introduction
37
have insights into the system descriptors so far as the performance indices are
concerned.
Chapter-7: Warranty Policy for Hardware and Software
Systems with Common Cause Failure
In this chapter we study the performance of K-out-of-N system with common
cause failure. The system consists of hardware and software components. The
analysis has been along with the application of warranty policy. The free replacement
warranty policy and repair are considered. The reliability and other measures are
obtained. The analytical results are supported by the numerical results.
Chapter-8: Semi-Markov Models with Common Cause Failure
In many real time systems, it is practically impossible to make a perfect
system in which component failure leading to system failure does not take place. The
redundancy is a widely spread technology of building such systems that continue to
operate satisfactorily in the presence of faults occurring in system. In this chapter we
develop semi-Markov models with for the redundant systems with common cause
failure. This chapter is arranged into two sections as follows:
Section-8A: Redundant System with Rejuvenation
The principle objective of applying redundancy is to achieve availability goals
subject to techno-economic constraints. In this section we predict the availability of
redundant system with common cause failure and rejuvenation by using an embedded
Markov chain approach. A recursive procedure for generating the state-transition
probabilities is employed. The appropriate framework for finding the optimal
rejuvenation interval is discussed by considering the downtime cost factors.
Section-8B: Imperfect Fault Coverage System with Reboot
We develop a semi-Markov model for two unit system by using the
supplementary variable technique (SVT). The concepts of common cause failure,
reboot and imperfect fault coverage are taken into consideration. The model is
analyzed for three types of distributions of repair time, such as exponential, gamma
Chapter-1: General Introduction
38
and uniform. We derive explicit expressions for the availability and failure frequency
of the systems.
Chapter-9: Software Reliability Growth Model (SRGM) with
N-version Programming
This investigation is concerned with the software reliability growth model
based on non-homogeneous Poisson process with testing-effort. The N-version
programming is taken into consideration by assuming the imperfect debugging
process. We propose the estimation of unknown parameters by using the maximum
likelihood parameter estimation approach. The Adaptive Network-based Fuzzy
Interference Systems (ANFIS) approach is also employed to facilitate the numerical
results.
1.8 Concluding Remarks
The main objective of this chapter has been to discuss some fundamental
concepts of fault tolerance and reliability analysis. The key issues related to the design
and analysis of fault-tolerant systems are also described. In this thesis, we have
studied software fault tolerance, hardware fault tolerance and other related issues in
different frameworks.
Our investigations focus on establishing reliability indices for the hardware
and software system. It is expected that hardware and software fault tolerance system
studied will benefit all concerned by enabling greater predictability on the
dependability of software. It is worthwhile to highlight the noble features of the
investigations carried out in the present thesis. Some key issues tackled based on
suitable methodology are as follows:
The concepts of recovery blocks, N-version programming, common cause
failure, reboot delay, imperfect debugging, switching failures, standbys, etc.
are taken into account while developing reliability models for fault tolerance
systems.
When the number of operating components available is at the verge of
minimal specified level, the workload on the operating components increases
and the system starts working in degraded mode. The concept of degraded
Chapter-1: General Introduction
39
failure has been used for developing the models studied in chapters 3, 4, 5(A),
7 and 8(B).
The moment a standby unit is switched in place of a broken-down unit, it
should work; however, sometimes is not so. There is always a possibility that
the switching process may fail. The probabilistic nature of this situation is
quite important to be incorporated in the system repair models, where standbys
play a crucial role in case of system failures. This concept has been
incorporated in chapters 5(B) and 6(A)-(B) to make model to depict more
realistic and versatile scenario.
In chapter 5(A), we have developed k-out-of-N system, wherein out of total N
components, k are required for smooth functioning of the system and (N-k)
units are kept as standbys.
Fault tolerance and software reliability growth models are widely used for the
performance modeling and analysis of communication systems and computer
networks. The concepts of N-version programming, testing effort and
imperfect debugging studied in chapter 9 may proved useful in real time fault
tolerance system.
In chapter 7 warranty policies for warranty cost model is employed to provide
the analytical results which are difficult to obtain using classical renewal
process.
Runge-Kutta (R-K) is reasonable accurate and well behaved approach and can
be employed for a wide range of problems. Runge-Kutta (R-K) fourth order
method is used to obtain the solution of differential equations governing the
reliability models in chapter 4, 5(A) and (B) and 8(B).
The successive over relaxation (SOR) method is used in chapters 3 and 6(A)
to deal with the hardware-software model to compute the availability. This
method provides quite good and perfect trend of the availability.
Supplementary variable technique (SVT) is being used in chapter 8(B) to
analyze the availability and failure frequency for different distributions such as
Chapter-1: General Introduction
40
exponential, gamma and uniform distributions for the imperfect fault coverage
system with common cause, recovery and reboot delay.
The capacity planners and the system analysts may utilize the results of models
developed in the presents thesis to architect the fault tolerance system of future
generation. It is hoped that our investigations may be beneficial to the decision
makers, system designers, network administrations and system developers to achieve
the desired grade of service or to manage the limited resources under the techno-
economic constraints as for as real time fault tolerance systems are concerned.
Fault Tolerance in A Clustered
Architecture 2.1 Introduction
2.2 Reliability Models
2.3 Model for Software Fault Tolerance
2.4 The Analysis
2.5 Numerical Illustrations
2.6 Conclusion
Chapter-2
Chapter-2: Fault Tolerance in A Clustered Architecture
42
System architecture depending upon a cluster of computers
has received a considerable attention recently. In a clustered
system, the software applications can be made with commercial
hardware, operating systems and application software to get high
system availability. In this chapter we describe various levels in
terms of fault detection, fault recovery, volatile data consistency
and persistent data consistency. The application software is
responsible for the extent of the data backup, subsequent recovery
and error detection. The numerical results have been facilitated to
have insights of the system descriptors on the performance indices.
2.1 Introduction
Software reliability is one of the most important characteristics of system
software. Its measurement and management aspects during the software life-cycle are
important to produce and maintain reliable software. Fault tolerance is the ability of a
system to continue correct performance of its tasks after the occurrence of hardware
or software faults. In recent years, the computer system failures have been caused by
software faults which were introduced during the software development process. The
events of system unavailability and data inconsistency are often caused by the
existence and end of disclosing of faults in the system. A fault is simply any physical
defect, imperfection or flaw that occurs in hardware or software. Some faults can not
be tolerated to conduct a entire system failure, whereas in some cases the impact may
be delivered of a partial system failure.
The fault tolerant architecture includes the software design of error detection
and diagnosis as well as error recovery. The executing program is supervised by the
watchdog, which warns a failure condition of the software program in case that the
execution time of each subprogram runs over its default value. In a cluster, the system
is built with commercial hardware, operating system and database system. The
systems are required to be present upon users demand and their data should be
consistent in the user's purpose. Carpenter (1990) defined the mechanism for
Chapter-2: Fault Tolerance in A Clustered Architecture
43
evaluating the effectiveness of software fault–tolerant structures. Siewiorek and
Swarz (1992) presented reliable design of computer systems. Hoeflin and Mendiratta
(1995) analyzed an elementary model for predicting switching system outage
durations. Hunag and Kintala (1995) proposed the software fault tolerance in the
application layer. Mendiratta (1996) discussed the reliability impacts of software
fault tolerance mechanisms. Sahner et al. (1996) proposed the performance and
reliability analysis of computer systems using the sharpe software package. Hughes–
Fenchel (1997) studied a flexible clustered approach to achieve high availability.
Berman and Kumar (1999) presented the optimization models for complex recovery
block schemes.
Cheng et al. (2000) analyzed a fault-tolerance model for multiprocessor real-
time systems. Littlewood et al. (2002) studied the reliability of diverse fault-tolerant
software based systems. Ho et al. (2003) studied various models for the software
reliability prediction. Levitin (2004) presented the reliability and performance
analysis for fault-tolerant programs consisting of versions with different
characteristics. Zhang and Qin (2008) and Leach (2008) described the parametric
analysis and checkpoints in legacy code to an improve the fault tolerant system.
Santos et al. (2009) considered power saving and fault tolerance in real-time critical
embedded systems. Adapting grid applications to safety using fault tolerant methods
were studied by Shi et al. (2010). Rafe and Mahdian (2011) explored the style based
modeling and verification of fault tolerance service oriented architecture. Shet et al.
(2011) suggested various strategies for fault tolerance in multicomponent
applications.
In this investigation fault tolerance software in a clustered architecture is
studied. The rest of the chapter is structured in the following sections. In section 2.2,
we explain the reliability models and levels. In section 2.3, we construct the transient
equations for clustered fault tolerance system with the help of transition diagram. In
section 2.4, the steady state probabilities are obtained. In section 2.5, we provide the
numerical results which are displayed with the help of tables and graphs. Section 2.6
is devoted to the concluding remarks.
Chapter-2: Fault Tolerance in A Clustered Architecture
44
2.2 Reliability Models
The hardware and software models are being used to predict the system
availability and other reliability indices. The data consistency model gives the
predictions of the detect rate. In other words it deals with the units of load calls,
messages, transactions, etc. that are lost due to hardware and software failures as a
proportion of the total offered load. The levels of reliability which are based on the
definition of levels of software fault tolerance are considered to model the fault
tolerant system. The reliability levels are described in the ascending order as follows:
Level 0 : Basic automatic fault detection by watchdog, no automatic fault
recovery, no data consistency.
The watchdog finds out the faults in the hardware and software. For a software
fault, the application process is restarted at the beginning of internal state. For a
hardware fault, the system is manually rearranged and the faulty processor is
removed.
Level 1 : Basic automatic fault detection by watchdog automatic fault recovery, no
data consistency.
The watchdog and recovery find out a group of hardware and software faults.
When the watchdog detects the fault in the hardware, the system is automatically
rearranged and recovered. In a software fault the process is again started at the initial
internal state. The restarted internal state does not effect the previous execution.
Level 2 : Basic automatic fault detection by watchdog, automatic fault recovery, no
data consistency enhanced automatic fault detection by watchdog plus
periodic check pointing logging and recovery of internal state.
In this level a larger set of hardware and software faults are automatically
detected by the watchdog and application. A hardware failure detected is rearranged
around the faulty unit. If both hardware and software fail, the application is restarted
and the application closes to the state at which it damages.
Chapter-2: Fault Tolerance in A Clustered Architecture
45
Level 3 : Level 2 and persistent data recovery.
The persistent data of the application is replicated on a backup disk connected
to a backup node with the capabilities of level 2. The persistent data is kept consistent
with the data on the primary node throughout the normal operation of the application.
The higher level is more reliable than the lower level so that level i is more
reliable than level i-1(1≤i≤3).
2.3 Model for Software Fault Tolerance
Fig. 2.1: State transition diagram for software fault tolerance
The Markov model for the systems having different levels (see fig. 2.1) of
software fault tolerance includes five states (i) working (W), (ii) fault detection and
recovery (FDR), (iii) volatile data recovery (VDR), (iv) persistence data recovery
(PDR) and (v) failure (F).
The working state represents the normal execution state of the system. If in
this state we find error, the system will go into other states. If the error is recoverable,
the system enters in the fault detection and recovery state and the recovery starts. If
the error is recovered in this state, it goes back to working state; otherwise the system
either fails or another level of recovery is entered. The recovery process goes on to
volatile data recovery state and persistent data recovery state in a similar pattern.
Chapter-2: Fault Tolerance in A Clustered Architecture
46
The following notations are used to formulate the model:
λ : The error rate.
1λ : The rate at which recovery can not be completed in this state.
2λ : The rate at which volatile data recovery can not be completed in this state.
3λ : The rate at which persistent data recovery can not be completed in this
state.
pλ : The failure rate of power supply unit. µ : Manual repair state.
1µ : The rate at which successful recovery is performed.
2µ : The rate at which successful volatile data recovery is performed.
3µ : The rate at which successful persistent data recovery is performed.
C : Fault recovery coverage factor for the error.
1C : Volatile data recovery coverage factor for the error.
2C : Persistent data recovery coverage factor for the error.
( )tPi : Prob. that the system is in ith ( i=0,1,2,3,4) state at time t.
The equations governing the model are constructed as follows:
( )[ ] ( ) )t(p)t(P)t(P)t(ptPC1dt
dP43322110CP
0 µ+µ+µ+µ+λ+λ+−λ−= … (2.1)
( )[ ] ( ) )t(PtPCC1dt
dP0C1111P11
1 λ+µ+λ−λ+−λ−= …(2.2)
( )[ ] ( ) )t(PCtPCC1dt
dP1112222P21
2 λ+µ+λ+λ+−λ−= …(2.3)
[ ] ( ) )t(PCtPdt
dP22233P3
3 λ+µ+λ+λ−= …(2.4)
( )[ ] ( ) ( )[ ] ( )[ ][ ] )t(p)t(P
)t(PC1)t(pC1tPC1dt
dP
43P3
2P211P110P4
µ−λ+λ+
λ+−λ+λ+−λ+λ+−λ= …(2.5)
Chapter-2: Fault Tolerance in A Clustered Architecture
47
2.4 The Analysis
For notational convenience, we denote:
p0 λ+λ=Λ ; 1p11 µ+λ+λ=Λ ; 3p33 µ+λ+λ=Λ
)C1(C −= ; )C1(C 11 −= ; )C1(C 22 −=
For steady state, the equations (2.1)-(2.5) can be written as
( )[ ] ( )[ ] ( )[ ][ ] 43P3
2P211P110P
pPPC1pC1PC10
µ−λ+λ+λ+−λ+λ+−λ+λ+−λ=
…(2.6)
( )[ ] 43322110CP pPPpPC10 µ+µ+µ+µ+λ+λ+−λ−= …(2.7)
( )[ ] 0C1111P11 PPCC10 λ+µ+λ−λ+−λ−= …(2.8)
( )[ ] 1112222P21 PCPCC10 λ+µ+λ+λ+−λ−= …(2.9)
[ ] 22233P3 PCP0 λ+µ+λ+λ−= …(2.10)
Solving equations (2.7)-(2.10), we obtain
( )P
43322110
PPPPP
λ+λµ+µ+µ+µ
= ...(2.11)
1
0C1
Pp
Λλ
= …(2.12)
)C(PC
P2211
011C2 Λ+λΛ
λλ= …(2.13)
( )22131
022113 C
PCCP
Λ+λΛΛλλλ
= …(2.14)
From eq. (2.6) and the values of P1, P2 and P3 from (2.12)-(2.14), we get
Chapter-2: Fault Tolerance in A Clustered Architecture
48
( ) ( )( )
( )22131
32211C221311C
22131C221310
4 CCCCC
CC
1PΛ+λΛΛ
µλλλ−Λ+λΛλλ−
Λ+λΛµλ−Λ+λΛΛΛ
µ= …(2.15)
Now using normalizing condition∑=
=4
0ii 1P , we obtain P0 as
( )
( )( ) ( )( ) ( )
µ−µλλλ+Λ+λΛΛΛ+
Λ+λ−µΛλλ+µ−µΛ+λΛλ
Λ+λΛΛµ=
32211C221310
221311C12213C
221310
CCC
CCCC
P …(2.16)
2.5 Numerical Illustrations
In this section, the numerical illustrations have been made to calculate the
reliability R(t), the state probabilities Pi(t) and expected operational time (t) by
varying error rate ( λ ) and failure rate of power supply ( pλ ). The effects of these
parameters on the reliability indices have been examined for two sets of default
parameters fixed as follows:
Data set I: ,95.0C,9.0C,05.0,5,50,4 1P321 ===λ=λ=λ=λ
10,5,1,2.0,999.0C 3212 =µ=µ=µ=µ=
Data set II: ,5.0=λ ,95.0C,9.0C,5,50,4 1321 ===λ=λ=λ
10,5,1,2.0,999.0C 3212 =µ=µ=µ=µ=
Table 2.1 depicts the results for the probabilities of the software fault tolerance
for different values of λ , by setting other parameters for data set I. It is seen that, the
probabilities of fault tolerance decrease by increasing t and λ both. Table 2.2 displays
the numerical results for the software fault tolerance probabilities for the varying
values of Pλ for data set II. We notice the same decreasing pattern of these
probabilities with both t and Pλ as seen in table 2.1.
Fig. 2.2 illustrates the probabilities of fault tolerance by varying values of t for
λ=0.5, λ1=4, λ2=50, λ3=5, C=0.9, C1=0.95, C2=0.999, μ=0.2, μ1=1, μ2=5, μ3=10. It is
observed that probability )t(P0 decreases sharply for the lower values of t and then
after it becomes almost constant. Probabilities )t(P),t(P 21 and )t(P3 increase and then
Chapter-2: Fault Tolerance in A Clustered Architecture
49
decrease and later on become asymptotically constant as the time grows. On the
contrary, )t(P4 initially increases sharply and then attains almost constant value for
the increasing values of t.
For data sets I and II, figs 2.3-2.5 show the results for the reliability R(t) with
respect to time t for different values of parameters P,λλ and C, respectively. Fig.
2.3(a) shows that the reliability decreases with respect to time t as well as λ while in
fig. 2.3(b) reliability decreases for lower values t and then becomes almost constant
for higher time t. Fig. 2.4(a) depicts that reliability R(t) initially decreases sharply and
then after attains almost constant value by increasing t. It is also observed that R(t)
decreases as Pλ increases. In fig. 2.4(b), R(t) depicts almost constant trend of R(t) with
t. In fig. 2.5(a), we see the similar decreasing pattern followed by constant trend for
R(t) with t; also R(t) decreases as c increases. In fig. 2.5(b), R(t) first gradually
decreases then becomes almost constant by increasing t. It is also found that R(t) is
not much effected by different values of C.
2.6 Conclusion
We have proposed a Markov chain based reliability model to describe the
software fault tolerance. The error detection and recovery methods by including the
process recovery, volatile data recovery and persistent data recovery have been
explored. Several levels of recovery procedures incorporated make our model more
versatile and realistic to deal with real time system. It is noticed that the manual
recovery which is very timely and costly, may be helpful to improve the reliability
and availability of the system to a great extent.
Chapter-2: Fault Tolerance in A Clustered Architecture
50
λ t P0 P1 P2 P3 P4
0.1
0 1 0 0 0 0
2 0.920896 0.016651 0.001151 0.003838 0.057464
4 0.88213 0.015919 0.0011 0.003667 0.097184
6 0.858149 0.015467 0.001069 0.00356 0.121755
8 0.843315 0.015186 0.001049 0.003495 0.136955
10 0.834138 0.015013 0.001037 0.003454 0.146358
0.5
0 1 0 0 0 0
2 0.699654 0.064339 0.004454 0.01494 0.216613
4 0.595017 0.054184 0.003748 0.01253 0.334522
6 0.545007 0.049329 0.00341 0.011378 0.390876
8 0.521106 0.047009 0.003249 0.010827 0.41781
10 0.509682 0.0459 0.003172 0.010563 0.430682
0.9
0 1 0 0 0 0
2 0.54885 0.092092 0.006383 0.021509 0.331165
4 0.431503 0.071082 0.004918 0.016474 0.476023
6 0.386685 0.063057 0.004359 0.014551 0.531349
8 0.369567 0.059992 0.004146 0.013815 0.55248
10 0.36303 0.058822 0.004064 0.013534 0.56055
Table 2.1: Probabilities of software fault tolerance for different values of λ for data set I
Chapter-2: Fault Tolerance in A Clustered Architecture
51
Table 2.2: Probabilities of software fault tolerance for different
values of pλ for data set II
λp t P0 P1 P2 P3 P4
0.01
0 1 0 0 0 0
2 0.694165 0.06381 0.004417 0.014815 0.222793
4 0.587974 0.053503 0.0037 0.012369 0.342454
6 0.537726 0.048626 0.003361 0.011211 0.399076
8 0.51395 0.046318 0.003201 0.010663 0.425869
10 0.502699 0.045226 0.003125 0.010404 0.438547
0.05
0 1 0 0 0 0
2 0.652029 0.059746 0.004135 0.013853 0.270238
4 0.535607 0.048446 0.003349 0.011172 0.401427
6 0.484754 0.04351 0.003005 0.01 0.458731
8 0.462541 0.041354 0.002855 0.009488 0.483762
10 0.452838 0.040412 0.00279 0.009264 0.494696
0.09
0 1 0 0 0 0
2 0.612878 0.055971 0.003872 0.012961 0.314319
4 0.48963 0.044009 0.00304 0.010122 0.4532
6 0.439934 0.039185 0.002704 0.008978 0.509199
8 0.419896 0.03724 0.00257 0.008515 0.53178
10 0.411816 0.036455 0.002515 0.008328 0.540885
Chapter-2: Fault Tolerance in A Clustered Architecture
52
P0
P1
P2
P3
P4
0
0.2
0.4
0.6
0.8
1
0 1 2 3 4 5 6 7 8 9 10time
Prob
abili
ty
Fig. 2.2: Probabilities of software fault tolerance
Chapter-2: Fault Tolerance in A Clustered Architecture
53
00.10.20.30.40.50.60.70.80.9
1
0 1 2 3 4 5 6 7 8 9 10t
R(t
)
λ=0.1λ=0.5λ=0.9
0
0.2
0.4
0.6
0.8
1
0 1 2 3 4 5 6 7 8 9 10t
R(t
)
λ = 0.1λ = 0.5λ = 0.9
Fig. 2.3(a): Profiles of reliability by Fig. 2.3(b): Profiles of reliability by
varying λ for data set I varying pλ for data set II
λp = 0.1λp = 0.5λp = 0.9
00.10.20.30.40.50.60.70.80.9
1
0 1 2 3 4 5 6 7 8 9 10t
R(t
)
λp = 0.01λp = 0.05λp = 0.09
0
0.2
0.4
0.6
0.8
1
0 1 2 3 4 5 6 7 8 9 10t
R(t
)
Fig. 2.4(a): Profiles of reliability by Fig. 2.4(b): Profiles of reliability by
varying λ for data set I varying pλ for data set II
0
0.2
0.4
0.6
0.8
1
0 1 2 3 4 5 6 7 8 9 10t
R(t
)
c = 0.9c = 0.95c = 0.99
0.9
0.92
0.94
0.96
0.98
1
0 1 2 3 4 5 6 7 8 9 10t
R(t
)
c = 0.9c = 0.95c = 0.99
Fig. 2.5(a): Profiles of reliability by Fig. 2.5(b): Profiles of reliability by varying c for data set I varying c for data set II
Reliability Modeling of Hardware and Software Interactions
3.1 Introduction
3.2 Model Description
3.3 Governing Equations and Analysis
3.4 Numerical Results
3.5 Conclusion
Chapter-3
Chapter-3: Reliability Modeling of Hardware and Software Interaction
55
For the reliability prediction of a computer system, the
failures are divided into three categories namely software failures,
hardware failures and interaction of software-hardware (SW/HW)
failures. In this investigation we develop a reliability model for a
computer system that includes the failures of all the three kinds.
The Markov process is used to establish the system reliability
indices with the consideration of hardware, software and
hardware-software interaction failures. The case of common cause
failure is also discussed. Using successive over relaxation (SOR)
method, we obtain the probabilities of the system being in
different states. Some important indices to predict the
performance of the system such as reliability, mean time to failure,
etc. have been obtained.
3.1 Introduction
Reliability of the computer system generally refers to why, when and how
system hardware and software failures occur. A reliable computer system needs to
provide its normal level of service in the presence of hardware and software both. The
software is an integral part of many embedded systems and major source of reliability
degradation in dependable systems. Software reliability modeling is a generic term for
a set of methods of statistical analysis which enables to calculate the reliability indices
of software to be predicted from observation of its failure during later testing and
operational use. The basic hardware reliability model consists of all hardware
elements of the system, so that the overall logistics support requirements for spares,
maintenance personnel etc. can be easily determined based on the failure rates of the
system. The systems have hardware and software components and the hardware may
fail by some common cause such as power supply, humidity, temperature, designing,
etc.. The software component can fail either because of latent faults or common cause
failure. Some times, the computer system fails because of interactions between the
hardware and the software. All such kinds of failure need attention of maintenance
Chapter-3: Reliability Modeling of Hardware and Software Interaction
56
engineers. In view of this the reliability quantification has become an important part
of the performance modeling of such systems.
Various researchers have contributed significantly in the field of software
reliability. Hamlet (1995) discussed the software quality, software process and
software testing. Linberg (1999) analyzed the software developer perception about
software project failure through a case study. Lee (2002) gave the detailed account of
the embedded software. Zalewski et al. (2003) examined various aspects of the
software of computer control systems. Goseva-Popstojanova and Trivedi (2003)
presented the architecture based approaches to software reliability prediction. Munch
and Heidrich (2004) discussed various concepts and approaches for the software
project control center. Jeske and Zhang (2005) presented some successful approaches
to software reliability modeling in industry. Raj Kiran and Ravi (2008) studied the
software reliability by soft computing techniques. Vinod et al. (2008) examined the
integrating safety critical software system in probabilistic safety assessment.
Sofokleous and Andreon (2008) considered automatic, evolutionary test data
generation for dynamic software testing. Oltean and Diosan (2009) studied an
autonomous GP-based system for regression and classification problems. In (2010),
Fu explored a failure-aware resource management for high-availability computing
clusters with distributed virtual machines. Meedeniya et al. (2011) presented the
reliability driven deployment optimization for embedded systems.
A few researchers tried to find a reliability and availability of both hardware
and software and combined reliability model for the entire system. Reussner et al.
(2003) gave the reliability indices for component based software architectures. Huang
and Chang (2007) analyzed an improved decomposition scheme for assessing the
reliability of embedded systems by using dynamic fault trees. Van et al. (2008) did
the reliability analysis of Markovian systems at steady state using perturbation
analysis. Dominguez-Garcia et al. (2008) described an integrated methodology for
the dynamic performance and reliability evaluation of fault tolerant systems. Recently
Zio (2009) studied the reliability engineering from the view point of old problems and
new challenges. Analysis of service availability for time-triggered rejuvenation
policies were considered by Salfner and Wolter (2010). Catelani et al. (2011)
proposed a new approach concerning the automated software testing as an aid to
Chapter-3: Reliability Modeling of Hardware and Software Interaction
57
maximize the test plan coverage within the time available and also to increase the
software reliability and quality.
The purpose of this chapter is to find the availability of the system failure
depending on whether they are hardware failure, software failure or hardware-
software interaction failure. A discrete state Markov chain is used to constitute a set
of differential equations for transient probabilities governing the model. With the aid
of steady state equations, we obtain a complete solution of the differential equations.
The rest of the chapter is organized as follows. Section 3.2 deals with model
description by stating the requisite assumptions and notations. Section 3.3 provides
governing equations and analysis. In section 3.4, numerical results are provided.
Finally in section 3.5, the conclusion is drawn.
3.2 Model Description
We consider a unified reliability model that accounts for failures in hardware,
software and interaction of software-hardware failures. In general, for modeling
purpose we assume that a hardware subsystem acts separately and the software
subsystem works independently. However, the hardware and software subsystems
cannot operate independent of each other. In this investigation we develop a model
with the assumption that the hardware and software interact with each other. We
explore the hardware failures, software failures and failure due to interaction of
hardware and software and consequently the reliability of the system which is affected
by the interaction of hardware-software failures. The transition flow diagram for
Markov model in shown in figure 3.1. In our model state (0, 0) is the fully working
state. States b, bb, ab, cb show that there is only detection of faults by the software but
recovery is not possible. States a, ba, aa, ca show that there is detection of faults and
also recovery by the software while states c, bc, ac, cc indicate that there is no
detection of faults by the software. FT is the total failure state.
The following assumptions are made to formulate the Markov model:
The whole system consists of one software and two hardware components. The
system fails when both the hardware or software components fail.
Chapter-3: Reliability Modeling of Hardware and Software Interaction
58
The hardware failures, software failures and HW/SW failures are independent of
one another.
λ2 is the failure rate of the hardware component. The life times and repair times
of software and hardware components are assumed to be exponentially
distributed. ab ,λλ and cλ are the failure rates of the hardware component,
considering to the failure of state b, a, c.
Sµ is the repair rate of the hardware components when recovered by the software
and mµ is the manually repair rate of the hardware components, when it is not
recovered by the software .
After one state the undetected hardware degradation may cause a HW/SW failure
with ac,bc and cc states respectively and a detected degradation may cause an
execution abortion with rates abbbb ,, λ′λ′λ′ and cbλ′ , respectively.
Partially failed hardware may further become totally failed with
rates babb ,λλ and bcλ if degradation is detected but not recovered by software
methods, aaab ,λλ and acλ if degradation is detected and recovered by software
methods and cacb ,λλ and ccλ if degradation is not detected.
Fig. 3.1: State transition diagram
Chapter-3: Reliability Modeling of Hardware and Software Interaction
59
Some other notations used for model formulation are as follows:
For the brevity we use some more notations as follows:
21211 qPqP ′′=α , 21214 qPPP ′′=α , 2117 qPq ′′=α ,
21212 PPqP ′′=α , 21215 PPPP ′′=α , 2118 PPq ′′=α ,
1213 qqP ′=α , 1216 qPP ′=α , 119 qq ′=α ,
and
bbbbbb λ′+λ=Λ , bcbcbc λ′+λ=Λ , ababab λ′+λ=Λ
acacac λ′+λ=Λ , cbcbcb λ′+λ=Λ , cccccc λ′+λ=Λ .
3.3 Governing Equations and Analysis The set of differential difference equations governing the model based on
transition diagram shown in fig. 3.1 is given as follows:
)t(Q)t(Q)t(Q2)t(Q aSbm0000 µ+µ+λ−=′ …(3.1)
( ) ( ) ( ) ( ) ( )[ ] ( )tQtQtQtQqP2tQ b321bbmbaSbbm0021b α+α+αλ+λ′+µ−µ′+µ′+λ=′ …(3.2)
[ ] )t(Q)()t(Q)t(Q)t(QPP2)t(Q a654aSaaSabm0021a α+α+αλ+µ−µ′+µ′+λ=′ …(3.3)
[ ] )t(Q)()t(Q)t(Q)t(Qq2)t(Q c987ccaScbm001c α+α+αλ−µ′+µ′+λ=′ …(3.4)
)t(Q)()t(Q)t(Q bbbbmb1bbb Λ+µ′−αλ=′ …(3.5)
mμ′ : Manully repair rate at state b.
Sµ′ : Repair rate at state a .
11 P,P ′ : Probability that the hardware degradation is detected.
11 q,q ′ : Probability that the hardware degradation is undetected.
22 P,P ′ : Probability that the degradation is recovered by software methods.
22 q,q ′ : Probability that the degradation is not recovered by software methods.
FT : Total failure state with aborted failure rate and hardware-related software failure rate.
Chapter-3: Reliability Modeling of Hardware and Software Interaction
60
)t(Q)()t(Q)t(Q babaSb2bba λ+µ′−αλ=′ …(3.6)
)t(Q)()t(Q)t(Q bcbcb3bbc Λ−αλ=′ …(3.7)
)t(Q)()t(Q)t(Q ababma4aab Λ+µ′−αλ=′ …(3.8)
)t(Q)()t(Q)t(Q aaaaSa5aaa λ+µ′−αλ=′ …(3.9)
)t(Q)()t(Q)t(Q acaca6aac Λ−αλ=′ …(3.10)
)t(Q)()t(Q)t(Q cbcbmc7ccb Λ+µ′−αλ=′ …(3.11)
)t(Q)()t(Q)t(Q cacaSc8cca λ+µ′−αλ=′ …(3.12)
)t(Q)()t(Q)t(Q ccccc9ccc Λ−αλ=′ …(3.13)
)t(Q)t(Q)t(Q)t(Q)t(Q)t(Q)t(Q)t(Q)t(Q)t(Q)t(Q
cccccacacbcbacacaaaa
ababbcbcbababbbbbbTF
Λ+λ+Λ+Λ+λ+Λ+Λ+λ+Λ+λ′=′
…(3.14)
When t ∞→ , the steady-state equations governing the model are obtained from (3.1)-
(3.14) as follows:
0QQQ2 aSbm00 =µ+µ+λ− …(3.15)
[ ] 0Q)(QQQqP2 b321bbmbaSbbm0021 =α+α+αλ+λ′+µ−µ′+µ′+λ …(3.16)
[ ] 0Q)(QQQPP2 a654aSaaSabm0021 =α+α+αλ+µ−µ′+µ′+λ …(3.17)
[ ] 0Q)(QQQq2 c987ccaScbm001 =α+α+αλ−µ′+µ′+λ …(3.18)
0Q)(Q bbbbmb1b =Λ+µ′−αλ …(3.19)
0Q)(Q babaSb2b =λ+µ′−αλ …(3.20)
0Q)(Q bcbcb3b =Λ−αλ …(3.21)
0Q)(Q ababma4a =Λ+µ′−αλ …(3.22)
0Q)(Q aaaaSa5a =λ+µ′−αλ …(3.23)
0Q)(Q acaca6a =Λ−αλ …(3.24)
0Q)(Q cbcbmc7c =Λ+µ′−αλ …(3.25)
0Q)(Q cacaSc8c =λ+µ′−αλ …(3.26)
0Q)(Q ccccc9c =Λ−αλ …(3.27)
0QQQQQQQQQQ
cccccacacbcb
acacaaaaababbcbcbababbbbbb
=Λ+λ+Λ+Λ+λ+Λ+Λ+λ+Λ+λ′
…(3.28)
Chapter-3: Reliability Modeling of Hardware and Software Interaction
61
With the help of equations (3.15)-(3.28), we get the matrix
0AQ = …(3.29)
where
[ ]cccacbacabaabcbabbcba00 QQQQQQQQQQQQQQ ++++++++++++=
=
43
21
AAAA
A .
Here
Λαλ−λ+µ′αλ−
λ+µ′αλ−θλ−
θλ−µ′−µ′−θλ−
µ−µ−λ
=
bc3b
bam2b
bam1b
31
221
Sm121
Sm
00000000000000000000q200000PP2000qP200002
1A
µ′−µ′−µ′−µ′−
=
00000000000000000000000000000000000000
Sm
Sm
2A
αλ−αλ−αλ−
αλ−αλ−αλ−
=
000000000000000000000000000000000000
9c
8c
7c
6a
5a
4a
3A
Chapter-3: Reliability Modeling of Hardware and Software Interaction
62
Λλ+µ′
Λ+µ′Λ
λ+µ′Λ+µ′
=
cc
caS
cbm
ac
aaS
abS
000000000000000000000000000000
4A
where
)( 321bbm1 α+α+αλ+λ′+µ=θ
)( 654aS2 α+α+αλ+µ=θ
)( 987c3 α+α+αλ=θ .
Denote
[ ]T321 P,P,PQ =
where
[ ]bbcab001 Q,Q,Q,Q,QP =
[ ]acaaabbcba2 Q,Q,Q,Q,QP =
[ ]cccacb3 Q,Q,QP = .
The steady state availability of the system can be obtained as: cccacbacabaabcbabbcba00 QQQQQQQQQQQQQA ++++++++++++=
…(3.30)
SOR technique can be applied for any converging iterative process. To solve
the equations (3.15)-(3.28), we use the successive over relaxation (SOR) technique
which is a powerful numerical method for solving a linear system of equations.
3.4 Numerical Results Extensive numerical results have been obtained to examine the effect of
various parameters on the system availability and are displayed in figures 3.2(a-f). In
order to compute various performance indices, the default parameters are taken as
follows:
0.7,'λ0.6,λ0.8,'λ0.7,λ0.5,'λ0.6,λ0.5,'λ0.4,λ
0.7,'λ0.7,λ0.5,'λ0.6,'λ0.5,'λ0.3,λ0.4,λ0.3,λ0.4,λ
aaabbcba
bbcab
aaabbcba
bbcab
========
=========
Chapter-3: Reliability Modeling of Hardware and Software Interaction
63
0.35.'P0.30,'P0.35,P0.25,P2,'μ1.5,μ2.1,'μ0.4,μ
0.5,'λ0.5,λ0.6,'λ0.7,λ0.7,'λ0.6,λ0.4,'λ0.5,λ
21Sm
cccacbac
21Sm
cccacbac
========
========
Figure 3.2(a) depicts the availability by varying failure rate (λ ) for different
values of mµ . It is clear that availability gradually decreases with the increase in
failure rate. It is also observed that system availability is higher for higher values
of mµ . Figure 3.2(b) exhibits the system availability by varying software repair rate
Sµ under different values of λ . This shows the sharp increasing trends in availability
with the increment in Sµ as we expect. Also we can see that the availability is greater
for lower values ofλ than its higher values.
Figs 3.2(c-d) show the effect of failure rateλ on system availability for
different values of 1P and 2P , respectively. A decreasing trend is seen for availability
with the increase in failure rateλ in both figures. Also as we increase the detection
probability of hardware degradation ( 1P ), the availability decreases significantly but
an increasing trend of availability is found when we increase the recovery probability
from hardware degradation ( 2P ).
In figs 3.2(e) -3.2(f), we see the pattern of availability by varying failure rate
(λ ) for different values of 'Pand'P 21 , respectively. From these figures it is clear that
the availability reveals decreasing pattern with the increase inλ ; the decrements are
more prevalent for lower values of λ in comparison to higher values of λ .
Furthermore, the availability tends to be constant for higher values of λ . For different
values of 'Pand'P 21 , no significant effect is found in the system availability.
3.5 Conclusion The software/hardware reliability model with the consideration of the
interactions between hardware and software subsystems has been developed. We have
examined the effects of hardware/software failures on the availability. The availability
of the whole computer system evaluated may be helpful to the system designers and
decision makers for the future design and upgradation of the embedded systems.
Chapter-3: Reliability Modeling of Hardware and Software Interaction
64
0.5
0.6
0.7
0.8
0.9
1
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
λ
A
μm=1 μm=3 μm=5
0
0.2
0.4
0.6
0.8
1
0.5 1 1.5 2 2.5µs
A
λ=.4 λ=.7 λ=1
Fig. 3.2(a): Availability (A) vsλ by Fig. 3.2(b): Availability (A) vs Sµ by varying mµ varyingλ
0.4
0.5
0.6
0.7
0.8
0.9
1
0.4 0.5 0.6 0.7 0.8 0.9 1λ
A
p1=.25 p1=.45 p1=.65
0.14
0.16
0.18
0.2
0.22
0.24
0.4 0.5 0.6 0.7 0.8 0.9 1λ
A
p2=.35 p2=.45 p2=.55
Fig. 3.2(c): Availability (A) vsλ by Fig. 3.2(d): Availability (A) vsλ by varying 1P varying 2P
0.4
0.5
0.6
0.7
0.8
0.9
1
0.2 0.3 0.4 0.5 0.6 0.7λ
A
p'1=.30 p'1=.40 p'1=.50
0.5
0.6
0.7
0.8
0.9
1
0.4 0.5 0.6 0.7 0.8 0.9 1λ
A
p'2=.35 p'2=.45 p'2=.55
Fig. 3.2(e): Availability (A) vsλ by Fig. 3.2(f): Availability (A) vsλ by varying 'P1 varying 'P2
Distributed Software and Hardware Systems
4.1 Introduction
4.2 Model Description
4.3 The Equations and Analysis
4.4 Performance Indices
4.5 Numerical Results
4.6 Conclusion
Chapter-4
Chapter-4: Distributed Software and Hardware Systems
66
The reliability/availability issues are key ingredients of the
performance quantification of the distributed system for design
and development of a system. The present investigation is
concerned with a multi-host system with standbys. When all
standbys are used, the system begins to work in degraded mode.
Both software and hardware failures are taken into account along
with the assumption that the software faults are constantly being
identified and removed. The common cause failure which is an
important factor to predict the availability of realistic system is also
taken into consideration. A Markov model is developed by
constructing the governing transient equations in terms of
probabilities of various system states. These probabilities are also
employed to obtain some reliability indices. Numerical experiment
has been performed by using Runge-Kutta method with the help of
MATLAB.
4.1 Introduction
The reliability/availability prediction is a key concern of the system engineer
in any distributed system. In the recent years, distributed computing systems have
become more popular due to the low-cost processors and are shared among the hosts.
A distributed system is a type of cluster system that is collection and combination of
computers in which any member of the cluster is capable of supporting the processing
functions of any other member. From time to time, various Markov models have been
developed by many researchers to analyze the component availabilities. The common
cause failure is an important factor that should be incorporated to predict the
availability of a system working in distributed environments. In the present
investigation, we analyze the reliability issues of a distributed system, which is
subject to degradation and may fail due to common cause. The repair facility is
provided to restore the partially failed system to original state. The system may fail
Chapter-4: Distributed Software and Hardware Systems
67
partially from a good state or from any degraded state. The system can also fail due to
common cause, for example due to electrical/mechanical fault or due to
humidity/voltage problem. We consider that the system may also fail partially at any
time. The availability, which is the probability that the system is operating
satisfactorily at time t, is evaluated by using reliability theory approach.
The reliability analysis becomes more complicated when components of
distributed system are subject to common cause failure during any phase of the
mission. Common cause failures are multiple dependent component failures within a
system that are a direct result of a common cause. Many researchers have described
the hardware and/or software reliability models in different frameworks under
common cause failures. Jankala and Vaurio (1993) made a residual common cause
failure analysis to predict a probabilistic safety assessment. Jain (1998) did the
reliability analysis of two-unit system with common cause failure. Goseva-
Popstojanova and Trivedi (2000) developed Markov model for failure correlation and
studied its effects on the software reliability measures. Zhang and Horigome (2001)
discussed the availability and reliability of the system with dependent components
and time-varying failure and repair rates. Lewis (2001) considered a load-capacity
interference model with common-mode failure in 1-out-of-2: G system. Kvam and
Miller (2002) provided the common cause failure prediction using data mapping.
Nakagawa and Yasui (2003) worked on the reliability of a system complexity.
Vaurio (2003) obtained the common cause failure probabilities in standby safety
system by considering the fault tree analysis with testing-scheme and timing
dependencies. Yadavalli et al. (2005) presented the Bayesian study of a two-
component system with common cause shock failure. Lu and Lewis (2006) have done
the reliability evaluation of standby safety systems due to independent and common
cause failures. Xing et al. (2007) discussed the reliability analysis of hierarchical
computer based systems subject to common cause failures. Van et al. (2008)
presented the perturbation method to estimate an importance factor in the framework
of steady-state sensitivity analysis of Markov processes in reliability studies. Atwood
and Kelly (2008) considered the binomial failure rate to analyze a common cause
model. A two-stage approach for multi-objective decision making with applications to
Chapter-4: Distributed Software and Hardware Systems
68
system reliability optimization was studied by Li et al. (2009b). Gamiz and Miranda
(2010) gave regression analysis of the structure function for reliability evaluation of
continuous-state system. In (2011), Savage and Son proposed the set theory method
for the system reliability of structures with degrading components.
The replacement of failed units by standbys in multi-component machining
system is useful in computer system working in distributed environment. The
provision of repair facility in addition to spares may be helpful in the smooth running
of the software and hardware systems. Park and Kim (2002) did an availability
analysis for the improvement of active/standby cluster systems using software
rejuvenation. Kim and Dshalalow (2002) analyzed the stochastic disaster recovery
systems with external resources. Zhang and Wang (2007) studied a deteriorating cold
standby repairable system with priority in use. Lim et al. (2008) explored the diversity
and fault avoidance for dependable replication systems. Ke et al. (2008) considered a
repairable system with detection, imperfect coverage and reboot. Chakravarthy and
Gomez-Corral (2009) discussed the influence of delivery times on repairable k-out-
of-N systems with spares. Erglmaz (2010) gave mixture representations for the
reliability of consecutive-k systems. Arya et al. (2011) described a methodology for
reliability enhancement of redial distribution system by determining optimal values of
repair times and failure rates of each section.
The purpose of investigation in this chapter is to find the availability/reliability
indices for distributed computer system having standbys and subject to common cause
failures. Markov chain is used to constitute a set of differential difference equations
for transient probabilities governing the model. With the help of R-K method we
evaluate the transient probabilities. The rest of the chapter is organized as follows. In
section 4.2, we outline the model description along with some assumptions. In section
4.3, we construct the governing equations by using appropriate transition rates. In
section 4.4, we establish some performance indices. In section 4.5, we facilitate the
numerical results which are obtained with the help of R-K method. Section 4.6 is
devoted to the concluding remarks.
Chapter-4: Distributed Software and Hardware Systems
69
4.2 Model Description
Consider the redundant N+Y=K configuration for distributed computer system
having N operating hosts and Y processing hosts as spare hosts. Here if all of the N
hosts fail, the system fails otherwise whenever one host is working, the system is still
working. The following assumptions are made to formulate the model.
The system consists of N operating components (i.e. hosts), Y spare components
and two softwares. When any operating host fails then it can be replaced by the
available spare host.
All the operating (spare) hosts have the same hardware failure rate hλ ( )α arising
from an exponential distribution.
Both software as well as hardware components have only two states, working
state and failed state.
There is repair provision as and when software or hardware malfunction occurs.
The repair times are exponentially distributed with parameter sµ ( )sµ′ for software
failure and )μ(μ hh ′ for hardware failure; here sµ′ is faster repair rate for software
failure of second software and hµ′ is faster repair rate for hardware failure of
second one.
All the failures involved are mutually independent.
( ) (t)λ,tλ ss ′ are the software failure rates caused by all software faults which occur
in Poisson fashion.
Pi(t) is the probability of the system being in ith (i=0,1,2,……..,15) state at time t.
The system may also fail due to common cause with failure rate Pλ .
When all standbys are exhausted, the remaining operating hosts fail with degraded
failure rate dλ .
4.3 The Equations and Analysis
The differential difference equations governing the model are constructed by
using the transition rates nλ and nµ to the failure and repair processes, respectively.
Chapter-4: Distributed Software and Hardware Systems
70
When there are n failed operating hosts in the system, the state dependent failure rate
are given by
For N operating components and Y spares, the diagram for particular case
when N=5 and Y=2, depicting the transition flow in all states A={0,1,2,……..,15} is
shown in figure 4.1. Let Pi(t) be the probability of the system being in ith
(i=0,1,2,……..,15) state at time t. P0(t) and P1(t), P2(t) denote the probabilities that
both software and N operating components are functioning well at time t. Let Pi(t) (i=
3,4) be the probability that both software are working and all operating components
have failed and the system is working in degraded mode and faster repair rate hµ′ for
hardware failure is adopted. P5(t), P6(t) and P7(t) are the probabilities that one
software is in failed state at time t and repair is rendered with faster rate hµ′ ; sλ′ is the
failure rate for software when N operating components are working. P8(t), P9(t)
denote that one software and all operating components have failed, and one software
as well as spares are working; the repair is with faster rate hµ′ for hardware failed
components. Pi(t) (i= 10,11,12,13,14) show the probability that at time t both software
have failed and there is no repair process. P15(t) denotes the probability that at time t
entire system has failed.
Now the differential difference equations governing the model developed with the
help of transition diagram are given below:
( ) ( ) ( )[ ] ( )tP252tPtPdt
dP0Ps5s1h
0 α+λ+λ+λ−µ+µ= …(4.1)
( ) ( ) ( ) ( ) ( )[ ] ( )tP52tP25tPtPdt
dP1hPs06s2h
1 α+λ+µ+λ+λ−α+λ+µ+µ= …(4.2)
( ) ( ) ( ) ( ) [ ] ( )tP52tP5tPtPdt
dP2hPs17s3h
2 λ+µ+λ+λ−α+λ+µ+µ′= …(4.3)
( ) ( ) ( ) ( ) [ ] ( )tP42tP5tPtPdt
dP3dhPs28s4h
3 λ+µ′+λ+λ−λ+µ+µ′= …(4.4)
( )( )
−≤<λ−+≤≤α−+λ
=λ1KnY,nYN
Yn0,nYN
dn
Chapter-4: Distributed Software and Hardware Systems
71
( ) ( ) [ ] ( )tP32tP4tPdt
dP4dhPs3d9s
4 λ+µ′+λ+λ−λ+µ= …(4.5)
( ) ( ) ( ) ( )[ ] ( )tP25tPtPtP2dt
dP5Pss10s6h0s
5 α+λ+λ+λ′+µ−µ′+µ+λ= …(4.6)
( ) ( ) ( ) ( ) ( )
( )[ ] ( )tP5
tp25tPtPtP2dt
dP
6hPss
511s7h1s6
α+λ+µ+λ+λ′+µ−
α+λ+µ′+µ+λ= …(4.7)
( ) ( ) ( ) ( ) ( ) ( )[ ] ( )tP5tp5tPtPtP2dt
dP7hPss612s8h2s
7 λ+µ+λ+λ′+µ−α+λ+µ′+µ′+λ=
…(4.8)
( ) ( ) ( ) ( ) ( ) ( )[ ] ( )tP4tp5tPtPtP2dt
dP8dhPss713s9h3s
8 λ+µ′+λ+λ′+µ−λ+µ′+µ′+λ=
…(4.9)
( ) ( ) ( ) ( ) ( )[ ] ( )tP3tp4tPtP2dt
dP9dhPss8d14s4s
9 λ+µ′+λ+λ′+µ−λ+µ′+λ= …(4.10)
( ) [ ] ( )tPtPdt
dP10Ps5s
10 λ+µ′−λ′= …(4.11)
( ) [ ] ( )tPtPdt
dP11Ps6s
11 λ+µ′−λ′= …(4.12)
( ) [ ] ( )tPtPdt
dP12Ps7s
12 λ+µ′−λ′= …(4.13)
( ) [ ] ( )tPtPdt
dP13Ps8s
13 λ+µ′−λ′= …(4.14)
( ) [ ] ( )tPtPdt
dP14Ps9s
14 λ+µ′−λ′= …(4.15)
( ) ( ) ( )∑=
λ+λ+λ=14
0iiP9d4d
15 tPtP3tP3dt
dP …(4.16)
In order to solve the above set of equations (4.1)-(4.16), we impose the initial
condition P1(0)=1, Pi(0)=0, i≠ 1. Numerical method based on R-K fourth order is used
to obtain the transient probabilities Pi(t), 0, 151 ≤≤ i .
Chapter-4: Distributed Software and Hardware Systems
72
4.4 Performance Indices
Now we establish various indices by using probabilities obtained in previous
section as follows:
The reliability of the system is given by
( ) ( ) ( )∑∑−
+=
−
+=
+=3K
YNnn,1
3K
YNnn,0 tPtPtR ...(4.17)
Expected number of operating hardware units in the system, when one software is
failed
( ) ( )∑+=
−−=K
1Ynn1,1 tY)P(nNtEO …(4.18)
Expected number of operating hardware units in the system, when both
softwares are failed
( ) ( )∑+=
−−=K
1Ynn2,2 tY)P(nNtEO …(4.19)
Expected number of spare hardware units in the system is obtained as
( ) ( )∑=
−=Y
1nnYtES ( ) ( ) ( ){ }tPtPtP n2,n1,n0, ++ …(4.20)
4.5 Numerical Results
In this section, we perform a numerical experiment for the transient analysis
by employing Runge-Kutta technique (RKT) of fourth order to solve the system of
differential equations. R-K method is implemented by exploiting MATLAB’s ‘ode
45’ function. A time span is taken with equal intervals. We set the default parameters
as
2.9.'μ2.5,'μ3,μ
8,μ0.3,α0.1,λ0.7,'λ0.4,λ0.06,λ0.09,λ3,m2,Y5,N
Sh
S
S
hPSd
===
==========
The numerical results for different performance indices are summarized in
tables 4.1(a)-4.1(c). The graphical presentations of some indices namely, reliability,
expected number of operating units when one software has failed, expected number of
operating units when both softwares have failed and expected number of spare
hardware units in the system have been done in figs 4.2(a)-4.2(c).
Chapter-4: Distributed Software and Hardware Systems
73
Tables 4.1(a)-4.1(c) display the numerical results to examine the effect of
varying time t on the reliability R(t), expected number of operating units when one
and both softwares have failed and the expected number of spare hardware units in the
system for different values of Pλ , Sµ and α , respectively. Table 4.1(a) reveals that
R(t), EO1(t) and ES(t) decrease as time t increases but EO2(t) initially decreases
slightly and after some time it tends to be constant with the time t. From tables 4.1(b)-
4.1(c), we observe that when the parameters Sµ and α increase then R(t), EO1(t) and
ES(t) decrease but EO2(t) increases with time t.
From figs 4.2(a)-4.2(c), we see the variation of reliability with time for
different values of Pλ , Sµ andα , respectively. We notice that reliability decreases
when time increases which is quite obvious. As Pλ ( Sµ ) increases, the reliability
decreases (increases). The effect of failure rate α of spare host on reliability is not
much significant.
4.6 Conclusion
In the present investigation, we have analyzed the performance of distributed
computer system by incorporating the common cause failure. The system can fail
either due to a software failure or due to hardware failure. Markov model developed
for evaluating various performance measures such as reliability, expected number of
operative hardware units in the system and expected number of spare units may be
applied to many embedded systems including computer and communication systems,
telecommunication, etc..
Chapter-4: Distributed Software and Hardware Systems
74
Fig. 4.1: State transition diagram
Chapter-4: Distributed Software and Hardware Systems
75
Table 4.1(a): Performance indices for different values of Pλ
t
.1λp = .5λp = .9λp =
R(t) EO1(t) EO2(t) ES(t) R(t) EO1(t) EO2(t) ES(t) R(t) EO1(t) EO2(t) ES(t) 0.00 1.00 0.00 0.00 1.00 1.00 0.00 0.00 1.00 1.00 0.00 0.00 1.00 1.00 0.86 0.22 0.04 0.85 0.58 0.14 0.03 0.57 0.39 0.10 0.02 0.42 1.25 0.83 0.22 0.05 0.83 0.51 0.13 0.03 0.50 0.31 0.09 0.02 0.35 1.50 0.81 0.22 0.05 0.81 0.44 0.12 0.03 0.44 0.24 0.07 0.02 0.28 1.75 0.79 0.21 0.05 0.79 0.39 0.10 0.03 0.39 0.19 0.06 0.01 0.23 2.00 0.77 0.21 0.05 0.77 0.34 0.09 0.02 0.35 0.15 0.05 0.01 0.19 2.25 0.75 0.21 0.05 0.75 0.30 0.08 0.02 0.30 0.12 0.04 0.01 0.16 2.50 0.73 0.20 0.05 0.73 0.27 0.07 0.02 0.27 0.10 0.03 0.01 0.13 2.75 0.71 0.20 0.05 0.71 0.24 0.06 0.02 0.24 0.08 0.02 0.01 0.10 3.00 0.69 0.19 0.05 0.69 0.21 0.05 0.01 0.21 0.06 0.02 0.01 0.09
Chapter-4: Distributed Software and Hardware Systems
76
Table 4.1(b): Performance indices for different values of Sμ
t .1λp = .5λp = .9λp =
R(t) EO1(t) EO2(t) ES(t) R(t) EO1(t) EO2(t) ES(t) R(t) EO1(t) EO2(t) ES(t) 0.00 1.00 0.00 0.00 1.00 1.00 0.00 0.00 1.00 1.00 0.00 0.00 1.00 1.00 0.86 0.22 0.04 0.85 0.58 0.14 0.03 0.57 0.39 0.10 0.02 0.42 1.25 0.83 0.22 0.05 0.83 0.51 0.13 0.03 0.50 0.31 0.09 0.02 0.35 1.50 0.81 0.22 0.05 0.81 0.44 0.12 0.03 0.44 0.24 0.07 0.02 0.28 1.75 0.79 0.21 0.05 0.79 0.39 0.10 0.03 0.39 0.19 0.06 0.01 0.23 2.00 0.77 0.21 0.05 0.77 0.34 0.09 0.02 0.35 0.15 0.05 0.01 0.19 2.25 0.75 0.21 0.05 0.75 0.30 0.08 0.02 0.30 0.12 0.04 0.01 0.16 2.50 0.73 0.20 0.05 0.73 0.27 0.07 0.02 0.27 0.10 0.03 0.01 0.13 2.75 0.71 0.20 0.05 0.71 0.24 0.06 0.02 0.24 0.08 0.02 0.01 0.10 3.00 0.69 0.19 0.05 0.69 0.21 0.05 0.01 0.21 0.06 0.02 0.01 0.09
Chapter-4: Distributed Software and Hardware Systems
77
Table 4.1(c): Performance indices for different values ofα
t .3α = .6α = .9α =
R(t) EO1(t) EO2(t) ES(t) R(t) EO1(t) EO2(t) ES(t) R(t) EO1(t) EO2(t) ES(t) 0.00 1.00 0.00 0.00 1.00 1.00 0.00 0.00 1.00 1.00 0.00 0.00 1.00 1.00 0.59 0.05 0.01 0.50 0.59 0.05 0.01 0.45 0.59 0.05 0.01 0.41 1.25 0.52 0.04 0.01 0.42 0.52 0.04 0.01 0.38 0.51 0.04 0.01 0.34 1.50 0.46 0.04 0.01 0.36 0.45 0.04 0.01 0.32 0.45 0.04 0.01 0.29 1.75 0.40 0.03 0.01 0.31 0.39 0.03 0.01 0.27 0.39 0.03 0.01 0.24 2.00 0.35 0.03 0.00 0.27 0.34 0.03 0.00 0.24 0.34 0.03 0.00 0.21 2.25 0.31 0.02 0.00 0.23 0.30 0.02 0.00 0.20 0.29 0.02 0.00 0.18 2.50 0.27 0.02 0.00 0.20 0.26 0.02 0.00 0.18 0.25 0.02 0.00 0.15 2.75 0.23 0.02 0.00 0.18 0.23 0.02 0.00 0.15 0.22 0.02 0.00 0.13 3.00 0.20 0.02 0.00 0.15 0.20 0.02 0.00 0.13 0.19 0.02 0.00 0.12
Chapter-4: Distributed Software and Hardware Systems
78
0.0
0.2
0.4
0.6
0.8
1.0
1.2
0 1.25 1.75 2.25 2.75(t)
R(t
)
λp=.1λp=.5λp=.9
Fig. 4.2(a): Reliability vs time by varying pλ
0.0
0.2
0.4
0.6
0.8
1.0
1.2
0.0 1.3 1.8 2.3 2.8(t)
R(t
)
μs=3μs=6μs=9
Fig. 4.2(b): Reliability vs time by varying Sµ
0
0.2
0.4
0.6
0.8
1
1.2
0 1.25 1.75 2.25 2.75(t)
R(t
)
α=.3α=.6α=.9
Fig. 4.2(c): Reliability vs time by varyingα
Transient Analysis of a Hardware-Software System
Section-5A Embedded Computer System with Two Types of Failure and Common Cause Failure
Section-5B Hardware and Software Systems with Warm Standbys and Switching Failures
Chapter-5
Embedded Computer System with Two Types of Failure and Common
Cause Failure 5A.1 Introduction
5A.2 Model Description
5A.3 The Analysis
5A.4 Illustration
5A.5 Performance Indices
5A.6 Numerical Results
5A.7 Conclusion
Section-5A
Section-5A: Embedded Computer System with Two Types of Failures…
81
Redundancy of hardware components is generally required to
design highly reliable embedded computer systems. A common
form of redundancy is a K-out-of-N: G system in which at least K
out of N components must be good for the system to be good. This
investigation is concerned with a Markov model for K-out-of-N: G
system with common cause failure. The hardware system consists
of N non-identical components and Y warm standby components.
There is a single repairman who repairs the failed components on a
first-come-first-served basis. The developed probabilistic model
represents the redundant computer system with one software
component. The software/hardware system along with human
error and hardware error has been investigated in order to obtain
the reliability indices under the assumption that each components
may fail due to two types of failures (hardware and human) or
common cause or software failure. Numerical results have been
facilitated with the help of Runge-Kutta method of fourth order to
validate the analytical results. The sensitivity of parameters on
system availability has also been examined.
5A.1 Introduction
These day’s embedded computers are used in day to day as well as different
areas such as transportation, nuclear reactors, aircraft, hospital operation, etc.. The
reliability has become an important aspect of planning, designing and operation of all
engineering systems. The aim of reliability analysis is to measure the probability that
the designed equipment will work its intended function in the hands of the customers.
In the field of reliability a large number of researchers have paid their attention in the
estimation as well as prediction of the reliability of a computer hardware or software
system.
Section-5A: Embedded Computer System with Two Types of Failures…
82
In various situations, the systems are sometime affected by environmental
factors such as human errors or common cause failure. Human errors are important
while predicting the reliability and safety measures of any engineering system. In a
real life situation, many faults are caused directly or indirectly due to human errors
such as wrong action, poor communication, wrong interpretation, poor handling, poor
maintenance and operation procedure, etc.. Further, common cause failure is also key
factor that should be incorporated to predict the system reliability in different
frameworks. The common cause failure may occur due to equipment design
deficiency, power supply, humidity, temperature, etc.. An example of a human error is
the fire in a room where the redundant units are located. In this case the entire
redundant system will fail, irrespective of whether one or more units were operating.
Hardware failures occur due to flaws in the design and manufacturing processes,
faulty operations, poor quality control, poor maintenance, etc.. Hence realistic
reliability model must include the occurrence of human errors, hardware failure and
common cause failure. The system reliability/availability can be quantified more
accurately by the use of these concepts.
In recent years significant attention of researchers has been focused on
reliability issues by considering the common cause failure. Chari et al. (1991)
presented the reliability analysis in the presence of change common cause shock
failures. De-Almeida and Souza (1993) discussed the maintenance strategy for a 2-
unit redundant standby system. Subramanian and Anantharaman (1995) made the
reliability analysis of a complex standby redundant system. Rajamanickam and
Chandrasekar (1997) established the reliability measures for two unit systems with a
dependent structure for failure and repair times. Jain (1998) and Vaurio (1998, 2005)
provided the implicit method for incorporating common cause failure in two unit
system. Whittaker et al. (2000) developed a Markov chain model for predicting the
reliability of multi-build software. Kuo et al. (2001) suggested the framework for
modeling software reliability using various testing-effort and fault detection rates.
Yadavalli et al. (2002) analyzed the asymptotic confidence limits for the steady state
availability of a two unit parallel system with preparation time for the repair facility.
Ou and Bechta-Dugan (2003) did the approximate sensitivity analysis for acyclic
Markov reliability models. Azaron et al. (2005) studied the reliability function of a
Section-5A: Embedded Computer System with Two Types of Failures…
83
class of time dependent systems with standby redundancy. Blokus (2006) presented
the reliability analysis of large systems with dependent components. Levitin (2007)
suggested a modification of the generalized reliability block diagram method for
evaluating reliability and performance distribution of complex multi-state series-
parallel system with uncovered failures. Hall and Mosleh (2008) analyzed the
framework for reliability growth of one-shot systems. El-damcese (2009) gave the
reliability and availability analysis of a k-out-of-(M+S): G warm standby system due
to common cause failure. Mahmoud and Moshref (2010) analyzed the hardware
failure, human error and preventive maintenance for a two unit cold standby system.
In (2011), Xing et al. proposed exact combinatorial reliability analysis of dynamic
systems with sequence-dependent failures.
It is common knowledge that redundancy can be used to increase the
reliability of a system without changing the reliability of the individual units that form
the system. k-out-of-n:G warm standby systems have found applications in various
fields including power plant, network design, redundant system testing, medical
diagnosis, etc.. In a K-out-of-n: G system, K is the minimum number of components
that must work if the whole system consisting of total N components is to work.
Akhtar (1994), Amari et al. (2004) and Myers (2007) discussed the reliability of K-
out-of-N: G system with imperfect fault coverage. There are many factors such as
critical human error or high temperature of computer chips, etc. which could cause the
whole system to fail. El-Damcese (1997) gave the human error and common cause
failure modeling of a two unit multiple systems. Huang et al. (2000) studied the
generalized multi-state K-out-of-N: G systems. Dutuit and Rauzy (2001) and Smidt-
Destombesa et al. (2004) considered the assessment of K-out-of-N and related
systems. Arulmozhi (2002) presented the reliability of an M-out-of-N warm standby
system with R repair facilities. Zhang et al. (2006) obtained availability and
reliability of K-out-of (M+N): G warm standby systems. Lu and Lewis (2008) and
Chakravarthy and Gomez-Corral (2009) studied the configuration determination for
K-out-of-N partially redundant systems. Levitin and Amari (2010) established the
algorithm for evaluating the time-to failure distribution of k-out-of-n system with
shared standby elements. Ruiz-Castro and Li (2011) modeled a discrete k-out-n:G
system with multi state components by means of block-structure Markov chains.
Section-5A: Embedded Computer System with Two Types of Failures…
84
In the present investigation, we develop Markov model for K-out-of-N:G
system by incorporating the failures caused by the human error and hardware problem
for a multi-component system. The outline of the chapter is as follows. In section
5A.2, we describe the mathematical model along with the underlying Markov process.
The transient analysis of the K-out-of-N: G system is presented in section 5A.3 where
an illustration of the model is also given. In section 5A.4, several system performance
measures are derived. The interesting representative numerical results to bring out the
quantitative nature of the model are discussed in section 5A.5. At last, section 5A.6 is
devoted to the concluding remarks.
5A.2 Model Description
We develop Markov model for the multicomponent system which is initially
considered to be in good state. The system or the components may fail due to
hardware failure and human error. In addition to this the system is subject to failure
due to some common cause as well as due to software failure. The provision of warm
standbys hardware components is also taken to be consideration. The following
assumptions are made to formulate the model:
The system consists of M operating and Y warm standby hardware components.
The system functions successfully with at least K components.
The failed components are repaired in the order of their failure.
There is single repairman and he is always available to repair the failed components.
The life time and repair time of the hardware components are exponentially distributed.
The switching time from standby to operating component is assumed to be negligible.
The system may also fail due to common cause failure or software failure according to exponential distribution.
Notations
N : Total number of hardware components in the system i.e. N=M+Y.
λ : Failure rate of an operating hardware component due to hardware failure.
Section-5A: Embedded Computer System with Two Types of Failures…
85
λ′ : Failure rate of an standby hardware component due to hardware failure.
hλ : Failure rate of an operating hardware component due to human failure.
hλ′ : Failure rate of a standby hardware component due to human failure.
Cλ : Failure rate of an operating hardware component due to common cause failure.
Sλ : Failure rate of an operating hardware component due to software failure.
μ : Repair rate of a component failed due to hardware faults when at least one standby is available.
hμ : Repair rate of a failed component due to human failure.
Cμ : Repair rate of a failed component due to common cause.
Sμ : Repair rate of a component failed due to software failure.
( )( )tP 0,0 : Probability that there is no failed component at time t.
( )( )tP ji, : Probability that there are i )Ni0( ≤≤ and j )Nj0( ≤≤ components, where Nji1 ≤+≤ , failed due to hardware failure and human failure
respectively, at time t.
( )( )tP Sf : Probability that the system fails due to software failure at time t.
( )( )tP cf : Probability that the system fails due to common cause at time t.
Section-5A: Embedded Computer System with Two Types of Failures…
86
Fig. 5A.1: State transition diagram
5A.3 The Analysis
Now using the appropriate in-flow and out-flow rates shown in transition
diagram (see fig. 5A.1), we construct the differential difference equations governing
the model as follows:
Section-5A: Embedded Computer System with Two Types of Failures…
87
( ) ( )[ ] ( )( ) ( )( ) ( )( )
(t)Pμ(t)Pμ
tPμtμPtPλλλYMλλYMλdt
(t)dP
f.)(CC(Sf)S
0,1h1,00,0SChh(0,0)
++
++++′++′+−= …(5A.1)
( ) ( ) ( ) ( )( )[ ] ( )( )
( )[ ] ( )( ) ( )( ) ( )( ))2.A5...(Yi1
(t),Pμ(t)PμtPμtμPtPλ1iYMλ
tPμλλλiYMλλiYMλdt
(t)dP
(Cf.)C(Sf)Si,1h1,0i1,0i
i,0SChhi,0
≤≤
++++′+−++
+++′−++′−+−=
+−
( ) ( ) ( )[ ] ( )( ) ( )( )
( )( ) ( )( ) (5A.3)...1Ni1Y(t),Pμ(t)PμtμPtPμ
tMλλtPμλλλiYMλiYMdt
(t)dP
(Cf)C(Sf)S1,0ii,1h
1,0ii,0CShi,0
−≤≤+++++
++++−++−+−=
+
−
( )( ) ( ) ( ) ( )tPtP
dt)t(dP
0,1N0,N0,N
−λ+µ−= …(5A.4)
( ) ( ){ } ( ){ }[ ] ( )( )
( ) ( )( ) ( )( ) ( )( )(5A.5)...Yj1 (t),Pμ
(t)PμtPμtμPtPλ1)j(YMλ
tPμλλλjYMλλjYMλdt
(t)dP
(Cf)C
(Sf)S1j0,hj1,1j0,hh
j0,hSChhj0,
≤≤+
+++′+−++
µ++++′−++′−+−=
+−
( ) ( ) ( )[ ] ( )( ) ( )( )
( )( ) ( )( ) (5A.6)...1Nj1Y,(t)Pμ(t)PμtPtP
tPMtPλλjYMjYMdt
)t(dP
(Cf)C(Sf)Sj,11j,0h
1j,0hj,0hSChj,0
−≤≤+++µ+µ+
λ+µ+µ+++λ−++λ−+−=
+
−
( )( ) ( ) ( ) ( )tPtP
dt)t(dP
1N,0hN,0hN,0
−λ+µ−= …(5A.7)
( ) { } { }[ ] ( )( )
( )( ) ( )( ) ( )( ) ( )( ) ( )( ) ( )( ))8.A5...(1Nji1Y,0j,i
,tPtPtPtPtPMtM
tPjiYMjiYMdt
(t)dP
CfCSfS1j,ih0,1i1j,ihj,1i
j,ihSChji,
−≤+≤+≠
µ+µ+µ+µ+′+λλ+
µ+µ+λ+λ+λ−−++λ−−+−=
++−−
Section-5A: Embedded Computer System with Two Types of Failures…
88
( ) ( ){ } ( ){ }[ ] ( )( )
( ){ } ( ) ( ){ } ( )
( )( ) ( )( )...(5A.9) Yji20,ji,,
(t)Pμ(t)PμtPμtμP
)t(PjiYM)t(P1jiYM
tPμμλλλjiYMλλjiYMλdt
(t)dP
(Cf)C(Sf)S1ji,h1,1i
1j,i'hhj,1ih
ji,hSChhji,
≤+≤≠
++++
λ−−+λ+λ+−−+λ+
++++′−−++′−−+−=
++
−−
( ) ( ) ( )( ) ( )( ) ( )( )
)10.A5...(Nji0,ji,(t),Pμ(t)Pμ
tPλtλPtPλλμμdt
(t)dP
(Cf)C(Sf)S
1ji,hj1,iji,SChji,
=+≠++
+++++= −−
5A.4 Illustration
In this section, we present 2-out-of-5:G system for illustration purpose. The
differential difference equations associated with the system states are as follows:
( )( ) ( ) ( )[ ] ( )( ) ( )( ) ( )( )
)11.A5...((t)Pμ(t)Pμ
tPμtμPtPλλλ32λλ32λdt
tdP
(Cf)c(Sf)S
0,1h1,00,0CShh0,0
++
++++′++′+−=
( )( ) ( ) ( )[ ] ( )( ) ( ) ( ) ( )( )
( )( ) )12.A5(...(t)Pμ(t)PμtPμ
tμP(t)Pλ32λtPμλλλ22λ22λdt
tdP
(Cf)c(Sf)S,11h
2,00,01,0CShh1,0
+++
+′+++++′++λ′+−=
( )( ) ( ) ( )[ ] ( )( ) ( ) ( ) ( )( )
( )( ) (t)Pμ(t)PμtPμ
tμP(t)Pλ22λtPμλλ2λλ2λdt
tdP
(Cf)c(Sf)S2,1h
3,01,02,0CShh2,0
+++
+′+++++λ′++′+−= …(5A.13)
( )( ) [ ] ( )( ) ( ) ( ) ( )( ) ( )( )
(t)Pμ(t)Pμ
tPμtμP(t)Pλ2λtPμλλ2λ2λdt
tdP
)(Cc(Sf)S
3,1h4,02,03,0CSh3,0
f++
++′++++++−= …(5A.14)
( )( ) [ ] ( )( ) ( ) ( )( ) ( )( )
(t)Pμ(t)Pμ
tPμtμP(t)P2λtPλλμλλdt
tdP
(Cf)c(Sf)S
4,1h5,03,04,0CSh4,0
++
+++++++−= …(5A.15)
( ) ( ) (t)λP)t(μPdt
tdP(4,0)(5,0)
5,0 +−= …(5A.16)
( )( ) ( ) ( )[ ] ( )( ) ( ) ( )
( )( ) ( ) (t)Pμ(t)Pμ)t(PμtμP
(t)Pλ32λtPμλλλ22λλ22λdt
tdP
(Cf)c(Sf)S0,2h1,1
0,0hh0,1hCShh0,1
++++
′+++++′++′+−= …(5A.17)
Section-5A: Embedded Computer System with Two Types of Failures…
89
( )( ) ( ) ( )[ ] ( )( ) ( ) ( )
( ) ( ) ( )( ) ( )( ) (5A.18)...(t)Pμ(t)PμtPμtμPtPλ2λ
(t)Pλ22λtPμμλλλ2λλ2λdt
tdP
(Cf)c(Sf)S1,2h2,1(1,0)hh
0,1,11,hCShh1,1
++++′++
′++++++′++′+−=
( )( ) [ ] ( )( ) ( ) ( ) ( ) ( )
( )( ) ( )( ) (5A.19) (t)Pμ(t)PμtPμtμP
tPλ2λ(t)Pλ2λtPλλμμ2λ2λdt
tdP
(Cf)c(Sf)S2,2h3,1
(2,0)hh1,1,12,CShh2,1
…++++
′++′+++++++−=
( )( ) [ ] ( )( ) ( ) ( ) ( )( )
( ) (t)Pμ(t)PμPμ
tμPtP2λ(t)P2λtPμμλλλλdt
tdP
(Cf)c(Sf)S3,2h
4,1(3,0)h2,13,1hCSh3,1
+++
++++++++−= …(5A.20)
( ) ( ) (t)Pλ(t)λP(t)P)μ(μdt
tdP(4,0)h(3,1)(4,1)h
4,1 +++−= …(5A.21)
( )( ) ( ) ( )[ ] ( )( ) ( ) ( ) ( )( )
( ) ...(5A.22) (t)Pμ(t)PμPμ
tμP(t)Pλ22λtPμλλ2λλ2λdt
tdP
(Cf)c(Sf)S0,3h
1,2,0,1hh0,2hCShh0,2
+++
+′+++++λ′++′+−=
( )( ) [ ] ( )( ) ( ) ( ) ( ) ( )
( )( ) ( )( ) (5A.23) (t)Pμ(t)PμtPμtμP
tPλ2λ(t)Pλ2λtPλλμμ2λ2λdt
tdP
(Cf)c(Sf)S1,3h2,2
(1,1)hh0,21,2CShh1,2
…++++
′++′+++++++−=
( )( ) [ ] ( )( ) ( ) ( ) ( )( )
( )( ) (5A.24)... (t)Pμ(t)PμtPμ
tμPtP2λ(t)P2λtPμμλλλλdt
tdP
(Cf)c(Sf)S2,3h
3,2(2,1)h1,22,2hCSh2,2
+++
++++++++−=
( ) ( ) (t)Pλ(t)λP(t)P)μ(μdt
tdP(3,1)h(2,2)(3,2)h
3,2 +++−= …(5A.25)
( )( ) [ ] ( )( ) ( ) ( ) ( )( )
( ) (t)Pμ(t)PμPμ
tμP(t)Pλ2λtPμλλ2λ2λdt
tdP
(Cf)c(Sf)S0,4h
1,30,2hh0,3hCSh0,3
+++
+′++++++−= …(5A.26)
( )( ) [ ] ( )( ) ( ) ( )
( )( ) ( )( ) (t)Pμ(t)PμtPμtμP
tP2λ(t)P2tPλλμμλλdt
tdP
(Cf)c(Sf)S1,4h2,2
(1,2)h0,31,3CShh1,3
++++
+λ++++++−= …(5A.27)
( ) ( ) (t)Pλ(t)λP(t)P)μ(μdt
tdP(2,2)h(1,3)(2,3)h
2,3 +++−= …(5A.28)
( )( ) [ ] ( )( ) ( ) ( )( ) ( )( )
(t)Pμ(t)Pμ
tPμtμPtP2λtPμλλdt
tdP
)(Cc(Sf)S
0,5h1,4(0,3)h0,4CShh0,4
f++
+++λ+λ+++−= …(5A.29)
Section-5A: Embedded Computer System with Two Types of Failures…
90
( )( ) (t)Pμ(t)Pμ(t)Pλ(t)λP(t)P)λλμ(μdt
tdP)(Cc(Sf)S(1,3)h(0,4)(1,4)CSh
1,4f+++++++−=
…(5A.30)
( ) ( ) (t)Pλ(t)Pμdt
tdP(0,4)h(0,5)h
0,5 +−= …(5A.31)
( )( )
)32.A5...()t(p)t(p)t(p)t(p
)t(p)t(p)t(p(t)Pμdt
tdP
1
0i)3,0(C
2
0i)2,0(C
3
0i)1,0(C
4
0i)0,i(C
1
0i)3,0(C
2
0i)2,0(C
3
0i)1,0(C
4
0i(i,0)C
fC,
∑∑∑∑
∑∑∑∑
====
====
λ+λ+λ+λ+
µ−µ−µ−−=
( )( )
)33.A5...()t(p)t(p)t(p)t(p
)t(p)t(p)t(p(t)Pμdt
tdP
1
0i)3,0(S
2
0i)2,0(S
3
0i)1,0(S
4
0i)0,i(S
1
0i)3,0(S
2
0i)2,0(S
3
0i)1,0(S
4
0i(i,0)S
fS,
∑∑∑∑
∑∑∑∑
====
====
λ+λ+λ+λ+
µ−µ−µ−−=
In order to solve the above set of equations (5A.11)-(5A.33), we impose the
initial condition P(0, 0) =1 and P (i,j) (0)=0, i≠ 0, j≠ 0. Numerical method based on R-K
fourth order is used to obtain the transient probabilities.
5A.5 Performance Indices
We obtain the performance indices by using probabilities obtained in previous
section as follows:
Expected number of failed components at time t due to hardware error is
( ){ } ( ) ( )∑∑−
==
=iN
0jji,
N
1ihard tPitNE …(5A.34)
Expected number of failed components at time t due to human error is
( ){ } ( ) ( )∑∑−
==
=jN
0iji,
N
1jhuman tPjtNE …(5A.35)
Expected number of working components in the system at time t is
( ){ } ( ) ( ) ( )tPYjiMtNE ji,
N
1Yjiworking ∑
+=+
−+−= …(5A.36)
Section-5A: Embedded Computer System with Two Types of Failures…
91
Expected number of standby components in the system at time t is
( ){ } ( ) ( ) ( )tPji-YtNE ji,
Y
0jistandby ∑
=+
−= …(5A.37)
Component availability at time t is
( ) ( ){ } ( ){ }
+
−=N
tNEtNE1tA hardhuman
comp …(5A.38)
System unavailability at time t is
( ) ( )tA1tUA compsystem −= …(5A.39)
5A.6 Numerical Results
In this section, numerical results for various performance indices are provided.
We present sensitivity analysis to illustrate how the system is affected by varying
failure rate, repair rate and other parameters. Runge-Kutta (RK) technique of fourth
order is used to calculate the system of differential equations, which is implemented
by exploiting the software MATLAB’s ‘ode 45’ function. A time span is taken with
equal intervals. The numerical results are displayed in tables 5A.1(a)-5A.1(b). The
graphical presentation of the reliability R(t) has been done in figs 5A.2(a)-5A.2(d) for
different varying parameters and default parameters choosen as follows:
.001.0,001.0,002.0,001.0,29.0,14.0,15.0',01.0,01.0,1.0
CS
hCSh h
=µ=µ
=µ=µ=λ=λ=λ=λ=λ′=λ
From table 5A.1(a) we notice the patterns of various performance indices
namely ( ){ }tNE hard , ( ){ }tNE human , ( ){ }tNE working and ( ){ }tNE dbytans by varying the
repair rates. It is observed that there is a decreasing trend in the values of ( ){ }tNE hard ,
( ){ }tNE human , ( ){ }tNE working , ( ){ }tNE dbytans with the increasing values of SCh ,,, µµµµ .
In the table 5A.1(b), we demonstrate the system availability for different values of
failure rates at fix time t = 5.
In figs 5A.2(a)-5A.2(d), we show the variation of reliability with time for
different values of CS and,, λλλ′λ , respectively. Fig. 5A.2(a) reveals the behavior of
Section-5A: Embedded Computer System with Two Types of Failures…
92
reliability with respect to time t and failure rateλ . It is found that the reliability
decreases sharply for the initial values of t but shows a smooth decreasing pattern for
the further increased values of t. Then after in figs 5A.2(b)-5A.2(d), we illustrate the
behavior of reliability R(t) with respect to time t by varying the parameters
CS and, λλλ′ , respectively. It is noticed that as the values of failure rates
( )CS and, λλλ′ increase, the reliability decreases in each case.
5A.7 Conclusion
The reliability of a system without assuming human error and common cause
failure may not depict a real picture of the actual reliability/availability modeling.
Therefore the real time system reliability modeling must include the occurrence of
common cause failures, hardware error and human error. A K-out-of-N: G system
with warm standby components is studied in this chapter. The transient availability
and other performance indices obtained may be helpful to improve the system
availability in particular when occurrence of common cause failure and human errors
are involved. Our investigation in the present study facilitates an insight to the system
designers and developers to produce more reliable embedded computer systems by
judging correct measure of fault generation.
Section-5A: Embedded Computer System with Two Types of Failures…
93
Table 5A.1(a): Performance indices for different values of ( hμ,μ ) and ( SC μ,μ )
Table 5A.1(b): System Availability for different values of ( 'hh λ,λ ).
μ hμ Cμ Sμ t E{Nhard(t)} E{Nhuman(t)} E{Nwork(t)} E{Nstand(t)}
0.45
0.4
0.5
0.5
0 0 0 1 1 2 0.19999 0.134955 0.094451 0.006999 4 0.070383 0.043826 0.030845 0.000878 6 0.037462 0.021731 0.015971 0.00041 8 0.020852 0.011232 0.008658 0.000211
0.9
0.8
0.5
0.5
0 0 0 1 1 2 0.227216 0.151455 0.115752 0.019112 4 0.063309 0.037373 0.029252 0.002943 6 0.022243 0.011735 0.009802 0.000886 8 0.008314 0.003991 0.003536 0.000302
0.45
0.4
0.7
0.9
0 0 0 1 1 2 0.071508 0.047969 0.033341 0.002169 4 0.020423 0.012539 0.00882 0.000192 6 0.009112 0.005135 0.003813 0.000076 8 0.004276 0.002188 0.001731 0.000032
System Availability ( )tA
λ λ′ sλ Cλ t ( )0.4λ,0.3λ 'hh == ( )0.8λ,0.9λ '
hh ==
0.4 0.5 0.5 0.5 5 0.988739
0.977129
0.8 0.7 0.5 0.5 5 0.981654
0.969133
0.4 0.5 0.9 0.9 5 0.997062
1.000000
Section-5A: Embedded Computer System with Two Types of Failures…
94
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8 9 10 11t
R(t)
λ=.1λ=.5λ=.9
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8 9 10 11t
R(t)
λ'=.5λ'=1.0λ'=1.5
Fig. 5A.2(a): Reliability vs time by Fig. 5A.2(b): Reliability vs time by varyingλ varyingλ′
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8 9 10 11t
R(t)
λs=.20λs=.60λs=1.0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8 9 10 11t
R(t)
λc=.30λc=.60λc=.90
Fig. 5A.2(c): Reliability vs time by Fig. 5A.2(d): Reliability vs time by varying Sλ varying Cλ
Hardware and Software Systems with Warm Standbys and Switching
Failures
5B.1 Introduction
5B.2 Model Description
5B.3 Governing Equations
5B.4 Special Case
5B.5 Performance Measures
5B.6 Numerical Results
5B.7 Conclusion
Section-5B
Section-5B: Hardware and Software Systems with Warm Standbys…
96
Redundancy or standby is a provision which plays an
important role in improving the reliability of engineering systems.
This investigation deals with the reliability and sensitivity analysis
of a repairable system (consisting of hardware and software
components) with warm standbys and switching failures. Failure
and repair rates of the components are assumed to follow
exponential distributions. By using Markovian property, the
transient state model is developed to establish the system
reliability and other performance measures. Numerical technique
based on Runge-Kutta method and matrix method are used. A
numerical example is provided to illustrate the tractability of the
proposed method.
5B.1 Introduction
It is evident that in the most of ‘real world’ multi component machining
systems, after any component’s failure, the component is intended to be repaired
rather than replaced. In this chapter, we develop markovian model for a system
consisting of primary and warm standby components which are considered as
repairable. Subramanian and Venkatakrishan (1975) investigated reliability of 2-
unit standby redundant system with repair, maintenance and standby failure. Goel and
Shrivastava (1991) analyzed the profit of a two unit redundant system with test and
correlated failures and repairs. Hsieh and Wang (1995) computed reliability of a
repairable system with spares and a removable repairman. Yang and Xie (2000)
studied the operational and testing reliability in software reliability model. The
reliability and sensitivity analysis of a multi-component system with warm standbys
and a repairable service station was done by Wang et al. (2004). They assumed that
the life time and repair time of the units are exponentially distributed and the failed
units are repaired on FCFS basis. Hsu et al. (2009) suggested a repairable system with
imperfect coverage and reboot with the help of Bayesian and asymptotic estimation.
The two unit repairable system was considered with different types of prior
Section-5B: Hardware and Software Systems with Warm Standbys…
97
assumptions for unknown parameters, in which the coverage factor for an operating
unit failure is possible. Bieth et al. (2010) studied standby system with two repair
persons under arbitrary life-and repair times. Zheng et al. (2011) defined well-
posedness and stability of the repairable system with N failure modes and one standby
unit
The demand for products with more and more functionalities has increased
due to the high industrial competition and the advances in embedded hardware and
software technologies. To gain and maintain competitive advantage, the system
designers require a high level reliability of both hardware as well as software systems.
A high or required level of reliability and availability are often essential requisites for
embedded system. Lai et al. (2002) analyzed the availability of distributed
software/hardware systems. Guo and Yang (2007) discussed the methods of simple
reliability block diagram for safety integrity verification. In 2010, Kornecki and
Zalewski studied hardware certification for real-time safety critical systems.
The standby redundant repairable systems have been studied extensively in
the past by Osaki and Nakagawa (1976), Kumar and Agarwal (1980) and many
others. A detailed bibliography on redundant repairable systems can be found in
Yearout et al. (1986). The reliability prediction of a system with redundancy plays an
increasingly important role in power systems, manufacturing systems, industrial
systems, etc.. The warm standby component with a lower failure rate than the
operating units is recommended due to economic constraints. In many real time
systems, the standby might not be able to switch over successfully to act as a primary
unit; and it might also need a longer warm-up time. Wang and Kuo (2000) gave cost
and probabilistic analysis of series systems having mixed standby components. Jain
and Baghel (2001) and Bhuyan and Sarmah (2002) estimated reliability of a
repairable standby redundant system. Chandrasekhar et al. (2004) studied the two
unit standby system and obtained exact confidence limits for the availability of the
system, when the time-to-failure of an operative unit is constant and the time-to-repair
of the failed unit is governed by a two stage Erlangian distribution. Wang et al.
(2005) suggested the cost benefit analysis of series systems with warm standby
components with general repair time. Wang and Liang (2006) did cost benefit
analysis of the systems with warm standby units and imperfect coverage. In this
Section-5B: Hardware and Software Systems with Warm Standbys…
98
system they assumed that time-to-failure is distributed exponentially and time-to-
repair is distributed according to general distribution. Xu et al. (2005) established
asymptotic stability of a repairable system with imperfect switching mechanism. In
2007, Jain et al. gave transient analysis of M/M/R machining system with mixed
standbys, switching failures, balking, reneging and additional removable repairmen.
Zhang et al. (2006) considered the availability and reliability of a k-out-of-(M+N): G
warm standby systems. In 2009, El-Damcese analyzed the k-out-of-(M+S): G warm
standby system with time varying failure and repair rates in the presence of common
cause failure. Yang and Meng (2011) discussed the reliability analysis of a warm
standby repairable system with priority in use.
Most studies about the reliability of a system assume that the switchover from
warm standby units to primary units is always perfect and there are no failures during
the switching. But as stated above in real time practice, there is always possibility of
failures during the switching from standby state to operating state. Chow (1971) has
done studies on an imperfect switching system, which contains two identical
components. Alidrisi (1992) gave the recursive formula for the reliability of a
dynamic warm standby n components redundant system with imperfect switching and
constant failure rate. Pan (1997) predicted the reliability of imperfect switching
system subject to multiple stresses. Wang et al. (2006a) compared the availability and
reliability analysis for different system configurations with warm standby components
and standby switching failures; the repair time and the failure time for each of the
primary and warm standby components are assumed to follow the negative
exponential distribution. Ke et al. (2007) discussed the reliability and sensitivity
analysis of a system with primary units, warm standby units, unreliable service
stations and standby switching failures. They considered the time-to failure, time-to-
repair, breakdown time and service time of the failed units governed by the
exponentially distribution. Hsu et al. (2008) investigated a problem of a redundant
repairable system with switching failure with the help of bayesian approach. Wang
and Chen (2009) provided a comparative analysis of availability between three
systems with general repair times, reboot delay and switching failures. In this system,
they considered that the time-to-failure and the time-to-repair of the primary and
standby units are exponentially and generally distributed, respectively. Recently, Yun
Section-5B: Hardware and Software Systems with Warm Standbys…
99
and Cha (2010) and Hsu et al. (2011b) considered a general warm standby system
with switching time of the standby unit.
It is realized in many real time multi-component systems that Markovian
models are more natural and suitable. It is of also vital importance to perform the
reliability and sensitivity analysis of a repairable system with warm standby units and
switching failures. The present investigation is concerned with Markovian model for a
system which works with both hardware and software components and there is a
provision of warm standby hardware units which is likely to have switching failures
when used. The section-wise arrangement of rest of the chapter is as follows. Section
5B.2 provides the mathematical description of Markov model along with assumptions
and notations. In section 5B.3, we construct the steady state equations. Section 5B.4
contains an illustration to evaluate various performance indices of the system. Section
5B.5 is devoted to the performance measures of the system. The solution approach
based on Laplace transform and matrix method is given. Numerical results and
sensitivity analysis are given in section 5B.6 with the help of R-K and matrix method.
The chapter is concluded in the final section 5B.7 which summarizes the works done
and highlights the important features included in our investigation.
5B.2 Model Description
We consider the transient analysis of embedded hardware and software
system. There is provision of warm standbys to replace the failed hardware units, and
cold standbys to replace the software units. All the hardware and software units are
subject to failure and repair. The standby units and its associated switching
mechanisms are also subject to failures. The failure rate of software standby
component is zero. The system state transition diagram is depicted in fig. 5B.1. The
assumptions and notations used to describe the system are as follows:
Assumptions:
The system consists of M operating and S warm standby units for hardware components.
Upon failure of an operative hardware unit, the available warm standby unit becomes operative instantaneously and the failed unit goes under the repair if the repairman is free, otherwise waits for the repair in the queue.
Section-5B: Hardware and Software Systems with Warm Standbys…
100
Once an operating hardware unit fails, a standby unit replaces it; the failure characteristic of the standby unit becomes same as that of the operating unit.
The repair crew has R repairmen to facilitate the repair of failed hardware components. Each repairman can repair only one failed unit at a time; the repair discipline is first come first served (FCFS).
When the repair of failed unit is completed, it is as good as new one.
The switching may fail with probability q during the switching from standby state to operating state.
The switch over times from failure to repair, from repair to standby and from standby to operating states are negligible.
When both hardware components as well as software component fail, the system becomes non repairable.
The states of all components are mutually independent.
Notations:
M : The number of operating hardware/software units in the system.
S : The number of hardware warm standby units in the system.
hλ : Failure rate of hardware operating units.
Sλ : Failure rate of software operating units.
α : Failure rate of standby hardware units.
q : Failure probability of switching of hardware standby units.
( )Sh µµ : Repair rate of permanent repairmen when providing repair of failed hardware (Software) units.
P (i, j) : Steady state probability that i and j failed units are present in the system which fail due to hardware and software faults, respectively.
5B.3 Governing Equations
With the help of state dependent failure rates and repair rates, we construct the
Chapman-Kolmogorov equations governing the model as follows (See transition
diagram):
Section-5B: Hardware and Software Systems with Warm Standbys…
101
( )( ) ( )[ ] ( ) ( ) ( ) )t(P)t(P)t(PMSMdt
tdP1,0S0,1h0,0Sh
0,0 µ+µ+λ+α+λ−= …(5B.1)
( )( ) ( )[ ] ( )( ) ( )[ ] ( ) ( )
( ) (5B.2))t(P
)t(P2)t(PSq1MtPM1SMdt
tdP
1,1S
0,2h0,0h0,1hSh0,1
…µ+
µ+α+−λ+µ+λ+α−+λ−=
( ) ( )[ ] ( )( ) ( ) ( )[ ] ( )
( ) ( ) ( )( )( ) ( )( )
(5B.3)Si2
,tPq1qM)t(P)t(PRi
)t(P1iSq1MtP)Ri(MiSMdt
)t(dP
0,n1ni
h
2i
0n1,iS0,1ih
0,1ih0,ihSh0,i
1
1
1
…≤≤
−λ+µ+µ∧+
α+−+−λ+µ∧+λ+α−+λ−=
−−−
=+
−
∑
( ) ( )[ ] ( )( ) ( )[ ] ( )
( ) ( )( )
( ) (5B.4)tPqM)t(PR)t(P
)t(P1iSMtPRMiSMdt
)t(dP
0,n
1R
0n
nSh0,2Sh1,1SS
0,Sh0,1ShSh0,1S
1
1
1 …λ+µ+µ+
λ+−++µ+λ+λ−+−=
∑−
=
−++
++
( )( ) ( )[ ] ( )( ) ( )[ ] ( )
( ) ( ) (5B.5)1Ni2S),t(PR)t(P
)t(P1iSMtPRMiSMdt
tdP
0,1ih1,iS
0,1ih0,ihSh0,i
…−≤≤+µ+µ+
λ+−++µ+λ+λ−+−=
+
−
( )( )( ) ( ) ( ) )t(PR)t(P
dttdP
0,Nh0,1Nh0,N µ−λ= − …(5B.6)
( )( ) ( )[ ] ( )( ) ( ) ( )
( ) (5B.7)Sj1),t(P
)t(P)t(PMtPMjSMdt
tdP
1j,0S
j,1h1j,0Sj,0SShj,0
…≤≤µ+
µ+λ+µ+λ+α−+λ−=
+
−
( )( ) ( ) ( )[ ] ( )( ) ( ) ( )
( ) ( ) (5B.8)1Nj1S),t(P)t(P
)t(P1jSMtPjSMjSMdt
tdP
1j,0Sj,ih
1j,0Sj,0SShj,0
…−≤≤+µ+µ+
λ+−++µ+λ−++λ−+−=
+
−
( )( )( ) ( ) )9.B5...()t(P)t(P
dttdP
N,0S1N,0SN,0 µ−λ= −
( )( ) ( ) ( ) ( )[ ] ( )
( ) ( )[ ] ( )( )( )( ) ( )
( ) ( ) ( )[ ] ( ) ( )
(5B.10) Sji2,0j,i
),t(P)t(PjiSq1M)t(PR
)t(Pq1qM)t(PiSq1M
)t(PRiMjiSq1Mdt
tdP
1j.iS1j,ihj.1ih
n,n
2ji
1nn
1nnjihj,1ih
j,iShShj,i
21
21
21
…≤+≤≠
µ+α+−+−λ+µ+
−λ+α−+−λ+
µ+µ∧+λ+α+−+−λ−=
+−+
−+
=+
−+−+− ∑
Section-5B: Hardware and Software Systems with Warm Standbys…
102
( )( ) ( ) ( )[ ] ( )( )
( ) ( ) ( )( )
( )
( )( ) ( ) ( ) ( )
(5B.11)1Sji
),t(P)t(PRi)t(PqM
)t(PqM)t(PM)t(P1jiSM
tPRiMjiSMdt
tdP
1j,1Sj,1ihn,n1nnji
h
1S
Rnn
n,n1nnji
h
1R
1nn1j,iSj,1ih
j,iShShj,i
21
21
21
21
21
21
…+=+
µ+µ∧+λ+
λ+λ+λ+−−++
µ+µ∧+λ+λ+−+−=
++−+−+
−
=+
−+−+−
=+−−
∑
∑
( )( ) ( ) ( )[ ] ( )( )
( ) ( ) ( ) ( ) ( )
( ) (5B.12)1Nji2S),t(P
)t(PRi)t(PM)t(P1jiSM
tPRiMjiSMdt
tdP
1j,iS
j,1ih1j,iSj,1ih
j,iShShj,i
…−=+≤+µ+
µ∧+λ+λ++−++
µ+µ∧+λ+λ+−+−=
+
+−−
( )( ) ( ) ( )[ ] ( )
( ) ( )[ ] ( ) ( ) ( ) ( )
( ) (5B.13)1Sj1,1i),t(P
)t(PM)t(P1i)t(PjSq1M
)t(PM1jSq1Mdt
tdP
1j.iS
1j,iSj.1ihj,1ih
j,iShShj,i
…−≤≤=µ+
λ+µ++α−+−λ+
µ+µ+λ+α+−+−λ−=
+
−+−
( )( )( ) ( ) ( ) ( ) Nji),t(P)1R()t(PM)t(P
dttdP
j,iSh1j,iSj,1ihj,i =+µ+µ−−λ+λ= −− …(5B.14)
In the construction of above equations, we have used RiΛ for min (i, R).
Taking Laplace transforms of equations (5B.1) to (5B.14) with initial conditions
( ) 1)0(P 0,0 = and ( ) 0)0(P j,i = , for ,0j,0i ≠≠ we get
( )[ ] ( ) ( ) ( ) 1)s(P)s(P)s(PsMSM 1,0*
S0,1*
h0,0*
Sh =µ−µ−+λ+α+λ …(5B.15)
( )[ ] ( )( ) ( )[ ] ( ) ( )
( ) (5B.16)0)s(P
)s(P2)s(PSq1MsPsM1SM
1,1*
S
0,2*
h0,0*
h0,1*
hSh
…=µ−
µ−α+−λ−+µ+λ+α−+λ
( )[ ] ( )( ) ( ) ( )[ ] ( )
( ) ( ) ( )( )( ) ( )( )
(5B.17)Si2
,0sPq1qM)s(P)s(PRi
)s(P1iSq1MsPs)Ri(MiSM
0,n*1ni
h
2i
0n1,i
*S0,1i
*h
0,1i*
h0,i*
hSh
11
1
…≤≤
=−λ−µ−µ∧−
α+−+−λ−+µ∧+λ+α−+λ
−−−
=
+
−
∑
Section-5B: Hardware and Software Systems with Warm Standbys…
103
( )[ ] ( )( ) ( )[ ] ( ) ( )
( )( )
( )( ) (5B.18)0sPqM)s(PR
)s(P)s(P1iSMsPsRMiSM
0,n*
1R
0n
nSh0,2S
*h
1,1S*
S0,S*
h0,1S*
hSh
1
1
1 …=λ−µ−
µ−λ+−+−+µ+λ+λ−+
∑−
=
−+
++
( )[ ] ( )( ) ( )[ ] ( ) ( )
( ) (5B.19)1Ni2S,0)s(PR
)s(P)s(P1iSMsPsRMiSM
0,1i*
h
1,i*
S0,1i*
h0,i*
hSh
…−≤≤+=µ−
µ−λ+−+−+µ+λ+λ−+
+
−
( ) ( ) ( ) )s(P)s(PsR 0,1N*
h0,N*
h −λ−+µ …(5B.20)
( )[ ] ( )( ) ( ) ( )
( ) (5B.21)Sj1,0)s(P
)s(P)s(PMsPsMjSM
1j,0*
S
j,1*
h1j,0*
Sj,0*
SSh
…≤≤=µ−
µ−λ−+µ+λ+α−+λ
+
−
( ) ( )[ ] ( )( ) ( ) ( )
( ) ( ) (5B.22)1Nj1S,0)s(P)s(P
)s(P1jSMsPsjSMjSM
1j,0*
Sj,i*
h
1j,0*
Sj,0*
SSh
…−≤≤+=µ−µ−
λ+−+−+µ+λ−++λ−+
+
−
( ) ( ) ( ) )s(P)s(Ps 1N,0*
SN,0*
S −λ−+µ …(5B.23)
( ) ( ) ( )[ ] ( )
( ) ( )[ ] ( )( )( )( ) ( )
( ) ( ) ( )[ ] ( ) ( )
(5B.24)Sji2,0j,i,0)s(P)s(PjiSq1M)s(PR
)s(Pq1qM)s(PiSq1M
)s(PsRiMjiSq1M
1j.i*
S1j,i*
hj.1i*
h
n,n*
2ji
1nn
1nnjihj,1i
*h
j,i*
ShSh
21
21
21
…≤+≤≠=µ−α+−+−λ−µ−
−λ−α−+−λ−
+µ+µ∧+λ+α+−+−λ
+−+
−+
=+
−+−+− ∑
( ) ( )[ ] ( )( ) ( ) ( )
( )( )
( )( )
( )
( ) ( ) ( ) (5B.25)1Sji,0)s(P)s(PRi
)s(PqM)s(PqM)s(PM
)s(P1jiSMsPsRiMjiSM
1j,1*
Sj,1i*
h
n,n*1nnji
h
1S
Rnnn,n
*1nnjih
1R
1nn1j,i
*S
j,1i*
hj,i*
ShSh
2121
21
2121
21
…+=+=µ−µ∧−
λ−λ−λ−
λ+−−+−+µ+µ∧+λ+λ+−+
++
−+−+−
=+
−+−+−
=+
−
−
∑∑
( ) ( )[ ] ( )( ) ( ) ( )
( ) ( ) ( ) ( )
(5B.26)1Nji2S,0)s(P)s(PRi)s(PM
)s(P1jiSMsPsRiMjiSM
1j,i*
Sj,1i*
h1j,i*
S
j,1i*
hj,i*
ShSh
…−=+≤+=µ−µ∧−λ−
λ++−+−+µ+µ∧+λ+λ+−+
++−
−
Section-5B: Hardware and Software Systems with Warm Standbys…
104
( ) ( )[ ] ( ) ( ) ( )[ ] ( )
( ) ( ) ( ) ( )
(5B.27)1Sj1,1i,0)s(P)s(PM)s(P1i
)s(PjSq1M)s(PsM1jSq1M
1j.i*
S1j,i*
Sj.1i*
h
j,1i*
hj,i*
ShSh
…−≤≤==µ−λ−µ+−
α−+−λ−+µ+µ+λ+α+−+−λ
+−+
−
( ) ( ) ( ) ( ) Nji,0)s(PM)s(P)s(Ps)1R( 1j,i*
Sj,1i*
hj,i*
Sh =+=λ−λ−+µ+µ− −− …(5B.28)
5B.4 Special Case
In this investigation we consider a special case of general model by
considering 4 hardware components, 4 software components and 3 warm standby
components for the hardware units. Now the differential-difference equations
governing the model are as follows:
( ) ( ) ( ) ( )( ) ( ) ( ) ( ) ( )tPtPtP434dt
tdP1,0S0,1h0,0Sh
0,0 µ+µ+λ+α+λ−= …(5B.29)
( )( ) ( ) ( )( ) ( )[ ] ( )( ) ( )( )
( )( ) (5B.30)tP
tP2tP3q14tP424dt
tdP
1,1S
0,2h0,0h0,1hSh0,1
…µ+
µ+α+−λ+µ+λ+α+λ−=
( )( ) ( ) ( )( ) ( )[ ] ( )( ) ( )( )
( )( ) ( ) ( )( ) (5B.31)tPq1q4λtPμ
tP3μtP2αq14λtP2μ4λα4λdt
tdP
0,0h2,1S
3,0h1,0h2,0hSh2,0
…−++
++−++++−=
( )( ) ( ) ( )( ) ( )[ ] ( )( ) ( )( ) ( )( )
( ) ( )( ) ( ) ( )( ) (5B.32)tPq1q4λtPq1q4λ
tPμtP3μtPαq14λtP3μ4λ4λdt
tdP
1,0h0,02
h
3,1S4,0h2,0h3,0hSh3,0
…−+−+
+++−+++−=
( )( ) ( ) ( )( ) ( )( ) ( )( ) ( )( )
( )( ) ( )( ) ( )( ) (5B.33)tqP4λtPq4λtPq4λ
tPμtP3μtP4λtP3μ4λ3λdt
tdP
2,0h1,02
h0,03
h
4,1S5,0h3,0h4,0hSh4,0
…+++
+++++−=
( )( ) ( ) ( )( ) ( )( ) ( )( ) ( )( )tPtP3tP3tP342dt
tdP1,5S0,6h0,4h0,5hSh
0,5 µ+µ+λ+µ+λ+λ−= …(5B.34)
( )( ) ( ) ( )( ) ( )( ) ( )( ) ( )( )tPtP3tP2tP34dt
tdP1,6S0,7h0,5h0,6hSh
0,6 µ+µ+λ+µ+λ+λ−= …(5B.35)
( )( )( )( ) ( )( )tPtP3
dttdP
0,6h0,7h0,7 λ+µ−= …(5B.36)
Section-5B: Hardware and Software Systems with Warm Standbys…
105
( )( ) ( ) ( )( ) ( )( ) ( )( ) ( )( )tPtPtP4tP424dt
tdP2,0S1,1h0,0S1,0SSh
1,0 µ+µ+λ+µ+λ+α+λ−= …(5B.37)
( )( ) ( ) ( )( ) ( )[ ] ( )( )
( )( ) ( )( ) ( )( ) (5B.38)tPtP4tP2
tP2q14tP44dt
tdP
2,1S0,1S1,2h
1,0h1,1ShSh1,1
…µ+λ+µ+
α+−λ+µ+µ+λ+α+λ−=
( )( ) ( ) ( )( ) ( )[ ] ( )( ) ( )( )
( )( ) ( )( ) ( ) ( )( ) (5B.39)tPq14tPtP4
tP2tPq14tP244dt
tdP
1,0h2,2S0,2S
1,3h1,1h1,2ShSh1,2
…−λ+µ+λ+
µ+α+−λ+µ+µ+λ+λ−=
( )( ) ( ) ( )( ) ( )( ) ( )( ) ( )( )
( )( ) ( )( ) ( )( ) (5B.40)tqP4tPq4tP
tP4tP2tP4tP243dt
tdP
1,1h1,02
h2,3S
0,3S1,4h1,2h1,3ShSh1,3
…λ+λ+µ+
λ+µ+λ+µ+µ+λ+λ−=
( )( ) ( ) ( )( ) ( )( ) ( )( ) ( )( )
( )( ) (5B.41)tP
tP4tP2tP3tP242dt
tdP
2,4S
0,4S1,5h1,3h1,4ShSh1,4
…µ+
λ+µ+λ+µ+µ+λ+λ−=
( )( ) ( ) ( )( ) ( )( ) ( )( ) ( )( )
( )( ) (5B.42)tP
tP4tP2tP2tP24dt
tdP
2,5S
0,5S1,6h1,4h1,5ShSh1,5
…µ+
λ+µ+λ+µ+µ+λ+λ−=
( )( ) ( ) ( )( ) ( )( ) ( )( )tP4tPtP2dt
tdP0,6S1,5h1,6Sh
1,6 λ+λ+µ+µ−= …(5B.43)
( )( ) ( ) ( )( ) ( )( ) ( )( ) ( )( )tPtP4tPtP44dt
tdP3,0S1,0S2,1h2,0SSh
2,0 µ+λ+µ+µ+λ+α+λ−= …(5B.44)
( )( ) ( ) ( )( ) ( )[ ] ( )( )
( )( ) ( )( ) ( )( )tPtP4tP2
tPq14tP44dt
tdP
3,1S1,1S2,2h
2,0h2,1ShSh2,1
µ+λ+µ+
α+−λ+µ+µ+λ+λ−= …(5B.45)
( )( ) ( ) ( )( ) ( )( ) ( )( )
( )( ) ( )( ) ( )( ) (5B.46)tqP4tPtP4
tP2tP4tP243dt
tdP
2,0h3,2S1,2S
2,3h2,1h2,2ShSh2,2
…λ+µ+λ+
µ+λ+µ+µ+λ+λ−=
Section-5B: Hardware and Software Systems with Warm Standbys…
106
( )( ) ( ) ( )( ) ( )( ) ( )( ) ( )( )
( )( ) (5B.47)tP
tP4tP2tP3tP242dt
tdP
3,3S
1,3S2,4h2,2h2,3ShSh2,3
…µ+
λ+µ+λ+µ+µ+λ+λ−=
( )( ) ( ) ( )( ) ( )( ) ( )( ) ( )( )
( )( ) (5B.48)tP
tP4tP2tP2tP24dt
tdP
3,4S
1,4S2,5h2,3h2,4ShSh2,4
…µ+
λ+µ+λ+µ+µ+λ+λ−=
( )( ) ( ) ( )( ) ( )( ) ( )( )tP4tPtP2dt
tdP1,5S2,4h2,5Sh
2,5 λ+λ+µ+µ−= …(5B.49)
( )( ) ( ) ( )( ) ( )( ) ( )( ) ( )( )tPtP4tPtP44dt
tdP4,0S2,0S3,1h3,0SSh
3,0 µ+λ+µ+µ+λ+λ−= …(5B.50)
( )( ) ( ) ( )( ) ( )( ) ( )( ) ( )( )
( )( ) (5B.51)tP
tP4tP2tP4tP43dt
tdP
4,1S
2,1S3,2h3,0h3,1ShSh3,1
…µ+
λ+µ+λ+µ+µ+λ+λ−=
( )( ) ( ) ( )( ) ( )( ) ( )( ) ( )( )
( )( ) (5B.52)tP
tP4tP2tP3tP242dt
tdP
4,2S
2,2S3,3h3,1h3,2ShSh3,2
…µ+
λ+µ+λ+µ+µ+λ+λ−=
( )( ) ( ) ( )( ) ( )( ) ( )( ) ( )( )
( )( ) (5B.53)tP
tP4tP2tP2tP24dt
tdP
4,3S
2,3S3,4h3,2h3,3ShSh3,3
…µ+
λ+µ+λ+µ+µ+λ+λ−=
( )( ) ( ) ( )( ) ( )( ) ( )( )tP4tPtP2dt
tdP2,4S3,3h3,4Sh
3,4 λ+λ+µ+µ−= …(5B.54)
( )( ) ( ) ( )( ) ( )( ) ( )( ) ( )( )tPtP4tPtP33dt
tdP5,0S3,0S4,1h4,0SSh
4,0 µ+λ+µ+µ+λ+λ−= …(5B.55)
( )( ) ( ) ( )( ) ( )( ) ( )( ) ( )( )
( )( ) (5B.56)tP2
tP4tPtP3tP32dt
tdP
4,2h
3,1S5,1S4,0h4,1ShSh4,1
…µ+
λ+µ+λ+µ+µ+λ+λ−=
Section-5B: Hardware and Software Systems with Warm Standbys…
107
( )( ) ( ) ( )( ) ( )( ) ( )( ) ( )( )
( )( ) (5B.57)tP
tP2tP4tP2tP24dt
tdP
5,2S
4,3h3,2S4,1h4,2ShSh4,2
…µ+
µ+λ+λ+µ+µ+λ+λ−=
( )( ) ( ) ( )( ) ( )( ) ( )( )tP4tPtP2dt
tdP3,3S4,2h4,3Sh
4,3 λ+λ+µ+µ−= …(5B.58)
( )( ) ( ) ( )( ) ( )( ) ( )( ) ( )( )tPtP3tPtP22dt
tdP6,0S4,0S5,1h5,0SSh
5,0 µ+λ+µ+µ+λ+λ−= …(5B.59)
( )( ) ( ) ( )( ) ( )( ) ( )( ) ( )( )
( )( ) (5B.60)tP2
tPtP2tP3tP2dt
tdP
5,0h
6,1S5,2h4,1S5,1ShSh5,1
…λ+
µ+µ+λ+µ+µ+λ+λ−=
( )( ) ( ) ( )( ) ( )( ) ( )( )tP4tPtP2dt
tdP4,2S5,1h5,2Sh
5,2 λ+λ+µ+µ−= …(5B.61)
( )( ) ( ) ( )( ) ( )( ) ( )( ) ( )( )tPtP2tPtPdt
tdP7,0S5,0S6,1h6,0SSh
6,0 µ+λ+µ+µ+λ+λ−= …(5B.62)
( )( ) ( ) ( )( ) ( )( ) ( )( )tP2tPtPdt
tdP5,1S6,0h6,1Sh
6,1 λ+λ+µ+µ−= …(5B.63)
( )( ) ( ) ( )( ) ( )( )tPtPdt
tdP6,0S7,0S
7,0 λ+µ−= …(5B.64)
In order to solve the above set of equations (5B.29)-(5B.64), we impose the
initial conditions P(0,0) (0)=1 and P (i,j) (0)=0, i≠ 0, j≠ 0. Numerical method based on
R-K fourth order technique is used to obtain the transient probabilities.
For solving the set of equations governing the model, we take Laplace
transforms of equations (5B.29)-(5B.64) and solve using matrix method with initial
conditions ( ) ( ) 0j,0ifor0)0(P,1)0(P j,i0,0 =≠== . Now equations (5B.29)-(5B.64)
become
( ) ( )( ) ( )( ) ( )( ) 1sPsPsPs434 1,0*
S0,1*
h0,0*
Sh =µ−µ−+λ+α+λ …(5B.65)
Section-5B: Hardware and Software Systems with Warm Standbys…
108
( ) ( )( ) ( )[ ] ( )( ) ( )( )( )( ) )66.B5...(0sP
sP2sP3q14sPs424
1,1*
S
0,2*
h0,0*
h0,1*
hSh
=µ−
µ−α+−λ−+µ+λ+α+λ
( ) ( )( ) ( )[ ] ( )( ) ( )( )( )( ) )67.B5...(0sP
sP3sP2q14sPs244
1,2*
S
0,3*
h0,1*
h0,2*
hSh
=µ−
µ−α+−λ−+µ+λ+α+λ
( ) ( )( ) ( )[ ] ( )( ) ( )( )( )( ) (5B.68)0sP
sP3sPq14sPs344
1,3*
S
0,4*
h0,2*
h0,3*
hSh
…=µ−
µ−α+−λ−+µ+λ+λ
( ) ( )( ) ( )( ) ( )( ) ( )( ) 0sPsP3sP4sPs343 1,4*
S0,5*
h0,3*
h0,4*
hSh =µ−µ−λ−+µ+λ+λ …(5B.69)
( ) ( )( ) ( )( ) ( )( ) ( )( ) 0sPsP3sP3sPs342 1,5*
S0,6*
h0,4*
h0,5*
hSh =µ−µ−λ−+µ+λ+λ …(5B.70)
( ) ( )( ) ( )( ) ( )( ) ( )( ) 0sPsP3sP2sPs34 1,6*
S0,7*
h0,5*
h0,6*
hSh =µ−µ−λ−+µ+λ+λ …(5B.71)
( )( ) ( )( ) 0sPsP3 0,6*
h0,7*
h =λ−µ …(5B.72)
( ) ( )( ) ( )( ) ( )( ) ( )( ) 0sPsPsP4sPs424 2,0*
S1,1*
h0,0*
S1,0*
SSh =µ−µ−λ−+µ+λ+α+λ …(5B.73)
( ) ( )( ) ( )[ ] ( )( ) ( )( )( )( ) ( )( ) (5B.74)0sPsP4
sP2sP2q14sPs44
2,1*
S0,1*
S
1,2*
h1,0*
h1,1*
ShSh
…=µ−λ−
µ−α+−λ−+µ+µ+λ+α+λ
( ) ( )( ) ( )[ ] ( )( ) ( )( )( )( ) ( )( ) (5B.75)0sPsP4
sP2sPq14sPs244
2,2*
S0,2*
S
1,3*
h1,2*
h1,2*
ShSh
…=µ−λ−
µ−α+−λ−+µ+µ+λ+λ
( ) ( )( ) ( )( ) ( )( ) ( )( )( )( ) )76.B5...(0sP
sP4sP2sP4sPs243
2,3*
S
0,3*
S1,4*
h1,2*
h1,3*
ShSh
=µ−
λ−µ−λ−+µ+µ+λ+λ
( ) ( )( ) ( )( ) ( )( ) ( )( )( )( ) )77.B5...(0sP
sP4sP2sP3sPs242
2,4*
S
0,4*
S1,5*
h1,3*
h1,4*
ShSh
=µ−
λ−µ−λ−+µ+µ+λ+λ
( ) ( )( ) ( )( ) ( )( ) ( )
( )( ) )78.B5...(0sP
)s(P4sP2sP2sPs24
2,5*
S
0,5*
S1,6*
h1,4*
h1,5*
ShSh
=µ−
λ−µ−λ−+µ+µ+λ+λ
( ) ( )( ) ( )( ) ( )( ) 0sP4sPsPs2 0,6*
S1,5*
h1,6*
Sh =λ−λ−+µ+µ …(5B.79)
( ) ( )( ) ( )( ) ( )( ) ( )( ) 0sPsP4sPsPs44 3,0*
S1,0*
S2,1*
h2,0*
SSh =µ−λ−µ−+µ+λ+α+λ …(5B.80)
Section-5B: Hardware and Software Systems with Warm Standbys…
109
( ) ( )( ) ( )[ ] ( )( ) ( )( )( )( ) ( )( ) 0sPsP4
sP2sPq14sPs44
3,1*
S1,1*
S
2,2*
h2,0*
h2,1*
ShSh
=µ−λ−
µ−α+−λ−+µ+µ+λ+λ …(5B.81)
( ) ( )( ) ( )( ) ( )( ) ( )( )( )( ) )82.B5...(0sP
sP4sP2tP4sPs243
3,2*
S
1,2*
S2,3*
h2,1*
h2,2*
ShSh
=µ−
λ−µ−λ−+µ+µ+λ+λ
( ) ( )( ) ( )( ) ( )( ) ( )( )( )( ) )83.B5...(0tP
sP4sP2sP3sPs242
3,3*
S
1,3*
S2,4*
h2,2*
h2,3*
ShSh
=µ−
λ−µ−λ−+µ+µ+λ+λ
( ) ( )( ) ( )( ) ( )( ) ( )( )( )( ) )84.B5...(0sP
sP4sP2sP2sPs24
3,4*
S
1,4*
S2,5*
h2,3*
h2,4*
ShSh
=µ−
λ−µ−λ−+µ+µ+λ+λ
( ) ( )( ) ( )( ) ( )( ) 0sP4sPsPs2 1,5*
S2,4*
h2,5*
Sh =λ−λ−+µ+µ …(5B.85)
( ) ( )( ) ( )( ) ( )( ) ( )( ) 0sPsP4sPsPs44 4,0*
S2,0*
S3,1*
h3,0*
SSh =µ−λ−µ−+µ+λ+λ …(5B.86)
( ) ( )( ) ( )( ) ( )( ) ( )( )( )( ) )87.B5...(0sP
sP4sP2sP4sPs43
4,1*
S
2,1*
S3,2*
h3,0*
h3,1*
ShSh
=µ−
λ−µ−λ−+µ+µ+λ+λ
( ) ( )( ) ( )( ) ( )( ) ( )( )( )( ) )88.B5...(0sP
sP4sP2sP3sPs242
4,2*
S
2,2*
S3,3*
h3,1*
h3,2*
ShSh
=µ−
λ−µ−λ−+µ+µ+λ+λ
( ) ( )( ) ( )( ) ( )( ) ( )( )( )( ) )89.B5...(0sP
sP4sP2sP2sPs24
4,3*
S
2,3*
S3,4*
h3,2*
h3,3*
ShSh
=µ−
λ−µ−λ−+µ+µ+λ+λ
( ) ( )( ) ( )( ) ( )( ) 0sP4sPsPs2 2,4*
S3,3*
h3,4*
Sh =λ−λ−+µ+µ …(5B.90)
( ) ( )( ) ( )( ) ( )( ) ( )( ) 0sPsP4sPsPs33 5,0*
S3,0*
S4,1*
h4,0*
SSh =µ−λ−µ−+µ+λ+λ …(5B.91)
( ) ( )( ) ( )( ) ( )( ) ( )( ) 0sP4sPsP3sPs32 3,1*
S5,1*
S4,0*
h4,1*
ShSh =λ−µ−λ−+µ+µ+λ+λ …(5B.92)
( ) ( )( ) ( )( ) ( )( ) ( )( )( )( ) )93.B5...(0sP
sP2sP4sP2sPs24
5,2*
S
4,3*
h3,2*
S4,1*
h4,2*
ShSh
=µ−
µ−λ−λ−+µ+µ+λ+λ
( ) ( )( ) ( )( ) ( )( ) 0sP4sPsPs2 3,3*
S4,2*
h4,3*
Sh =λ−λ−+µ+µ …(5B.94)
( ) ( )( ) ( )( ) ( )( ) ( )( ) 0sPsP3sPsPs22 6,0*
S4,0*
S5,1*
h5,0*
SSh =µ−λ−µ−+µ+λ+λ …(5B.95)
( ) ( )( ) ( )( ) ( )( ) ( )( ) 0sPsP2sP3sPs2 6,1*
S5,2*
h4,1*
S5,1*
ShSh =µ−µ−λ−+µ+µ+λ+λ …(5B.96)
Section-5B: Hardware and Software Systems with Warm Standbys…
110
( ) ( )( ) ( )( ) ( )( ) 0sP4sPsPs2 4,2*
S5,1*
h5,2*
Sh =λ−λ−+µ+µ …(5B.97)
( ) ( )( ) ( )( ) ( )( ) ( )( ) 0sPsP2sPsPs 7,0*
S5,0*
S6,1*
h6,0*
SSh =µ−λ−µ−+µ+λ+λ …(5B.98)
( ) ( )( ) ( )( ) ( )( ) 0sP2sPsPs 5,1*
S6,0*
h6,1*
Sh =λ−λ−+µ+µ …(5B.99)
( ) ( )( ) ( )( ) 0sPsP 6,0*
S7,0*
S =λ−µ …(5B.100)
For brevity, we denote the Laplace transform of probabilities ( ) ( )sP*j,i with one suffix
i.e. by Pk (s) as defined below:
( )( ) ( ) 7i0,sPsP 1i*
0,i ≤≤= + ; ( )( ) ( ) 6i0,sPsP 1i8*
1,i ≤≤= ++ ; ( )( ) ( ) 5i0,sPsP 1i15*
2,i ≤≤= ++ ;
( )( ) ( ) 4i0,sPsP 1i21*
3,i ≤≤= ++ ; ( )( ) ( ) 3i0,sPsP 1i26*
4,i ≤≤= ++ ; ( )( ) ( ) 2i0,sPsP i31*
5,i ≤≤= + ;
( )( ) ( ) 1i0,sPsP i34*
6,i ≤≤= + ; ( )( ) ( )( )sPsP 36*
7,i = .
The system of equations (5B.65)-(5B.100) can be written in matrix form as
( )0P)s(P).s(Q * = …(5B.101)
where, ( ) ( ) ( ) ( )[ ]T*36
*2
*1
* sP,...,sP,sPsP = and ( ) [ ] T0...,,0,0,10P = .
Here Q(s) is a 3636× order matrix and can be written as tri-diagonal block matrix as
follows:
( ) 77ij ]A[sQ ×=
Now sub matrices Aij (i=1,2,……7 and j=1,2,….7) are constructed for particular case
as follows:
[ ]T77S12 0,IA µ−= ,
[ ][ ]
[ ]
µλ−µ−Λλ−
µ−Λλ−µ−Λλ−λ−λ−λ−
µ−Λα+−λ−−λ−−λ−µ−Λα+−λ−−λ−
µ−Λα+−λ−µ−Λ
=
hh
h7h
h6h
h5hh2
h3
h
h4hh2
h
h3hh
h2h
h1
11
3000000320000003300000034q4q4q40003)q1(4)q1(q4)q1(q4000032)q1(4)q1(q40000023)q1(000000
A
[ ]77S21 0,I4A λ−= , [ ]T66S23 0,IA µ−= ,
Section-5B: Hardware and Software Systems with Warm Standbys…
111
[ ][ ]
µ−Λλ−µ−Λλ−
µ−Λλ−λ−λ−µ−Λα+−λ−−λ−
µ−Λα+−λ−µ−Λ
=
h
13h
h12h
h11hh2
h
h10hh
h9h
h8
22
20000000200000230000024q4q40002)q1(4)q1(q4000022)q1(400000
A
[ ]66S32 0,I4A λ−= , [ ]T55S34 0,IA µ−= ,
[ ]
Λλ−µ−Λλ−
µ−Λλ−µ−Λλ−λ−
µ−Λα+−λ−µ−Λ
=
19h
h18h
h17h
h16hh
h15h
h14
33
000022000023000024q40002)q1(40000
A
[ ]55S43 0,I4A λ−= , [ ]T44S45 0,IA µ−= ,
Λλ−µ−Λλ−
µ−Λλ−µ−Λλ−
µ−Λ
=
24h
h23h
h22h
h21h
h20
44
000220002300024000
A
[ ]44S54 0,I4A λ−= , [ ]T33S56 0,IA µ−= ,
+µ+µλ−µ−+µ+µ+λ+λλ−
µ−+µ+µ+λ+λλ−µ−+µ+λ+λ
=
s2002s242002s32300s33
A
Shh
hShShh
hShShh
hSSh
55
[ ]T22S67 0,IA µ−=
λ−λ−
λ−=
040000300003
A
S
S
S
65
Section-5B: Hardware and Software Systems with Warm Standbys…
112
+µ+µλ−µ−+µ+µ+λ+λλ−
µ−+µ+λ+λ=
s202s220s22
A
Shh
hhSShh
hSSh
66
λ−+µ+µλ−
µ−+µ+λ+λ=
0s
sA
S
Shh
hSSh
76 ,
µ−
µ−=
S
S
77 0A
Other sub matrices Aij are zero matrices of appropriate size.
In sub matrices, we have used the following notations
s434 Sh1 +λ+α+λ=Λ , s424 hSh2 +µ+λ+α+λ=Λ ,
s244 hSh3 +µ+λ+α+λ=Λ , s344 hSh4 +µ+λ+λ=Λ ,
s343 hSh5 +µ+λ+λ=Λ , s342 hSh6 +µ+λ+λ=Λ ,
s34 hSh7 +µ+λ+λ=Λ , s424 SSh8 +µ+λ+α+λ=Λ ,
s44 ShSh9 +µ+µ+λ+α+λ=Λ , s244 ShSh10 +µ+µ+λ+λ=Λ
s243 ShSh11 +µ+µ+λ+λ=Λ , s242 ShSh12 +µ+µ+λ+λ=Λ ,
s24 ShSh13 +µ+µ+λ+λ=Λ , s44 SSh14 +µ+λ+α+λ=Λ ,
s44 ShSh15 +µ+µ+λ+λ=Λ , s243 ShSh16 +µ+µ+λ+λ=Λ ,
s242 ShSh17 +µ+µ+λ+λ=Λ , s24 ShSh18 +µ+µ+λ+λ=Λ ,
s2 Sh19 +µ+µ=Λ , s44 SSh20 +µ+λ+λ=Λ ,
s43 ShSh21 +µ+µ+λ+λ=Λ , s242 ShSh22 +µ+µ+λ+λ=Λ
s24 ShSh23 +µ+µ+λ+λ=Λ , s2 Sh24 +µ+µ=Λ
Using Cramer’s rule, the laplace transform ( )sP*k of probabilities ( )tPk , can be obtained
as
( ) Lk0,)s(Q
)s(QsP 1k*
k ≤≤= + …(5B.102)
Section-5B: Hardware and Software Systems with Warm Standbys…
113
where )s(Q 1k+ is the determinant obtained by replacing the (k+1)th column of
determinant )s(Q by RHS vector P(0).
For calculating the characteristic roots of the matrix Q(s), we note that
s = 0 is one of the roots. Let s = -d, so that we get
( ) ( )dIQdQ −=− …(5B.103)
Now eq. (5B.101) becomes ( ) ( ) ( )1P)s(PdIQ)s(P.dQ ** =−=− …(5B.104)
It is observed that the eigen values of Q are real and distinct and it is also
observed that Q is positive definite. So, all eigen values of Q are positive. Let
( )Lk1k ≤≤ν denote the eigen values of Q, then we get
( )∏=
ν+=L
1kkss)s(Q …(5B.105)
361,)s(s
)s(Q)s(P L
1kk
1 ≤≤ν+
=
∏=
+ lll …(5B.106)
We may write )s(Pl in partial fractions form as
( ) ∑= ν+
+=L
1k k
k00*0 s
as
asP …(5B.107)
∑= ν+
=L
1k k
k*
sa)s(P l
l …(5B.108)
where 0a and ka l are real numbers calculated as
∏=
ν= L
1jj
10
)0(Qa …(5B.109)
and
( )
Lk2,L1,)(Q
a L
k1j
kjk
k1k ≤≤≤≤
ν−νν
ν−−=
∏≠=
+ lll …(5B.110)
On taking inverse Laplace transform of eqs (5B.107) and (5B.108), we get
( )
∑∏∏ =
≠==
ν−νν
ν−ν−−
ν=
L
1kL
k1j
kjk
kk1L
1kk
10
)texp()(Q)0(a)t(P …(5B.111)
Section-5B: Hardware and Software Systems with Warm Standbys…
114
( )
L1where,)texp()(Q
)t(PL
1kL
k1j
kjk
kk1 ≤≤−
−−−= ∑
∏=
≠=
+ lll
ννν
νν …(5B.112)
5B.5 Performance Measures
To quantify the performance of the system concerned is the main objective of
developing a mathematical model of real time system. In this section, we obtain some
performance indices in terms of probabilities obtained in previous section as follows:
Expected number of failed components at time t due to hardware failure is
( ) ( )( )∑∑−+
=
+
=
=iSM
0jji,
SM
1iH tPitF …(5B.113)
Expected number of failed components at time t due to software failure is
( ) ( )( )∑∑−+
=
+
=
=jSM
0iji,
SM
1jS tPjtF …(5B.114)
Expected number of standby components in the system at time t is
( ) ( ) ( )( )tPji-StS ji,
S
0jiC ∑
=+
−= …(5B.115)
Component availability at time t is
( ) ( ) ( )
++
−=SM
tFtF1tA SHC …(5B.116)
Failure frequency at time t is
( ) ( )tPt 1SMdF −+λ=ω …(5B.117)
Reliability of the system is
( ) ( )( )∑∑−+
=
+
=
=1SM
0jji,
1-SM
0itPtR …(5B.118)
Section-5B: Hardware and Software Systems with Warm Standbys…
115
5B.6 Numerical Results
In this section, we check the validity of the proposed model by employing
Runge-Kutta (R-K) technique of fourth order and matrix method to solve the system
of differential equations. R-K method is implemented by exploiting MATLAB’s
‘ode45’ function. We consider a time span with equal intervals. For different values
of λh, λs, α, µh, µs, tables 5B.1(a)-5B.1(f) and figs 5B.2(a)-5B.2(f) depict various
performance measures and reliability of the system. For illustration purpose, we
choose default parameters as λ=0.9, λS=0.1, C=0.3, θ=0.6, β=1, µ=3.
In tables 5B.1(a)-5B.1(f), various performance measures such as expected
number of failed components due to hardware and software failure, expected number
of standby components, availability and failure frequency of the system are
summarized. From tables it is noticed that the expected number of failed components
due to software failures is increasing with respect to time but due to failures of
hardware components, it initially increases and after some time decreases gradually.
Expected number of standby components and availability are also decreasing function
of time but the failure frequency shows the increasing pattern with respect to increase
in time. From tables 5B.1(a)-5B.1(f), it is noticed that FH(t) increases as λh, α, µS, q
increase but decreases as λs and µh increase. It is seen that FS(t) increases on
increasing λs and µh but decreases on increasing the values of λh, µS. By increasing
the values of α and q, FS(t) increases. For other performance indices viz. SC(t), AC(t)
and ωF(t), we see that as λh, λs increase, SC(t) and AC(t) decrease but ωF(t) increases.
We see that when µh and µS increase, SC(t) and AC(t) show the increasing trend but
ωF(t) decreases. On increasing the probability of perfect switching q, it is found that
SC(t) and AC(t) are decreasing while ωF(t) is increasing. With respect to parameter α,
SC(t) decreases but AC(t) and ωF(t) remain almost constant.
In figs 5B.2(a)-5B.2(f), we compute the reliability with respect to time t for
different system parameters. In figs 5B.2(a) and 5B.2(b), it is observed that reliability
decreases as time increases. Also as λh increases, the reliability decreases; however
we observe the reverse effect on increasing µh. The reliability with respect to time,
initially decreases and after some time it becomes almost constant as seen in figs
5B.2(c) and 5B.2(d). We also notice that the reliability decreases sharply on
Section-5B: Hardware and Software Systems with Warm Standbys…
116
increasing the values of λs; it increases as µs increases. From figs 5B.2(e) and 5B.2(f),
it can be observed that initially reliability decreases sharply then after decreases
slowly for the higher values of α, q and time t but after some time reliability becomes
constant.
Overall from the tables and figs we can conclude that the availability AC(t)
and reliability R(t) decrease with time. It is quite obvious to notice that as failure rate
of hardware components increases, the expected number of failed hardware
components increases whereas as the failure rate of software components increases,
the expected number of failed software components increases. Overall, on the basis of
numerical results, it can be concluded that a system would work more effectively with
the adequate support of standbys and repair facility.
5B.7 Conclusion
In this chapter, explicit expressions for the reliability, availability and other
performance measures are provided. It is noticed that the system reliability can be
improved by providing sufficient standbys and sufficient repairmen for an embedded
system which contains both hardware and software components. Our numerical
results indicate that the switching failure has a significant effect on the reliability as
such incorporation of switching failure in the model makes our model more realistic
and versatile to deal with real time system. Our future research will look at the
reliability analysis of the system with unreliable server and/or server vacation.
Section-5B: Hardware and Software Systems with Warm Standbys…
117
Fig. 5B.1: State transition diagram
Section-5B: Hardware and Software Systems with Warm Standbys…
118
Table 5B.1(a): Performance indices for different values of hλ
t λs=0.2 λs=0.5 λs=0.8
FH(t) FS(t) SC(t) AC(t) ωF(t) FH(t) FS(t) SC(t) AC(t) ωF(t) FH(t) FS(t) SC(t) AC(t) ωF(t)
0 0.00 0.00 3.00 1.00 0.000 0.00 0.00 3.00 1.00 0.000 0.00 0.00 3.00 1.00 0.000
2 1.09 0.87 1.27 0.72 0.000 0.72 2.53 0.55 0.54 0.005 0.44 3.99 0.19 0.37 0.023
4 0.97 1.28 1.09 0.68 0.001 0.43 3.79 0.25 0.40 0.019 0.19 5.27 0.04 0.22 0.066
6 0.90 1.52 0.99 0.65 0.001 0.31 4.39 0.15 0.33 0.032 0.14 5.59 0.02 0.18 0.084
8 0.86 1.67 0.93 0.64 0.002 0.26 4.66 0.10 0.30 0.039 0.13 5.67 0.01 0.17 0.090
10 0.84 1.76 0.90 0.63 0.002 0.24 4.77 0.09 0.28 0.043 0.12 5.69 0.01 0.17 0.091
12 0.82 1.82 0.88 0.62 0.002 0.23 4.83 0.08 0.28 0.044 0.12 5.69 0.01 0.17 0.091
Section-5B: Hardware and Software Systems with Warm Standbys…
119
t λh=0.3 λh=0.6 λh=0.9 FH(t) FS(t) SC(t) AC(t) ωF(t) FH(t) FS(t) SC(t) AC(t) ωF(t) FH(t) FS(t) SC(t) AC(t) ωF(t)
0 0.00 0.00 3.00 1.00 0.000 0.00 0.00 3.00 1.00 0.000 0.00 0.00 3.00 1.00 0.000
2 0.84 1.97 0.75 0.60 0.002 1.36 1.96 0.51 0.53 0.011 1.77 1.94 0.35 0.47 0.028
4 0.57 3.03 0.44 0.49 0.009 0.96 3.01 0.30 0.43 0.026 1.28 2.97 0.21 0.39 0.054
6 0.45 3.61 0.30 0.42 0.016 0.76 3.58 0.21 0.38 0.042 1.03 3.55 0.15 0.35 0.079
8 0.38 3.92 0.23 0.38 0.022 0.66 3.90 0.16 0.35 0.053 0.90 3.86 0.11 0.32 0.096
10 0.35 4.09 0.20 0.37 0.025 0.61 4.06 0.14 0.33 0.059 0.84 4.02 0.10 0.31 0.105
12 0.33 4.18 0.18 0.36 0.026 0.58 4.15 0.13 0.32 0.063 0.80 4.11 0.09 0.30 0.111
Table 5B.1(b): Performance indices for different values of Sλ
Section-5B: Hardware and Software Systems with Warm Standbys…
120
t α=0.1 α=0.2 α=0.3 FH(t) FS(t) SC(t) AC(t) ωF(t) FH(t) FS(t) SC(t) AC(t) ωF(t) FH(t) FS(t) SC(t) AC(t) ωF(t)
0 0.00 0.00 3.00 1.00 0.000 0.00 0.00 3.00 1.00 0.000 0.00 0.00 3.00 1.00 0.000
2 1.48 1.95 0.46 0.51 0.015 1.49 1.95 0.45 0.51 0.015 1.50 1.95 0.45 0.51 0.015
4 1.06 3.00 0.27 0.42 0.034 1.06 3.00 0.27 0.42 0.034 1.07 3.00 0.26 0.42 0.034
6 0.85 3.57 0.19 0.37 0.053 0.85 3.57 0.19 0.37 0.053 0.86 3.57 0.18 0.37 0.053
8 0.74 3.88 0.15 0.34 0.066 0.74 3.88 0.15 0.34 0.066 0.75 3.88 0.14 0.34 0.066
10 0.68 4.05 0.13 0.32 0.074 0.69 4.05 0.12 0.32 0.074 0.69 4.05 0.12 0.32 0.074
12 0.65 4.14 0.12 0.32 0.078 0.66 4.14 0.11 0.32 0.078 0.66 4.14 0.11 0.31 0.078
Table 5B.1(c): Performance indices for different values of α
Section-5B: Hardware and Software Systems with Warm Standbys…
121
t µh=3 µh=4 µh=5 FH(t) FS(t) SC(t) AC(t) ωF(t) FH(t) FS(t) SC(t) AC(t) ωF(t) FH(t) FS(t) SC(t) AC(t) ωF(t)
0 0.00 0.00 3.00 1.00 0.000 0.00 0.00 3.00 1.00 0.000 0.00 0.00 3.00 1.00 0.000
2 1.50 1.95 0.45 0.51 0.015 1.20 1.96 0.57 0.55 0.009 1.00 1.97 0.66 0.58 0.006
4 1.07 3.00 0.26 0.42 0.034 0.85 3.02 0.33 0.45 0.027 0.71 3.02 0.38 0.47 0.023
6 0.86 3.57 0.18 0.37 0.053 0.68 3.60 0.23 0.39 0.045 0.56 3.61 0.26 0.40 0.041
8 0.75 3.88 0.14 0.34 0.066 0.59 3.91 0.18 0.36 0.058 0.49 3.92 0.20 0.37 0.054
10 0.69 4.05 0.12 0.32 0.074 0.54 4.07 0.15 0.34 0.065 0.45 4.09 0.17 0.35 0.061
12 0.66 4.14 0.11 0.31 0.078 0.52 4.16 0.14 0.33 0.069 0.43 4.17 0.16 0.34 0.065
Table 5B.1(d): Performance indices for different values of hμ
Section-5B: Hardware and Software Systems with Warm Standbys…
122
Table 5B.1(e): Performance indices for different values of Sμ
t µS=2 µS=3 µS=4 FH(t) FS(t) SC(t) AC(t) ωF(t) FH(t) FS(t) SC(t) AC(t) ωF(t) FH(t) FS(t) SC(t) AC(t) ωF(t)
0 0.00 0.00 3.00 1.00 0.000 0.00 0.00 3.00 1.00 0.000 0.00 0.00 3.00 1.00 0.0000
2 1.75 1.26 0.66 0.57 0.009 1.90 0.85 0.80 0.61 0.006 1.99 0.60 0.89 0.63 0.0041
4 1.56 1.64 0.57 0.54 0.012 1.84 0.96 0.77 0.60 0.006 1.97 0.63 0.88 0.63 0.0041
6 1.49 1.80 0.54 0.53 0.014 1.82 0.98 0.77 0.60 0.006 1.97 0.63 0.88 0.63 0.0042
8 1.46 1.86 0.53 0.52 0.014 1.82 0.99 0.77 0.60 0.006 1.97 0.63 0.88 0.63 0.0042
10 1.45 1.89 0.52 0.52 0.015 1.82 0.99 0.77 0.60 0.006 1.97 0.63 0.88 0.63 0.0042
12 1.45 1.90 0.52 0.52 0.015 1.82 0.99 0.77 0.60 0.006 1.97 0.63 0.88 0.63 0.0042
Section-5B: Hardware and Software Systems with Warm Standbys…
123
Table 5B.1(f): Performance indices for different values of q
t q=0.3 q=0.5 q=0.7 FH(t) FS(t) SC(t) AC(t) ωF(t) FH(t) FS(t) SC(t) AC(t) ωF(t) FH(t) FS(t) SC(t) AC(t) ωF(t)
0 0.00 0.00 3.00 1.00 0.000 0.00 0.00 3.00 1.00 0.000 0.00 0.00 3.00 1.00 0.000
2 1.15 1.96 0.62 0.56 0.012 1.27 1.96 0.56 0.54 0.013 1.39 1.96 0.50 0.52 0.014
4 0.87 3.01 0.35 0.45 0.033 0.94 3.01 0.32 0.44 0.033 1.00 3.00 0.29 0.43 0.034
6 0.72 3.58 0.24 0.38 0.052 0.77 3.58 0.22 0.38 0.053 0.81 3.58 0.20 0.37 0.053
8 0.64 3.89 0.19 0.35 0.066 0.68 3.89 0.17 0.35 0.066 0.71 3.89 0.16 0.34 0.066
10 0.60 4.06 0.16 0.33 0.073 0.63 4.06 0.15 0.33 0.073 0.66 4.05 0.13 0.33 0.073
12 0.58 4.14 0.15 0.32 0.077 0.61 4.14 0.13 0.32 0.077 0.63 4.14 0.12 0.32 0.078
Section-5B: Hardware and Software Systems with Warm Standbys…
124
0.85
0.9
0.95
1
0 1 2 3 4 5 6 7 8 9 10 11 12t
R(t)
λh=0.3λh=0.6λh=0.9
0.85
0.9
0.95
1
0 1 2 3 4 5 6 7 8 9 10 11 12t
R(t)
µh=3µh=4µh=5
Fig. 5B.2(a): Reliability vs time by Fig. 5B.2(b): Reliability vs time by varying hλ varying hμ
0.6
0.7
0.8
0.9
1
0 1 2 3 4 5 6 7 8 9 10 11 12t
R(t)
λs=0.4λs=0.6λs=0.8
Fig. 5B.2(c): Reliability vs time by varying Sλ
Section-5B: Hardware and Software Systems with Warm Standbys…
125
0.97
0.975
0.98
0.985
0.99
0.995
1
0 1 2 3 4 5 6 7 8 9 10 11 12t
R(t)
µS=2µS=3µS=4
0.94
0.95
0.96
0.97
0.98
0.99
1
0 1 2 3 4 5 6 7 8 9 10 11 12
t
R(t)
α=0α=0.3α=0.5
Fig. B.52(d): Reliability vs time by Fig. 5B.2(e): Reliability vs time by varying Sμ varyingα
0.965
0.97
0.975
0.98
0.985
0.99
0.995
1
0 1 2 3 4 5 6 7 8 9 10 11 12
t
R(t)
q=0.1q=0.5q=0.9
Fig. 5B.2(f): Reliability vs time by varying q
Availability Analysis of Repairable Redundant System with Reboot
Delay Section-6A Hardware-Software System with Switching Failure
Section-6B Repairable System with Warm Standby and Switching Failure
Chapter-6
Hardware-Software System with Switching Failure
6A.1 Introduction
6A.2 Model Description and Governing Equations
6A.3 Availability Prediction
6A.4 Performance Measures
6A.5 Sensitivity Analysis
6A.6 Conclusions
Section-6A
Section-6A: Hardware-Software System with Switching Failure
128
In this section, we present the availability analysis of a
system having N software components, M hardware components,
S warm standby units and two types of repairmen. The switching
failures, common cause failure and reboot delay are also
considered. The life time, repair time and delay time of reboot of
the software and hardware components are assumed to be
exponentially distributed. Numerical results are provided for
various performance indices. Sensitivity analysis is also carried
out to explore the changes in the availability characteristics for
the variation of different system parameters. Adaptive Network-
based Fuzzy Interference Systems (ANFIS) approach is also
employed to exhibit the scope of soft computing for numerical
tractability.
6A.1 Introduction
System availability is an increasingly important issue in power plants,
manufacturing systems, industrial systems and standby systems. For maintaining a
high degree quality, the availability prediction is often an essential requisite. Any
production system cannot work continuously because of some interference due to
failure of the components during the operation.
Spare parts support has been widely used for improving the system reliability
and availability. There is enormous literature available on machine interference
problems with spares. The reliability analysis has been an area of interest for many
researchers due to its applications in many organizations working in machining
environments. Aven (1990) gave availability formulae for the standby systems of
similar units that are preventively maintained. Subramanian and Anantharaman
(1995) and Azaron et al. (2005) made reliability analysis of a complex standby
redundancy system which is based on time-dependent. Jain et al. (2004) provided
prediction of machine interference model with spare and two modes of failure. Cha et
al. (2008) modeled a general standby system and provided various indices for
Section-6A: Hardware-Software System with Switching Failure
129
predicting its performance. A standby system by considering human error failures,
hardware and preventive maintenance was studied by Mahmoud and Moshref
(2010). Wang et al. (2011a) proposed the availability model and parameters
estimation method for the delay time model with imperfect maintenance at inspection.
In the past decade, many researchers have worked on the reliability and
availability of the machining system with warm standby. A system with warm
standby has been studied in detail by considering different conditions, such as
repairable system and human error (Guo and Hua, 2003), general repair times (Wang
et al. 2005) and components having proportional hazard rates (Li et al., 2009a).
Mokaddis et al. (1997) analyzed two unit warm standby system subject to
degradation. Wang et al. (2004) studied the reliability and sensitivity analysis of a
system with M operating machines, S warm standbys, and a repairable service station
and presented derivations for the system reliability and the mean time to system
failure. Recently, Yun and Cha (2010) discussed a standby system with two units and
determining the optimal switching time which maximizes the expected system life
and related allocation problem. Stability analysis of a new kind n-unit series
repairable system was studied by Guo et al. (2011).
Most studies about the reliability of a system assume that the switchover from
warm standby unit to primary one is always perfect. But in the real life situations, this
assumption only simplifies the analysis of the problem because a warm standby unit
might not be able to switchover to a primary unit successfully. The concept of
imperfect switching has been discussed by many researchers. Goel and Shrivastava
(1992) studied two unit standby systems with imperfect switch, preventive
maintenance and correlated failures and repairs. Hsieh and Hsieh (2003) discussed
the reliability and cost optimization in distributed computer systems. Huang et al.
(2006) suggested parametric nonlinear programming approach for a repairable system
with switching failure and fuzzy parameter. Wang et al. (2006) and Wang and Chen
(2009) compared the reliability/availability of the warm standby systems with general
repair times, reboot delay and switching failures. Levitin and Amari (2010) studied a
k-out-of-n system with shared standby elements along with algorithm for evaluating
the time-to-failure distribution. In (2011), Kancev and Cepin evaluated a risk and cost
Section-6A: Hardware-Software System with Switching Failure
130
using an age-dependent unavailability modeling of test and maintenance for the
standby components.
Redundant repairable systems have been studied extensively in the past, and a
detailed bibliography can be found in Zhang and Horigome (2001). They analyzed
the systems in which one or more components can fail simultaneously due to a
cumulative shock-damage process. They provided the reliability and availability
analysis of repairable systems, where failure and repair rates of components can be
varying with time. Yadavalli et al. (2002) studied the steady-state availability of a
two-unit parallel system with the introduction of preparation time for the repair
facility. Seo et al. (2003) proposed the life time and reliability estimation of
repairable redundant system subject to periodic alternation. Balsamo et al. (2004)
described the model-based performance in software development. Raj Kiran and
Ravi (2008) predicted the ensemble models to accurately forecast software reliability.
Various statistical and intelligent techniques such as neuro-fuzzy inference system
constitute the ensembles presented. Reliability analysis for a k/n(F) system with repair
was done by Zhang and Wu (2009). Levitin and Xing (2010) analyzed the reliability
and performance of multi-state systems with propagated failures having selective
effect. Wang et al. (2011b) developed the availability model and proposed the
parameter estimation method for the delay time model with imperfect maintenance at
inspection.
Common cause failure, where a single failure event can propagate and cause
failure of more than one unit in the system, has drawn the attention of many
investigators working in the area of system reliability and availability. There are
several internal factors (e.g. designing deficiencies, fabrication, etc.) and external
factors (e.g. environmental conditions like temperature, dust and humidity, power
failure, fire, flood, earthquake, etc.) which can lead to common cause failure. The
study of the systems incorporating the common cause factor for the investigation
whether it is for predicting the behavior of new designs or studying possible changes
to existing ones is challenging task. Hu (2006) discussed a repairable system with
warm standbys under common cause failure. Reliability of standby safety systems due
to independent and common cause failures was calculated by Lu and Lewis (2006). Li
Section-6A: Hardware-Software System with Switching Failure
131
et al. (2010) considered the heterogeneous redundancy optimization for multi-state
series-parallel systems subject to common cause failure.
Some researchers have worked in the field of reliability to analyze the
repairable components of hardware systems with standby, switching failures and other
different environments. But no one has considered the reliability and availability
analysis for both hardware and software components of any system under different
arguments. The present investigation is concerned with the reliability and availability
analysis for the system having both hardware and software components by
considering realistic situations of warm standby provisioning, switching failures,
common cause failure and reboot delay. In section 6A.2, we describe the model and
provide notations used throughout the chapter. In section 6A.3, the analysis is
provided along with illustrations of the proposed model. Performance measures have
been derived in section 6A.4. Numerical results have been given in section 6A.5.
Finally conclusions are made to highlight the key features of our study in section
6A.6.
6A.2 Model Description and Governing Equations
Consider the performance analysis based on availability measures of a
hardware-software system. The embedded system consists of N softwares and M
hardwares components with the provisioning of S warm standby hardware
components. The concepts of switching failures and reboot delay are taken into
consideration. At time t=0, the system is in working state, i.e., there are no failed
components. The life-times of software components and hardware components are
exponential distributed with mean hS
1and1λλ , respectively. There is provision of
repair of failed components; the repair times are exponentially distributed with
parameters sµ and hμ for software and hardware components, respectively. When an
operating hardware component fails, a warm standby hardware unit is immediately
substituted in place of it and the failed component is sent for repairing. The model is
based on the following assumptions and notations:
Each of the operating units fails independent of the state of the others.
Section-6A: Hardware-Software System with Switching Failure
132
When the repair of failed hardware/software component is completed, it is as good as a new one.
The repaired hardware component joins the operating group, otherwise it joins the standby group.
The standby hardware components fail independent of the state of all others and have an exponential life time distribution with parameterα .
The system may also fail due to failure of power supply unit.
When all hardware standbys are exhausted, the remaining operating hardware components fail with degraded failure rate dλ .
When a standby hardware component moves into an operating state, its characteristics are same as that of an operating component.
The switching device which is used to replace the failed hardware component by standby hardware component, is subject to failure with probability q during the switching from standby state to operating state.
After the switching, a reboot delay which is exponentially distributed, takes place with mean time 1/β .
Notations
Following notations have been used for formulating the mathematical model:
M (N) : Number of hardware (software) operating components.
S : Number of warm standby hardware components.
hλ ( Sλ ) : Failure rate of a hardware (software) operating components.
dλ : Degradation failure rate of a hardware operating components.
pλ : Power supply failure rate of the system.
α : Failure rate of a warm standby hardware component.
hµ ( Sµ ) : Repair rate of a hardware (software) failed component.
q : Switching failure probability of hardware standby component.
β1 : Reboot delay of a hardware standby component to operating component.
Section-6A: Hardware-Software System with Switching Failure
133
k,j,iP : Steady state probability that there are i operating software components, j operating hardware components and k standbys hardware components in the system, where i=0,1,2…,N, j=0,1,2…,M and k=0,1,2…,K.
R(t) : Reliability function of the system at time t.
Fig. 6A.1: State transition diagram
Chapman-Kolmogorov equations governing the model are constructed by
using appropriate transition rates as follows (see transition diagram given in fig.
6A.1):
( ) 0PPPNSM )S,M,1N(S)1S,M,N(h)S,M,N(pSh =µ+µ+λ+λ+α+λ −− …(6A.1)
Section-6A: Hardware-Software System with Switching Failure
134
( )[ ] ( )(6A.2)1Sn1,0P
PP1nSPNnSM
)nS,M,1N(S
)1nS,M,N(h)1nS,M,N()nS,M,N(phSh
…−≤≤=µ+
µ+α+−+λ+µ+β+λ+α−+λ−
−−
−−+−−
[ ] )3.A6...(0PPPPNM )0,M,1N(S)0,1M,N(h)1,M,N()0,M,N(phSh =µ+µ+α+λ+µ+β+λ+λ− −−
( ) 0PPq1MP )1S,M,N()S,M,N(h)S,1M,N(p =β+−λ+λ− −− …(6A.4)
( ) 1Sn1,0Pq1qMPPn
0r)rS,M,N(
rnh)0,M,N()nS,1M,N(p −≤≤=−λ+β+λ− ∑
=−
−−− …(6A.5)
( )[ ](6A.6)1Sn1
,0PqMPPMP1M1S
0n)n,M,N(
)nS(h)0,2M,N(h)0,M,N(h)0,1M,N(phd
…−≤≤
=λ+µ+λ+λ+µ+λ−− ∑−
=
−−−
0PP )1L(d)L(h hh=λ+µ− − …(6A.7)
( )[ ]( ) )8.A6...(1Ni1,0Pi1N
PPPiNSM
)S,M,i1N(S
)S,M,i1N(S)1S,M,iN(h)S,M,iN(pSSh
−≤≤=λ−++
µ+µ+λ+µ+λ−+α+λ−
−+
−−−−−
( ) ( )[ ] ( )( ) 1Sk1,1Ni1,Pi1NPP
P1kSPiNkSM
)S,M,i1N(S)kS,M,i1N(S)1kS,M,iN(h
)1kS,M,iN()kS,M,iN(pShSh
−≤≤−≤≤λ−++µ+µ+
α+−+λ+µ+µ+β+λ−+α−+λ−
−+−−−−−−
+−−−−
…(6A.9)
( )[ ]( ) 1Ni1,0Pi1NP
PPPiNM
)0,M,i1N(S)0,M,i1N(S
)0,1M,iN(h)1,M,iN()0,M,iN(pShSh
−≤≤=λ−++µ+
µ+α+λ+µ+µ+β+λ−+λ−
−+−+
−−−− …(6A.10)
( ) 0PPq1MP )1S,M,iN()S,M,iN(h)S,1M,iN(p =β+−λ+λ− −−−−− …(6A.11)
( )
(6A.12)1Sk1,1Ni1
,0Pq1qMPPn
0r)rS,M,iN(
rnh)0,M,iN()kS,1M,iN(p
…−≤≤−≤≤
=−λ+β+λ− ∑=
−−−
−−−−
( )[ ])13.A6...(Mj1,1Ni1,0PqM
PPMPjM1S
0n)n,M,iN(
)nS(h
)0,1jM,iN(h)0,M,iN(h)0,jM,iN(phd
≤≤−≤≤=λ+
µ+λ+λ+µ+λ−−
∑−
=−
−
+−−−−−
0PP )1L(d)L(h SS=λ+µ− − …(6A.14)
0P0P )S,M,0(S)S,M,0(s =λ+=µ− …(6A.15)
Sk1,1Ni1,0PP )kS,M,0(S)kS,M,0(s ≤≤−≤≤=λ+µ− −− …(6A.16)
Section-6A: Hardware-Software System with Switching Failure
135
To solve equations (6A.1)-(6A.16), we employ numerical method based on successive
over relaxation (SOR). To examine the tractability of proposed approach, we consider
an illustration and outline the solution procedure in the next section.
6A.3 Availability Prediction
In this section, we consider an embedded computer system having 2 software
components, 3 hardware operating components and 2 standby hardware components.
The difference equations associated with the system states are constructed as follows:
( ) 0P223PP )2,3,2(pSh)2,3,1(S)1,3,2(h =λ+λ+α+λ−µ+µ …(6A.17)
( ) 0P23PPP2 )1,3,2(phSh)1,3,1(S)0,3,2(h)2,3,2( =λ+µ+β+λ+α+λ−µ+µ+α …(6A.18)
( ) 0P23PPP )0,3,2(phSh)0,3,1(S)0,2,2(h)1,3,2( =λ+µ+β+λ+λ−µ+µ+α …(6A.19)
( ) 0P2qP3Pq3PP3 )0,2,2(phd)1,3,2(h)2,3,2(2
h)0,1,2(h)0,3,2(h =λ+µ+λ−λ+λ+µ+λ …(6A.20)
( ) 0PPP2 )0,1,2(phd)0,0,2(h)0,2,2(d =λ+µ+λ−µ+λ …(6A.21)
0PP )0,0,2(h)0,1,2(d =µ−λ …(6A.22)
( ) 0PPPq13 )2,2,2(p)1,3,2()2,3,2(h =λ−β+−λ …(6A.23)
( ) ( ) 0PPq1q3PPq13 )1,2,2(p)2,3,2(h)0,3,2()1,3,2(h =λ−−λ+β+−λ …(6A.24)
( ) 0P23PP2P )2,3,1(pSSh)2,3,0(S)2,3,2(S)1,3,1(h =λ+µ+λ+α+λ−µ+λ+µ …(6A.25)
( ) 0P3PP2PP2 )1,3,1(phSSh)1,3,0(S)1,3,2(S)0,3,1(h)2,3,1( =λ+µ+β+µ+λ+α+λ−µ+λ+µ+α …(6A.26)
( ) 0P3PP2PP )0,3,1(phSSh)0,3,0(S)0,3,2(S)0,2,1(h)1,3,1( =λ+µ+β+µ+λ+λ−µ+λ+µ+α …(6A.27)
( ) 0P2qP3Pq3PP3 )0,2,1(phd)1,3,1(h)2,3,1(2
h)0,1,1(h)0,3,1(h =λ+µ+λ−λ+λ+µ+λ …(6A.28)
( ) 0PPP2 )0,1,1(phd)0,0,1(h)0,2,1(d =λ+µ+λ−µ+λ …(6A.29)
0PP )0,0,1(h)0,1,1(d =µ−λ …(6A.30)
( ) 0PPPq13 )2,1,1(p)1,3,1()2,3,1(h =λ−β+−λ …(6A.31)
Section-6A: Hardware-Software System with Switching Failure
136
( ) ( ) 0PPPq1q3Pq13 )1,1,1(p)0,3,1()2,3,1(h)1,3,1(h =λ−β+−λ+−λ …(6A.32)
0PP )2,3,0(S)2,3,1(S =µ−λ …(6A.33)
0PP )1,3,0(S)1,3,1(S =µ−λ …(6A.34)
0PP )0,3,0(S)0,3,1(S =µ−λ …(6A.35)
The normalizing condition is
( )∑∑∑= = =
=2
0i
3
0j
2
0kk,j,i 1P …(6A.36)
For equations (6A.17)-(6A.36), we get the matrix equation as
AP=0
and A1, A2, ….. etc. are given as follows:
=
987
654
321
AAAAAAAAA
A
=
43
211 BB
BBA ,
=
43
212 CC
CCA , [ ] 683 0A ×=
=
43
214 DD
DDA ,
=
43
215 EE
EEA ,
=
43
216 FF
FFA
64ppppppp
7
0000000000000000000000000
A
×
λλλλλλλ
= ,
64pppppp
S
S
S
8 000000000000000
A
×
λλλλλλλ
λλ
=
64pp
S
S
S
9
0000000000000000000
A
×
λλµ−
µ−µ−
=
Section-6A: Hardware-Software System with Switching Failure
137
( )( )
( )( )
44hd2h3qh32qh3hhS2h30
0hhS2h32
00hS22h3
1B
p
p
p
p
×λ+µ+λ−λλλ
µµ+λ+β+λ+λ−α
µµ+λ+β+λ+α+λ−α
µλ+λ+α+λ−
=
44h
2
000000000000000
B
×
µ
= , ( )( ) ( )
44hh
h
d
3
0q13q1q300q130000
2000
B
×
β−λ−λβ−λ
λ
=
( )
44p
p
hd
hphd
4
0000000000
B
×
λλ−
µ−λµλ+µ+λ−
=
34
S
S
S
1
00000
0000
C
×
µµ
µ
= , [ ] 342 0C ×= , [ ] 343 0C ×= , [ ] 344 0C ×=
44
S
S
S
1
0000
000200
020002
D
×
λλ
λ
= , [ ] 442 0D ×= , [ ] 443 0D ×= , [ ] 444 0D ×=
34h3qh32qh3
hpSSh30
hShpSh32
0hpSS22h3
1E
×λλλ
µ+λ+β+λ+µ+λ−α
µµ+µ+λ+β+λ+α+λ−α
µλ+µ+λ+α+λ−
=
( )34hphd
h2
0200000000
E
×
µλ+µ+λ−µ
= , ( )( ) ( )
34hh
h3
q13q1q30q13000000
E
×
β−λ−λ−β−λ
=
Section-6A: Hardware-Software System with Switching Failure
138
( )
34
hd
hpdhd
4
000000
02
E
×
µ−λµλ+λ+µ−λ
=
34
S
1
000000000
00
F
×
µ
= ,
34
S
S2
0000000000
F
×
µµ
= ,
34p
p3
0000000000
F
×
λ−λ−
= , [ ] 344 0F ×=
The probability vector P is given by
[ ]T321 ,, PPPP =
where
( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )[ ]2,2,12,2,22,0,02,1,02,2,02,3,02,3,12,3,2 PPPPPPPP +++++++=1P
( ) ( ) ( ) ( ) ( ) ( ) ( )[ ]1,1,11,1,21,0,01,1,01,2,01,3,01,3,1 PPPPPPP ++++++=2P
( ) ( ) ( )[ ]0,3,00,3,10,3,2 ppP ++=3P
We have used the successive over relaxation (SOR) technique which is a
powerful numerical method for solving a linear system of equations.
6A.4 Performance Measures
In this section, we find the expressions for various performance measures in terms
of probabilities determined by SOR method as follows:
The availability of the system is given by
( ) ( ) ( ) ( )
+++−= ∑ ∑ ∑ ∑
= = = =−−−
N
1i
S
0k
S
1k
S
1kk,2M,1Nk,1M,Nk,M,00,0,i PPPP1A …(6A.36)
The system availability, when both software are working, is given by
( ) ( )
+−= ∑ ∑
= =
S
0k
S
1kk1,-MN,kN,0,2S PP1A …(6A.36)
The system availability, when one software is working, is given by
( ) ( )
+−= ∑ ∑
= =
S
0k
S
1kk2,-M1,-Nk1,0,-N2S PP1A …(6A.36)
Section-6A: Hardware-Software System with Switching Failure
139
The frequency of the system failure is
( ) ( )( ) ( ) ( )
+−λ+
λ=ω ∑ ∑∑
= =−−−
=
S
1k
S
1kk,2M,1Nk,1M,N
N
1i0,0,idf PPq13P …(6A.36)
6A.5 Sensitivity Analysis
In this section, we present computational experiment for exploring the system
availability and other performance measures. The numerical results obtained from
SOR (Successive over Relaxation) technique are compared with the neuro-fuzzy
results by building Adaptive Network Based Fuzzy Inference System (ANFIS) in
MATLAB 7.4. Neuro Fuzzy (NF) method is characterized by their membership
function by defining the tolerance limit for getting achievements. We use Gaussian
function for describing the membership function. For all approximations, ANFIS are
trained for 100 epochs. For illustration purpose, we fix default parameters as β= 0.8,
μS=1, μh=2, α=0.5, q=0.5, λh=0.3, λd=0.7, λp=0.9, and λS=0.4.
Tables 6A.1(a)-6A.1(d) show various performance measures for the variation of
different parameters. In tables 6A.1(a-d), by varying the degradation failure rate λd
and switching failure probability q, we observe that as λd increases, A, AS2 , AS1 and
ωf decrease. Tables 6A.1(b)-6A.1(d) reveal the similar observation for the variation in
λS, β and α, respectively
Figs 6A.2(a)-6A.2(c) are drawn for availability with the variation in failure rate
of hardware components (λh). Figs 6A.3(a)-6A.3(c) exhibit the availability by
changing the values of the repair rate of the software components (μS). The fuzzy
membership functions, graph for λh and μS are shown in figs 6A.2(d) and 6A.3(d),
respectively. The effect of switching failure probability q on the availability can be
examined from fig. 6A.2(a) and 6A.3(a). It is noted that as λh, q and μS are increasing,
the availability is decreasing. Further, we display the effect of reboot delay β with
respect to the λh and μS in figs 6A.2(b) and 6A.3(b), respectively. We observe that as
λh and μS increase, R decreases but as β increases, the availability increases for both
cases. Figs 6.A2(c) and 6A.3(c) reveal that availability rarely changes for failure rate
of standby α, i.e., availability is decreasing as λh and μS are increasing. The
corresponding availability curves for various values of α are almost identical which
shows that the failure rate of standby has very little effect on the system availability.
Section-6A: Hardware-Software System with Switching Failure
140
Based on sensitivity analysis, we conclude that the availability changes
significantly by changing the parameters λh, λS, q and β. The effect of failure rate (α)
of spare component on the availability is not significant; this may be due to choice of
other parameter values.
6A.6 Conclusion
In this chapter, we have provided the availability analysis of the embedded
system having both hardware and software components under the assumption that the
standby switching at the primary state might fail. The availability and other
performance indices obtained may be helpful to improve the availability of the
concerned system in particular when reboot and switching failure are prevalent. Our
study may be used in computer and communication networks, distributed computing
system, etc..
Section-6A: Hardware-Software System with Switching Failure
141
Table 6A.1(a): Performance indices for different values of dλ
q λh=0.1 λh=0.5 λh=0.9
A AS2 AS1 ωf A AS2 AS1 ωf A AS2 AS1 ωf 0 0.9112 0.8363 0.0749 0.7599 0.9043 0.8299 0.0744 0.7599 0.8939 0.8204 0.0735 0.7599 0.1 0.9069 0.8323 0.0746 0.6759 0.8999 0.8259 0.0740 0.6759 0.8894 0.8163 0.0732 0.6759 0.2 0.9026 0.8283 0.0743 0.5939 0.8955 0.8218 0.0737 0.5939 0.8849 0.8121 0.0728 0.5939 0.3 0.8982 0.8243 0.0739 0.5141 0.8911 0.8177 0.0733 0.5141 0.8804 0.8079 0.0725 0.5141 0.4 0.8939 0.8203 0.0736 0.4363 0.8867 0.8137 0.0730 0.4363 0.8759 0.8038 0.0721 0.4363 0.5 0.8895 0.8163 0.0733 0.3607 0.8822 0.8096 0.0727 0.3607 0.8714 0.7996 0.0718 0.3607 0.6 0.8852 0.8123 0.0729 0.2871 0.8778 0.8055 0.0723 0.2871 0.8669 0.7955 0.0714 0.2871 0.7 0.8809 0.8083 0.0726 0.2157 0.8734 0.8015 0.0720 0.2157 0.8624 0.7913 0.0711 0.2157 0.8 0.8765 0.8043 0.0722 0.1463 0.8690 0.7974 0.0716 0.1463 0.8578 0.7871 0.0707 0.1463 0.9 0.8722 0.8003 0.0719 0.0790 0.8646 0.7933 0.0713 0.0790 0.8533 0.7830 0.0704 0.0790
Section-6A: Hardware-Software System with Switching Failure
142
q ΛS=0.1 λS=0.5 λS=0.9
A AS2 AS1 ωf A AS2 AS1 ωf A AS2 AS1 ωf 0 0.9329 0.9118 0.0211 0.7967 0.8887 0.8000 0.0886 0.7486 0.8488 0.7127 0.1361 0.7072
0.1 0.9283 0.9073 0.0210 0.7088 0.8843 0.7961 0.0882 0.6658 0.8445 0.7091 0.1354 0.6288 0.2 0.9237 0.9028 0.0209 0.6230 0.8798 0.7921 0.0878 0.5850 0.8403 0.7055 0.1348 0.5524 0.3 0.9191 0.8983 0.0208 0.5394 0.8754 0.7881 0.0874 0.5063 0.8361 0.7019 0.1341 0.4779 0.4 0.9145 0.8938 0.0207 0.4579 0.8710 0.7841 0.0870 0.4297 0.8318 0.6983 0.1335 0.4055 0.5 0.9099 0.8893 0.0206 0.3786 0.8666 0.7801 0.0865 0.3552 0.8276 0.6947 0.1328 0.3351 0.6 0.9053 0.8848 0.0205 0.3015 0.8622 0.7761 0.0861 0.2827 0.8233 0.6912 0.1322 0.2667 0.7 0.9007 0.8803 0.0205 0.2265 0.8578 0.7721 0.0857 0.2123 0.8191 0.6876 0.1315 0.2003 0.8 0.8961 0.8758 0.0204 0.1536 0.8534 0.7681 0.0853 0.1440 0.8149 0.6840 0.1309 0.1358 0.9 0.8915 0.8713 0.0203 0.0829 0.8490 0.7641 0.0849 0.0778 0.8106 0.6804 0.1302 0.0734
Table 6A.1(b): Performance indices for different values of Sλ
Section-6A: Hardware-Software System with Switching Failure
143
q β=0.1 β =0.5 β =0.9
A AS2 AS1 ωf A AS2 AS1 ωf A AS2 AS1 ωf 0 0.7134 0.6502 0.0631 0.3139 0.8253 0.7553 0.0700 0.5823 0.9223 0.8472 0.0751 0.8152
0.1 0.7084 0.6457 0.0627 0.2735 0.8206 0.7510 0.0696 0.5156 0.9179 0.8432 0.0747 0.7257 0.2 0.7034 0.6411 0.0623 0.2355 0.8160 0.7467 0.0693 0.4512 0.9135 0.8391 0.0744 0.6384 0.3 0.6984 0.6365 0.0619 0.1998 0.8113 0.7424 0.0689 0.3889 0.9091 0.8351 0.0741 0.5531 0.4 0.6934 0.6319 0.0615 0.1665 0.8066 0.7381 0.0685 0.3289 0.9047 0.8310 0.0737 0.4698 0.5 0.6884 0.6273 0.0611 0.1355 0.8019 0.7338 0.0681 0.2710 0.9003 0.8270 0.0734 0.3886 0.6 0.6834 0.6228 0.0606 0.1068 0.7973 0.7295 0.0678 0.2153 0.8960 0.8229 0.0731 0.3095 0.7 0.6784 0.6182 0.0602 0.0805 0.7926 0.7252 0.0674 0.1618 0.8916 0.8188 0.0727 0.2324 0.8 0.6734 0.6136 0.0598 0.0565 0.7879 0.7209 0.0670 0.1105 0.8872 0.8148 0.0724 0.1574 0.9 0.6684 0.6090 0.0594 0.0349 0.7833 0.7166 0.0666 0.0614 0.8828 0.8107 0.0720 0.0845
Table 6A.1(c): Performance indices for different values of β
Section-6A: Hardware-Software System with Switching Failure
144
q α=0.1 α =0.5 α =0.9
A AS2 AS1 ωf A AS2 AS1 ωf A AS2 AS1 ωf 0 0.9030 0.8285 0.0744 0.7680 0.9000 0.8259 0.0740 0.7614 0.8973 0.8236 0.0737 0.7556
0.1 0.8982 0.8241 0.0740 0.6823 0.8955 0.8218 0.0737 0.6771 0.8931 0.8197 0.0734 0.6724 0.2 0.8933 0.8197 0.0736 0.5990 0.8909 0.8176 0.0733 0.5949 0.8888 0.8157 0.0731 0.5912 0.3 0.8885 0.8152 0.0733 0.5179 0.8864 0.8134 0.0730 0.5148 0.8845 0.8118 0.0727 0.5121 0.4 0.8837 0.8108 0.0729 0.4391 0.8819 0.8093 0.0726 0.4369 0.8803 0.8079 0.0724 0.4349 0.5 0.8788 0.8064 0.0725 0.3625 0.8773 0.8051 0.0723 0.3610 0.8760 0.8039 0.0721 0.3597 0.6 0.8740 0.8020 0.0721 0.2883 0.8728 0.8009 0.0719 0.2873 0.8718 0.8000 0.0718 0.2865 0.7 0.8692 0.7975 0.0717 0.2162 0.8683 0.7967 0.0716 0.2158 0.8675 0.7960 0.0715 0.2153 0.8 0.8644 0.7931 0.0713 0.1465 0.8638 0.7926 0.0712 0.1463 0.8632 0.7921 0.0711 0.1462 0.9 0.8595 0.7887 0.0709 0.0790 0.8592 0.7884 0.0708 0.0790 0.8590 0.7882 0.0708 0.0790
Table 6A.1(d): Performance indices for different values of α
Section-6A: Hardware-Software System with Switching Failure
145
0.7
0.75
0.8
0.85
0.9
0.95
1
0 0.5 1 1.5 2 2.5 3 3.5 4
λh
Avai
labi
lity
q=0.2(Analytical Set 1)q=0.2(Afnis Set 1)q=0.9(Analytical Set 2)q=0.9(Afnis Set 2)
0.7
0.75
0.8
0.85
0.9
0.95
1
0 0.5 1 1.5 2 2.5 3 3.5 4
λhAv
aila
bilit
y
β=0.4(Analytical Set 1)β=0.4(Afnis Set 1)β=0.8(Analytical Set 2)β=0.8(Afnis Set 2)
Fig. 6A.2(a): Availability vs hλ by Fig. 6A.2(b): Availability vs hλ by varying q varying β
0.7
0.75
0.8
0.85
0.9
0.95
1
0 0.5 1 1.5 2 2.5 3 3.5 4
λh
Avai
labi
lity
α=0.2(Analytical Set 1)α=0.2(Afnis Set 1)α=0.9(Analytical Set 2)α=0.9(Afnis Set 2)
1 2 3 4 5 6 7 8 9 10
0
0.2
0.4
0.6
0.8
1
input1
Deg
ree
of m
embe
rshi
p
in1mf1 in1mf2 in1mf3 in1mf4 in1mf5
Training DataANFIS Output
Fig. 6A.2(c): Availability vs hλ by Fig. 6A.2(d): Membership functions varyingα for input parameter hλ
Section-6A: Hardware-Software System with Switching Failure
146
0.7
0.75
0.8
0.85
0.9
0.95
1
0 1 2 3 4 5 6 7 8 9 10
µS
Avai
labi
lity
q=0.2(Analytical Set 1)q=0.2(Afnis Set 1)q=0.9(Analytical Set 2)q=0.9(Afnis Set 2)
0.7
0.75
0.8
0.85
0.9
0.95
1
0 1 2 3 4 5 6 7 8 9 10
µS
Ava
ilabi
lity
β=0.4(Analytical Set 1)β=0.4(Afnis Set 1)β=0.8(Analytical Set 2)β=0.8(Afnis Set 2)
Fig. 6A.3(a): Availability vs Sµ by Fig. 6A.3(b): Availability vs Sµ by varying q varying β
0.7
0.75
0.8
0.85
0.9
0.95
1
0 1 2 3 4 5 6 7 8 9 10
µS
Ava
ilabi
lity
α=0.2(Analytical Set 1)α=0.2(Afnis Set 1)α=0.9(Analytical Set 2)α=0.9(Afnis Set 2)
1 2 3 4 5 6 7 8 9 10
0
0.2
0.4
0.6
0.8
1
input1
Deg
ree
of m
embe
rshi
p
in1mf1 in1mf2 in1mf3 in1mf4 in1mf5
Training DataANFIS Output
Fig. 6A.3(c): Availability vs Sµ by Fig. 6A.2(d): Membership functions varying α for input parameter Sµ
Repairable System with Warm Standby and Switching Failure
6B.1 Introduction
6B.2 Model Description
6B.3 Availability Prediction for Three Configurations
6B.4 Transient Solution
6B.5 Numerical Results
6B.6 Conclusion
Section-6B
Section-6B: Repairable System with Warm Standby…
148
By using Markov process, the system state transitions can be
modeled to predict the system availability in many realistic
applications wherein all components of the system cannot be
treated as identical because of their failure and repair
characteristics. In this chapter, the efforts have been made to
examine the availability characteristics for three different
configurations with warm standby, switching failure and delay of
reboot. For primary and warm standby components, the time-to-
failure, time-to-repair and time-to-delay are assumed to follow the
exponential distribution. The switching of warm standbys to
replace the failed component is subject to failure with probability
q. The numerical results using Runge-Kutta method have been
provided for supporting the analytical results. These results
validate the prediction capability of the proposed analytical
framework of the system incorporating standby, switching and
reboot.
6B.1 Introduction
The multi-component machining systems are being used in every field of our
life to perform different activities due to our dependence on them. As the time grows,
a system is subject to failure; these failures may result in loss of production, money,
goodwill, etc.. This situation can be handled by facilating the spare part/standby
support as well as the maintenance provided by the repair crew. For ensuring the
desired efficiency and availability of the system, many researchers have suggested the
provision of repair facility and standby components. The failure and repair are
assumed to be a coupled event in a system working in machining environment. The
system availability is very important aspect to both the system users and manufactures
because failures may cause much loss in the time and cost. The system failures are
also unavoidable in many safety-critical systems such as in the banking systems,
Section-6B: Repairable System with Warm Standby…
149
military systems, nuclear systems and so forth. As the complexity and competitive
industrial pressure of the systems are increasing, the need to understand changes in
the availability of a complex repairable system.
The steady-state availability is the probability that a system will be
operational at any random point of time and is expressed as the expected fraction of
time. The availability is more related to major life cycle costs in time and money.
Therefore being able to model availability accurately and use this performance
measure to make design decisions becomes crucial to the ultimate success of any
system working in machining environment. The availability analysis of various
complex systems under different types of failure modes has been taken up from time
to time by several researchers. Kapur and Garg (1990) and Aven (1990) presented
some simple approximation formulae for the compound availability of standby
redundant systems. Jie (1991) and Galikowsky et al. (1996) estimated the reliability
and availability of a series system. Wang and Sivazlian (1997) discussed the life
cycle cost analysis for the availability prediction of system with parallel components.
Abu-Salih et al. (1999) established asymptotic confidence limits for the steady-state
availability for the repair facility. Park and Kim (2002) studied of software
rejuvenation that follows a proactive fault-tolerant approach to handle software-origin
system failure. They mapped the software rejuvenation and switchover states with a
semi-Markov process and obtained the mathematical steady-state solutions of the
chain. Many authors have used standby provisioning in their models for increasing the
operational availability of the system. Smidt-Destombesa et al. (2004) considered an
installed base of k-out-of-N systems, each consisting of identical, repairable
components. System maintenance consists of replacing all failed and degraded
components by spares. They focused on the downtime resulting from the lack of spare
parts and maintenance strategy. The analysis of reliability and the availability of
systems having warm standby components and switching failures was studied by
Wang et al. (2006). El-Damcese (2009) analyzed the warm standby system subject to
common cause failures with time varying failure and repair rates. Maheshwari et al.
(2010) studied machine repair problem with K-type warm spares, multiple vacations
for repairmen and reneging. Yuan and Xu (2011) considered an optimal replacement
policy for a repairable system with repairman vacations.
Section-6B: Repairable System with Warm Standby…
150
A switching is used to connect a standby unit by replacing failed components
but switchover from standby component to operational one may be perfect or
imperfect. The standby redundant components support the demand of pre-specified
minimum reliability of the system. The provision of warm standbys with switching
failures has attracted many researchers working in the area of reliability prediction of
machining systems. In 1983, Gupta et al. studied the switching failure in a two-unit
standby redundant system. Goel and Gupta (1984) analyzed the availability of a two-
unit cold standby system with two switching failure modes. Labib (1991) and Alidrisi
(1992) proposed the stochastic analysis of a n-components redundant system with
two-unit warm standby system and switching devices. Dhillon and Yang (1992) and
Dhillon (1993) analyzed the reliability and availability of warm standby systems with
common-cause failures and human errors. Mokaddis et al. (1994) developed two
models for two dissimilar-unit standby redundant system with three types of repair
facilities and perfect or imperfect switch. Chung (1995), Singh and Goel (1995) and
Xu et al. (2005) predicted the reliability and availability of standby systems with
imperfect switching and multiple non-critical/critical errors. Reliability and sensitivity
analysis of a system with multiple unreliable service stations and standby switching
failures was considered by Ke et al. (2007). Ke and Lee (2007) suggested the
asymptotic confidence limits for a repairable system with standbys subject to
switching failures. Hsu et al. (2008) explored statistically availability of redundant
system with reboot delay, standby switching failures and an unreliable repair facility,
which consists of two active components and one warm standby. Wang and Chen
(2009) described the general repair times, reboot delay and switching failures to
evaluate the availability for the different configurations. Other important contributions
in the direction of standby provisioning in repairable system was due by Cha et al.
(2008) and Yun and Cha (2010). They optimized the design of a general warm
standby system. Mahmoud and Moshref (2010) studied a two-unit cold standby
system considering hardware, human error failures and preventive maintenance. Ke et
al. (2011) discussed the reliability measures of a repairable system with standby
switching failures and reboot delay.
The better maintenance of the system may result in better reliability and
performance of the system. Different models and various availability characteristics
Section-6B: Repairable System with Warm Standby…
151
of interest have been discussed and obtained by using the theory of Markov Process
by several researchers. In this chapter, reliability models have been developed to
address the issue of availability prediction and to facilitate the comparison of different
configurations. The purpose of the present investigation is to study the availability
and failure frequency of three systems with switching devices, standby components,
setup time for repair and delay of reboot. This chapter is arranged into following
sections. The ongoing section 6B.1 reviews the previous studies on
reliability/availability analysis of repairable system having standby and switching
failures. In order to construct the model mathematically, some notations and
assumptions are stated in section 6B.2. In section 6B.3, we develop three models and
construct the governing equations by using the appropriate transition rates in order to
evaluate the availability for three configurations. The steady state availability and
failure frequency have also been obtained. In section 6B.4, numerical illustrations are
provided to examine the availability indices. The concluding remarks are given in the
last section 6B.5.
6B.2 Model Description
We develop Markov model for a multi-component system consisting of
identical operating primary units and warm standby units. Each of the operating unit
fails independent of the state of the others according to an exponential failure time
distribution with parameterλ . Whenever one of the operating unit fails, it is instantly
replaced by a warm standby if available. The failure component is sent immediately
for repair. The available warm standby unit may also fail exponentially with
parameter ( )λ<α≤α 0 . It is assumed that the switchover time is instantaneous.
However, the switch over of standby unit to replace the failed primary unit is
imperfect; the switch over failure probability is q. If a warm standby unit fails to
switch to a primary unit, the next available standby unit attempts to switch. This
process continues until switching is successful or all the warm standby units have
exhausted. Whenever an operating unit or a warm standby fails, it is immediately sent
to a repair facility where failed unit is repaired by the repairman who takes a setup
time before starting the repair. The set up time and repair time are exponentially
distributed with parameterηandµ . After repair, a unit works like a new one. The
Section-6B: Repairable System with Warm Standby…
152
reboot delay for an operating unit and warm standby unit is assumed to be
exponentially distributed with rate β .
We use the index (i,j) to represent the system states that there are ‘i’ operating
units and ‘j’ standby units in the repairable system. We use index ‘k’ corresponding to
the system state when the system is broken down and is under repair; also k=i+j i.e. k
represents the total number of units in the system. The set up time and the repair time
both are assumed to be exponentially distributed with rate kη and kµ ( )jik += when
the system is broken down and under repair, and there are total ‘k’ units in the system.
To obtain the availability function A(t) and other performance indices viz.
failure frequency, probability of the system being in reboot state and the probability of
the system being in broken down and under setup/repair state, we obtain transient
probabilities ( )tp j,i and ( )tpk by using Runge-Kutta method of fourth order. The
steady state probabilities have also been determined in explicit form by using
recursive approach.
6B.3 Availability Prediction for Three Configurations
Three models are developed by considering different configurations of multi-
component system with spare part support as described below:
6B.3.1 Model 1.
The system consists of two similar units, one is operating and other one acts as
a standby. Initially one unit is operative and other unit is kept as warm standby as
shown by node (1,1) in fig. 6B.1. When the operating unit fails, it is replaced by the
standby unit if available. If standby unit is consumed, i.e. not available for
replacement, the system fails. When operating unit fails, the standby unit is switched
to being operative by means of a switching device. The switch may be available,
when required with probability q. The system can reach a failed state denoted by node
(0, 1); then after rebooting, the system is available for work. The node (0,0) represents
the failure of both units. Whenever the server is broken down, immediately it is sent
for repair to the repairmen who needs setup time before starting the repair. After
completion of repair, the server becomes available for service. States (0) and (1) show
that the server is down and in set up state when one and both units, respectively, are
Section-6B: Repairable System with Warm Standby…
153
failed. It is assumed that 1µ and 2µ are the repair rates and 1η and 2η are the setup rates
for the repair while system is in states (0) and (1), respectively.
Fig. 6B.1: State transition diagram for model 1
The following differential-difference equations are constructed by using appropriate
transition rates depicted in fig 6B.1:
( ) ( ) )t(P)t(Pdt
tdP121,1
1,1 µ+α+λ−= … (6B.1)
( ) ( ) )t(Pq1)t(Pdt
tdP1,11,0
1,0 −λ+β−= … (6B.2)
( ) ( ) )t(P)t(P)t(P)t(Pdt
tdP111,01,10,12
0,1 µ+β+α+η+λ−= … (6B.3)
( ))t(P)t(qP)t(P
dttdP
0,11,10,010,0 λ+λ+η−= … (6B.4)
( ) )t(P)t(Pdt
tdP0,0101
0 η+µ−= … (6B.5)
( ) )t(P)t(Pdt
tdP0,1212
1 η+µ−= … (6B.6)
The initial condition is
Section-6B: Repairable System with Warm Standby…
154
( ) ( ) ( ) ( ) ( ) ( ) 00P0P0P0P0P,10P 210,00,11,01,1 ====== …(6B.7)
Steady state probabilities
When t ∞→ , the steady-state equations governing the model are obtained from
(6B.1)-(6B.7) as follows:
( ) 121,1 PP0 µ+α+λ−= … (6B.8)
( ) 1,11,0 Pq1P0 −λ+β−= … (6B.9)
( ) 011,01,10,11 PPPP0 µ+β+α+η+λ−= … (6B.10)
0,11,10,01 PqPP0 λ+λ+η−= … (6B.11)
0,0101 PP0 η+µ−= … (6B.12)
0,1212 PP0 η+µ−= … (6B.13)
The normalizing condition is given by
1PPPPPP 010,01,00,11,1 =+++++ , … (6B.14)
Solving equations (6B.8)-(6B.13) recursively and using (6B.14), we obtain
( ) ( )( )[ ]1121011
211,1 ημαλΛΛΛημ
μP++++++
η= … (6B.15a)
( )( ) ( )( )[ ]1121011
11,0 ημαλΛΛΛημ
αλμP++++++
+= … (6B.15b)
( )( ) ( )( )[ ]1121011
211,0 ημαλΛΛΛημ
q1μP++++++β
−λη= … (6B.15c)
( )( ) ( )( )[ ]11210111
210,0 ημαλΛΛΛημ
qμP++++++η
η+λ+αλ= … (6B.15d)
( )( ) ( )( )[ ]11210112
211 ημαλΛΛΛημ
αλμP++++++µ
+η= … (6B.15e)
( )( ) ( )( )[ ]1121011
20 ημαλΛΛΛημ
qP++++++
η+λ+αλ= … (6B.15f)
Section-6B: Repairable System with Warm Standby…
155
where
( )β−λη
=Λq12
0 , ( )1
21
qµ
η+λ+αλ=Λ , ( )
1
22
qη
η+λ+αλ=Λ
Performance Indices
For configuration 6B.1, the availability ( )∞1A is given by
( ) 010,11,11 PPPPA +++=∞
( )( )( ) ( )( )[ ]1121011
11
ημαλΛΛΛημαλημ
++++++++
= …(6B.16)
Other performance indices for this configuration are as follows:
Failure Frequency ( ) 0,11,1f PPF λ+α+λ=
( )( )( ) ( )( )[ ]1121011
11
ημαλΛΛΛημαλημ
+++++++λ+
= …(6B.17)
Probability of the system being in reboot state 1,0RB PP =
( )( ) ( )( )[ ]1121011
21
ημαλΛΛΛημq1μ
++++++β−λη
= …(6B.18)
Probability of the system being under setup/repair state 01D PPP +=
= ( )( ) ( )( )[ ]11210110
1110
ημαλΛΛΛημμqμ
++++++η+α+λλµ+η …(6B.19)
6B.3.2 Model 2.
In this configuration, we consider that the system consists of two operating
and one standby units. Other assumptions made are same as taken for configuration 1.
For proper functioning of the system two units are required. For the configuration 2,
the differential-difference equations are constructed by considering the appropriate
rates as shown in fig. 6B.2.
Section-6B: Repairable System with Warm Standby…
156
Fig. 6B.2: State transition diagram for model 2
( ) ( ) )t(P)t(P2dt
tdP221,2
1,2 µ+α+λ−= … (6B.20)
( ) ( ) )t(Pq12)t(Pdt
tdP1,21,1
1,1 −λ+β−= … (6B.21)
( ) ( ) )t(P)t(P)t(P)t(P2dt
tdP111,11,20,22
0,2 µ+β+α+η+λ−= … (6B.22)
( ))t(P)t(qP2)t(P
dttdP
0,21,20,110,1 λ+λ+η−= … (6B.23)
( ) )t(P)t(Pdt
tdP0,1111
1 η+µ−= … (6B.24)
( ) )t(P)t(Pdt
tdP0,2222
2 η+µ−= … (6B.25)
The initial condition is
( ) ( ) ( ) ( ) ( ) ( ) 00P0P0P0P0P,10P 120,11,10,21,2 ====== …(6B.26)
Steady state probabilities
The steady-state equations governing the model are obtained from (6B.20)-(6B.26) as
follows:
( ) 221,2 PP20 µ+α+λ−= … (6B.27)
Section-6B: Repairable System with Warm Standby…
157
( ) 1,21,1 Pq12P0 −λ+β−= … (6B.28)
( ) 111,11,20,22 PPPP20 µ+β+α+η+λ−= … (6B.29)
0,21,20,11 PqP2P0 λ+λ+η−= … (6B.30)
0,1111 PP0 η+µ−= … (6B.31)
0,2222 PP0 η+µ−= … (6B.32)
Solving equations (6B.27)-(6B.32) recursively and using the normalizing condition,
we obtain
( ) ( )( )[ ]2221022
222,1 ημα2λΛΛΛημ
μηP++++++
= … (6B.33a)
( )( ) ( )( )[ ]2221022
22,0 ημα2λΛΛΛημ
α2λμP++++++
+= … (6B.33b)
( )( )( ) ( ) ( )( )[ ]2221022
221,1 ημα2λΛΛΛημα2λ
α2λq1μ2P+++++++β
+−λη= … (6B.33c)
( )[ ]( ) ( ) ( )( )[ ]22210221
222
1,0 ημα2λΛΛΛημα2λα2λμq2P
+++++++η+λ+ηλ
= … (6B.33d)
( )( ) ( )( )[ ]2221022
22 ημα2λΛΛΛημ
α2ληP++++++
+= … (6B.33e)
( )( ) ( )( )[ ]( ) ( )( )[ ]22210221
2221 ημα2λΛΛΛημ
q1222μP++++++µ
−λ+αη−α+λη+λ= … (6B.33f)
where
( )β−λη
=Λq12 2
0 , ( )( ) ( )( )[ ]1
221
q1222µ
−λ+αη−α+λη+λ=Λ
( )[ ]1
22
2q2η
α+λλ+ηλ=Λ .
Section-6B: Repairable System with Warm Standby…
158
Performance Indices
The system availability ( )∞2A is given by
( ) 120,21,22 PPPPA +++=∞
( )( )( ) ( )( )[ ]2221022
22
ημα2λΛΛΛημα2λημ
++++++++
= …(6B.34)
Other performance indices for this model are obtained as
Failure Frequency ( ) 0,21,2f P2P2F λ+α+λ=
( )( )( ) ( )( )[ ]2221022
22
ημα2λΛΛΛημα2λ2ημ
+++++++λ+
= …(6B.35)
Probability of the system being in reboot state 1,1RB PP β=
( )( )( ) ( ) ( )( )[ ]2221022
22
ημα2λΛΛΛημα2λα2λq1μ2
++++++++−λη
= …(6B.36)
Probability of the system being under setup/repair state 12D PPP +=
( ) ( )( ) ( )[ ]( ) ( )( )[ ]22210221
222221
ημα2λΛΛΛημμq12α2λ2λ2μ
++++++η−λ−αη−+η+µ+α+λη
= …(6B.37)
6B.3.3 Model 3.
In this model, the system consists of 2 operating and two standby units and at
least two units are required for the functioning of the system. The states 1, 2, 3
represent that the server is broken down when the system has one operating unit, two
operating units, and two operating as well as one standby units respectively; the
corresponding set up and repair rates are (µ1, η1), (µ2, η2) and (µ3, η3).
For the configuration 3 as shown in fig. 6B.3, we construct the differential-
difference equations as follows:
Section-6B: Repairable System with Warm Standby…
159
Fig. 6B.3: State transition diagram for model 3
( ) ( ) )t(P)t(P22dt
tdP332,2
2,2 µ+α+λ−= … (6B.38)
( ) ( ) )t(P)t(P)t(P2)t(P2dt
tdP222,12,21,23
1,2 µ+β+α+η+α+λ−= … (6B.39)
( ) ( ) )t(P)t(P)t(P)t(P2dt
tdP111,11,20,22
0,2 µ+β+α+η+λ−= … (6B.40)
( ) ( ) )t(Pq12)t(Pdt
tdP2,22,1
2,1 −λ+β−= … (6B.41)
( ))t(P)q1(2)t(P)q1(q2)t(P
dttdP
1,22,21,11,1 −λ+−λ+β−= …(6B.42)
( ))t(P2)t(qP2)t(Pq2)t(P
dttdP
0,21,22,22
0,110,1 λ+λ+λ+η−= … (6B.43)
( ) )t(P)t(Pdt
tdP1,2333
3 η+µ−= … (6B.44)
( ) )t(P)t(Pdt
tdP0,2222
2 η+µ−= … (6B.45)
( ) )t(P)t(Pdt
tdP0,1111
1 η+µ−= … (6B.46)
Section-6B: Repairable System with Warm Standby…
160
The initial condition is
0)0(P)0(P)0(P)0(P)0(P)0(P)0(P)0(P,1)0(P 1230,11,12,10,21,22,2 ========= …(6B.47)
Steady state probabilities
The steady-state equations corresponding to (6B.38)-(6B.46) are as follows:
( ) 332,2 PP220 µ+α+λ−= … (6B.48)
( ) 222,12,21,23 PPP2P20 µ+β+α+η+α+λ−= … (6B.49)
( ) 111,11,20,22 PPPP20 µ+β+α+η+λ−= … (6B.50)
( ) 2,22,1 Pq12P0 −λ+β−= … (6B.51)
1,22,21,1 P)q1(2P)q1(q2P0 −λ+−λ+β−= … (6B.52)
0,21,22,22
0,11 P2qP2Pq2P0 λ+λ+λ+η−= … (6B.53)
1,2333 PP0 η+µ−= … (6B.54)
0,2222 PP0 η+µ−= … (6B.55)
0,1111 PP0 η+µ−= … (6B.56)
Solving (6B.37)-(6B.45) recursively and using normalizing condition
1PPPPPPPPP 1230,11,12,10,21,22,2 =++++++++ , … (6B.57)
we compute probabilities as
( )[ ]( ) ( )( )[ ]3354321033
31,2 22
22Pµ+ηα+λ+Λ+Λ+Λ+Λ+Λ+Λ+ηµ
α+λµ= … (6B.58a)
( ) ( )( )[ ]3354321033
332,2 22
Pµ+ηα+λ+Λ+Λ+Λ+Λ+Λ+Λ+ηµ
µη= … (6B.58b)
( ) ( )( )[ ]3354321033
330,2 22
Pµ+ηα+λ+Λ+Λ+Λ+Λ+Λ+Λ+ηµ
µΛ= … (6B.58c)
( ) ( )( )[ ]3354321033
302,1 22
Pµ+ηα+λ+Λ+Λ+Λ+Λ+Λ+Λ+ηµ
µΛ= … (6B.58d)
Section-6B: Repairable System with Warm Standby…
161
( ) ( )( )[ ]3354321033
311,1 22
Pµ+ηα+λ+Λ+Λ+Λ+Λ+Λ+Λ+ηµ
µΛ= … (6B.58e)
( ) ( )( )[ ]3354321033
340,1 22
Pµ+ηα+λ+Λ+Λ+Λ+Λ+Λ+Λ+ηµ
µΛ= … (6B.58f)
( )( ) ( )( )[ ]3354321033
33 22
22Pµ+ηα+λ+Λ+Λ+Λ+Λ+Λ+Λ+ηµ
α+λη= … (6B.58g)
( ) ( )( )[ ]3354321033
322 22
Pµ+ηα+λ+Λ+Λ+Λ+Λ+Λ+Λ+ηµ
µΛ= … (6B.58h)
( ) ( )( )[ ]3354321033
351 22
Pµ+ηα+λ+Λ+Λ+Λ+Λ+Λ+Λ+ηµ
µΛ= … (6B.58i)
where
( )β
η−λ=Λ 3
0q12 , ( )( )
βα+λ+η−λ
=Λ22qq12 3
1
( )( ) ( )2
3332
q122222µ
η−λ−αη−α+λη+α+λ=Λ
( )( ) ( )2
3333
q122222η
η−λ−αη−α+λη+α+λ=Λ
( ) ( )( ) ( )21
3332232
4q12222q222q2
ηηη−λ−αη−η+α+λλ+ηλα+λ+ηηλ
=Λ
( ) ( )( ) ( )21
3332232
5q12222q222q2
ηµη−λ−αη−η+α+λλ+ηλα+λ+ηηλ
=Λ
Performance indices
For configuration 3, the availability ( )∞3A is obtained as
( ) 1230,21,22,23 PPPPPPA +++++=∞
( )[ ]( ) ( )( )[ ]3354321033
333
2222
µ+ηα+λ+Λ+Λ+Λ+Λ+Λ+Λ+ηµΛ+η+α+λµ
= …(6B.59)
Section-6B: Repairable System with Warm Standby…
162
Other performance indices are given as follows:
Failure Frequency ( )( ) 0,11,12,1f P2PPq12F λ++−λ=
( ) ( )( )[ ]( ) ( )( )[ ]3354321033
333
222222
µ+ηα+λ+Λ+Λ+Λ+Λ+Λ+Λ+ηµΛλ+α+λ+ηα+λµ
= …(6B.60)
Probability of the system being in reboot state is
0,21,2RB PPP += ( )( ) ( )( )[ ]3354321033
103
22 µ+ηα+λ+Λ+Λ+Λ+Λ+Λ+Λ+ηµΛ+Λβµ
= …(6B.61)
Probability of the system being under setup/repair state
321D PPPP ++= ( ) ( )( ) ( )( )[ ]3354321033
5233
2222
µ+ηα+λ+Λ+Λ+Λ+Λ+Λ+Λ+ηµΛ+Λµ+α+λη
= …(6B.62)
6B.4 Transient Solution
The transient probabilities for the system of equations (6B.1)-(6B.6), (6B.20)-
(6B.26) and (6B.38)-(6B.48) for models 1, 2 and 3 respectively, can not be obtained
explicitly using analytical method. However numerical methods to solve set of
differential equations can be easily employed. To obtain the transient probabilities, we
employ numerical approach based on Runge-Kutta (RK) technique of fourth order.
Once transient probabilities are evaluated, we can obtain availability for models 1, 2
and 3 respectively, using the following formulae:
( ) ( ) ( ) ( ) ( )tPtPtPtPtA 1,10,1101 +++= …(6B.63)
( ) ( ) ( ) ( ) ( )tPtPtPtPtA 1,20,2212 +++= …(6B.64)
( ) ( ) ( ) ( ) ( ) ( ) ( )tPtPtPtPtPtPtA 2,21,20,23213 +++++= …(6B.65)
The failure frequencies Fi(t) (i=1,2,3) for model 1, 2, 3 are determined using
( ) ( ) ( ) ( )tPtPtF 1,10,11 α+λ+λ= …(6B.66)
( ) ( ) ( ) ( )tP2tP2tF 1,20,22 α+λ+λ= …(6B.67)
( ) ( ) ( ) ( ) ( ){ }tPtPq12tP2tF 1,12,10,13 +−λ+λ= …(6B.68)
Section-6B: Repairable System with Warm Standby…
163
6B.5 Numerical Results
To illustrate the computational tractability for transient behaviour of the all the
three configuration, we perform the computational experiment. The coding for the
computer program has been implemented by exploiting MATLAB’s ‘ode 45’
function. A time span is considered with equal intervals. For numerical results, we
choose default parameters as
2.0,4.0,4.0,5.0q,3.0,4,2,6,3.0,3.0 211321 =η=η=η==β=µ=µ=µ=α=λ ,
η3=0.3.
In tables 6B.1(a)-6B.1(c), we examine the effect of different parameters such
as failure rate of standbys (α), reboot rate (β ) and switching parameter (q)
respectively, on the system availability. It is observed that the availability decreases as
time grows and also as the values of parameters α and q increase, the availability
decreases for all cases. The availability increases by increasing the values of β for all
the three configurations.
To examine the effects of setup time and repair rate, the graphical presentation
of availability has been done in figures 6B.4(a-d), 6B.5(a-d) and 6B.6(a-f) for model
1, 2 and 3, respectively. From figs 6B.4(a-d), 6B.5(a-d) and 6B.6(a-f), we note that as
time increases, A(t) initially sharply decreases and after some time it becomes almost
constant. In figs 6B.4(a) and 6B.4(b) we note that as µ1 (µ2) increases, A(t) decreases.
From figs 6B.4(c) and 6B.4(d), it is found that as η1 and η2 are increasing, A(t) is
increasing.
In figs 6B.5(a)-6B.5(d), the graphs for A(t) have been plotted with respect to
time t for model 2. We analyze the effects of repair rate and setup time on the
availability. As repair rates µ1 and µ2 increase, A(t) decreases. As set up time η1
increases, A(t) increases however on increasing η2, the availability remains almost
constant.
For model 3, the availability graphs with respect to time are plotted in figs
6B.6(a),6B.6(c) and 6B.6(e) for examine the effects of repair rates (µ1, µ2 and µ3) and
in figs 6B.6(b),6B.6(d) and 6B.6(f) for setup time (η1, η2 and η3). It is seen in figs
6B.6(a), 6B.6(c) and 6B.6(e) that A(t) remains almost same for lower values of t on
Section-6B: Repairable System with Warm Standby…
164
increasing the repair rates µ1, µ2 and µ3. For higher values of t, the effect of repair
rates µ2 and µ3 on the availability are insignificant; however on increasing µ1, we
observe a slight decrement in the availability. In figs 6B.6(b) and 6B.6(d), it is found
that the availability A(t) increases on increasing the values of η1 and η2, but decreases
on increasing η3 as can be visualized in fig. 6B.6(f).
Finally from the tables and graphs, we conclude that
• The availability slightly decreases as time t increases but after some time it
becomes constant for the three configurations. With the increase in different
parameters namely repair rate (α), reboot delay (β) and switching failure (q), the
availability decreases. The tables 6B.1(a)-6B.1(c) reveal that the availability of
the system is more affected by the reboot delay β than by the failure rate α of
standby and switching probability q.
• From the figs for models 1 and 2, it is demonstrated that on increasing the repair
rate, the availability decreases and on the other hand on increasing the set up time
the availability increases. Thus we conclude that the set up time has the significant
effect on the availability and should be incorporated for the availability analysis.
6B.5 Conclusion
In this investigation we have addressed the issue of improvement in the
availability of a multi-component system supported by warm standby and a repair
facility. We have included the setup time, switching failure and reboot delay so that
model developed may be closer to realistic situations. Various performance measures
including the system availability at transient state have been examined with the help
of numerical results. It has been shown that the combination of standby and
maintainability is of great importance in many real time systems operating in the
machining environment. We have explored some system characteristics such as
failure frequency, probability of the system being in reboot state, probability of the
system being under setup/repair sate which may be helpful to the system designers
during design and development phases of the concerned system under optimal
operating conditions.
Section-6B: Repairable System with Warm Standby…
165
t α=0.3 α=0.6 α=0.9 A1(t) A2(t) A3(t) A1(t) A2(t) A3(t) A1(t) A2(t) A3(t)
0 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 2 0.623 0.472 0.443 0.623 0.477 0.440 0.623 0.482 0.439 4 0.504 0.423 0.363 0.503 0.430 0.363 0.502 0.435 0.365 6 0.471 0.430 0.351 0.470 0.437 0.351 0.469 0.441 0.352 8 0.463 0.438 0.344 0.462 0.443 0.343 0.461 0.446 0.343 10 0.461 0.443 0.335 0.460 0.447 0.332 0.460 0.449 0.331 12 0.460 0.446 0.324 0.460 0.449 0.319 0.459 0.451 0.317 14 0.460 0.447 0.312 0.459 0.450 0.306 0.459 0.452 0.303
Table 6B.1(a): Comparison of availability of three configurations for different values of α
Section-6B: Repairable System with Warm Standby…
166
t β=0.2 β=0.5 β=0.8 A1(t) A2(t) A3(t) A1(t) A2(t) A3(t) A1(t) A2(t) A3(t)
0 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 2 0.611 0.454 0.417 0.643 0.500 0.485 0.663 0.528 0.529 4 0.484 0.398 0.326 0.530 0.453 0.410 0.549 0.473 0.442 6 0.450 0.406 0.316 0.493 0.454 0.386 0.504 0.465 0.402 8 0.443 0.416 0.315 0.479 0.456 0.368 0.486 0.463 0.376 10 0.443 0.424 0.311 0.474 0.457 0.351 0.480 0.463 0.355 12 0.444 0.428 0.305 0.471 0.458 0.335 0.477 0.463 0.337 14 0.445 0.432 0.297 0.471 0.458 0.319 0.476 0.463 0.320
Table 6B.1(b): Comparison of availability of three configurations for different values of β
Section-6B: Repairable System with Warm Standby…
167
t q=0.1 q=0.5 q=0.9 A1(t) A2(t) A3(t) A1(t) A2(t) A3(t) A1(t) A2(t) A3(t)
0 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 2 0.640 0.460 0.433 0.623 0.472 0.443 0.607 0.484 0.462 4 0.522 0.401 0.339 0.504 0.423 0.363 0.487 0.444 0.401 6 0.483 0.411 0.323 0.471 0.430 0.351 0.460 0.450 0.388 8 0.471 0.423 0.318 0.463 0.438 0.344 0.455 0.453 0.374 10 0.467 0.432 0.313 0.461 0.443 0.335 0.454 0.454 0.359 12 0.466 0.436 0.305 0.460 0.446 0.324 0.454 0.455 0.344 14 0.466 0.439 0.295 0.460 0.447 0.312 0.454 0.455 0.328
Table 6B.1(c): Comparison of availability of three configurations for different values of q
Section-6B: Repairable System with Warm Standby…
168
0.4
0.6
0.8
1
0 2 4 6 8 10 12 14
t
A(t)
µ1=2µ1=8
0.4
0.6
0.8
1
0 2 4 6 8 10 12 14t
A(t)
µ2=3µ2=9
(a) (b)
0.4
0.6
0.8
1
0 2 4 6 8 10 12 14t
A(t)
η1=0.5η1=0.8
0.4
0.6
0.8
1
0 2 4 6 8 10 12 14t
A(t)
η2=0.6η2=0.9
(c) (d)
Fig. 6B.4: Availability vs time for model 1 by varying (a) 1µ (b) 2µ (c) 1η (d) 2η
Section-6B: Repairable System with Warm Standby…
169
0.4
0.6
0.8
1
0 2 4 6 8 10 12 14t
A(t)
µ1=2µ1=8
0.4
0.6
0.8
1
0 2 4 6 8 10 12 14t
A(t)
µ2=3µ2=9
(a) (b)
0.4
0.6
0.8
1
0 2 4 6 8 10 12 14
t
A(t)
η1=0.5η1=0.8
0.4
0.6
0.8
1
0 2 4 6 8 10 12 14t
A(t)
η2=0.6η2=0.9
(c) (d)
Fig. 6B.5: Availability vs time for model 2 by varying (a) 1µ (b) 2µ (c) 1η (d) 2η
Section-6B: Repairable System with Warm Standby…
170
0
0.2
0.4
0.6
0.8
1
0 2 4 6 8 10 12 14t
A(t)
µ1=2µ1=8
0
0.2
0.4
0.6
0.8
1
0 2 4 6 8 10 12 14
t
A(t)
η1=0.2η1=0.5
Fig. 6B.6(a): Availability vs time by Fig. 6B.6(b): Availability vs time by varying 1µ for model 3 varying 1η for model 3
0
0.2
0.4
0.6
0.8
1
0 2 4 6 8 10 12 14
t
A(t)
µ2=3µ2=9
0
0.2
0.4
0.6
0.8
1
1.2
0 2 4 6 8 10 12 14
t
A(t)
η2=0.6η2=0.9
Fig. 6B.6(c): Availability vs time by Fig. 6B.6(d): Availability vs time by varying 2µ for model 3 varying 2η for model 3
0
0.2
0.4
0.6
0.8
1
0 2 4 6 8 10 12 14
t
A(t)
µ3=2µ3=9
0
0.2
0.4
0.6
0.8
1
0 2 4 6 8 10 12 14t
A(t)
η3=0.5η3=0.8
Fig. 6B.6(e): Availability vs time by Fig. 6B.6(f): Availability vs time by varying 3µ for model 3 varying 3η for model 3
Warranty Policy for Hardware and Software Systems with Common
Cause Failure
7.1 Introduction
7.2 Model Description
7.3 The Analysis
7.4 Warranty Policy with Repair
7.5 Numerical Results
7.6 Conclusion
Chapter-7
Chapter-7: Warranty Policy for Hardware and Software…
172
A software/hardware system consisting of one software
component and N hardware components which are subject to
individual and common cause failures is analyzed. The renewal
theory has been applied to determine renewals that bring a
component back into the system as-good-as new one. In this
chapter, we investigate warranty policy of repairs and
replacements for developing the warranty cost models. The
reliability and other measures are obtained for performance
evaluation of the system. Numerical example is given to illustrate
the computational tractability of the model with the help of
MATLAB software.
7.1 Introduction
A warranty is an assurance from a manufacturer to a consumer that the
product sold is guaranteed to perform satisfactorily upto a specified period of time i.e.
the warranty period. In case when an item does not perform satisfactorily as specified,
the dealer/manufacturer is responsible to repair or replace it by a new one. The main
object of a warranty is to provide protection to both manufacturers as well as
consumers. The term of warranty is not only the concern of the consumers but also
increases the sales and reputation of the manufacturers. Now-a-days with the
competitive global market, the warranty has become more powerful tool to increase
the sales and revenue. Many different types of warranty policies such as free
replacement repair policy, pro-rata warranty policy, etc. exist in the literature. In free
replacement repair policy, the manufacturer guarantees to repair or provides
replacement for failed items free of cost upto specified warranty period from the date
of initial purchase. In pro-rata warranty policy, the manufacturer or seller agrees to
refund a fraction of the purchase price of the items which fail before time specified
from the time of the initial purchase. Many researchers have studied the warranty cost
models in different frameworks. Ascher and Feingold (1984), Abdel-Hameed (1995)
and Hunter (1996) gave the mathematical techniques for warranty analysis of
Chapter-7: Warranty Policy for Hardware and Software…
173
software reliability. Vaurio (1999) suggested availability and cost functions for
periodically inspected preventively maintained units. Wang and Sheu (2001)
explored the effect of the warranty cost on the imperfect EMQ model with general
discrete shift distribution. Pham (2003b) analyzed the software reliability and cost
models. Attardi et al. (2005) presented a mixed-Weibull regression model for the
analysis of automotive warranty data. Rahman and Chattopadhyay (2006) gave the
review of long term warranty policies. Yun et al. (2008) studied warranty servicing
with imperfect repair. Wu et al. (2009) considered optimal price, warranty length and
production rate for free replacement policy in the static demand market. Yang et al.
(2009) considered cost-oriented task allocation and hardware redundancy policies in
heterogeneous distributed computing systems considering software reliability.
Srinivas et al. (2009) analyzed the influence of delivery times on repairable k-out-of-
N systems with spares. Yang et al. (2010) defined a generic data-driven software
reliability model using mining technique. Zhu et al. (2010) proposed the availability
optimization of the system subject to competing risk. A decision support model for
warranty servicing of repairable items was proposed by Rao (2011).
In reliability literature, renewal theory provides many variants of
replacement/repair policies for the maintainability of the system. Sheu (1991)
proposed a generalized block replacement policy with minimal repair and general
random repair costs for a multi-unit system. Shaked and Zhu (1992) gave some
results on the block replacement policies using renewal theory. Blischke and Murthy
(1994), Wang and Pham (1996) and Murthy et al. (2004) discussed the quasi-
renewal process and its applications in imperfect maintenance. Salameh and Jaber
(2000) developed the economic production quantity model for item with imperfect
quality. Pham and Wang (2001) proposed a quasi-renewal process for the software
reliability and testing costs. Yanez et al. (2002) discussed generalized renewal process
for the analysis of repairable systems with limited failure experience. Rai and Singh
(2005) gave a modeling framework for assessing the impact of new time/mileage
warranty limits on the number and cost of automotive warranty claims. Huang et al.
(2007b) proposed optimal reliability, warranty and price for new products. Noortwijk
and Weide (2008) discussed applications to continuous-time processes of
computational techniques for discrete-time renewal processes. Park and Pham (2008)
Chapter-7: Warranty Policy for Hardware and Software…
174
developed the warranty system-cost model using quasi-renewal processes. Samatli-
Pac and Taner (2009) discussed the role of repair strategy in warranty cost
minimization via quasi-renewal processes. Zhou et al. (2009) suggested the dynamic
pricing and warranty policies for products with fixed lifetime. Park and Pham (2010)
studied altered quasi-renewal concepts for modeling renewable warranty costs with
imperfect repairs. In (2011), Kallen discussed the modeling imperfect maintenance
and the reliability of complex system using superposed renewal process.
The warranty cost depends on the terms of the warranty and is calculated by
the manufacturer as per servicing a claim under warranty. In this chapter, we suggest
a free replacement warranty policy according to which if an item fails, it is replaced
by a new without paying any cost (i.e. free of charge) because the item is non
repairable. The replacement occurs according to a renewal process. The number of
failures during the warranty period is mathematically calculated based on quasi
renewal process. We evaluate the representative cost functions to evaluate the
effectiveness of policies. The rest of the chapter is structured as follows. Section 7.2
deals with model description by stating the requisite assumptions and nomenclatures.
Section 7.3 provides the mathematical analysis of the model. In section 7.4, we
describe the warranty policy with repair. In section 7.5, numerical results are
provided. Finally in section 7.6, the conclusion is drawn.
7.2 Model Description
We provide the quasi-renewal analysis of the distribution function of the
number of product failures of a multi-component repairable system within a warranty
period w. A replacement service would be possible during the warranty period by
introducing two quasi renewal concepts based on (i) altered quasi renewal process and
(ii) mixed quasi renewal process. Appling quasi renewal process, the cost analysis is
performed for K-out-of-N system consisting of N hardware components and one
software component. The system components may fail individually or due to common
cause failure. For modeling purpose, the following assumptions are made.
Assumptions
The repair and replacement do not happen simultaneously.
Chapter-7: Warranty Policy for Hardware and Software…
175
The repair cost and replacement cost are constant. Also, repair time and
replacement service time are negligible.
All warranty claims are executed and are valid.
The repairs are imperfect and the repair process can be modeled by a quasi-
Renewal process.
The time to perform an inspection, in which it is to determine whether the
failed component needs repair or a replacement, is negligible.
The warranty period is renewable.
Abreviation and Nomenclature
pmf, pdf, cdf : Probability mass function, probability density function,
cumulative distribution function.
QRP : Quasi-renewal Processes.
CV : Coefficient of variation.
i.i.d : Identical and independently distributed.
r.v. : Random variable.
w : Length of a warranty period.
T : r.v. denoting time.
Cλ : Constant common cause failure rate.
β : Parameter for QRP.
Nsystem : The number of system failures.
)w(R C : The inter-failure time function of common cause.
Nh(t),Ns(t), Nc(t) : The number of hardware failures, software failures
and failure due to common cause in (0,t], respectively.
fh(.), Fh(.), (.)R h : pdf, cdf and reliability function of hardware failure time within
a warranty period w, respectively.
Chapter-7: Warranty Policy for Hardware and Software…
176
fs(.), Fs(.), (.)RS : pdf, cdf and reliability function of software failure time within
a warranty period w, respectively.
fih(.), Fih(.), Rih(.) : pdf, cdf and reliability function of the hardware component failure times after (i-1)th repair/replacement within a warranty period w, respectively.
fis(.), Fis(.), Ris(.) : pdf, cdf and reliability function of the software component failure times after (i-1)th repair/replacement within a warranty period w, respectively.
C(w) : Warranty cost of the system for a warranty period w.
ch , cs, c0 : Warranty cost for repairs/replacements within a warranty period w, for hardware, software and common cause failure, respectively.
7.3 The Analysis
Following Park and Pham (2008), the probability mass function (pmf) of Nh and Ns
are given by
[ ] ( ) ( )( )wRwFnNP h)1n(
n
1iihhh h
h
+=
== ∏ …(7.1)
and
[ ] ( ) ( )( )wRwFnNP s)1n(
n
1jjsss s
s
+=
== ∏ …(7.2)
In this section we evaluate the expected number of system failure due to
hardware, software components and common cause failures. Under the imperfect
repair, the reliability functions for the hardware and software respectively, are
obtained as
( ) ( )( ) ∏ ∫∏==
==
hh n
1i
x
0ihih
n
1iihh x)dxf(ββ-1wRwR …(7.3)
( ) ( )( ) ∏ ∫∏==
==
ss n
1j
x
0jsjs
n
1jjss x)dxf(ββ-1wRwR … (7.4)
Chapter-7: Warranty Policy for Hardware and Software…
177
We also define
( ) ∫ λλ−=−=x
0CCCC x)dxf(1(w)F1wR …(7.5)
7.3.1 Series System
The reliability function of the system when N hardware components arranged
in series is given by
( ) (w)R(w)R(w)RwR c
n
1jjs
Nn
1iih
sh
= ∏∏
==
−
= ∫∏ ∫∏ ∫
==
x
0cc
n
1j
x
0jsjs
Nn
1i
x
0ihih x)dxf(λλ1x)dxf(ββ-1x)dxf(ββ-1
sh
…(7.6)
7.3.2 K-out-of-N System
When the hardware components are arranged in a K-out-of-N system along
with one software component, the reliability function is obtained as the probability of
having at least K functioning hardware units out of N and software in functioning
state along with no failure due to common cause. Thus, we obtain
[ ] [ ]∑=
−−
=
N
Kkcs
kNh
kh (w)R(w).R.(w)R1(w)R
kN
R(w)
−
−
−
=
∫∏ ∫
∑ ∏ ∫∏ ∫
=
=
−
==
x
0cc
n
1j
x
0jsjs
N
Kk
kNn
1i
x
0ihih
kn
1i
x
0ihih
x)dxf(λλ1x)dxf(ββ1
x)dxf(ββ-11x)dxf(ββ-1kN
s
hh
…(7.7)
The probability that the system is not working is given by
( ) ( )workingnotissystemProb.wP =
{ } ( )
{ } ( ) [ ](w)R(w).R(w)R.(w)R(w)R(w)RkN
(w)R(w).R.(w)R(w)RkN
cscs
kN
hk
h
N
Kk
cs
kN
hk
h
1K
0k
+
+
=
−
=
−−
=
∑
∑ …( 7.8)
Chapter-7: Warranty Policy for Hardware and Software…
178
where ( ) ( )wR1wR ss −= and ( ) ( )wR1wR CC −= .
Thus
( )
−+
−
−−
−
+
−
−
−
=
∫∏ ∫
∫∏ ∫
∏ ∫∏ ∫∑∫
∏ ∫∏ ∫∏ ∫∑
=
=
−
===
=
−
==
−
=
x
0cc
n
1j
x
0jsjs
x
0cc
n
1j
x
0jsjs
kNn
1i
x
0ihih
kn
1i
x
0ihih
N
Kk
x
0cc
n
1j
x
0jsjs
kNn
1i
x
0ihih
kn
1i
x
0ihih
1K
0k
x)dxf(λλx)dxf(ββ1
x)dxf(λλ1.x)dxf(ββ11.
x)dxf(ββ-11x)dxf(ββ-1kN
x)dxf(λλ1.
x)dxf(ββ1x)dxf(ββ-11x)dxf(ββ-1kN
wP
s
s
hh
shh
Hence, the expected number of system failures due to hardware component failure is
given by
( ) ( ) ( )∑ ∑∞
=
−−
=
−
=
1ncs
kNh
kh
1K
0khh
h
(w)R(w).R.(w)R1(w)RkN
.nNE …(7.9)
Expected number of system failures due to software failure is given by
( ) ( ) ( )∑ ∑∞
=
−
=
=
1ncs
kN
hk
h
N
Kkss
s
(w)R.(w)R.(w)R(w)RkN
.nNE …(7.10)
Expected number of system failures due to common cause failure is given by
( ) ( ) ( )∑ ∑∞
=
−
=
=
1ncs
kN
hk
h
N
KkCC
C
(w)R(w).R.(w)R(w)RkN
.nNE …(7.11)
Expected number of system failures is obtained as
( ) ( ) ( ) ( )Cshsystem NENENENE ++= …(7.12)
Now we derive the variance of the number of system failures due to hardware
component failures as follows:
Chapter-7: Warranty Policy for Hardware and Software…
179
The second moment of the number of system failures due to hardware failure is
( ) ( ) ( )∑ ∑∞
=
−−
=
−
=
1ncs
KNh
kh
1K
0k
2hh
2
h
(w)R(w).R.(w)R1(w)RkN
.nNE …(7.13)
Therefore, the variance of the number of system failures due to failure of hardware
component is given by
( ) ( ) ( )[ ]2h2hh NENENVar −= …(7.14)
Similarly, we can obtain the variance of the number of system failures due to software
component failure as
( ) ( ) ( )[ ]2S2SS NENENVar −= …(7.15)
where
( ) ( ) ( )∑ ∑∞
=
−
=
=
1ncs
kN
hk
h
N
Kks
2s
2
s
(w)R.(w)R.(w)R(w)RkN
.nNE …(7.16)
Also
( ) ( ) ( )[ ]2C2CC NENENVar −= …(7.17)
The variance of the number of repair services for the system is given by
( ) ( ) ( )[ ]2system2systemsystem NENENVar −= …(7.18)
Let hc , Sc and 0c be the repair cost per failure due to hardware failure,
software failure and common cause failure, respectively. The expected warranty cost
is given by
( )( ) ( ) ( ) ( )CoSShh N.EcN.EcN.EcwCE ++= …(7.19)
The variance of warranty system cost for software can be obtained as
( )( ) ( ) ( ) ( )C2oS
2Sh
2h NVar.cNVar.cNVar.cwCVar ++= …(7.20)
Chapter-7: Warranty Policy for Hardware and Software…
180
7.4 Warranty Policy with Repair
In this section we consider the warranty repair service and do not take into
consideration the warranty replacement service. For illustration purpose we consider a
dissimilar hardware system having 3-out-of-4 configuration of hardware units along
with one software unit. Let 4321 ,,, xxxx be the indicators denoting that the hardware
components 1 through 4 are working respectively. Also Sx and Cx indicate that there
is no failure due to software and common cause, respectively. ix denotes the
complement events so that ii x1x −= .
Then the reliability of the system is given by
( )[ ] )x)P(P(xxxxxP)xxxxP()xxxP(x)xxxP(x)xxxP(xR CS43214321432143214321 ++++=
…(7.21)
( )[ ( ) ( ) ( ) ( ) ( ) ( ) ( )( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )( ) ( ) ( ) ( )] ( ) ( )CCSS44332211
4433221144332211
4433221144332211
nN.PnNPnNPnNPnNPnNPnNPnNPnNPnNPnNPnNPnNPnNP
nNPnNPnNPnNPnNPnNPnNPnNPR
======+====+====+
====+=====
…(7.21)
where
[ ] ( ) ( ) ( )( ) 4,3,2,1h),x)P(P(xdx.xβfβ1dxxβfβnNP CS
x
0 h1nh1n
n
1i
x
0 ihihhh hh
h
=
−
== ∫∏ ∫ ++
=
Also )w(R)w(R)x)P(P(x CSCS =
λλ−
−= ∫∏ ∫
=
x
0CC
n
1j
x
0jsjs x)dxf(1x)dxf(ββ1
s
The expected warranty cost is given by:
( ) ( )NcECE =
where 0sh cccc ++=
Chapter-7: Warranty Policy for Hardware and Software…
181
7.4(i) K-R-out-of-N System
Let us consider K-R-out-of-N system. The probability that the system working is
given by
( ) ( ) )w(R).w(R.)w(R1)w(RN
P csN
hh
R
K)workingsystem(
ll
l l−
=
−
=∑ …(7.23)
Now the probability that the system is not working, is given by
( ) ( )
( ) ( )
( )( ) ( ) [ ](w)R(w).R(w).R(w)R(w)RwRN
(w)(w).R.R(w)R1(w)RN
(w)(w).R.R(w)R1(w)RN
P
cscs
N
hhK
csN
hh
N
1R
csN
hh
1K
0working)not(system
+
+
−
+
−
=
−
=
−
+=
−−
=
∑
∑
∑
llR
l
ll
l
ll
l
l
l
l
…(7.24)
The expected number of system failure due to hardware component failure is given by
( ) ( ) ( )
( ) ( )
−
+
−
=
−
+=
∞
=
−−
=
∑
∑ ∑
)w(R).w(R.)w(R1)w(RN
)w(R).w(R.)w(R1)w(RN
n'NE
csN
hh
N
1R
1ncs
Nhh
1K
1hh
h
ll
l
ll
l
l
l …(7.25)
So, the expected number of system failure due to software component failure is given
by
( ) ( ) ( ) ( )∑ ∑∞
=
−
=
=
1ncs
N
hh
R
Kss
s
)w(R.)w(R.)w(R)w(RN
n'NEll
l l …(7.26)
The expected number of system failure due to common cause is given by
( ) ( ) ( ) ( )∑ ∑∞
=
−
=
=
1ncs
N
hh
R
KCC
C
)w(R.)w(R.)w(R)w(RN
n'NEll
l l …(7.27)
Expected number of system failure is given by
( ) ( ) ( ) ( )Csh 'NE'NE'NENE ++= …(7.28)
Chapter-7: Warranty Policy for Hardware and Software…
182
Now with the help of above results we can calculate the variance of the
number of system failure, the variance of the number of repair services, the expected
warranty cost and the variance of warranty system cost, etc. for K-out-of-N system.
7.5 Numerical Results
In this section, we provide numerical results by coding computer program in
‘MATLAB’ software to examine the validity and tractability of analytical results of
the proposed model by taking an illustration. We consider a 3-out-of-4 system and
compute numerical results for total expected cost, standard deviation and coefficient
of variance of the system. The life time of the components follows the exponential
distribution. The warranty cost and other parameters are taken as β1h=0.7, β2h=0.3,
β3h=0.4, β4h=0.2, β6h=0.1, β8h=0.2, β9h=0.8, β12h=0.3, β6h=0.4, β20h=0.2, βs=0.8, λc=.9,
c=1500.
For different values of β1h, βs, λc, the total expected cost is shown in figs 7.1(i)-
(iii). Figs 7.2(i)-(iii) and 7.3(i)-(iii) show the standard deviation and coefficient of
variance respectively with respect to the warranty period. From figs 7.1(i)-(iii), we
can see that total expected cost decreases with warranty period because reliability has
been increased by replacing hardware/software components in which cost is incurred
initially only. From figs 7.1(ii) and 7.1(iii), it is clear that as βs and λc increase, the
expected cost decreases which are quite obvious.
The standard deviation graphs for different parameters with respect to
warranty period are displayed in figs 7.2(i)-(iii). The value of standard deviation
initially increases sharply and after some time it decreases slightly. In these figs, there
is no significant effect of increasing the values of β1h and βs, however in fig. 7.2(iii),
on increasing the λc, initially there is no change but after some time standard deviation
decreases.
In figs 7.3(i)-(iii), we exhibit the coefficient of variance (CV) with respect to
the warranty period. The value of CV initially increases sharply and after some time it
increases slowly. It is also seen that CV remains almost constant as β1h increases. But
as βs and λc increase, the value of CV reveals the increasing trend with respect to the
warranty period.
Chapter-7: Warranty Policy for Hardware and Software…
183
7.6 Conclusion
Warranty provides an assurance to the customers regarding product quality
and warranty policies. When a repairable item fails under warranty, the
manufacturer has the option of either repairing the failed item or replacing it
with a new one. The cost depends on several factors such as the reliability of the
product, the warranty terms, the maintenance actions of the customers and the
servicing strategies used by the manufacturer. There is huge scope for future
research in this area. The warranty cost models developed at a component level
by using renewal, replacement and repair. The quasi-renewal processes used
provides more realistic results for warranty cost model of K-out-of-N systems.
The results of reliability and warranty cost functions of our study may be easily
used in practices and would be of helpful for designing optimal warranty
policies.
Chapter-7: Warranty Policy for Hardware and Software…
184
0
100
200
300
400
500
600
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5w
EC
β1h=0.3β1h=0.5β1h=0.7
0
100
200
300
400
500
600
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5w
EC
βs=0.2βs=0.4βs=0.6
(i) (ii)
0
100
200
300
400
500
600
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5w
EC
λc=0.2λc=0.4λc=0.6
(iii)
Fig. 7.1: Expected cost vs warranty period by varying (i) h1β (ii) Sβ and (iii) Cλ
Chapter-7: Warranty Policy for Hardware and Software…
185
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5w
SD
β1h=0.3β1h=0.5β1h=0.7
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5w
SD
βs=0.3βs=0.5βs=0.7
(i) (ii)
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5w
SD
λc=0.5λc=0.6λc=0.7
(iii)
Fig. 7.2: Standard deviation vs warranty period by varying (i) h1β (ii) Sβ and (iii) Cλ
Chapter-7: Warranty Policy for Hardware and Software…
186
05
10
1520253035
404550
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5w
CV
β1h=0.2β1h=0.5β1h=0.8
0
5
10
15
20
25
30
35
40
45
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5w
CV
βs=0.4βs=0.5βs=0.6
(i) (ii)
0
5
1015
20
25
30
3540
45
50
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5w
CV
λc=0.3λc=0.5λc=0.7
(iii)
Fig. 7.3: Coefficient of variation vs warranty period by varying (i) h1β (ii) Sβ and (iii) Cλ
Semi-Markov Models with Common Cause Failure
Section-8A Redundant System with Rejuvenation
Section-8B Imperfect Fault Coverage System with Reboot
Chapter-8
Imperfect Fault Coverage System with Reboot
8A.1 Introduction
8A.2 Model Description
8A.3 Semi-Markov Analysis
8A.4 Performance Measures
8A.5 Total Expected Downtime Cost
8A.6 Conclusion
Section-8A
Section 8A: Redundant System with Rejuvenation
189
Software rejuvenation is a preventive maintenance technique
to prevent failures in continuously running systems that
experience software aging. In this section, a stochastic model is
developed to study the availability evaluation problem of
rejuvenation system. The availability of redundant system with
common cause failure and rejuvenation is obtained by using an
embedded Markov chain approach. A recursive procedure for
generating the state-transition probabilities is employed. The
appropriate framework for finding the optimal rejuvenation
interval is discussed by considering the downtime cost factors.
8A.1 Introduction
Today redundancy plays a dominant role in increasing systems availability
especially where the system availability must be greater than that of the components
used. In classical availability modeling of the redundant system it is assumed the
occurrence of common cause failures violate the assumption of independent
redundant units failure. It is due to the fact that the common mode failures are the
multiple failures which occur because of a single initiating factor or cause. When this
cause occurs all other failures are triggered to constitute a complete system failure.
Subramanian and Anantharaman (1995) considered the reliability analysis of a
complex standby redundant system. Vaurio (2005) evaluated the uncertainties and
quantification of common cause failure rates and probabilities for the redundant
system analyses. Lu and Lewis (2006) suggested the reliability evaluation of standby
safety systems having independent and common cause failures. Shen et al. (2008)
proposed the exponential asymptotic property of a parallel repairable system with
warm standby under common-cause failure. Cekyay and Ozekici (2010) discussed the
mean time to failure and availability of semi Markov missions with maximal repair.
Hajeeh (2011) studied the availability for series configurations having standbys with
the existence of common cause failure of the system at all states.
Section 8A: Redundant System with Rejuvenation
190
The system reliability engineering has advanced up to a level since it is
beginning to be separated into various specialized areas such as life cycle costing,
reliability growth modeling, reliability optimization and others. Many redundant
systems can be modelled by using the semi-Markov process, such as machine repair
standby system. An assumption of exponential distributions is rather unrealistic. If a
system under consideration is not highly reliable, no asymptotic methods can be
applied. Adke and Manjunath (1984) studied finite Markov processes. Sadek and
Limnios (2005) suggested non parametric estimation of reliability and survival
function for continuous-time finite Markov processes. Wen and Li (2009) proposed
the minimum packet drop sequences based networked control system model with
embedded Markov chain. An embedded Markov chain approach to stock rationing
was employed by Fadiloglu and Bulut (2010). Grabski (2011) also considered a gave
semi-Markov failure rates processes.
Rejuvenation is an appropriate technique to prevent a computer system from
failures by periodically performing the maintenance of its software. Huang et al.
(1995) studied the software rejuvenation and its applications. Park and Kim (2002)
discussed the availability analysis and improvement of active/standby cluster systems
using software rejuvenation. Wie et al. (2005) gave the analysis of a two-level
software rejuvenation policy. Xie et al. (2005) examined a two-level software
rejuvenation policy. Rinsaka and Dohi (2007) discussed a faster estimation algorithm
for periodic preventive rejuvenation schedule maximizing system availability.
Iwamoto et al. (2008) considered periodic software rejuvenation schedules under
discrete-time operation circumstance. Methods and opportunities for rejuvenation in
aging distributed software systems were studied by Avritzer et al. (2010).
The main purpose of this investigation is to examine the optimal time to
perform software rejuvenation which improves the availability, the downtime cost and
the dependability measures. The rest of the section is organized as follows. In Section
8A.2, the model description is presented. In Sections 8A.3 three semi-Markov models
are explicitly described. In Section 8A.4, we derive some performance measures.
Total expected downtime cost of the proposed model is presented in Section 8A.5.
Finally, we conclude this section by providing a short discussion in sub section 8A.6.
Section 8A: Redundant System with Rejuvenation
191
8A.2. Model Description
Consider a computer system with one redundant node with the provision of
both automatic and manual switching procedure among the primary and the standby
unit. In case of a switching automation failure, the system is switched to the standby
unit manually. The time to resource exhaustion for the primary unit has to be modeled
by an increasing failure rate (IFR) distribution as software resources are exhausted in
an increasing manner with respect to the time that the primary unit serving the system.
We consider three different models for two unit redundant system with common cause
failure. In the first model, there is no rejuvenation action. The second model
incorporates full rejuvenation action whereas in the third model the concepts of
rejuvenation failure are incorporated.
8A.2.1 Model without Rejuvenation
A two identical component system consisting in one active component and
one standby is considered. When the active unit fails, the service is restored by a
switching mechanism of the standby unit. In Figure 8A.1, the state transition diagram
of the two component redundant system without rejuvenation is presented. Initially,
the system is in state (1, 1) where the active unit is responsible for the service. The
active unit fails with an IFR Weibull distribution F1(t). Actually, the time until the
active unit fails follows a probability distribution F1(t). We denote the probability
distribution of the time needed for a certain state transition by F1. Let γ be the
probability of automatic switching success, then γF1(t) denotes the probability of
entering state (0, 1) from state (1, 1). The service is switched to the standby unit
manually and the system enters state F with probability (1-γ). The time to enter state
(0, 1) from state F follows a distribution with parameter β. Hence, in the case of a
failure incidence at the primary unit, the systems enter state (0, 1) automatically or
manually. System control is then switched to the standby unit, while the failed unit is
repaired. Both primary and standby units may fail simultaneously due to common
cause failure and system enters to state (0, 0) from state (1, 1) with distribution FC(t)
with parameter FC.
Section 8A: Redundant System with Rejuvenation
192
Fig. 8A.1. Redundant system without rejuvenation
After repair, one active unit and one standby unit are available. In this case,
the system returns to state (1, 1) with distribution F2(t) which is assumed to be Erlang
with parameters (KE, λE). If during the repair of the failed unit the serving unit also
fails, the system experiences a total failure and enters state (0, 0) with distribution
F1(t); as soon as the failed unit is repaired, the system returns to state (0, 1) with
distribution F2(t).
8A.2.2 Rejuvenation Model
In this model, we made same assumption as stated for model without
rejuvenation. The additional assumptions for this model are as follows. Software
rejuvenation follows a probability distribution F3(t) and the system moves to state (R,
1), which is the rejuvenation state, either by switching to the standby node
automatically with probability γ3, or in the case of a failure in the automation
mechanism, the system enters state FR with probability γ4 and the service control is
switched manually to the standby unit as the primary unit is being rejuvenated. The
state transition diagram of the system is depicted in fig. 8A.2.
Section 8A: Redundant System with Rejuvenation
193
Fig. 8A.2. Redundant system with rejuvenation
If a failure occurs on the active unit before software rejuvenation is triggered
then the system behaves in the same manner as in the case of not taking any
rejuvenation actions. The distribution of the time for recovering from rejuvenation is
F4(t) which is assumed to be an exponential distribution with parameter βR. The time
to trigger rejuvenation has a fixed duration, F3(t)-u(t-tr) where u(t) is the unit step
function and tr is the time to trigger rejuvenation.
8A.2.3 Failed Rejuvenation Model
Fig. 8A.3. Failed software rejuvenation model
Section 8A: Redundant System with Rejuvenation
194
The case when rejuvenation actions may not be completed properly or a
rejuvenation action may be performed improperly is modeled by failed rejuvenation.
In detail, failed rejuvenation indicates an abnormal function during the rejuvenation
process. When failed rejuvenation occurs, the system enters a failure state. In this
case, when software rejuvenation is performed and the system state is (R, 1). The
rejuvenation may fail to be completed; a failure occurs in the rejuvenated unit in state
(R, 1) resulting in a transition to state (0, 1). The time needed for this transition is
assumed to be exponentially distributed with parameter λR. In fig. 8A.3, the transition
diagrams of the failed rejuvenation models for the rejuvenation model are shown.
8A.3. Semi-Markov Analysis
A semi-Markov process ( ){ }0t:tX ≥ is a stochastic process in which changes
of state occur according to a Markov chain and in which the time interval between
two successive transitions is a random variable whose distribution depends on the
state from which the transition takes place as well as the state to which the next
transition takes place.
8A.3.1 Embedded Markov Model
The embedded Markov model provides computational advantages for steady-
state calculations preserving the nature of the decision problem. In this sub-section,
the embedded Markov chains are used for all the three equivalent semi-Markov
models. The relevant characteristics of semi-Markov models are considered at those
epochs when the system state changes and the time spent in a particular state can
follows an arbitrary probability distribution.
The steady-state probabilities of each state of the semi-Markov process (SMP) and the
mean sojourn time in state i are:
∑∈
=
Ejjj
iii hv
hvπ and ( )( ) Ei,dttH1h
0ii ∈−= ∫
∞
…(8A.1)
where ( )tHi is the sojourn time distribution of state i.
Section 8A: Redundant System with Rejuvenation
195
8A.3.2 Model without rejuvenation
Before the computation of the mean sojourn times and consequently the
computation of the limiting distribution of the SMP, the sojourn time distributions
have to be computed. We consider the sojourn times in states (1,1) and (0,1) as
follows. Let, H1,1 be the sojourn time distribution in state (1,1), which is the minimum
of three random variables X1, X1 and V which follow distributions F1(t), F2(t) and
FC(t) respectively. Thus
( ) ( ) { }( )tV,X,XminPrtXPrtH 21min1,1 ≤=≤= …(8A.2)
{ }( ) ( )( ) ( )( )tF1tF1V,X,XminPr1 C2
121 −−=>−=
The sojourn time distribution in state (0, 1) is obtained as
( ) ( )( ) ( )( )tF1tF11tH 211,0 −−−= …(8A.3)
The steady-state probabilities of each state of the SMP can be computed as.
( )( ) ( )( ) ( )( ) ( )( ) ,dttF1tF1h,dttF1tF1h 20
11,0C0
211,1 −−=−−= ∫∫
∞∞
Also
( )( ) ( )( )dttF1h,dtexp1h0
20,00
F ∫∫∞∞
−=−= β …(8A.4)
8A.3.3 Rejuvenation Model
Let X~ F1(t), Y~ F1(t), Z~ F3(t), W~ F1(t) and V~ FC(t) be the random
variables denoting the time required for the change of the state (1,1) to states (0, 1),
F1, FR, (R, 1) and (0, 0) respectively. The next step in the SMP analysis is to compute
the sojourn time at each state. For this model, the sojourn time of state (1, 1) is given
by
( ) ( ) { }( )tV,W,Z,Y,XminPrtXPrtH min1,1 ≤=≤=
( )( ) ( )( ) ( )( )tF1tF1tF11 C2
32
1 −−−−= …(8A.5)
Though H0,1 is the same as in equation (8A.3).
Section 8A: Redundant System with Rejuvenation
196
The mean sojourn times of state (1,1) is obtained as
( )( ) ( )( ) ( )( )dttF1tF1tF1h C2
30
211,1 −−−= ∫
∞
( )( ) ( )( )∫∞
−−=0
C2
1 dttF1tF1
Similarly we obtain sojourn times of states (0, 1), F, (0, 0), FR and (R, 1) as follows:
( )( ) ( )( )dttF1tF1h 20
11,0 −−= ∫∞
…(8A.6)
( )( )dtexp1h0
F ∫∞
−= β …(8A.7)
( )( )dttF1h0
20,0 ∫∞
−= …(8A.8)
( )( )dtaexp1h0
FR ∫∞
−= …(8A.9)
( )( )dtexp1h0
R1,R ∫∞
−= β …(8A.10)
Using equation (8A.1), the steady-state distribution of the embedded Markov chain
can be computed.
8A.3.4 Failed Rejuvenation Model
In this model we assume that the rejuvenation is properly completed with
probability q and fails with probability 1-q leading the system to state (0, 1). The
sojourn time distribution and the mean sojourn time in the rejuvenation state (R, 1)
changes. Due to the fact that is one more transition from state (R, 1) to state (0, 1); the
time for this transition is assumed to be exponentially distributed with parameter Rλ .
Thus HR,1(t) and hR,1 can be determined as:
( ) ( )( )texp1tH RR1,R λβ +−−= …(8A.11)
Section 8A: Redundant System with Rejuvenation
197
and CRR
1,R F1h++
=λβ
…(8A.12)
8A.4. Performance Measures
Let U denotes the up states depicted by shaded nodes. The asymptotic
availability (A) can be obtained using
∑∈
=Ui
iA π …(8A.13)
where iπ is the steady-state probability of state i.
The availability for model 1 is given by:
1,01,1rejwithoutA ππ += …(8A.14)
Furthermore for models 2 and model 3, we obtain availability by using
1,R1,01,1rejwithA πππ ++= …(8A.15)
8A.5. Total Expected Downtime Cost
The total expected downtime cost in the steady state in a time interval of L time units,
is computed by
( ){ } ( )[ ]( ) ( ) ( ) LiXPr.iwElimLXgElimLTCE tEittt
×
==×= ∑
∈∞→∞→
( ) ( ) ( ) L.iwLiXPrlim.iwE iEi
ttEi×
=×
== ∑∑
∈∞→
∈
π …(8A.16)
where w(i) denotes the reward function for state i.
Let us assume the average cost per unit of downtime for the automation mechanism
failure and repair procedure needed are denoted by CA and cR, respectively.
According to (8A.5), the expected total downtime cost for three models are:
(i) Model without rejuvenation
( ){ } ( ) L.c.cLTCE 0,0RFA1 ×+= ππ …(8A.17)
Section 8A: Redundant System with Rejuvenation
198
(ii) Model with rejuvenation
( ){ } ( ){ } L.c.cLTCE 0,0RFRFA2 ×++= πππ …(8A.18)
(iii) Failed rejuvenation model
( ){ } ( ){ } L.c.cLTCE 0,0RFRFA3 ×++= πππ …(8A.19)
In real time, the system analysts and decision makers may be interested in
deriving the optimal rejuvenation interval tr that minimizes total expected downtime
cost for each one of the three presented rejuvenation models. By comparing the total
expected downtime of all the models, one can take decision regarding appropriate
rejuvenation schedule in particular when dealing with large applications.
8A.6. Conclusion
We have presented embedded Markov chain approach for the prediction of
asymptotic availability and the expected total downtime cost of software rejuvenation
model of a redundant computer system with one active and one standby unit. The
performance indices established can be further used to determine the optimal
rejuvenation policy in particular when failed rejuvenation takes place.
Imperfect Fault Coverage System with Reboot
8B.1 Introduction 8B.2 Model Description
8B.3 The Steady State Availability 8B.4 Special Cases 8B.5 Numerical Results
8B.6 Concluding Remarks
Section-8B
Section-8B: Imperfect Fault Coverage System with Reboot
200
Availability of a two unit system is studied with different
types of prior assumptions for unknown parameters such as
imperfect fault coverage, reboot and common cause failure, etc.
The solution of the semi-Markov model has been obtained by
using supplementary variable technique (SVT). The explicit
expressions for the availability and failure frequency of the system
for some special distributions of repair time such as exponential,
gamma and uniform distribution have been used. The numerical
simulation has been carried out to explore the effect of different
distributions on the availability.
8B.1 Introduction
Availability is an important concept for the planning, design and operation
stages of various complex systems. It is defined as the percentage of time that a
system is available to perform its required functions. Redundancy, repair maintenance
and preventive maintenance are some of the well-known methods by which the
availability of a system can be enhanced. The stochastic models with standby units
have widely been studied by the various researchers. Some important contributions in
this direction are due to Mahmoud et al. (1987) and Verma and Chari (1991).
Yadavalli et al. (2002) found the asymptotic confidence limits for the steady state
availability of a two unit parallel system, with the assumption that the repair facility is
not available for a random time after each repair completion. The availability of a K-
out-of-N system, given limited spares and repair capacity under a condition based
maintenance strategy with warm standby system was studied by Smidt-Destombesa et
al. (2004) and Zhang et al. (2006). Wang et al. (2006) compared different system
configurations with warm standby components and standby switching failures. Based
on reliability analysis, they developed the explicit expressions for the mean time-to-
failure (MTTF) and the steady-state availability for different configurations.
Kiureghian et al. (2007) considered the availability, reliability and downtime of a
system with repairable components. Cekyay and Ozekici (2010) obtained the mean
time to failure and availability of semi-Markov missions with maximal repair as these
Section-8B: Imperfect Fault Coverage System with Reboot
201
are frequently used in modern technology. Recently, Moghaddass and Zuo (2011)
discussed the optimal design of a repairable k-out-of-n system considering
maintenance.
The subject of common cause failures has been receiving significant attention
of the researchers over the past two decades. It has been realized that in order to
predict realistic availability of standby systems, the occurrence of common-cause
failures must be considered. A common-cause failure is defined as any instance where
multiple units or elements fail due to a single cause. These types of failures could
occur owing to equipment design efficiencies, abnormal environment, external
catastrophe, common power source, common manufacturer, etc.. Either for predicting
the behavior of new designs or studying possible changes in existing ones, modeling
of redundant repairable system with common-cause failures is important topic for
investigation. Some noble works which have been done in this area are as follows.
Chung (1981) studied a k-out-of-N: G three-state unit redundant system with
common-cause failure and replacements. To increase the system availability, the
switching time of the warm standby unit scheduled with common cause failure was
analyzed by Singh (1989). Dhillon and Anudu (1993) performed the common-cause
failure analysis of a non-identical unit parallel system with arbitrarily distributed
repair times. Human error and common-cause failure modeling of a two-unit multiple
systems was explored by El-Damcese (1997) and Atwood and Kelly (2008). Xing
(2007) and El-Damcese (2009) analyzed warm standby system subject to common
cause failures with time varying failure and repair rates. Ke et al. (2010) studied
simulation inferences for an availability system with general repair distribution and
imperfect fault coverage. Hsu et al. (2011b) discussed standby system with general
repair, reboot delay, switching failure and unreliable repair facility.
For highly reliable systems, coverage has a significant effect on the system’s
availability. However, some failures can remain undetected or uncovered, which can
lead to the system failure. Examples of the effect of uncovered faults can be found in
computing systems, electrical power systems, distribution networks, pipelines
carrying dangerous materials, etc.. Systems subject to imperfect fault-coverage may
fail even prior to the exhaustion of standbys due to uncovered component failures.
Therefore, it is important to consider the effects of imperfect fault-coverage in the
Section-8B: Imperfect Fault Coverage System with Reboot
202
design and analysis of these systems. Further, the effects of fault-coverage also play a
key role in electrical power distribution, dangerous fluid transportation, and several
standby redundancy applications. The reliability analysis of K-out-of-M: G systems
with dependent failures and imperfect coverage was studied by Moustafa (1997),
Amari et al. (2004), Myers (2007). Xing and Dugan (2002) showed that the
reliability of the systems subject to imperfect fault-coverage decreases after a certain
level of active redundancy. The systems with imperfect fault coverage have been
intensively studied by Vieira and Madeira (2004). Tang and Lee (2005) described a
simple recovery strategy for economic lot scheduling problem. Chang et al. (2005)
evaluated the reliability and other important measures for multistate systems subject
to imperfect fault coverage. Wang and Chiu (2006b) developed the steady-state
availability systems with warm standby units and imperfect coverage along with cost
benefit analysis. Levitin (2007) and Levitin and Amari (2008) explored a block
diagram method for analyzing the multi-state systems with uncovered failures.
Therefore, it is important to consider the effects of imperfect fault coverage in
designing these systems. Ke et al. (2008) studied a system characteristics of a two-
unit repairable system with different types of priors assumed such as detection,
recovery time and reboot delay and coverage factor for an operating unit. Kuniewski
et al. (2009) considered a sampling inspection for the evaluation of time dependent
reliability of deteriorating systems under imperfect defect detection. The aggregated
semi-Markov repairable system with history-dependent up and down states was
examined by Wang and Cui (2011).
In this investigation, we present the availability analysis of two unit system
with warm standbys, imperfect fault coverage by incorporating the concept of reboot
and common cause failures. This section is arranged in the following manner. The
model under consideration is described in section 8B.2. In section 8B.3, we establish
steady state availability of the system. Some cases of specific distributions of repair
are considered in section 8B.4. In section 8B.5, the proposed model is illustrated
numerically. The effects of various system parameters are also examined by faciliting
the sensitivity analysis. The findings and noble features of our investigation are
summarized in section 8B.6.
Section-8B: Imperfect Fault Coverage System with Reboot
203
8B.2 Model Description
We consider a redundant repairable system in which two components are
active. If any unit fails, then the system will immediately take reconfiguration
operation considering negligible time. The reconfiguration operation will detect and
remove the failed unit from the system. However, other entire operating unit will
continue to operate as it is. The probability of successful reconfiguration operation is
defined as coverage factor C. After recovering, the failed component is ready for
repair. Active components are considered repairable. It is assumed that each of the
active components fails independently of the others according to Poisson distribution
with parameter λ. The system fails either due to a common cause failure or when all
of its units fail. The inter-failure time of common cause failure and the recovery time
are assumed to be exponentially distributed with rate λS and θ, respectively. The
reboot time is also exponentially distributed with parameter β. The general
distribution is considered for the repair time of the units.
In figure 8BB.1, the state transition diagram of a redundant repairable system
is shown in which two active components are initially fully working in state (2).
When one of the the active components fails with probability C, a protection switch
successfully restores service by switching on the other component, and the system
enters state (1). With probability (1-C) the protection switch fails to cover the failure
of the active component and the system enters state (4). We assume that the active
component failure in the state (4) is cleared by a reboot, and the delay for an active
unit follows an exponential distribution with rate β. The failure of the active
component is detected immediately with probability C, and when this happens, the
system enters-state (1). If the failure of the active component is not detected with
probability (1-C), the systems enter state (3). There is a latent fault in the spare
component when the system is in state (3). If a component failure occurs when the
system is in state (1), the system fails and enters state (0). When a component
switches over successfully, its failure characteristics become those of the active
component. If a unit is inactive, it is immediately sent to the repair facility and is
repaired one at a time in order of breakdowns. The system can fail with the common
cause failure rate λS. Let the time-to-repair of the components be independent and
identically distributed random variable following a general distribution with C.D.F.
Section-8B: Imperfect Fault Coverage System with Reboot
204
B(x), (x > 0), and hazard rate b(x). In this section, we provide a closed form solution
for the availability and mean time to failure. In order to improve the system
availability, we should not only add additional redundancy but also improve the
coverage-factor.
8B.3 The Steady State Availability
In this section we evaluate the availability of the redundant system having two
units. The state transition diagram is depicted in fig. 8B.1. Now we construct the
steady-state equations by balancing the flow rates as follows (see transition diagram
in fig. 8B.1):
Fig. 8B.1: State transition diagram for two unit system
24 PC2P0 λ+β−= …(8B.1)
23 PC2P0 λ+θ−= …(8B.2)
( ) ( )0PP20 12S +λ+λ−= …(8B.3)
( ) ( ) ( ) ( ) ( ) ( )0PxbPxbPxbxPdx
xdP0431
1 +β+θ+λ−=− …(8B.4)
( ) ( ) 2S10 PxPdx
xdPλ+λ=− …(8B.5)
Solving (8B.1), (8B.2) and (8B.3), we get
Section-8B: Imperfect Fault Coverage System with Reboot
205
224 bPPC2P =βλ
= …(8B.6)
223 aPPC2P =θλ
= …(8B.7)
( ) ( ) 22S1 PP20P Λ=λ+λ= …(8B.8)
where ( )βλ
=θλ
=λ+λ=ΛC2b,C2a,2 S .
Taking Laplace transforms of (8B.4) and (8B.5) and using (8B.6) and (8B.7), we have
( ) ( ) ( ) ( ) ( ) ( )0P0PsBPsB2sPs 10*
2**
1 −+λ=−λ …(8B.9)
( ) ( ) ( ) ( )0PsPsPssP 0*2S
*1
*0 −λ+λ=− …(8B.10)
Substituting λ=s and s=0 in (8B.9), we get
( ) ( ) ( ) ( )0P0PBPB2 10*
2* =λ+λλ …(8B.11)
( ) ( ) ( ) ( ) ( )0P0P0BP0B20P 10*
2**
1 −+λ=λ
( ) ( )0P0PP2P 1021 −+λ=λ …(8B.12)
Differentiating (8B.9) w. r. t.‘s’ and then putting s=0, we obtain
( ) ( )0PbPb20Pds
)0(dP0121
*1
*1 −λ−=−λ
( ) ( )0PbPb2Pds
0dP01211
*1 −λ−=λ …(8B.13)
Substituting s=0 in (8B.10), we get
( )0PPP 02S1 =λ+λ …(8B.14)
Differentiating (8B.10) w.r.t. ‘s’ and then substituting s=0, we obtain
( )21S0
*1 PbPds
0dPλ+−=λ …(8B.15)
Using (8B.13) and (8B.15), we get
Section-8B: Imperfect Fault Coverage System with Reboot
206
( ) 211001 PbPP0Pb Λ−+= …(8B.16)
Therefore
( ) ( )( )( ) 2*
*
0 PB
B20Pλ
λλ−Λ= …(8B.17)
Now equation (8B.12), gives
( )( )( ) 2*
*
1 PB
B1Pλλ
λ−Λ= …(8B.18)
Again equation (8B.16) yields
( )( ) ( )( )[ ]( ) 2*
*S1
*
0 PB
BbB1Pλλ
λλ+Λλ+λ−Λ−= …(8B.19)
Normalizing condition gives
1PPPPP 43210 =++++ …(8B.20)
Finally solving equations (8B.6), (8B.7), (8B.18B) and (8B.19), we obtain
probabilities of the system states as
( ) ( ) ( )( ) ( )( ){ }[ ]( ) ( )( ){ } ( )( )[ ]θλ+βλλ+λ+λ+λ+λθβλ
λ−−λλ+λ+λλ+λ=
C2C2Bb1B2bB1B2b2P *
1S*
S1
**SS1S
0 …(8B.21a)
( ) ( )( )( ) ( )( ){ } ( )( )[ ]θλ+βλλ+λ+λ+λ+λθβλ
λ−λ+λθβ=
C2C2Bb1B2bB12P *
1S*
S1
*S
1 …(8B.21b)
( )( ) ( )( ){ } ( )( )[ ]θλ+βλλ+λ+λ+λ+λθβ
λθβ=
C2C2Bb1B2bBP *
1S*
S1
*
2 …(8B.21c)
( )( ) ( )( ){ } ( )( )[ ]θλ+βλλ+λ+λ+λ+λθβ
λβλ=
C2C2Bb1B2bBC2P *
1S*
S1
*
3 …(8B.21d)
( )( ) ( )( ){ } ( )( )[ ]θλ+βλλ+λ+λ+λ+λθβ
λθλ=
C2C2Bb1B2bBC2P *
1S*
S1
*
4 …(8B.21e)
The system availability is given by
( ) 321 PPPA ++=∞
Section-8B: Imperfect Fault Coverage System with Reboot
207
( ) ( ) ( ){ } ( )[ ]( ) ( )( ){ } ( )( )[ ]θλ+βλλ+λ+λ+λ+λθβ
λλ+λ−+λ+λ+λθβ=
C2C2Bb1B2bB2a12
*1S
*S1
*SS …(8B.22)
Failure frequency of the system is
( ) 40S PC2Pf λ+λ+λ=
( ) ( ) ( ) ( )( ) ( )( ){ }[ ] ( )( ) ( )( ){ } ( )( )[ ]θλ+βλλ+λ+λ+λ+λθβλ
λθλ+λ−−λλ+λ+λλ+λλ+λ=
C2C2Bb1B2bBC4B1B2b2
*1S
*S1
*22**SS1SS …(8B.23)
8B.4 Special Cases
In this section we discuss special cases corresponding to three repair time
distributions viz. exponential distribution, gamma distribution and uniform
distribution. The system performance indices for each case are established by setting
( )λ*B and b1 given as below:
(i) Exponential Distribution-
In this case, we have ( )µ+λ
µ=λ*B ,
µ=
1b1
Now equation (8B.22) yields
( ) ( ) ( )[ ]( ) ( )[ ]ba1
a1AS
2
++µ+λ+Λλµ+λΛ−+λµ+µ+λµΛ
=∞ …(8B.24)
Failure frequency of the system is obtained as f ( ) 40S PC2P λ+λ+λ=
( ) ( ){ }[ ]( ) ( ){ } ( )[ ]θ+βλµ+λ+µµ+µ+λΛθβλ
θµλ+λµ−µλ+µ+λΛΛλ+λ=
CC2C4
2S
222SS …(8B.25)
where ( )βλ
=θλ
=λ+λ=ΛC2b,C2a,2 S .
(ii) Gamma Distribution-
For gamma distribution, we have
( )r
*
rrB
µ+λ
µ=λ ,
µ=
1b1
Section-8B: Imperfect Fault Coverage System with Reboot
208
Then equation (8B.22) provides
( ) ( ) ( ) ( )[ ]( ) ( )[ ]ba1r
a1rrAS
r
rr
++µ+λ+Λλµ+λΛ−+λµµ+µ+λµΛ
=∞ …(8B.26)
Failure frequency of the system f ( ) 40S PC2P λ+λ+λ=
( ) ( ) ( ) ( ){ }[ ] ( )( ) ( ) ( ){ } ( ) ( )[ ]θ+βµλ+λ+µµ+µ+λΛθβλ
µθλ+µ+λλ−µλ+µ+λΛΛλ+λ=
+
+−
CCr2rrrC4r2rr
1rrS
rr
1rr221rrS
rS …(8B.27)
where ( )βλ
=θλ
=λ+λ=ΛC2b,C2a,2 S .
(iii) Uniform Distribution-
By setting ( )abeeB
ab*
−−
=λλ−λ−
, 2
bab1+
= in equation (8B.22), we get availability for
uniform distribution as
( ) ( ) ( ) ( )[ ]( ) ( ) ( ) ( ){ }[ ]S
ab
baba12baaba1ee2ab2A
λ+++++Λ+λ−Λ−+λ−+−Λ
=∞λ−λ−
…(8B.28B)
Also failure frequency f ( ) 40S PC2P λ+λ+λ=
( ) ( ) ( ) ( ) ( )( ){ }[ ] ( )( ) ( ) ( )( ){ } ( )( )[ ]θ+β−λ++λ+−+Λ−θβλ
−θλ+−−−−−λ+Λ−Λλ+λ=
λ−λ−λ−λ−
λ−λ−λ−λ−λ−λ−
CCee4ba2eeabeeC8eeab2eeab
abS
ab22
ab22ababS
22S
…(8B.29)
8B.5 Numerical Results
In this section we provide numerical results for the availability and failure
frequency for the repairable system with the varying parameters namely imperfect
fault coverage (C), reboot (β) and common cause failure (λs). The program has been
coded using software ‘MATLAB’. The sensitivity analysis is facilitated to check the
validity of the analytical results. For computational purpose we set the default
parameters as λS=0.1, C=0.3, θ=0.6, β=1, µ=3. For analyzing the effects of various
parameters on availability, numerical results are summarized in tables 8B.1(a)-
8B.1(b) and 8B.2(a)-8B.2(b). The graphical presentation of the availability has been
Section-8B: Imperfect Fault Coverage System with Reboot
209
done in figs 8B.1-8B.4 by considering the specific distribution namely exponential,
gamma and uniform.
Tables 8B.1(a) and 8B.1(b) display the effects of parameters C, µ and β on the
availability for exponential, gamma and uniform distribution. Table 8B.1(a) reveals
that the availability increases as C and µ increase in case of exponential, or gamma
distribution. In case of uniform distribution, for the smaller values of β (i.e. β=1), the
availability decreases initially and becomes almost constant. It is observed that the
availability increases initially and becomes constant for β=2 and β=3. In table 1(b),
we examine the effects of θ, µ and β on the availability and note that as β increases,
the availability increases for all distributions. The availability increases as µ increases
in case of exponential and gamma distribution but it remains almost constant for
uniform distribution. On increasing the values of C, the availability increases for
uniform distribution but decreases for exponential distribution. It is noticed that for
gamma distribution, the availability decreases only for β=1 but increases for β=2 and
β=3.
From figs 8B.1-8B.3, we see the variation of availability with respect to
parameter λ and different values of C, β, θ and λS for exponential, uniform and gamma
uniform distributions, respectively. In figs 8B.1(i)-8B.1(iii) and 8B.3(i)-8B.3(iii), the
availability sharply decreases when λ increases but in figs 8B.2(i)-8B.2(iii), the
availability initially decreases smoothly and then after it decreases sharply.
Figs 8B.4(i)-8B.4(iv) show the variation of availability with respect to λ, λS, β
and µ respectively by varying r for gamma distribution. The availability slightly
decreases when r increases as clear from fig. 8B.4(i)-8B.4(ii). But in figs 8B.4(iii)-
8B.4(iv) for gamma distribution, the availability increases when r increases.
Overall we conclude that for all distributions, the availability is decreasing
with respect to failure rate λ. A close study of availability tables reveals that the
availability is more affected by the parameter C and θ because both factors prevent
the entire system failure by replacing the available ones with the recovered failed
elements.
Section-8B: Imperfect Fault Coverage System with Reboot
210
8B.5 Concluding Remarks
In many real time applications, the highly available fault-tolerant systems are
expensive and time consuming to develop and deploy. In this section, we have
developed steady-state results for the availability of redundant systems with imperfect
fault coverage, reboot and common cause failure. According to the system-operating
parameters, the steady-state probabilities are obtained which are further used to find
the availability and failure frequency of the system. The proposed model has the
advantage of being quite general and will provide a useful performance evaluation
tool for real time fault-tolerant systems arising in practical applications, such as
production systems, flexible manufacturing systems, computer and communication
systems, transportation systems, inventory problems, and many other related systems.
Section-8B: Imperfect Fault Coverage System with Reboot
211
Availability
C Uniform Distribution
Exponential Distribution
Gamma Distribution
0.7
µ β=1=2=3 β=1 β=2 β=3 β=1 β=2 β=3 01 0.434 0.467 0.490 0.498 0.451 0.474 0.482 03 0.409 0.667 0.711 0.728 0.662 0.706 0.722 05 0.409 0.730 0.782 0.801 0.728 0.780 0.799 07 0.409 0.761 0.817 0.838 0.760 0.816 0.836 09 0.409 0.779 0.838 0.859 0.779 0.837 0.859 11 0.409 0.792 0.852 0.874 0.791 0.851 0.873 13 0.409 0.800 0.862 0.884 0.800 0.861 0.884 15 0.409 0.807 0.869 0.892 0.806 0.869 0.891
0.8
01 0.479 0.484 0.500 0.505 0.464 0.479 0.485 03 0.444 0.701 0.731 0.741 0.694 0.724 0.734 05 0.444 0.770 0.805 0.817 0.767 0.802 0.814 07 0.444 0.804 0.842 0.8B55 0.802 0.840 0.853 09 0.444 0.824 0.863 0.877 0.823 0.862 0.876 11 0.444 0.8B38 0.878 0.892 0.837 0.877 0.891 13 0.444 0.847 0.888 0.903 0.847 0.888 0.902 15 0.444 0.855 0.896 0.911 0.854 0.896 0.910
0.9
01 0.519 0.501 0.509 0.512 0.477 0.485 0.487 03 0.491 0.733 0.748 0.753 0.725 0.740 0.745 05 0.491 0.808 0.825 0.831 0.804 0.822 0.828 07 0.491 0.845 0.863 0.870 0.843 0.861 0.868 09 0.491 0.867 0.886 0.893 0.865 0.885 0.892 11 0.491 0.881 0.901 0.908 0.880 0.900 0.907 13 0.491 0.892 0.912 0.919 0.891 0.911 0.918 15 0.491 0.900 0.920 0.927 0.899 0.920 0.927
Table 8B.1(a): Effects of parameters C, µ and β on the availability for different distributions of repair time.
Section-8B: Imperfect Fault Coverage System with Reboot
212
Availability
θ Uniform
Distribution Exponential Distribution
Gamma Distribution
0.7
µ β=1=2=3 β=1 β=2 β=3 β=1 β=2 β=3 01 0.429 0.399 0.449 0.469 0.395 0.445 0.464 03 0.429 0.537 0.629 0.668 0.535 0.628 0.666 05 0.429 0.577 0.684 0.730 0.576 0.684 0.729 07 0.429 0.596 0.711 0.760 0.596 0.711 0.760 09 0.429 0.608 0.727 0.778 0.607 0.727 0.778 11 0.429 0.615 0.738 0.790 0.615 0.737 0.790 13 0.429 0.620 0.745 0.798 0.620 0.745 0.798 15 0.429 0.624 0.750 0.805 0.624 0.750 0.805
0.8B
01 0.502 0.388 0.442 0.463 0.391 0.446 0.467 03 0.502 0.509 0.610 0.652 0.510 0.611 0.654 05 0.502 0.543 0.659 0.710 0.543 0.660 0.711 07 0.502 0.558 0.683 0.738 0.559 0.684 0.739 09 0.502 0.567 0.697 0.755 0.568 0.698 0.755 11 0.502 0.573 0.707 0.766 0.573 0.707 0.766 13 0.502 0.577 0.713 0.774 0.578 0.713 0.774 15 0.502 0.581 0.718 0.779 0.581 0.718 0.779
0.9
01 0.469 0.381 0.437 0.459 0.389 0.446 0.469 03 0.469 0.493 0.598 0.643 0.496 0.601 0.646 05 0.469 0.522 0.644 0.698 0.523 0.645 0.700 07 0.469 0.535 0.666 0.724 0.536 0.667 0.725 09 0.469 0.543 0.678 0.740 0.544 0.679 0.741 11 0.469 0.548 0.687 0.750 0.548 0.687 0.751 13 0.469 0.551 0.693 0.757 0.552 0.693 0.758 15 0.469 0.554 0.697 0.763 0.554 0.697 0.763
Table 8B.1(b): Effects of parameters θ, µ and β on the availability for different
distributions of repair time
Section-8B: Imperfect Fault Coverage System with Reboot
213
0.6
0.7
0.8
0.9
1
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
λ
Ava
ilabi
lity
c=0.7c=0.8c=0.9
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
λ
Ava
ilab
ilit
y
c=0.1c=0.2c=0.3
(i) (i)
0.5
0.6
0.7
0.8
0.9
1
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
λ
Ava
ilabi
lity
β=1β=2β=3
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
λ
Abai
labi
lity
β=1β=2β=3
(ii) (ii)
0.4
0.5
0.6
0.7
0.8
0.9
1
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1λ
Ava
ilab
ilit
y
θ=0.4θ=0.6θ=0.8
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
λ
Ava
ilabi
lity
θ=0.7θ=0.8θ=0.9
(iii) (iii)
Fig. 8B.2: Availability vs λ by varying Fig. 8B. 3: Availability vs λ by (i) C (ii) β (iii) θ for exponential varying (i) C (ii) β (iii) θ for distributed repair time uniform distributed repair time
Section-8B: Imperfect Fault Coverage System with Reboot
214
0.4
0.5
0.6
0.7
0.8
0.9
1
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
λ
Avai
labi
lity
c=0.7c=0.8c=0.9
0.4
0.5
0.6
0.7
0.8
0.9
1
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
λAv
aila
bilit
y
β=1β=2β=3
(i) (ii)
0.4
0.5
0.6
0.7
0.8
0.9
1
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
λ
Avai
labi
lity
θ=0.4θ=0.6θ=0.8
0.4
0.5
0.6
0.7
0.8
0.9
1
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
λ
Avai
labi
lity
λs=.1λs=.5λs=.9
(iii) (iv)
Fig. 8B.4: For gamma distributed repair time, the availability vs λ by varying (i) C (ii) β (iii) θ (iv) λS
Section-8B: Imperfect Fault Coverage System with Reboot
215
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
0
0.2
0.4
0.6
0.8
1
Availability
λ
r=.5r=2
r=3.5
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.6
0.62
0.64
0.66
0.68
0.7
0.72
Availability
λs
r=.5r=2
r=3.5
(i) (ii)
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
0.6
0.62
0.64
0.66
0.68
0.7
0.72
Availability
β
r=.5r=2
r=3.5
1 1.5 2 2.5 3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Availability
µ
r=.5r=2
r=3.5
(iii) (iv)
Fig. 8B.5: For gamma distributed repair time, the availability vs (i) λ (ii) λs (iii) β (iv) µ by varying r
Software Reliability Growth Model (SRGM) with N-version
Programming
9.1 Introduction
9.2 NHPP Model
9.3 Mean Value Function
9.4 Reliability Estimation
9.5 Parameter Estimation
9.6 Total Expected Cost of Software
9.7 Numerical Results
9.8 Concluding Remarks
Chapter-9
Chapter-9: Software Reliability Growth Model (SRGM)…
217
The software reliability models have received attention of
software designers in the evaluation of quality factors. In this
chapter, we propose the software reliability growth model
(SRGM) based on non-homogeneous poisson process (NHPP).
The main aim of this investigation is to develop software
reliability growth model by incorporating both testing-effort
and N-version programming by assuming the imperfect
debugging process. Detected software faults during testing have
been categorized into three types namely minor, major and
critical depending on the severity of the faults. A modified
approach is discussed to determine the delivery cost of the
software at any time during the testing. The parameter
estimation approach is proposed to estimate the unknown
parameters of the proposed model. Numerical results are also
presented to demonstrate the validity of analytical results. By
using Adaptive Network-based Fuzzy Interference Systems
(ANFIS) approach for SRGM, we explore the prediction
capabilities of soft computing.
9.1 Introduction
Reliability is the basic requirement for both hardware and software systems
with the increasing demand of software. High quality software systems have become
more essential for a high degree of system reliability. The quality of the software is
described by many metrics such as complexity, portability, maintainability,
availability and reliability. In the recent years, many fault tolerant systems have
become software dependent for their correct functioning. There are two methods for
obtaining fault tolerance of software (i) N-version programming (ii) recovery block.
Chapter-9: Software Reliability Growth Model (SRGM)…
218
In NVP, N independent programs are performed in parallel and same
applications and decisions are obtained by voting on the output from individual
program. N-version programming technique can tolerate the design faults present in
the software if the design diversity concept is implemented properly. Each version of
the module should be implemented as diverse as possible manner, including different
tool sets, different programming languages and possible different environments. All N
software versions are executed simultaneously and their results are sent to a voter
which selects the correct output. Whenever a failure occurs, the causes of all faults
detected are immediately removed perfectly. N-version programming (NVP)
technique was proposed by Chen and Avizienies (1977). Various researchers have
contributed significantly in the field of N-version programming (cf. Brilliant et al.,
1990; Dugan and Lyu, 1993). Pham (1994) presented the system reliability analysis
of N-version programming application. Teng and Pham (2002) and Bhaskar and
Kumar (2006) discussed the software reliability growth model for N-version
programming with cost and imperfect debugging. Laval et al. (2011) discussed
supporting simultaneous version of software evolution assessment.
SRGMs provide the relationship among the cumulative number of faults
detected by the software testing and expended time. Various stochastic process
models have been successfully used in studying hardware and software reliability
problems. Some notable researchers have contributed towards this field. Pham and
Zhang (2003) and Cai et al. (2008) studied the cost analysis of reliability growth
models based on NHPP for hardware and software systems with testing coverage.
Yang and Xie (2000) and Rackwitz (2001) studied the operational and testing
reliability in software systems. Pham (2007) and Castillo et al. (2008) analyzed the
optimization and reliability problems with imperfect-debugging and fault-detection
dependant-parameters for SRGM. Rebba and Mahadevan (2008) gave the
computational methods for modeling the reliability assessment. Zio (2009) discussed
the old and new challenges related to reliability prediction. Reliability growth
modeling for in-service systems considering latent failure modes was discussed by
Jin et al. (2010). Hsu et al. (2011a) proposed a enhancing software reliability
modeling and prediction through the introduction of time-variable fault reduction
factor.
Chapter-9: Software Reliability Growth Model (SRGM)…
219
In some applications, SRGMs also incorporate the time-dependent behavior of
testing-effort expenditure. Testing efforts include the number of executed test case,
man power spent during the testing and the number of CPU hours. Many researchers
have described the software reliability growth aspects in different frameworks under
testing effort. Software reliability growth models with testing-effort function, fault
detection rate and change point were studied by Yamada and Othera (1990), Kuo et
al. (2001), Kapur and Bhardhan (2002), Huang and Kuo (2002), Huang (2005).
Catuneanu et al. (1991), Kuo (2005) and Huang and Lyu (2005) examined the
software reliability release policy with testing effort by considering the cost and
testing efficiency. The framework for developing testing effort dependent software
reliability growth models with learning function for distributed systems was
considered by Yamada et al. (1993), Inoue and Yamada (2004), Kapur et al. (2004),
Huang et al. (2007a), Kapur et al. (2009) and many more. Huang and Hung (2010)
presented the software reliability analysis and assessment using queueing moldels
with multiple change-points. The sensitivity analysis of release time of software
reliability models incorporating testing effort with multiple change-points was studied
by Li et al. (2011).
The NHPP based models are the most important models because of their
simplicity, convenience and compatibility. These groups of models provide an
analytical framework for describing the software failure occurrence or fault-removal
phenomenon during testing. These models are normally based upon different
debugging scenarios and can catch quantitatively typical reliability growth concept
observed in the testing phase of the software products. This chapter investigates N-
version programming for a SRGM based on NHPP that incorporates imperfect
debugging and testing effort. The rest of the chapter is organized as follows. In
section 9.2, we describe the development of the SRGM based on NHPP to examine
the software testing for N-version programming. In section 9.3, we define mean value
function which describes the fault detection, isolation and removal phenomenon in the
software testing. In section 9.4, we describe the reliability estimation for the model. In
section 9.5, we suggest the parameter estimation of the model which is based on
NHPP and NVP. The total software cost is evaluated in section 9.6. In section 9.7, we
Chapter-9: Software Reliability Growth Model (SRGM)…
220
present numerical results by taking an illustration. Finally section 8 contains the
concluding remarks.
9.2 NHPP Model
We develop a testing-effort dependent reliability growth model which
incorporates the testing-effort spent on software testing. Let us consider a counting
process ( ){ }0t,tN ≥ describing the cumulative number of errors detected up to testing
time t. The software reliability growth model (SRGM) based on NHPP can be
formulated as a Poisson process given by:
( ){ } ( ){ } ( ){ } ( ).....,2,1,0n,tmexp!ntmntNPr
n
=−== …(9.1)
where ( )tm is a mean value function which represents the expected cumulative
number of errors detected in the time interval (0, t].
9.2.1 Software Reliability Growth Model
In this section, we develop a software reliability growth model (SRGM) for N-
version programming systems by considering the errors removal efficiency and the
error introduction rate during testing. The model describes as how the observations of
failures and correcting process of the underlying faults which occur in software
development when the software is being tested and debugged, affect the reliability of
the software. For formulations the model, the following assumptions are made:
There are three types of errors (minor, major and critical) in the system. The minor, major and critical errors are considered for each of the N-version. The error introduction probabilities are assumed to be constant for each version.
The detection/isolation/fault removal phenomenon is modeled by non homogeneous poisson process (NHPP).
The time-dependent behavior of the testing-effort is governed by an exponential distribution.
The mean number of faults/errors detected in the time interval ( ]tt,t ∆+ is proportional to the mean number of remaining errors in the system.
The imperfect debugging is considered for each of the N-version of the software.
Chapter-9: Software Reliability Growth Model (SRGM)…
221
The following notations are used for mathematical formulation and performance
evaluation purpose:
pdf : Probability density function.
N : Total number of versions in the system.
M : Total amount of testing-effort eventually consumed.
i : Index representing the version number in the system, i=1,2,3,…,N.
j : Index representing the types of error i.e. minor, major and complex faults, respectively in the system for j=1,2,3.
ia : Number of error to be detected eventually in version i, i=1,2,3,…,N.
ijb : Failure detection rate per unit testing effort of version i for jth type fault, where Njj2j1 b....bb ≠≠≠ .
ijp : Error content proportion of version i for type jth error, where Njj2j1 p....pp ≠≠≠ .
ijβ : Error introduction rate for jth type error of version i, where Njj2j1 .... β≠≠β≠β .
( )tmDij : Expected mean number of jth type faults detected in version i in time
(0, t]. ( )tmI
ij : Expected mean number of jth type faults isolated in version i in time (0, t].
( )tmRij : Expected mean number of jth type faults removal in version i in time
(0, t].
( )tλij : Initial failure intensity of type j error for version i, i=1,2,3,…,N.
( )t'λ i : Increased initial failure intensity for version i, i=1,2,3,…,N.
( )t'mi : Increased mean value function for version i, i=1,2,3,…,N.
iα : Correlating parameters for ith version, 10 i ≤α≤ .
iR : Reliability of version i, i=1,2,3,…,N.
9.2.2 Testing-Effort Function
The testing-effort function (TEF) can be evaluated by the number of test cases
runs, human power, or the number of CPU hours. We consider exponential TEF to
define the possible testing-effort patterns.
The total amount function of testing-effort W(t) spent in the time interval (0,t] is
( ) ( )( ){ }texp1MtW γ−−= …(9.2)
Chapter-9: Software Reliability Growth Model (SRGM)…
222
The current testing-effort expenditure rate at testing time t is given by
( ) ( )tWdtdtw = …(9.3)
9.3 Mean Value Function
We propose the software reliability growth models in which software faults
detected during the testing phase are isolated and removed and then software tends to
grow. The software faults are assumed of different severity such as minor, major and
complex. The mean value function of the software reliability growth model with time
dependent testing-effort function is established. The faults are removed in different
stages according to their severity.
The mean value functions are governed by the following differential equations:
( ) ( ) ( ) ( )[ ]tmtnbtw
1tmdtd D
ijiiji
Dij −=× …(9.4)
where
( ) ( )tmdtdtn
dtd D
ijijij β= …(9.5)
( )( ) ( ) ( ) ( )[ ]tmtmtwbtw
1dt
tdm Iij
Dijij
Iij −=× …(9.6)
( ) ( ) ( ) ( )[ ]tmtmtwbdt
tdm Rij
Iijij
Rij −= …(9.7)
Solving above equations, for i=1, 2, 3,…,N; j=1, 2, 3 we get
( ) ( ) ( ) ( )( )[ ]tWb1exp11
aptn ijijij
ij
ijij β−−β−
β−= …(9.8)
and
( )
( ) ( )( ) ( ) ( ) ( )( )( )
β−−−β−β
−−−β−
= ijijij
ijij
ij
ijiRij 1exp1
tWb1tWbexp1
1pa
dttdm …(9.9)
Chapter-9: Software Reliability Growth Model (SRGM)…
223
The failure intensity for i=1, 2, 3,…,N; j=1, 2, 3 is obtained as
( ) ( )dt
tdmtλ
Rij
ij = and ( ) ∑=
λ=3
1jiji )t(tλ
Therefore
( ) ( )( ) ( ) ( ) ( )( )( ) ( ) ( ) ( )( )
×−−−+−+−−
=λ ∑=
tWtwbtwβb
β1expβ1twbtWbexpβ1pa
)t( ijij
ijijijijij
3
1j ij
ijii
…(9.10)
9.4 Reliability Estimation
Now we evaluate the conditional reliability for ith version and reliability
measures of the system with N-version, by considering two cases such as (i) the
increased failure intensity of each version of the system (ii) the increase mean value
function for each version in the system. These cases are as follows:
Case (I):
The increased failure intensity for each version is determined by
∑=
=n
1iiii λα'λ
With the help of equations (9.10) and (9.11), we evaluate the conditional reliability
for given time x for the ith version as
( ) ( ) ( ){ }[ ]TmXTmexpTXR iii −+−= …(9.11)
The reliability expression for two-version software system is determined by using
( ) ( ) ( ) ( )TXRT
XRTXRT
XRR 2121sys −+= …(9.12)
For N-version, the reliability expression is given by
( ) ( ) ( ) ( ) ( ) ( )
( ) ( )TXRΠ1....
TXRT
XRTXRT
XRTXRT
XRR
i
N
1i
1N
k
N
kjiji
N
jiji
N
1iisys
=
−
<<<=
−+−
+−= ∑∑∑
Chapter-9: Software Reliability Growth Model (SRGM)…
224
( ) ( ) ( ) ( ) ( ) ( ) ( )TXRΠ1T
XR....TXRT
XR1-TXR i
N
1i
1NN
...kji...kji
jiN
N
1ii =
−
≠≠≠≠<<<<=
−+−= ∑∑ l
ll
…(9.13)
Case (II):
The increased mean value function for each version is given by
∑=
=n
1iiii mα'm
and
( ) ( ) ( )( ) ( ) ( ) ( )( )( )∑=
−−−−−−−−
=3
1jijij
ij
ijij
ij
ijiii β1expβ1
βtWb
1tWbexp1β1pa
t,λm
The conditional reliability of version i is given as below
( ) ( ) ( ){ }[ ]T''mXT'mexpTX'R iii −+−=
…(9.14)
The conditional reliability of two version software is given as follows
( ) ( ) ( ) ( )TX'RT
X'RTX'RT
X'R'R 2121sys −+= …(9.15)
The conditional reliability of a N-version is given by
( ) ( ) ( ) ( ) ( ) ( )
( ) ( )TX'RΠ1....
TX'RT
X'RTX'RT
X'RTX'RT
X'R'R
i
jijiisys
N
1i
1N
k
N
kji
N
ji
N
1i
=
−
<<<=
−+−
+−= ∑∑∑
( ) ( ) ( ) ( ) ( ) ( ) ( )TX'RΠ1T
XR....TXRT
XR1-TX'R
ii
N
1i
1NN
...kji...kji
jiN
N
1i =
−
≠≠≠≠<<<<=
−+−= ∑∑ l
ll
…(9.16)
9.5 Parameter Estimation
Parameter estimation is the basic requirement of software reliability
prediction. Parameter estimation can be done by using the well established likelihood
estimation approach to evaluate the unknown parameters for the NHPP models.
Chapter-9: Software Reliability Growth Model (SRGM)…
225
The joint p.d.f. for ith (i=1, 2, …, N) version in the system is given by
( ) ( )[ ] ( )is
n
1siiniini2i1i tλtmexpt,...,t,tf ∏
=
−= …(9.17)
Let ( )ini2i1 t,...,t,t be the time between failures for ith (i=1, 2, …, N) version. The joint
likelihood function is defined as follows:
( ) ( )∑∏∑∑= ===
−==
N
1iis
n
1s
N
1iini
N
1ii tλtmexpfL '
i …(9.18)
Taking logarithm of L, we get
( ) ( )∑∑∑= ==
λ+−=N
1i
n
1sis
'N
1iini tlogtmL log
i …(9.19)
Partial derivatives of eq. (9.21) w. r. t. ( )Ni1ai ≤≤ , ( )Ni1ibij ≤≤= ,
( )Ni1ipij ≤≤= and ( )Ni1iij ≤≤=β and iα give the likelihood equations of the
above parameters.
9.6 Expected Cost of the Software
It is of vital importance for software manufacturer to control a software
development process in terms of cost, reliability and optimal testing time. The quality
of the software system usually depends on the testing time and testing efforts. The
total software testing cost incurred during the software life-cycle is measured from the
time when the testing starts. The cost of testing before and after release is quantified
in terms of various cost factors including setup cost, cost of removing errors during
the testing and operational phase. The expected total cost function is given by:
( ) ( ) ( ) ( )[ ] ( )dxxwCTmTmCTmCTCT
03LC21 ∫×+−×+×= …(9.20)
where
LCT = Software life-cycle length.
1C = Cost of fixing a fault during the testing phase.
2C = Cost of fixing a fault during the operational phase ( )0CC 12 >> .
Chapter-9: Software Reliability Growth Model (SRGM)…
226
3C = Cost per unit of testing-effort consumption during the testing.
9.7 Numerical Results
In this section, numerical results are obtained and are compared with the
neuro-fuzzy results by building Adaptive Network Based Fuzzy Inference System
(ANFIS) in software ‘MATLAB 7.4’. ANFIS is built by using the fuzzy toolbox of
the MATLAB package. We use Gaussian function for describing the membership
function of input variable. For all approximations, ANFIS are trained for 50 epochs
and 5 membership functions. The linguistic values of the input parameter are given in
table 9.1 and the corresponding membership functions are shown in fig. 9.4. We
consider T as linguistic variable for fuzzy system. For illustration purpose, we
consider that a software consisting of simple faults and 3 versions, i.e. j=1 and i=3.
For different values of a1 and b11, figs 9.1-9.3 depict the cumulative number of faults
detected m(T), total expected cost C(T) and reliability R(T) of the software by taking
the default parameters as C1=200, C2=500, C3=350, TLC=300, a1=100, a2=250,
a3=500, P11=.20, P21=.30 P31=.50, β11=.5, β21=.6, β31=.7, γ=.009, M=80, x=1.
Table 9.1: Linguistic values of the membership functions for time t
From figs 9.1(i)-(ii), we notice that the expected number of detected faults are
increasing as time passes. As a and β increase, the mean value function reveals an
increasing trend. The total expected cost has been exhibited for different parameters
a1 and b11 with respect to time in figs 9.2(i)-(ii). It is seen that the expected cost
initially decreases sharply but later on increases gradually. We note that as a1
Input
Variables
No. of
membership
function
Linguistic
Values
T 5
Low
Medium
High
Chapter-9: Software Reliability Growth Model (SRGM)…
227
increases, the expected cost increases but on increasing the value b11, the expected
cost remains constant.
By varying the parameters a1 and b11, reliability is depicted in figs 9.3(i)-(ii).
We observe that the reliability initially increases with time but finally it becomes
almost constant. From fig. 9.3(i), it is clear that as error content function increases, the
reliability decreases. From fig. 9.3(ii), it is also seen that as error detection rate
increases, the reliability decreases for some time and later on approaches to a constant
value.
9.8 Concluding Remarks
In this chapter, a software reliability growth model (SRGM) based on non-
homogeneous Poisson process (NHPP) for N-versions programming with testing-
effort has been developed. Our model takes care of imperfect debugging which is the
more realistic assumption of software development process. It is worth-mentioning
that the proposed model will be helpful in calculating the number of various types of
faults and their effect on the reliability growth and the actual application. The
developed software reliability model is capable for providing the reliability for N-
version when the parameters can be estimated as realistically as possible and the
distribution for the testing suits to concrete situations. Various interesting quantities
for software reliability measurement can be computed easily for the concerned
software system as validated by numerical simulation.
Chapter-9: Software Reliability Growth Model (SRGM)…
228
0
100
200
300
400
500
600
0 1 2 3 4 5 6 7 8 9 10T
m(T
)
a1=200(Analytical Set 1)a1=200(Afnis Set 1)a1=300(Analytical Set 2)a1=300(Afnis Set 2)
0
50
100
150
200
250
300
350
400
0 1 2 3 4 5 6 7 8 9 10T
m(T
)
β11=0.5(Analytical Set 1)β11=0.5(Afnis Set 1)β11=0.6(Analytical Set 2)β11=0.6(Afnis Set 2)
Fig. 9.1(i): Mean time vs time by Fig. 9.1(ii): Mean time vs time by varying 1a varying 11β
20000
30000
40000
50000
60000
70000
80000
0 20 40 60 80 100 120 140T
C(T
)
a1=200(Analytical Set 1)a1=200(Afnis Set 1)a1=300(Analytical Set 2)a1=300(Afnis Set 2)
40000
60000
80000
100000
120000
140000
160000
0 20 40 60 80 100 120 140
T
C(T)
a2=250(Analytical Set 1)a2=250(Afnis Set 1)a2=350(Analytical Set 2)a2=350(Afnis Set 2)
Fig. 9.2(i): Expacted cost vs time by Fig. 9.2(ii): Expacted cost vs time by varying 1a varying 2a
Chapter-9: Software Reliability Growth Model (SRGM)…
229
0
0.2
0.4
0.6
0.8
1
0 20 40 60 80 100 120 140 160
T
Rsys
a1=60(Analytical Set 1)a1=60(Afnis Set 1)a1=80(Analytical Set 2)a1=80(Afnis Set 2)
0
0.2
0.4
0.6
0.8
1
0 20 40 60 80 100 120 140 160T
Rsys
b11=0.04(Analytical Set 1)b11=0.04(Afnis Set 1)b11=0.06(Analytical Set 2)b11=0.06(Afnis Set2)
Fig. 9.3(i): Reliability vs time by Fig. 9.3(ii): Reliability vs time by varying 1a varying 11b
Fig. 9.4: Membership functions for input parameter T
Future Scope
230
Future Scope With the advances in computer technology, modern society almost relies on
the gadgets which contain software. The investigation done in the present thesis is
mainly concerned with the performance prediction of hardware and software
reliability of fault tolerant systems. We have indicated the open problems and scope
of the research work in the first chapter. However, there are some specific
observations in this regard which we would like to mention. Some important issues
related to our research investigations which can be addressed in future are as follows:
System reliability/availability is a key measure performance which is to be
assessed for a concerned system during design and development phase. Based on
our investigation a further interesting possibility is to study the optimal number of
spare parts for different configurations.
Software fault tolerance techniques are gaining popularity. The analysis of cluster
architectures with respect to fault tolerance and reliability measures presented in
chapter 2 will give insight to the developers and system designers to improve the
reliability of the concerned system. However, optimal configuration can be
suggested based upon pre-specified techno-economic constraints.
For the smooth and uninterrupted functioning of any system, standby units are
used. The optimal allocation of redundant components based on cost criterion can
be further determined for the repairable system with warm standbys studied in
chapters 5 and 6.
The failure of any hardware system may also take place due to some common
cause. The concept of common cause has been incorporated in chapters 1, 4, 5(A),
7, 8(B). The concept of common cause can be further included in other models
also.
The models of embedded Markov chain are gaining importance these days. The
work presented in this thesis needs refinements for the improvement of the
analysis procedure by integrating the economic factors.
N-version programming is one of the most common techniques for the system
capable of fault tolerance. The method and analysis provided for software
reliability growth model (SRGM) with N-version programming in chapter 9 can
be extended to flexible environment.
References
References
232
1. Abdel-Hameed, M. (1995): Inspection, maintenance and replacement models. Computers and Operations Research, Vol. 22, No. 4, pp. 435-441.
2. Abujarad, F. and Kulkarni, S. S. (2011): Automated constraint-based addition of non masking and stabilizing fault tolerance. Theoretical Computer Science, Vol. 412, No. 33, pp. 4228-4246.
3. Abulnaja, O. A. (2005): Component-based recovery blocks technique. Artificial Intelligence & Machine Learning Journal, Vol. 5, No. 2, pp. 1-5.
4. Abu-Salih, M., Anakerh, N. and Ahmed, M. S. (1999): Confidence limits for steady state availability. Pakistan Journal of Statistics, Vol. 6, No. 2A, pp. 189-196.
5. Adke S. R. and Manjunath S. M. (1984): An Introduction to Finite Markov Processes. Wiley Eastern Limited, New York.
6. Akhtar, S. (1994): Reliability of K-out-of-N: G system with imperfect fault coverage. IEEE Transactions on Reliability, Vol. 43, pp. 101-106.
7. Alidrisi, M. M. (1992): The reliability of a dynamic warm standby redundant system of n components with imperfect switching. Microelectronics and Reliability, Vol. 32, No. 6, pp. 851-859.
8. Al-Saqabi, K., Saleh, K. and Ahmad, I. (1996): Recovery from concurrent failures in communication protocols. Journal of Systems and Software, Vol. 35, No. 1, pp. 55-65.
9. Amari, S. V. (2000): Transient analysis of reliability with and without repair for K-out-of-N: G systems with M failures modes. Reliability Engineering and System Safety, Vol. 67, No. 3, pp. 321-324.
10. Amari, S., Pham, H. and Dill, G. (2004): Optimal design of K-out-of-N: G subsystem subjected to imperfect fault coverage. IEEE Transactions on Reliability, Vol. 53, pp. 567-75.
11. Arulmozhi, G. (2002): Reliability of an M-out-of-N warm standby system with R repair facilities. OPSEARCH, Vol. 39, pp. 77-87.
12. Arya, L. D., Choube, S. C. and Arya, R. (2011): Differential evolution applied for reliability optimization of radial distribution systems. International Journal of Electrical Power and Energy Systems, Vol. 33, No. 2, pp. 271-277.
13. Ascher, H. and Feingold, H. (1984): Repairable System Reliability. Marcel Dekker Inc., New York.
14. Attardi, L., Guida, M. and Pulcini, G. (2005): A mixed-Weibull regression model for the analysis of automotive warranty data. Reliability Engineering and System Safety, Vol. 87, pp. 265-273.
15. Atwood, C. L. and Kelly, D. L. (2008): The binomial failure rate common-cause model with Win BUGS. Reliability Engineering and System Safety, Vol. 94, No. 5, pp. 990-999.
16. Aven, T. (1990): Availability formulae for standby systems of similar units that are preventively maintained. IEEE Transactions on Reliability, Vol. 39, pp. 603-6.
References
233
17. Avizienis, A. and Chen, L. (1977): On the implementation of N-Version programming for software fault tolerance during program execution. In Proceeding COMPSAC 77, pp. 149-155.
18. Avritzer, A., Cole, R. G. and Weyuker, E. J. (2010): Methods and opportunities for rejuvenation in aging distributed software systems. Journal of Systems and Software, Vol. 83, No. 9, pp. 1568-1578.
19. Azaron, A., Katagiri, H., Sakawa, M. and Modarres, M. (2005): Reliability function of a class of time-dependent systems with standby redundancy. European Journal of Operational Research, Vol. 164, No. 2, pp. 378-86.
20. Balsamo, S., Di Marco, A., Inverardi, P. and Simeoni, M. (2004): Model-based performance prediction in software development: A survey. IEEE Transactions on Software Engineering, Vol. 30, No. 5, pp. 295-310.
21. Belli, F. and Jedrzeiowicz, P. (1990): Fault-tolerant programs and their reliability. IEEE Transactions on Reliability, Vol. 16, No. 3, pp. 184-92.
22. Belzunce, F., Martinez-Puertas, H. and Ruiz, J. M. (2011): On optimal allocation of redundant components for series and parallel of two dependent components. Journal of Statistical Planning and Inference, Vol. 141, No. 9, pp. 3094-3104.
23. Berman, O. and Kumar, U. D. (1999): Optimization models for recovery black schemes. European Journal of Operational Research, Vol. 115, No. 2, pp. 368- 379.
24. Beutner, E. (2010): Non parametric model checking for k-out-of-n systems. Journal of Statistical Planning and Inference, Vol. 140, No. 3, pp. 626-639.
25. Bhaskar, T. and Kumar, U. D. (2006): A cost model for N-version programming with imperfect debugging. Journal of The Operational Research Society, Vol. 57, No. 8, pp. 986-994.
26. Bhuyan, P. and Sarmah, P. (2002): Reliability estimation of a repairable standby redundant system, Statistical Papers; Springer Berlin, Vol. 43, No. 3, pp. 323-336.
27. Bichon, B. J., Mc Farland, J. M. and Mahadevan, S. (2011): Efficient surrogate models for reliability analysis of systems with multiple failure modes. Reliability Engineering and System Safety, Vol. 96, No. 10, pp. 1386-1395.
28. Bieth, B., Hong, L. and Sarkar, J. (2010): A standby system with two repair persons under arbitrary life-and repair times. Mathematical and Computer Modelling, Vol. 51, pp. 756-767.
29. Biswas, A. and Sarkar, J. (2000): Availability of a system maintained through several imperfect repairs before a replacement or a perfect repair. Statistics & Probability Letters, Vol. 50, pp. 105-114.
30. Blischke, W. R. and Murthy, D. N. P. (1994): Warranty Cost Analysis. Marcel Dekker Inc., New York.
References
234
31. Blokus, A. (2006): Reliability analysis of large systems with dependent components. International Journal of Reliability, Quality and Software Engineering, Vol. 13, No. 1, pp. 1-14.
32. Bobbio, A. (1990): Dependability analysis of fault–tolerant systems: A literature survey. Microprocessing and Microprogramming, Vol. 29, No. 1, pp. 13.
33. Bondavalli, A., Chiaradonna, S., Giandomenico, F.D. and Xu, J. (2002): An adaptive approach to achieving hardware and software fault tolerance in a distributes computing environment. Journal of System Architecture, Vol. 47, No. 9, pp. 763-781.
34. Bondavalli, A., Di Giandomenico, F. and Xu, J. (1993): A cost-effective and flexible scheme for software fault tolerance. Journal of Computer Systems Science and Engineering, Vol. 8, No. 4, pp. 234-244.
35. Brilliant, S. S., Knight, J. C. and Leveson, N. G. (1990): Analysis of faults in an N-version software experiment. IEEE Transactions on Software Engineering, Vol. 16. No. 2, pp. 238-247.
36. Buchholz, P., Kemper, P. and Kriege, J. (2010): Multi-class Markovian arrival processes and their parameter fitting. Performance Evaluation, Vol. 67, No. 11, pp. 1092-1106.
37. Bueno, V. D. C. and Carmo, I. M. D. (2007): Active redundancy allocation for a k-out-of-n: F system of dependent components. European Journal of Operational Research, Vol. 176, No. 2, pp. 1041-1051.
38. Cai, K. Y., Hu, D. B. Bai, C. G., Hu, H. and Jing, T. (2008): Does software reliability growth behavior follow a non-homogeneous Poisson process. Infosoft Technologies, Vol. 50, No. 12, pp. 1232-1247.
39. Carpenter, G. F. (1990): Mechanism for evaluating the effectiveness of software fault- tolerant structures. Microprocessors and Microsystems, Vol. 14, No. 8, pp. 505-510.
40. Castillo, E., Minguez, R. and Castillo, C. (2008): Sensitivity analysis in optimization and reliability problems. Reliability Engineering and System Safety, Vol. 93, No. 12, pp. 1788-1800.
41. Catelani, M., Ciani. L., Scarano, V. L. and Bacioccola, A. (2011): Software automated testing: A solution to maximizing the test plan coverage and to increase software reliability and quality in use. Computer Standards and Interfaces, Vol. 33, No. 2, pp. 152-158.
42. Catuneanu, V. M., Moldovan, C., Popentiu, F. L. and Popovici, D. (1991): Software reliability release policy with testing effort, Microelectronics Reliability, Vol. 31, No. 5, pp. 895-899.
43. Cekyay, B. and Ozekici, S. (2010): Mean time to failure and availability of semi-markov missions with maximal repair. European Journal of Operational Research, Vol. 207, pp. 1442-1454.
44. Cha, J. H., Mi, J. and Yun, W. Y. (2008): Modelling general standby system and evaluation of its performance. Applied Stochastic Models in Business and Industry, Vol. 24, pp. 159-169.
References
235
45. Chakravarthy, S. R. and Gomez-Corral, A. (2009): The influence of delivery times on repairable k-out-of-N systems with spares. Applied Mathematical Modelling, Vol. 33, No. 5, pp. 2368-2387.
46. Chandrasekhar, P., Natarajan, R. and Yadavalli, V. S. S. (2004): A study on a two unit standby system with Erlangian repair time. Asia-Pacific Journal of Operational Research, Vol. 21, No. 3, pp. 271-277.
47. Chang, W. K. and Jeng, S. L. (2005): Impartial evaluation in software reliability practice. Journal of Systems and Software, Vol. 76, No.2, pp. 99-110.
48. Chang, Y., Amari S. and Kuo S. (2005): OBDD-based evaluation of reliability and importance measures for multistate systems subject to imperfect fault coverage. IEEE Transactions on Dependable And Secure Computing, Vol. 2, pp. 336-347.
49. Chari, A. A., Shatri, M. P. and Verma, S. M. (1991): Reliability analysis in the presence of change common cause shock failures. Microelectronics Reliability, Vol. 31, pp. 15-19.
50. Chatterjee, S., Misra, R. B. and Alam, S. S. (2004): N-version programming with imperfect debugging. Computers & Electrical Engineering, Vol. 30, No. 6, pp. 453-463.
51. Chen, Y. M. (1992): Transient analysis of reliability and availability in k-to-l-out-of-n: G system. Reliability Engineering and System Safety, Vol. 35, No. 3, pp. 179-82.
52. Cheng, S. T., Chen, C. M. and Tripathi, S. K. (2000): A Fault–tolerant Model for multiprocessor real-time systems. Journal of Computer and System Sciences, Vol. 61, No. 3, pp. 457–477.
53. Chiaradonna, S., Bondavalli, A. and Strigini, L. (1994): On performability modeling and evaluation of software fault tolerance structures. In Proceeding EDCC-1, Berlin, Germany, pp. 97-114.
54. Chiquet, J., Eid, M. and Limnios, N. (2008): Modelling and estimating the reliability of stochastic dynamical systems with Markovian switching. Reliability Engineering & System Safety, Vol. 93, No. 12, pp. 1801-1808.
55. Choi, J. G. and Seong, P. H. (2006): Reliability assessment of embedded digital system using multi-state function. Reliability Engineering and System Safety, Vol. 91, No. 3, pp. 261-269.
56. Chow, D. K. (1971): Reliability of two items in sequence with sensing and switching. IEEE Transactions on Reliability, Vol. 20, pp. 254-256.
57. Chung, W. K. (1980): An availability calculation for K-out-of-N redundant system with common-cause failures and replacement. Microelectronics and Reliability, Vol. 20, pp. 517-519.
58. Chung, W. K. (1981): A k-out-of-N: G three-state unit redundant system with common-cause failure and replacements. Microelectronics and Reliability, Vol. 21, No. 4, pp. 589-591.
References
236
59. Chung, W. K. (1995): Reliability of imperfect switching of cold standby systems with multiple non-critical and critical errors. Microelectronics and Reliability, Vol. 35, No. 12, pp. 1479-1482.
60. Da Casta Bueno, V. (2005): Minimal standby redundancy allocation in a k-out-of-n: F system of dependent components. European Journal of Operational Research, Vol. 165, No. 3, pp. 786-793.
61. Da Casta Bueno, V. and Do Carmo, I. M. (2007): Active redundancy allocation for a k-out-of-n: F system of dependent components. European Journal of Operational Research, Vol. 176, No. 2, pp. 1041-1051.
62. Dabney, R. W., Etzkorn, L. and Cox, G. W. (2008): A fault tolerant approach to test control utilizing dual-redundant processors. Advances in Engineering Software, Vol. 39, No. 5, pp. 371-383.
63. De Smidt-Destombes, K. S., van der Heijden, M. C. and van Harten, A. (2004): On the availability of a k-out-of-N system given limited spares and repair capacity under a condition based maintenance strategy. Reliability Engineering and System Safety, Vol. 83, pp. 287-300.
64. De-Almeida, A. T. and Souza, C. F. M. (1993): Decision theory in maintenance strategy for a 2-unit redundant standby system. IEEE Transactions on Reliability, Vol. 42, No. 3, pp. 401-407.
65. Dhillon, B. S. (1978): A k-out-of-n three state devices system with common-cause failures. Microelectronics and Reliability, Vol. 18, pp. 447-448.
66. Dhillon, B. S. (1993): Reliability and availability analysis of a system with warm standby and common cause failures. Microelectronics and Reliability, Vol. 33, No. 9, pp. 1343-1349.
67. Dhillon, B. S. and Anudu, O. C. (1993): Common-cause failure analysis of a non-identical unit parallel system with arbitrarily distributed repair times. Microelectronics and Reliability, Vol. 33, pp. 87-103.
68. Dhillon, B. S. and Yang, N. (1992): Reliability and availability analysis of warm standby systems with common-cause failures and human errors. Microelectronics and Reliability, Vol. 32, pp. 561-575.
69. Do Van, P., Barros, A. and Berenguer, C. (2010): From differential to difference importance measure for Markov reliability models. European Journal of Operational Research, Vol. 204, No. 3, pp. 513-521.
70. Dominguez-Garcia, A. D., Kassakian, J. G., Schindall, J. E. and Zinchu, J. J. (2008): An integrated methodology for the dynamic performance and reliability evaluation of fault tolerant systems. Reliability Engineering and System Safety, Vol. 93, No. 11, pp. 1628-1649.
71. Dugan, J. B. and Lyu, M. R. (1993): System reliability analysis of N-version programming application. Proceeding 4th IEEE International Symposium on Software Reliability Engineering, pp. 103-111.
72. Dugan, J. B. and Lyu, M. R. (1995): Dependability modeling for fault-tolerant software and systems. John Wiley & Sons Ltd., pp. 109-138.
References
237
73. Dutuit, Y. and Rauzy, A. (2001): New insights in the assessment of K-out-of-N and related systems. Reliability Engineering and System Safety, Vol. 72, No.3, pp. 303-314.
74. Eckhardt, D. E. and Lee, L. D. (1988): Fundamental differences in the reliability of N-modular redundancy and N-version programming. Journal of System and Software, Vol. 8, No. 4, pp. 313-318.
75. El-Damcese, M. A. (1997): Human error and common-cause failure modeling of a two-unit multiple system. Theoretical and Applied Mechanics, Vol. 26, pp. 117-127.
76. El-Damcese, M. A. (2009): Analysis of warm standby system subject to common cause failures with time varying failure and repair rates. Applied Mathematical Sciences, Vol. 3, No. 18, pp. 853-860.
77. El-Gohary, A. (2004): Estimations of parameters in a three state reliability semi-Markov model. Applied Mathematics and Computation, Vol. 154, No. 2, pp. 389-403.
78. Elsayed, E. (1996): Reliability Engineering. Addison Wesley Longman Reading, Mass.
79. Erglmaz, S. (2010): Mixture representations for the reliability of cousecutive-k systems. Mathematical and Computer Modelling, Vol. 51, No. 5-6, pp. 405-412.
80. Eryilmaz, S. (2011): Dynamic behaviour of k-out-of-n: G systems. Operations Research Letters, Vol. 39, No. 2, pp. 155-159.
81. Fadiloglu, M. M. and Bulut, O. (2010): An embedded Markov chain approach to stock rationing. Operations Research Letters, Vol. 38, pp. 510-515.
82. Flammini, F., marrone, S., Mazzocca, n. and vittorini, V. (2009): A new modeling approach to the safety evaluation of n-modular redundant computer system in presence of imperfect maintenance. Reliability Engineering and System Safety, Vol. 94, pp. 1422-1432.
83. Fu, S. (2010): Failure-aware resource management for high availability computing clusters with distributed virtual machines. Journal of Parallel and Distributed Computing, Vol. 70, No. 4, pp. 384-393.
84. Galikowsky, C., Sivazlian, B. D. and Chaovalitwongse, P. (1996): Optimal redundancies for reliability and availability of series system. Microelectronics and Reliability, Vol. 36, pp. 1537-1546.
85. Gamiz, M. L., Miranda, M. D. M. (2010): Regression analysis of the structure function for reliability evaluation of continuous-state system. Reliability Engineering and System Safety, Vol. 95, No. 2, pp. 134-142.
86. Giandomenico, F. D., Bondavalli, A. and Xu, J. (1995): Hardware and software fault tolerance: adaptive architectures in distributed computing environments. Technical Report B4-15, IEI-CNR.
87. Goel, L. R and Gupta, R. (1984): Availability analysis of a two-unit cold standby system with two switching failure modes. Microelectronics and Reliability, Vol. 24, No. 3, pp. 419-423.
References
238
88. Goel, L. R. De and Shrivastava, P. (1991): Profit analysis of a two unit redundant system with provision for test and correlated failures and repairs. Microelectronics and Reliability, Vol. 31, pp. 827-833.
89. Goel, L. R. De and Shrivastava, P. (1992): A two unit standby system with imperfect switch, preventive maintenance and correlated failures and repairs. Microelectronics and Reliability, Vol. 32, No. 12, pp. 1687-1691.
90. Gokhale, S. S., Philip, T., and Marinos, P. N. (1996): A non- homogeneous Markov software reliability model with imperfect repair. In Proceeding of International Performance and Dependability Symposium (IPDS), Urbana-Champaign, IL., pp. 262-270.
91. Goseva-Popstojanova, K. and Trivedi, K. S. (2000): Failure correlation in software reliability model. IEEE Transactions on Reliability, Vol. 49, pp. 37-48.
92. Goseva-Popstojanova, K. and Trivedi, K. S. (2003): Architecture-based approaches to software reliability prediction. Computers and Mathematics with Applications, Vol. 46, No. 7, pp. 1023-1036.
93. Gou, L., Xu, H., Gao, C. and Zhu, G. (2011): Stability analysis of a new kind n-unit series repairable system. Applied Mathematical Modelling, Vol. 35, No. 1, pp. 202-217.
94. Grabski, F. (2011): Semi-Markov failure rates processes. Applied Mathematics and Computation, Vol. 217, No. 24, pp. 9956-9965.
95. Grosspietsch, K. E. (1989): Schemes of dynamic redundancy for fault tolerant in random access memories. Microelectronics and Reliability, Vol. 29, No. 6, pp. 1098.
96. Guo, and Hua, W. (2003): Analysis of repairable, warm standby, human & machine systems with two identical units. Mathematics in Practice and Theory, Vol. 33, No. 7, pp. 88-95.
97. Guo, H. and Yang, X. (2007): A simple reliability block diagram method for safety integrity verification. Reliability Engineering and System Safety, Vol. 92, pp. 1267-1273.
98. Gupta, P. P. and Sharma, R. K. (1986): Reliability analysis of two state repairable parallel redundant system under human failure. Microelectronics and Reliability, Vol. 26, No. 2, pp. 221-224.
99. Gupta, P. P. and Tyagi, L. (1986): M.T.T.F. and availability evaluation of a two-unit, two-state, standby redundant complex system with constant human failure. Microelectronics and Reliability, Vol. 26, No. 4, pp. 647-650.
100. Gupta, S. M., Jaiswal, N. K. and Goel, L. R. (1983): Switch failure in a two-unit standby redundant system. Microelectronics and Reliability, Vol. 23, No. 1, pp. 129-132.
101. Gurler, S. and Bairamov, I. (2009): Parallel and k-out-of-n: G systems with non identical components and their mean residual life functions. Applied Mathematical Modelling, Vol. 33, No. 2, pp. 1116-1125.
References
239
102. Habib, A. S., Yuge, T., Al-Seedy, R. O. and Ammar, S. I. (2010): Reliability of a consecutive (r, s)-out-of-(m, n): F lattice system with conditions on the number of failed components in the system. Applied Mathematical Modelling, Vol. 34, No. 3, pp. 531-538.
103. Hajeeh, M. A. (2011): Reliability and availability of a standby system with common cause failure. International Journal of Operational Research, Vol. 11, No. 3, pp. 343-363.
104. Hall, B. J. and Mosleh, A. (2008): An analytical framework for reliability growth of one-shot systems. Reliability Engineering and System Safety, Vol. 93, pp. 1751-1760.
105. Hamlet, D. (1995): Software quality, software process and software testing. Advances in Computers, Vol. 41, pp. 191-229.
106. Ho, S. L., Xie, M. and Goh, T. N. (2003): A study of the connectionist models for software reliability prediction. Computers and Mathematics with Applications, Vol. 46, No. 7, pp. 1037-1045.
107. Hoeflin, D. A. and Mendiratta, V. B. (1995): An elementary model for perdicting switching system outage durations. Proceedings of the XV International Switching Symposium, Berlin.
108. Hong, J. S., Koo, H. Y. and Lie, C. H. (2002): Joint reliability importance of k-out-of-n- systems. European Journal of Operational Research, Vol. 142, pp. 539-547.
109. Hoyland, A. and Rausand, M. (1994): System Reliability Theory: Models and Statistical Methods. John Wiley and Sons.
110. Hsieh, C. C. (2003): Optimal task allocation and hardware redundancy policies in distributed computing systems. European Journal of Operational Research, Vol. 147, No. 2, pp. 430-447.
111. Hsieh, C. C., and Hsieh, Y. C. (2003): Reliability and cost optimization in distributed computer systems. Computers & Operations Research, Vol. 30, No. 8, pp. 1103-1119.
112. Hsieh, Y. C. and Wang, K. H. (1995): Reliability of a repairable system with spares and a removable repairman. Microelectronics and Reliability, Vol. 35, No. 2, pp. 197-208.
113. Hsu, C. J., Huang, C. Y. and Chang, J. R. (2011a): Enhancing software reliability modeling and prediction through the introduction of time-variable fault reduction factor. Applied Mathematical Modelling, Vol. 35, No. 1, pp. 506-521.
114. Hsu, Y. L., Ke, J. C. and Lee, S. L. (2008): On a redundant repairable system with switching failure: Bayesian approach. Journal of Statistical Computation and Simulation, Vol. 78, No. 12, pp. 1163-1180.
115. Hsu, Y. L., Ke, J. C. and Liu, T. H. (2011b): Standby system with general repair, reboot delay, switching failure and unreliable repair facility- A statistical standpoint. Mathematical and Computers in Simulation, Vol. 81, No. 11, pp. 2400-2413.
References
240
116. Hsu, Y. L., Lee, S. L. and Ke, J. C. (2009): A repairable system with imperfect coverage and reboot: Bayesian and asymptotic estimation. Mathematics and Computers in Simulation, Vol. 79, pp. 2227-2239.
117. Hu, W. W. (2006): Asymptotic stability of a parallel repairable system with warm standby under common cause failure. Journal of Mathematical Analysis and Applications, Vol. 8, No. 1, pp. 5-20.
118. Huang, C. Y. (2005): Performance analysis of software reliability growth models with testing-effort and change-point. Journal of Systems and Software, Vol. 76, No. 2, pp. 181-194.
119. Huang, C. Y. and Chang, Y. R. (2007): An improved decomposition scheme for assessing the reliability of embedded systems by using dynamic fault trees. Reliability Engineering and System Safety, Vol. 92, pp. 1403-1412.
120. Huang, C. Y. and Huang, T. Y. (2010): Software reliability analysis and assessment using queueing models with multiple change-points. Computers and Mathematics with Applications, Vol. 60, No. 7, pp. 2015-2030.
121. Huang, C. Y. and Kintala, C. (1995): Software Fault Tolerance in the Application Layer. In M. R. Lyu (Ed). Software Fault Tolerance, John Wiley.
122. Huang, C. Y. and Kuo, S. (2002): Analysis of incorporating logistic testing effort function into software reliability modeling. IEEE Transactions on Reliability, Vol. 51, No. 3, pp. 261-270.
123. Huang, C. Y. and Lyu, M. (2005): Optimal release time for software systems considering cost, testing efforts and test efficiency. IEEE Transactions on Reliability, Vol. 54, No. 4, pp. 583-591.
124. Huang, C. Y., Kuo, S. and Luo, M. (2007a): An assessment of testing-effort development software reliability growth models. IEEE Transactions on Reliability, Vol. 56, No. 2, pp. 198-211.
125. Huang, H. I., Lin, C. H., Ke, J. C. (2006): Parametric nonlinear programming approach for a repairable system with switching failure and fuzzy parameter. Applied Mathematics and Computation, Vol. 183, pp. 508-517.
126. Huang, H. Z., Liu, Z. J. and Murthy, D. N. P. (2007b): Optimal reliability, warranty and price for new products. IIE Transactions, Vol. 39, pp. 819-827.
127. Huang, J., Zuo, M. and Wu, Y. (2000): Generalized multi-state K-out-of-N: G systems. IEEE Transactions on Reliability, Vol. 49, pp. 105-11.
128. Huang, Y., Kintala, C., Koletis, N. and Fulton, N. D. (1995): Software rejuvenation, analysis, module and application. In Proceedings of 25th Symposium on Fault Tolerant Computing, pp. 381-390.
129. Hughes, R. P. (1987): A new approach to common cause failure. Reliability Engineering and System Safety, Vol. 17, pp. 211-236.
130. Hughes-Fenchel, G. (1997): A flexible clustered approach to high availability. Proceeding of the Twenty-Seventh Annual International Symposium on Fault Tolerant Computing, Seattle, WA.
References
241
131. Hunter, J. J. (1996): Mathematical techniques for warranty analysis. In W. R. Blischke & D. N. P. Murthy (Eds.), Product Warranty Handbook, New York: Marcel Dekker, pp. 157-190.
132. Hwang, F. K. (1986): Simplified reliabilities for consecutive k-out-of-n systems. Society for Industrial and Applied Mathematics Alg. Disc. Math., Vol. 7, pp. 258-264.
133. Inoue, S. and Yamada, S. (2004): Stochastic differential equation modeling for testing-effort dependent software reliability assessment. In Proceeding of ISSAT, pp. 256-260.
134. Iwamoto, K., Dohi, T. and Kaio, N. (2008): Estimating periodic software rejuvenation schedules under discrete-time operation circumstance. IEICE Transactions, 91-D, No. 1, pp. 23-31.
135. Jain, M. (1998): Reliability analysis of two-unit system with common cause failure, Indian Journal of Pure and Applied Mathematics, Vol. 29, No. 12, pp. 1-8.
136. Jain, M. and Baghel, K. P. S. (2001): A multi-components spare and state dependent rates. The Nepali Mathematical Science Report, Vol. 19, No. 2, pp. 81-92.
137. Jain, M., Baghel, K. P. S., and Jadown, M. (2004): Performance prediction of machine interference model with spare and two mode of failure. Operations Research, Information Technology and Industry, Eds M. Jain and G. C. Sharma, S.R.S Pub., Agra, pp. 197-208.
138. Jain, M., Rakhee and Singh, M. (2004): Bilevel control of degraded machining system with warm standbys, setup and vacation. Applied Mathematical Modeliiing, Vol. 28, No. 12, pp. 1015-1026.
139. Jain, M., Sharma, G. C. and Singh, N. (2007): Transient analysis of M/M/R maching system with mixed standbys, switching failures, balking, reneging and additional removable repairmen. IJE Transactions B: Basics, Vol. 20, No. 2, pp. 169-182.
140. Janab, K. and Dhillon, B. S. (2006): Assessment of reversible multi-state k-out-of-n: G/F load-sharing systems with flow-graph models. Reliability Engineering and System Safety, Vol. 91, pp. 765-771.
141. Jankala, K. E. and Vaurio, J. K. (1993): Residual common cause failure analysis in a probabilistic safety assessment. In Proceeding P.S.A., Vol. 2, pp. 804-810.
142. Jeske, D. R. and Zhang, X. (2005): Some successful approaches to software reliability modeling in industry. Journal of System and Software, Vol. 74, No. 1, pp. 85-99.
143. Jha, P. C., Gupta, D., Yang, B. and Kapur, P. K. (2009): Optimal testing resource allocation during module testing considering cost, testing effort and reliability. Computers & Industrial Engineering, Vol. 57, No. 3, pp. 1122-1130.
144. Jie, M. (1991): Interval estimation of availability of a series system. IEEE Transaction on Reliability, Vol. R-40, No. 5, pp. 541-546.
References
242
145. Jin, T., Liao, H. and Kilari, M. (2010): Reliability growth modeling for in-service electronic systems considering latent failure modes. Microelectronics Reliability, Vol. 50, No. 3, pp. 324-331.
146. Kallen, M. J. (2011): Modelling imperfect maintenance and the reliability of complex system using superposed renewal process. Reliability Engineering and System Safety, Vol. 96, No. 6, pp. 636-641.
147. Kancev, D. and Cepin, M. (2011): Evaluation of risk and cost using an age-dependent unavailability modeling of test and maintenance for standby components. Journal of Loss Prevention in the Process Industries, Vol. 24, No. 2, pp. 146-155.
148. Kanoun, K., Kaaniche, M., Beounes, C., Laprie, J. C. and Arlat, J. (1993): Reliability growth of fault tolerant software. IEEE Transactions on Reliability, Vol. 42, No. 2, pp. 205-18.
149. Kant, K. (1987): Software fault tolerance in real–time systems. Information Sciences, Vol. 42, No. 3, pp. 255-282.
150. Kapur, P. K. and Bardhan, A. (2002): Testing effort control through software reliability growth modeling. International Journal of Modelling Simulation, Vol. 22, No. 1, pp. 90-96.
151. Kapur, P. K. and Garg, R. B. (1990): Compound availability measures for a two-unit standby system. Microelectronics and Reliability, Vol. 30, No. 3, pp. 425-429.
152. Kapur, P. K., Goswami, D. and Gupta, A. (2004): A software reliability growth model with testing effort dependant learning function for distributed systems. International Journal of Reliability, Quality and Safety Engineering, Vol. 11, No. 4, pp. 365-377.
153. Kapur, P. K., Sharma, K. O., and Garg, R. B. (1992): Transient solutions of software reliability model with imperfect debugging and error-detection/generation. Microelectronics and Reliability, Vol. 32, No. 1, pp. 475-478.
154. Kapur, P. K., Shatnawi, O., Agarwal, A. G. and Kumar, R. (2009): Unified framework for developing testing effort dependent software reliability growth models. WSEAS Transactions on Systems, Vol. 8, No. 4, pp. 521-531.
155. Ke, J. B., Chen, J. W. and Wang, K. H. (2011): Reliability measures of a repairable system with standby switching failures and reboot delay. Quality Technology & Quantitative Management, Vol. 8, No. 1, pp. 15-26.
156. Ke, J. B., Lee, W.C. and Wang, K. H. (2007): Reliability and sensitivity analysis of a system with multiple unreliable service stations and standby switching failures. Physica A: Statistical Mechanics and its Applications, Vol. 380, No. 1, pp. 455-469.
157. Ke, J. C. and Lee, S. L. (2007): Asymptotic confidence limits for a repairable system with standbys subject to switching failures. American Journal of Applied Science, Vol. 4, No. 11, pp. 834-849.
References
243
158. Ke, J. C., Lee, S. L. and Hsu, Y. L. (2008): On a repairable system with detection, imperfect coverage and reboot: Bayesian approach. Simulation Modelling Practice and Theory, Vol. 16, No. 3, pp. 353-367.
159. Ke, J. C., Su, Z. L., Wang, K. H. and Hsu, Y. L. (2010): Simulation inferences for an availability system with general repair distribution and imperfect fault coverage. Simulation Modelling Practice and Theory, Vol. 18, No. 3, pp. 338-347.
160. Kemmeny, J. G. and Snell, J. L. (1976): A Finite Markov Chains. Springer-Verlag, New York. NY.
161. Khan, F. G., Qureshi, K. and Nazir, B. (2010): Performance evaluation of fault tolerance techniques in grid computing system. Computer and Electrical Engineering, Vol. 36, No. 6, pp. 1110-1122.
162. Kharoufeh, J. P., Finkelstein, D. and Mixon, D. (2006): Availability of periodically inspected systems with Markovian wear and shocks. Journal of Applied Probability, Vol. 43, pp. 303-317.
163. Kim, K. H. and Welch, H. O. (1989): Distributed execution of recovery blocks: an approach for uniform treatment of hardware and software faults in real-time applications. IEEE Transactions on Computers, Vol. C-38, No. 5, pp. 626-636.
164. Kim, S. K. and Dshalalow, J. H. (2002): Stochastic disaster recovery systems with external resources. Mathematical and Computer Modelling, Vol. 36, No. 11-13, pp. 1235-1257.
165. Kiureghian, A. D., Ditlevsen, O. D. and Song, J. (2007): Availability, reliability and downtime of systems with repairable components. Reliability Engineering and System Safety, Vol. 92, pp. 231-242.
166. Knight, J. C., and Leveson, N. G. (1986): An experimental evaluation of the assumption of independence in multi-version programming. IEEE Transactions on Software Engineering, Vol.12, pp. 96-106.
167. Kornecki, A. J. and Zalewski, J. (2010): Hardware certification for real-time safety critical systems: State of the art. Annual Reviews in Control, Vol. 34, pp. 163-174.
168. Kumar, A. and Agarwal, M. (1980): A review of standby systems. IEEE Transactions on Reliability, Vol. R-29, pp. 290-294.
169. Kumar, A., Agarwal, M. L. and Garg, S. C. (1986): Reliability analysis of a two-unit redundant system with critical human error. Microelectronics and Reliability, Vol. 26, pp. 867-871.
170. Kuniewski, S. P., Weide, J. A. M. V. D. and Noartwijik, J. M. V. (2009): Sampling inspection for the evaluation of time dependent reliability of deteriorating systems under imperfect defect detection. Reliability Engineering and System Safety, Vol. 94, No. 9, pp. 1480-1490.
171. Kuo, L. (2005): Software Reliability. Handbook of Statistics, Vol. 25, pp. 929-963.
References
244
172. Kuo, S., Huang, C. and Lyu, M. (2001): Framework for modeling software reliability using various testing-efforts and fault detection rates. IEEE Transactions on Reliability, Vol. 50, No. 3, pp. 310-320.
173. Kvam, P. H. and Miller, J. G. (2002): Common cause failure prediction using data mapping. Reliability Engineering and System Safety, Vol. 76, No. 3, pp. 273-278.
174. Labib, S. W. (1991): Stochastic analysis of a two-unit warm standby system with two switching devices. Microelectronics and Reliability, Vol. 31, No. 6, pp. 1163-1173.
175. Lai, C. D., Xie, M., Poh, K. L., Dai, Y. S. and Yang, P. (2002): A model for availability analysis of distributed software/hardware systems. Information and Software Technology, Vol. 44, pp. 343-350.
176. Lala, J. H. and Alger, L. S. (1988): Hardware and software fault tolerance: a unified architectural approach. In Proceeding IEEE 18th International Symposium on Fault Tolerant Computing, pp. 240-245.
177. Laplante, P. A. (1993): Fault-tolerant control of real time systems in the presence of single event upsets. Control Engineering Practice, Vol.1, No. 5, pp. 763-769.
178. Laprie, J. C. (1987): Hardware and software-fault tolerance: definition and analysis of architectural solutions digest of papers. FTCS-17: The Seventeenth International Symposium on Fault-Tolerant Computing, pp. 116-121.
179. Laprie, J. C. (1990): Definition and analysis of hardware- and software- fault-tolerance architectures. IEEE Computer, pp. 39-51.
180. Laprie, J. C. (1995): Architectural Issues in Software Fault Tolerance. Michael R. Lyu, editor, Wiley, pp. 47-80.
181. Laprie, J. C., Arlat, J., Beounes, C. and Kanoun, K. (1990): Definition and analysis of hardware-and-software fault-tolerant architectures. IEEE Computer, Vol. 23. No. 7, pp. 39-51.
182. Laval, J., Denier, S., Ducasse, S. and Falleri, J. R. (2011): Supporting simultaneous versions for software evaluation assessment. Science of Computer Programming, Vol. 76, No. 12, pp. 1177-1193.
183. Leach, R. J. (2008): Setting checkpoints in legacy code to improve fault-tolerance. Journal of Systems and Software, Vol. 81, No. 6, pp. 920-928.
184. Lee, E. A. (2002): Embedded software. Advances in Computers, Vol. 56, pp. 55-95.
185. Leu, S. W., Fernandez, E. B. and Khoshgoftaar, T. (1991): Fault–tolerant software reliability modeling using Petri nets. Microelectronics and Reliability, Vol. 31, No. 4, pp. 645-667.
186. Levitin, G. (2001): Incorporating common-cause failures into non repairable multistate series-parallel system analysis. IEEE Transactions on Reliability, Vol. 50, pp. 380-388.
References
245
187. Levitin, G. (2004): Reliability and performance analysis for fault tolerant programs consisting of versions with different characteristics. Reliability Engineering and System Safety, Vol. 86, No. 1, pp. 75-81.
188. Levitin, G. (2006): Reliability and performance analysis of hardware–software systems with fault-tolerant software components. Reliability Engineering and System Safety, Vol. 91, pp. 570-579.
189. Levitin, G. (2007): Block diagram method for analyzing multi-state systems with uncovered failures. Reliability Engineering and System Safety, Vol. 92, No. 6, pp. 727-734.
190. Levitin, G. and Amari, S. V. (2008): Multi-state systems with multi-fault coverage. Reliability Engineering and System Safety, Vol. 93, pp. 1730-1739.
191. Levitin, G. and Amari, S. V. (2010): Approximation algorithm for evaluating time-to-failure distribution of k-out-of-n system with shared standby elements. Reliability Engineering and System Safety, Vol. 95, pp. 396-401.
192. Levitin, G. and Xing L. (2010): Reliability and performance of multi-state systems with propagated failures having selective effect. Reliability Engineering and System Safety, Vol. 95, pp. 655-661.
193. Lewis, E. E. (1994): Introduction to Reliability Engineering. Tweede editie. Illinois,Vs: John.
194. Lewis, E. E. (2001): A load-capacity interference model for common mode failure in 1-out-of-2G system. IEEE Transactions on Reliability, Vol. 50, pp. 47-51.
195. Li, C. Y., Chen, X., Yi, X. S. and Tao, J. Y. (2010): Heterogeneous redundancy optimization for multi-state series–parallel systems subject to common cause failure. Reliability Engineering & System Safety, Vol. 95, No. 3, pp. 202-207.
196. Li, H. F., Wei, Z. and Goswami, D. (2006): Quasi-atomic recovery for distributed agents. Parallel Computing, Vol. 32, No. 10, pp. 733-758.
197. Li, X. and Hu, X. (2008): Some new stochastic comparisons for redundancy allocations in series and parallel systems. Statistics &Probability Letters, Vol. 78, No. 18, pp. 3388-3394.
198. Li, X., Xie, M. and Ng, S. H. (2011): Sensitivity analysis of release time of software reliability models incorporating testing effort with multiple change-points. Applied Mathematical Modelling, Vol. 34, No. 11, pp. 3560-3570.
199. Li, X., Yan, R. and Zuo, M. J. (2009a): Evaluating a warm standby system with components having proportional hazard rates. Operations Research Letters, Vol. 37, pp. 56-60.
200. Li, Z., Liao, H. and Coit, D. W. (2009b): A two-stage approach for multi-objective decision making with applications to system reliability optimization. Reliability Engineering and System Safety, Vol. 94, No. 10, pp. 1585-1592.
References
246
201. Lim, S. H., Lee, B. H. and Kim, J. H. (2008): Diversity and fault avoidance for dependable replication systems. Information Processing Letters, Vol. 108, No.1, pp. 33-37.
202. Linberg K. R. (1999): Software developer perceptions and about software project failure: A case study. Journal of Systems and Software, Vol. 49, No. 2-3, pp. 177-192.
203. Lisnianski, A., Levitin, G. and Ben-Haim, H. (2000): Structure optimization of multi-state system with time redundancy. Reliability Engineering and System Safety, Vol. 67, No. 2, pp. 103-112.
204. Littlewood, B. (1975): A reliability model of Markov structured software. Proceeding of the International Conference on Reliable Software, pp. 204-207.
205. Littlewood, B., Popov, P. and Strigini, L. (2002): Assessing the reliability of diverse fault–tolerant software based systems. Safety Sciences, Vol. 40, No. 9, pp. 781-796.
206. Lo., J. H., Huang, C. Y., Chen, I. Y., Kuo, S. Y. and Lyu, M. R. (2005): The reliability assessment and sensitivity analysis of software reliability growth modeling based on software module structure, The Journal of Systems and Software, Vol. 76, pp. 3-13.
207. Lu, L. and Lewis, G. (2006): Reliability evaluation of standby safety systems due to independent and common cause failures. Proceeding of the 2006 IEEE, International Conference on Automation Science and Engineering, Shanghai, China, pp. 264-269.
208. Lu, L. and Lewis, G. (2006): Reliability evaluation of standby safety systems due to independent and common cause failures. IEEE Conference on Automation Science and Engineering, pp. 274-279.
209. Lu, L. and Lewis, G. (2008): Configuration determination for K-out-of-N partially redundant systems. Reliability Engineering and System Safety, Vol. 93, No. 11, pp. 1594-1604.
210. Lv, X., Wan, C. and Bi, G. (2010): Block orthogonal greedy algorithm for stable recovery of block -sparse signal representations. Signal Processing, Vol. 90, No. 12, pp. 3265-3277.
211. Lyu, M. R. (1995): Software Fault Tolerance, John Wiley & Sons, 1995.
212. Lyu, M. R. (1996): Handbook of Software Reliability Engineering. IEEE Computer Society, Press, McGraw Hill.
213. Maheshwari, S., Sharma, P. and Jain, M. (2010): Machine repair problem with K-type warm spares, multiple vacations for repairmen and reneging. International Journal of Engineering and Technology, Vol. 2, No. 4, pp. 252-258.
214. Mahmoud, M. A. W. and Moshref, M. E. (2010): On a two unit cold standby system considering hardware, human error failures and preventive maintenance. Mathematical and Computer Modelling, Vol. 51, No. 5-6, pp. 736-745.
References
247
215. Mahmoud, M., Mokhles, M. A. and Saleh, E. H. (1987): Availability analysis of a repairable system with common cause failure and one standby unit. Microelectronics and Reliability, Vol. 27, pp. 741-754.
216. Malaiya, Y. K., Srimani, P. K. (1990): Software reliability models: theoretical developments, evaluation and applications, los Alamitos. IEEE Computer Society, In Press.
217. McAllister, D. F. and Scott, R. K. (1991): Cost modeling of fault–tolerant software. Information and Software Technology, Vol. 33, No. 8, pp. 594-603.
218. Meedeniya, I., Buhnova, B., Aleti, A., Grunske, L. (2011): Reliability driven deployment optimization for embedded systems. Journal of Systems and Software, Vol. 84, No. 5, pp. 835-846.
219. Mendiratta, V. B. (1996): Assessing the reliability impacts of software fault tolerance mechanisms. Proceedings of 1996 International Symposium of Software Reliability Engineering, White Plains, NY.
220. Moghaddass, R. and Zuo, M. J. (2011): Optimal design of a repairable k-out-of-n system considering maintenance. Reliability and Maintainability Symposium (RAMS), Lake Buena Vista, pp. 1-6.
221. Moghaddass, R., Zuo, M. J. and Wang, W. (2010): Availability of a general k-out-of-n: G system with non-identical components considering shut-off rules using quasi-birth-death process. Reliability Engineering and System Safety, Vol. 96, pp. 489-496.
222. Mokaddis, G. S., Labib, S. W. and Ahmed, A. M. (1997): Analysis of a two unit warm standby system subject to degradation. Microelectronics and Reliability, Vol. 37, No. 4, pp. 641-647.
223. Mokaddis, G. S., Labib, S. W. and El-Said, K. M. (1994): Two models for two dissimilar-unit standby redundant system with three types of repair facilities and perfect or imperfect switch. Microelectronics and Reliability, Vol. 34, No. 7, pp. 1239-1247.
224. Montoro-Cazorla, D. and Perez-Ocon, R. (2006): Reliability of a system under two types of failure using a Markovian arrival process. Operations Research Letters, Vol. 34, No. 5, pp. 525-530.
225. Mosleh, A. (1991): Common cause failure: An analysis methodology and examples. Reliability Engineering, Vol. 34, pp. 249-292.
226. Moustafa, M. S. (1997): Reliability analysis of K-out-of-M: G systems with dependent failures and imperfect coverage. Reliability Engineering and System Safety, Vol. 58, pp. 15-17.
227. Moustafa, M. S. (1998): Transient analysis of reliability with and without repair for k-out of-n: G systems with M failure modes. Reliability Engineering and System Safety, Vol. 59, pp. 317-320.
228. Moustafa, M. S. (2001a): Availability of K-out-of-N: G systems with exponential failures and general repairs. Economic Quality Control, Vol. 16, No. 1, pp. 75-82.
References
248
229. Munch, J. and Heidrich, J. (2004): Software project control centers: Concepts and approaches. Journal of Systems and Software, Vol. 70, No. 1-2, pp. 3-19.
230. Murthy, D. N. P., Solem, O. and Roren, T. (2004): Product warranty logistics: Issues and challenges. European Journal of Operational Research, Vol. 156, No.1, pp. 110-126.
231. Musa, J. D., Tannino, A. and Okumoto, K. (1987): Software Reliability: Measurement, Prediction, and Application. New York, McGraw-Hill.
232. Myers, A. (2007): k-out-of-n: G system reliability with imperfect fault coverage. IEEE Transactions on Reliability, Vol. 56, pp. 464-473.
233. Nahas, N., Nourelfath, m. and Ait-Kadi, D. (2007): Coupling ant colony and the degraded ceiling algorithm for the redundancy allocation problem of series-parallel systems. Reliability Engineering and Systems Safety, Vol. 92, No. 2, pp. 211-222.
234. Nakagawa, T. and Yasui, K. (2003): Note on reliability of a system complexity. Mathematical and Computer Modelling, Vol. 38, No. 11-13, pp. 1365-1371.
235. Nicola, V. F. and Goyal, A. (1990): Modeling of correlated failures and community error recovery in multi version software. IEEE Transactions on Engineering, Vol. 16, No. 3, pp. 350-359.
236. Noortwijk, V. J. M. and Weide, V. D. J. A. M. (2008): Applications to continuous-time processes of computational techniques for discrete-time renewal processes. Reliability Engineering and System Safety, Vol. 93, pp. 1853-1860.
237. Oltean, M. and Diosan, L. (2009): An autonomous GP-based system for regression and classification problems. Applied Soft Computing, Vol. 9, No. 1, pp. 49-60.
238. Osaki, S. and Nakagawa, T. (1976): Bibliography for reliability and availability of stochastic system. IEEE Transactions on Reliability, Vol. R-25, pp. 284-287.
239. Ou, Y. and Bechta-Dugan, J. (2003): Approximate sensitivity analysis for acyclic Markov reliability models. IEEE Transactions on Reliability, Vol. 52, No. 2, pp. 220-231.
240. Pan, J. N. (1997): Reliability prediction of imperfect switching systems subject to multiple stresses. Microelectronics and Reliability, Vol. 37, No. 3, pp. 439-445.
241. Park, K. and Kim, S. (2002): Availability analysis and improvement of active/standby cluster systems using software rejuvenation. Journal of Systems and Software, Vol. 61, No. 2, pp. 121-128
242. Park, M. and Pham, H. (2008): Warranty system-cost analysis using quasi-renewal process. OPSEARCH, Vol. 45, No. 3, pp. 263-274.
243. Park, M. and Pham, H. (2010): Altered quasi-renewal concepts for modeling renewable warranty costs with imperfect repairs. Mathematical and Computer Modelling, Vol. 52, No. 9-10, pp. 1435-1450.
References
249
244. Pham, H. (1994): On the optimal design of N-version software systems subject to constraints. Journal of Systems and Software, Vol. 27, No. 1, pp. 55- 61.
245. Pham, H. (2003a): A software cost model with imperfect debugging, random life cycle and penalty cost. International Journal of System Science, Vol. 27, pp. 455-463.
246. Pham, H. (2003b): Software reliability and cost models: Perspectives, comparison, and practice. European Journal of Operational Research, Vol. 49, No. 3, pp. 475-489.
247. Pham, H. (2007): An imperfect-debugging fault-detection dependant-parameter software. International Journal Automotive and Computer, Vol. 4, No. 4, pp. 325-328.
248. Pham, H. and Wang, H. (2001): A quasi-renewal process for software reliability and testing costs. IEEE Transactions on Systems, Man and Cybernetics-Part A: Systems and Humans, Vol. 31, pp. 623-631.
249. Pham, H. and Zhang, X. (2003): NHPP software reliability and cost models with testing coverage. European Journal of Operational Research, Vol. 145, No. 2, pp. 443-454.
250. Pham, H., Nordmann, L. and Zhang, X. (1999): A general imperfect software-debugging model with S-shaped fault detection rate. IEEE Transactions on Reliability, Vol. 48, No. 2, pp. 168-75.
251. Pham, H., Suprasad, A. and Misra, R. B. (1997): Availability and mean life time prediction of multistage degraded system with partial repairs. Reliability Engineering and Systems Safety, Vol. 56, No. 2, pp. 169-173.
252. Propp, J. G. and Wilson, D. B. (1996): Exact sampling with coupled Markov chains and applications to statistical mechanics. Random Structures and Algorithms, Vol. 9, pp. 223-252.
253. Prowell, S. J. and Poore, J. H. (2004): Computing system reliability using Markov chain usage models. The Journal of Systems and Softwares, Vol. 73, pp. 219-225.
254. Rackwitz, R. (2001): Reliability analysis- A review and some perspectives. Structural Safety, Vol. 23, No. 4, pp. 365-395.
255. Rafe, V. and Mahdian, F. (2011): Style based modeling and verification of fault tolerance service oriented architecture. Procedia Computer Science, Vol. 3, pp. 972-976.
256. Rahman, A. and Chattopadhyay, G. N. (2006): Review of long term warranty policies. Asia Pacific Journal of Operational Research, Vol. 22, No. 4, pp. 453-473.
257. Rai, B. and Singh, N. (2005): A modeling framework for assessing the impact of new time/mileage warranty limits on the number and cost of automotive warranty claims. Reliability Engineering and System Safety, Vol. 88, No. 2, pp. 157-169.
References
250
258. Raj Kiran, N. and Ravi, V. (2008): Software reliability prediction by soft computing techniques. Journal of Systems and Software, Vol. 81, No. 4, pp. 576-583.
259. Rajamanickam, S. P. and Chandrasekar, B. (1997): Reliability measure for two unit systems with a dependent structure for failure and repair times. Microelectronics Reliability, Vol. 37, No. 5, pp. 829-833.
260. Ramirez-Marquez, J. E. and Coit, D. W. (2006): Optimization of system reliability in the presence of common cause failure. Reliability Engineering and System Safety, Vol. 92, No. 10, pp. 1421-1434.
261. Randell, B. (1975): System structure for software fault tolerance. IEEE Transactions on Software Engineering, Vol. SE-1, No. 2, pp. 220-232.
262. Randell, B. and Xu, J. (1995): The evolution of the recovery block concept, in Software Fault Tolerance, Wiley, pp. 1-21.
263. Randles, M., Lamb, D., Odat, E. and Taleb-Bendiab, A. (2011): Distributed redundancy and robustness in complex systems. Journal of Computer and System Sciences, Vol. 77, No. 2, pp. 293-304.
264. Rao, B. M. (2011): A decision support model for warranty servicing of repairable items. Computers and Operations Research, Vol. 38, No. 1, pp. 112-130.
265. Rebba, R. and Mahadevan, S. (2008): Computational methods for model reliability assessment. Reliability Engineering and System Safety, Vol. 93, No. 8, pp. 1197-1207.
266. Rehage, D., Carl, U. B. and Vahl, A. (2005): Redundancy management of fault tolerant aircraft system architectures- reliability synthesis and analysis of degraded system states. Aerospace Science and Technology, Vol. 9, No. 4, pp. 337-347.
267. Reussner, R. H., Schmidt, H. W. and Poernomo, I. H. (2003): Reliability prediction for component-based software architectures. Journal of Systems and Software, Vol. 66, No. 3, pp. 241-252.
268. Rinsaka, K. and Dohi, T. (2007): A faster estimation algorithm for periodic preventive rejuvenation schedule maximizing system availability. ISAS, pp. 94-109.
269. Rossi, G. P. and Simone, C. (1984): A multitasking operating system with explicit treatment of recovery points. Microprocessing and Microprogramming, Vol. 14, No. 2 pp. 55-66.
270. Ruiz-Castro, J. E. and Li, Q. L. (2011): Algorithm for a general discrete k-out-of-n: G systems subject to several types of failure with an indefinite number of repairpersons. European Journal of Operational Research, Vol. 211, No.1, pp. 97-111.
271. Rushdi, A. M. and Alsulami, A. E. (2007): Cost elasticities of reliability and MTTF for k-out-of-n systems. Journal of Mathematics and Statistics, Vol. 3, No. 3, pp. 122-128.
References
251
272. Sadek, A. and Limnios, N. (2005): Nonparametric estimation of reliability and survival function for continuous-time finite Markov processes. Journal of Statistical Planning and Interference, Vol. 133, No. 1, pp. 1-21.
273. Saha, G. K. (2006a): A software tool for fault tolerance. Journal of Information Science & Engineering, Vol. 22, No. 4.
274. Saha, G. K. (2006b): A single-version scheme of fault tolerant computing. Journal of Computer Science & Technology, Vol. 6, No. 1, pp. 22-27.
275. Sahner, R. A., Trivedi, K. S. and Puliafito, A. (1996): Performance and Reliability Analysis of Computer Systems: An Example-Based Approach using the SHARPE Software Package. Kluwer Academic Publishers, Boston, MA.
276. Salameh, M. K. and Jaber, M. Y. (2000): Economic production quantity model for item with imperfect quality. International journal of Production Quantity, No. 64, pp. 59-64.
277. Salem, A. M. and El-Damcese, M. A. (2004): Reliability and systems subject to common cause hazards. Nulear Engineering and Design, Vol. 227, No. 3, pp. 349-354.
278. Salfner, F. and Walter, K. (2010): Analysis of service availability for time-triggered rejuvenation policies. Journal of Systems and Software, Vol. 83, No. 9, pp. 1579-1590.
279. Samatlı-Pac, G., Mehmet, R. and Taner (2009): The role of repair strategy in warranty cost minimization: An investigation via quasi-renewal process. European Journal of Operational Research, Vol.197, No. 2, pp. 632-641.
280. Santos, R. M., Santos, J. and Orozco, J. D. (2009): Power saving and fault-tolerance in real-time critical embedded systems. Journal of Systems Architecture, Vol. 55, No. 2, pp. 90-101.
281. Sarhan, A. M. (2002): Reliability equivalence with basic series/parallel system. Applied Mathematics and Computation, Vol. 132, No. 1, pp. 115-133.
282. Sarkar, J. and Chaudhuri, G. (1999): Availability of a system with gamma life and exponential repair under a perfect repair policy. Statistics & Probability Letters, Vol. 43, pp. 189-196.
283. Sarkar, J. and Sarkar, S. (2000): Availability of a periodically inspected system under perfect repair. Journal of Statistical Planning and Inference, Vol. 91, pp. 77-90.
284. Sarkar, J. and Sarkar, S. (2001): Availability of a periodically inspected system supported by a spare unit, under perfect repair or perfect upgrade. Statistics &Probability Letters, Vol. 53, pp. 207-217.
285. Savage, G. J. and Son, Y. K. (2011): The set theory method for system reliability of structures with degrading components. Reliability Engineering and System Safety, Vol. 96, No. 1, pp. 108-116.
286. Seo, J. H., Jang, J. S. and Bai, D. S. (2003): Lifetime and reliability estimation of repairable redundant system subject to periodic alternation. Reliability Engineering and System Safety, Vol. 80, pp. 197-204.
References
252
287. Shaked, M. and Zhu, H. (1992): Some results on block replacement policies and renewal theory. Journal of Applied Probability, Vol. 29, pp. 932-946.
288. Shanthikumar, J. G. (1982): Recursive algorithm to evaluate the reliability of a consecutive K-out-of-N: F system. IEEE Transactions on Reliability, Vol. R-31, pp. 442-443.
289. She, J. and Pecht, M. G. (1992): Reliability of a k-out-of-n warm-standby system. IEEE Transactions on Reliability, Vol. 41, No. 1, pp. 72-75.
290. Shen, Z., Hu, X. and Fan, W. (2008): Exponential asymptotic property of a parallel repairable system with warm standby under common-cause failure. Journal of Mathematical Analysis and Applications, Vol. 341, No. 1, pp. 457-466.
291. Shet, A. G., Elawasif, W. R., Foley, S. S., Park, B. H., Bemholdt, D. E. and Bramley (2011): Strategies for fault tolerance in multi component applications. Procedia, Computer Science, Vol. 4, pp. 2287-2296.
292. Sheu, S. H. (1991): A generalized block replacement policy with minimal repair and general random repair costs for a multi-unit system. Journal of the Operational Research Society, Vol. 42, pp. 331-341.
293. Shi, X., Pazat, J. L., Rodriguez, E., Jin, H. and Jiang, H. (2010): Adapting grid applications to safety using fault-tolerant methods: Design, implementation and evaluation. Future Generation Computer Systems, Vol. 26, No. 2, pp. 236-244.
294. Shooman, M. L. (1983): Software Engineering: Design, Reliability, and Management, New York, McGraw-Hill.
295. Shooman, M. L. (1990): Probabilistic Reliability: An Engineering Approach, 2d ed. Krieger, Melbourne.
296. Siewiorek, D. P. and Swarz, R. S. (1992): Reliability Computer System: Design and Evaluation, Digital Press, USA.
297. Simeu-Abazi, Z., Lefebvre, A. and Derain, J. P. (2011): A methodology of alarm filtering using dynamic fault tree. Reliability Engineering and System Safety, Vol. 96, No. 2, pp. 257-266.
298. Singh, J. (1989): A warm standby redundant system with common cause failure. Reliability Engineering and System Safety, Vol. 26, pp. 135-141.
299. Singh, J. and Goel, P. (1995): Availability analysis of a standby complex system having imperfect switch-over device. Microelectronics and Reliability, Vol. 35, No. 2, pp. 285-288.
300. Smidt-Destombes, K. S. D., Elst, N. P. V., Barros, A. I., Mulder, H. and Hontelez, J. A. M. (2011): Spare parts model with cold-standby redundancy on system level. Journal of Computers and Operations Research, Vol. 38, No. 7, pp. 985-991.
301. Smidt-Destombesa, K. S., Heijden, M. C. and Harten, A. (2004): On the availability of a K-out-of-N system given limited spares and repair capacity under a condition based maintenance strategy. Reliability Engineering System Safety, Vol. 83, No. 3, pp. 287-300.
References
253
302. Smidts, C. and Sova, D. (1999): An architectural model for software reliability quantification: Sources of data. Reliability Engineering and System Safety, Vol. 64, No. 2, pp. 279-290.
303. Sofokleous, A. A. and Andreou, S. A. (2008): Automatic, evolutionary test data generation for dynamic software testing. Journal of Systems and Software, Vol. 81, No. 11, pp. 1883-1898.
304. Somani, A. K. and Vaidya, N. H. (1997): Understanding fault tolerance and reliability, IEEE Computer Society Press Los Alamitos, CA, USA , Vol. 30, No. 4, pp. 45-50.
305. Soro, I. W., Nourelfath, M. and Ait-Kadi, D. (2010): Performance evaluation of multi-state degraded systems with minimal repairs and imperfect preventive maintenance. Reliability Engineering and Systems Safety, Vol. 95, No. 2, pp. 65-69.
306. Sridharan, V. and Jayashree, P. R. (1998): Transient solutions of a software model with imperfect debugging and generation of errors by two servers. Mathematical and Computer Modelling, Vol. 27, No. 3, pp. 103-108.
307. Sridharan, V. and Mohanavadivu, P. (1997): Reliability and availability analysis for two non-identical unit parallel systems with common cause failures and human errors. Microelectron Reliability, Vol. 37, No. 5, pp. 747-752.
308. Srinivas, R., Chakravarthy, A. and Gomez-Corral (2009): The influence of delivery times on repairable k-out-of-N systems with spares. Applied Mathematical Modelling, Vol. 33, No. 5, pp. 2368-2387.
309. Subramanian, R. and Anantharaman, V. (1995): Reliability analysis of a complex standby redundant system. Reliability Engineering and System Safety, Vol. 48, pp. 57-70.
310. Subramanian, R. and Venkatakrishan, K. S. (1975): Reliability of 2-unit standby redundant system with repair, maintenance and standby failure. IEEE Transactions on Reliability, Vol. R-24, pp. 139-142.
311. Tai, A. T., Avizienis, A. and Meyer, J. F. (1993): Performability enhancement of fault-tolerant software. IEEE Transactions on Reliability, Sp. Issue on Fault Tolerant Software. Vol. R-42, No.2, pp. 227-237.
312. Tang, L. C. and Lee, L. H. (2005): A simple recovery strategy for economic lot scheduling problem: A two- product case. International Journal of Production Economics, Vol. 98, pp. 97-107.
313. Teng, X. and Pham, H. (2002): A software reliability growth model for N-version programming systems. IEEE Transactions on Reliability, Vol. 51, No. 3, pp. 311-321.
314. Tian, Z., Levitin, G. and Zuo, M. J. (2009): A joint reliability-redundancy optimization approach for multi-state series-parallel systems. Reliability Engineering and System Safety, Vol. 94, No. 10, pp. 1568-1576.
315. Tokuno, K. and Yamada, S. (1995): Markovian software availability modeling for performance evaluation. In Stochastic Modelling in Innovative Manufacturing: Proceedings, Cambridge, U.K., July 21-22, 1995, (Edited by
References
254
A.H. Christer, S. Osaki and L.C. Thomas), pp. 246-256, Springer-Verlag, Berlin, (1997).
316. Trivedi, A. K. and Shooman, M. L. (1975): A many state Markov model for the estimation and prediction of computer software performance parameters. Proceeding of the International Conference on Reliable Software, pp. 208-220.
317. Valdes, J. E. and Zequeira, R. I. (2003): On the optimal allocation of an active redundancy in a two-component series system. Statistics & Probability Letters, Vol. 63, No. 3, pp. 325-332.
318. Valdes, J. E. and Zequeira, R. I. (2006): On the optimal allocation of two active redundancies in a two-component series system. Operations Research Letters, Vol. 34, No. 1, pp. 49-52.
319. Valdes, J. E., Arango, G., Zequeira, R. I. and Brito, G. (2010): Some stochastic comparisons in series systems with active redundancy. Statistics & Probability Letters, Vol. 80, No. 11-12, pp. 945-949.
320. Van, P. D., Barros, A. and Berenguer, C. (2008): Reliability importance analysis of Markovian systems at steady state using perturbation analysis. Reliability Engineering and System Safety, Vol. 93, No. 11, pp. 1605-1649.
321. Vanderperre, E. J. (1990): Reliability analysis of a warm standby system with general distributions. Microelectronics & Reliability, Vol. 30, No. 3, pp. 487-490.
322. Vaurio, J. K. (1998): An implicit method for incorporating common cause failure in system analysis. IEEE Transactions on Reliability, Vol. 47, No. 2, pp. 173-180.
323. Vaurio, J. K. (1999): Availability and cost functions for periodically inspected preventively maintained units. Reliability Engineering and System Safety, Vol. 63, pp. 133-140.
324. Vaurio, J. K. (2003): Common cause failure probabilities in standby safety system fault tree analysis with testing-scheme and timing dependencies. Reliability Engineering and System Safety, Vol. 79, pp. 43-57.
325. Vaurio, J. K. (2005): Uncertainties and quantification of common cause failure rates and probabilities for system analyses. Reliability Engineering and System Safety, Vol. 90, pp. 186-195.
326. Velardi, P. and Ciciani, B. (1983): Recovery blocks for communicating systems. Microprocessing and Microprogramming, Vol. 11, No. 5 pp. 287-294.
327. Venkateswaran, N., Siva, M. S. and Goel, P. S. (2002): Analytical redundancy based fault detection of gyroscopes in spacecraft applications. Acta Astronautica, Vol. 50, No. 9, pp. 535-545.
328. Verma, S. M. and Chari, A. A. (1991): Availability and frequency of failure of a system in the presence of chance common-cause shock failures. Microelectronics and Reliability, Vol. 31, No. 2/3, pp. 265-269.
References
255
329. Vieira, M. and Madeira, H. (2004): Joint evaluation of recovery and performance of a COTS DBMS in the presence of operator faults. Performance Evaluation, Vol. 56, pp. 187-212.
330. Vinod, G. V., Santosh, T. V., Saraf, R. K. and Ghosh, A. K. (2008): Integrating safety critical software system in probabilistic safety assessment. Nuclear Engineering and Design, Vol. 238, No. 9, pp. 2392-2399.
331. Wang, C. H. and Sheu, S. H. (2001): The effect of the warranty cost on the imperfect EMQ model with general discrete shift distribution. Production Planning and Control, Vol. 12, No. 6, pp. 621-628.
332. Wang, H. and Pham, H. (1996): A quasi-renewal process and its applications in imperfect maintenance. International Journal of Systems Science, Vol. 27, No.10, pp. 1055-1062.
333. Wang, K. H. and Chen, Y. J. (2009): Comparative analysis of availability between three systems with general repair times, reboot delay and switching failures. Applied Mathematics and Computation, Vol. 215, No. 1, pp. 384-394.
334. Wang, K. H. and Kuo, C. C. (2000): Cost and probabilistic analysis of series systems and mixed standby components. Applied Mathematical Modeling, Vol. 24, pp. 957-967.
335. Wang, K. H. and Liang, L. W. (2006b): Cost benefit analysis of availability systems with warm standby units and imperfect coverage. Applied Mathematics and Computation, Vol. 172, pp. 1239-1256.
336. Wang, K. H. and Sivazlian, B. D. (1997): Life cycle cost analysis for availability system with parallel components. Computers and Industrial Engineering, Vol. 33, pp. 129-132.
337. Wang, K. H., Lai, Y. J. and Ke, J. B. (2004): Reliability and sensitivity analysis of a system with warm standbys and a repairable service station. International Journal of Operations Research, Vol. 1, No. 1, pp. 61-70.
338. Wang, K. H., Liou, Y. C. and Pearn, W. L. (2005): Cost benefit analysis of series systems with warm standby components and general repair times. Mathematical Methods of Operations Research, Vol. 61, pp. 329-343.
339. Wang, K., Dong, W. and Ke, J. (2006a): Comparison of reliability and the availability between four systems with warm standby components and standby switching failures. Applied Mathematics and Computation, Vol. 183, pp. 1310-1322.
340. Wang, L. and Cui, L. (2011): Aggregated semi-Markov repairable systems with history-dependent up and down states. Mathematical and Computer Modelling, Vol. 53, pp. 883-895.
341. Wang, L., Hu, H., Wang, Y., Wu, W. and He, P. (2011a): The availability model and parameters estimation method for the delay time model with imperfect maintenance at inspection. Applied mathematical Modelling, Vol. 35, No. 6, pp. 2855-2863.
342. Wang, Z., Lam, J., Ma, L., Bo, Y. and Guo, Z. (2011b): Variance-constrained dissipative observer-based control for a class of nonlinear stochastic
References
256
systems with degraded measurements. Journal of Mathematical Analysis and Applications, Vol. 377, No. 2, pp. 645-658.
343. Wattanapingskorn, N. and Coit, D. W. (2007): Fault–tolerant embedded system design and optimization considering reliability estimation uncertainty. Reliability Engineering and System Safety, Vol. 92, No. 4, pp. 395-407.
344. Wen, P. and Li, Y. (2009): Minimum packet drop sequences based networked control system model with embedded Markov chain. Simulation Modelling Practice and Theory, Vol. 17, pp. 1635-1641.
345. Whittaker, J. A. and Poore, J. H. (1993): Markov analysis of software specification. ACM Transaction of Software Engineering and Methodology, Vol. 2, pp. 93-106.
346. Whittaker, J. A. and Thomason, M. G. (1994): A Markov chain model for statistical software testing. IEEE Transactions on Software Engineering, Vol. 30, No. 10, pp. 812-824.
347. Whittaker, J. A., Rekeb, K. and Thomason, M. G. (2000): A Markov chain model for predicting the reliability of multi-build software. Information and Software Technology, Vol. 42, pp. 889-894.
348. Wie, X., Yiguang, H. and Trivedi, K. S. (2005): Analysis of a two-level software rejuvenation policy. Reliability Engineering and System Safety, Vol. 87, No. 1, pp. 13-22.
349. Wu, C. C., Chou, C. Y. and Huang, C. (2009): Optimal price, warranty length and production rate for free replacement policy in the static demand market. Omega, Vol. 37, No. 1, pp. 29-39.
350. Wu, J., Fernandez, E. B., Zhang, M. (1996): Design and modeling of hybrid fault-tolerant software with cost constraints. Journal of System and Software, Vol. 35, No. 2, pp. 141-149.
351. Wu, J., Wang, Y. and Fernandez, E. B. (1994): A uniform approach to software and hardware fault tolerance. Journal of Systems and Software. Vol. 26, pp. 117-127.
352. Xang, B. and Xie, M. (2000): A study of operational and testing reliability in software analysis. Reliability Engineering and System Safety, Vol. 70, No. 32, pp. 323-329.
353. Xie, W., Yiguang H., Y. and Trivedi, K. S. (2005): Analysis of a two-level software rejuvenation policy. Reliability Engineering and System Safety, Vol. 87, No. 1, pp. 13-22
354. Xing, L. (2007): Reliability evaluation of phased-mission systems with imperfect fault coverage and common-cause failures. IEEE Transactions on Reliability, Vol. 56, pp. 58-68.
355. Xing, L. and Dugan, J. B. (2002): Generalized imperfect coverage phased-mission analysis. In: Proceedings of Annual Reliability and Maintainability Symposium, Virginia Univ. Charlottesville, VA, pp. 112-119.
References
257
356. Xing, L., Meshkat, L. and Donohue, S. K. (2007): Reliability analysis of hierarchical computer based systems subject to common cause failures. Reliability Engineering and System Safety, Vol. 92, No. 3, pp. 351-359.
357. Xing, L., Shrestha, A. and Dai, Y. (2011): Exact combinatorial reliability analysis of dynamic systems with sequence-dependent failures. Reliability Engineering and System Safety, Vol. 96, No. 10, pp. 1375-1385.
358. Xu, H., Guo, W., Yu, J. and Zhu, G. (2005): Asymptotic stability of a repairable system with imperfect switching mechanism. International Journal of Mathematics and Mathematical Sciences, Vol. 4, pp. 631-643.
359. Yadavalli, V. S. S., Batha, M. and Bekker, A. (2002): Asymptotic confidence limits for the steady state availability of a two unit parallel system with preparation time for the repair facility. Asia-Pacific Journal of Operational Research, Vol. 19, pp. 249-256.
360. Yadavalli, V. S. S., Bekker, A. and Pauw, J. (2005): Bayesian study of a two-component system with common cause shock failures. Asia-Pacific Journal of Operational Research, Vol. 22, No. 1, pp. 105-119.
361. Yamachi, H., Tsujimura, Y., Kambayashi, Y. and Yamamoto, H. (2006): Multi-objective genetic algorithm for solving N-version programm design problem. Reliability Engineering and System Safety, Vol. 91, No. 9, pp. 1083-1094.
362. Yamada, S. and Othera, H. (1990): Software reliability growth models for testing-effort control. European Journal of Operational Research, Vol. 46, pp. 343-349.
363. Yamada, S., Hishitani, J. and Osaki, S. (1993): Software reliability growth model with weibull testing effort: A model and application. IEEE Transactions on Reliability, Vol. 42, pp. 100-105.
364. Yamashiro, M. (1982): A repairable multistate system with several degraded states and common-cause failures. Microelectronics and Reliability, Vol. 22, No. 3, pp. 615-618.
365. Yanez, M., Joglar, F. and Modarres, M. (2002): Generalized renewal process for analysis of repairable systems with limited failure experience. Reliability Engineering and System Safety, Vol. 77, pp. 167-180.
366. Yang, B. and Xie, M. (2000): A study of operational and testing reliability in software reliability analysis. Reliability Engineering and System Safety, Vol. 70, pp. 323-329.
367. Yang, B., Hu, H. and Guo, S. (2009): Cost-oriented task allocation and hardware redundancy policies in heterogeneous distributed computing systems considering software reliability. Computers & Industrial Engineering, Vol. 56, No. 4, pp. 1687-1696.
368. Yang, B., Li, X., Xie, M. and Tan, F. (2010): A generic data-driven software reliability model with model mining technique. Reliability Engineering & System Safety, Vol. 95, No. 6, pp.671-678.
References
258
369. Yang, L. and Meng, X. Y. (2011): Reliability analysis of a warm standby repairable system with priority in use. Applied mathematical Modelling, Vol. 35, No. 9, pp. 4295-4303.
370. Yearout, R. D., Reddy, P. and Grosh, D. L. (1986): Standby redundancy in reliability-A review. IEEE Transactions on Reliability, Vol. R-35, pp. 285-292.
371. Yinghui, T. and Jing, Z. (2008): New model for load-sharing k-out-of-n: G systems with different components. Journal of Systems Engineering and Electronics, Vol. 19, No. 4, pp. 748-751.
372. Yinong, C. and Chen, T. (1992): Implementing fault-tolerance via modular redundancy with comparison. Microelectronics Reliability, Vol. 32, No. 1-2, pp. 287-288.
373. Yu, H., Chu, C., Chatelet, E. and Yalaoui, F. (2007): Reliability optimization of a redundant system with failure dependencies. Reliability Engineering and System Safety, Vol. 92, No. 12, pp. 1627-1634.
374. Yuan, L. and Xu, J. (2011): An optimal replacement policy for a repairable system based on its repairman having vacations. Reliability Engineering and System Safety, Vol. 96, No. 7, pp. 868-875.
375. Yun, W. Y, Murthy, D. N. P. and Jack, N. (2008): Warranty servicing with imperfect repair. International Journal of Economics, Vol. 111, pp. 159-69.
376. Yun, W. Y. and Cha, J. H. (2010): Optimal design of a general warm standby system. Reliability Engineering & System Safety, Vol. 95, No. 8, pp. 880-886.
377. Zalewski, J., Ehrenberger, W., Saglietti, F., Gorski, J. and Kornecki, A. (2003): Safety of computer control systems: challenges and results in software development. Annual Reviews in Control, Vol. 27, No. 1, pp. 23-37.
378. Zhang, M. and Qin, W. (2008): Parametric Analysis of an improved fault tolerant system. Electronic Notes in Theoretical Computer Science, Vol. 207, No. 10, pp. 121-136.
379. Zhang, T. and Horigome, M. (2001): Availability and reliability of system with dependent components and time-varying failure and repair rates. IEEE Transactions on Reliability, Vol. 50, pp. 151-158.
380. Zhang, T., Xie, M., and Horigome, M. (2006): Availability and reliability of k-out-of-(M+N): G warm standby systems. Reliability Engineering and System Safety, Vol. 91, No. 4, pp. 381-387.
381. Zhang, Y. L. and Wang, G. J. (2007): A deteriorating cold standby repairable system with priority in use. European Journal of Operational Research, Vol. 183, No. 1, pp. 278-295.
382. Zhang, Y. L. and Wu, S. (2009): Reliability analysis for a k/n (F) system with repairable repair-equipment. Applied Mathematical Modelling, Vol. 33, No. 7, pp. 3052-3067.
383. Zheng, F., Zhu, G. and Gao, C. (2011): Well-posedness and stability of the repairable system with N failure modes and one standby unit. Journal of Mathematical Analysis and Applications, Vol. 375, No. 1, pp. 174-184.
References
259
384. Zhou, Z., Li, Y. and Tang, K. (2009): Dynamic pricing and warranty policies for products with fixed lifetime. European Journal of Operational Research, Vol. 196, No. 3, pp. 940-948.
385. Zhu, Y., Elsayed, E. A., Liao, H. and Chan, L. Y. (2010): Availability optimization of systems subject to competing risk. European Journal of Operational Research, Vol. 202, No. 3, pp. 781-788.
386. Zhuang, W. J. and Xie, M. (1994): Design and analysis of some fault–tolerance configurations based on a multipath principle. Journal of System and Software, Vol. 25, No. 1, pp. 101-108.
387. Zio, E. (2009): Reliability engineering: Old problems and new challenges. Reliability Engineering and System Safety, Vol. 94, No. 2, pp. 125-141.