pascual sisr1

Optimal inspection intervals for safety systems

with partial inspections

R. Pascual, D. LouitCentro de Minerıa

Pontificia Universidad Catolica de Chile, Santiago, Chile

A.K.S. JardineDepartment of Mechanical and Industrial Engineering

University of Toronto, Toronto, Ontario, Canada

July 27, 2010

Abstract

The introduction of International Standard IEC 61508 and its industry-specific derivatives sets demanding requirements for the definition and im-plementation of lifecycle strategies for safety systems. Compliance withthe Standard is important for human safety and environmental perspec-tives as well as for potential adverse economic effects (e.g. damage tocritical downstream equipment or a clause for an insurance or warrantycontract). This situation encourages the use of reliability models to attainthe recommended safety integrity levels using credible assumptions.

During the operation phase of the safety system life-cycle, a key deci-sion is the definition of an inspection program, namely its frequency andthe maintenance activities to be performed. These may vary from mini-mal checks to complete renewals. This work presents a model (which wecalled ρβ model) to find optimal inspection intervals for a safety system,considering that it degrades in time, even when it is inspected at regularintervals. Such situation occurs because most inspections are partial, thatis, not all potential failure modes are observable through inspections. Pos-sible reasons for this are the nature and the extent of the inspection, orpotential risks generated by the inspection itself. The optimization cri-terion considered here is the mean overall availability Ao, but also takinginto account the requirements for the safety availability As. We con-sider several conditions that ensure coherent modeling for these systems:sub-systems decomposition, k-out-of-n architectures, diagnostics coverage(observable/total amount of failure modes), dependent and independentfailures, and non-negligible inspection times. The model requires anestimation for the coverage and dependent-failure ratios for each compo-nent, global failure rates, and inspection times. We illustrate its usethrough case studies and compare results with those obtained by applyingpreviously published methodologies.

1

Keywords: safety system, inspection program, availability, IEC 61508,redundancy, coverage ratio, non-periodic inspection, partial inspection.

1 Introduction

Safety systems exist to take some action in the event of an emergency event.Reliable operation of this type of system is of great importance for the integrityof plants and the safety of personnel and the environment. During the operationphase of the plant life-cycle, this imposes a requirement on maintainers, whohave to inspect and repair these systems, to consider the balance between riskand economics. Any policy should take into account that safety systems areusually in dormant mode and are expected to operate when an emergency eventoccurs. It is necessary to perform regular inspections to reveal potential failuremodes. The key questions are: (i) when to inspect, and (ii) what should be theextent of the actions taken at each inspection.

Safety systems are subject to random shocks and deterioration and even-tually fail. The failure is not self-announced (e.g., it is a hidden failure), andcan be found by inspections or when the system fails to respond to a demand.Costs are also relevant when setting inspection frequencies. There might existdowntime costs associated with inspections and repairs and a penalty cost ordowntime associated with the safety system staying in a failed state withoutbeing detected. As such, the inspection timing can also be optimized to balancethe effects of these two penalty costs.

Mathematical modeling of the situation includes in general three key compo-nents: inspection scheme, objective function, and decision variables. Inspectionschemes may be periodic and non-periodic. For the objective function, threeoptions are considered usually: maximizing (overall) availability, achieving agiven (safety) availability level, and minimizing expected costs. Decision vari-ables are dependent on the inspection scheme. There are many decision variablesfor the non-periodic inspection scheme. If a periodic scheme is selected the de-cision variables correspond to the inspection interval and the extent of it. Theabove three key components form many combinations and hence there appearto be many possible models.

After selecting an appropriate mathematical model, one needs to estimatethe values of the input parameters. When the reliability characteristics of thesafety system are unknown, the engineer has to depend on subjectively educatedjudgment and/or existing databases. Based on these, the initial inspection fre-quency can be determined and the program can get started. With the programin progress, more and more information is gained about the status of the in-spected items. Such information can be used to update the failure model andthe inspection frequency can be reevaluated.

In this work, we consider a situation where a safety system is periodically ornon-periodically inspected partially and after some time it is taken out of serviceto receive a full inspection. In the context of this paper a partial inspection is onewhere only a few tests are performed in the system (e.g. a partial stroke test for

2

a safety valve) and, in case of need, a repair work is performed. A full inspectionimplies a complete testing of the system and major repairs in case of need. Wealso make a distinction between safety and overall availabilities of the safetysystem. Safety availability considers only the unknown downtime in betweeninspections, that is, the period elapsing from the occurrence of a failure until it isdetected by an inspection or by a demand of the safety function during a criticalsituation. The overall considers both known (due to inspections and repairs)and unknown downtime. We also focus on safety systems where the sum ofinspections and repair times do affect the overall availability in a non negligiblemanner. We propose a model easy to use and implement yet capturing theimperfect nature of inspections due to partial coverage ratio during inspections.

The paper is organized as follows: after this introduction a discussion ismade on the framework provided by IEC 61508 [1, 2] and related standards toprovide practical methodologies for the assessment of safety systems. Then, wepresent a brief overview of the literature related to systems whose conditionis known through inspections, as is the case for safety systems. After that, wepropose a method whose innovations are the consideration of partial coverageduring partial inspections and the integration with existing models to considercomplex system architectures and common cause modes. Case studies from theliterature are presented and compared and conclusions are provided.

1.1 IEC 61508 as a framework for mathematical modeling

Until recently, only equipment specific standards or company procedures tospecify safety system maintenance programs existed. Given their limited goals,these Standards imposed conditions for certain specific devices but not for theperipheral equipment, so there existed a lack of system level safety criteriaand life-cycle analysis. The introduction of International Standard IEC 61508Functional safety of electrical/electronic/programmable electronic safety relatedsystems has served as framework for other industry and equipment specificstandards (see table 1). IEC 61508 centered standards provide a framework onwhich to handle all activities in the life cycle of safety systems. They proposequantitative risk metrics which allow objective decision making for e.g. mainte-nance policy setting. They also provide design rules for system developers anddesigners, as well as for system certifier entities [2].

There exists increasing evidence of the wide application of IEC 61508 andrelated standards in several kind of industries (e.g. public transport industryand the manufacturing industry) [3, 4]. According to Goble [5], 70% of thepetrochemical industry in USA was already using IEC 61511 or its equivalentANSI 84.01 by 2003. Compliance with safety standards is important because ofhuman, environmental and economic reasons: (i) compliance with national orinternational regulations, (ii) compliance with insurance or warranty contractsbetween stakeholders and possible litigations that generally happen after anaccident, (iii) the delivery of a project from the original equipment manufactureror third party to the client in turnkey contracts, and (iv) protection of critical(expensive) equipment.

3

Code Industry Equipment Date 1

IEC 61508 [1] Generic Generic 2004IEC 61511 [6] Process Generic 2003IEC 61513 [7] Nuclear Generic 2001ANSI 84.01 [8] Process Generic 2004IEC 62061 [9] Machinery Generic 2005EN 50126 [10] Railway Generic 1999EN 50128 [11] Railway Software 2001API 670 [12] Process Rotating 2000

Table 1: Some examples of safety standards

Level Probability of failure on demand (PFD)1 {10−2, 10−1}2 {10−3, 10−2}3 {10−4, 10−3}4 {10−5, 10−4}

Table 2: Safety integrity levels (SIL) according to IEC 61508 for low demandmodes

Regarding safety assessment, a central concept in IEC 61508 is the safetyintegrity defined as the probability of a safety-related system satisfactorily per-forming the required safety functions under all the stated conditions within aspecified period of time. Safety integrity is classified in 4 levels for safety sys-tems with low demand, that is, when the expected demand rate is less than oneper year or not greater than twice the inspection frequency (see table 2).

Given the criticality of safety systems in terms of human safety, the naturalobjective is to achieve safety availability, which may be estimated at instant orinterval level. According to ISO 14224 [13]:

1. instantaneous availability : corresponds to the probability of an item tobe in a state to perform a required function under given conditions at agiven instant of time, assuming that the required external resources areprovided,

2. mean availability for over a given period of time: it is the average of thepointwise availability over the time period,

3. mean availability : the limit of the mean availability for a given missionwhen the time period goes to infinity.

When a renewal process occurs, definitions (2) and (3) are equal when theperiod of time in (2) corresponds to the renewal interval. A mean value is easierto handle (compared to use (1)) and is often preferred in the context of fixinginspection intervals.

Unavailability of the safety system may be classified into known and un-known. The known unavailability corresponds to inspections and repairs. In

4

this case the system is not available or available but with reduced redundancy,when it exists. IEC 61508 does not provide explicit formulas to consider thiskind of unavailability as it is considered less critical than the unknown safetyunavailability. As production may be interrupted or influenced by inspectionsand repairs, the mathematical model needs to include this situation. It is worthmentioning that availability of safety systems may or may not be influencedby inspection times and repair times. An example of the second case appearswhen there exist self-diagnosis capabilities in the system. Inspection times areoften negligible with respect to the interval between inspections. On the otherhand, there exist good examples where inspection may take weeks (e.g. safetysubsea valves in offshore applications). Regarding use of the availability as cri-terion to set inspection intervals, it appears that the Probability of Failure onDemand (PFD) (according to IEC 61508) considers at the same time the ex-pected unavailability due to unknown dangerous failures as well as the knownunavailability due to repairs and inspections. Hauge et al. [14] separate bothterms as it seems natural to use alternative safety measures while the safetysystem is being inspected or repaired (provided that there exists such possibil-ity) or the whole process is shut down using opportunistic strategies. In thecontext of this work, safety availability, As, corresponds to 1− PFD and over-all availability, Ao, also includes the known downtime due to inspections andrepairs.

An important modeling issue is the decision wether to consider the safetysystem as a single component system or prefer to decompose it into subsystemsand components. The perceived safety system failure rates depend on the severalfactors such as its architecture, inspection policy and repair quality. Yet, ifsufficient field failure data is available, the one-component approach may be agood approximation. On the other hand, failure rates are often only available atcomponent level (in databases such as OREDA [15] or MIL-HDBK-217F [16]).This permits an improved assessment of system level reliability modeling, butrequires more detailed and expensive models. In this work we consider bothsituations.

2 Safety assessment methodologies

Reliability modeling is a straightforward way to assess system reliability re-quirements and may be a helpful tool to decide actions to be taken to attainrequired reliability levels. From the point of view of a modeler, safety systemsmay be considered a special case of systems whose condition may be knownthrough inspections. Other examples are: systems in storage (e.g. spares andweapon systems) [17], and standby equipment [18]. All these systems may beviewed as protective devices (they exist to protect people, environment, equip-ment and/or products), as presented in Moubray [19]. Another related type ofsystem is that of production equipment whose operational condition may onlybe assessed through quality control [20, 21].

Among the techniques to assess reliability of safety systems we find models

5

using the probability of failure on demand [4, 14], failure trees [22], and Markovmodels [23]. The most used are the PFD models, however it should be notedthat using different quantitative techniques to analyze the probability of fail-ure on demand may lead to different results, as pointed out by Rouvroye andBrombacher [24].

Modeling attempts for safety systems date back to the sixties with the worksof Barlow and Proschan [25] which consider non-periodic, minimal inspectionsand replacement when failure was observed with negligible intervention times.Following the same scheme but improving numerical stability or reducing thedesign space we may mention Nakagawa and Yasui [26] and, Jiang and Jardine[21]. Jardine [20] proposes periodic inspections with renewal at each inspectionand constant times to inspect and to repair. Vaurio [27] proposes a model whereafter a number of minimal inspections, the system is renewed by full inspection.In his case, the optimization criterion is the expected cost rate. Instantaneoussafety availability has also been used in the optimization of inspection decisions.Ito and Nakagawa [17] consider this function as a constraint, but it has also beenused as objective in Sarkar and Sarkar [28] and Cui and Xie [29]. The expectedcost rate is used as objective in Barlow and Proschan [25], Ito and Nakagawa[17]. Courtois and Delsarte [30] consider redundant safety systems, with pe-riodic, staggered inspections. Their objective is the maximization of the timebetween failures. In their case, component inspections are perfect, that is, theycorrespond to a full renewal. The model of Ito and Nakagawa minimizes theexpected cost rate requiring that the instantaneous availability should be abovesome threshold level Al. It fixes a number N of inspections (duration negligi-ble) with interval T before a full inspection (with renewal, duration negligible)occurs. They do not consider penalty cost due to unavailability (as, e.g., Jiangand Jardine [21]) and allow the renewal cycle not to be an exact multiple ofthe inspection interval.

Current practical methods like the formulas proposed in IEC 61508, ISO14224 or Hauge et al. [1, 13, 14] consider that after an inspection the safetysystem is as good as new. This assumption does not seem credible in severalsituations as the inspections are often partial and repairs are imperfect. Theprevious situations lead to the system renewal problem (e.g., replacement ormajor inspection) decision problem. Also, many papers do not consider thatcomponents of safety systems may or not fail independently each other (e.g.[24, 17]). IEC 61508 requires models to consider dependent failures. Furtherdiscussion on dependent failures is presented in Hokstad and Corneliussen [4],and Rausand and Hoyland [22].

Among the available models for dependent failures we may cite the squareroot method [31], the β model [32], the binomial failure rate (BFR) model [33],and the modified β models [14, 34]. We disregard the square root method be-cause it does consider that several degrees of coupling may exist between thecomponents. We also disregard the BFR model as it assumes that failures areimmediately discovered and repaired, which obviously is not applicable in ourcontext. The β model is the most common in use today and is supported byIEC 61508. Its β factor corresponds to the ratio of component failures that may

6

be considered as dependent failures (with respect to all component failures). Inthe context of nuclear power plants, The U.S. Nuclear Regulatory Commission[35] have estimated that the β model gives reasonably accurate results for re-dundancy levels up to about three or four items. As typical redundancy valuesare below 4, we will adopt the β model to model dependent failures althoughthe extension of our method to use the modified version is straightforward. Themodified version of the β model considers that not necessarily all componentsfail when a common cause failure happens and introduces correction factors. βfactors in the process industry have been estimated e.g. in Hokstad and Cor-neliussen [36] and usually are below 10% for sensors and actuators and below5% for the logic units.

It shall be mentioned that an important source of discrepancy for the avail-ability computations come from the source of the input data. Reliability databases(e.g. OREDA) may differ in orders of magnitude in the estimation of failurerate [14], which is also dependent on local operating conditions.

Regarding the use of different distributions, Rausand and Vatn [37] discussthe general use of exponential and Weibull distributions in reliability analysis ofsafety systems. Their case study considers subsea safety valves. They concludethat estimations of mean time to failure and safety unavailability are non robustwith respect to variations of the model structure and of its parameters. Thisalso encourages a more detailed assessment of the reliability characteristics ofeach safety system under study.

An important issue concerning modeling of safety systems is the diagnosticcoverage. This is the ratio of failures to potential failures that may be identifiedand repaired during inspections. According to Lundteigen and Rausand [38],most diagnostic coverages are in the region of 60%-90% in safety systems in theprocess industry. This situation may seriously affect the correct assessment ofa safety system as is shown by the examples.

3 Model formulation

Let us consider a safety system which is composed of a set of subsystems in aseries configuration: e.g., (i)sensing, (ii)decision logic and (iii)actuation (moredetailed descriptions may be used, e.g. McCalley [39]). As a way to increasethe safety availability (As) and decrease the likelihood of spurious trips everysubsystem has components redundancy and a voting logic. All system compo-nents are inspected every T time units and the system is renewed (by a fullinspection) after N partial inspections. In our model, N and T represent thedecisions variables of the optimization problem. The objective in our case is tomaximize the overall availability of the safety system while at the same timerespecting a safety availability constraint (according to table 2). Inspectionsrequire the unavailability of the safety system, which is a known unavailability.The mean duration of partial and full inspections are Ti and To time units re-spectively, both times consider the eventual repairs that may be required afterthe inspections.

7

Each component, considered independently, has a set of detectable (d) failuremodes that can be assessed and repaired when partial inspections occur. Theremaining (u) modes are repaired during the full inspections which leave thesystem in an as good as new condition. The ratio among the d-type modes andthe full set of modes defined as the diagnostic coverage ρi of a component:

ρi =λd

λd + λu(1)

As such, each component may be modeled as a two-block series system, onefor the failure modes observed during partial inspections, and another one forthe failure modes observed during full inspections. We have introduced sub-index i to indicate that this coverage ratio is related to component independentfailures.

3.1 One component safety systems

To simplify the presentation of the model let us first consider a one-componentsafety system. The instantaneous component reliability corresponds to the prod-uct:

Rc(t) = Rd(t)Ru(t) (2)

= e−Hd(t)−Hu(t)

where R corresponds to the reliability function, H to the expected number offailures up to instant t, according to the model of Ito and Nakagawa [17] forrepairable systems, or the cumulative hazard for non-repairable systems. Justprior to any inspection, T time units has passed since the last partial inspection:

Rc(t−) = e−Hd(T )−Hu(t−)

where t− corresponds to the instant just before starting a partial inspection.After each partial inspection, block d is as good as new, then:

Rc(t+) = e−Hu(t+)

where t+ corresponds to the instant just after finishing a partial inspection.These conditions determine an instantaneous availability profile similar to

the one shown in figure 1. Let us note that the safety availability correspondsto the reliability function of the component. This is so, as the probability ofoperation at any instant in between inspections (safety availability) correspondsto the probability of survival up to that instant (reliability) [22].

The overall availability during a renewal cycle is given by:

Ao =1Tc

∫ Tc

0

Rc(t)dt

8

Time

Ava

ilabi

lity

ToTi Ti

Renewal cycle

Overall availability

Safety availability

t+ t-

Figure 1: Instantaneous and mean availabilities for a renewal cycle with N = 2partial inspections.

where Tc is the length of the renewal cycle:

Tc = N (T + Ti) + (T + To)

and for the safety availability,

As =1

(N + 1)T

∫ Tc

0

Rc(t)dt

For convenience, we define the ratio:

γ =TiTo

we have,

Ao (N,T ) =1Tc

N+1∑j=1

∫ tj+T

tj

e−Hd(t−tj)−Hu(t)dt (3)

with

t1 = 0tj = tj−1 + T + Ti for j = 2, 3, .., N + 1

We note that Ao considers the downtime associated with inspections. Assuch, it is a lower bound for As which only considers the unknown downtime inbetween inspections. Our initial optimization problem considers,

maxN,T

Ao

9

subject to the constraints,

As ≥ 1− PFDSIL

N ≥ 0, integerT > 0

where PFDSIL refers to the maximum probability of failure on demand forthe fixed safety integrity level.

Available databases assume exponential distributions for component failures.In this case:

Rd(t) = e−λdt

Ru(t) = e−λut

we may define the component failure rate λ,

λ = λd + λu

and by definition (1):

λd = ρiλ

λu = (1− ρi)λ

Considering equation (3),

∫ tj+T

tj

e−Hd(t−tj)−Hu(t)dt =∫ tj+T

tj

e−λd(t−tj)−λutdt

=∫ tj+T

tj

e−ρiλ(t−tj)−(1−ρi)λtdt

=e(ρiλtj−λtj−λT )(eλT − 1)

λ

and we obtain an explicit form for the objective in terms of N and T :

Ao =(1− eλT )

(eλTi − e−(1−ρi)λT+ρiλTi−(1−ρi)(λT−λTi)N

)((λT + λTi)N + λT + λTo)

(eρi(λT+λTi) − eλT+λTi

) (4)

and for the safety availability:

As =(1− eλT )

(eλTi − e−(1−ρi)λT+ρiλTi−(1−ρi)(λT−λTi)N

)λT (N + 1)

(eρi(λT+λTi) − eλT+λTi

) (5)

Naturally, this formula is valid for one-component safety systems. It differsfrom the model of Ito and Nakagawa in the sense that we are using meanavailability instead of instantaneous availability and also because we considernon negligible inspection times in the aging of the component.

10

3.2 Non-periodic inspections

Considering the degradation of the system during the interval between full in-spections, it makes sense to prepare an inspection program that dynamicallychanges inspection intervals to attain a higher overall availability. We need nowto define a set {T (1), T (2), ...T (n+1)} where T (j) corresponds to the interval be-tween the end of the last inspection and the beginning of the j-th inspection.Considering equation 3, we have:

Tc =(T (N+1) + To

)+

N∑j=1

(T (j) + Ti

)t1 = 0tj = tj−1 + (Tj−1 + Ti) for j = 2, 3, ..., N + 1

where each T (j) represents an independent decision variable. One way toreduce the design space is by considering a parametric law to relate all T (j). Inreal world applications, this approach may be unpractical as it complicates theinspection scheduling logistics.

3.3 ρβ model

Complex safety systems modeling requires several considerations such as systemarchitecture, voting logic, dependent failures and partial inspection coverage.Figure 2 shows the logic diagram of the ρβ model proposed here. ρ stands forthe partial coverage ratio during partial inspections and β for dependent fail-ures in redundant systems. Building of a model starts by selecting if wether adetailed model is required and if safety system failure rate is available (1-2-3).If a complex model is required, one would need the safety system’s architec-ture, component failure rates and coverage ratios, as well as dependent failureratios and dependent failure coverage ratios (4). Provided, the safety systemreliability block diagram is built (5). Then, an inspection policy is set (6). Itincludes selecting a given safety integrity level, decide among periodic or non-periodic inspection intervals, grouping of component inspections, and defininghow inspections and repairs affect the availability of the safety system (e.g. theinspection of one component may force the system to be unavailable or mayreduce the redundancy from n to n−1). To consider the existence of dependentfailures, we use the β model [32]. Figure 3 shows and example of the β modelfor a subsystem with a 2/3 architecture. The coverage ratio ρd for dependentfailures:

ρd =λdiβλ

(6)

where λdi corresponds to the failure rate for dependent failures that aredetected during partial inspections. Each subsystem is modeled by decomposing

11

Complexsystem model

Component failure rateComponent coverage ratioDependent failure failure ratioDependent failure coverage ratio

Inspectionpolicy

Safety integrity levelPeriodic inspection/Non-periodicFull system/ by subsystemStaggered inspectionProcess stops during inspection

Optimization Max Overall availabilityMin Direct costs/Total costs

Systemparameters

As ≥1-PFDSIL

Sensing-logic-actuation decompositionRedundancyVoting logic

no Computationof Ao,As,C

End

yes

T,N

Reliability blocksdiagram

RedundancyDependent failuresPartial coverage

Safety system failure rate available? no

One-componentmodel

yes

1

2 3

4

5

6

7

8

9

10

Figure 2: Flow diagram for the ρβ model

12

Sensor 1 Sensor 2

Sensor 1 Sensor 3

Sensor 2 Sensor 3

All sensors

βλ

(1-β)λ(1-β)λ

Figure 3: Block diagram of a 2/3 sensing subsystem considering dependentfailures

the dependent failures block into a two block in series. Each component blockand dependent-failure block is decomposed in two series blocks, as explainedfor the one-component model. The optimization process (7) includes fixing anobjective (overall availability in our case) in terms of the decision variables(e.g., T and N) and functions evaluation (8). If the safety constraint is notsatisfied (9), new values for the decision variables are computed or a redesign ofthe system architecture and components reliability is needed, which returns theflow to (1) for reassessment. Otherwise, the optimal values are adopted (10).

In practical cases, it may be difficult to estimate separately ρi and ρd, so wemight use a single value:

ρ = ρi = ρd (7)

Interaction with the design of the safety system is needed when the SILconstraint can not be attained inside the feasible region of the inspection pro-gram. In this case it is necessary to reevaluate the architecture and componentreliability parameters as well as the inspection policy.

4 Case studies

4.1 One-component

In order to illustrate the use of the ρβ model, we use an example from Nakagawa[40]. It considers a one-component safety system but we also study the effectof changing to a configuration 1/2 and 2/3 with no dependent failures (β = 0).As reference value, the time for full inspections represents 5% of the mean timebetween failures of a component (λTo = 0.05).

Figure 4 shows the effect of changing the coverage ratio ρi. For all con-figurations, as it is increased, the optimal number of inspections tends also toincrease.

Figure 5 presents a sensitivity analysis on the effect of the relative durationof partial and full inspections on the optimum number of partial inspections perrenewal cycle. When γ = 0.01 and ρi = 0.5 the optimum is obtained at N∗ = 9

13

0 3 6 9 120.75

0.8

0.85

0.9

N

Ove

rall

avai

labi

lity

(a) ρi = 0.3

0 3 6 9 12N

(b) ρi = 0.5

0 3 6 9 12N

1/11/22/3

(c) ρi = 0.9

Figure 4: Effect of ρi on the optimal number of inspections (γ = 0.5, λTo = 0.05,β = 0).

0 3 6 9 120.75

0.8

0.85

0.9

N

Ove

rall

avai

labi

lity

(a) γ = 0.01

0 3 6 9 12N

(b) γ = 0.1

0 3 6 9 12N

1/11/22/3

(c) γ = 0.5

Figure 5: Effect of γ on the optimal number of inspections (ρi = 0.5, λTo = 0.05,β = 0)

partial inspections (the topology of the overall availability is quite insensitive toN (observe figure 5(a)). When the relative time to perform a partial inspectionis extended, it becomes more attractive to perform full inspections after lesspartial inspections (observe 5(b) and 5(c)). In figure 5(c) we notice that it isbetter to fully inspect the system at every inspection epoch.

As expected, adding redundancy increases the overall availability. 1/2 archi-tectures is the most reliable of the three configurations under test. Of course,here we have not considered the spurious trip rate which usually makes theconfiguration 2/3 the most effective with respect to the frequency of spurioustrips.

Comparing to the cost based model of Nakagawa, the number of inspec-tions is consistently lower (see table 3). Of course, direct comparison is notstraightforward as his criterion is cost and not availability.

Regarding the use of non-periodic intervals, table 4 shows resulting optimaloverall availabilities for γ = {0.01, 0.1, 0.5} when using periodic and non-periodicintervals, for a one-component system. Only for γ = 0.1 we observe an increasein the achieved overall availability. Figure 6 shows a comparison of the intervals

14

0 2 4 6 8 100.99

0.995

1

1.005

1.01

j

λTj/T

Non-periodicPeriodic

Figure 6: Example with one component. Optimal intervals for non-periodicinspections (normalized with respect to the optimal T for periodic inspection).γ = 0.01, ρ = 0.5, λTo = 0.05, configuration 1/1.

κ ρi Amins (t) N∗[40]10 0.9 0.8 850 0.9 0.8 1910 0.5 0.8 210 0.9 0.9 8

Table 3: Example from reference [40]. One-component system. κ correspondsto the ratio of the cost of an full inspection to the cost of a partial inspection,Ci. (β = 0).

15

Periodic Non-periodicγ N∗ A∗o N∗ A∗o0.01 9 0.7892 9 0.78920.1 2 0.7609 2 0.77820.5 0 0.7405 0 0.7405

Table 4: Example with one component. Comparison of results using periodicand non-periodic inspections. The architecture is 1/1.

λ βPressure sensor 0.3 0.03Logic unit 0.1 0.02Valve 2.9 0.02

Table 5: HIPPS system parameters. Failure rate per 106 hours.

when γ = 0.01. The resulting value of overall availability is very similar in thiscase. Computations were performed using the Microsoft Excel standard solverfor portability.

4.2 Complex system

We illustrate the ρβ model in a complex system by using a modified version ofthe example given in Hauge et al. [14]. In that reference, their model considersunitary component coverage ratios. The safety system under study is shownin figure 7 and corresponds to a High Integrity Pressure Protection System(HIPPS). The safety function uses a 2/3 voting logic on the pressure sensors ,and a 1/2 voting logic for the logic units and also for the actuators. In the caseof detection of high pressure in the vessel, the pressure sensors send a signal tothe logic units and these send a signal to shut down the actuators. Table (5)shows model parameters.

We assume that inspections and repairs of components in the system renderthe safety system unavailable, a fact that is known to the operators who mayhave other layers of protection.

The HIPPS is composed of three sub-systems: pressure sensors, logic unitsand the safety actuators. Figure 8 shows a simplified block diagram wherecommon cause effects and partial coverages are not shown explicitly. As anexample, the block decomposition for the sensing sub-system, showing thesetwo considerations is shown in figure 9. We require the total failure rate of thecomponent, the fraction of common cause failures, β, the fraction of independentfailures which can be inspected, ρi and the fraction of dependent failures whichcan be inspected, ρd.

After integration of the three subsystems (figure 10), and considering a giveninterval T and a number of partial inspections N , we may obtain the instan-taneous availability of each sub-system and the one for the full HIPPS as it is

16

Pressurevessel

lu1 lu2

pt1 pt2 pt3

v1 v2

sensors

logicunits

actuators

Figure 7: Example with a complex system. High Integrity Pressure ProtectionSystem.

Sensor 1

Sensor 2

Sensor 3

Logic 1

Logic 2

2/3 1/2 1/2Actuator 1

Actuator 2

Figure 8: Simplified block diagram of the HIPPS.

I

(1-ρd)βλ

ρi(1-β)λ

I

II I

I I

(1-ρi)(1-β)λ

ρdβλ

Figure 9: Block diagram of the 2/3 sensing sub-system considering dependentfailures and partial coverage. Some block failure rates are shown as example.

17

d u d u

d ud u d u

d u d u

d u

d u

d u

d u

d u

d u

Figure 10: General block diagram of the HIPPS. Blocks ’d’ renew after eachpartial inspection. Blocks ’u’ renew after full inspections.

0 20 40 600.994

0.995

0.996

0.997

0.998

0.999

1

Time (months)

Saf

ety

avai

labi

lity

sensorslogicactuatorssystemAsSIL3 limit

(a) partial coverage

0 20 40 60

Time (months)

(b) full coverage

Figure 11: Results for the HIPPS system when ρ = {0.7, 1} for all componentsT = 6 months, N = 11. Safety integrity level is not attained with this inspectionfrequency. Main actors in the degradation are the actuators as they have a largerfailure rate.

shown in figure 11. Computations were made in an ad hoc program in Matlab.For this example figure, an interval T = 6 months and renewal after 5 yearsis considered. We observe that the actuators drive the reduction of the safetyavailability of the system. The mean safety availability is computed by averag-ing instantaneous values. It is also seen that the safety integrity level 3 requiredfor the HIPPS safety function is no longer attained. Figure 11 shows the trendof the expected safety availability, which is valuable to the decision maker.

Figure 12 is a graphic sensitivity analysis of overall and safety availabilitiesfor N = {0, ..., 6} with Ti = 4 hours, and γ = 0.25. For the case of partialcoverage, the global optimum for the overall availability (point A) in figure12 is attained when there are 2 inspections prior to renewal and the partialinspection interval is 10.9 months and a life cycle of 1.6 years. The associatedsafety availability (point B) is also displayed. The safety constraint is attainedfor safety integrity level 3 of IEC 61508. Table 6 lists the corresponding values.

In the case of having perfect coverage ratios, all partial inspections becomefull inspections in terms of their effect on the reliability of the safety system afterfinishing it. If a full inspection is done at every epoch (N = 0), the optimalinterval between them is T = 19.1 months (+75% with respect to the optimal

18

0 5 10 15 20 250.996

0.9965

0.997

0.9975

0.998

0.9985

0.999

0.9995

1

A

B

T (months)

Ava

ilabi

lity

Ao

As

0123456

Figure 12: Study of overall and safety availabilities for N = {0, ..., 6}, Ti = 4hours, γ = 0.25.

N T ∗ A∗o As Life cycle(months) (years)

0 19.1 0.99758 0.99903 1.61 13.6 0.99793 0.99916 2.32 10.9 0.99799 0.99918 2.73 9.3 0.99796 0.99917 3.14 8.1 0.99791 0.99916 3.45 7.3 0.99785 0.99913 3.76 6.6 0.99777 0.99911 3.9

Table 6: Results for the HIPPS example for different number of partial inspec-tions in a renewal cycle (ρ = 0.7, Ti = 4 hours, γ = 0.25).

solution of partial coverage). The life cycle decrease to 1.6 years (-41% w.r.t.the optimal solution of 2.7 years).

5 Conclusion

We have proposed a model to estimate optimal inspection intervals for safetysystems which continue to degrade after each partial inspection as componentdiagnostic coverages are not unitary. The proposed ρβ model considers peri-odic and non-periodic inspections. In the case of one-component systems wehave avoided approximations when computing both the overall and the safetyavailability. Compared to Ito and Nakagawa [17], the ρβ model is aligned withcurrent safety standards as it considers the safety availability constraint at sys-tem level. We have illustrated the model using two examples from recent liter-ature. The first considers the safety system as a one-component system, while

19

the second considers the decomposition according to sub-systems and compo-nent redundancy. Both examples show the high sensitivity of the inspectionintervals to the coverage ratios. The application of standard methods whichconsider perfect diagnostic coverage and negligible inspection and overhaul timesmay bias the estimation of the optimal policy and produce an overestimation ofboth safety and overall availability. Regarding this, the ρβ model acknowledgesthat most inspections are partial and recommends fully testing and renewingthe safety system after some period of time if overall availability is to be maxi-mized while at the same time complying with the SIL constraint given by safetystandards.

Possible extensions to the model may consider: (i) the failure intensityincreases in time due to imperfect repairs, (ii) inspection of components ink-out-of-n configurations does not necessarily mean downtime of the safety sys-tem as staggered inspections may increase system availability, (iii) failures withnon-exponential distributions, (iv) use of opportunistic strategies may furtherincrease the availability of the safety systems as the downtime associated withinspections and overhauls may be significantly reduced.

Acknowledgments

The authors would like to acknowledge Dr. Per Hokstad of the Department ofSafety and Reliability, SINTEF Industrial Management, for kindly letting ususe the example presented in reference [14]. The first author would like toacknowledge the financial support of Material and Manufacturing Ontario andmember companies of the Centre for Maintenance Optimization and Reliabil-ity Engineering (C-MORE) Consortium that allowed his research visit to theUniversity of Toronto.

References

[1] IEC 61508. Functional safety of electrical/electronic/programmable elec-tronic (E/E/PE) safety related systems. Part 1–7, Edition 1.0 (variousdates), International Electrontechnical Commission.

[2] Lundteigen, M.A., Rausand, M., Bouwer, I., Integrating RAMS engineer-ing and management with the safety lifecycle of IEC 61508, ReliabilityEngineering and System Safety, 94, 1894-1903, 2009.

[3] Lytollis, B., Safety instrumentation systems: how much is enough?, Chem-ical Engineering, 109(12), 2002.

[4] Hokstad, P., Corneliussen, K., Loss of safety assessment and the IEC 61508standard Reliability Engineering and System Safety, 83(19), 111-120, 2004.

[5] Goble, W., Using the safety life cycle, Hydrocarbon Processing, 82(7), 2003.

20

[6] IEC 61511. Functional safety: safety instrumented systems for the processindustry sector. Part 1–3, International Electrontechnical Commission.

[7] IEC 61513, Safety systems for nuclear industry, International Electrotech-nical Commission.

[8] ANSI/ISA-84.01, Functional safety: Safety Instrumented Systems for theProcess Industry sector, Instrument Society of America, Research TrianglePark, NC, 2004.

[9] IEC 62061, Safety of machinery – Functional safety of safety-related elec-trical, electronic and programmable electronic control systems, 2005.

[10] EN50126, Railway Applications, The Specification and Demonstrationof Reliability, Availability, Maintainability and Safety (RAMS), Cenelec,Brussels, 1999.

[11] EN 50128, Railway Applications, Sofware for Railway Control and Protec-tion Systems, Cenelec, Brussels, 2001.

[12] API Standard 670 Machinery Protection Systems, Fourth Edition, Ameri-can Petroleum Institute, Washington, D.C, 2000.

[13] ISO/DIN Standard 14224, Petroleum and natural gas industries, Collectionand exchange of reliability and maintenance data for equipment, 2004.

[14] Hauge, S., Hokstad, P., Langseth H., Oien, K., Reliability PredictionMethod for safety Instrumented Systems; PDS Method Handbook. SIN-TEF Report STF50 A06031, SINTEF, Trondheim, Norway, 2006.

[15] OREDA, Offshore Reliability Data, 4th ed. OREDA Participants. Availablefrom: Det Norske Veritas, NO-1322 Hgvik, Norway, 2002.

[16] MIL HDBK-217F, Reliability Prediction of Electronic Equipment. U.S. De-parment of Defense, Washington, DC., 1991.

[17] Ito, K., Nakagawa, T., Optimal inspection policies for a storage systemwith degradation at periodic tests. Math. Comput. Modelling 31, 191–195,2000.

[18] Sarkar, J., Sarkar, S., Availability of a periodically inspected system sup-ported by a spare unit, under perfect repair or perfect upgrade, Statistics& Probability Letters, 53, 207-217, 2001.

[19] Moubray, J., Reliability-Centered Maintenance, 2nd ed., Butterworth-Heinemann, 1997.

[20] Jardine, A.K.S., Tsang, A., Maintenance, Replacement and Reliability,Taylor & Francis, 2006.

21

[21] Jiang, R., Jardine, A.K.S., Two optimization models of the optimum in-spection problem, Journal of the Operational Research Society, 56, 1176–1183, 2005.

[22] Rausand, M., Hoyland, A., System Reliability Theory, 2nd ed., Wiley, NewYork, 2004.

[23] Bukowski, J.V. Goble, W., Defining mean time-to-failure in a particularfailure-state formulti-failure-state systems, IEEE Transactions on Reliabil-ity, 50(2), 221-228, 2001.

[24] Rouvroye, J.L. and Brombacher, A.C., New quantitative standards: dif-ferent techniques, different results?, Reliability Engineering and SystemSafety, 66, 121-125, 1999.

[25] Barlow, R.E., Hunter, L. and Proschan, F., Optimum checking procedures.Journal of the Society for Industrial and Applied Mathematics, 11, 1078–1095, 1963.

[26] Nakagawa, T., Yasui, K., Approximate calculation of optimal inspectiontimes. Journal of the Operations Research Society, 31, 851-853, 1980.

[27] Vaurio, J.K., Availability and cost functions for periodically inspected pre-ventively maintained units, Reliability Engineering and System Safety, 63,133–140, 1999.

[28] Sarkar, J., Sarkar, S., Availability of a periodically inspected system underperfect repair, Journal of Statistical Planning and Inference, 91, 77–90,2000.

[29] Cui, L., Xie, M., Availability of a periodically inspected system with ran-dom repair or replacement times, Journal of Statistical Planning and In-ference, 131, 89-100, 2005.

[30] Courtois, P.J. , Delsarte, Ph., On the optimal scheduling of periodic testsand maintenance for reliable redundant components, Reliability Engineer-ing and System Safety 91, 66–72, 2006.

[31] NUREG-75/014, Reactor Safety: An Assessment of Accident Risk in USCommercial Nuclear Power Plants, WASH-1400, U.S. Nuclear RegulatoryCommission, Washington, DC., 1975.

[32] Fleming, K.N., A reliability model for common mode failures in redundantsafety systems, General Atomic Report, GA-13284, Pittsburgh, PA., 1974.

[33] Vesely, W. E., Estimating common cause failure probabilities in reliabilityand risk analysis: Marshall-Olkin specializations. In Nuclear Systems Reli-ability Engineering and Risk Assessment. J. B. Fussell, and G. R. Burdick,eds. SIAM, Philadelphia, 314-341, 1977.

22

[34] Guo, H., Yang, X., Simple Reliability Block Diagram Method For SafetyIntegrity Verification, Reliability Engineering and System Safety, 92, 1267-1273, 2007.

[35] NUREG/CR-4780 (A. Mosleh, K. N. Fleming, G. W. Parry, H. M. Paula,D. H. Worledge, and D. M. Rasmuson), Procedures for Treating CommonCause Failures in Safety and Reliability Studies, Vol. 1: Procedural Frame-work and Examples. U.S. Nuclear Regulatory Commission, Washington,DC, 1988.

[36] Hokstad, P., Corneliussen K., Improved common cause failure model forIEC 61508. SINTEF report STF38 A00420, 2000.

[37] Rausand, M., Vatn, J., Reliability modeling of surface controlled subsurfacesafety valves, Reliability Engineering and System Safety, 61, 159-166, 1998.

[38] Lundteigen, M.A., Rausand, M., Partial stroke testing of process shutdownvalves: How to determine the test coverage, Journal of Loss Prevention inthe Process Industries, 21, 579588, 2008.

[39] McCalley, J.D., Fu, W., Reliability of Special Protection Systems, IEEETransactions on Power Systems, 14(4), 1400-1406, 1999.

[40] Nakagawa, T., Maintenance Theory of Reliability, Springer, 2005.

23

pascual sisr1

Documents