selections and configurations to meet safety and availability requirements in the process industries...

ISAFrance,2012 1 Gruhn

The selection of logic solver and field device technologies and configurations to meet safety and availability

requirements in the process industries

Paul Gruhn, P.E., ISA 84 Expert ICS Triplex | Rockwell Automation, Houston, Texas, USA

[email protected], 281-330-0393 Keywords

Safety Instrumented System, Safety Instrumented Function, Safety Integrity Level, Safe Failure Fraction, Fault Tolerance, ISA 84 (IEC 61511) standard Abstract

When a restaurant has only two choices (e.g., hamburger or chicken sandwich), choosing is easy and fast. However, when a restaurant has a menu that is twelve pages long, choosing is neither easy nor fast. Designing a safety instrumented system is similarly problematic. The sheer number of choices available, such as configuration (e.g., single, dual, triple, quad), design options (e.g., certified vs. prior use, centralized vs. distributed), determining test intervals, and the multitude of different vendor products and technologies for both logic solvers (e.g., relays, solid state, programmable) and field devices (e.g., switches, transmitters) means that choosing and designing a system is no longer as easy and simple as it used to be back in the days of relays and discrete switches. This paper will review different design configurations and options for safety instrumented systems in the process industries and their impact on system performance.

Background Concepts

In order to compare logic and field device technologies and configurations, their impact on system performance, and understand the fault tolerance requirements listed in the standards, it will first be necessary to review and define a few basic concepts such as failure modes, safe failure fraction, hardware fault tolerance, and the real impact of redundancy.

Failure Modes

Safety system failures have long been categorized in two different modes; safe and dangerous. Safe failures result in nuisance trips and lost production downtime. Dangerous failures are where the system will not respond to an actual demand. SIL (Safety Integrity Level) is a measure of performance against dangerous failures only. In other words, knowing the SIL a function meets, tells you nothing about its nuisance trip performance.


Safe Failure Fraction

Intelligent devices (i.e., those with microprocessors) are able to detect some failures. However, diagnostic coverage can never be 100%. Therefore, four failure categories may be defined, as shown in Figure 1.

Safe failure fraction (SFF), a term used in recent industry standards, is defined as safe failures (detected and undetected), plus dangerous detected failures, divided by the total of all failures. With the splits shown in Figure 1, which are merely an example for illustrative purposes, the safe failure fraction is 75%.

Hardware Fault Tolerance

Redundancy and fault tolerance are not the same. Redundancy is a vague term open to various

interpretations. Anything more than one is redundant. The definition of fault tolerance is more specific.

A hardware fault tolerance of N (i.e., 0, 1 or 2) means that N+1 dangerous faults could cause a loss of the safety function. A non-redundant configuration (1oo1: one-out-of-one) has a fault tolerance of zero. A two-out-of-two (2oo2) configuration also has a fault tolerance of zero. One-out-of-two (1oo2) and two-out-of-three (2oo3) configurations have a fault tolerance of one. A fault tolerance of two requires either a 1oo3 or 2oo4 configuration.

Table 1 lists the fault tolerance requirements for software based devices in order to meet the different integrity levels based on the safe failure fraction, per IEC 61508.

Safe Failure Fraction Hardware Fault Tolerance

0 1 2

< 60 % Not Allowed SIL 1 SIL 2

60 % < 90 % SIL 1 SIL 2 SIL 3

90 % < 99 % SIL 2 SIL 3 SIL 4

> 99 % SIL 3 SIL 4 SIL 4

Table 1: Hardware Fault Tolerance Requirements for Type B Devices (per IEC 61508)

Devices are considered type B when:

a) the failure mode of at least one constituent component is not well defined, or b) the behavior of the subsystem under fault conditions cannot be completely determined, or c) there is insufficient dependable failure data from field experience to support claims for

rates of failure for detected and undetected dangerous failures.

Figure 1: Failure Categories


Software based programmable devices/systems are considered type B. A common or typical safe failure fraction for a general purpose PLC (Programmable Logic Controller) would be in the range of 70-80%. Table 1 means that if your performance target was SIL 1, you could do it with a 1oo1 or 2oo2 PLC configuration. If your target were SIL 2, you would need a 1oo2 or 2oo3 configuration.

However, if you had a safety PLC (one designed originally to meet the requirements of the IEC 61508 standard) that had a safe failure fraction of 95%, you could achieve SIL 2 with a 1oo1 configuration, and SIL 3 with a 2oo3 configuration. If the safe failure fraction were greater than 99%, you could achieve SIL 3 with a 1oo1 configuration. There are a variety of systems from different vendors that meet all of the above.

The Real Impact of Redundancy

Few terms, as well as its impact, are as misunderstood as redundancy. Most people believe that if one is good, two must be better, three must be better than that, and since a couple of vendors offer quad, that must be the best. If marketers can do it with razor blades, why not safety systems? Strange as it may sound, dual is not always better than single, triple is not always better than dual, and what some offer as quad is not as quad as people might think. The impact of redundancy depends upon the failure mode.

Single (1oo1)

Start with a base case of a non-redundant one-out-of-one (1oo1) system. Assume a safe (nuisance trip) failure probability in one year of .04 (4%). You could think of it as 4 systems out of 100 causing a nuisance trip within a year, or 1 system in 25 causing a nuisance trip, or a mean time to fail safe (MTTFsafe) of 25 years (1/.04).

Assume a dangerous failure probability in one year of .02 (2%). You could think of it as 2 systems out of 100 not responding in a year, or 1 in 50 not responding in a year, or a mean time to fail dangerously (MTTFdanger) of 50 years (1/.02). These numbers are just for comparison purposes at this point.

Dual (1oo2)

A one-out-of-two (1oo2) configuration has the outputs wired in series (assuming closed and energized contacts, as shown in Figure 2). Either channel can shut the system down. Since there is twice as much hardware, there are twice as many nuisance trips. Therefore, the .04 of a single system doubles to .08. You could think of it as 8 systems out of 100 causing a nuisance trip within a year, or 1 system in 12.5 causing a nuisance trip, or a MTTFsafe of 12.5 years.

A 1oo2 configuration would fail to function in the dangerous mode only if both channels were to fail dangerously at the same time. If one were stuck, the other could still de-energize and shut down the system. What is the probability of two simultaneous failures? It is the probability of the single event squared (like two coins landing heads). So the probability of two channels failing at the same time is remote (0.02 x 0.02 = 0.0004). You could think of it as 4 systems out of 10,000 not responding in a year, or 1 in 2,500 not responding in a year, or a MTTFdanger of 2,500 years.


In other words, a 1oo2 configuration is very safe (the probability of a dangerous system failure is very small), but the system suffers twice as many nuisance trips as single, which is not desirable from a lost production standpoint.

Dual (2oo2)

A two-out-of-two (2oo2) configuration has the outputs wired in parallel. Both channels must de-energize in order to perform a shutdown. This system would fail dangerously if a single channel had a dangerous failure. Since this configuration has twice as much hardware compared to single, it has twice as many dangerous failures. Therefore the .02 of a single configuration doubles to .04. You could think of it as 4 systems out of 100 not responding in a year, or 1 in 25 not responding in a year, or a MTTFdanger of 25 years.

For this configuration to have a nuisance trip, both channels would have to suffer safe failures at the same time. As before, the probability of two simultaneous failures is the probability of a single event squared. Therefore, nuisance trip failures in this cofiguration are unlikely (0.04 x 0.04 = 0.0016). You could think of it as 16 systems out of 10,000 causing a nuisance trip within a year, or 1 system in 625 causing a nuisance trip, or a MTTFsafe of 625 years.

So a 2oo2 configuration protects against nuisance trips (i.e., the probability of safe failures is very small), but the system is less safe than single, which is not desirable from a safety

Figure 2: The Impact of Redundancy


standpoint. This is not to imply that 2oo2 configurations are bad or should not be designed. If the probability of failure on demand meets the overall safety integrity level requirements, then the configuration is safe enough.

Triple (2oo3)

Triple Modular Redundant (TMR) concepts were developed in the 1970s through research with NASA (National Aeronautics and Space Agency) and released as commercial products in the early and mid 1980s. The reason for triplication back then was very simple; early computer based systems had limited diagnostics. For example, if there were only two signals and they disagreed, it was not always possible to determine which one was correct. Adding a third channel solved the problem. One can assume that a channel in disagreement has an error and can simply be outvoted by the other two. A two-out-of-three (2oo3) configuration is a majority voting system. Whatever two or more channels say, that is what the system does.

What initially surprises people is that a 2oo3 system has a greater nuisance trip rate than a 2oo2 system, and a greater probability of a dangerous failure than a 1oo2 system. (Refer to Figure 2 once again to compare the numbers). Some people initially think, Wait a minute, that cant be! Actually it is intuitively obvious, you just have to think about it a moment.

How many simultaneous failures does a 1oo2 configuration need in order to have a dangerous failure? Two. How many simultaneous failures does a 2oo3 configuration need in order to have a dangerous failure? Two. A triplicated configuration has more hardware, hence three times as many dual failure combinations! (A+B, A+C, B+C)

How many simultaneous failures does a 2oo2 configuration need in order to suffer a nuisance trip? Two. How many simultaneous failures does a 2oo3 configuration need in order to suffer a nuisance trip? Two. Same thing, a triplicated configuration has three times as many dual failure combinations. A triplicated configuration is actually somewhat of a tradeoff. Overall, it is good in both modes, but not as good as the two different dual systems. However, a traditional dual system is either good in one mode or the other, not both.

Note: all of the above comparisons ignore common cause (i.e., a single stressor or failure that makes multiple components fail at the same time).

1oo2D

If you look carefully at the numbers in Figure 2 you can see that the 1oo2 configuration is safer than 2oo3, and the 2oo2 configuration offers better nuisance trip performance than 2oo3. If a dual configuration could be designed to provide the best performance of both dual systems, such a system could outperform a triplicated system, at least in theory.

Improvements made in both hardware and software since the early 1980s mean that failures in dual redundant computer-based systems can now generally be diagnosed well enough to tell which of two channels is correct if they disagree. The industry refers to this relatively newer dual design as 1oo2D. The D stands for diagnostics (usually tied to some form of secondary outputs).

When Simplex is Not Really Simplex

So how can a simplex configuration get certified to SIL 3? Simple diagnostics. As shown in Table 1, if the safe failure fraction can be shown to exceed 99%, the system can meet SIL 3 with


a fault tolerance of 0. However, how does such a system achieve such a high level of diagnostics? Simple redundancy.

What?! Yes, redundancy. Just because a system has a single processor or I/O module, does not mean it is actually simplex. In fact, there are redundant circuits and/or processing, operating in either a 1oo2 or 2oo2 configuration. One such system utilizes diverse software compiled from the same source code and processed in a different manner. In effect, 1oo2 processing.

As shown earlier, while 1oo2 is safe (which is all the standards and certification agencies cover), such systems result in more nuisance trips. Whenever the channels disagreeand they will at some pointthe system will shut down. Safe, but not very available. This may be fine for a machinery application (which is what some of these systems were originally designed for) where one of thirty punch presses going down will not have a major impact on overall operations. However, applying such a system to a refinery, where downtime can cost over $1,000,000/day, is a completely different matter. Uptime is often just as important as safety. The terms used for this type of performance are availability and/or MTTFsp (Mean Time To Failure, spurious).

When Quad is Not Really Quad

Some systems are promoted as quad (2oo4D) redundant. Prospective users are cautioned to investigate how much of these systems are actually quad. These systems were originally developed as 1oo2D. Due to limitations with their degraded run-time restrictions (i.e., how long they were allowed to continue operating with a detected failure, such as 72 hours), they ran into marketing and sales difficulties from their triplicated competitors who did not have such stringent restrictions. The dual vendors solution for this problem was to make the processors more redundant (quad). I/O modules are still either simplex or dual redundant, not quad. The quad systems have the same safety rating and nuisance trip performance as they did with a 1oo2D configuration; only the degraded runtime restriction changed. Yet quad sounds like more and better than triplicated.

Centralized vs. Distributed Logic Solvers

Some systems (and most early triplicated systems) were designed for relatively large I/O (input and output) applications. These systems are relatively expensive and difficult to justify for low I/O count systems (e.g., < 100 I/O). In order to justify and implement such systems in the past, many users combined what were formerly separate, smaller systems into one larger centralized system. One monolithic centralized system may be economical and suitable for your needs, especially for larger I/O counts. (The same could be said for most early DCS (Distributed Control Systems).) However, a number of very economic systems are now available for small I/O counts making them cost effective to implement in the originally desired distributed manner. Such a design can minimize the impact of a common cause failure (i.e., one that could affect the entire facility if everything were handled in one system), as well as lower the cost of field wiring (as the distributed systems can now be located much closer to the field devices). Such systems can utilize standard networks to pass safety critical data back and forth between controllers.


Fault Tolerance of Field Devices

Hardware fault tolerance was defined earlier in this paper. Table 2 (below) appears in the 84/61511 standard and describes the required level of fault tolerance for field devices for different SILs. However, there are exceptions to every rule and the standard describes cases where the fault tolerance numbers may be decreased by one in some circumstances, yet must be increased by one in other circumstances.

The financial impact of redundant field devices is significant. The installed cost of a second transmitter is approximately $10,000. The installed cost for a second valve is significantly higher. Therefore, simply going from SIL 1 to SIL 2 for a single function may increase the cost $40,000 for a single function. The additional costs of going from SIL 2 to SIL 3 are even greater.

Safety Integrity Level Minimum Hardware Fault Tolerance Requirement

1 0 (i.e., 1oo1, 2oo2)

2 1 (i.e., 1oo2, 2oo3)

3 2 (i.e., 1oo3)

4 Special Requirements Apply - See IEC 61508

Table 2: Minimum Hardware Fault Tolerance Requirements for Field Devices

Logic & Field Device Design Options for Safety and Availability

Figures 3 through 5 graphically compare the nuisance trip (MTTFsp: Mean Time To Fail spurious) and safety performance (RRF: Risk Reduction Factor, which is 1/PFD (Probability of Failure on Demand)) of different sensor, logic and final element configurations. The exact numbers for each are shown in Table 3. However, quantification does not reveal everything. SIL 2 is usually the highest requirement in most process industry applications. A simplex (1oo1) system can be designed to meet this requirement, and is the lowest cost design. Yet a fault tolerant SIL 3 rated logic solver can still offer advantages. For example, if you live in a town with 100 people, and there is one central bank in town, and every family contributes $100,000, you expect a certain level of security from that bank. Yet if you live in a city with 1,000,000 people, and there is only one central bank in town, and every family contributes $100,000, the risk to each family is exactly the same, yet you would expect a different (higher) level of security from the larger centralized bank. The concept is similar if you are combining 1,000 functions in one logic solver (vs. only 10).


Figure 3: Performance of Sensors

Figure 4: Performance of Logic Solvers


Figure 5: Performance of Logic Valves


Table 3: Performance of Different Sensor, Logic and Valve Configurations

Note that while a logic solver may have a RRF of 1,300, this does not mean it is suitable for use in SIL 3 applications (a Risk Reduction Factor of 1,000 to 10,000 for an entire Safety Instrumented Function). Logic solvers are usually allocated 10-15% of the total Probability of Failure on Demand for a function. (A function consists of a sensor, logic and final element.) A logic solver number of 1,300 therefore means it is suitable for use in SIL 2. This is graphically represented in Figure 4.


Assumptions: 1. Switches: 30 year safe & dangerous MTTF, 0% diagnostic coverage 2. Transmitters: 60 year safe & dangerous MTTF, 30% diagnostic coverage simplex, 90% dual,

99% triplicated 3. Safety transmitters: 60 year safe & dangerous, 95% diagnostic coverage simplex, 99% dual 4. Relays: 300 year MTTF, 95% safe, 0% diagnostic coverage 5. Standard PLC: 20 year CPU MTTF, 60% safe, 85% diagnostic coverage; 50 year I/O MTTF,

75% safe, 20% coverage 6. Safety PLC: Similar, but 95% diagnostic coverage for 1oo1D, 99% for 1oo2D and 2oo3 7. Valves: 40 year safe & dangerous MTTF, 0% diagnostic coverage, 80% w/ partial stroke test 8. 1 year manual proof test 9. 10% common cause Beta value (for redundant configurations) Conclusions 1. Redundancy is not the magic answer for safety, diagnostics is. 2. Diagnostics of valves is obtained through partial stroke testing. 3. A properly designed simplex (1oo1) system can meet SIL 2 for the lowest cost. (SIL 2 is

often the highest target selected in the process industries.) 4. Quantification of performance is a useful tool, but cannot account for everything. All models

are wrong; some are just less wrong than others. References 1. IEC 61508, Functional safety of electrical/electronic/programmable electronic safety-related

systems, 2010 (2nd edition) 2. ANSI/ISA-84.00.01-2004 (IEC 61511 Mod), Functional Safety: Safety Instrumented

Systems for the Process Sector, 2004 3. Safety Shutdown Systems: Design, Analysis and Justification, 2nd edition, Gruhn & Cheddie,

ISA press, 2005 4. Things to consider when selecting a safety instrumented system, presented at the ISA Safety

Division Symposium, April 2009 5. New trends for safety instrumented systems, Hydrocarbon Processing, April 2009 6. Not all safety integrity level 3 safety systems are the same, Hydrocarbon Processing, March

2006 7. Get full value from partial stroking, Chemical Processing, March 2007 8. Safety instrumented system design: valuable lessons learned, Hydrocarbon Processing,

August 2000


Author Bio Paul Gruhn is the Global Process Safety Consultant at ICS Triplex | Rockwell Automation in

Houston, Texas. Paul is an ISA Fellow, a member of the ISA 84 standard committee, the developer and instructor of ISA courses on safety systems, and the primary author of the ISA textbook on the subject. Paul developed the first commercial safety system modeling program over 17 years ago. He has a B.S. degree in Mechanical Engineering from Illinois Institute of Technology, is a licensed Professional Engineer (P.E.) in Texas, and an ISA 84 Expert.

selections and configurations to meet safety and availability requirements in the process industries...

Documents