design of fail-safe computer systems

EECE499 Computers and Nuclear EnergyElectrical and Computer EngHoward UniversityDr. Charles KimFall 2013

Webpage: www.mwftr.com/CNE.html

Design of Fail-Safe Computer Systems

Yonatan Yilma

October 2, 2013

Intro

Previous presentations: the computer system was broken down into

its basic hardware and software components the ways in which these components could

fail and produce a system mishap

This Presentation the design steps that can be taken to make

the basic computer system fail-safe. Measures are discussed on a component-by-

component basis Reliability and quality improvements

DESIGNING THE FAIL-SAFE SYSTEM

Layers of Protection Incorporate internal safety and warning

devices Incorporate external safety devices Improve component reliability and quality

Safety devices Computer component vs. Overall System

Safety

Fail-Safe Versus Fail-Operate

Both employ failure detection capability

Fail-safe system In the event of failure it will revert to a non-

operating state that will not cause a mishap.

Fail-operate system Must detect existence of the fault or failure as

well, but must additionally isolate the offending component and reconfigure itself so that safe operation will continue without noticeable interruption.

In practice, system designs usually employ a mix of the fail-safe and fail-operate approaches.

Fail-Safe System Design Guidelines

common practice to select and interconnect the components so that the failure of any component will automatically cause the system to revert to a fail-safe state.

Fail-Safe Design Approach

To make hardware and software modifications to the basic computer system so that the new, modified system can

1)detect the presence of faults or occurrence of failures and,

2)upon detecting these faults or failures, reconfigure itself to a safe state.

SIMPLEX ARCHITECTURE

It is a widely held belief that a computer system must employ redundant components to be fail-safe. This is False.

A system that does not employ redundancy is frequently referred to as a simplex system.

The overall approach is to tackle each component one at a time and show how to deal with its failure modes.

Application Failures Prevention of application failures is the primary

safety function of both computer control systems and computer safety systems.

Modifying the computer system so that it will prevent these application failures from occurring is done systematically through the following steps..1. Define the physical measurements that can be made

on the application which will indicate that it is approaching a failure condition.

2. Select the appropriate sensors for making these measurements and interface them to the computer.

3. Select effectors that can be commanded to eliminate or arrest the conditions leading to the application failure and interface them to the computer.

4. Design and install software which will continuously monitor output of the sensors selected in STEP #2. If the software detects a fault or onset of failure, it will signal one or more of the effectors selected in STEP 3 to arrest the failure onset.

Application Failure Control Examples

Operator Interface:Warnings and Advisories software detected the onset of failure and

automatically signaled the effectors to control the failure

software should also notify the operator, usually by visual and/or aural means, that the event was detected and controlled.

In addition to its failure detection and reconfiguration functions, software can cue the operator on additional safety steps not directly related to the system that may have to be taken.

Emergency Procedures and Detecting Non-

application Faults and Failures

As a general statement, each potential failure event should have a corresponding emergency procedure that the operator(s) should follow should the event transpire.

The major challenge in the design of the real-time safety-critical system is not how to deal with application failures but how to handle failures in the hardware and software components that are monitoring and controlling the application in real time.

Sensor Failure Detection

The designer has to know, in advance, what the correct sensor output should be when the system is run in real time.

When there is no sensor failure, the actual value and the commanded value will match. If there is a sensor failure, there will be a mismatch and the implemented software will detect the failure.

Sensor Failure Isolation.

The concluding phrase is "a failure has occurred." One cannot say "a sensor failure has occurred" because other failures in the system can produce exactly the same symptom.

Example of sensor failure detection

Failure Isolation in the Simplex System

Software written to detect failure of a given component may also detect failures that might occur in other components within the system.

In a simplex system, software can detect the occurrence of a component's failure but cannot tie the event to that specific component.

Detecting Analog Sensor Failures

In principle the same approach can be applied to sensors that generate analog signals.

As part of the software design, it is therefore necessary to establish positive and negative levels called thresholds within which the error in the non-faulted system can be expected to lie.

The function of software then is to monitor this error in real time, comparing its value relative to the threshold levels.

If a major sensor failure occurs the error will exceed the threshold limits in which event software can detect and issue a command to stop further motion.

State Estimator The software scheme for detecting

sensor failures can be refined considerably by employing a mathematical concept called a state estimator.

A state refers to a physical parameter such as a position, velocity, pressure, temperature, etc., existing within the application.

Use of State Estimator for Sensor Fault Detection

Sensor State Estimators

More important, control uncertainties, particularly external disturbances, may require that larger than desirable thresholds be set.

The planned use of a state estimator for sensor failure detection should always be qualified by analysis and testing with real sensor outputs.

Reasonableness Tests When sensors fail, they sometimes report

engineering values that are clearly inconsistent with the associated physical process.

Software tests the validity of sensor outputs by establishing beforehand a band of "reasonable" values that each sensor should generate and verifying that each sensor output lies within this band during real-time operation.

Note that this approach does not work for the sensor that fails to a constant or offset output value.

Informational Redundancy

Sometimes a sensor value can be directly related through physical laws to other sensor values.

Dependent Sensor Values

Analytical Redundancy This technique is similar to informational

redundancy except that it is performed on pairs of sensors that measure physical parameters which are different but are analytically related to each other through a common parameter.

The common parameter is time.

Since software can measure time (using its internal timer or frame counter) it can verify that the two sensor readings are consistent over the time span.

Complementary Filter In most systems, physical rates are not

constant.

As a result, control system engineers can resort to a software scheme known as a complementary filter which allows the rate to be integrated over short time periods and compared with short term position changes.

Complementary Filter for Sensor Fault Detection

Undetectable Sensor Failures

If failure of a sensor can lead to mishap conditions, then software must be able to positively identify that the sensor failure has occurred and effect a safety disconnect of the system.

Many times incorrect sensor behavior cannot be determined in real time. If the sensor is safety-critical, then redundant sensors will usually have to be employed.