design of fail-safe computer systems

26
EECE499 Computers and Nuclear Energy Electrical and Computer Eng Howard University Dr. Charles Kim Fall 2013 Webpage: www.mwftr.com/CNE.html

Upload: aleda

Post on 07-Jan-2016

95 views

Category:

Documents


3 download

DESCRIPTION

EECE499 Computers and Nuclear Energy Electrical and Computer Eng Howard University Dr. Charles Kim Fall 2013 Webpage: www.mwftr.com/CNE.html. Design of Fail-Safe Computer Systems. Yonatan Yilma October 2 , 2013. Intro. Previous presentations: - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Design of  Fail-Safe Computer Systems

EECE499 Computers and Nuclear EnergyElectrical and Computer EngHoward UniversityDr. Charles KimFall 2013

Webpage: www.mwftr.com/CNE.html

Page 2: Design of  Fail-Safe Computer Systems

Design of Fail-Safe Computer Systems

Yonatan Yilma

October 2, 2013

Page 3: Design of  Fail-Safe Computer Systems

Intro

Previous presentations: the computer system was broken down into

its basic hardware and software components the ways in which these components could

fail and produce a system mishap

This Presentation the design steps that can be taken to make

the basic computer system fail-safe. Measures are discussed on a component-by-

component basis Reliability and quality improvements

Page 4: Design of  Fail-Safe Computer Systems

DESIGNING THE FAIL-SAFE SYSTEM

Layers of Protection  Incorporate internal safety and warning

devices Incorporate external safety devices Improve component reliability and quality

Safety devices Computer component vs. Overall System

Safety

Page 5: Design of  Fail-Safe Computer Systems

Fail-Safe Versus Fail-Operate

Both employ failure detection capability

Fail-safe system In the event of failure it will revert to a non-

operating state that will not cause a mishap.

Fail-operate system Must detect existence of the fault or failure as

well, but must additionally isolate the offending component and reconfigure itself so that safe operation will continue without noticeable interruption.

In practice, system designs usually employ a mix of the fail-safe and fail-operate approaches.

Page 6: Design of  Fail-Safe Computer Systems

Fail-Safe System Design Guidelines

common practice to select and interconnect the components so that the failure of any component will automatically cause the system to revert to a fail-safe state.

Page 7: Design of  Fail-Safe Computer Systems

Fail-Safe Design Approach

To make hardware and software modifications to the basic computer system so that the new, modified system can

1)detect the presence of faults or occurrence of failures and,

2)upon detecting these faults or failures, reconfigure itself to a safe state.

Page 8: Design of  Fail-Safe Computer Systems

SIMPLEX ARCHITECTURE

It is a widely held belief that a computer system must employ redundant components to be fail-safe. This is False.

A system that does not employ redundancy is frequently referred to as a simplex system.

The overall approach is to tackle each component one at a time and show how to deal with its failure modes.

Page 9: Design of  Fail-Safe Computer Systems

Application Failures Prevention of application failures is the primary

safety function of both computer control systems and computer safety systems.

Modifying the computer system so that it will prevent these application failures from occurring is done systematically through the following steps..1. Define the physical measurements that can be made

on the application which will indicate that it is approaching a failure condition.  

2. Select the appropriate sensors for making these measurements and interface them to the computer.  

3. Select effectors that can be commanded to eliminate or arrest the conditions leading to the application failure and interface them to the computer.  

4. Design and install software which will continuously monitor output of the sensors selected in STEP #2. If the software detects a fault or onset of failure, it will signal one or more of the effectors selected in STEP 3 to arrest the failure onset.

Page 10: Design of  Fail-Safe Computer Systems

Application Failure Control Examples

Page 11: Design of  Fail-Safe Computer Systems

Operator Interface:Warnings and Advisories software detected the onset of failure and

automatically signaled the effectors to control the failure

software should also notify the operator, usually by visual and/or aural means, that the event was detected and controlled.

In addition to its failure detection and reconfiguration functions, software can cue the operator on additional safety steps not directly related to the system that may have to be taken.

Page 12: Design of  Fail-Safe Computer Systems

Emergency Procedures and Detecting Non-

application Faults and Failures

As a general statement, each potential failure event should have a corresponding emergency procedure that the operator(s) should follow should the event transpire.

The major challenge in the design of the real-time safety-critical system is not how to deal with application failures but how to handle failures in the hardware and software components that are monitoring and controlling the application in real time.

Page 13: Design of  Fail-Safe Computer Systems

Sensor Failure Detection

The designer has to know, in advance, what the correct sensor output should be when the system is run in real time.

When there is no sensor failure, the actual value and the commanded value will match. If there is a sensor failure, there will be a mismatch and the implemented software will detect the failure.

Page 14: Design of  Fail-Safe Computer Systems

Sensor Failure Isolation.

The concluding phrase is "a failure has occurred." One cannot say "a sensor failure has occurred" because other failures in the system can produce exactly the same symptom.

Example of sensor failure detection

Page 15: Design of  Fail-Safe Computer Systems

Failure Isolation in the Simplex System

Software written to detect failure of a given component may also detect failures that might occur in other components within the system.

In a simplex system, software can detect the occurrence of a component's failure but cannot tie the event to that specific component.

Page 16: Design of  Fail-Safe Computer Systems

Detecting Analog Sensor Failures

In principle the same approach can be applied to sensors that generate analog signals.

As part of the software design, it is therefore necessary to establish positive and negative levels called thresholds within which the error in the non-faulted system can be expected to lie.

The function of software then is to monitor this error in real time, comparing its value relative to the threshold levels.

If a major sensor failure occurs the error will exceed the threshold limits in which event software can detect and issue a command to stop further motion.

Page 17: Design of  Fail-Safe Computer Systems
Page 18: Design of  Fail-Safe Computer Systems

State Estimator The software scheme for detecting

sensor failures can be refined considerably by employing a mathematical concept called a state estimator.

A state refers to a physical parameter such as a position, velocity, pressure, temperature, etc., existing within the application.

Page 19: Design of  Fail-Safe Computer Systems

Use of State Estimator for Sensor Fault Detection

Page 20: Design of  Fail-Safe Computer Systems

Sensor State Estimators

More important, control uncertainties, particularly external disturbances, may require that larger than desirable thresholds be set.

The planned use of a state estimator for sensor failure detection should always be qualified by analysis and testing with real sensor outputs.

Page 21: Design of  Fail-Safe Computer Systems

Reasonableness Tests When sensors fail, they sometimes report

engineering values that are clearly inconsistent with the associated physical process.

Software tests the validity of sensor outputs by establishing beforehand a band of "reasonable" values that each sensor should generate and verifying that each sensor output lies within this band during real-time operation.

Note that this approach does not work for the sensor that fails to a constant or offset output value.

Page 22: Design of  Fail-Safe Computer Systems

Informational Redundancy

Sometimes a sensor value can be directly related through physical laws to other sensor values.

Dependent Sensor Values

Page 23: Design of  Fail-Safe Computer Systems

Analytical Redundancy This technique is similar to informational

redundancy except that it is performed on pairs of sensors that measure physical parameters which are different but are analytically related to each other through a common parameter.

The common parameter is time.

Since software can measure time (using its internal timer or frame counter) it can verify that the two sensor readings are consistent over the time span.

Page 24: Design of  Fail-Safe Computer Systems

Complementary Filter In most systems, physical rates are not

constant.

As a result, control system engineers can resort to a software scheme known as a complementary filter which allows the rate to be integrated over short time periods and compared with short term position changes.

Page 25: Design of  Fail-Safe Computer Systems

Complementary Filter for Sensor Fault Detection

Page 26: Design of  Fail-Safe Computer Systems

Undetectable Sensor Failures

If failure of a sensor can lead to mishap conditions, then software must be able to positively identify that the sensor failure has occurred and effect a safety disconnect of the system.

Many times incorrect sensor behavior cannot be determined in real time. If the sensor is safety-critical, then redundant sensors will usually have to be employed.