ece 753: fault-tolerant computing

ECE 753: FAULT-TOLERANT COMPUTING

Kewal K.SalujaDepartment of Electrical and Computer Engineering

Low Level Fault-Tolerance: Watchdog and Re-execution

ECE 753 Fault Tolerant Computing 2

Overview• Introduction• Watchdog techniques

– Timers, watchdog processors, error model, control flow checking, memory access and assertion checking

• Re-execution for fault-tolerance– Basic techniques: RESO concept, program re-

execution, instruction re-execution– Case studies: Fine grain parallel architecture

(CRAY), SMT architecture, multiscalar architecture. Chip Multiprocessor

• Summary


Introduction

• References• Watchdog - [mahm:88]• Re-execution - [rotenberg:99], [rashid:00]

[subra:10], [kala:13]• Sohi, Franklin, and Saluja, “A study of time-

redundant fault-tolerant techniques for high-performance pipelined computers,” Proceedings FTCS-19, June 1989, pp. 436-443.


Introduction (contd.)

• Somewhat higher level than ECC and masking at circuit level

• Bordering between hardware and software (hardware often assisted by software)

• These are some of the very first fault-tolerance methods


Watchdog techniques• Key concept

– A process or processor is checked by another hardware (normally) unit of its actions. Actions checked include if the process is still active, alive, not executing incorrect paths during execution, etc.

Processorwatchdog


Watchdog: Timers

• Check for aliveness– Processor resets the timer at certain

intervals or on certain conditions– Timer raises error flag if not reset before it

overruns

Processortimer

Error


Watchdog: Timers (contd.)

• Check for timeout– Processor sends a message and starts a

timer, the second processor must reply within this time (hardware/software implementation)

Timer

Processor BProcessor A


Watchdog: Timers (contd.)

• Applications– Processor control systems (chemical,

mechanical and other control systems)– Switching systems – messages sent or

received often await certain length of time before they are repeated

– Networks – email messages often have timeouts associated with them


Watchdog: Processors

• Architecture – can be complex but let us consider the following simple architecture

Memory

Processor

dataaddress

controlBUS

Watchdog(observer)


Watchdog: Processors (contd.)

• What can it achieve? – Observe the address bus

• Can observe the data• Can observe instructions• Can check the flow of program control

– Need to know what kind of errors can occur to determine the capability of this method


Watchdog: Error models

• Experimental setup to develop error models applicable at this level– Processor-memory architecture– Inject faults (random errors) - in I/O

processor, within processor (register file, states), within memory

– Simulate – Also hardware was designed to inject such

faults and study the impact/behavior


Watchdog: Error models (contd.)• Conclusions of the studies

– Program flow could change (branch to no branch, or vise a versa)

– Instruction fetched from data space– Access to non existence memory space– Data fetched from instruction space– Illegal instruction– Writing in protected area (ROM)

• 60% of all faults could be detected by monitoring control flow – Thus we need to develop methods that are good in monitoring control flow


Watchdog: Control flow checking

• Basic principle– Analyze the program and extract control

information• Branch free intervals• Subroutine calls

– Assign signatures to branch free intervals and provide these signatures to the watchdog processor to check these values


Watchdog: Control flow checking (contd.)

• A simple example

Program watchdog

start ------------ receive start

branch observe bus

free cont. to form

code signature

check sig X --- Check X against collected sig



• Details and variations– Structural integrity checking

• Analyze the program control flow – create a program control flow graph

• Assign unique identifier to the nodes of the graph• Provide control flow graph to the watchdog along with

the identifiers• In case of branches, watchdog expects one of the many

possible identifiers• Limitations

– Performance impact – insertion of special instructions

– Inability to detect data processing variations – add to sub



• Details and variations (contd.)– Derived signature checking

• Compiler identifies branch free intervals and generates signatures (such as check sum) for these intervals

• At run time these signatures are provided to the watchdog using tag bits to differentiate between regular instructions and watchdog messages

• Watchdog monitors the bus and generates the signatures and compare these signatures with the signatures captured from the bus (compiled signature)

• Example: associate two tag bits with every memory word to differentiate between instructions and compiled signatures – when a tag for signature appears on the bus watchdog captures the tag and forces a NOP on the bus for the regular processor



• Details and variations (contd.)– Derived signature checking (contd.)

• Coverage– Can detect random errors in instructions in branch free

intervals (but aliasing can occur)• Overheads

– Memory width increase due to tag bits– Memory increase due to signatures insertions– Performance impact due to NOPs

• Solutions– Using path signature method – reduces the number of

signatures needed– Branch address hashing – merge signature and branch

address


Watchdog: Mem access and assertion checks

• What to do about memory/data errors– Use ECC– Few other methods using watchdog

• Check for non existent memory addresses• Check for out of range addresses• Capability based checking for objects is also

possible• Assertion based checking and sanity checks

using watchdog (independent hardware) is also possible


Re-execution for fault-tolerance• Key concept

– Execute a program/instruction twice (or more times) and then compare the results.

– A time redundancy technique, but if multiple hardware platforms are available, it is a hardware redundancy technique

– Can detect transient faults. But it can also be employed to detect some permanent faults (see RESO next) even if the same hardware is used.


Re-execution: Basic Techniques

• RESO concept– Re-execution of an instruction with shifted

operands• Already discussed early in the course• Can detect transient faults • Can also detect many permanent faults


Re-execution: Basic Techniques (contd.)

• Program Re-execution– Make two copies the program

• Execute them serially– Can use RESO if the hardware platform is same for both

executions

• Execute them in parallel if sufficient hardware redundancy is available

– May take twice as long or twice the hardware– When/how to compare: impacts the system

complexity– Performance impact

• Serial computation: High latency• Parallel computation: Complex implementation, and hence

possible loss of performance


Re-execution: Basic Techniques (contd.)

• Instruction Re-execution – fine grain parallelism– Re-execute every instruction on same or

different hardware, depending upon the redundancy available

• May use RESO if same hardware is used for instruction re-execution

– If sufficient resources are available, this method may have little impact on the performance


Re-execution: Case studies

• Introduction to case studies– CRAY

• Instruction re-execution

– SMT architecture• Two copies the program are interleaved as two threads

for simultaneous execution

– Multiscalar architecture• Two copies of the program are executed on many

processing elements simultaneously

– Chip multiprocessor• With critical value forwarding (DSN-2010)


Re-execution: Case studies (contd.)

• CRAY• Instruction re-execution• Duplication of instruction in hardware• Sufficient resources and pipelining available for

re-execution without doubling the execution time• Consider a generic fine grain parallel

architecture (OH)• Consider executing a code segment (OH)• Now look at ways of duplicating instructions and

executing original and duplicated instructions (OH)

• Some experimental results



• AR-SMT– High level view of the technique (OH)

• Concept of execution (Active) streams• Re-execution of the instruction stream –

Redundant stream

– Issue of delay buffer length and latency– Implementation issues and coverage– Performance impact



• Multiscalar – Concept of control flow graph (OH)– Basic architecture (OH)– Static division of PUs and performance

impact (OH)– Dynamic division of PUs and performance

impact (OH)



• Chip Multiprocessor (See slide set)– Intro– Design Overview and concept– Evaulation– Conclusion


Watchdog and Re-execution: Comments

• Concepts discussed here can be used to design high performance processors– Performance improvement via speculation

• Have a very high performance speculative processor• Verify the control flow using watchdog or use a second

processor to fully verify the executed stream by the speculative processor.

• This will lead to a processor with high performance (throughput) albeit high latency


Summary

• Watchdog– Timer– Processor– Control flow checking

• Re-execution– Basic techniques– Case studies: CRAY, AR-SMT, Multiscalar

ece 753: fault-tolerant computing

Documents

fault tolerant computingece

faulttolerant computingkewal

faulttolerance methods

watchdog processors

error models contd

timers contd

processors contd

program reexecution