ece 753: fault-tolerant computing

29
ECE 753: FAULT- TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Low Level Fault-Tolerance: Watchdog and Re-execution

Upload: cardea

Post on 11-Jan-2016

31 views

Category:

Documents


3 download

DESCRIPTION

ECE 753: FAULT-TOLERANT COMPUTING. Kewal K.Saluja Department of Electrical and Computer Engineering Low Level Fault-Tolerance: Watchdog and Re-execution. Overview. Introduction Watchdog techniques - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: ECE 753: FAULT-TOLERANT COMPUTING

ECE 753: FAULT-TOLERANT COMPUTING

Kewal K.SalujaDepartment of Electrical and Computer Engineering

Low Level Fault-Tolerance: Watchdog and Re-execution

Page 2: ECE 753: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing 2

Overview• Introduction• Watchdog techniques

– Timers, watchdog processors, error model, control flow checking, memory access and assertion checking

• Re-execution for fault-tolerance– Basic techniques: RESO concept, program re-

execution, instruction re-execution– Case studies: Fine grain parallel architecture

(CRAY), SMT architecture, multiscalar architecture. Chip Multiprocessor

• Summary

Page 3: ECE 753: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing 3

Introduction

• References• Watchdog - [mahm:88]• Re-execution - [rotenberg:99], [rashid:00]

[subra:10], [kala:13]• Sohi, Franklin, and Saluja, “A study of time-

redundant fault-tolerant techniques for high-performance pipelined computers,” Proceedings FTCS-19, June 1989, pp. 436-443.

Page 4: ECE 753: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing 4

Introduction (contd.)

• Somewhat higher level than ECC and masking at circuit level

• Bordering between hardware and software (hardware often assisted by software)

• These are some of the very first fault-tolerance methods

Page 5: ECE 753: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing 5

Watchdog techniques• Key concept

– A process or processor is checked by another hardware (normally) unit of its actions. Actions checked include if the process is still active, alive, not executing incorrect paths during execution, etc.

Processorwatchdog

Page 6: ECE 753: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing 6

Watchdog: Timers

• Check for aliveness– Processor resets the timer at certain

intervals or on certain conditions– Timer raises error flag if not reset before it

overruns

Processortimer

Error

Page 7: ECE 753: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing 7

Watchdog: Timers (contd.)

• Check for timeout– Processor sends a message and starts a

timer, the second processor must reply within this time (hardware/software implementation)

Timer

Processor BProcessor A

Page 8: ECE 753: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing 8

Watchdog: Timers (contd.)

• Applications– Processor control systems (chemical,

mechanical and other control systems)– Switching systems – messages sent or

received often await certain length of time before they are repeated

– Networks – email messages often have timeouts associated with them

Page 9: ECE 753: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing 9

Watchdog: Processors

• Architecture – can be complex but let us consider the following simple architecture

Memory

Processor

dataaddress

controlBUS

Watchdog(observer)

Page 10: ECE 753: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing 10

Watchdog: Processors (contd.)

• What can it achieve? – Observe the address bus

• Can observe the data• Can observe instructions• Can check the flow of program control

– Need to know what kind of errors can occur to determine the capability of this method

Page 11: ECE 753: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing 11

Watchdog: Error models

• Experimental setup to develop error models applicable at this level– Processor-memory architecture– Inject faults (random errors) - in I/O

processor, within processor (register file, states), within memory

– Simulate – Also hardware was designed to inject such

faults and study the impact/behavior

Page 12: ECE 753: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing 12

Watchdog: Error models (contd.)• Conclusions of the studies

– Program flow could change (branch to no branch, or vise a versa)

– Instruction fetched from data space– Access to non existence memory space– Data fetched from instruction space– Illegal instruction– Writing in protected area (ROM)

• 60% of all faults could be detected by monitoring control flow – Thus we need to develop methods that are good in monitoring control flow

Page 13: ECE 753: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing 13

Watchdog: Control flow checking

• Basic principle– Analyze the program and extract control

information• Branch free intervals• Subroutine calls

– Assign signatures to branch free intervals and provide these signatures to the watchdog processor to check these values

Page 14: ECE 753: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing 14

Watchdog: Control flow checking (contd.)

• A simple example

Program watchdog

start ------------ receive start

branch observe bus

free cont. to form

code signature

check sig X --- Check X against collected sig

Page 15: ECE 753: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing 15

Watchdog: Control flow checking (contd.)

• Details and variations– Structural integrity checking

• Analyze the program control flow – create a program control flow graph

• Assign unique identifier to the nodes of the graph• Provide control flow graph to the watchdog along with

the identifiers• In case of branches, watchdog expects one of the many

possible identifiers• Limitations

– Performance impact – insertion of special instructions

– Inability to detect data processing variations – add to sub

Page 16: ECE 753: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing 16

Watchdog: Control flow checking (contd.)

• Details and variations (contd.)– Derived signature checking

• Compiler identifies branch free intervals and generates signatures (such as check sum) for these intervals

• At run time these signatures are provided to the watchdog using tag bits to differentiate between regular instructions and watchdog messages

• Watchdog monitors the bus and generates the signatures and compare these signatures with the signatures captured from the bus (compiled signature)

• Example: associate two tag bits with every memory word to differentiate between instructions and compiled signatures – when a tag for signature appears on the bus watchdog captures the tag and forces a NOP on the bus for the regular processor

Page 17: ECE 753: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing 17

Watchdog: Control flow checking (contd.)

• Details and variations (contd.)– Derived signature checking (contd.)

• Coverage– Can detect random errors in instructions in branch free

intervals (but aliasing can occur)• Overheads

– Memory width increase due to tag bits– Memory increase due to signatures insertions– Performance impact due to NOPs

• Solutions– Using path signature method – reduces the number of

signatures needed– Branch address hashing – merge signature and branch

address

Page 18: ECE 753: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing 18

Watchdog: Mem access and assertion checks

• What to do about memory/data errors– Use ECC– Few other methods using watchdog

• Check for non existent memory addresses• Check for out of range addresses• Capability based checking for objects is also

possible• Assertion based checking and sanity checks

using watchdog (independent hardware) is also possible

Page 19: ECE 753: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing 19

Re-execution for fault-tolerance• Key concept

– Execute a program/instruction twice (or more times) and then compare the results.

– A time redundancy technique, but if multiple hardware platforms are available, it is a hardware redundancy technique

– Can detect transient faults. But it can also be employed to detect some permanent faults (see RESO next) even if the same hardware is used.

Page 20: ECE 753: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing 20

Re-execution: Basic Techniques

• RESO concept– Re-execution of an instruction with shifted

operands• Already discussed early in the course• Can detect transient faults • Can also detect many permanent faults

Page 21: ECE 753: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing 21

Re-execution: Basic Techniques (contd.)

• Program Re-execution– Make two copies the program

• Execute them serially– Can use RESO if the hardware platform is same for both

executions

• Execute them in parallel if sufficient hardware redundancy is available

– May take twice as long or twice the hardware– When/how to compare: impacts the system

complexity– Performance impact

• Serial computation: High latency• Parallel computation: Complex implementation, and hence

possible loss of performance

Page 22: ECE 753: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing 22

Re-execution: Basic Techniques (contd.)

• Instruction Re-execution – fine grain parallelism– Re-execute every instruction on same or

different hardware, depending upon the redundancy available

• May use RESO if same hardware is used for instruction re-execution

– If sufficient resources are available, this method may have little impact on the performance

Page 23: ECE 753: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing 23

Re-execution: Case studies

• Introduction to case studies– CRAY

• Instruction re-execution

– SMT architecture• Two copies the program are interleaved as two threads

for simultaneous execution

– Multiscalar architecture• Two copies of the program are executed on many

processing elements simultaneously

– Chip multiprocessor• With critical value forwarding (DSN-2010)

Page 24: ECE 753: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing 24

Re-execution: Case studies (contd.)

• CRAY• Instruction re-execution• Duplication of instruction in hardware• Sufficient resources and pipelining available for

re-execution without doubling the execution time• Consider a generic fine grain parallel

architecture (OH)• Consider executing a code segment (OH)• Now look at ways of duplicating instructions and

executing original and duplicated instructions (OH)

• Some experimental results

Page 25: ECE 753: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing 25

Re-execution: Case studies (contd.)

• AR-SMT– High level view of the technique (OH)

• Concept of execution (Active) streams• Re-execution of the instruction stream –

Redundant stream

– Issue of delay buffer length and latency– Implementation issues and coverage– Performance impact

Page 26: ECE 753: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing 26

Re-execution: Case studies (contd.)

• Multiscalar – Concept of control flow graph (OH)– Basic architecture (OH)– Static division of PUs and performance

impact (OH)– Dynamic division of PUs and performance

impact (OH)

Page 27: ECE 753: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing 27

Re-execution: Case studies (contd.)

• Chip Multiprocessor (See slide set)– Intro– Design Overview and concept– Evaulation– Conclusion

Page 28: ECE 753: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing 28

Watchdog and Re-execution: Comments

• Concepts discussed here can be used to design high performance processors– Performance improvement via speculation

• Have a very high performance speculative processor• Verify the control flow using watchdog or use a second

processor to fully verify the executed stream by the speculative processor.

• This will lead to a processor with high performance (throughput) albeit high latency

Page 29: ECE 753: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing 29

Summary

• Watchdog– Timer– Processor– Control flow checking

• Re-execution– Basic techniques– Case studies: CRAY, AR-SMT, Multiscalar