ece 753: fault-tolerant computing
DESCRIPTION
ECE 753: FAULT-TOLERANT COMPUTING. Kewal K.Saluja Department of Electrical and Computer Engineering Low Level Fault-Tolerance: Watchdog and Re-execution. Overview. Introduction Watchdog techniques - PowerPoint PPT PresentationTRANSCRIPT
ECE 753: FAULT-TOLERANT COMPUTING
Kewal K.SalujaDepartment of Electrical and Computer Engineering
Low Level Fault-Tolerance: Watchdog and Re-execution
ECE 753 Fault Tolerant Computing 2
Overview• Introduction• Watchdog techniques
– Timers, watchdog processors, error model, control flow checking, memory access and assertion checking
• Re-execution for fault-tolerance– Basic techniques: RESO concept, program re-
execution, instruction re-execution– Case studies: Fine grain parallel architecture
(CRAY), SMT architecture, multiscalar architecture. Chip Multiprocessor
• Summary
ECE 753 Fault Tolerant Computing 3
Introduction
• References• Watchdog - [mahm:88]• Re-execution - [rotenberg:99], [rashid:00]
[subra:10], [kala:13]• Sohi, Franklin, and Saluja, “A study of time-
redundant fault-tolerant techniques for high-performance pipelined computers,” Proceedings FTCS-19, June 1989, pp. 436-443.
ECE 753 Fault Tolerant Computing 4
Introduction (contd.)
• Somewhat higher level than ECC and masking at circuit level
• Bordering between hardware and software (hardware often assisted by software)
• These are some of the very first fault-tolerance methods
ECE 753 Fault Tolerant Computing 5
Watchdog techniques• Key concept
– A process or processor is checked by another hardware (normally) unit of its actions. Actions checked include if the process is still active, alive, not executing incorrect paths during execution, etc.
Processorwatchdog
ECE 753 Fault Tolerant Computing 6
Watchdog: Timers
• Check for aliveness– Processor resets the timer at certain
intervals or on certain conditions– Timer raises error flag if not reset before it
overruns
Processortimer
Error
ECE 753 Fault Tolerant Computing 7
Watchdog: Timers (contd.)
• Check for timeout– Processor sends a message and starts a
timer, the second processor must reply within this time (hardware/software implementation)
Timer
Processor BProcessor A
ECE 753 Fault Tolerant Computing 8
Watchdog: Timers (contd.)
• Applications– Processor control systems (chemical,
mechanical and other control systems)– Switching systems – messages sent or
received often await certain length of time before they are repeated
– Networks – email messages often have timeouts associated with them
ECE 753 Fault Tolerant Computing 9
Watchdog: Processors
• Architecture – can be complex but let us consider the following simple architecture
Memory
Processor
dataaddress
controlBUS
Watchdog(observer)
ECE 753 Fault Tolerant Computing 10
Watchdog: Processors (contd.)
• What can it achieve? – Observe the address bus
• Can observe the data• Can observe instructions• Can check the flow of program control
– Need to know what kind of errors can occur to determine the capability of this method
ECE 753 Fault Tolerant Computing 11
Watchdog: Error models
• Experimental setup to develop error models applicable at this level– Processor-memory architecture– Inject faults (random errors) - in I/O
processor, within processor (register file, states), within memory
– Simulate – Also hardware was designed to inject such
faults and study the impact/behavior
ECE 753 Fault Tolerant Computing 12
Watchdog: Error models (contd.)• Conclusions of the studies
– Program flow could change (branch to no branch, or vise a versa)
– Instruction fetched from data space– Access to non existence memory space– Data fetched from instruction space– Illegal instruction– Writing in protected area (ROM)
• 60% of all faults could be detected by monitoring control flow – Thus we need to develop methods that are good in monitoring control flow
ECE 753 Fault Tolerant Computing 13
Watchdog: Control flow checking
• Basic principle– Analyze the program and extract control
information• Branch free intervals• Subroutine calls
– Assign signatures to branch free intervals and provide these signatures to the watchdog processor to check these values
ECE 753 Fault Tolerant Computing 14
Watchdog: Control flow checking (contd.)
• A simple example
Program watchdog
start ------------ receive start
branch observe bus
free cont. to form
code signature
check sig X --- Check X against collected sig
ECE 753 Fault Tolerant Computing 15
Watchdog: Control flow checking (contd.)
• Details and variations– Structural integrity checking
• Analyze the program control flow – create a program control flow graph
• Assign unique identifier to the nodes of the graph• Provide control flow graph to the watchdog along with
the identifiers• In case of branches, watchdog expects one of the many
possible identifiers• Limitations
– Performance impact – insertion of special instructions
– Inability to detect data processing variations – add to sub
ECE 753 Fault Tolerant Computing 16
Watchdog: Control flow checking (contd.)
• Details and variations (contd.)– Derived signature checking
• Compiler identifies branch free intervals and generates signatures (such as check sum) for these intervals
• At run time these signatures are provided to the watchdog using tag bits to differentiate between regular instructions and watchdog messages
• Watchdog monitors the bus and generates the signatures and compare these signatures with the signatures captured from the bus (compiled signature)
• Example: associate two tag bits with every memory word to differentiate between instructions and compiled signatures – when a tag for signature appears on the bus watchdog captures the tag and forces a NOP on the bus for the regular processor
ECE 753 Fault Tolerant Computing 17
Watchdog: Control flow checking (contd.)
• Details and variations (contd.)– Derived signature checking (contd.)
• Coverage– Can detect random errors in instructions in branch free
intervals (but aliasing can occur)• Overheads
– Memory width increase due to tag bits– Memory increase due to signatures insertions– Performance impact due to NOPs
• Solutions– Using path signature method – reduces the number of
signatures needed– Branch address hashing – merge signature and branch
address
ECE 753 Fault Tolerant Computing 18
Watchdog: Mem access and assertion checks
• What to do about memory/data errors– Use ECC– Few other methods using watchdog
• Check for non existent memory addresses• Check for out of range addresses• Capability based checking for objects is also
possible• Assertion based checking and sanity checks
using watchdog (independent hardware) is also possible
ECE 753 Fault Tolerant Computing 19
Re-execution for fault-tolerance• Key concept
– Execute a program/instruction twice (or more times) and then compare the results.
– A time redundancy technique, but if multiple hardware platforms are available, it is a hardware redundancy technique
– Can detect transient faults. But it can also be employed to detect some permanent faults (see RESO next) even if the same hardware is used.
ECE 753 Fault Tolerant Computing 20
Re-execution: Basic Techniques
• RESO concept– Re-execution of an instruction with shifted
operands• Already discussed early in the course• Can detect transient faults • Can also detect many permanent faults
ECE 753 Fault Tolerant Computing 21
Re-execution: Basic Techniques (contd.)
• Program Re-execution– Make two copies the program
• Execute them serially– Can use RESO if the hardware platform is same for both
executions
• Execute them in parallel if sufficient hardware redundancy is available
– May take twice as long or twice the hardware– When/how to compare: impacts the system
complexity– Performance impact
• Serial computation: High latency• Parallel computation: Complex implementation, and hence
possible loss of performance
ECE 753 Fault Tolerant Computing 22
Re-execution: Basic Techniques (contd.)
• Instruction Re-execution – fine grain parallelism– Re-execute every instruction on same or
different hardware, depending upon the redundancy available
• May use RESO if same hardware is used for instruction re-execution
– If sufficient resources are available, this method may have little impact on the performance
ECE 753 Fault Tolerant Computing 23
Re-execution: Case studies
• Introduction to case studies– CRAY
• Instruction re-execution
– SMT architecture• Two copies the program are interleaved as two threads
for simultaneous execution
– Multiscalar architecture• Two copies of the program are executed on many
processing elements simultaneously
– Chip multiprocessor• With critical value forwarding (DSN-2010)
ECE 753 Fault Tolerant Computing 24
Re-execution: Case studies (contd.)
• CRAY• Instruction re-execution• Duplication of instruction in hardware• Sufficient resources and pipelining available for
re-execution without doubling the execution time• Consider a generic fine grain parallel
architecture (OH)• Consider executing a code segment (OH)• Now look at ways of duplicating instructions and
executing original and duplicated instructions (OH)
• Some experimental results
ECE 753 Fault Tolerant Computing 25
Re-execution: Case studies (contd.)
• AR-SMT– High level view of the technique (OH)
• Concept of execution (Active) streams• Re-execution of the instruction stream –
Redundant stream
– Issue of delay buffer length and latency– Implementation issues and coverage– Performance impact
ECE 753 Fault Tolerant Computing 26
Re-execution: Case studies (contd.)
• Multiscalar – Concept of control flow graph (OH)– Basic architecture (OH)– Static division of PUs and performance
impact (OH)– Dynamic division of PUs and performance
impact (OH)
ECE 753 Fault Tolerant Computing 27
Re-execution: Case studies (contd.)
• Chip Multiprocessor (See slide set)– Intro– Design Overview and concept– Evaulation– Conclusion
ECE 753 Fault Tolerant Computing 28
Watchdog and Re-execution: Comments
• Concepts discussed here can be used to design high performance processors– Performance improvement via speculation
• Have a very high performance speculative processor• Verify the control flow using watchdog or use a second
processor to fully verify the executed stream by the speculative processor.
• This will lead to a processor with high performance (throughput) albeit high latency
ECE 753 Fault Tolerant Computing 29
Summary
• Watchdog– Timer– Processor– Control flow checking
• Re-execution– Basic techniques– Case studies: CRAY, AR-SMT, Multiscalar