optimizing design of fault-tolerant computing systems · 2017. 10. 23. · optimizing design of...

Optimizing Design of Fault-tolerant Computing Systems

Milos Krstic

HDT 2017, 1st Workshop on Hardware Design and Theory,

www.ihp-microelectronics.com © 2015 - All rights reserved

Agenda

1.10.2015 2

1 Motivation

2 Fault Tolerant Methods

3

4 Static and dynamic methods: examples

5 System resilience

6

Methods for reducing the overhead in Fault tolerant Systems

Conclusions


Motivation and goals

Fault tolerance is traditional requirement of the applications such as space or avionics.

Today’s scaled technologies due to their reliability issues require more and more fault tolerance measures even for main stream applications

Fault tolerance is always achieved with some (significant!) cost

The open question is how to limit the overhead imposed by fault tolerance?

Two important strategies:

Advanced static low-overhead techniques which provide certain level of fault tolerance with limited overhead

Adaptivity techniques which enable fault tolerance only when required

3


Example: Effects induced by Radiation

Irradiation effects are present not just in space applications Radiation effects can be split in two general categories:

Cumulative Effects Single Event Effects

Ionization Displacement

(can also be caused by aging effects)

Enhanced low-dose-Rate Sensitivity (ELDRS)

Soft errors: Neutron Single Event

Upset (NSEU) Single Event Transient

(SET) Single Event Upset

(SEU) Single Event Functional

Interrupt (SEFI)

Hard errors: Single Event Latchup

(SEL) Single Event Gate

Rupture (SEGR) Single Event Burnout

(SEBR)

4

The effects need to be addressed by corresponding fault tolerant measures


Almost all fault-tolerant techniques are based on the redundancy

Hardware redundancy (N-modular, triple and double modular redundancy)

Information redundancy (error detection and correction)

Time redundancy

Software redundancy

Fault-tolerant Techniques

5


Hardware Redundancy with TMR

6

The general approach is N-modular redundancy

Such systems are also known as M-of-N Systems

N=3, Triple Modular Redundancy (TMR)

N=2, Dual Modular Redundancy (DMR)

How TMR works?

m

m

m

m D



7





How TMR works?

x

m

m

m

Transient or permanent Fault

x ≠ m

D






How TMR works?

TMR is 2-of-3-System!


8

x

y

m

?

x ≠ m, y ≠ m

Transient or permanent Fault

D


Limits of the simple TMR-Circuit

9

• Voter (V) is the single point of failure!

Module 1

Module 3

Module 2 V D1 Q1

Module 1

Module 3

Module 2 V D2 Q2

Transienter Puls


Alternative TMR-Architecture

10

• Full TMR-architecture, including TMR-Voter

Module 1

Module 3

Module 2

V

V

V

D1

D2

D3

Q1

Q2

Q3


Information Redundancy

TMR approach could be used (or seen) also as information redundancy

Not very efficient one

There are error correction codes (ECCs) which can reduce the overhead

Example: Hamming-code (26 Data bits, 5 Parity bits, Overhead 16%)

Single error correction and double error detection

Easily applicable for homogenous structures (for example SRAM) or for communication

packets.


Time Redundancy

Time redundancy is performed but multiple execution of the same task in order to detect the error

Advantage

Hardware overhead is low or even not existing

Disadvantage

Performance reduced

Energy consumption overhead still exists


Software Redundancy

• Redundancy can be performed also in software • Execution of the algorithm several times and checking of results

• Critical issue: in case that the software is executed in the same way overhead will not be

able to find some errors

• Solution is to have different implementation of the target algorithms

• Overhead is similar to the time redundancy approach


Overhead of Fault Tolerant Solutions

Overhead of TMR could be much higher then X3 Example: TMR Flip-Flop TMR flip flop is very effective against single event effects On the other hand the overhead is significant

TMR Flip Flops

1,0E-11

1,0E-10

1,0E-09

1,0E-08

1,0E-07

1,0E-06

0 20 40 60 80

Cro

ss S

ectio

n [cm

²]

LET [MeVcm²/mg]

SEU Cross Section SGB25: VDD=2.25V, T=27°C

SGB25 Full-TMR

SGB25 Red-TMRSGB25 Dlib-DICE

Minimal threshold


Redundancy generates the overhead

This can increase power/area/performance budget several times

Reducing the overhead is possible

Basic methods are combining different redundancy methods (hardware, software,

information, time) and perform the trade-off between achieved level of fault tolerance,

performance and power consumption

Example: DMR

DMR reduces significantly overhead introduced by TMR

However, it can usually offer only error detection, and correction need to be done by

architectural replay, reducing the performances

Merging standard low-power techniques and fault tolerant methods?

Reducing the Overhead

15


Standard low-power techniques could be applicable for fault-tolerant circuits Clock gating

Could cause the issues for space application if the fault tolerance is based on the hardware redundancy

Error accumulation is possible, since fault recovery over clock is not there

State recovery needed after clock gating phase

DVFS

Fault tolerant clock and voltage control needed

Voltage and frequency changes could affect the susceptibility to events

Characterization needed

Power gating

Fault tolerant power switches needed

Applying Standard Low-Power Methods for Reliable Circuits

16


Fault-Tolerance to achieve Optimized Design Operation

17

In the modern scaled systems power needs to be reduced beyond the worst case corners

We need better then the worst case timing!

As a result => speculative execution is needed

Clock cycle is adaptive, and could be shorter as the critical path delay

Such low-power circuits must be fault tolerant!

Important priorities are:

Overhead for fault tolerance must be minimized (POWER!, area, timing)

Focus on fault detection => correction by recovery

Target – timing faults


RAZOR Concept

18

RAZOR is a very popular approach to achieve fault tolerance against timing errors

It is based on the use of the shadow latch which is sampled on the alternative edge

The response from main flip-flop and shadow latch are compared and timing fault identified

Pros

Reduced complexity compared with TMR/DMR

Effective against timing errors

Cons

Effective ONLY against timing errors

Metastability

D. Ernst, et al, Razor: circuit-level correction of timing errors for low-power operation, Micro, IEEE 24 (6), 2004


Bubble-RAZOR Concept

19

Bubble-RAZOR is recently proposed for processor-based architectures

When the error is detected the bubbles are pushed to the neighboring stages

This ensures the slots for the error recovery

Pros

Further reduced complexity

Effective against timing errors

Cons

Effective only for processor-based systems

M. Fojtik, et al, Bubble razor: An architecture-independent approach to timing-error detection and correction, in: Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2012 IEEE International, 2012, pp. 488-490.


Redundancy = Power/Area Increase and performance drop! The overhead could be more than N times baseline The classical FT techniques could be performed in the more optimal way: Static solutions: Limiting the level of fault tolerance but significantly reducing overhead

(partial FT, ECC codes etc.) Dynamic (adaptive) solutions: Enabling system adaptivity: using overhead only when it is needed Adaptivity is the key requirement for complex system implementation in

advanced technologies Example solutions: Static - partial and selective fault tolerance, EDPEC, FEDC Adaptive - NMR power control, adaptive ECC, adaptive MPSoCs

Resilience: Addressing with the same mechanisms different challenges

Faults, PVT variations, security

Consequences and Solutions

20


Static Solutions


Partial Fault Tolerance

22

We can reduce the overhead if we protect with redundancy only the parts of the systems which are most critical (control logic, command/status registers)

This method is called partial fault tolerance

However, some part of the system remains fully unprotected

Example: Design of Digital Beamforming Network processor for synthetic aperture radar (EU Project DIFFERENT)

Digital baseband IC and in IHP technology

DFBN Chip tape-out Aug 2015 – 46 mm2 in SGB25V

Tested on wafer and operational up to 250 MHz!

Optimized overhead (saving 20% in area and 33% in power)

radiation hardened TMR flip-flops in control logic

Standard flip-flops in datapath

DBFN Baseband Processor - 46mm2 in SGB25V, Sep 2015


Selective Fault Tolerance Architecture

Fault tolerance is guaranteed only for input assignments of critical tasks, since it is not required for other signals

We can make the trade-off between the FT level and area increase

3s1S S 2s

V

m m m

xxx

nn n

1y

2y

3y

y

n

*Reducing the Area Overhead of TMR-Systems by Protecting Specific Signals, M. Augustin, M. Gössel, R. Kraemer, Proc. IEEE IOLTS 2010 *Eine neue Fehlertoleranzmethode zur Verringerung des Flächenaufwandes von TMR-Systemen, M. Augustin, M. Gössel, R. Kraemer, Proc. ZuE 2010

optimize

optimize

otherwise 0

critical isinput if )( )(

12

xSxs

2s

otherwise 1

critical isinput if )( )(

13

xSxs

3s

23


Improvements of Selective Fault Tolerance

Methodology could be easily integrated in standard design flow

Selective Fault Tolerance applicable to real industrial designs

The reduction of area overhead

compared to TMR is significant

Near to the computationally very intensive solution

The protection of 20% of all possible

input/output assignments leads to an area reduction of one complete system compared to TMR


24


Improvements of Selective Fault Tolerance

The concept of Selective Fault Tolerance was also applied to sequential circuits (FSMs)

The protection of 20% of all possible state transitions leads to an area reduction of nearly one complete system compared to TMR


25


Error Detection and Partial Error Correction (EDPEC) Architecture

Circuit Dynamic Power [mW]

Area Comb [mm2]

Area Sequential [mm2]

Area Total [mm2]

TMR 0,600 0,0421 0,0125 0,0470

EDPEC 0,442 0,0344 0,0105 0,0380

Milos Krstic, et al., Improved circuitry for soft error correction in combinational logic in pipelined designs. IOLTS 2014: 93-98

Architecture optimized for effective error detection Most of the soft errors could be

corrected as well

Self-checking based on prediction circuits, which are less complex then hardware multiplication

Hardware/power overhead reduced compared with TMR Saves around 25% area/power

Critical errors appearing near to the clock edge

26


Full Error Detection and Correction (FEDC)

Circuit Dynamic Power [mW]

Area Comb [mm2]

Area Sequential [mm2]

Area Total [mm2]

TMR 0,600 0,0421 0,0125 0,0470

FEDC 0,432 0,035 0,0126 0,0418

Milos Krstic, et al., Enhanced Architectures for Soft Error Detection and Correction in Combinational and Sequential Circuits, Microelectronics Reliability, 2016

Focused on full error correction All injected SETs could be

corrected Suitable for long transients

as well Effective against timing

errors

Hardware/power overhead

reduced compared with TMR Saves around 28% power

27


Dynamic Methods

28


Protecting Memory

Error Correcting Codes are usual approach for fault protection

Hamming Code, BCH, Hsiao Code

Satisfying level of fault tolerance with limited overhead

The most challenging thing in smaller feature sizes will be leakage power coming mostly from memories.

Non-volatile memories have zero/very small leakage power

Universal memories (PCM, RRAM) as a next big step towards low-power

Variable-Strength Error-Correcting Codes

When the reliability of the system is sufficient (without ECC) at normal Vcc, at lower Vcc the reliability usually is lower.

ECC deals with memory cells which are unreliable at lower Vcc.

VS-ECC design achieves an 84% power reduction and a 50% energy reduction compared to SECDED ECC, and achieves a 26% power reduction and an 11% energy reduction *

*Alameldeen et al., “Energy-Efficient Cache Design Using Variable-Strength Error-Correcting Codes”

29


Dynamic Methods on the MPSoCs Level

Multiprocessors have varying application requirements

performance

dependability (fault-tolerance, lifetime, …)

power consumption

E.g. Earth observation satellite

image processing

orbit change

waiting new task

Target: NMR mechanisms, lifetime aspect Self-repairable system based on configurable micro-operation units, selected by the

programmer

Timing-critical applications are also considered

Dynamically adapting to the application requirements

Trade-offs between higher endurance, fault-tolerance, performance and power efficiency

A. Simevski, R. Kraemer, M. Krstic, Investigating Core-Level N-Modular Redundancy in Multiprocessors, IEEE MCSoC-14

30


Architectural Framework for Adaptable Multiprocessors Framework addressing fault-tolerance and aging effects for space and automotive applications

Aging monitors

Automated HW/SW verification, design & test

Programmable NMR voters

Modes of operation

De-stress

Fault-tolerant

High-performance

Dynamic mode change depending on the application requirements

Core 1

Core 4

Core 2

Core 3 Sch

edule

r +

Vote

r


31


De-stress mode

Core 1 Core 2

Core 4 Core 3

Core 1 Core 2 Core 3 Core 4

time

T

Period

Legend:

= active core

= inactive core

32


De-stress mode

Core 1 Core 2

Core 4 Core 3


time

T T

Period

33


De-stress mode

Core 1 Core 2

Core 4 Core 3


time

T T T

Period

34


De-stress mode

Core 1 Core 2

Core 4 Core 3


time

T T T T

Period

35


De-stress mode

Core 1 Core 2

Core 3 Core 4


time

T T T T T

Period

36


De-stress mode (2 active cores)

Core 1 Core 2

Core 4 Core 3


time

T

Period

37


Age 2 Age 3 Age 4 Age 5

time

T

Core 1 Core 2

Core 4 Core 3

The initial age may not be equal even between cores on the same die!

Youngest-First Round Robin (YFRR)

38



time

T T

Core 1 Core 2

Core 4 Core 3



39



time

T T T

Core 1 Core 2

Core 4 Core 3

Core 1 reached age 3



40



time

T T T T

Core 1 Core 2

Core 4 Core 3




41



time

T T T T T

Core 1 Core 2

Core 4 Core 3




42



time

T T T T T T



Core 1 Core 2

Core 4 Core 3


43



time

T T T T T

Core 1 Core 2

Core 4 Core 3

T T




44



time

T T T T T T T T



Core 1 Core 2

Core 4 Core 3


45



time

T T T T T

Core 1 Core 2

Core 4 Core 3

T T T T




46



time

T T T T T T T T T T


Cores 1 and 2 reached age 4


Core 1 Core 2

Core 4 Core 3


47



time

T T T T T

Core 1 Core 2

Core 4 Core 3

T T T T T T





48



time

T T T T T

Core 1 Core 2

Core 4 Core 3

T T T T T T T





49



time

T T T T T T T T T T T T T




Core 1 Core 2

Core 4 Core 3


50



time

T T T T T T T T T T T T T T




Core 1 Core 2

Core 4 Core 3


51



time

T T T T T

Core 1 Core 2

Core 4 Core 3

T T T T T T T T T T





52



time

T T T T T T T T T T T T T T T T




Core 1 Core 2

Core 4 Core 3


53



time

T T T T T T T T T T T T T T T T T




Core 1 Core 2

Core 4 Core 3


54



time

T T T T T

Core 1 Core 2

Core 4 Core 3

T T T T T T T T T T T T T



Cores 1, 2 and 3 reached age 5



55



time

T T T T T T T T T T T T T T T T T T T





Core 1 Core 2

Core 4 Core 3


56



time

T T T T T T T T T T T T T T T T T T T T





Core 1 Core 2

Core 4 Core 3


57



time

T T T T T T T T T T T T T T T T T T T T T





Core 1 Core 2

Core 4 Core 3


58



time

T T T T T

Core 1 Core 2

Core 4 Core 3

T T T T T T T T T T T T T T T T T






59



time

T T T T T T T T T T T T T T T T T T T T T T T





Core 1 Core 2

Core 4 Core 3


60



time

T T T T T T T T T T T T T T T T T T T T T T T T





Core 1 Core 2

Core 4 Core 3


61




time

T T T T T

Core 1 Core 2

Core 4 Core 3

T T T T T T T T T T T T T T T T T T T T





Extend lifetime by equalizing the age of the cores

62


Fault-tolerant mode

Core-level NMR – voting in each clock cycle !

Masking faults, no need for instant recoveries for N>2

Dynamic reconfiguration – NMR on-demand

Core 1

Pro

gram

mab

le N

MR

vo

ter

Core 2

Core 3

Core 4

pb1 pb2 pb3 pb4

Voting output

ISD

err

63


Implementation and Results

Test ASIC named FMP implemented and tested

8-core system based on 32-bit internal processor

Fully functional in IHP 130 nm technology

Youngest-First Scheduling Methodology (YFSM) could increase

the system lifetime up to 31%


64


Resilience methods


What is Resilience?

Resilience - „an ability to recover from or adjust easily to change” (Merriam-Webster) What can be this change in our digital systems? Environmental changes: Voltage/Temperature variations External effects (Radiation (SEEs)) Ageing and manufacturing issues Security threats (side channel attacks)

We need some measures to address those changes

We already know that addressing faults will cause some overhead

However, addressing the other aspects also leads to overhead

Example: Supply management unit and its overhead

Synergy needed in addressing those challenges!

In this way the optimization of the overhead need to be performed


Leon-based multiprocessor using IHP’s framework

De-stress (Power Gating, Clock Gating, Adaptive Voltage Scaling)

Fault-tolerant (Core-level NMR-on-Demand, ECC)

High-performance

Addressing at the same time Soft errors induced by particle hits

Voltage variation induced errors

Example: PISA System Power robust IC design for Space Applications


Resilient Multiprocessor Architecture

25. July 2017 68

Four LEON2 cores using the Waterbear framework + AVS (Adaptive Voltage Scaling)


Waterbear framework controller

Power management

Clock management

Framework control (e.g., modes: de-stress, fault-tolerant, hi-performance)

Error management (both from fault-tolerant mode and ECC)

Aging observation and control (aging monitors)

Temperature sensors

Other management functions (AHB priority, SRAM enable/disable, …)

Synergetic integration of the different fault detection and correction mechanisms

Overhead optimization

Waterbear framework controller

Slide: 16/20 25. May 2016 PISA steering committee meeting

www.ihp-microelectronics.com © 2015 - All rights reserved 70

PISA Chip

Specifications

Voltage Regulator with 20 discrete steps, controlling core voltage in range [0,8V – 1,2V]; power- and clock-gating of the domains;

Size

7 mm x 7 mm

Power

(static) 173mW @ 0.2 input activity

Proc. core (dynamic): 130mV – 30mV

Tape-out September 2017

Core0

Core 1 Core 2

Core 3


Conclusion

Fault tolerance always causes significant hardware & power overhead

Different methods how to limit this overhead exist

This always present some trade-off to achieved fault tolerance and performance

Two basic approaches presented:

Static techniques for reducing power overhead, but also reducing fault protection

Adaptive techniques enabling dynamic trade-off power-reliability

Addressing the challenges in synergetic way is of utmost importance

Ultimate target is having optimal resilient system

71

IHP – Innovations for High Performance Microelectronics Im Technologiepark 25 15236 Frankfurt (Oder) Germany

www.ihp-microelectronics.com

Phone: +49 (0) 335 5625 Fax: +49 (0) 335 5625 Email:

Thank you for your attention! Milos Krstic

729 671

[email protected]

A1

1B

optimizing design of fault-tolerant computing systems · 2017. 10. 23. · optimizing design of...

Documents