optimizing design of fault-tolerant computing systems · 2017. 10. 23. · optimizing design of...

72
Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, 1st Workshop on Hardware Design and Theory,

Upload: others

Post on 12-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

Optimizing Design of Fault-tolerant Computing Systems

Milos Krstic

HDT 2017, 1st Workshop on Hardware Design and Theory,

Page 2: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

www.ihp-microelectronics.com © 2015 - All rights reserved

Agenda

1.10.2015 2

1 Motivation

2 Fault Tolerant Methods

3

4 Static and dynamic methods: examples

5 System resilience

6

Methods for reducing the overhead in Fault tolerant Systems

Conclusions

Page 3: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

www.ihp-microelectronics.com © 2015 - All rights reserved

Motivation and goals

Fault tolerance is traditional requirement of the applications such as space or avionics.

Today’s scaled technologies due to their reliability issues require more and more fault tolerance measures even for main stream applications

Fault tolerance is always achieved with some (significant!) cost

The open question is how to limit the overhead imposed by fault tolerance?

Two important strategies:

Advanced static low-overhead techniques which provide certain level of fault tolerance with limited overhead

Adaptivity techniques which enable fault tolerance only when required

3

Page 4: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

www.ihp-microelectronics.com © 2015 - All rights reserved

Example: Effects induced by Radiation

Irradiation effects are present not just in space applications Radiation effects can be split in two general categories:

Cumulative Effects Single Event Effects

Ionization Displacement

(can also be caused by aging effects)

Enhanced low-dose-Rate Sensitivity (ELDRS)

Soft errors: Neutron Single Event

Upset (NSEU) Single Event Transient

(SET) Single Event Upset

(SEU) Single Event Functional

Interrupt (SEFI)

Hard errors: Single Event Latchup

(SEL) Single Event Gate

Rupture (SEGR) Single Event Burnout

(SEBR)

4

The effects need to be addressed by corresponding fault tolerant measures

Page 5: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

www.ihp-microelectronics.com © 2015 - All rights reserved

Almost all fault-tolerant techniques are based on the redundancy

Hardware redundancy (N-modular, triple and double modular redundancy)

Information redundancy (error detection and correction)

Time redundancy

Software redundancy

Fault-tolerant Techniques

5

Page 6: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

www.ihp-microelectronics.com © 2015 - All rights reserved

Hardware Redundancy with TMR

6

The general approach is N-modular redundancy

Such systems are also known as M-of-N Systems

N=3, Triple Modular Redundancy (TMR)

N=2, Dual Modular Redundancy (DMR)

How TMR works?

m

m

m

m D

Page 7: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

www.ihp-microelectronics.com © 2015 - All rights reserved

Hardware Redundancy with TMR

7

The general approach is N-modular redundancy

Such systems are also known as M-of-N Systems

N=3, Triple Modular Redundancy (TMR)

N=2, Dual Modular Redundancy (DMR)

How TMR works?

x

m

m

m

Transient or permanent Fault

x ≠ m

D

Page 8: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

www.ihp-microelectronics.com © 2015 - All rights reserved

The general approach is N-modular redundancy

Such systems are also known as M-of-N Systems

N=3, Triple Modular Redundancy (TMR)

N=2, Dual Modular Redundancy (DMR)

How TMR works?

TMR is 2-of-3-System!

Hardware Redundancy with TMR

8

x

y

m

?

x ≠ m, y ≠ m

Transient or permanent Fault

D

Page 9: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

www.ihp-microelectronics.com © 2015 - All rights reserved

Limits of the simple TMR-Circuit

9

• Voter (V) is the single point of failure!

Module 1

Module 3

Module 2 V D1 Q1

Module 1

Module 3

Module 2 V D2 Q2

Transienter Puls

Page 10: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

www.ihp-microelectronics.com © 2015 - All rights reserved

Alternative TMR-Architecture

10

• Full TMR-architecture, including TMR-Voter

Module 1

Module 3

Module 2

V

V

V

D1

D2

D3

Q1

Q2

Q3

Page 11: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

www.ihp-microelectronics.com © 2015 - All rights reserved

Information Redundancy

TMR approach could be used (or seen) also as information redundancy

Not very efficient one

There are error correction codes (ECCs) which can reduce the overhead

Example: Hamming-code (26 Data bits, 5 Parity bits, Overhead 16%)

Single error correction and double error detection

Easily applicable for homogenous structures (for example SRAM) or for communication

packets.

Page 12: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

www.ihp-microelectronics.com © 2015 - All rights reserved

Time Redundancy

Time redundancy is performed but multiple execution of the same task in order to detect the error

Advantage

Hardware overhead is low or even not existing

Disadvantage

Performance reduced

Energy consumption overhead still exists

Page 13: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

www.ihp-microelectronics.com © 2015 - All rights reserved

Software Redundancy

• Redundancy can be performed also in software • Execution of the algorithm several times and checking of results

• Critical issue: in case that the software is executed in the same way overhead will not be

able to find some errors

• Solution is to have different implementation of the target algorithms

• Overhead is similar to the time redundancy approach

Page 14: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

www.ihp-microelectronics.com © 2015 - All rights reserved

Overhead of Fault Tolerant Solutions

Overhead of TMR could be much higher then X3 Example: TMR Flip-Flop TMR flip flop is very effective against single event effects On the other hand the overhead is significant

TMR Flip Flops

1,0E-11

1,0E-10

1,0E-09

1,0E-08

1,0E-07

1,0E-06

0 20 40 60 80

Cro

ss S

ectio

n [cm

²]

LET [MeVcm²/mg]

SEU Cross Section SGB25: VDD=2.25V, T=27°C

SGB25 Full-TMR

SGB25 Red-TMRSGB25 Dlib-DICE

Minimal threshold

Page 15: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

www.ihp-microelectronics.com © 2015 - All rights reserved

Redundancy generates the overhead

This can increase power/area/performance budget several times

Reducing the overhead is possible

Basic methods are combining different redundancy methods (hardware, software,

information, time) and perform the trade-off between achieved level of fault tolerance,

performance and power consumption

Example: DMR

DMR reduces significantly overhead introduced by TMR

However, it can usually offer only error detection, and correction need to be done by

architectural replay, reducing the performances

Merging standard low-power techniques and fault tolerant methods?

Reducing the Overhead

15

Page 16: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

www.ihp-microelectronics.com © 2015 - All rights reserved

Standard low-power techniques could be applicable for fault-tolerant circuits Clock gating

Could cause the issues for space application if the fault tolerance is based on the hardware redundancy

Error accumulation is possible, since fault recovery over clock is not there

State recovery needed after clock gating phase

DVFS

Fault tolerant clock and voltage control needed

Voltage and frequency changes could affect the susceptibility to events

Characterization needed

Power gating

Fault tolerant power switches needed

Applying Standard Low-Power Methods for Reliable Circuits

16

Page 17: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

www.ihp-microelectronics.com © 2015 - All rights reserved

Fault-Tolerance to achieve Optimized Design Operation

17

In the modern scaled systems power needs to be reduced beyond the worst case corners

We need better then the worst case timing!

As a result => speculative execution is needed

Clock cycle is adaptive, and could be shorter as the critical path delay

Such low-power circuits must be fault tolerant!

Important priorities are:

Overhead for fault tolerance must be minimized (POWER!, area, timing)

Focus on fault detection => correction by recovery

Target – timing faults

Page 18: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

www.ihp-microelectronics.com © 2015 - All rights reserved

RAZOR Concept

18

RAZOR is a very popular approach to achieve fault tolerance against timing errors

It is based on the use of the shadow latch which is sampled on the alternative edge

The response from main flip-flop and shadow latch are compared and timing fault identified

Pros

Reduced complexity compared with TMR/DMR

Effective against timing errors

Cons

Effective ONLY against timing errors

Metastability

D. Ernst, et al, Razor: circuit-level correction of timing errors for low-power operation, Micro, IEEE 24 (6), 2004

Page 19: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

www.ihp-microelectronics.com © 2015 - All rights reserved

Bubble-RAZOR Concept

19

Bubble-RAZOR is recently proposed for processor-based architectures

When the error is detected the bubbles are pushed to the neighboring stages

This ensures the slots for the error recovery

Pros

Further reduced complexity

Effective against timing errors

Cons

Effective only for processor-based systems

M. Fojtik, et al, Bubble razor: An architecture-independent approach to timing-error detection and correction, in: Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2012 IEEE International, 2012, pp. 488-490.

Page 20: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

www.ihp-microelectronics.com © 2015 - All rights reserved

Redundancy = Power/Area Increase and performance drop! The overhead could be more than N times baseline The classical FT techniques could be performed in the more optimal way: Static solutions: Limiting the level of fault tolerance but significantly reducing overhead

(partial FT, ECC codes etc.) Dynamic (adaptive) solutions: Enabling system adaptivity: using overhead only when it is needed Adaptivity is the key requirement for complex system implementation in

advanced technologies Example solutions: Static - partial and selective fault tolerance, EDPEC, FEDC Adaptive - NMR power control, adaptive ECC, adaptive MPSoCs

Resilience: Addressing with the same mechanisms different challenges

Faults, PVT variations, security

Consequences and Solutions

20

Page 21: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

www.ihp-microelectronics.com © 2015 - All rights reserved

Static Solutions

Page 22: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

www.ihp-microelectronics.com © 2015 - All rights reserved

Partial Fault Tolerance

22

We can reduce the overhead if we protect with redundancy only the parts of the systems which are most critical (control logic, command/status registers)

This method is called partial fault tolerance

However, some part of the system remains fully unprotected

Example: Design of Digital Beamforming Network processor for synthetic aperture radar (EU Project DIFFERENT)

Digital baseband IC and in IHP technology

DFBN Chip tape-out Aug 2015 – 46 mm2 in SGB25V

Tested on wafer and operational up to 250 MHz!

Optimized overhead (saving 20% in area and 33% in power)

radiation hardened TMR flip-flops in control logic

Standard flip-flops in datapath

DBFN Baseband Processor - 46mm2 in SGB25V, Sep 2015

Page 23: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

www.ihp-microelectronics.com © 2015 - All rights reserved

Selective Fault Tolerance Architecture

Fault tolerance is guaranteed only for input assignments of critical tasks, since it is not required for other signals

We can make the trade-off between the FT level and area increase

3s1S S 2s

V

m m m

xxx

nn n

1y

2y

3y

y

n

*Reducing the Area Overhead of TMR-Systems by Protecting Specific Signals, M. Augustin, M. Gössel, R. Kraemer, Proc. IEEE IOLTS 2010 *Eine neue Fehlertoleranzmethode zur Verringerung des Flächenaufwandes von TMR-Systemen, M. Augustin, M. Gössel, R. Kraemer, Proc. ZuE 2010

optimize

optimize

otherwise 0

critical isinput if )( )(

12

xSxs

2s

otherwise 1

critical isinput if )( )(

13

xSxs

3s

23

Page 24: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

www.ihp-microelectronics.com © 2015 - All rights reserved

Improvements of Selective Fault Tolerance

Methodology could be easily integrated in standard design flow

Selective Fault Tolerance applicable to real industrial designs

The reduction of area overhead

compared to TMR is significant

Near to the computationally very intensive solution

The protection of 20% of all possible

input/output assignments leads to an area reduction of one complete system compared to TMR

*Reducing the Area Overhead of TMR-Systems by Protecting Specific Signals, M. Augustin, M. Gössel, R. Kraemer, Proc. IEEE IOLTS 2010 *Eine neue Fehlertoleranzmethode zur Verringerung des Flächenaufwandes von TMR-Systemen, M. Augustin, M. Gössel, R. Kraemer, Proc. ZuE 2010

24

Page 25: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

www.ihp-microelectronics.com © 2015 - All rights reserved

Improvements of Selective Fault Tolerance

The concept of Selective Fault Tolerance was also applied to sequential circuits (FSMs)

The protection of 20% of all possible state transitions leads to an area reduction of nearly one complete system compared to TMR

*Reducing the Area Overhead of TMR-Systems by Protecting Specific Signals, M. Augustin, M. Gössel, R. Kraemer, Proc. IEEE IOLTS 2010 *Eine neue Fehlertoleranzmethode zur Verringerung des Flächenaufwandes von TMR-Systemen, M. Augustin, M. Gössel, R. Kraemer, Proc. ZuE 2010

25

Page 26: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

www.ihp-microelectronics.com © 2015 - All rights reserved

Error Detection and Partial Error Correction (EDPEC) Architecture

Circuit Dynamic Power [mW]

Area Comb [mm2]

Area Sequential [mm2]

Area Total [mm2]

TMR 0,600 0,0421 0,0125 0,0470

EDPEC 0,442 0,0344 0,0105 0,0380

Milos Krstic, et al., Improved circuitry for soft error correction in combinational logic in pipelined designs. IOLTS 2014: 93-98

Architecture optimized for effective error detection Most of the soft errors could be

corrected as well

Self-checking based on prediction circuits, which are less complex then hardware multiplication

Hardware/power overhead reduced compared with TMR Saves around 25% area/power

Critical errors appearing near to the clock edge

26

Page 27: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

www.ihp-microelectronics.com © 2015 - All rights reserved

Full Error Detection and Correction (FEDC)

Circuit Dynamic Power [mW]

Area Comb [mm2]

Area Sequential [mm2]

Area Total [mm2]

TMR 0,600 0,0421 0,0125 0,0470

FEDC 0,432 0,035 0,0126 0,0418

Milos Krstic, et al., Enhanced Architectures for Soft Error Detection and Correction in Combinational and Sequential Circuits, Microelectronics Reliability, 2016

Focused on full error correction All injected SETs could be

corrected Suitable for long transients

as well Effective against timing

errors

Hardware/power overhead

reduced compared with TMR Saves around 28% power

27

Page 28: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

www.ihp-microelectronics.com © 2015 - All rights reserved

Dynamic Methods

28

Page 29: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

www.ihp-microelectronics.com © 2015 - All rights reserved

Protecting Memory

Error Correcting Codes are usual approach for fault protection

Hamming Code, BCH, Hsiao Code

Satisfying level of fault tolerance with limited overhead

The most challenging thing in smaller feature sizes will be leakage power coming mostly from memories.

Non-volatile memories have zero/very small leakage power

Universal memories (PCM, RRAM) as a next big step towards low-power

Variable-Strength Error-Correcting Codes

When the reliability of the system is sufficient (without ECC) at normal Vcc, at lower Vcc the reliability usually is lower.

ECC deals with memory cells which are unreliable at lower Vcc.

VS-ECC design achieves an 84% power reduction and a 50% energy reduction compared to SECDED ECC, and achieves a 26% power reduction and an 11% energy reduction *

*Alameldeen et al., “Energy-Efficient Cache Design Using Variable-Strength Error-Correcting Codes”

29

Page 30: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

www.ihp-microelectronics.com © 2015 - All rights reserved

Dynamic Methods on the MPSoCs Level

Multiprocessors have varying application requirements

performance

dependability (fault-tolerance, lifetime, …)

power consumption

E.g. Earth observation satellite

image processing

orbit change

waiting new task

Target: NMR mechanisms, lifetime aspect Self-repairable system based on configurable micro-operation units, selected by the

programmer

Timing-critical applications are also considered

Dynamically adapting to the application requirements

Trade-offs between higher endurance, fault-tolerance, performance and power efficiency

A. Simevski, R. Kraemer, M. Krstic, Investigating Core-Level N-Modular Redundancy in Multiprocessors, IEEE MCSoC-14

30

Page 31: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

www.ihp-microelectronics.com © 2015 - All rights reserved

Architectural Framework for Adaptable Multiprocessors Framework addressing fault-tolerance and aging effects for space and automotive applications

Aging monitors

Automated HW/SW verification, design & test

Programmable NMR voters

Modes of operation

De-stress

Fault-tolerant

High-performance

Dynamic mode change depending on the application requirements

Core 1

Core 4

Core 2

Core 3 Sch

edule

r +

Vote

r

A. Simevski, R. Kraemer, M. Krstic, Investigating Core-Level N-Modular Redundancy in Multiprocessors, IEEE MCSoC-14

31

Page 32: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

www.ihp-microelectronics.com © 2015 - All rights reserved

De-stress mode

Core 1 Core 2

Core 4 Core 3

Core 1 Core 2 Core 3 Core 4

time

T

Period

Legend:

= active core

= inactive core

32

Page 33: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

www.ihp-microelectronics.com © 2015 - All rights reserved

De-stress mode

Core 1 Core 2

Core 4 Core 3

Core 1 Core 2 Core 3 Core 4

time

T T

Period

33

Page 34: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

www.ihp-microelectronics.com © 2015 - All rights reserved

De-stress mode

Core 1 Core 2

Core 4 Core 3

Core 1 Core 2 Core 3 Core 4

time

T T T

Period

34

Page 35: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

www.ihp-microelectronics.com © 2015 - All rights reserved

De-stress mode

Core 1 Core 2

Core 4 Core 3

Core 1 Core 2 Core 3 Core 4

time

T T T T

Period

35

Page 36: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

www.ihp-microelectronics.com © 2015 - All rights reserved

De-stress mode

Core 1 Core 2

Core 3 Core 4

Core 1 Core 2 Core 3 Core 4

time

T T T T T

Period

36

Page 37: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

www.ihp-microelectronics.com © 2015 - All rights reserved

De-stress mode (2 active cores)

Core 1 Core 2

Core 4 Core 3

Core 1 Core 2 Core 3 Core 4

time

T

Period

37

Page 38: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

www.ihp-microelectronics.com © 2015 - All rights reserved

Age 2 Age 3 Age 4 Age 5

time

T

Core 1 Core 2

Core 4 Core 3

The initial age may not be equal even between cores on the same die!

Youngest-First Round Robin (YFRR)

38

Page 39: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

www.ihp-microelectronics.com © 2015 - All rights reserved

Age 2 Age 3 Age 4 Age 5

time

T T

Core 1 Core 2

Core 4 Core 3

The initial age may not be equal even between cores on the same die!

Youngest-First Round Robin (YFRR)

39

Page 40: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

www.ihp-microelectronics.com © 2015 - All rights reserved

Age 2 Age 3 Age 4 Age 5

time

T T T

Core 1 Core 2

Core 4 Core 3

Core 1 reached age 3

The initial age may not be equal even between cores on the same die!

Youngest-First Round Robin (YFRR)

40

Page 41: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

www.ihp-microelectronics.com © 2015 - All rights reserved

Age 2 Age 3 Age 4 Age 5

time

T T T T

Core 1 Core 2

Core 4 Core 3

Core 1 reached age 3

The initial age may not be equal even between cores on the same die!

Youngest-First Round Robin (YFRR)

41

Page 42: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

www.ihp-microelectronics.com © 2015 - All rights reserved

Age 2 Age 3 Age 4 Age 5

time

T T T T T

Core 1 Core 2

Core 4 Core 3

Core 1 reached age 3

The initial age may not be equal even between cores on the same die!

Youngest-First Round Robin (YFRR)

42

Page 43: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

www.ihp-microelectronics.com © 2015 - All rights reserved

Age 2 Age 3 Age 4 Age 5

time

T T T T T T

Core 1 reached age 3

The initial age may not be equal even between cores on the same die!

Core 1 Core 2

Core 4 Core 3

Youngest-First Round Robin (YFRR)

43

Page 44: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

www.ihp-microelectronics.com © 2015 - All rights reserved

Age 2 Age 3 Age 4 Age 5

time

T T T T T

Core 1 Core 2

Core 4 Core 3

T T

Core 1 reached age 3

The initial age may not be equal even between cores on the same die!

Youngest-First Round Robin (YFRR)

44

Page 45: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

www.ihp-microelectronics.com © 2015 - All rights reserved

Age 2 Age 3 Age 4 Age 5

time

T T T T T T T T

Core 1 reached age 3

The initial age may not be equal even between cores on the same die!

Core 1 Core 2

Core 4 Core 3

Youngest-First Round Robin (YFRR)

45

Page 46: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

www.ihp-microelectronics.com © 2015 - All rights reserved

Age 2 Age 3 Age 4 Age 5

time

T T T T T

Core 1 Core 2

Core 4 Core 3

T T T T

Core 1 reached age 3

The initial age may not be equal even between cores on the same die!

Youngest-First Round Robin (YFRR)

46

Page 47: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

www.ihp-microelectronics.com © 2015 - All rights reserved

Age 2 Age 3 Age 4 Age 5

time

T T T T T T T T T T

Core 1 reached age 3

Cores 1 and 2 reached age 4

The initial age may not be equal even between cores on the same die!

Core 1 Core 2

Core 4 Core 3

Youngest-First Round Robin (YFRR)

47

Page 48: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

www.ihp-microelectronics.com © 2015 - All rights reserved

Age 2 Age 3 Age 4 Age 5

time

T T T T T

Core 1 Core 2

Core 4 Core 3

T T T T T T

Core 1 reached age 3

Cores 1 and 2 reached age 4

The initial age may not be equal even between cores on the same die!

Youngest-First Round Robin (YFRR)

48

Page 49: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

www.ihp-microelectronics.com © 2015 - All rights reserved

Age 2 Age 3 Age 4 Age 5

time

T T T T T

Core 1 Core 2

Core 4 Core 3

T T T T T T T

Core 1 reached age 3

Cores 1 and 2 reached age 4

The initial age may not be equal even between cores on the same die!

Youngest-First Round Robin (YFRR)

49

Page 50: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

www.ihp-microelectronics.com © 2015 - All rights reserved

Age 2 Age 3 Age 4 Age 5

time

T T T T T T T T T T T T T

Core 1 reached age 3

Cores 1 and 2 reached age 4

The initial age may not be equal even between cores on the same die!

Core 1 Core 2

Core 4 Core 3

Youngest-First Round Robin (YFRR)

50

Page 51: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

www.ihp-microelectronics.com © 2015 - All rights reserved

Age 2 Age 3 Age 4 Age 5

time

T T T T T T T T T T T T T T

Core 1 reached age 3

Cores 1 and 2 reached age 4

The initial age may not be equal even between cores on the same die!

Core 1 Core 2

Core 4 Core 3

Youngest-First Round Robin (YFRR)

51

Page 52: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

www.ihp-microelectronics.com © 2015 - All rights reserved

Age 2 Age 3 Age 4 Age 5

time

T T T T T

Core 1 Core 2

Core 4 Core 3

T T T T T T T T T T

Core 1 reached age 3

Cores 1 and 2 reached age 4

The initial age may not be equal even between cores on the same die!

Youngest-First Round Robin (YFRR)

52

Page 53: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

www.ihp-microelectronics.com © 2015 - All rights reserved

Age 2 Age 3 Age 4 Age 5

time

T T T T T T T T T T T T T T T T

Core 1 reached age 3

Cores 1 and 2 reached age 4

The initial age may not be equal even between cores on the same die!

Core 1 Core 2

Core 4 Core 3

Youngest-First Round Robin (YFRR)

53

Page 54: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

www.ihp-microelectronics.com © 2015 - All rights reserved

Age 2 Age 3 Age 4 Age 5

time

T T T T T T T T T T T T T T T T T

Core 1 reached age 3

Cores 1 and 2 reached age 4

The initial age may not be equal even between cores on the same die!

Core 1 Core 2

Core 4 Core 3

Youngest-First Round Robin (YFRR)

54

Page 55: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

www.ihp-microelectronics.com © 2015 - All rights reserved

Age 2 Age 3 Age 4 Age 5

time

T T T T T

Core 1 Core 2

Core 4 Core 3

T T T T T T T T T T T T T

Core 1 reached age 3

Cores 1 and 2 reached age 4

Cores 1, 2 and 3 reached age 5

The initial age may not be equal even between cores on the same die!

Youngest-First Round Robin (YFRR)

55

Page 56: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

www.ihp-microelectronics.com © 2015 - All rights reserved

Age 2 Age 3 Age 4 Age 5

time

T T T T T T T T T T T T T T T T T T T

Core 1 reached age 3

Cores 1 and 2 reached age 4

Cores 1, 2 and 3 reached age 5

The initial age may not be equal even between cores on the same die!

Core 1 Core 2

Core 4 Core 3

Youngest-First Round Robin (YFRR)

56

Page 57: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

www.ihp-microelectronics.com © 2015 - All rights reserved

Age 2 Age 3 Age 4 Age 5

time

T T T T T T T T T T T T T T T T T T T T

Core 1 reached age 3

Cores 1 and 2 reached age 4

Cores 1, 2 and 3 reached age 5

The initial age may not be equal even between cores on the same die!

Core 1 Core 2

Core 4 Core 3

Youngest-First Round Robin (YFRR)

57

Page 58: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

www.ihp-microelectronics.com © 2015 - All rights reserved

Age 2 Age 3 Age 4 Age 5

time

T T T T T T T T T T T T T T T T T T T T T

Core 1 reached age 3

Cores 1 and 2 reached age 4

Cores 1, 2 and 3 reached age 5

The initial age may not be equal even between cores on the same die!

Core 1 Core 2

Core 4 Core 3

Youngest-First Round Robin (YFRR)

58

Page 59: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

www.ihp-microelectronics.com © 2015 - All rights reserved

Age 2 Age 3 Age 4 Age 5

time

T T T T T

Core 1 Core 2

Core 4 Core 3

T T T T T T T T T T T T T T T T T

Core 1 reached age 3

Cores 1 and 2 reached age 4

Cores 1, 2 and 3 reached age 5

The initial age may not be equal even between cores on the same die!

Youngest-First Round Robin (YFRR)

59

Page 60: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

www.ihp-microelectronics.com © 2015 - All rights reserved

Age 2 Age 3 Age 4 Age 5

time

T T T T T T T T T T T T T T T T T T T T T T T

Core 1 reached age 3

Cores 1 and 2 reached age 4

Cores 1, 2 and 3 reached age 5

The initial age may not be equal even between cores on the same die!

Core 1 Core 2

Core 4 Core 3

Youngest-First Round Robin (YFRR)

60

Page 61: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

www.ihp-microelectronics.com © 2015 - All rights reserved

Age 2 Age 3 Age 4 Age 5

time

T T T T T T T T T T T T T T T T T T T T T T T T

Core 1 reached age 3

Cores 1 and 2 reached age 4

Cores 1, 2 and 3 reached age 5

The initial age may not be equal even between cores on the same die!

Core 1 Core 2

Core 4 Core 3

Youngest-First Round Robin (YFRR)

61

Page 62: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

www.ihp-microelectronics.com © 2015 - All rights reserved

Youngest-First Round Robin (YFRR)

Age 2 Age 3 Age 4 Age 5

time

T T T T T

Core 1 Core 2

Core 4 Core 3

T T T T T T T T T T T T T T T T T T T T

Core 1 reached age 3

Cores 1 and 2 reached age 4

Cores 1, 2 and 3 reached age 5

The initial age may not be equal even between cores on the same die!

Extend lifetime by equalizing the age of the cores

62

Page 63: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

www.ihp-microelectronics.com © 2015 - All rights reserved

Fault-tolerant mode

Core-level NMR – voting in each clock cycle !

Masking faults, no need for instant recoveries for N>2

Dynamic reconfiguration – NMR on-demand

Core 1

Pro

gram

mab

le N

MR

vo

ter

Core 2

Core 3

Core 4

pb1 pb2 pb3 pb4

Voting output

ISD

err

63

Page 64: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

www.ihp-microelectronics.com © 2015 - All rights reserved

Implementation and Results

Test ASIC named FMP implemented and tested

8-core system based on 32-bit internal processor

Fully functional in IHP 130 nm technology

Youngest-First Scheduling Methodology (YFSM) could increase

the system lifetime up to 31%

A. Simevski, R. Kraemer, M. Krstic, Investigating Core-Level N-Modular Redundancy in Multiprocessors, IEEE MCSoC-14

64

Page 65: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

www.ihp-microelectronics.com © 2015 - All rights reserved

Resilience methods

Page 66: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

www.ihp-microelectronics.com © 2015 - All rights reserved

What is Resilience?

Resilience - „an ability to recover from or adjust easily to change” (Merriam-Webster) What can be this change in our digital systems? Environmental changes: Voltage/Temperature variations External effects (Radiation (SEEs)) Ageing and manufacturing issues Security threats (side channel attacks)

We need some measures to address those changes

We already know that addressing faults will cause some overhead

However, addressing the other aspects also leads to overhead

Example: Supply management unit and its overhead

Synergy needed in addressing those challenges!

In this way the optimization of the overhead need to be performed

Page 67: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

www.ihp-microelectronics.com © 2015 - All rights reserved

Leon-based multiprocessor using IHP’s framework

De-stress (Power Gating, Clock Gating, Adaptive Voltage Scaling)

Fault-tolerant (Core-level NMR-on-Demand, ECC)

High-performance

Addressing at the same time Soft errors induced by particle hits

Voltage variation induced errors

Example: PISA System Power robust IC design for Space Applications

Page 68: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

www.ihp-microelectronics.com © 2015 - All rights reserved

Resilient Multiprocessor Architecture

25. July 2017 68

Four LEON2 cores using the Waterbear framework + AVS (Adaptive Voltage Scaling)

Page 69: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

www.ihp-microelectronics.com © 2015 - All rights reserved

Waterbear framework controller

Power management

Clock management

Framework control (e.g., modes: de-stress, fault-tolerant, hi-performance)

Error management (both from fault-tolerant mode and ECC)

Aging observation and control (aging monitors)

Temperature sensors

Other management functions (AHB priority, SRAM enable/disable, …)

Synergetic integration of the different fault detection and correction mechanisms

Overhead optimization

Waterbear framework controller

Slide: 16/20 25. May 2016 PISA steering committee meeting

Page 70: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

www.ihp-microelectronics.com © 2015 - All rights reserved 70

PISA Chip

Specifications

Voltage Regulator with 20 discrete steps, controlling core voltage in range [0,8V – 1,2V]; power- and clock-gating of the domains;

Size

7 mm x 7 mm

Power

(static) 173mW @ 0.2 input activity

Proc. core (dynamic): 130mV – 30mV

Tape-out September 2017

Core0

Core 1 Core 2

Core 3

Page 71: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

www.ihp-microelectronics.com © 2015 - All rights reserved

Conclusion

Fault tolerance always causes significant hardware & power overhead

Different methods how to limit this overhead exist

This always present some trade-off to achieved fault tolerance and performance

Two basic approaches presented:

Static techniques for reducing power overhead, but also reducing fault protection

Adaptive techniques enabling dynamic trade-off power-reliability

Addressing the challenges in synergetic way is of utmost importance

Ultimate target is having optimal resilient system

71

Page 72: Optimizing Design of Fault-tolerant Computing Systems · 2017. 10. 23. · Optimizing Design of Fault-tolerant Computing Systems Milos Krstic HDT 2017, ... Fault tolerance is guaranteed

IHP – Innovations for High Performance Microelectronics Im Technologiepark 25 15236 Frankfurt (Oder) Germany

www.ihp-microelectronics.com

Phone: +49 (0) 335 5625 Fax: +49 (0) 335 5625 Email:

Thank you for your attention! Milos Krstic

729 671

[email protected]

A1

1B