optimizing design of fault-tolerant computing systems · 2017. 10. 23. · optimizing design of...
TRANSCRIPT
Optimizing Design of Fault-tolerant Computing Systems
Milos Krstic
HDT 2017, 1st Workshop on Hardware Design and Theory,
www.ihp-microelectronics.com © 2015 - All rights reserved
Agenda
1.10.2015 2
1 Motivation
2 Fault Tolerant Methods
3
4 Static and dynamic methods: examples
5 System resilience
6
Methods for reducing the overhead in Fault tolerant Systems
Conclusions
www.ihp-microelectronics.com © 2015 - All rights reserved
Motivation and goals
Fault tolerance is traditional requirement of the applications such as space or avionics.
Today’s scaled technologies due to their reliability issues require more and more fault tolerance measures even for main stream applications
Fault tolerance is always achieved with some (significant!) cost
The open question is how to limit the overhead imposed by fault tolerance?
Two important strategies:
Advanced static low-overhead techniques which provide certain level of fault tolerance with limited overhead
Adaptivity techniques which enable fault tolerance only when required
3
www.ihp-microelectronics.com © 2015 - All rights reserved
Example: Effects induced by Radiation
Irradiation effects are present not just in space applications Radiation effects can be split in two general categories:
Cumulative Effects Single Event Effects
Ionization Displacement
(can also be caused by aging effects)
Enhanced low-dose-Rate Sensitivity (ELDRS)
Soft errors: Neutron Single Event
Upset (NSEU) Single Event Transient
(SET) Single Event Upset
(SEU) Single Event Functional
Interrupt (SEFI)
Hard errors: Single Event Latchup
(SEL) Single Event Gate
Rupture (SEGR) Single Event Burnout
(SEBR)
4
The effects need to be addressed by corresponding fault tolerant measures
www.ihp-microelectronics.com © 2015 - All rights reserved
Almost all fault-tolerant techniques are based on the redundancy
Hardware redundancy (N-modular, triple and double modular redundancy)
Information redundancy (error detection and correction)
Time redundancy
Software redundancy
Fault-tolerant Techniques
5
www.ihp-microelectronics.com © 2015 - All rights reserved
Hardware Redundancy with TMR
6
The general approach is N-modular redundancy
Such systems are also known as M-of-N Systems
N=3, Triple Modular Redundancy (TMR)
N=2, Dual Modular Redundancy (DMR)
How TMR works?
m
m
m
m D
www.ihp-microelectronics.com © 2015 - All rights reserved
Hardware Redundancy with TMR
7
The general approach is N-modular redundancy
Such systems are also known as M-of-N Systems
N=3, Triple Modular Redundancy (TMR)
N=2, Dual Modular Redundancy (DMR)
How TMR works?
x
m
m
m
Transient or permanent Fault
x ≠ m
D
www.ihp-microelectronics.com © 2015 - All rights reserved
The general approach is N-modular redundancy
Such systems are also known as M-of-N Systems
N=3, Triple Modular Redundancy (TMR)
N=2, Dual Modular Redundancy (DMR)
How TMR works?
TMR is 2-of-3-System!
Hardware Redundancy with TMR
8
x
y
m
?
x ≠ m, y ≠ m
Transient or permanent Fault
D
www.ihp-microelectronics.com © 2015 - All rights reserved
Limits of the simple TMR-Circuit
9
• Voter (V) is the single point of failure!
Module 1
Module 3
Module 2 V D1 Q1
Module 1
Module 3
Module 2 V D2 Q2
Transienter Puls
www.ihp-microelectronics.com © 2015 - All rights reserved
Alternative TMR-Architecture
10
• Full TMR-architecture, including TMR-Voter
Module 1
Module 3
Module 2
V
V
V
D1
D2
D3
Q1
Q2
Q3
www.ihp-microelectronics.com © 2015 - All rights reserved
Information Redundancy
TMR approach could be used (or seen) also as information redundancy
Not very efficient one
There are error correction codes (ECCs) which can reduce the overhead
Example: Hamming-code (26 Data bits, 5 Parity bits, Overhead 16%)
Single error correction and double error detection
Easily applicable for homogenous structures (for example SRAM) or for communication
packets.
www.ihp-microelectronics.com © 2015 - All rights reserved
Time Redundancy
Time redundancy is performed but multiple execution of the same task in order to detect the error
Advantage
Hardware overhead is low or even not existing
Disadvantage
Performance reduced
Energy consumption overhead still exists
www.ihp-microelectronics.com © 2015 - All rights reserved
Software Redundancy
• Redundancy can be performed also in software • Execution of the algorithm several times and checking of results
• Critical issue: in case that the software is executed in the same way overhead will not be
able to find some errors
• Solution is to have different implementation of the target algorithms
• Overhead is similar to the time redundancy approach
www.ihp-microelectronics.com © 2015 - All rights reserved
Overhead of Fault Tolerant Solutions
Overhead of TMR could be much higher then X3 Example: TMR Flip-Flop TMR flip flop is very effective against single event effects On the other hand the overhead is significant
TMR Flip Flops
1,0E-11
1,0E-10
1,0E-09
1,0E-08
1,0E-07
1,0E-06
0 20 40 60 80
Cro
ss S
ectio
n [cm
²]
LET [MeVcm²/mg]
SEU Cross Section SGB25: VDD=2.25V, T=27°C
SGB25 Full-TMR
SGB25 Red-TMRSGB25 Dlib-DICE
Minimal threshold
www.ihp-microelectronics.com © 2015 - All rights reserved
Redundancy generates the overhead
This can increase power/area/performance budget several times
Reducing the overhead is possible
Basic methods are combining different redundancy methods (hardware, software,
information, time) and perform the trade-off between achieved level of fault tolerance,
performance and power consumption
Example: DMR
DMR reduces significantly overhead introduced by TMR
However, it can usually offer only error detection, and correction need to be done by
architectural replay, reducing the performances
Merging standard low-power techniques and fault tolerant methods?
Reducing the Overhead
15
www.ihp-microelectronics.com © 2015 - All rights reserved
Standard low-power techniques could be applicable for fault-tolerant circuits Clock gating
Could cause the issues for space application if the fault tolerance is based on the hardware redundancy
Error accumulation is possible, since fault recovery over clock is not there
State recovery needed after clock gating phase
DVFS
Fault tolerant clock and voltage control needed
Voltage and frequency changes could affect the susceptibility to events
Characterization needed
Power gating
Fault tolerant power switches needed
Applying Standard Low-Power Methods for Reliable Circuits
16
www.ihp-microelectronics.com © 2015 - All rights reserved
Fault-Tolerance to achieve Optimized Design Operation
17
In the modern scaled systems power needs to be reduced beyond the worst case corners
We need better then the worst case timing!
As a result => speculative execution is needed
Clock cycle is adaptive, and could be shorter as the critical path delay
Such low-power circuits must be fault tolerant!
Important priorities are:
Overhead for fault tolerance must be minimized (POWER!, area, timing)
Focus on fault detection => correction by recovery
Target – timing faults
www.ihp-microelectronics.com © 2015 - All rights reserved
RAZOR Concept
18
RAZOR is a very popular approach to achieve fault tolerance against timing errors
It is based on the use of the shadow latch which is sampled on the alternative edge
The response from main flip-flop and shadow latch are compared and timing fault identified
Pros
Reduced complexity compared with TMR/DMR
Effective against timing errors
Cons
Effective ONLY against timing errors
Metastability
D. Ernst, et al, Razor: circuit-level correction of timing errors for low-power operation, Micro, IEEE 24 (6), 2004
www.ihp-microelectronics.com © 2015 - All rights reserved
Bubble-RAZOR Concept
19
Bubble-RAZOR is recently proposed for processor-based architectures
When the error is detected the bubbles are pushed to the neighboring stages
This ensures the slots for the error recovery
Pros
Further reduced complexity
Effective against timing errors
Cons
Effective only for processor-based systems
M. Fojtik, et al, Bubble razor: An architecture-independent approach to timing-error detection and correction, in: Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2012 IEEE International, 2012, pp. 488-490.
www.ihp-microelectronics.com © 2015 - All rights reserved
Redundancy = Power/Area Increase and performance drop! The overhead could be more than N times baseline The classical FT techniques could be performed in the more optimal way: Static solutions: Limiting the level of fault tolerance but significantly reducing overhead
(partial FT, ECC codes etc.) Dynamic (adaptive) solutions: Enabling system adaptivity: using overhead only when it is needed Adaptivity is the key requirement for complex system implementation in
advanced technologies Example solutions: Static - partial and selective fault tolerance, EDPEC, FEDC Adaptive - NMR power control, adaptive ECC, adaptive MPSoCs
Resilience: Addressing with the same mechanisms different challenges
Faults, PVT variations, security
Consequences and Solutions
20
www.ihp-microelectronics.com © 2015 - All rights reserved
Static Solutions
www.ihp-microelectronics.com © 2015 - All rights reserved
Partial Fault Tolerance
22
We can reduce the overhead if we protect with redundancy only the parts of the systems which are most critical (control logic, command/status registers)
This method is called partial fault tolerance
However, some part of the system remains fully unprotected
Example: Design of Digital Beamforming Network processor for synthetic aperture radar (EU Project DIFFERENT)
Digital baseband IC and in IHP technology
DFBN Chip tape-out Aug 2015 – 46 mm2 in SGB25V
Tested on wafer and operational up to 250 MHz!
Optimized overhead (saving 20% in area and 33% in power)
radiation hardened TMR flip-flops in control logic
Standard flip-flops in datapath
DBFN Baseband Processor - 46mm2 in SGB25V, Sep 2015
www.ihp-microelectronics.com © 2015 - All rights reserved
Selective Fault Tolerance Architecture
Fault tolerance is guaranteed only for input assignments of critical tasks, since it is not required for other signals
We can make the trade-off between the FT level and area increase
3s1S S 2s
V
m m m
xxx
nn n
1y
2y
3y
y
n
*Reducing the Area Overhead of TMR-Systems by Protecting Specific Signals, M. Augustin, M. Gössel, R. Kraemer, Proc. IEEE IOLTS 2010 *Eine neue Fehlertoleranzmethode zur Verringerung des Flächenaufwandes von TMR-Systemen, M. Augustin, M. Gössel, R. Kraemer, Proc. ZuE 2010
optimize
optimize
otherwise 0
critical isinput if )( )(
12
xSxs
2s
otherwise 1
critical isinput if )( )(
13
xSxs
3s
23
www.ihp-microelectronics.com © 2015 - All rights reserved
Improvements of Selective Fault Tolerance
Methodology could be easily integrated in standard design flow
Selective Fault Tolerance applicable to real industrial designs
The reduction of area overhead
compared to TMR is significant
Near to the computationally very intensive solution
The protection of 20% of all possible
input/output assignments leads to an area reduction of one complete system compared to TMR
*Reducing the Area Overhead of TMR-Systems by Protecting Specific Signals, M. Augustin, M. Gössel, R. Kraemer, Proc. IEEE IOLTS 2010 *Eine neue Fehlertoleranzmethode zur Verringerung des Flächenaufwandes von TMR-Systemen, M. Augustin, M. Gössel, R. Kraemer, Proc. ZuE 2010
24
www.ihp-microelectronics.com © 2015 - All rights reserved
Improvements of Selective Fault Tolerance
The concept of Selective Fault Tolerance was also applied to sequential circuits (FSMs)
The protection of 20% of all possible state transitions leads to an area reduction of nearly one complete system compared to TMR
*Reducing the Area Overhead of TMR-Systems by Protecting Specific Signals, M. Augustin, M. Gössel, R. Kraemer, Proc. IEEE IOLTS 2010 *Eine neue Fehlertoleranzmethode zur Verringerung des Flächenaufwandes von TMR-Systemen, M. Augustin, M. Gössel, R. Kraemer, Proc. ZuE 2010
25
www.ihp-microelectronics.com © 2015 - All rights reserved
Error Detection and Partial Error Correction (EDPEC) Architecture
Circuit Dynamic Power [mW]
Area Comb [mm2]
Area Sequential [mm2]
Area Total [mm2]
TMR 0,600 0,0421 0,0125 0,0470
EDPEC 0,442 0,0344 0,0105 0,0380
Milos Krstic, et al., Improved circuitry for soft error correction in combinational logic in pipelined designs. IOLTS 2014: 93-98
Architecture optimized for effective error detection Most of the soft errors could be
corrected as well
Self-checking based on prediction circuits, which are less complex then hardware multiplication
Hardware/power overhead reduced compared with TMR Saves around 25% area/power
Critical errors appearing near to the clock edge
26
www.ihp-microelectronics.com © 2015 - All rights reserved
Full Error Detection and Correction (FEDC)
Circuit Dynamic Power [mW]
Area Comb [mm2]
Area Sequential [mm2]
Area Total [mm2]
TMR 0,600 0,0421 0,0125 0,0470
FEDC 0,432 0,035 0,0126 0,0418
Milos Krstic, et al., Enhanced Architectures for Soft Error Detection and Correction in Combinational and Sequential Circuits, Microelectronics Reliability, 2016
Focused on full error correction All injected SETs could be
corrected Suitable for long transients
as well Effective against timing
errors
Hardware/power overhead
reduced compared with TMR Saves around 28% power
27
www.ihp-microelectronics.com © 2015 - All rights reserved
Dynamic Methods
28
www.ihp-microelectronics.com © 2015 - All rights reserved
Protecting Memory
Error Correcting Codes are usual approach for fault protection
Hamming Code, BCH, Hsiao Code
Satisfying level of fault tolerance with limited overhead
The most challenging thing in smaller feature sizes will be leakage power coming mostly from memories.
Non-volatile memories have zero/very small leakage power
Universal memories (PCM, RRAM) as a next big step towards low-power
Variable-Strength Error-Correcting Codes
When the reliability of the system is sufficient (without ECC) at normal Vcc, at lower Vcc the reliability usually is lower.
ECC deals with memory cells which are unreliable at lower Vcc.
VS-ECC design achieves an 84% power reduction and a 50% energy reduction compared to SECDED ECC, and achieves a 26% power reduction and an 11% energy reduction *
*Alameldeen et al., “Energy-Efficient Cache Design Using Variable-Strength Error-Correcting Codes”
29
www.ihp-microelectronics.com © 2015 - All rights reserved
Dynamic Methods on the MPSoCs Level
Multiprocessors have varying application requirements
performance
dependability (fault-tolerance, lifetime, …)
power consumption
E.g. Earth observation satellite
image processing
orbit change
waiting new task
Target: NMR mechanisms, lifetime aspect Self-repairable system based on configurable micro-operation units, selected by the
programmer
Timing-critical applications are also considered
Dynamically adapting to the application requirements
Trade-offs between higher endurance, fault-tolerance, performance and power efficiency
A. Simevski, R. Kraemer, M. Krstic, Investigating Core-Level N-Modular Redundancy in Multiprocessors, IEEE MCSoC-14
30
www.ihp-microelectronics.com © 2015 - All rights reserved
Architectural Framework for Adaptable Multiprocessors Framework addressing fault-tolerance and aging effects for space and automotive applications
Aging monitors
Automated HW/SW verification, design & test
Programmable NMR voters
Modes of operation
De-stress
Fault-tolerant
High-performance
Dynamic mode change depending on the application requirements
Core 1
Core 4
Core 2
Core 3 Sch
edule
r +
Vote
r
A. Simevski, R. Kraemer, M. Krstic, Investigating Core-Level N-Modular Redundancy in Multiprocessors, IEEE MCSoC-14
31
www.ihp-microelectronics.com © 2015 - All rights reserved
De-stress mode
Core 1 Core 2
Core 4 Core 3
Core 1 Core 2 Core 3 Core 4
time
T
Period
Legend:
= active core
= inactive core
32
www.ihp-microelectronics.com © 2015 - All rights reserved
De-stress mode
Core 1 Core 2
Core 4 Core 3
Core 1 Core 2 Core 3 Core 4
time
T T
Period
33
www.ihp-microelectronics.com © 2015 - All rights reserved
De-stress mode
Core 1 Core 2
Core 4 Core 3
Core 1 Core 2 Core 3 Core 4
time
T T T
Period
34
www.ihp-microelectronics.com © 2015 - All rights reserved
De-stress mode
Core 1 Core 2
Core 4 Core 3
Core 1 Core 2 Core 3 Core 4
time
T T T T
Period
35
www.ihp-microelectronics.com © 2015 - All rights reserved
De-stress mode
Core 1 Core 2
Core 3 Core 4
Core 1 Core 2 Core 3 Core 4
time
T T T T T
Period
36
www.ihp-microelectronics.com © 2015 - All rights reserved
De-stress mode (2 active cores)
Core 1 Core 2
Core 4 Core 3
Core 1 Core 2 Core 3 Core 4
time
T
Period
37
www.ihp-microelectronics.com © 2015 - All rights reserved
Age 2 Age 3 Age 4 Age 5
time
T
Core 1 Core 2
Core 4 Core 3
The initial age may not be equal even between cores on the same die!
Youngest-First Round Robin (YFRR)
38
www.ihp-microelectronics.com © 2015 - All rights reserved
Age 2 Age 3 Age 4 Age 5
time
T T
Core 1 Core 2
Core 4 Core 3
The initial age may not be equal even between cores on the same die!
Youngest-First Round Robin (YFRR)
39
www.ihp-microelectronics.com © 2015 - All rights reserved
Age 2 Age 3 Age 4 Age 5
time
T T T
Core 1 Core 2
Core 4 Core 3
Core 1 reached age 3
The initial age may not be equal even between cores on the same die!
Youngest-First Round Robin (YFRR)
40
www.ihp-microelectronics.com © 2015 - All rights reserved
Age 2 Age 3 Age 4 Age 5
time
T T T T
Core 1 Core 2
Core 4 Core 3
Core 1 reached age 3
The initial age may not be equal even between cores on the same die!
Youngest-First Round Robin (YFRR)
41
www.ihp-microelectronics.com © 2015 - All rights reserved
Age 2 Age 3 Age 4 Age 5
time
T T T T T
Core 1 Core 2
Core 4 Core 3
Core 1 reached age 3
The initial age may not be equal even between cores on the same die!
Youngest-First Round Robin (YFRR)
42
www.ihp-microelectronics.com © 2015 - All rights reserved
Age 2 Age 3 Age 4 Age 5
time
T T T T T T
Core 1 reached age 3
The initial age may not be equal even between cores on the same die!
Core 1 Core 2
Core 4 Core 3
Youngest-First Round Robin (YFRR)
43
www.ihp-microelectronics.com © 2015 - All rights reserved
Age 2 Age 3 Age 4 Age 5
time
T T T T T
Core 1 Core 2
Core 4 Core 3
T T
Core 1 reached age 3
The initial age may not be equal even between cores on the same die!
Youngest-First Round Robin (YFRR)
44
www.ihp-microelectronics.com © 2015 - All rights reserved
Age 2 Age 3 Age 4 Age 5
time
T T T T T T T T
Core 1 reached age 3
The initial age may not be equal even between cores on the same die!
Core 1 Core 2
Core 4 Core 3
Youngest-First Round Robin (YFRR)
45
www.ihp-microelectronics.com © 2015 - All rights reserved
Age 2 Age 3 Age 4 Age 5
time
T T T T T
Core 1 Core 2
Core 4 Core 3
T T T T
Core 1 reached age 3
The initial age may not be equal even between cores on the same die!
Youngest-First Round Robin (YFRR)
46
www.ihp-microelectronics.com © 2015 - All rights reserved
Age 2 Age 3 Age 4 Age 5
time
T T T T T T T T T T
Core 1 reached age 3
Cores 1 and 2 reached age 4
The initial age may not be equal even between cores on the same die!
Core 1 Core 2
Core 4 Core 3
Youngest-First Round Robin (YFRR)
47
www.ihp-microelectronics.com © 2015 - All rights reserved
Age 2 Age 3 Age 4 Age 5
time
T T T T T
Core 1 Core 2
Core 4 Core 3
T T T T T T
Core 1 reached age 3
Cores 1 and 2 reached age 4
The initial age may not be equal even between cores on the same die!
Youngest-First Round Robin (YFRR)
48
www.ihp-microelectronics.com © 2015 - All rights reserved
Age 2 Age 3 Age 4 Age 5
time
T T T T T
Core 1 Core 2
Core 4 Core 3
T T T T T T T
Core 1 reached age 3
Cores 1 and 2 reached age 4
The initial age may not be equal even between cores on the same die!
Youngest-First Round Robin (YFRR)
49
www.ihp-microelectronics.com © 2015 - All rights reserved
Age 2 Age 3 Age 4 Age 5
time
T T T T T T T T T T T T T
Core 1 reached age 3
Cores 1 and 2 reached age 4
The initial age may not be equal even between cores on the same die!
Core 1 Core 2
Core 4 Core 3
Youngest-First Round Robin (YFRR)
50
www.ihp-microelectronics.com © 2015 - All rights reserved
Age 2 Age 3 Age 4 Age 5
time
T T T T T T T T T T T T T T
Core 1 reached age 3
Cores 1 and 2 reached age 4
The initial age may not be equal even between cores on the same die!
Core 1 Core 2
Core 4 Core 3
Youngest-First Round Robin (YFRR)
51
www.ihp-microelectronics.com © 2015 - All rights reserved
Age 2 Age 3 Age 4 Age 5
time
T T T T T
Core 1 Core 2
Core 4 Core 3
T T T T T T T T T T
Core 1 reached age 3
Cores 1 and 2 reached age 4
The initial age may not be equal even between cores on the same die!
Youngest-First Round Robin (YFRR)
52
www.ihp-microelectronics.com © 2015 - All rights reserved
Age 2 Age 3 Age 4 Age 5
time
T T T T T T T T T T T T T T T T
Core 1 reached age 3
Cores 1 and 2 reached age 4
The initial age may not be equal even between cores on the same die!
Core 1 Core 2
Core 4 Core 3
Youngest-First Round Robin (YFRR)
53
www.ihp-microelectronics.com © 2015 - All rights reserved
Age 2 Age 3 Age 4 Age 5
time
T T T T T T T T T T T T T T T T T
Core 1 reached age 3
Cores 1 and 2 reached age 4
The initial age may not be equal even between cores on the same die!
Core 1 Core 2
Core 4 Core 3
Youngest-First Round Robin (YFRR)
54
www.ihp-microelectronics.com © 2015 - All rights reserved
Age 2 Age 3 Age 4 Age 5
time
T T T T T
Core 1 Core 2
Core 4 Core 3
T T T T T T T T T T T T T
Core 1 reached age 3
Cores 1 and 2 reached age 4
Cores 1, 2 and 3 reached age 5
The initial age may not be equal even between cores on the same die!
Youngest-First Round Robin (YFRR)
55
www.ihp-microelectronics.com © 2015 - All rights reserved
Age 2 Age 3 Age 4 Age 5
time
T T T T T T T T T T T T T T T T T T T
Core 1 reached age 3
Cores 1 and 2 reached age 4
Cores 1, 2 and 3 reached age 5
The initial age may not be equal even between cores on the same die!
Core 1 Core 2
Core 4 Core 3
Youngest-First Round Robin (YFRR)
56
www.ihp-microelectronics.com © 2015 - All rights reserved
Age 2 Age 3 Age 4 Age 5
time
T T T T T T T T T T T T T T T T T T T T
Core 1 reached age 3
Cores 1 and 2 reached age 4
Cores 1, 2 and 3 reached age 5
The initial age may not be equal even between cores on the same die!
Core 1 Core 2
Core 4 Core 3
Youngest-First Round Robin (YFRR)
57
www.ihp-microelectronics.com © 2015 - All rights reserved
Age 2 Age 3 Age 4 Age 5
time
T T T T T T T T T T T T T T T T T T T T T
Core 1 reached age 3
Cores 1 and 2 reached age 4
Cores 1, 2 and 3 reached age 5
The initial age may not be equal even between cores on the same die!
Core 1 Core 2
Core 4 Core 3
Youngest-First Round Robin (YFRR)
58
www.ihp-microelectronics.com © 2015 - All rights reserved
Age 2 Age 3 Age 4 Age 5
time
T T T T T
Core 1 Core 2
Core 4 Core 3
T T T T T T T T T T T T T T T T T
Core 1 reached age 3
Cores 1 and 2 reached age 4
Cores 1, 2 and 3 reached age 5
The initial age may not be equal even between cores on the same die!
Youngest-First Round Robin (YFRR)
59
www.ihp-microelectronics.com © 2015 - All rights reserved
Age 2 Age 3 Age 4 Age 5
time
T T T T T T T T T T T T T T T T T T T T T T T
Core 1 reached age 3
Cores 1 and 2 reached age 4
Cores 1, 2 and 3 reached age 5
The initial age may not be equal even between cores on the same die!
Core 1 Core 2
Core 4 Core 3
Youngest-First Round Robin (YFRR)
60
www.ihp-microelectronics.com © 2015 - All rights reserved
Age 2 Age 3 Age 4 Age 5
time
T T T T T T T T T T T T T T T T T T T T T T T T
Core 1 reached age 3
Cores 1 and 2 reached age 4
Cores 1, 2 and 3 reached age 5
The initial age may not be equal even between cores on the same die!
Core 1 Core 2
Core 4 Core 3
Youngest-First Round Robin (YFRR)
61
www.ihp-microelectronics.com © 2015 - All rights reserved
Youngest-First Round Robin (YFRR)
Age 2 Age 3 Age 4 Age 5
time
T T T T T
Core 1 Core 2
Core 4 Core 3
T T T T T T T T T T T T T T T T T T T T
Core 1 reached age 3
Cores 1 and 2 reached age 4
Cores 1, 2 and 3 reached age 5
The initial age may not be equal even between cores on the same die!
Extend lifetime by equalizing the age of the cores
62
www.ihp-microelectronics.com © 2015 - All rights reserved
Fault-tolerant mode
Core-level NMR – voting in each clock cycle !
Masking faults, no need for instant recoveries for N>2
Dynamic reconfiguration – NMR on-demand
Core 1
Pro
gram
mab
le N
MR
vo
ter
Core 2
Core 3
Core 4
pb1 pb2 pb3 pb4
Voting output
ISD
err
63
www.ihp-microelectronics.com © 2015 - All rights reserved
Implementation and Results
Test ASIC named FMP implemented and tested
8-core system based on 32-bit internal processor
Fully functional in IHP 130 nm technology
Youngest-First Scheduling Methodology (YFSM) could increase
the system lifetime up to 31%
A. Simevski, R. Kraemer, M. Krstic, Investigating Core-Level N-Modular Redundancy in Multiprocessors, IEEE MCSoC-14
64
www.ihp-microelectronics.com © 2015 - All rights reserved
Resilience methods
www.ihp-microelectronics.com © 2015 - All rights reserved
What is Resilience?
Resilience - „an ability to recover from or adjust easily to change” (Merriam-Webster) What can be this change in our digital systems? Environmental changes: Voltage/Temperature variations External effects (Radiation (SEEs)) Ageing and manufacturing issues Security threats (side channel attacks)
We need some measures to address those changes
We already know that addressing faults will cause some overhead
However, addressing the other aspects also leads to overhead
Example: Supply management unit and its overhead
Synergy needed in addressing those challenges!
In this way the optimization of the overhead need to be performed
www.ihp-microelectronics.com © 2015 - All rights reserved
Leon-based multiprocessor using IHP’s framework
De-stress (Power Gating, Clock Gating, Adaptive Voltage Scaling)
Fault-tolerant (Core-level NMR-on-Demand, ECC)
High-performance
Addressing at the same time Soft errors induced by particle hits
Voltage variation induced errors
Example: PISA System Power robust IC design for Space Applications
www.ihp-microelectronics.com © 2015 - All rights reserved
Resilient Multiprocessor Architecture
25. July 2017 68
Four LEON2 cores using the Waterbear framework + AVS (Adaptive Voltage Scaling)
www.ihp-microelectronics.com © 2015 - All rights reserved
Waterbear framework controller
Power management
Clock management
Framework control (e.g., modes: de-stress, fault-tolerant, hi-performance)
Error management (both from fault-tolerant mode and ECC)
Aging observation and control (aging monitors)
Temperature sensors
Other management functions (AHB priority, SRAM enable/disable, …)
Synergetic integration of the different fault detection and correction mechanisms
Overhead optimization
Waterbear framework controller
Slide: 16/20 25. May 2016 PISA steering committee meeting
www.ihp-microelectronics.com © 2015 - All rights reserved 70
PISA Chip
Specifications
Voltage Regulator with 20 discrete steps, controlling core voltage in range [0,8V – 1,2V]; power- and clock-gating of the domains;
Size
7 mm x 7 mm
Power
(static) 173mW @ 0.2 input activity
Proc. core (dynamic): 130mV – 30mV
Tape-out September 2017
Core0
Core 1 Core 2
Core 3
www.ihp-microelectronics.com © 2015 - All rights reserved
Conclusion
Fault tolerance always causes significant hardware & power overhead
Different methods how to limit this overhead exist
This always present some trade-off to achieved fault tolerance and performance
Two basic approaches presented:
Static techniques for reducing power overhead, but also reducing fault protection
Adaptive techniques enabling dynamic trade-off power-reliability
Addressing the challenges in synergetic way is of utmost importance
Ultimate target is having optimal resilient system
71
IHP – Innovations for High Performance Microelectronics Im Technologiepark 25 15236 Frankfurt (Oder) Germany
www.ihp-microelectronics.com
Phone: +49 (0) 335 5625 Fax: +49 (0) 335 5625 Email:
Thank you for your attention! Milos Krstic
729 671
A1
1B