dark silicon phenomenon
DESCRIPTION
Combining Error Detection and Transactional Memory for Energy-Efficient Computing below Safe Operation Margin. - PowerPoint PPT PresentationTRANSCRIPT
This project and the research leading to these results has received funding from the European Community's Seventh Framework Programme [FP7/2007-2013] under grant agreement n° 318693
Gulay Yalcin, Anita Sobe, Alexey Voronin, Jons-Tobias Wamhoff, Derin Harmanci, Adrián Cristal, Osman Unsal, Pascal Felber, Christof Fetzer
PDP2014, Turin, Italy
13 February 2014
Combining Error Detection and Transactional Memory for Energy-Efficient Computing below
Safe Operation Margin
Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin
Dark Silicon Phenomenon
Number of transistors can be increased.In order to stay within a chip’s power budget, some must remain “dark”.
One solution: Downscale the voltage.
2
Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin
How about Reliability?
3
When the Vdd is reduced, the error rate increases exponentially [1].
[1] Dan Ernst et al. “Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation.” In Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture, pages 7–18, 2003
Our goal is:Investigating the edge cases on voltage reduction while the error recovery still leads to a reduced energy consumption.
Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin
Agenda / Overview
MotivationExperiment: Scaling Vdd in a Real System
Basics of ReliabilityError Recovery with TMError Detection Schemes
AnalysisConclusion
4
Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin
Reducing Vdd in a Real System
5
AMD FX-61006-core CPU CPU-heavy executionEvery 10 seconds reduce Vdd by 12.5mVMonitor
Incorrect Result System Crash Machine Check Architecture
The system encounters errors which can not be corrected by MCA even only after 10% reduction in Vdd
Errors are in instruction cache (37%), execution unit (61%) and others (less than 2%).
Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin
Basics of Reliability
6
Transactional Memory can provide a lightweight Coordinated Local
Checkpoitning [2]
[2] Gulay Yalcin et al. “FaulTM: Fault Tolerance Using Hardware Transactional Memory , DATE 2013
Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin
TM provides checkpointing/rollback
7
Processor 1
Checkpoint (Log Area)
Checkpoint (Log Area)Checkpoint
(Log Area)Checkpoint (Log Area)Checkpoint
(Log Area)
P2P3
P4Pn
TM write-sets log the tentative memory updates.
Synchronize checkpoints
Data-Versioning provides a synchronization mechanism between
checkpoints.
Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin
Error Detection Schemes - Replication
Execute instruction streams multiple timesCompare the results of executionsLess comparison with TM. Dual/Triple Modular Redundancy+ High Error Detection Rate- High Energy Overhead
8
Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin
Error Detection Schemes-Assertions/Invariants
Assertions: Conditions referring to the current and previous state of the program.Check the stateAdding manually or automatic TM facilitates inserting invariantsEx:
9
Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin
Error Detection Schemes - Symptoms
Monitor program executions to inspect if there is a symptom of hardware faults.Symptoms:
Mispredictions in high confidence branches,high OS activity,fatal traps (e.g. undefined instruction code)
Reliability at a low cost
10
Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin
Error Detection Schemes- Encoded Processing
Apply software coding (ECC-like) techniquesThe redundancy is added by applying arithmetic codes to the values.Arithmetic codes: AN, ANBDmem etc.With TM, the validation of a code word can be deferred until a TX commits.Ex:
11
Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin
Comparing Error Detection Schemes
12
Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin
Analysis
Gem5 full system simulator 1GHz in-order cores 4 coresX86 ISA64KB L1 data and instruction cachesUnified 2MB L2 cache
SPLASH2 benchmark suite.
13
Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin
Energy Analysis
14
E ≈ C x Vdd 2
Vdd
Error-free Overhead
RecoveryOverhead
Fault Injection
TX size
Error Detection Rate
Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin
Energy Reduction
15
Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin
Reliability of the System
16
Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin
Conclusion
The energy consumption of CPUs can be reduced if we have efficient hardware support for Transactional Memory and for Error Detection.
17
Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin
Future Work: Combining DMR and Symptoms
18
Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin
Thanks!
19