dark silicon phenomenon

19
This project and the research leading to these results has received funding from the European Community's Seventh Framework Programme [FP7/2007-2013] under grant agreement n° 318693 Gulay Yalcin, Anita Sobe, Alexey Voronin, Jons-Tobias Wamhoff, Derin Harmanci, Adrián Cristal, Osman Unsal, Pascal Felber, Christof Fetzer PDP2014, Turin, Italy 13 February 2014 Combining Error Detection and Transactional Memory for Energy-Efficient Computing below Safe Operation Margin

Upload: long

Post on 05-Feb-2016

37 views

Category:

Documents


0 download

DESCRIPTION

Combining Error Detection and Transactional Memory for Energy-Efficient Computing below Safe Operation Margin. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Dark Silicon Phenomenon

This project and the research leading to these results has received funding from the European Community's Seventh Framework Programme [FP7/2007-2013] under grant agreement n° 318693

Gulay Yalcin, Anita Sobe, Alexey Voronin, Jons-Tobias Wamhoff, Derin Harmanci, Adrián Cristal, Osman Unsal, Pascal Felber, Christof Fetzer

PDP2014, Turin, Italy

13 February 2014

Combining Error Detection and Transactional Memory for Energy-Efficient Computing below

Safe Operation Margin

Page 2: Dark Silicon Phenomenon

Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin

Dark Silicon Phenomenon

Number of transistors can be increased.In order to stay within a chip’s power budget, some must remain “dark”.

One solution: Downscale the voltage.

2

Page 3: Dark Silicon Phenomenon

Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin

How about Reliability?

3

When the Vdd is reduced, the error rate increases exponentially [1].

[1] Dan Ernst et al. “Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation.” In Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture, pages 7–18, 2003

Our goal is:Investigating the edge cases on voltage reduction while the error recovery still leads to a reduced energy consumption.

Page 4: Dark Silicon Phenomenon

Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin

Agenda / Overview

MotivationExperiment: Scaling Vdd in a Real System

Basics of ReliabilityError Recovery with TMError Detection Schemes

AnalysisConclusion

4

Page 5: Dark Silicon Phenomenon

Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin

Reducing Vdd in a Real System

5

AMD FX-61006-core CPU CPU-heavy executionEvery 10 seconds reduce Vdd by 12.5mVMonitor

Incorrect Result System Crash Machine Check Architecture

The system encounters errors which can not be corrected by MCA even only after 10% reduction in Vdd

Errors are in instruction cache (37%), execution unit (61%) and others (less than 2%).

Page 6: Dark Silicon Phenomenon

Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin

Basics of Reliability

6

Transactional Memory can provide a lightweight Coordinated Local

Checkpoitning [2]

[2] Gulay Yalcin et al. “FaulTM: Fault Tolerance Using Hardware Transactional Memory , DATE 2013

Page 7: Dark Silicon Phenomenon

Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin

TM provides checkpointing/rollback

7

Processor 1

Checkpoint (Log Area)

Checkpoint (Log Area)Checkpoint

(Log Area)Checkpoint (Log Area)Checkpoint

(Log Area)

P2P3

P4Pn

TM write-sets log the tentative memory updates.

Synchronize checkpoints

Data-Versioning provides a synchronization mechanism between

checkpoints.

Page 8: Dark Silicon Phenomenon

Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin

Error Detection Schemes - Replication

Execute instruction streams multiple timesCompare the results of executionsLess comparison with TM. Dual/Triple Modular Redundancy+ High Error Detection Rate- High Energy Overhead

8

Page 9: Dark Silicon Phenomenon

Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin

Error Detection Schemes-Assertions/Invariants

Assertions: Conditions referring to the current and previous state of the program.Check the stateAdding manually or automatic TM facilitates inserting invariantsEx:

9

Page 10: Dark Silicon Phenomenon

Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin

Error Detection Schemes - Symptoms

Monitor program executions to inspect if there is a symptom of hardware faults.Symptoms:

Mispredictions in high confidence branches,high OS activity,fatal traps (e.g. undefined instruction code)

Reliability at a low cost

10

Page 11: Dark Silicon Phenomenon

Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin

Error Detection Schemes- Encoded Processing

Apply software coding (ECC-like) techniquesThe redundancy is added by applying arithmetic codes to the values.Arithmetic codes: AN, ANBDmem etc.With TM, the validation of a code word can be deferred until a TX commits.Ex:

11

Page 12: Dark Silicon Phenomenon

Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin

Comparing Error Detection Schemes

12

Page 13: Dark Silicon Phenomenon

Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin

Analysis

Gem5 full system simulator 1GHz in-order cores 4 coresX86 ISA64KB L1 data and instruction cachesUnified 2MB L2 cache

SPLASH2 benchmark suite.

13

Page 14: Dark Silicon Phenomenon

Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin

Energy Analysis

14

E ≈ C x Vdd 2

Vdd

Error-free Overhead

RecoveryOverhead

Fault Injection

TX size

Error Detection Rate

Page 15: Dark Silicon Phenomenon

Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin

Energy Reduction

15

Page 16: Dark Silicon Phenomenon

Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin

Reliability of the System

16

Page 17: Dark Silicon Phenomenon

Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin

Conclusion

The energy consumption of CPUs can be reduced if we have efficient hardware support for Transactional Memory and for Error Detection.

17

Page 18: Dark Silicon Phenomenon

Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin

Future Work: Combining DMR and Symptoms

18

Page 19: Dark Silicon Phenomenon

Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin

Thanks!

19