dark silicon phenomenon

Post on 05-Feb-2016

37 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Combining Error Detection and Transactional Memory for Energy-Efficient Computing below Safe Operation Margin. - PowerPoint PPT Presentation

TRANSCRIPT

This project and the research leading to these results has received funding from the European Community's Seventh Framework Programme [FP7/2007-2013] under grant agreement n° 318693

Gulay Yalcin, Anita Sobe, Alexey Voronin, Jons-Tobias Wamhoff, Derin Harmanci, Adrián Cristal, Osman Unsal, Pascal Felber, Christof Fetzer

PDP2014, Turin, Italy

13 February 2014

Combining Error Detection and Transactional Memory for Energy-Efficient Computing below

Safe Operation Margin

Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin

Dark Silicon Phenomenon

Number of transistors can be increased.In order to stay within a chip’s power budget, some must remain “dark”.

One solution: Downscale the voltage.

2

Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin

How about Reliability?

3

When the Vdd is reduced, the error rate increases exponentially [1].

[1] Dan Ernst et al. “Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation.” In Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture, pages 7–18, 2003

Our goal is:Investigating the edge cases on voltage reduction while the error recovery still leads to a reduced energy consumption.

Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin

Agenda / Overview

MotivationExperiment: Scaling Vdd in a Real System

Basics of ReliabilityError Recovery with TMError Detection Schemes

AnalysisConclusion

4

Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin

Reducing Vdd in a Real System

5

AMD FX-61006-core CPU CPU-heavy executionEvery 10 seconds reduce Vdd by 12.5mVMonitor

Incorrect Result System Crash Machine Check Architecture

The system encounters errors which can not be corrected by MCA even only after 10% reduction in Vdd

Errors are in instruction cache (37%), execution unit (61%) and others (less than 2%).

Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin

Basics of Reliability

6

Transactional Memory can provide a lightweight Coordinated Local

Checkpoitning [2]

[2] Gulay Yalcin et al. “FaulTM: Fault Tolerance Using Hardware Transactional Memory , DATE 2013

Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin

TM provides checkpointing/rollback

7

Processor 1

Checkpoint (Log Area)

Checkpoint (Log Area)Checkpoint

(Log Area)Checkpoint (Log Area)Checkpoint

(Log Area)

P2P3

P4Pn

TM write-sets log the tentative memory updates.

Synchronize checkpoints

Data-Versioning provides a synchronization mechanism between

checkpoints.

Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin

Error Detection Schemes - Replication

Execute instruction streams multiple timesCompare the results of executionsLess comparison with TM. Dual/Triple Modular Redundancy+ High Error Detection Rate- High Energy Overhead

8

Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin

Error Detection Schemes-Assertions/Invariants

Assertions: Conditions referring to the current and previous state of the program.Check the stateAdding manually or automatic TM facilitates inserting invariantsEx:

9

Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin

Error Detection Schemes - Symptoms

Monitor program executions to inspect if there is a symptom of hardware faults.Symptoms:

Mispredictions in high confidence branches,high OS activity,fatal traps (e.g. undefined instruction code)

Reliability at a low cost

10

Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin

Error Detection Schemes- Encoded Processing

Apply software coding (ECC-like) techniquesThe redundancy is added by applying arithmetic codes to the values.Arithmetic codes: AN, ANBDmem etc.With TM, the validation of a code word can be deferred until a TX commits.Ex:

11

Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin

Comparing Error Detection Schemes

12

Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin

Analysis

Gem5 full system simulator 1GHz in-order cores 4 coresX86 ISA64KB L1 data and instruction cachesUnified 2MB L2 cache

SPLASH2 benchmark suite.

13

Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin

Energy Analysis

14

E ≈ C x Vdd 2

Vdd

Error-free Overhead

RecoveryOverhead

Fault Injection

TX size

Error Detection Rate

Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin

Energy Reduction

15

Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin

Reliability of the System

16

Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin

Conclusion

The energy consumption of CPUs can be reduced if we have efficient hardware support for Transactional Memory and for Error Detection.

17

Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin

Future Work: Combining DMR and Symptoms

18

Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin

Thanks!

19

top related