template design © 2008 integer alu2 dec 1 id/ exe stage exe/ mem stage reg file d-cache pc mem/ wb...

1
TEMPLATE DESIGN © 2008 www.PosterPresentations.com Integer ALU2 DEC 1 ID/ EXE Stage EXE/ MEM Stage Reg File D-Cache PC MEM/ WB Stage IF/ID Stage I-Cache DEC 2 Instruction Scheduler Enable Lines Diagnosing Intermittent Faults Using Software Techniques Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan The University of British Columbia Intermittent Faults Research Objective Diagnosis Technique Goals Overview of the Diagnosis Approach Isolate Fault-Prone Unit Intermittent hardware faults are bursts of errors that occur at the same location and last from a few cycles to a few seconds. Intermittent faults will be a significant concern in future processors. Transient Fault Intermittent Faults mov R1, #5 mov R2, #6 mov R3, #7 ld R4, R1, Array_Addr ld R5, R2, Array_Addr ld R6, R3, Array_Addr mult R7, R5, R4 Fail ure Program Execution tim e Research Motivation Diagnosis is vital in guiding fine-grained recovery techniques (e.g., hardware reconfiguration) and hence facilitating processor degraded performance. Chip Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Chip Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Core 8 Chip Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Core 8 If core 8 malfunctions, then two possible recovery options would be available: 1.The whole core 8 is disabled without fine- grained diagnosis, or 2. Part of core 8 is disabled with fine-grained diagnosis. Requires no hardware support, Provides formal guarantees of correctness and completeness, Scalable, Few false positives. Modeling Intermittent Faults Impact on Programs - Example Code Fragment Node mov R1, #5 1 mov R2, #6 2 mov R3, #7 3 ld R4, R1, Array_Addr 4 ld R5, R2, Array_Addr 5 ld R6, R3, Array_Addr 6 mult R7, R5, R4 7 Modeling Intermittent Faults Impact on Programs - Results The DDG model is more than two orders of magnitude faster than equivalent fault-injection experiments. 89 to 93% of the faults' crash distances are within 100 nodes. Crash Model Dynamic Dependency Graph Fault Model Expected IPS and CD SimpleScalar Simulator Actual IPS and CD Intermittent Error Program Crash/ Error Detected Crash Dump File (e.g., crash state and inputs) Run Fault-Free Construct DDG Diagnose Error Faulty Instructions Overview of the Diagnosis Approach - Example Identify Erroneous Data An intermittent fault affected 14- 18, Crash instruction: 27, Erroneous data: 14, 17, 16, 19 and 21. Expected fault spans over nodes 14-19. Actual fault affected nodes 14-18. Array_Addr #5 #6 #7 . . . Intermittent Error 4 5 6 1 2 3 7 Identify Faulty Instructions Filtering Isolate Fault-Prone Unit Potential Hardware Support Operating Systems Directions Contact Information Layali Rashid PhD Candidate Department of Electrical and Computer Engineering The University of British Columbia [email protected] Isolate Instructions First Affected by the Fault 3 Identify Instructions that Change Erroneous Data 2 1 Of the intermittent faults that are non-benign, 95% result in a program crash. 91 to 95% of the faults cause program to crash within 300 nodes of the fault’s start. Conclusions Diagnosis is vital in guiding fine-grained recovery. Diagnosing intermittent faults using software techniques is possible. Most intermittent faults cause program to crash shortly after the fault’s start. Use Dynamic Dependency Graph (DDG). Map tasks to cores based on the core's functioning units and the task's requirements. Modify a program on the fly to avoid using malfunctioning units. Provide feedback to instruction scheduler about the malfunctioning units, such that minimal performance overhead is encountered. Integer ALU1 Back trace erroneous data in DDG.

Upload: byron-martin

Post on 14-Dec-2015

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: TEMPLATE DESIGN © 2008  Integer ALU2 DEC 1 ID/ EXE Stage EXE/ MEM Stage Reg File D-Cache PC MEM/ WB Stage IF/ID Stage I-Cache

TEMPLATE DESIGN © 2008

www.PosterPresentations.com

Integer

ALU2

DEC 1

ID/EXEStage

EXE/MEMStage

Reg File

D-Cache

PC

MEM/WBStage

IF/IDStage

I-Cache

DEC 2

InstructionScheduler

Enable Lines

Diagnosing Intermittent Faults Using Software Techniques Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan

The University of British Columbia

Intermittent Faults

Research Objective

Diagnosis Technique Goals Overview of the Diagnosis Approach Isolate Fault-Prone Unit

Intermittent hardware faults are bursts of errors that occur at the same location and last from a few cycles to a few seconds.

Intermittent faults will be a significant concern in future processors.

Transient Fault Intermittent Faults

mov R1, #5

mov R2, #6

mov R3, #7

ld R4, R1, Array_Addr

ld R5, R2, Array_Addr

ld R6, R3, Array_Addr

mult R7, R5, R4

Failure

Program Execution

time

Research Motivation

Diagnosis is vital in guiding fine-grained recovery techniques (e.g., hardware reconfiguration) and hence facilitating processor degraded performance.

Chip

Core 1

Core 2

Core 3

Core

4

Core 5

Core

6Core

7

Chip

Core 1

Core

2Core

3Core

4

Core

5Core

6Core

7Core

8

Chip

Core 1

Core 2

Core 3

Core

4

Core 5

Core

6Core

7Core

8

If core 8 malfunctions, then two possiblerecovery options would be available:

1.The whole core 8 is disabled without fine-grained diagnosis, or

2. Part of core 8 is disabled with fine-grained diagnosis.

Requires no hardware support, Provides formal guarantees of correctness and

completeness, Scalable, Few false positives.

Modeling Intermittent Faults Impact on Programs - Example

Code Fragment Node

mov R1, #5 1

mov R2, #6 2

mov R3, #7 3

ld R4, R1, Array_Addr 4

ld R5, R2, Array_Addr 5

ld R6, R3, Array_Addr 6

mult R7, R5, R4 7

Modeling Intermittent Faults Impact on Programs - Results

The DDG model is more than two orders of magnitude faster than equivalent fault-injection experiments.

89 to 93% of the faults' crash distances are within 100 nodes.

Crash Model

Dynamic Dependency

Graph

Fault Model

Expected IPS and

CD

SimpleScalarSimulator

Actual IPS and

CD

Intermittent Error

Program Crash/Error Detected

Crash Dump File(e.g., crash state and inputs)

• Run Fault-Free• Construct DDG• Diagnose Error

FaultyInstructions

Overview of the Diagnosis Approach - Example

Identify Erroneous Data

An intermittent fault affected 14-18,

Crash instruction: 27,

Erroneous data: 14, 17, 16, 19 and 21.

Expected fault spans over nodes 14-19.

Actual fault affected nodes 14-18.

Array_Addr

#5 #6 #7

.

.

.

Intermittent Error

4 5 6

1 2 3

7

Identify Faulty Instructions

Filtering

Isolate Fault-Prone Unit

Potential Hardware Support

Operating Systems Directions

Contact Information

Layali Rashid

PhD CandidateDepartment of Electrical and Computer EngineeringThe University of British [email protected]

Isolate Instructions First Affected by the Fault

3

Identify Instructions that Change Erroneous Data

2

1

Of the intermittent faults that are non-benign, 95% result in a program crash.

91 to 95% of the faults cause program to crash within 300 nodes of the fault’s start.

Conclusions

Diagnosis is vital in guiding fine-grained recovery.

Diagnosing intermittent faults using software techniques is possible.

Most intermittent faults cause program to crash shortly after the fault’s start.

Use Dynamic Dependency Graph (DDG).

Map tasks to cores based on the core's functioning units and the task's requirements.

Modify a program on the fly to avoid using malfunctioning units.

Provide feedback to instruction scheduler about the malfunctioning units, such that minimal performance overhead is encountered.

Integer

ALU1

Back trace erroneous data in DDG.