managing application resilience: a programming language ...managing application resilience: a...

27
Managing Application Resilience: A Programming Language Approach Saurabh Hukerikar, Pedro C. Diniz and Robert F. Lucas {saurabh, pedro, [email protected]} Information Sciences Institute Presentation at the 2014 Mardi Gras Conference Louisiana State University Feb. 28, 2014

Upload: others

Post on 13-Feb-2020

13 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Managing Application Resilience: A Programming Language ...Managing Application Resilience: A Programming Language Approach Saurabh Hukerikar, Pedro C. Diniz and Robert F. Lucas {saurabh,

Managing Application Resilience: A

Programming Language Approach

Saurabh Hukerikar, Pedro C. Diniz and Robert F. Lucas

{saurabh, pedro, [email protected]}

Information Sciences Institute

Presentation at the 2014 Mardi Gras Conference

Louisiana State University

Feb. 28, 2014

Page 2: Managing Application Resilience: A Programming Language ...Managing Application Resilience: A Programming Language Approach Saurabh Hukerikar, Pedro C. Diniz and Robert F. Lucas {saurabh,

Why are We Here?

Increasing Soft- or Transient Errors:- Shrinking feature sizes

- Voltage level and noise margins decrease

- Hostile environments: Space or altitude

- Even sea-level!

- Processor ageing

- Delay errors

- Latches no longer function / Errors in logic

Single Event Upsets are common cause of soft errors

Not permanent / no hardware damage

2

Page 3: Managing Application Resilience: A Programming Language ...Managing Application Resilience: A Programming Language Approach Saurabh Hukerikar, Pedro C. Diniz and Robert F. Lucas {saurabh,

Errors are not New

• Basic Idea: Redundancy in Space and Time

- Software/Hardware Triple-Modular Redundancy (TMR)

- Other Variants: DWC, Redundant Threading...

• Software approaches are Slow

- Checkpoint and restart does not scale

- MTBF < Checkpoint cost

- Code bloat from fine-grained approaches

• Hardware approaches are Expensive

- TMR has high area/power cost

- Inflexible (“always on”)

3

Page 4: Managing Application Resilience: A Programming Language ...Managing Application Resilience: A Programming Language Approach Saurabh Hukerikar, Pedro C. Diniz and Robert F. Lucas {saurabh,

Resilient Computing Techniques

Resilient

Computing

Hardware

Hardware

Coarse

Grain

Fine

Grain

Coarse

Grain

Fine

Grain

EDDI,

ThOR,

etc.

TMR,

TR,

DWC-

CED

Resiliency-

aware

Scheduling

Less time Less

hardware

Software

Checkpoint

and Restart

SOI, DICE,

Hardening-

by-process

4

Page 5: Managing Application Resilience: A Programming Language ...Managing Application Resilience: A Programming Language Approach Saurabh Hukerikar, Pedro C. Diniz and Robert F. Lucas {saurabh,

Memory Errors? What Errors?(Data Courtesy of Al Gueist, ORNL 2011)

What About ECC?

• Transient error bit flips in memory

• ECC fixes single bit errors, but not double bit errors.

How Serious?

• Jaguar has a lot of memory (362 TB)

• It endures a constant stream of single bit errors

• It has a double bit error about every 24 hours

• ChipKill allows the system to run through double bit

errors but DRAM can not correct double bit errors

5

Page 6: Managing Application Resilience: A Programming Language ...Managing Application Resilience: A Programming Language Approach Saurabh Hukerikar, Pedro C. Diniz and Robert F. Lucas {saurabh,

Memory Errors? What Errors?(Data Courtesy of Al Gueist, ORNL 2006)

Is This a Problem at Exascale?• Exascale system target is 128 PB of memory (354

times Jaguar)

• Translates into a double bit error about every 4 mins

• Frequent enough to need something better than

ChipKill at Exascale

We need to be Smart about Errors• Not all errors are Identical...

• Not all errors can be ignored...

6

Page 7: Managing Application Resilience: A Programming Language ...Managing Application Resilience: A Programming Language Approach Saurabh Hukerikar, Pedro C. Diniz and Robert F. Lucas {saurabh,

• Application: Where and When Do Errors Matter?- Type declarations

- Code region selection

- Dynamic memory allocation

• Operating System: Reasons about the severity

and significance of memory faults due to soft

errors to the address space- Crash?

- Ignore?

- Correct!

System Level View

7

Page 8: Managing Application Resilience: A Programming Language ...Managing Application Resilience: A Programming Language Approach Saurabh Hukerikar, Pedro C. Diniz and Robert F. Lucas {saurabh,

• Introspective Engine: Observe and React to

Trends:- Avoid Specific Memory Pages and/or Cores

- Migrate Pages and Processes

System Level View

8

Page 9: Managing Application Resilience: A Programming Language ...Managing Application Resilience: A Programming Language Approach Saurabh Hukerikar, Pedro C. Diniz and Robert F. Lucas {saurabh,

Provide Linguistic Mechanisms to Convey

Criticality of Errors

• Emphasis on keeping things simple

• Existing libraries and application code base need not be

entirely rewritten

• Complementary:

• Checkpoint & Restart: increase checkpoint interval

• Redundant Computation: potential energy savings

Motivation for Our Work

9

Page 10: Managing Application Resilience: A Programming Language ...Managing Application Resilience: A Programming Language Approach Saurabh Hukerikar, Pedro C. Diniz and Robert F. Lucas {saurabh,

• Inherent Algorithmic Fault Tolerance knowledge

of an application

• still there are no convenient mechanisms to convey

this knowledge to the lower levels of system

abstraction layers

• Engage all software layers including the application

• Compiler infrastructure,

• Operating system

• Fault Model: SECDED Failures

Programming Model Extensions

for Resilience

10

Page 11: Managing Application Resilience: A Programming Language ...Managing Application Resilience: A Programming Language Approach Saurabh Hukerikar, Pedro C. Diniz and Robert F. Lucas {saurabh,

• tolerant int rgb[XDIM][YDIM];

• tolerant<MAX.VALUE=...> unsigned int counter;

• tolerant<precision.6f> double low_precision;

Programming Model

Extensions: Type Qualifiers

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

S Exponent Fraction

11

Page 12: Managing Application Resilience: A Programming Language ...Managing Application Resilience: A Programming Language Approach Saurabh Hukerikar, Pedro C. Diniz and Robert F. Lucas {saurabh,

<type>* <var> = (<cast>) tolerant_malloc(NUM * sizeof(<type><MAX.VALUE=..>));

Dynamic Memory Allocation

<type>* <var> = (<cast>) tolerant_malloc(sizeof(<type>));

12

Page 13: Managing Application Resilience: A Programming Language ...Managing Application Resilience: A Programming Language Approach Saurabh Hukerikar, Pedro C. Diniz and Robert F. Lucas {saurabh,

Detection and Recovery:

Robust Code Blocks

13

#pragma robust sentinel <variable list>

{

<code>

}

• Possible Uses of Variable/Predicate Lists:

- Use of Variables List as the only non-tolerant storage

- Use of Predicate List for “soft-error checkers”

#pragma robust sentinel <predicate list>

{

<code>

}

Page 14: Managing Application Resilience: A Programming Language ...Managing Application Resilience: A Programming Language Approach Saurabh Hukerikar, Pedro C. Diniz and Robert F. Lucas {saurabh,

Detection and Recovery:

Robust Code Blocks

14

#pragma robust sentinel <variable list>

{

<code>

}

• How to use this information?

- Dual Threading uses predicate for selection of correct

execution

- Triple Treading uses variables for voting followed by

checking of predicates

- Can be Adaptive

Dual threading for simple detection

Roll-back and proceed with Triple-Threading

#pragma robust sentinel <predicate list>

{

<code>

}

Page 15: Managing Application Resilience: A Programming Language ...Managing Application Resilience: A Programming Language Approach Saurabh Hukerikar, Pedro C. Diniz and Robert F. Lucas {saurabh,

• Source code #pragma recognized by ROSE Compiler

Robust Program Elements

– the ‘critical’ Pragma

int main (int argc, char** argv) {

#pragma critical 3

int* A_j = NULL;

int A_i[100];

int y_data[100], x_data[100], A_data[100];

int jj = 0, i = 0;

//A_j = (int *) malloc (sizeof(int) * 100);

for (jj = A_i[i]; jj < A_i[i + 1]; ++jj) {

y_data[i] += A_data[jj] * x_data[A_j[jj]];

}

free(A_j);

}

#include "new_critical_var.h"

int main(int argc,char **argv)

{

#pragma critical 3

int *A_j = (int *)((int *)((void *)0));

int *A_j0 = (int *)((int *)((void *)0));

int *A_j1 = (int *)((int *)((void *)0));

int A_i[100UL];

int y_data[100UL];

int x_data[100UL];

int A_data[100UL];

int jj = 0;

int i = 0;

//A_j = (int *) malloc (sizeof(int) * 100);

for (jj = A_i[i]; jj < A_i[i + 1]; ++jj) {

y_data[i] += (A_data[jj] * x_data[((int *)(triplication(A_j,A_j0,A_j1)))[jj]]);

}

free(((int *)(triplication(A_j,A_j0,A_j1))));

return 0;

}

void* triplication(p1, p2, p3) {

if (p1 == p2 || p1 == p3) {

return p1;

} else if (p2 == p3) {

return p2;

} else {

exit(1);

}

}

15

Page 16: Managing Application Resilience: A Programming Language ...Managing Application Resilience: A Programming Language Approach Saurabh Hukerikar, Pedro C. Diniz and Robert F. Lucas {saurabh,

• Source code #pragma recognized by ROSE Compiler

int main (int argc, char** argv) {

#pragma critical 3

int* A_j = NULL;

int A_i[100];

int y_data[100], x_data[100], A_data[100];

int jj = 0, i = 0;

//A_j = (int *) malloc (sizeof(int) * 100);

for (jj = A_i[i]; jj < A_i[i + 1]; ++jj) {

y_data[i] += A_data[jj] * x_data[A_j[jj]];

}

free(A_j);

}

#include "new_critical_var.h"

int main(int argc,char **argv){

#pragma critical 3

int *A_j = (int *)((int *)((void *)0));

int *A_j0 = (int *)((int *)((void *)0));

int *A_j1 = (int *)((int *)((void *)0));

int A_i[100UL];

int y_data[100UL];

int x_data[100UL];

int A_data[100UL];

int jj = 0;

int i = 0;

//A_j = (int *) malloc (sizeof(int) * 100);

for (jj = A_i[i]; jj < A_i[i + 1]; ++jj) {

y_data[i] += (A_data[jj] * x_data[((int *)(triplication(A_j,A_j0,A_j1)))[jj]]);

}

free(((int *)(triplication(A_j,A_j0,A_j1))));

return 0;

}

void* triplication(p1, p2, p3) {

if (p1 == p2 || p1 == p3) {

return p1;

} else if (p2 == p3) {

return p2;

} else {

exit(1);

}

}

number of replicas of pointer variable in the subsequent program text line

Robust Program Elements

– the ‘critical’ Pragma

16

Page 17: Managing Application Resilience: A Programming Language ...Managing Application Resilience: A Programming Language Approach Saurabh Hukerikar, Pedro C. Diniz and Robert F. Lucas {saurabh,

Simple Experiments

17

• Methodology:- “Inject” errors in memory

- Check if storage has been labeled as tolerant

- If so ignore; else continue (and possibly crash later)

• Simple Codes:- HPCC Random Access

- Molecular Dynamics

- Algebraic Multi-Grid Solver

- Graph - BFS

• Amelioration:- Ignore in many cases (AMG will naturally converge...)

- Remove offending data in other cases (MD code)

Page 18: Managing Application Resilience: A Programming Language ...Managing Application Resilience: A Programming Language Approach Saurabh Hukerikar, Pedro C. Diniz and Robert F. Lucas {saurabh,

Simple Experiments:

System Workflow

18

Page 19: Managing Application Resilience: A Programming Language ...Managing Application Resilience: A Programming Language Approach Saurabh Hukerikar, Pedro C. Diniz and Robert F. Lucas {saurabh,

% Execution Runs to Completion

Fault Injection Rate

(minutes)Random Access

Molecular Dynamics

Simulation

Algebraic Multigrid

Linear Solver

Graph Breadth-First-

Search Traversal

15 99.5 % 66.2 % 96.1 % 61.4 %

10 99.2 % 61.3 % 92.7 % 32.8 %

5 99.1 % 36.3 % 86.6 % 19.8 %

2 97.4 % 7.1 % 81.2 % 2.3 %

1 96.1 % 1.2 % 63.3 % 0.8 %

Fault injection: Multi-bit faults, non-recoverable by ECC

Experimental Results

19

Page 20: Managing Application Resilience: A Programming Language ...Managing Application Resilience: A Programming Language Approach Saurabh Hukerikar, Pedro C. Diniz and Robert F. Lucas {saurabh,

Vulnerability Analysis :

Execution Lifetime

20

Page 21: Managing Application Resilience: A Programming Language ...Managing Application Resilience: A Programming Language Approach Saurabh Hukerikar, Pedro C. Diniz and Robert F. Lucas {saurabh,

Vulnerability Analysis :

Execution Lifetime

21

Safely Ignore Errors:

Programmer knows best

If molecule outside box –

ignore it - removed

Just keep computing (ex. NaN)

will converge - slower

Just crashed – nothing you can

do about it for the time being

Page 22: Managing Application Resilience: A Programming Language ...Managing Application Resilience: A Programming Language Approach Saurabh Hukerikar, Pedro C. Diniz and Robert F. Lucas {saurabh,

Resiliency Techniques for

Pointer-based Data Structures

• Duplicate/Triplicate Node Fields

• Duplication: Allows for Detection (and in some instances correction – see next slide)

• Triplication: allows for correction (TMR-like)

• Increase Storage Overhead

• Not Exclusive to Graphs...

22

Page 23: Managing Application Resilience: A Programming Language ...Managing Application Resilience: A Programming Language Approach Saurabh Hukerikar, Pedro C. Diniz and Robert F. Lucas {saurabh,

Exploiting Alignment and

Storage Bounds

• Field Duplication

• Traditionally only used for Detection of Errors

• Disambiguating 1-of-2 bad Pointers (rule out bad pointers):

• Exploiting Data Structure Properties

• Alignment of Node Storage

• Out-of-Bounds Pointers

23

Page 24: Managing Application Resilience: A Programming Language ...Managing Application Resilience: A Programming Language Approach Saurabh Hukerikar, Pedro C. Diniz and Robert F. Lucas {saurabh,

Coping with Irrecoverable

Pointers

• Some Traversals do not require ‘exact’ Results• User may be happy with partial searches

• Randomized Computations

• How to Exploit this ‘Acceptance’ Criteria?

• Detect Error & Skip Node

• Built-in Redundancy of Data Structure Helps

• Extra Pointers or Confluence of Paths

24

Page 25: Managing Application Resilience: A Programming Language ...Managing Application Resilience: A Programming Language Approach Saurabh Hukerikar, Pedro C. Diniz and Robert F. Lucas {saurabh,

• Application Specific Programming Extensions

• Resilience with Power and Energy

• Platform & System Heterogeneity

Future Directions

25

Page 26: Managing Application Resilience: A Programming Language ...Managing Application Resilience: A Programming Language Approach Saurabh Hukerikar, Pedro C. Diniz and Robert F. Lucas {saurabh,

Future Dynamic Systems

AnnotatedApplica onSourceCode(C,C++,UPC,Fortran)

Compiler(ROSE)

Introspec veRun meSystem

Opera ngSystem

Adap veApplica onExecutable

Knowledge/Experience

Database(KEX)Vendor

Compilers

Graph500SSCA2max2sat

LLVM

Developer

ExascaleHardware

Off-

l

ine

(Compile-Tim

e)

On-line

(Run-Tim

e)

OS/HardwareInfrastructure(Outofscope)

Knowledgeabout:

• Applica onDomain• SystemProper es• StrategicGoals

• Execu onBehavior

• Monitoring• Analysis• Recovery/

Op miza onFeedback

• Predic on

26

Page 27: Managing Application Resilience: A Programming Language ...Managing Application Resilience: A Programming Language Approach Saurabh Hukerikar, Pedro C. Diniz and Robert F. Lucas {saurabh,

Conclusion & Perspectives

• The need for Reliability is not New:

• Hardware / Software TMR in Space-borne & Safety-critical

systems;

• HPC Systems have a ‘different’ constraint – all

simultaneously available and scalability issues

• Plenty of system and application knowledge about

acceptable solutions – but no widely accepted

mechanisms to covey it.

27