1 software and services group 1 execution frontiers cnc support for highly adaptive execution kath...

61
1 Software and Services Group 1 Execution Frontiers CnC support for highly adaptive execution Kath Knobe Intel 12/07/12

Upload: dale-hood

Post on 18-Jan-2018

225 views

Category:

Documents


0 download

DESCRIPTION

3 Software and Services Group 3 Motivation: Highly adaptive computing for exascale Critical exascale issues (inspired by work on UHPC and X-Stack) Require the ability to move currently executing parts of the app to another place in the platform or to a later time. Resilience −Fragile components −Lots of them Power management −Power components off −Power components down Self-aware computing −Modify mapping based on feedback Change of goals −Between power and time to solution, for example Thesis: management of the execution frontiers in CnC is a mechanism supporting highly adaptive computing for exascale.

TRANSCRIPT

Page 1: 1 Software and Services Group 1 Execution Frontiers CnC support for highly adaptive execution Kath Knobe Intel 12/07/12

1Software and Services Group 1

Execution FrontiersCnC support for highly adaptive execution

Kath Knobe Intel

12/07/12

Page 2: 1 Software and Services Group 1 Execution Frontiers CnC support for highly adaptive execution Kath Knobe Intel 12/07/12

2Software and Services Group 2

Warning • This is all high level conceptual thinking• Many details to be determined• Today: just the basic idea without any concern for efficiency.• Lots of room for optimizing

Suggestions /comments more than welcome!

Page 3: 1 Software and Services Group 1 Execution Frontiers CnC support for highly adaptive execution Kath Knobe Intel 12/07/12

3Software and Services Group 3

Motivation: Highly adaptive computing for exascale

Critical exascale issues (inspired by work on UHPC and X-Stack)Require the ability to move currently executing parts of the app to another place in the platform or to a later time.

• Resilience−Fragile components−Lots of them

• Power management−Power components off−Power components down

• Self-aware computing−Modify mapping based on feedback

• Change of goals−Between power and time to solution, for example

Thesis: management of the execution frontiers in CnC is a mechanism supporting highly adaptive computing for exascale.

Page 4: 1 Software and Services Group 1 Execution Frontiers CnC support for highly adaptive execution Kath Knobe Intel 12/07/12

4Software and Services Group 4

Checkpoint/restart Hierarchical CnC

Hierarchical checkpoint/restart

Hierarchical checkpoint/restartFor adaptive execution

2 passes - Abstract: unlimited resources - Actual: with resource constraints

For faults

Page 5: 1 Software and Services Group 1 Execution Frontiers CnC support for highly adaptive execution Kath Knobe Intel 12/07/12

5Software and Services Group 5

Outline• Abstract (platform has infinite memory and processors)

−Semantic state−Checkpoint/restart−Hierarchical CnC −Hierarchical checkpoint/restart

• Actual (with resource constraints)• Beyond faults

Page 6: 1 Software and Services Group 1 Execution Frontiers CnC support for highly adaptive execution Kath Knobe Intel 12/07/12

6Software and Services Group 6

Outline• Abstract

−Semantic state−Checkpoint/restore−Hierarchical CnC −Hierarchical checkpoint/restart

• Actual • Beyond faults

Page 7: 1 Software and Services Group 1 Execution Frontiers CnC support for highly adaptive execution Kath Knobe Intel 12/07/12

7Software and Services Group 7

Outline• Abstract

−Semantic state−Checkpoint/restore−Hierarchical CnC −Hierarchical checkpoint/restart

• Actual • Beyond faults

Page 8: 1 Software and Services Group 1 Execution Frontiers CnC support for highly adaptive execution Kath Knobe Intel 12/07/12

8Software and Services Group 8

Semantics / execution model

Itemavail

tagavail

Page 9: 1 Software and Services Group 1 Execution Frontiers CnC support for highly adaptive execution Kath Knobe Intel 12/07/12

9Software and Services Group 9

Semantics / execution model

Itemavail

stepcontrolReady

stepdataReady

tagavail

Page 10: 1 Software and Services Group 1 Execution Frontiers CnC support for highly adaptive execution Kath Knobe Intel 12/07/12

10Software and Services Group 10

Semantics / execution model

Itemavail

stepcontrolReady

stepready

stepdataReady

tagavail

Page 11: 1 Software and Services Group 1 Execution Frontiers CnC support for highly adaptive execution Kath Knobe Intel 12/07/12

11Software and Services Group 11

Semantics / execution model

Itemavail

stepcontrolReady

stepready

stepdataReady

tagavail

Page 12: 1 Software and Services Group 1 Execution Frontiers CnC support for highly adaptive execution Kath Knobe Intel 12/07/12

12Software and Services Group 12

Semantics / execution model

Itemavail

stepcontrolReady

stepready

stepdataReady

tagavail

Page 13: 1 Software and Services Group 1 Execution Frontiers CnC support for highly adaptive execution Kath Knobe Intel 12/07/12

13Software and Services Group 13

Semantics / execution model

Itemavail

stepcontrolReady

stepready

stepdataReady

stepexecuted

tagavail

Page 14: 1 Software and Services Group 1 Execution Frontiers CnC support for highly adaptive execution Kath Knobe Intel 12/07/12

14Software and Services Group 14

Semantics / execution model

Itemavail

stepcontrolReady

stepready

stepdataReady

stepexecuted

tagavail

The primitive attributes come from below: available, executed The derived attributes propagate at this level: control_ready, data_ready, ready

2 levels:• Graph level (above)• User serial code level (below)

Page 15: 1 Software and Services Group 1 Execution Frontiers CnC support for highly adaptive execution Kath Knobe Intel 12/07/12

15Software and Services Group 15

Execution frontier• An execution frontier is a CnC program state:

−The set of attributes of instances of steps, tags and items−The contents of available items

• CnC execution can proceed from a execution frontier

• Some examples of execution frontiers:− Normal program input (set of available items and tags)− Normal program output (set of available items and tags)− Any state during execution (more general)

• Perspective− Traditional focus:

> Data structure is items; computation is step.> step instance consumes and produces items.

− Alternate view: > Data structure is execution frontier; computation is step, subgraph or full program.> Applying a computation to an execution frontier yields another execution frontier.

Page 16: 1 Software and Services Group 1 Execution Frontiers CnC support for highly adaptive execution Kath Knobe Intel 12/07/12

16Software and Services Group 16

Outline• Abstract

−Semantic state−Checkpoint/restart−Hierarchical CnC −Hierarchical checkpoint/restart

• Actual • Beyond faults

Page 17: 1 Software and Services Group 1 Execution Frontiers CnC support for highly adaptive execution Kath Knobe Intel 12/07/12

17Software and Services Group 17

Checkpoint/restart summary(abstract)• Changes to the execution frontier are saved continuously as they occur

• Changes are saved in less volatile “place”• Asynchronous, no barriers• No programmer involvement• Saved state may not correspond to an actual state • Can restart from any saved state

Page 18: 1 Software and Services Group 1 Execution Frontiers CnC support for highly adaptive execution Kath Knobe Intel 12/07/12

18Software and Services Group 18

Outline• Abstract

−Semantic state−Checkpoint/restore−Hierarchical CnC −Hierarchical checkpoint/restart

• Actual • Beyond faults

Page 19: 1 Software and Services Group 1 Execution Frontiers CnC support for highly adaptive execution Kath Knobe Intel 12/07/12

19Software and Services Group 1919

Cholesky domain spec

TrisolveTag: row, iter

CholeskyTag: iter

UpdateTag: col, row, iter

CONTROL TAG

CONTROL TAG

CONTROL TAG

Cholesky: iter

Trisolve: row, iter

Update: col, row, iter

COMPUTE STEP

COMPUTE STEP

COMPUTE STEP

Array : col, row, iter

DATA ITEM

Page 20: 1 Software and Services Group 1 Execution Frontiers CnC support for highly adaptive execution Kath Knobe Intel 12/07/12

20Software and Services Group 20

Looks like a CnC spec at each level

<iterTag: iter>CONTROL TAG

COMPUTE STEP(C: iter)

Page 21: 1 Software and Services Group 1 Execution Frontiers CnC support for highly adaptive execution Kath Knobe Intel 12/07/12

21Software and Services Group 21

Looks like a CnC spec at each level

iterations<iterTag: iter>CONTROL TAG

COMPUTE STEP(cholesky:)

COMPUTE STEP(C: iter)

COMPUTE STEP(TU:)

Page 22: 1 Software and Services Group 1 Execution Frontiers CnC support for highly adaptive execution Kath Knobe Intel 12/07/12

22Software and Services Group 22

Looks like a CnC spec at each level

<iterTag: iter>CONTROL TAG

COMPUTE STEP(C: iter)

COMPUTE STEP(U:)

COMPUTE STEP(trisolve)

<rowTag: row>CONTROL TAG

COMPUTE STEP(cholesky:)

COMPUTE STEP(TU:)

Page 23: 1 Software and Services Group 1 Execution Frontiers CnC support for highly adaptive execution Kath Knobe Intel 12/07/12

23Software and Services Group 23

get…get…… = .. + … *… /… = …if …put

Executed semantics: leafCOMPUTE STEP(trisolve: row)

Executed is a primitive attribute. It comes from below. - Leaf : termination of the serial code below

Page 24: 1 Software and Services Group 1 Execution Frontiers CnC support for highly adaptive execution Kath Knobe Intel 12/07/12

24Software and Services Group 24

Executed semantics: non-leaf

COMPUTE STEP(U:)

COMPUTE STEP(trisolve)

<rowTag: row>CONTROL TAG

COMPUTE STEP(TU:)

Executed is a primitive attribute. It comes from below. - Leaf : termination of the serial code below- non-leaf: termination of the subgraph below

Page 25: 1 Software and Services Group 1 Execution Frontiers CnC support for highly adaptive execution Kath Knobe Intel 12/07/12

25Software and Services Group 25

Hierarchical CnC application: execution is at the leaves only

Cholesky

trisolve

update

Page 26: 1 Software and Services Group 1 Execution Frontiers CnC support for highly adaptive execution Kath Knobe Intel 12/07/12

26Software and Services Group 26

Hierarchical CnC application: intermediate nodes maintain state

State of each iteration

State of each row

Page 27: 1 Software and Services Group 1 Execution Frontiers CnC support for highly adaptive execution Kath Knobe Intel 12/07/12

27Software and Services Group 27

Hierarchical view of the abstract platform tree

A node looks like a full machine at each level:a subtree of the memory hierarchy + the associated set of cores

Hierarchical platform node

Page 28: 1 Software and Services Group 1 Execution Frontiers CnC support for highly adaptive execution Kath Knobe Intel 12/07/12

28Software and Services Group 28

Abstract platform:Depth and extent of platform hierarchy corresponds exactly

to the depth and extent of the dynamic application

The mapping is direct

Page 29: 1 Software and Services Group 1 Execution Frontiers CnC support for highly adaptive execution Kath Knobe Intel 12/07/12

29Software and Services Group 29

Outline• Abstract

−Semantic state−Checkpoint/restore−Hierarchical CnC −Hierarchical checkpoint/restart

• Actual • Beyond faults

Page 30: 1 Software and Services Group 1 Execution Frontiers CnC support for highly adaptive execution Kath Knobe Intel 12/07/12

30Software and Services Group 30

Hierarchical checkpoint / restart(abstract)

Hierarchical application node

Page 31: 1 Software and Services Group 1 Execution Frontiers CnC support for highly adaptive execution Kath Knobe Intel 12/07/12

31Software and Services Group 31

Hierarchical checkpoint/restart(abstract)

Checkpoint for that application node

Hierarchical application node

Page 32: 1 Software and Services Group 1 Execution Frontiers CnC support for highly adaptive execution Kath Knobe Intel 12/07/12

32Software and Services Group 32

Hierarchical checkpoint/restart(abstract)

Checkpoint for that application node

resides at the parent place

Hierarchical application node

Page 33: 1 Software and Services Group 1 Execution Frontiers CnC support for highly adaptive execution Kath Knobe Intel 12/07/12

33Software and Services Group 33

Hierarchical checkpoint/restart(abstract)

Checkpoint for that application node

resides at the parent place

Hierarchical application node

Distinct checkpoints residing at a single place remain separate.

We will see why later.

Page 34: 1 Software and Services Group 1 Execution Frontiers CnC support for highly adaptive execution Kath Knobe Intel 12/07/12

34Software and Services Group 34

Abstract failure model

• The system knows if/when a node fails − We’re not talking about soft errors

• Abstract platform node fails temporarily then returns

Page 35: 1 Software and Services Group 1 Execution Frontiers CnC support for highly adaptive execution Kath Knobe Intel 12/07/12

35Software and Services Group 35

Hierarchical checkpoint/restart(abstract)

1-level Checkpoint• Fault • Fullstop• Restart

Page 36: 1 Software and Services Group 1 Execution Frontiers CnC support for highly adaptive execution Kath Knobe Intel 12/07/12

36Software and Services Group 36

Hierarchical checkpoint/restart(abstract)

1-level Checkpoint• Fault • Fullstop• Restart

Page 37: 1 Software and Services Group 1 Execution Frontiers CnC support for highly adaptive execution Kath Knobe Intel 12/07/12

37Software and Services Group 37

Hierarchical checkpoint/restart(abstract)

1-level Checkpoint• Fault • Fullstop• Restart

Page 38: 1 Software and Services Group 1 Execution Frontiers CnC support for highly adaptive execution Kath Knobe Intel 12/07/12

38Software and Services Group 38

Hierarchical checkpoint/restart(abstract)

1-level Checkpoint• Fault • Fullstop• Restart

Page 39: 1 Software and Services Group 1 Execution Frontiers CnC support for highly adaptive execution Kath Knobe Intel 12/07/12

39Software and Services Group 39

Hierarchical checkpoint/restart(abstract)

Checkpoint in hierarchy• Fault • Fullstop• Restart

Page 40: 1 Software and Services Group 1 Execution Frontiers CnC support for highly adaptive execution Kath Knobe Intel 12/07/12

40Software and Services Group 40

Hierarchical checkpoint/restart(abstract)

Checkpoint in hierarchy• Fault • Fullstop• Restart

Page 41: 1 Software and Services Group 1 Execution Frontiers CnC support for highly adaptive execution Kath Knobe Intel 12/07/12

41Software and Services Group 41

Hierarchical checkpoint/restart(abstract)

Checkpoint in hierarchy• Fault • Fullstop• Restart

Page 42: 1 Software and Services Group 1 Execution Frontiers CnC support for highly adaptive execution Kath Knobe Intel 12/07/12

42Software and Services Group 42

Hierarchical checkpoint/restart(abstract)

Checkpoint in hierarchy• Fault • Fullstop• Restart

Page 43: 1 Software and Services Group 1 Execution Frontiers CnC support for highly adaptive execution Kath Knobe Intel 12/07/12

43Software and Services Group 43

Hierarchical checkpoint/restart(abstract)

Checkpoint in hierarchy• Fault • Fullstop• Restart

From above: step simply looks like it took longer than expected.

Checkpoint/fullstop at one node looks like checkpoint/continue for the whole program

Page 44: 1 Software and Services Group 1 Execution Frontiers CnC support for highly adaptive execution Kath Knobe Intel 12/07/12

44Software and Services Group 44

Hierarchical checkpoint/restart:Summary

• Each node in a hierarchy has all the characteristics of a whole program checkpoint.

• Checkpoint/fullstop/restart at nodes in the hierarchy enables the application as a whole to adapt and continue through faults.

Page 45: 1 Software and Services Group 1 Execution Frontiers CnC support for highly adaptive execution Kath Knobe Intel 12/07/12

45Software and Services Group 45

Outline• Abstract • Actual: with resources and resource constraints

−Semantic state−Checkpoint/restore−Hierarchical CnC −Hierarchical checkpoint/restart

• Beyond faults

Page 46: 1 Software and Services Group 1 Execution Frontiers CnC support for highly adaptive execution Kath Knobe Intel 12/07/12

46Software and Services Group 46

Semantic state for execution(limited memory)

• Checkpointed information leaves the trailing edge of the execution frontier−Dead tags−Dead items−Dead stepsThis is the motivation for the term “execution frontier” as opposed to “execution state”. It’s only the relevant frontier of the state.

• Dead is a derived attribute. It doesn’t propagate up from the children. It is derived independently within each (sub)program.

Page 47: 1 Software and Services Group 1 Execution Frontiers CnC support for highly adaptive execution Kath Knobe Intel 12/07/12

47Software and Services Group 47

Hierarchical CnC map to actual platformplatform: limited depth / limited extent at each level

Platform hierarchy

Application hierarchy

Page 48: 1 Software and Services Group 1 Execution Frontiers CnC support for highly adaptive execution Kath Knobe Intel 12/07/12

48Software and Services Group 48

Hierarchical CnC map to actual platformflatten the depth

Platform hierarchy

Application hierarchy

Page 49: 1 Software and Services Group 1 Execution Frontiers CnC support for highly adaptive execution Kath Knobe Intel 12/07/12

49Software and Services Group 49

Hierarchical CnC map to actual platformfold extent

Platform hierarchy

Application hierarchy

Page 50: 1 Software and Services Group 1 Execution Frontiers CnC support for highly adaptive execution Kath Knobe Intel 12/07/12

50Software and Services Group 50

Actual failure model

• Platform node fails and may not return − or don’t want to wait until it returns

• Restart is at some other platform node

Page 51: 1 Software and Services Group 1 Execution Frontiers CnC support for highly adaptive execution Kath Knobe Intel 12/07/12

51Software and Services Group 51

Remapping

A B

Map:

Page 52: 1 Software and Services Group 1 Execution Frontiers CnC support for highly adaptive execution Kath Knobe Intel 12/07/12

52Software and Services Group 52

Remapping

A B

A B

Map:

Page 53: 1 Software and Services Group 1 Execution Frontiers CnC support for highly adaptive execution Kath Knobe Intel 12/07/12

53Software and Services Group 53

Remapping

X

A BY

A B

Map: Original checkpoint of B is at XNew checkpoint of B is at YFollows the new platform location

A B

A B

Page 54: 1 Software and Services Group 1 Execution Frontiers CnC support for highly adaptive execution Kath Knobe Intel 12/07/12

54Software and Services Group 54

Remapping

X

A BY

A B

Map: Original checkpoint of B is at XNew checkpoint of B is at YFollows the new platform location

A B

A B

This is why we don’t want to merge checkpoints of the application children at the platform parent.

We may want to relocate each child independently.

Page 55: 1 Software and Services Group 1 Execution Frontiers CnC support for highly adaptive execution Kath Knobe Intel 12/07/12

55Software and Services Group 55

What do we have?

• A way of maintaining the execution frontier of −A running application−A running subgraph of an application

• A mechanism for taking an execution frontier and moving it−To another place−To a later time

• Use of this to cope with faults

Page 56: 1 Software and Services Group 1 Execution Frontiers CnC support for highly adaptive execution Kath Knobe Intel 12/07/12

56Software and Services Group 56

Outline• Abstract • Actual: with resources and resource constraints• Beyond faults

Page 57: 1 Software and Services Group 1 Execution Frontiers CnC support for highly adaptive execution Kath Knobe Intel 12/07/12

57Software and Services Group 57

Adaptive execution• If we can checkpoint and continue elsewhere on a fault, we

can checkpoint and continue elsewhere for our own reasons. Big relevant exascale issues:−Resilience• Actual/predicted failures

−Power management−Self-aware computing−Changes in goals

• Mechanism not policy!• Status:

−No staffing or funding yet.

Page 58: 1 Software and Services Group 1 Execution Frontiers CnC support for highly adaptive execution Kath Knobe Intel 12/07/12

58Software and Services Group 58

Other uses of execution frontiers

• Mechanism for connecting reusable components• Low priority app

− Execute/checkpoint/restart one step at a time − Stop mid-step when high priority work arrives

• Long-lived app with very slowly arriving input − e.g., phylogenetic tree for SARS virus

• Debugging− View state− Reverse time (undo)

• Soft-errors−Compute more than once. Compare

• Something like out-of-core computation but not baked into application

Page 59: 1 Software and Services Group 1 Execution Frontiers CnC support for highly adaptive execution Kath Knobe Intel 12/07/12

59Software and Services Group 59

Potential: Forms & operationsForms • As executing

− general, arrays, trees…• Serialized• Streaming• Encrypted• Compressed• Database • Excel • Human readable

Operations • Save/restore• Partition/specialize

−At fork into distinct large subgraphs

• Merge −At join of distinct large subgraphs

• Send • Compare (e.g., for fault

tolerance)• Explicitly modify (e.g., debug)• Rename collections (e.g., for

composition

Page 60: 1 Software and Services Group 1 Execution Frontiers CnC support for highly adaptive execution Kath Knobe Intel 12/07/12

60Software and Services Group 60

Relook at motivation: Highly adaptive computing for exascale

Critical exascale issues:require the ability to move currently executing parts of the app to another place in the platform or to a later time.

• Resilience−Fragile components−Lots of them

• Power management−Power components off−Power components down

• Self-aware computing−Modify mapping based on feedback

• Change of goals−Between power and time to solution, for example

Looking forward to:• Lowering the design• Implementation• Experimenting

Looking for feedback and collaborators

Page 61: 1 Software and Services Group 1 Execution Frontiers CnC support for highly adaptive execution Kath Knobe Intel 12/07/12

61Software and Services Group 61