Download - ROMA June 2014 Journ ee des doctorantsmbbxqmf2/talks/jdd14.pdf · 2018. 1. 4. · Towards Exascale ROMA The Titan SuperComputer: 404m2 (the big box) 299;008 processor cores (the small

Towards Exascale

ROMA

June 2014

Journée des doctorants

TowardsExascale

ROMA

TowardsExascale

ROMA The story starts with a box ...

... that contains lots of little boxes.

TowardsExascale

ROMA The story starts with a box ...

... that contains lots of little boxes.

TowardsExascale

ROMA

The Titan SuperComputer:

• 404m2 (the big box)• 299, 008 processor cores

(the small boxes)

• 17.59 PetaFlops

• 8.2 MW• 693.6 TiB of RAM• 240 GB/s transfer speed

to RAM

Image courtesy of Oak Ridge National Laboratory, U.S. Dept. of Energy

TowardsExascale

ROMA Then what is Exascale?

×1000

But inthe same box:

TowardsExascale

×1000

But inthe same box:

TowardsExascale

×1000

But inthe same box:

TowardsExascale

ROMA

TowardsExascale

ROMA

Linear algebra, problems get bigger and bigger

Code Aster, Carter(e.g., finite ele-ments)

→Solution of sparsesystemsAx = b

Often the most expensive part in numerical simulation codesSparse direct methods to solve Ax = b:

• Decompose A under the form LU,LDLt or LLt

• Solve the triangular systems Ly = b, then Ux = y3D example in earth science:acoustic wave propagation,27-point finite difference grid

Current goal [Seiscope project]:LU on complete earthn = N3 = 10003

Extrapolation on a 1000× 1000× 1000 grid: 55 exaflops, 200 Tbytesfor factors, 40 TBytes for active memory!

TowardsExascale

ROMA

TowardsExascale

ROMA

TowardsExascale

ROMA

Resilience

The main known problem for Exascale is Resilience.

Time to checkpoint

Time

What if there is 1000×the processing power?It gets worse

TowardsExascale

ROMA

Resilience

Time to checkpoint

Time

TowardsExascale

ROMA

Resilience

Time to checkpoint

Time

TowardsExascale

ROMA

Resilience

Time to checkpoint

Time

TowardsExascale

ROMA

Resilience

Time to checkpoint

Time

What if there is 1000×the processing power?

It gets worse

TowardsExascale

ROMA

Resilience

Time to checkpoint

Time

TowardsExascale

ROMA

Resilience

Time to checkpoint

Time

What if there is 1000×the processing power?

It gets worse

TowardsExascale

ROMA

Resilience

Time to checkpointTime

TowardsExascale

ROMA

Fault-tolerance techniques

• Rollback Recovery Strategies: All processors periodicallystop computing and checkpoint (save the state of theparallel application onto resilient storage).

• Coordinated checkpointing, No need to log messages/ All processors need to rollback/ I/O congestion

• Non Coordinated checkpointing/ Need to log messages

• Slowdowns failure-free execution and increasescheckpoint size/time

, Faster re-execution with logged messages• Hierarchical checkpointing

/ Need to log inter-groups messages, Only processors from failed group need to rollback, Faster re-execution with logged messages, Rumor: scales well to very large platforms

TowardsExascale

ROMA

, Faster re-execution with logged messages

• Hierarchical checkpointing/ Need to log inter-groups messages, Only processors from failed group need to rollback, Faster re-execution with logged messages, Rumor: scales well to very large platforms

TowardsExascale

ROMA

, Faster re-execution with logged messages• Hierarchical checkpointing

/ Need to log inter-groups messages, Only processors from failed group need to rollback, Faster re-execution with logged messages, Rumor: scales well to very large platforms

TowardsExascale

ROMA

Replication

Model

• A parallel application comprising n (sequential) processes• Each process replicated g ≥ 2 times• A processing element executes a single replica• The application fails when all replicas in one replica group

have been hit by failures

1 2

. . .

i

. . .

n

Objective

• Show when replication is beneficial to periodiccheckpointing

TowardsExascale

ROMA

Replication

Model

1 2

. . .

i

. . .

n

Objective

TowardsExascale

ROMA

Replication

Model

1 2

. . .

i

. . .

n

Objective

TowardsExascale

ROMA

Replication

Model

1 2

. . .

i

. . .

n

Objective

TowardsExascale

ROMA

Replication

Model

1 2

. . .

i

. . .

n

Objective

TowardsExascale

ROMA

Prediction

• Predictor (Recall, Precision), Window-based predictions• Predictions must be provided at least Cp seconds in

advance

TimeTR-C TR-C Tlost TR-C

Error(Regular mode)

TimeTR-C Wreg

I

TR-C-Wreg

TR-C

(Prediction without failure)

TimeTR-C Wreg

IError

TR-C-Wreg

TR-C

(Prediction with failure)

C C C D R C

C C Cp C C

C C Cp D R C C

Objective

• Characterize when prediction is useful.

TowardsExascale

ROMA

Prediction

• Predictor (Recall, Precision), Window-based predictions• Predictions must be provided at least Cp seconds in

advance

TimeTR-C TR-C Tlost TR-C

Error(Regular mode)

TimeTR-C Wreg

I

TR-C-Wreg

TR-C

(Prediction without failure)

TimeTR-C Wreg

IError

TR-C-Wreg

TR-C

(Prediction with failure)

C C C D R C

C C Cp C C

C C Cp D R C C

Objective

• Characterize when prediction is useful.

TowardsExascale

ROMA

Kind of errors

Hard errors

• Easy to detect

• Easy to localize and characterize

• Expensive to correct

Soft errors

• Hard to detect

• Hard to localize and characterize

• Easy to correct (sometimes)

TowardsExascale

ROMA

Silent errors

How to spot them

• Add some redundancy

• Error detecting codes

• Selective reliability

How to face them

• Majority vote among the replicas

• Error correcting codes

• Checkpoint recovery

TowardsExascale

ROMA

Finding the best trade-off

Let us consider an iterative method

• correction at each step• increases cost of a single iteration• no time wasted for checkpoint• good for low error rates

• checkpointing + detection at each step• small overhead at each iteration (detection)• periodic time loss for checkpointing• checkpoint interval can be tailored on error rate

Solution: combine the two techniques

TowardsExascale

ROMA

TowardsExascale

ROMA

TowardsExascale

ROMA

TowardsExascale

ROMA

Dealing with verifications

It is not always possible to use error detection / correctioncodes at each step. What if we still want to use checkpointsand recoveries ?

Problem

• We don’t know when the error occurred

• We don’t know if the last checkpoint is valid

We need a verification mechanism to verify that there were nosilent errors in previous computations and to check thecorrectness of the checkpoints. But this has a cost!

TowardsExascale

ROMA

Checkpoints and Verifications

We assume there are no errors during checkpoints (less errorsources when doing I/O).

Simple approach: perform a verification before each checkpointto eliminate risk of corrupted data.

Time

w V C w V C w V C w V C

Is this better?

Time

w C w V C w C w V C w C

TowardsExascale

ROMA

Checkpoints and Verifications

We assume there are no errors during checkpoints (less errorsources when doing I/O).

Simple approach: perform a verification before each checkpointto eliminate risk of corrupted data.

Time

w V C w V C w V C w V C

Is this better?

Time

w C w V C w C w V C w C

TowardsExascale

ROMA

With k checkpoints and one verification

With multiple checkpoints, the problem is to find when theerror occurred.

Time

Error

V C w C w C w C w C w V R V R V R V

TowardsExascale

ROMA

Time

Error

TowardsExascale

ROMA

Time

Error

TowardsExascale

ROMA

Time

Error

TowardsExascale

ROMA

Time

Error

Solution

• The problem is very similar with k verifications and onecheckpoint

• With constant C, V and R we can find an optimal solutionto this problem (i.e that minimizes the expectation of theexecution time).

TowardsExascale

ROMA

What about DAGs?

Let us consider a Directed Acyclic Graph (DAG) where:

• Nodes represent tasks

• Edges correspond to precedence constraintsWe make several important assumptions on this model:

• All tasks are executed by all the p processors (whichamounts to linearize the task graph and to execute alltasks sequentially)

• Each task has its own undivisible work of size w

Problem: Where do we have to place the checkpoints and theverifications in order to find the optimal expectation of thetime to execute all the tasks without failures?

TowardsExascale

ROMA

Starting with simple graphs

We have analytical formulas to compute the expectation of thetime to successfully execute each of these graphs.

• We can find the optimal expectation of the time tosuccessfully execute the fork graph and the linear chainusing a polynomial dynamic programming algorithm.

• The join is probably NP-Complete because of thecombinatorial explosion of the possibilites.

T0

T1

Ti

Tn

T0

Ti

Tn

Tf T0 T1 Ti Tn

Future work: investigate the optimal checkpointing andverification problem for general DAGs.

TowardsExascale

ROMA

TowardsExascale

ROMA

Memory

Another concern: Bandwidth to Memory:

240Gb/s

When system grows 10 times,

Bandwidth to Memory should grow 20 times!

Since we are not good with architecture, we focus onalgorithms..

TowardsExascale

ROMA

Memory

240Gb/s

TowardsExascale

ROMA

Memory

240Gb/s

TowardsExascale

ROMA

Memory

240Gb/s

TowardsExascale

ROMA

Memory

240Gb/s

TowardsExascale

ROMA

Pebble Game

0/3

0/2

0/4

0/1

Two moves:

• Add a pebble on a vertex.• Remove a pebble from a vertex.

One rule:

• To add pebble on a vertex, all its predecessors must have anumber of pebbles equal to its weight.

One goal:

• All vertices have to be fulfil at least one time and thenumber of used pebbles must be minimized.

TowardsExascale

ROMA

Pebble Game

1/3

0/2

0/4

0/1

Two moves:

One rule:

One goal:

TowardsExascale

ROMA

Pebble Game

0/3

0/2

0/4

0/1

Two moves:

One rule:

One goal:

TowardsExascale

ROMA

Pebble Game

0/3

0/2

0/4

1/1

Two moves:

One rule:

One goal:

TowardsExascale

ROMA

Pebble Game

0/3

0/2

0/4

1/1

Wrong

Two moves:

One rule:

One goal:

TowardsExascale

ROMA

Pebble Game

0/3

0/2

0/4

0/1

pebble counter : 0; number max of pebbles : 0

TowardsExascale

ROMA

Pebble Game

1/3

0/2

0/4

0/1

TowardsExascale

ROMA

Pebble Game

2/3

0/2

0/4

0/1

TowardsExascale

ROMA

Pebble Game

3/3

0/2

0/4

0/1

TowardsExascale

ROMA

Pebble Game

3/3

0/2

1/4

0/1

TowardsExascale

ROMA

Pebble Game

3/3

0/2

2/4

0/1

TowardsExascale

ROMA

Pebble Game

3/3

0/2

3/4

0/1

TowardsExascale

ROMA

Pebble Game

3/3

0/2

4/4

0/1

TowardsExascale

ROMA

Pebble Game

3/3

0/2

3/4

0/1

TowardsExascale

ROMA

Pebble Game

3/3

0/2

2/4

0/1

TowardsExascale

ROMA

Pebble Game

3/3

0/2

1/4

0/1

TowardsExascale

ROMA

Pebble Game

3/3

0/2

0/4

0/1

TowardsExascale

ROMA

Pebble Game

3/3

1/2

0/4

0/1

TowardsExascale

ROMA

Pebble Game

3/3

0/4

0/1

2/2

TowardsExascale

ROMA

Pebble Game

3/3

0/4

1/1

2/2

TowardsExascale

ROMA

An other modelisation

DefinitionLet G be a DAG with weighted edges and vertices, and π atopological order.

• We define Me(π, x) (memory edges) as the set of edgeseuv such that π(u) < π(x) ≤ π(v)

• We call Cost of π at vertex v the value

Cost(π, v) = w(v) +∑

u∈N+(v)

c(evu) +∑

eux∈Me(π,v)

c(eux)

• We define the Cost of an order as:

Cost(π) = max{Cost(π, v), v ∈ G}

Our goal: minimize Cost(π)

TowardsExascale

ROMA

DefinitionLet G be a DAG with weighted edges and vertices, and π atopological order.

• We define Me(π, x) (memory edges) as the set of edgeseuv such that π(u) < π(x) ≤ π(v)

• We call Cost of π at vertex v the value

Cost(π, v) = w(v) +∑

u∈N+(v)

c(evu) +∑

eux∈Me(π,v)

c(eux)

• We define the Cost of an order as:

Cost(π) = max{Cost(π, v), v ∈ G}

Our goal: minimize Cost(π)

TowardsExascale

ROMA

unprocessedprocessedvertices already

vertices

Figure : Before the processing of v

TowardsExascale

ROMA

vertices

Figure : During the processing of v

TowardsExascale

ROMA

vertices

Figure : After the processing of v

TowardsExascale

ROMA

TowardsExascale

ROMA

Energy

One last problem, Energy.

2W /cm2

80W /cm2

8.2MW

Thermal Wall: We cannot improve the clock

efficiency of a chip: it would melt.

TowardsExascale

ROMA

Energy

2W /cm2

80W /cm2

8.2MW

TowardsExascale

ROMA

Energy

2W /cm2

80W /cm2

8.2MW

TowardsExascale

ROMA

Energy

2W /cm2

80W /cm2

8.2MW

TowardsExascale

ROMA

Energy

2W /cm2

80W /cm2

8.2MW

TowardsExascale

ROMA

Speed Scaling

One can modify the execution speed f of any task,f ∈ [fmin, fmax].

Let Ti of weight wi executed on processor pj :

time

pj · · · · · ·

Exe(wi , fi )

fi

Exe(wi , fi )

fi

TowardsExascale

ROMA

Speed Scaling

time

pj · · · · · ·

Exe(wi , fi )

fi

Exe(wi , fi )

fi

TowardsExascale

ROMA

Speed Scaling

time

pj · · · · · ·

Exe(wi , fi )

fi

Exe(wi , fi )

fi

TowardsExascale

ROMA

The energy consumption of the execution of task Ti at speed fi :

Ei (fi ) = Exe(wi , fi )f 3i = wi f 2i

→ (Dynamic part of the classical energy model)

TowardsExascale

ROMA

Unfortunately some more drawbacks (reliability):

fi

Ri (fi )

frel

Ri (frel)

Ri (fi ) ≈ 1− λ0e−dfi Exe(wi , fi )

TowardsExascale

ROMA

fi

Ri (fi )

frel

Ri (frel)

TowardsExascale

ROMA

fi

Ri (fi )

frel

Ri (frel)

TowardsExascale

ROMA A solution: two executions!

Ri = 1− (1− Ri (f(1)i ))(1− Ri (f

(2)i ))

time

p1

p2

T(1)i f

(1)i

T(2)i f

(2)i

ti

Energy consumption with two executions:

Ei = wi

(f(1)i

)2+ wi

(f(2)i

)2

fi

Ei (fi )

wi f2i + wi f

2i = 2Ei (fi )

frel

Ei (frel)

frel√2

TowardsExascale

Ri = 1− (1− Ri (f(1)i ))(1− Ri (f

(2)i ))

time

p1

p2

T(1)i f

(1)i

T(2)i f

(2)i

ti

Ei = wi

(f(1)i

)2+ wi

(f(2)i

)2

fi

Ei (fi )

wi f2i + wi f

2i = 2Ei (fi )

frel

Ei (frel)

frel√2

TowardsExascale

Ri = 1− (1− Ri (f(1)i ))(1− Ri (f

(2)i ))

time

p1

p2

T(1)i f

(1)i

T(2)i f

(2)i

ti

Ei = wi

(f(1)i

)2+ wi

(f(2)i

)2

fi

Ei (fi )

wi f2i + wi f

2i = 2Ei (fi )

frel

Ei (frel)

frel√2

TowardsExascale

Ri = 1− (1− Ri (f(1)i ))(1− Ri (f

(2)i ))

time

p1

p2

T(1)i f

(1)i

T(2)i f

(2)i

ti

Ei = wi

(f(1)i

)2+ wi

(f(2)i

)2

fi

Ei (fi ) wi f 2i + wi f2i = 2Ei (fi )

frel

Ei (frel)

frel√2

TowardsExascale

Ri = 1− (1− Ri (f(1)i ))(1− Ri (f

(2)i ))

time

p1

p2

T(1)i f

(1)i

T(2)i f

(2)i

ti

Ei = wi

(f(1)i

)2+ wi

(f(2)i

)2

fi

frel

Ei (frel)

frel√2

TowardsExascale

Ri = 1− (1− Ri (f(1)i ))(1− Ri (f

(2)i ))

time

p1

p2

T(1)i f

(1)i

T(2)i f

(2)i

ti

Ei = wi

(f(1)i

)2+ wi

(f(2)i

)2

fi

frel

Ei (frel)

frel√2

TowardsExascale

ROMA

To sum up

We need to find for each task:

• the number of execution (one or two)• their speed• their mapping (processor)

In order to minimize the energy consumption under theconstraints:

• ∀i , ti ≤ D (bounded makespan)• ∀i , Ri (Ti ) ≥ Ri (frel) (minimum reliability)

TowardsExascale

ROMA

Two kind of results

Theoretical:

• FPTAS for linear chains;• Inapproximability for independent tasks;• With a relaxation on the makespan constraint (β), we can

approximate the optimal solution within 1 + 1β2

, for all

β ≥ max(

2− 32p+1 , 2−p+24p+2

).

But also simulations for general DAGs.

TowardsExascale

ROMA

TowardsExascale

ROMA

Sparse direct solution: main research issues

Code Aster,EDF Pump,nuclear backupcircuit

01234D

epth

(km

)

0Dip (km)

5

10

15

20

Cros

s (km

)

5 10 15 20

3000 4000 5000 6000m/s

Frequency domainseismic modeling,Helmholtz equa-tions, SEISCOPEproject

Extrapolation on a 1000× 1000× 1000 grid:55 exaflops, 200 Tbytes for factors, 40 TBytes for active memory!

Main algorithmic issues

• Parallel algorithmic issues: synchronization avoidance,mapping irregular data structures, scheduling.

• Performance scalability: time but also memory/proc whenincreasing number of processors (and problem size).

• Numerical issues: numerical accurary, hybrid iterative-directsolvers, application (elliptic PDEs) specific solvers

TowardsExascale

ROMA

Execution of malleable task trees

• It is one of the problems submitted to Exascale• Motivation: linear algebra, sparse matrix factorisations. . .• Principle: many processors available → can parallelize the

tree but also the tasks

• Difficulty: parallelisation is not perfect; the moreprocessors we allocate to a task, the more losses occur

0

1 2 3 4

11 12 13 14 21 22 23

121 122 123 124 231 232

TowardsExascale

ROMA

• In the model developed (time to complete a task of lengthL with p processors is L/pα for 0 < α < 1): makespan-optimal processor allocation to the tree looks likeelectricity charges repartition → nice structure to workwith

• This model forgets some relevant constraints as memorylimit or granularity: other models are designed to handlethem

0

1 2 3 4

11 12 13 14 21 22 23

121 122 123 124 231 232

TowardsExascale

ROMA

0

1 2 3 4

11 12 13 14 21 22 23

121 122 123 124 231 232

TowardsExascale

ROMA

0

1 2 3 4

11 12 13 14 21 22 23

121 122 123 124 231 232

TowardsExascale

ROMA

0

1 2 3 4

11 12 13 14 21 22 23

121 122 123 124 231 232

TowardsExascale

ROMA

0

1 2 3 4

11 12 13 14 21 22 23

121 122 123 124 231 232

TowardsExascale

ROMA

TowardsExascale

ROMA

Download - ROMA June 2014 Journ ee des doctorantsmbbxqmf2/talks/jdd14.pdf · 2018. 1. 4. · Towards Exascale ROMA The Titan SuperComputer: 404m2 (the big box) 299;008 processor cores (the small

Top Related