adaptive runtime system for parallel computing · 2009-06-12 · • tasks only communicate through...

25
KAAPI : Adaptive Runtime System for Parallel Computing Thierry Gautier, [email protected] Bruno Raffin, bruno.raffi[email protected] MOAIS project, INRIA Grenoble Rhône-Alpes

Upload: others

Post on 02-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Adaptive Runtime System for Parallel Computing · 2009-06-12 · • Tasks only communicate through shared objects Automatic scheduling ... 8760 points. (2snd 1459 pts, 3rd 792 pts)

KAAPI :Adaptive Runtime System

for Parallel Computing

Thierry Gautier, [email protected] Raffin, [email protected]

MOAIS project, INRIA Grenoble Rhône-Alpes

Page 2: Adaptive Runtime System for Parallel Computing · 2009-06-12 · • Tasks only communicate through shared objects Automatic scheduling ... 8760 points. (2snd 1459 pts, 3rd 792 pts)

Workshop INRIA-Illinois, 2009/06/9-12 MOAIS project

Moais Projecthttp://moais.imag.fr

• Leader

• Jean-Louis Roch

• 10 Members

• Vincent Danjean, Pierre-François Dutot, Thierry Gautier, Guillaume Huard, Grégory Mounié, Clément Pernet, Bruno Raffin, Denis Trystram, Frédéric Wagner

• About 20 PhD students

Page 3: Adaptive Runtime System for Parallel Computing · 2009-06-12 · • Tasks only communicate through shared objects Automatic scheduling ... 8760 points. (2snd 1459 pts, 3rd 792 pts)

Workshop INRIA-Illinois, 2009/06/9-12 MOAIS project

To mutually adapt application and scheduling

Moais Positioning

GridCluster

MulticoreGPU

MPSoC

Page 4: Adaptive Runtime System for Parallel Computing · 2009-06-12 · • Tasks only communicate through shared objects Automatic scheduling ... 8760 points. (2snd 1459 pts, 3rd 792 pts)

Workshop INRIA-Illinois, 2009/06/9-12 MOAIS project

KAAPI Overview

Application

KAAPI middleware

system

Model: abstract representation

Algorithms: scheduling, fault tolerance protocol, ...

“causal connexions”

Perf

orm

ance

Page 5: Adaptive Runtime System for Parallel Computing · 2009-06-12 · • Tasks only communicate through shared objects Automatic scheduling ... 8760 points. (2snd 1459 pts, 3rd 792 pts)

API

Global address space• Creation of objects in a global address space with ‘shared’ type

Task• Creation with ‘Fork’ keyword (~!Cilk spawn)

• Tasks only communicate through shared objects

Automatic scheduling• work stealing or graph partitioning

‘Sequential’ semantics

similar to TBB/Cilk but with data flow dependencies

Page 6: Adaptive Runtime System for Parallel Computing · 2009-06-12 · • Tasks only communicate through shared objects Automatic scheduling ... 8760 points. (2snd 1459 pts, 3rd 792 pts)

C++ Elision

struct Fibonacci {

void operator()( int n, a1::Shared_w<int> result ) ! {

! ! if (n < 2) result.write( n ); ! ! else {

! ! ! a1::Shared<int> subresult1; ! ! ! a1::Shared<int> subresult2; ! ! ! a1::Fork<Fibonacci>()(n-1, subresult1);

! ! ! a1::Fork<Fibonacci>()(n-2, subresult2); ! ! ! a1::Fork<Sum>()(result, subresult1, subresult2); ! ! }

! }

};

struct Sum {

void operator()(! a1::Shared_w<int> result, ! ! ! ! a1::Shared_r<int> sr1, ! ! ! ! a1::Shared_r<int> sr2 )

! { result.write( sr1.read() + sr2.read() ); }

}

Page 7: Adaptive Runtime System for Parallel Computing · 2009-06-12 · • Tasks only communicate through shared objects Automatic scheduling ... 8760 points. (2snd 1459 pts, 3rd 792 pts)

struct Fibonacci {

void operator()( int n, a1::Shared_w<int&>result ) ! {

! ! if (n < 2) result = n ;

! ! else {

! ! ! a1::Shared<int> subresult1; ! ! ! a1::Shared<int> subresult2; ! ! ! a1::Fork<Fibonacci>()(n-1, subresult1);

! ! ! a1::Fork<Fibonacci>()(n-2, subresult2); ! ! ! a1::Fork<Sum>()(result, subresult1, subresult2); ! ! }

! }

};

struct Sum {

void operator()(! a1::Shared_w<int&>result, ! ! ! ! a1::Shared_w<int >sr1, ! ! ! ! a1::Shared_w<int >sr2 )

! { result.w=rite(sr1.read() + sr2.read() ); } }

C++ Elision

Page 8: Adaptive Runtime System for Parallel Computing · 2009-06-12 · • Tasks only communicate through shared objects Automatic scheduling ... 8760 points. (2snd 1459 pts, 3rd 792 pts)

Abstract Representation

result

Fibonacci

Time

result

Sum

Fibonacci

subres2

Fibonacci

subres1

result

Sum

Fibonacci

subres2

Sum

subres1

Fibonacci

subres1.1

Fibonacci

subres1.2

Page 9: Adaptive Runtime System for Parallel Computing · 2009-06-12 · • Tasks only communicate through shared objects Automatic scheduling ... 8760 points. (2snd 1459 pts, 3rd 792 pts)

Workshop INRIA-Illinois, 2009/06/9-12 MOAIS project

KAAPI Scheduler

2 Level SchedulingK-Thread

CPU

OS scheduler

CPUOS CPU

OS scheduler

K-Processor

processprocess

other process

Active Message over TCP/IP, Myrinet

and SSH

Page 10: Adaptive Runtime System for Parallel Computing · 2009-06-12 · • Tasks only communicate through shared objects Automatic scheduling ... 8760 points. (2snd 1459 pts, 3rd 792 pts)

Workshop INRIA-Illinois, 2009/06/9-12 MOAIS project

• Notations

• Ts : Sequential work, time of sequential execution

• T1 : Time of the parallel algorithm on 1 core

• D: Critical Path

• P: Number of cores

• Properties

• with high probability, number of steals is

O(P x D)

• with high probability, execution time is

Tp " T1 / P + O(D)

~ Also similar bound of Cilk’ extension with Rabin et al.

Performance Guarantee

Page 11: Adaptive Runtime System for Parallel Computing · 2009-06-12 · • Tasks only communicate through shared objects Automatic scheduling ... 8760 points. (2snd 1459 pts, 3rd 792 pts)

Workshop INRIA-Illinois, 2009/06/9-12 MOAIS project

Comparison with Cilk/TBB

• 8 processors NUMA machine• STL Transform, Ratio Tstl / Tlibrary on 8 cores

0

1

2

3

4

5

6

7

0 50000 100000 150000 200000 250000 300000

TS

TL /

TLib

rary

Size

STL transform

X-KaapiTBBCilk

Page 12: Adaptive Runtime System for Parallel Computing · 2009-06-12 · • Tasks only communicate through shared objects Automatic scheduling ... 8760 points. (2snd 1459 pts, 3rd 792 pts)

Workshop INRIA-Illinois, 2009/06/9-12 MOAIS project

Comparison with Cilk/TBB

• 8 processors NUMA machine• STL Merge, Ratio Tstl / Tlibrary on 8 cores

0

1

2

3

4

5

6

7

0 50000 100000 150000 200000 250000 300000

TS

TL /

TL

ibra

ry

Size

STL Merge

X-KaapiTBBCilk

Page 13: Adaptive Runtime System for Parallel Computing · 2009-06-12 · • Tasks only communicate through shared objects Automatic scheduling ... 8760 points. (2snd 1459 pts, 3rd 792 pts)

Workshop INRIA-Illinois, 2009/06/9-12 MOAIS project

Grid Experiments

• QAP, Q3AP, NQueens• well suited for work stealing scheduling

• Plugtest Contest Grid@Works• 2007: Grid 5000 (France)

• 1rst rank,

• NQueens N=23, 35 minutes 7s, 3654 cores

• 2008: Grid5000 (2709 cores) + Intrigger (Japan, 900 cores) • 1rst rank, 8760 points. (2snd 1459 pts, 3rd 792 pts)

• Super Quant Monte Carlo option pricing application

Page 14: Adaptive Runtime System for Parallel Computing · 2009-06-12 · • Tasks only communicate through shared objects Automatic scheduling ... 8760 points. (2snd 1459 pts, 3rd 792 pts)

Workshop INRIA-Illinois, 2009/06/9-12 MOAIS project

Iterative Application

• Scheduling by graph partitioning

• Metis / Scotch

Application

Page 15: Adaptive Runtime System for Parallel Computing · 2009-06-12 · • Tasks only communicate through shared objects Automatic scheduling ... 8760 points. (2snd 1459 pts, 3rd 792 pts)

Workshop INRIA-Illinois, 2009/06/9-12 MOAIS project

Experiments

• Finite Difference Kernel• Kaapi / C++ code versus Fortran MPI code

• Constant size sub domain D per processor

• Cluster : N processors on a cluster

• Grid : N/4 processors per cluster, 4 clusters

D=256^3 # processors Cluster (s) Grid (s) Overhead

KAAPI1 0.49 0.49 -64 0.55 0.84 0,53128 0.65 0.91 0,4

MPI1 0.44 0.44 -64 0.66 2.02 2,06128 0.68 1.57 1,31

Page 16: Adaptive Runtime System for Parallel Computing · 2009-06-12 · • Tasks only communicate through shared objects Automatic scheduling ... 8760 points. (2snd 1459 pts, 3rd 792 pts)

Workshop INRIA-Illinois, 2009/06/9-12 MOAIS project

Optimizing MPI code

• Overlapping communication by computation• At the cost of important code restructuring

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

16+16 32+32 64+64

Me

an

tim

e f

or

an

ite

ratio

n (

s)

Nb proc

256^3/proc between Rennes and Bordeaux

kaapi!optsendrecv!ompiirecvisend!ompiasync!ompi

KAAPI automatically reschedules computation and communication tasks

Page 17: Adaptive Runtime System for Parallel Computing · 2009-06-12 · • Tasks only communicate through shared objects Automatic scheduling ... 8760 points. (2snd 1459 pts, 3rd 792 pts)

Workshop INRIA-Illinois, 2009/06/9-12 MOAIS project

Fault Tolerance

• State of application = state of the data flow graph

• Two specialized protocols

• TIC: Theft Induced Checkpointin

• Periodic checkpoint + forced checkpoint on steal

• CCK: for iterative applications

• Coherent checkpoints

• only recovery of failed process + !application

Page 18: Adaptive Runtime System for Parallel Computing · 2009-06-12 · • Tasks only communicate through shared objects Automatic scheduling ... 8760 points. (2snd 1459 pts, 3rd 792 pts)

Workshop INRIA-Illinois, 2009/06/9-12 MOAIS project

• Implemented using distributed checkpoint services

• two checkpointing periods

• max overhead observed: 0.9%

• TIC: overhead increases as the number of processors increases

0

0,225

0,450

0,675

0,900

20 40 60 120

CIC (period=1s)CIC (period=20s)

Ove

rhea

d (

%)

Protocol Scalability

#Processors

Page 19: Adaptive Runtime System for Parallel Computing · 2009-06-12 · • Tasks only communicate through shared objects Automatic scheduling ... 8760 points. (2snd 1459 pts, 3rd 792 pts)

Workshop INRIA-Illinois, 2009/06/9-12 MOAIS project

Comparison with Satin

• 32 processors, synthetic recursive app.

Page 20: Adaptive Runtime System for Parallel Computing · 2009-06-12 · • Tasks only communicate through shared objects Automatic scheduling ... 8760 points. (2snd 1459 pts, 3rd 792 pts)

Workshop INRIA-Illinois, 2009/06/9-12 MOAIS project

Physics Simulation

• SOFA: real-time physics engine

• Strongly supported INRIA initiative

• Open Source:

http://www-sofa-framework.org

• Target application:

Surgery simulation

Page 21: Adaptive Runtime System for Parallel Computing · 2009-06-12 · • Tasks only communicate through shared objects Automatic scheduling ... 8760 points. (2snd 1459 pts, 3rd 792 pts)

Workshop INRIA-Illinois, 2009/06/9-12 MOAIS project

Multi CPU/GPU SOFA

• SOFA: 2 levels of parallelization

• KAAPI: graph partitioning and work stealing

• Nvidia Cuda

• On-going: work stealing between CPUs and GPUs

Page 22: Adaptive Runtime System for Parallel Computing · 2009-06-12 · • Tasks only communicate through shared objects Automatic scheduling ... 8760 points. (2snd 1459 pts, 3rd 792 pts)

Workshop INRIA-Illinois, 2009/06/9-12 MOAIS project

SOFA

Page 23: Adaptive Runtime System for Parallel Computing · 2009-06-12 · • Tasks only communicate through shared objects Automatic scheduling ... 8760 points. (2snd 1459 pts, 3rd 792 pts)

Workshop INRIA-Illinois, 2009/06/9-12 MOAIS project

Oblivious Algorithms

• Cache oblivious algorithms

• Irregular meshes: 2-20x on CPU, 1.2-2.7x on GPU

• On-going work: cache oblivious + adapted work stealing strategy

Page 24: Adaptive Runtime System for Parallel Computing · 2009-06-12 · • Tasks only communicate through shared objects Automatic scheduling ... 8760 points. (2snd 1459 pts, 3rd 792 pts)

Workshop INRIA-Illinois, 2009/06/9-12 MOAIS project

Conclusions

• KAAPI: flexible framework for parallel programming and fine scheduling control:

• work stealing : recursive computation or local scheduling

• graph partitioning : iterative application

• Data dependency graph:

• used for scheduling or fault tolerance protocols

• On going work on hybrid architectures and large scale machines (BlueGene)

Page 25: Adaptive Runtime System for Parallel Computing · 2009-06-12 · • Tasks only communicate through shared objects Automatic scheduling ... 8760 points. (2snd 1459 pts, 3rd 792 pts)

Workshop INRIA-Illinois, 2009/06/9-12 MOAIS project

Questions?

• http://kaapi.gforge.inria.fr

• http://www-sofa-framework.org

• http://moais.imag.fr