starpu a runtime system for scheduling tasks, or how to get ......starpu a runtime system for...

63
1 StarPU a runtime system for Scheduling Tasks, or How to get portable performance on accelerator-based platforms without the agonizing pain NVIDIA GPU Technology Conference – San Jose (USA) – September 2010 Cédric Augonnet Samuel Thibault Raymond Namyst INRIA Bordeaux, LaBRI, University of Bordeaux

Upload: others

Post on 31-Jul-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without

1StarPU a runtime system for Scheduling Tasks,

orHow to get portable performance on

accelerator-based platforms without the agonizing pain

NVIDIA GPU Technology Conference – San Jose (USA) – September 2010

Cédric Augonnet Samuel Thibault Raymond Namyst

INRIA Bordeaux, LaBRI, University of Bordeaux

Page 2: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without

2

Once upon a time in computer architecture ...

Page 3: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without

3

IntroductionProgramming Accelerator-based machines

• Prehistory (<2007)• SMP machines (1960s !)

• NUMA architectures (1970s)

• Vector machines (1980s-90s)

• Multicore chips (2000s)

• GPGPU for masses ? (from 2007)• CPUs are now deprecated ?

• Rewrite all codes for accelerators

• Pure offloading model

Page 4: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without

4

IntroductionProgramming Accelerator-based machines

• Pure offloading model• CUDA 1.x (2007 – 2008)

– Synchronous cards

• Ignore CPUs

• Concentrate on efficient kernels– Complex memory accesses– CUDA heros (eg. V.Volkov)

• Port standard libraries to CUDA– CUBLAS– CUFFT– GPUCV...

CPU GPU

Pre-process input

Post-process input

Compute on GPU

Upload input

Download result

Page 5: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without

5

IntroductionProgramming Accelerator-based machines

• Multi-GPU era• CUDA 2.x (2008 – 2009)

– Asynchronous transfers– S1070 servers

• Still Ignore CPUs (in general)

• Suitable for regular applications

– Massively parallel problems– Use previously written kernels

• New problems– Parallel programming for real– PCI bus = bottleneck– Pre/Post-processing is costly

CPU GPUs

Distribute work

Gather results

Page 6: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without

6

IntroductionProgramming Accelerator-based machines

• GPU computing era• CUDA 3.x (2009 – ?)

– End of GPGPU– Hybrid machines

• Tightly coupled CPUs and GPUs– Take advantage of all ressources– MUCH more complicated

• Load balancing– Who does what ?– Heterogeneous capabilities

• Data management– Numerous data transfers– Fully asynchronous model

M.M.

CPU

CPU

CPU

CPU M.GPU

M.GPU

M.M.

CPU

CPU

CPU

CPU M.GPU

M.GPU

Page 7: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without

7

IntroductionChallenging issues at all stages

• Applications• Programming paradigm

• BLAS kernels, FFT, …

• Compilers• Languages

• Code generation/optimization

• Runtime systems• Resources management

• Task scheduling

• Architecture• Memory interconnect

Compiling environment

HPC Applications

Runtime system

Operating System

Hardware

Specific librairies

Page 8: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without

8

IntroductionChallenging issues at all stages

• Applications• Programming paradigm

• BLAS kernels, FFT, …

• Compilers• Languages

• Code generation/optimization

• Runtime systems• Resources management

• Task scheduling

• Architecture• Memory interconnect

Compiling environment

HPC Applications

Runtime system

Operating System

Hardware

Specific librairies

Expressive interface

Execution Feedback

Page 9: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without

9

• The StarPU runtime system

• Task Scheduling• Load balancing

• Improving data locality

• Evaluation on dense linear algebra algorithms• Synthetic “LU” decomposition

• Mixing PLASMA and MAGMA (Cholesky & QR)

• Scheduling parallel tasks

• Adding support for MPI in StarPU

Outline

Page 10: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without

10

The StarPU runtime system

Page 11: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without

11

ParallelCompilers

HPC Applications

Runtime system

Operating System

CPU

Parallel Libraries

• “do dynamically what can’t be done statically anymore”

• Library that provides• Task scheduling• Memory management

• Compilers and libraries generate (graphs of) parallel tasks

• Additional information is welcome!

The need for runtime systems

GPU …

The StarPU runtime system

Page 12: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without

12

ParallelCompilers

HPC Applications

StarPU

Drivers (CUDA, OpenCL)

CPU

Parallel Libraries

• StarPU provides a Virtual Shared Memory subsystem

• Weak consistency

• Replication

• Single writer

• High level API– Partitioning filters

•Input & ouput of tasks = reference to VSM data

Data management library

GPU …

The StarPU runtime system

Page 13: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without

13

ParallelCompilers

HPC Applications

StarPU

Drivers (CUDA, OpenCL)

CPU

Parallel Libraries

•Tasks =• Data input & output

– Reference to VSM data• Multiple implementations

– E.g. CUDA + CPU implementation

• Dependencies with other tasks

• Scheduling hints

•StarPU provides an Open Scheduling platform

• Scheduling algorithm = plug-ins

The StarPU runtime systemTask scheduling

GPU …fcpugpuspu

(ARW, BR, CR)

Page 14: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without

14

ParallelCompilers

HPC Applications

StarPU

Drivers (CUDA, OpenCL)

CPU

Parallel Libraries

• Who generates the code ?• StarPU Task = ~function pointers• StarPU don't generates code

• Programming heros ?

• Libraries era• PLASMA + MAGMA• FFTW + CUFFT...

• Rely on compilers• PGI accelerators• CAPS HMPP...

The StarPU runtime systemTask scheduling

GPU …fcpugpuspu

(ARW, BR, CR)

Page 15: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without

15

The StarPU runtime systemExecution model

Scheduling engine

Application

GPU driver

MemoryManagement

(DSM)

RAM GPU

CPU driver#k

CPU#k

...

Sta

rPU

A B

BA

Page 16: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without

16

The StarPU runtime systemExecution model

Scheduling engine

Application

GPU driver

MemoryManagement

(DSM)

RAM GPU

CPU driver#k

CPU#k

...

Sta

rPU

Submit task « A += B »

A+= B

A B

BA

Page 17: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without

17

The StarPU runtime systemExecution model

Scheduling engine

Application

GPU driver

MemoryManagement

(DSM)

RAM GPU

CPU driver#k

CPU#k

...

Sta

rPU

Schedule task

A+= B

A B

BA

Page 18: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without

18

The StarPU runtime systemExecution model

Scheduling engine

Application

GPU driver

MemoryManagement

(DSM)

RAM GPU

CPU driver#k

CPU#k

...

Sta

rPU

A+= B

B B

BA

A

Fetch data

Page 19: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without

19

The StarPU runtime systemExecution model

Scheduling engine

Application

GPU driver

MemoryManagement

(DSM)

RAM GPU

CPU driver#k

CPU#k

...

Sta

rPU

A+= B

B B

BA

A A

Fetch data

Page 20: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without

20

The StarPU runtime systemExecution model

Scheduling engine

Application

GPU driver

MemoryManagement

(DSM)

RAM GPU

CPU driver#k

CPU#k

...

Sta

rPU

A+= B

B B

BA

A A

Fetch data

Page 21: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without

21

The StarPU runtime systemExecution model

Scheduling engine

Application

GPU driver

MemoryManagement

(DSM)

RAM GPU

CPU driver#k

CPU#k

...

Sta

rPU

B B

BA

A A

Offload computation

A+= B

Page 22: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without

22

The StarPU runtime systemExecution model

Scheduling engine

Application

GPU driver

MemoryManagement

(DSM)

RAM GPU

CPU driver#k

CPU#k

...

Sta

rPU

A B

BA

Page 23: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without

23

Task Scheduling

Page 24: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without

24

Why do we need task scheduling ?Blocked Matrix multiplication

2 Xeon cores

Quadro FX5800

Quadro FX4600

Things can go (really) wrong even on trivial problems !• Static mapping ?

– Not portable, too hard for real-life problems• Need Dynamic Task Scheduling

– Performance models

Page 25: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without

25

Predicting task durationLoad balancing

Time

cpu #3

gpu #1

cpu #2

cpu #1

gpu #2

•Task completion time estimation

• History-based

• User-defined cost function

• Parametric cost model

•Can be used to improve scheduling

• E.g. Heterogeneous Earliest Finish Time

Page 26: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without

26

Time

cpu #3

gpu #1

cpu #2

cpu #1

gpu #2

•Task completion time estimation

• History-based

• User-defined cost function

• Parametric cost model

•Can be used to improve scheduling

• E.g. Heterogeneous Earliest Finish Time

•Task completion time estimation

• History-based

• User-defined cost function

• Parametric cost model

•Can be used to improve scheduling

• E.g. Heterogeneous Earliest Finish Time

Predicting task durationLoad balancing

Page 27: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without

27

Time

cpu #3

gpu #1

cpu #2

cpu #1

gpu #2

•Task completion time estimation

• History-based

• User-defined cost function

• Parametric cost model

•Can be used to improve scheduling

• E.g. Heterogeneous Earliest Finish Time

Predicting task durationLoad balancing

Page 28: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without

28

Time

cpu #3

gpu #1

cpu #2

cpu #1

gpu #2

•Task completion time estimation

• History-based

• User-defined cost function

• Parametric cost model

•Can be used to improve scheduling

• E.g. Heterogeneous Earliest Finish Time

Predicting task durationLoad balancing

Page 29: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without

29

Predicting data transfer overheadMotivations

• Hybrid platforms• Multicore CPUs and GPUs

• PCI-e bus is a precious ressource

• Data locality vs. Load balancing• Cannot avoid all data transfers

• Minimize them

• StarPU keeps track of• data replicates

• on-goig data movements

M.M.

CPU

CPU

CPU

CPU M.GPU

GPU

CPU

CPU

CPU

CPU

M.M.

B

M.GPU

M.GPU A

M.B

A

Page 30: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without

30

Predicting data transfer overheadOffline bus benchmarking

• Offline bus benchmarking• When StarPU is launched for the first

time

• Measure bandwidth and latency– Stored as files

• Loaded when StarPU is initialized

• Detect CPU/GPU affinity• Control a GPU from the closest CPU

• Significant impact on bus usage

• Straightforward cost prediction• Latency + size * bandwidth

• Could be improved in many ways

M.M.

CPU

CPU

CPU

CPU M.GPU

GPU

CPU

CPU

CPU

CPU

M.M.

B

M.GPU

M.GPU A

M.B

A

Page 31: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without

31

Impact of scheduling policy on a synthetic LU decomposition

(without pivoting !)

Page 32: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without

32

Scheduling in a hybrid environment

• LU without pivoting (16GB input matrix)• 8 CPUs (nehalem) + 3 GPUs (FX5800)

Performance models

Speed (GFlops)0

100200300400500600700800

Greedytask modelprefetchdata model

Transfers (GB)0

10

20

30

40

50

60

Page 33: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without

33

Scheduling in a hybrid environment

• LU without pivoting (16GB input matrix)• 8 CPUs (nehalem) + 3 GPUs (FX5800)

Performance models

Speed (GFlops)0

100200300400500600700800

Greedytask modelprefetchdata model

Transfers (GB)0

10

20

30

40

50

60

Page 34: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without

34

Scheduling in a hybrid environment

• LU without pivoting (16GB input matrix)• 8 CPUs (nehalem) + 3 GPUs (FX5800)

Performance models

Speed (GFlops)0

100200300400500600700800

Greedytask modelprefetchdata model

Transfers (GB)0

10

20

30

40

50

60

Page 35: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without

35

Scheduling in a hybrid environment

• LU without pivoting (16GB input matrix)• 8 CPUs (nehalem) + 3 GPUs (FX5800)

Performance models

Speed (GFlops)0

100200300400500600700800

Greedytask modelprefetchdata model

Transfers (GB)0

10

20

30

40

50

60

Page 36: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without

36

Mixing PLASMA and MAGMA with StarPU

(in collaboration with UTK)

Cholesky & QR decompositions

Page 37: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without

37

• State of the art algorithms• PLASMA (Multicore CPUs)

– Dynamically scheduled with Quark

• MAGMA (Multiple GPUs)

– Hand-coded data transfers

– Static task mapping

• General SPLAGMA design

• Use PLASMA algorithm with « magnum tiles »

• PLASMA kernels on CPUs, MAGMA kernels on GPUs

• Bypass the QUARK scheduler

• Programmability

• Cholesky: ~half a week

• QR : ~2 days of works

• Quick algorithmic prototyping

Mixing PLASMA and MAGMA with StarPU

Page 38: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without

38

• Cholesky decomposition • 5 CPUs (Nehalem) + 3 GPUs (FX5800)

• Efficiency > 100%

Mixing PLASMA and MAGMA with StarPU

Page 39: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without

39

• Cholesky decomposition • 5 CPUs (Nehalem) + 3 GPUs (FX5800)

• Efficiency > 100%

Mixing PLASMA and MAGMA with StarPU

Page 40: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without

40

• Memory transfers during Cholesky decomposition

Mixing PLASMA and MAGMA with StarPU

~2.5x lesstransfers

Page 41: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without

41

• QR decomposition• Mordor8 (UTK) : 16 CPUs (AMD) + 4 GPUs (C1060)

Mixing PLASMA and MAGMA with StarPU

Page 42: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without

42

• QR decomposition• Mordor8 (UTK) : 16 CPUs (AMD) + 4 GPUs (C1060)

Mixing PLASMA and MAGMA with StarPU

MAGMA

Page 43: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without

43

• QR decomposition• Mordor8 (UTK) : 16 CPUs (AMD) + 4 GPUs (C1060)

Mixing PLASMA and MAGMA with StarPU

+12 CPUs~200GFlops

Peak : 12 cores~150 GFlops

Page 44: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without

44

• « Super-Linear » efficiency in QR?• Kernel efficiency

– sgeqrt

– CPU: 9 Gflops GPU: 30 Gflops (Speedup : ~3)

– stsqrt

– CPU: 12Gflops GPU: 37 Gflops (Speedup: ~3)

– somqr

– CPU: 8.5 Gflops GPU: 227 Gflops (Speedup: ~27)

– Sssmqr

– CPU: 10Gflops GPU: 285Gflops (Speedup: ~28)

• Task distribution observed on StarPU

– sgeqrt: 20% of tasks on GPUs

– Sssmqr: 92.5% of tasks on GPUs

• Taking advantage of heterogeneity !

– Only do what you are good for

– Don't do what you are not good for

Mixing PLASMA and MAGMA with StarPU

Page 45: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without

45

Scheduling parallel tasks

Page 46: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without

46

• Take advantage of multicore architectures

• Task parallelism may not be suited at a fine grain

• Use other parallel paradigms– eg. OpenMP

• Use existing parallel libraries– eg. do not reimplement parallel BLAS …

• Alleviate granularity concerns

• Less tasks

• Large enough tasks (suited for the GPU)

Parallel tasksWhy do we need parallel tasks ?

Page 47: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without

47

Parallel tasks

• StarPU allocates processing units

Time

cpu #3

gpu #1

cpu #2

cpu #1

gpu #2

cpu #0

Scheduling parallel tasks

Page 48: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without

48

Parallel tasks

• StarPU allocates processing units

Time

cpu #3

gpu #1

cpu #2

cpu #1

cpu #0

gpu #1

gpu #2

Scheduling parallel tasks

Page 49: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without

49

Parallel tasks

• StarPU allocates processing units

Time

cpu #3

cpu #2

cpu #1

cpu #0

gpu #1gpu #1

gpu #2

Scheduling parallel tasks

Page 50: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without

50

Parallel tasks

• StarPU allocates processing units

Time

cpu #3

cpu #2

cpu #1

cpu #0

gpu #1gpu #1

gpu #2

Scheduling parallel tasks

Page 51: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without

51

Adding support for MPI in StarPU

Page 52: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without

52

Accelerating MPI applications with StarPU

• Keep MPI SPMD style• Static distribution of data• Scheduling within the node only

– No load balancing between MPI processes

• Inter-process data dependencies• MPI communications triggered by StarPU data availability

• Support from StarPU's memory management– Automaticallly construct MPI datatype

Page 53: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without

53

Accelerating MPI applications with StarPU

• Provided API• starpu_mpi_{send,recv}

• starpu_mpi_{isend,irecv}

• starpu_mpi_{test,wait}

• starpu_mpi_{send,recv}_detached

• starpu_mpi_*_array

• Detached calls• No need to explicitly test/wait for the request

• Automatic progression

• Automatic data dependencies• MPI transfers ~ StarPU tasks

• Accelerating legacy codes

Page 54: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without

54

Accelerating LU/MPI with StarPU

• LU decomposition• MPI+multiGPU

• Static MPI distribution• 2D block cyclic• ~SCALAPACK• No pivoting !

• Algorithmic work required• Collaboration with UTK

Page 55: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without

55

Conclusion

Page 56: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without

56

ParallelCompilers

HPC Applications

Runtime system

Operating System

CPU

Parallel Libraries

• StarPU

• Freely available under LGPL

• Available on Linux, OS/X, Windows

• Open to external contributors!

• Task Scheduling

• Required on hybrid platforms

• Auto-tuned performance models

• Combined PLASMA and MAGMA

• Parallel tasks

• MPI extensions

ConclusionSummary

GPU …

Page 57: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without

57

• Implement more algorithms

• LU, Hessenberg

• Communication Avoiding algorithms

• Hybrid Scalapack

• Provide higher level constructs (eg. reductions)

• Provide a back-end for compilers

• StarSs, XscalableMP, HMPP

• Support new architectures

• Intel SCC, Fermi cards, …

• Dynamically adapt granularity

• Divisible tasks

ConclusionFuture work

Compiling environment

HPC Applications

Runtime system

Operating System

Hardware

Specific librairies

Page 58: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without

58

• Implement more algorithms

• LU, Hessenberg

• Communication Avoiding algorithms

• Hybrid Scalapack

• Provide higher level constructs (eg. reductions)

• Provide a back-end for compilers

• StarSs, XscalableMP, HMPP

• Support new architectures

• Intel SCC, Fermi cards, …

• Dynamically adapt granularity

• Divisible tasks

ConclusionFuture work

Compiling environment

HPC Applications

Runtime system

Operating System

Hardware

Specific librairies

Thanks for your attention !Any question ?

Page 59: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without

59

Page 60: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without

60

Page 61: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without

61

Performance ModelsOur History-based proposition

• Hypothesis• Regular applications

• Execution time independent from data content

– Static Flow Control

• Consequence• Data description fully characterizes tasks

• Example: matrix-vector product

– Unique Signature : ((1024, 512), 1024, 1024)

– Per-data signature

– CRC(1024, 512) = 0x951ef83b

– Task signature

– CRC(CRC(1024, 512), CRC(1024), CRC(1024)) = 0x79df36e2

1024

512 1024x 1024=

Page 62: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without

62

Performance ModelsOur History-based proposition

• Generalization is easy• Task f(D1, … , Dn)

• Data– Signature(Di) = CRC(p1, p2, … , pk)

• Task ~ Series of data– Signature(D1, ..., Dn) = CRC(sign(D1), ..., sign(Dn))

• Systematic method• Problem independent

• Transparent for the programmer

• Efficient

Page 63: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without

63

EvaluationExample: LU decomposition

• Faster

• No code change !

• More stable

(16k x 16k) (30k x 30k)

ref. 89.98 ± .2 97 130.64 ± .1 66

1st iter 48.31 96.63

2nd iter 103.62 130.23

3rd iter 103.11 133.50

≥ 4 iter 103.92 ± .0 46

135.90 ± .0 00

Speed (GFlop/s)

• Dynamic calibration

• Simple, but accurate