challenge the future delft university of technology programming models for multi-cores ana lucia...

39
Challenge the future Delft University of Technology Programming Models for multi-cores Ana Lucia Varbanescu TUDelft / Vrije Universiteit Amsterdam Programming Models for Multi- Cores with acknowledgements to Maik Nijhuis @ VU Xavier Matorell @ UPC, Rosa Badia @ BSC

Upload: patrick-mclaughlin

Post on 01-Jan-2016

215 views

Category:

Documents


1 download

TRANSCRIPT

Challenge the future

DelftUniversity ofTechnology

Programming Models for multi-coresAna Lucia Varbanescu TUDelft / Vrije Universiteit Amsterdam

Programming Models for Multi-Cores

with acknowledgements toMaik Nijhuis @ VU

Xavier Matorell @ UPC, Rosa Badia @ BSC

2

A.L.Varbanescu @ TUD – Programming Models for multi-core processors

| 79.95

Outline

An introduction Programming the Cell/B.E.

available models. … can we compare them ?!

More processors, more models …? CUDA, Brook, TBB, ct, Sun Studio, …

… or a single standard one ? OpenCL standard = the solution?

Conclusions

3

A.L.Varbanescu @ TUD – Programming Models for multi-core processors

| 79.95

An introduction

4

A.L.Varbanescu @ TUD – Programming Models for multi-core processors

| 79.95

The Problem

Cell/B.E. = High performance Cell/B.E. != Programmability

Is there a way to match the two ?

5

A.L.Varbanescu @ TUD – Programming Models for multi-core processors

| 79.95

Cell/B.E.

1 x PPE 64-bit PowerPC (L1: 32KB I$ + 32 KB D$; L2: 512 KB)

8 x SPE cores (LS: 256KB, SIMD machines) Hybrid memory model Cell blades (QS20/21): 2xCell / PS3: 1xCell (6 SPEs only)

Thread-based model, push/pull data Thread scheduling by user

Five layers of parallelism: Task parallelism (MPMD) Data parallelism (SPMD) Data streaming parallelism (DMA double buffering) Vector parallelism (SIMD – up to 16-ways) Pipeline parallelism (dual-pipelined SPEs)

6

A.L.Varbanescu @ TUD – Programming Models for multi-core processors

| 79.95

Programming the Cell/B.E.

A view from the application: High-level parallelization => application task-graph Mapping/Scheduling => mapped graph In-core optimizations => optimized code for each core

A high-level programming model should “capture” all three aspects of Cell applications!

High-level

Mapping

Core-level

7

A.L.Varbanescu @ TUD – Programming Models for multi-core processors

| 79.95

Expressing the task graph

Task definition A task is a tuple : <Inputs,Outputs,Computation[,Data-Par]>

Task interconnections Express top level application parallelism and data

dependencies Task synchronization

Allow for barriers and other mechanisms, external from the tasks

Task composition Tasks should be able to split/merge with other tasks.

High-level

Mapping

Core-level

8

A.L.Varbanescu @ TUD – Programming Models for multi-core processors

| 79.95

Mapping/scheduling

Task-graph “expansion” (Auto) Data parallelism and synchronization are transformed in

nodes and edges Application mapping (User-aid)

All potential mappings should be considered Mapping optimizations (User-aid)

Merge/split tasks to fit the target core and minimize communication

Scheduling (User) Establish how to deal with contention at the core level

High-level

Mapping

Core-level

9

A.L.Varbanescu @ TUD – Programming Models for multi-core processors

| 79.95

Core-level

Computation (User) Allow the user to write the computation code (sequential)

Core optimizations (User/Auto) Per-core optimizations (different on PPE and SPEs)

Memory access (Auto) Hide explicit DMA

Optimize DMA (Auto) Overlap computation with communication High-level

Mapping

Core-level

10

A.L.Varbanescu @ TUD – Programming Models for multi-core processors

| 79.95

Extra’s

Performance estimation Application performance should be roughly predicted based

on the task graph Warnings and hints

Better warnings and hints to replace the standard SDK messages

High-level

Mapping

Core-level

11

A.L.Varbanescu @ TUD – Programming Models for multi-core processors

| 79.95

Available Cell/B.E. programming models

SDK-based models IDL, ALF

Code-reuse models MPI (micro-tasks), OpenMP CellSS

Abstract models Sequoia Charm++ and the Offload API SP@CE

Industry PeakStream, RapidMind, the MultiCore Framework

Other approaches MultiGrain Parallelism Scheduling, BlockLib, Sieve++

12

A.L.Varbanescu @ TUD – Programming Models for multi-core processors

| 79.95

IDL and the Function-Offload Model

Offloads computation-intensive tasks on SPEs Programmer provides:

Sequential code to run on PPE SPE implementations for offloaded functions IDL specification for function behaviour

Dynamic scheduling, based on distributed SPE queues

13

A.L.Varbanescu @ TUD – Programming Models for multi-core processors

| 79.95

Accelerated Library Framework (ALF)

SPMD applications on a host-accelerator platform Programmer provides:

Accelerator libraries - collections of accelerated code Application usage of the accelerator libraries

Runtime scheduling

15

A.L.Varbanescu @ TUD – Programming Models for multi-core processors

| 79.95

MPI micro-tasks

MPI front-end on the Cell/B.E. Programmer provides:

MPI application Preprocessor generates application graph with basic

tasks Basic tasks are merged together such that the graph is

SP The SP graph is mapped automatically Core-level communication optimizations are automatic

16

A.L.Varbanescu @ TUD – Programming Models for multi-core processors

| 79.95

OpenMP

Based on pragma's Enables code re-use

Programmer provides: OpenMP application Core-level optimizations DMA optimizations

Mapping and scheduling: automated Most work on the compiler side

17

A.L.Varbanescu @ TUD – Programming Models for multi-core processors

| 79.95

Cell SuperScalar (CellSS)

Very good for quick porting of applications on the Cell/B.E. Programmer provides:

Sequential C application Pragma’s to separate functions to be offloaded Additional data distribution information

Based on a compiler and a run-time system The compiler separates the annotated application into a PPE

application and the SPE application The runtime system maintains a dynamic data dependency graph

with all these active tasks, updating it each time a task starts/ends Dynamic scheduling

based on the runtime calculation of the data dependency graph.

18

A.L.Varbanescu @ TUD – Programming Models for multi-core processors

| 79.95

Sequoia

High-level abstract model, suitable for divide-and-conquer applications Uses the memory locality as a first parallelization criteria Application = hierarchy of parameterized, recursively

decomposed tasks Tasks run in isolation (data locality)

Programmer provides: Application hierarchical graph A mapping of the graph on the platform (Optimized) Code for the leaf-nodes

A flexible environment for tuning and testing application performance

19

A.L.Varbanescu @ TUD – Programming Models for multi-core processors

| 79.95

SP@CE

Dedicated to streaming applications An application is a collection of kernels that

communicate only by data streaming Programmer provides:

Application streaming graph (XML) (Library of) Optimized kernels for the SPEs

Dynamic scheduling, based on a centralized job-queue Run-time system on the SPEs, to optimize (some)

communication overhead

20

A.L.Varbanescu @ TUD – Programming Models for multi-core processors

| 79.95

Charm++ and the Offload API

An application = a collection of chares Communicate through messages Created and/or destroyed at runtime

A chare has a list of work requests to run on SPE PPE: uses the offload API to manage the work requests

(data flow, execution, completion) SPE: a small runtime system for local management and

optimizations Programmer provides:

Charm++ application Work requests and their SPE code

21

A.L.Varbanescu @ TUD – Programming Models for multi-core processors

| 79.95

RapidMind

Based on “SPMD streaming” tasks are executed on parallelized streams of data.

A kernel (“program”) is a computation on elements of a vector An application is a combination of regular code and RapidMind

code => compiler translates into PPE code and SPEs code Programmer provides:

C++ application Computation kernels inside the application

Kernels can execute asynchronously => achieve task-parallelism

22

A.L.Varbanescu @ TUD – Programming Models for multi-core processors

| 79.95

MultiCore Framework SDK (Mercury)

A master-worker model focused on data parallelism and data distributions

An application = manager (on PPE) and workers (on SPEs) Data communication is based on:

virtual channels: between manager and worker(s) data objects: to specify data granularity and distribution elements read/written are different at the channel ends

Programmer provides: C code for the kernels The channels interconnections via read/write ops Data distribution objects for each channel

No parallelization support, no core optimizations, no application-level design.

23

A.L.Varbanescu @ TUD – Programming Models for multi-core processors

| 79.95

Brief overview

24

A.L.Varbanescu @ TUD – Programming Models for multi-core processors

| 79.95

Features - revisited

25

A.L.Varbanescu @ TUD – Programming Models for multi-core processors

| 79.95

How to compare performance?

Implement one application from scratch Impractical and very time consuming

Use an already given benchmark Matrix multiplication is available

26

A.L.Varbanescu @ TUD – Programming Models for multi-core processors

| 79.95

Performance

See examples …

27

A.L.Varbanescu @ TUD – Programming Models for multi-core processors

| 79.95

Are results relevant ?

Only partially ! MMUL is NOT a good benchmark for high-level programming

models The results reveal the low-level optimizations success

The implementations are VERY different Hard to measure computation only

Data distribution issues are very differently addressed

Overall, a better approach for performance comparison is needed!!! Benchmark application Set of metrics

28

A.L.Varbanescu @ TUD – Programming Models for multi-core processors

| 79.95

Still …

Low-level optimizations are not part of the programming model’s targets => can/should be designed separately and heavily reused

The performance overhead induced by the design and/or implementation in a high-level model decreases with the size of the application

The programming effort spent on SPE optimizations increases the overall application implementation design with a constant factor, independent of the chosen programming model.

29

A.L.Varbanescu @ TUD – Programming Models for multi-core processors

| 79.95

Usability

30

A.L.Varbanescu @ TUD – Programming Models for multi-core processors

| 79.95

The Answers [1/2]

High-level programming models cover enough features to support application design and implementation at all levels.

Low-level optimizations and high-level algorithm parallelization remain difficult tasks for the programmer.

No single Cell/B.E. programming model that can address all application types

High-level

Mapping

Core-level

> 90%

0-100%

> 50%

31

A.L.Varbanescu @ TUD – Programming Models for multi-core processors

| 79.95

The Answers [2/2]

Alleviate the programmability issue: 60 % Preserve the high Cell/B.E. performance: 90 % Are easy to use ? 10-90 % Allow for automation ? 50 %

Is there an ideal one ? NO

32

A.L.Varbanescu @ TUD – Programming Models for multi-core processors

| 79.95

GPU Models [1/2]

GPGPU used to be fancy OpenCL cG RapidMind

NVIDIA GPUs CUDA is an original HW-SW codesign approach Extremely popular Considered easy to use

ATI/AMD GPUs Originally Brook Currently ATI Stream SDK

33

A.L.Varbanescu @ TUD – Programming Models for multi-core processors

| 79.95

GPU Models [2/2]

NVIDIA GPUs CUDA is an original HW-SW codesign approach Extremely popular Considered easy to use

ATI/AMD GPUs Originally Brook Currently ATI Stream SDK

34

A.L.Varbanescu @ TUD – Programming Models for multi-core processors

| 79.95

OpenCL [1/4]

Currently up and running for : AMD/ATI, IBM, NVIDIA, Apple

Other members of the Khronos consortium to follow ARM, Intel [?]

See examples …

35

A.L.Varbanescu @ TUD – Programming Models for multi-core processors

| 79.95

OpenCL [2/4]

Language Specification C-based cross-platform programming interface Subset of ISO C99 with language extensions - familiar to developers Online or offline compilation and build of compute kernel

executables Platform Layer API

A hardware abstraction layer over diverse computational resources Query, select and initialize compute devices Create compute contexts and work-queues

Runtime API Execute compute kernels Manage scheduling, compute, and memory resources

36

A.L.Varbanescu @ TUD – Programming Models for multi-core processors

| 79.95

OpenCL [3/4] – memory model

multi-level memory model private memory visible only to the individual compute units

in the device global memory visible to all compute units on the device. depending on the HW, memory spaces can be collapsed

together. 4 memory spaces

Private memory: a single compute unit (think registers). Local memory: work-items in a work-group. Constant memory: stores constant data for read-only access Global memory: used by all the compute units on the device.

37

A.L.Varbanescu @ TUD – Programming Models for multi-core processors

| 79.95

OpenCL [4/4] – execution model

Execution model Compute kernels can be thought either data-parallel (for for GPUs),

or task-parallel, which is well-matched to the architecture of CPUs. A compute kernel is the basic unit of executable code and can be

thought of as similar to a C function. kernels execution can be in-order or out-of-order Events for the developer to check on the status of runtime requests.

The execution domain of a kernel an N-dimensional computation domain. each element in the execution domain is a work-item work-items can be clustered into work-groups for synchronization

and communication.

38

A.L.Varbanescu @ TUD – Programming Models for multi-core processors

| 79.95

Conclusions

A multitude of programming models Aboundent for Cell/B.E., due to original lack of high-level

programming Less so for GPUs, due to CUDA

Simple programming models are key to platform adoption CUDA

Essential features are: Tackling *all* parallelism layers of a platform

Both automagically and with user-intervention Portability Ease-of-use or a very steep learning curve (C-based works) (Control over) Performance

Most of the times, efficiency

39

A.L.Varbanescu @ TUD – Programming Models for multi-core processors

| 79.95

Take home messages

Application parallelization remains the programmer's task Programming models should facilitate quick

implementation and evaluation Programming models are hard to compare

Application-specific or platform-specific Often user-specific

Low portability is considered worse than performance drops Performance trade-offs are smaller than expected OpenCL’s portability is responsible (so far) for his appeal

40

A.L.Varbanescu @ TUD – Programming Models for multi-core processors

| 79.95

Thank you!