fábio soldado, fernando alexandre, hervé paulino citi/computer science department

38
Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments Fábio Soldado, Fernando Alexandre, Hervé Paulino CITI/Computer Science Department Faculty of Science and Technology NOVA University of Lisbon HeteroPar 2014 @ Euro-Par 2014 Porto, Portugal August 25

Upload: anana

Post on 07-Jan-2016

36 views

Category:

Documents


0 download

DESCRIPTION

Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments. Fábio Soldado, Fernando Alexandre, Hervé Paulino CITI/Computer Science Department Faculty of Science and Technology NOVA University of Lisbon. HeteroPar 2014 @ Euro -Par 2014 - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Fábio Soldado, Fernando Alexandre,  Hervé Paulino CITI/Computer Science Department

Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments

Fábio Soldado, Fernando Alexandre, Hervé Paulino

CITI/Computer Science DepartmentFaculty of Science and Technology NOVA University of Lisbon

HeteroPar 2014 @ Euro-Par 2014Porto, PortugalAugust 25

Page 2: Fábio Soldado, Fernando Alexandre,  Hervé Paulino CITI/Computer Science Department

2

Motivation

Current computational systems are heterogeneous by nature: CPUs + GPUs

The GPU is increasingly being used in general purpose computing

The programming and execution models for CPUs and GPUs are quite different Programmer forced to direct the computation to one kind of

processing unit

High-level programming of multiple GPUs + multiple CPUs environments as a whole

Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments

HeteroPar 2014 - Porto, Portugal

Page 3: Fábio Soldado, Fernando Alexandre,  Hervé Paulino CITI/Computer Science Department

3

OpenCL provides code but not performance portability

Low-level programming model – no composition support

Problem

Host Device

Bus

Resource

management Orchestration of

data transfer and

execution requests

SPMD programming

model Memory organization

Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments

HeteroPar 2014 - Porto, Portugal

Page 4: Fábio Soldado, Fernando Alexandre,  Hervé Paulino CITI/Computer Science Department

4

OpenCL provides code but not performance portability

Low-level programming model – no composition support

Problem

Host Devices

Bus

⬆ Resource management

⬆ Orchestration of data

transfer and execution

requests

+ Decompose the computation

among the CPUs and GPUs

+ Scheduling and load

balancing

+ Device-type specific

optimizations

SPMD programming

model Device-type specific

memory organization

ALGORITHMICSKELETONS

Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments

HeteroPar 2014 - Porto, Portugal

Page 5: Fábio Soldado, Fernando Alexandre,  Hervé Paulino CITI/Computer Science Department

5

The Marrow Framework

C++ algorithmic skeleton framework for the orchestration of OpenCL computations [Euro-Par 2013]

Task and Data-parallel skeletons Task-parallel: Pipeline and Loop Data-parallel: Map(Reduce)

Skeleton nesting

GPU heterogeneity support

GPU-directed optimizations

Distinguishing Features

Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments

HeteroPar 2014 - Porto, Portugal

Page 6: Fábio Soldado, Fernando Alexandre,  Hervé Paulino CITI/Computer Science Department

6

The Marrow Framework

Fast Fourier Transform (FFT) pipeline Adapted from the SHOC benchmark suite FFT kernel Inverse FFT kernel

Programming Example

Pipeline

iFFTFFT

Executable FFT (new KernelWrapper(kernelFile,

                         kernelFunction, inInfo, outInfo));

Executable pipeline (new Pipeline(FFT, iFFT));

new Buffer<cl_float2>()

Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments

HeteroPar 2014 - Porto, Portugal

Page 7: Fábio Soldado, Fernando Alexandre,  Hervé Paulino CITI/Computer Science Department

7

Proposal

Support the execution of compound OpenCL computations in multi-CPU/multi-GPU environments

Grow the Marrow algorithmic skeleton framework

Transparently Distribute the load of a Marrow computations across

multiple CPUs and GPUs Adapt this distribution to different input data-sets and to the

CPUs’ load fluctuations.

Multiple (possibly heterogeneous) GPUs

+ Multiple CPUs

Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments

HeteroPar 2014 - Porto, Portugal

Page 8: Fábio Soldado, Fernando Alexandre,  Hervé Paulino CITI/Computer Science Department

8

Challenges

How to efficiently decompose a Marrow Computation Tree (CT) among the multiple CPU and GPU devices

How to efficiently distribute the work load among the available hardware resources

How to adapt this distribution to different input data-sets and to the CPUs’ load fluctuations

How to integrate these concepts in the programming model in a non-intrusive way

Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments

HeteroPar 2014 - Porto, Portugal

Page 9: Fábio Soldado, Fernando Alexandre,  Hervé Paulino CITI/Computer Science Department

9

CT DecompositionReplicating the skeleton tree

Integrates seamlessly with the SPMD model

Avoids data migration between devices

Scales well with the increase of devices

Locality-aware domain decomposition

Pipeline

iFFTFFT

Pipeline

iFFTFFT

Input dataset

Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments

HeteroPar 2014 - Porto, Portugal

Page 10: Fábio Soldado, Fernando Alexandre,  Hervé Paulino CITI/Computer Science Department

10

OverlapComp/CommFactor of 3

OpenCL Fission Fission of 2

CT Decomposition

HeteroPar 2014 - Porto, Portugal

Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments

Sub CPU

Sub CPU

Sub CPU

Sub CPU

Overlap Partition

Overlap Partition

Overlap Partition

Overlap Partition

Overlap Partition

Overlap Partition

Data

Best Fission level?

Best overlap factor?

Page 11: Fábio Soldado, Fernando Alexandre,  Hervé Paulino CITI/Computer Science Department

11

CT Decomposition

HeteroPar 2014 - Porto, Portugal

Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments

Sub CPU

Sub CPU

Sub CPU

Sub CPU

Overlap Partition

Overlap Partition

Overlap Partition

Overlap Partition

Overlap Partition

Overlap Partition

Data

f

1-f

ata

Evenly distributed

Distributed according to the relative performance of the devices [SAC 2014]

f?

Best Fission level?

Best overlap factor?

Page 12: Fábio Soldado, Fernando Alexandre,  Hervé Paulino CITI/Computer Science Department

12

Work Distribution – CPUs +GPUs

We are particularly interested in recurrent applications of CTs upon possibly different data-sets with different sizes

Lightweight mechanism to derive a suitable configuration for a CT’s execution, given a particular parameterization

Profile-based self-adaptation Resort to a profile built from a past executions

and to the current CPU load information

HeteroPar 2014 - Porto, Portugal

Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments

Page 13: Fábio Soldado, Fernando Alexandre,  Hervé Paulino CITI/Computer Science Department

13

Work Distribution – CPUs +GPUs

HeteroPar 2014 - Porto, Portugal

Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments

Decision Process

Execution request

New CT?

CT info?

Train flag?

yes yes

no yes

Perform training

Persist result

Monitored execution

Compute lbt

Page 14: Fábio Soldado, Fernando Alexandre,  Hervé Paulino CITI/Computer Science Department

14

Work Distribution – CPUs +GPUs

Dimensions to consider Fission level Overlap factor

Compute the best workload distribution (f) for each considered fission/overlap configuration Two approaches:

50/50 split CPU assisted GPU execution

Final result: the best overall performance

Uniform search over the search space (to improve)

HeteroPar 2014 - Porto, Portugal

Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments

Training Process

Page 15: Fábio Soldado, Fernando Alexandre,  Hervé Paulino CITI/Computer Science Department

Work Distribution – CPUs +GPUs

HeteroPar 2014 - Porto, Portugal

Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments

15

Decision Process

Execution request

NewCT?

CT info?

Train flag?

yes yes no

Persist result

Monitored execution

Compute lbt

Derive configuration

Page 16: Fábio Soldado, Fernando Alexandre,  Hervé Paulino CITI/Computer Science Department

16

Distribution Adaptation

Derive an initial work distribution Interpolation from past executions Nearest-neighbor

HeteroPar 2014 - Porto, Portugal

Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments

Page 17: Fábio Soldado, Fernando Alexandre,  Hervé Paulino CITI/Computer Science Department

Work Distribution – CPUs +GPUs

HeteroPar 2014 - Porto, Portugal

Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments

17

Decision Process

Execution request

NewCT?

CT info?

Train flag?

yes yes

no

yes

no

Persist result

Monitored execution

Compute lbt

Derive configuration

New data-set?

yes

Adjust distribution

no

Retrieve lbt

Must rebalnce?

no

Page 18: Fábio Soldado, Fernando Alexandre,  Hervé Paulino CITI/Computer Science Department

18

Distribution Adaptation

Derive an initial work distribution Interpolation from past executions – Nearest-neighbor

Adjust work distribution When lbt(t) ≈ 1 Two-level approach

1. Transfer load from the worst performing computing unit type to the best performing

2. Retrigger the process to find the best configuration for the current fission/overlap configuration

HeteroPar 2014 - Porto, Portugal

Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments

Page 19: Fábio Soldado, Fernando Alexandre,  Hervé Paulino CITI/Computer Science Department

19

Evaluation

Speed-up relatively to GPU-only executions

Efficiency of the work distribution strategy

Efficiency load balancing strategy

Metrics

Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments

HeteroPar 2014 - Porto, Portugal

Page 20: Fábio Soldado, Fernando Alexandre,  Hervé Paulino CITI/Computer Science Department

20

Evaluation

Case Studies

Image Filter Pipeline: 3 stage pipeline

FFT (Fast-Fourier Transformation): 2 stage pipeline

N-Body (Direct-sum, O(N2)): For loop

Saxpy: Map

Segmentation: Map

Case Studies and Test Platforms

Test Platform

CPU Intel Core i7-3930K @

3.20 GHz 6 cores 12 hardware

threads 6 L1 and L2 caches 1 L3 cache

GPUs 2 AMD HD 7950 (2x PCIe

bus)

Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments

HeteroPar 2014 - Porto, Portugal

Page 21: Fábio Soldado, Fernando Alexandre,  Hervé Paulino CITI/Computer Science Department

Evaluation - Speedup

1024x1

024

2048x2

048

4096x4

096

128M

B

256M

B

512M

B

16384

32768

65536

1M

10M

15M

1M

B

8M

B

60M

B

Image Pipeline FFT NBody Saxpy Segmentation

0.5

1

1.5

2

2.5

3

Divisão 50/50 Execução GPU assistida pelo CPU

Speedup

1 GPU + CPU vs 1 GPU

HeteroPar 2014 - Porto, Portugal 21

Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments

50/50 split CPU assisted GPU execution

Page 22: Fábio Soldado, Fernando Alexandre,  Hervé Paulino CITI/Computer Science Department

22

Evaluation - Speedup

1024x1

024

2048x2

048

4096x4

096

128M

B

256M

B

512M

B

16384

32768

65536

1M

10M

15M

1M

B

8M

B

60M

B

Filter Pipeline FFT Nbody Saxpy Segmentation

0.5

1

1.5

2

2.5

3

Divisão 50/50 Execução GPU assistida pelo CPU

Speedup

HeteroPar 2014 - Porto, Portugal

2 GPUs + CPU vs 2 GPUs

Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments

50/50 split CPU assisted GPU execution

Page 23: Fábio Soldado, Fernando Alexandre,  Hervé Paulino CITI/Computer Science Department

23

Evaluation – Config. Derivation

Fraction assigned to the GPUs

Image 2 Image 3 Image 4 Image 5 Image 680

82

84

86

88

90

92

94

96

W/ Full Training Derived Configuration

Execution time

Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments

HeteroPar 2014 - Porto, Portugal

Image 1 Image 2 Image 3 Image 4 Image 5 Image 60.1

1

10

100

W/ Full training Derived Configuration

Page 24: Fábio Soldado, Fernando Alexandre,  Hervé Paulino CITI/Computer Science Department

24

Evaluation – Load Balancing

HeteroPar 2014 - Porto, Portugal

Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments

L1 L1 L1 L1 L1 L1 L2 L1 L1 L1 L1 L1 L1 L2 L1 L1 L1 L1 L1 L1 L240%

42%

44%

46%

48%

50%

52%

54%

56%

58%

60% GPU percentageCPU percentage

Page 25: Fábio Soldado, Fernando Alexandre,  Hervé Paulino CITI/Computer Science Department

25

Conclusions

We are able to support the execution of Nestable task-parallel skeletons in heterogeneous multi-

CPU / multi-GPU environments With device specific-optimizations

CPU – locality via Fission GPU – overlap of communication and computation

Transparent work distribution and load balancing in the presence of recurrent executions

The experimental results are promising

The program size is reduced more than 5x for a simple map example (Saxpy)

Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments

HeteroPar 2014 - Porto, Portugal

Page 26: Fábio Soldado, Fernando Alexandre,  Hervé Paulino CITI/Computer Science Department

26

Future Work

Regarding CPU + GPU Optimize configuration derivation Conjoin the use of profiling with performance models

Regarding Marrow Other types of accelerators Cluster of multi-CPU / multi-GPU nodes Generate code for kernels and orchestration from higher-

level representations More skeletons

Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments

HeteroPar 2014 - Porto, Portugal

Page 27: Fábio Soldado, Fernando Alexandre,  Hervé Paulino CITI/Computer Science Department

27

Questions?

Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments

HeteroPar 2014 - Porto, Portugal

Page 28: Fábio Soldado, Fernando Alexandre,  Hervé Paulino CITI/Computer Science Department

Work Distribution – CPUs +GPUs 50/50 Split

HeteroPar 2014 - Porto, Portugal 28

Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments

Page 29: Fábio Soldado, Fernando Alexandre,  Hervé Paulino CITI/Computer Science Department

Work Distribution – CPUs +GPUs 50/50 Split

HeteroPar 2014 - Porto, Portugal 29

Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments

Page 30: Fábio Soldado, Fernando Alexandre,  Hervé Paulino CITI/Computer Science Department

Work Distribution – CPUs +GPUs 50/50 Split

HeteroPar 2014 - Porto, Portugal 30

Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments

Page 31: Fábio Soldado, Fernando Alexandre,  Hervé Paulino CITI/Computer Science Department

31

Execução só com CPUs

1024x1

024

2048x2

048

4096x4

096

8192x8

192

1M

10M

50M

1M

B

8M

B

60M

B

Image Pipeline Saxpy Segmentation

0.0

50.0

100.0

150.0

200.0

250.0

300.0

350.0

400.0

Com melhor nível de fission Sem Fission

Execu

tion T

ime

HeteroPar 2014 - Porto, Portugal

Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments

Page 32: Fábio Soldado, Fernando Alexandre,  Hervé Paulino CITI/Computer Science Department

32

Treino FFT 256 Mb

L1 cache L2 cache L3 cache none0.0

50.0

100.0

150.0

200.0

250.0

60.7 58.182.2

197.9

Execu

tion T

ime

HeteroPar 2014 - Porto, Portugal

Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments

Page 33: Fábio Soldado, Fernando Alexandre,  Hervé Paulino CITI/Computer Science Department

33

Online Monitoring

Equi l ibrado Desiqui l ibrado

CPUGPU

Execu

tion t

ime

HeteroPar 2014 - Porto, Portugal

Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments

Page 34: Fábio Soldado, Fernando Alexandre,  Hervé Paulino CITI/Computer Science Department

34

EvaluationDistribution Quality

Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments

HeteroPar 2014 - Porto, Portugal

Page 35: Fábio Soldado, Fernando Alexandre,  Hervé Paulino CITI/Computer Science Department

35

Evaluation

Saxpy: Z[i] = alpha * X[i] + Y[i]

Initialization/

Finalization

Orquestration

Total

OpenCL 104 94 198

Marrow 18 38 56

Reduction 5.7x 2.5x 3.5x

Productivity – Lines of code

Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments

HeteroPar 2014 - Porto, Portugal

Page 36: Fábio Soldado, Fernando Alexandre,  Hervé Paulino CITI/Computer Science Department

36

Decomposing Marrow ComputationsThe Loop Skeleton

Evaluate condition

on the host

Upload/Update partition to GPU

#1

BodyDownload

data to host

Update loop state

True

False

Evaluate condition

on the host

Upload/Update partition to GPU

#N

BodyDownload

data to host

Update loop state

True

False

Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments

HeteroPar 2014 - Porto, Portugal

Page 37: Fábio Soldado, Fernando Alexandre,  Hervé Paulino CITI/Computer Science Department

37

Programming Interface

Control over What may and may not be partitioned

PARTITIONABLE COPY

The elementary size of a partition

Merge functions

New Features

Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments

HeteroPar 2014 - Porto, Portugal

Page 38: Fábio Soldado, Fernando Alexandre,  Hervé Paulino CITI/Computer Science Department

38

Programming Example

shared_ptr<IWorkData> (new BufferData<cl_float2>());

Pipeline

iFFTFFT

unique_ptr<Executable> FFT (new KernelWrapper(kernelFile,

                         kernelFunction, inInfo, outInfo));

FFT Pipeline Revisited

shared_ptr<IWorkData> (new BufferData<cl_float2>(fftSize,

IWorkData::PARTITIONABLE));

unique_ptr<Executable> pipeline (new Pipeline(FFT, iFFT));

Partition elementary size

Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments

HeteroPar 2014 - Porto, Portugal