keeping performance portable in high performance kernels saman amarasinghe una-may o’reilly jason...

26
Keeping Performance Portable In High Performance Kernels Saman Amarasinghe Una-May O’Reilly Jason Ansel Phitchaya Mangpo Phothilimthana Jonathan Ragan-Kelley Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Upload: robert-weaver

Post on 29-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Keeping Performance Portable In High Performance Kernels Saman Amarasinghe Una-May O’Reilly Jason Ansel Phitchaya Mangpo Phothilimthana Jonathan Ragan-Kelley

Keeping Performance Portable In High Performance Kernels

Saman AmarasingheUna-May O’Reilly

Jason AnselPhitchaya Mangpo Phothilimthana

Jonathan Ragan-Kelley

Computer Science and Artificial Intelligence LaboratoryMassachusetts Institute of Technology

Page 2: Keeping Performance Portable In High Performance Kernels Saman Amarasinghe Una-May O’Reilly Jason Ansel Phitchaya Mangpo Phothilimthana Jonathan Ragan-Kelley

Example: Convolution

2D Convolution

k1k

1

k1k

2

k1k

3

k2k

1

k2k

2

k2k

3

k3k

1

k3k

1

k3k

1

k1k

1

k1k

2

k1k

3

k2k

1

k2k

2

k2k

3

k3k

1

k3k

1

k3k

1

k1k

1

k1k

2

k1k

3

k2k

1

k2k

2

k2k

3

k3k

1

k3k

1

k3k

1

k1k

1

k1k

2

k1k

3

k2k

1

k2k

2

k2k

3

k3k

1

k3k

1

k3k

1

input

output

k1k

1

k1k

2

k1k

3

k2k

1

k2k

2

k2k

3

k3k

1

k3k

1

k3k

1

2D kernel

k1

k2

k3

Page 3: Keeping Performance Portable In High Performance Kernels Saman Amarasinghe Una-May O’Reilly Jason Ansel Phitchaya Mangpo Phothilimthana Jonathan Ragan-Kelley

Example: ConvolutionAlgorithmic Choices

2D Convolution Separable Convolution

k1k

1

k1k

2

k1k

3

k2k

1

k2k

2

k2k

3

k3k

1

k3k

1

k3k

1

2D kernel

input

output

input intermediate

intermediate

output

Convolve Row

Convolve Column

k1 k2 k3k1 k2 k3k1 k2 k3

k1

k2

k3

k1

k2

k3

k1

k2

k3

k1

k2

k3

Page 4: Keeping Performance Portable In High Performance Kernels Saman Amarasinghe Una-May O’Reilly Jason Ansel Phitchaya Mangpo Phothilimthana Jonathan Ragan-Kelley

ZettaBricks Language [PLDI’09]

transform SeparableConvolutionfrom In[w, h], Kernel[KWIDTH]to Out[w - KWIDTH+1, h - KWIDTH+1]{

// Choice 1: single pass 2D convolutionto(Out out) from(In in, Kernel kernel) {

Convolve2D(out, in, kernel);}

// Choice 2: two pass separable convolutionto(Out out) from(In in, Kernel kernel) using(buffer[w - KWIDTH+1, h]) {

ConvolveRows(buffer, in, kernel);ConvolveColumns(out, buffer, kernel);

}}

Page 5: Keeping Performance Portable In High Performance Kernels Saman Amarasinghe Una-May O’Reilly Jason Ansel Phitchaya Mangpo Phothilimthana Jonathan Ragan-Kelley

ZettaBricks for Heterogeneous Systems

Compiler

Autotuner

Runtime System

ZettaBricks

Program

C++ output

Program

Training Informatio

n

ChoiceConfigurati

on

- dependency analysis- task creations- task scheduler- C++ code gen- etc.

- algorithmic choices- parellelization techniques- data distributions- transformations- etc.

- CPU work-stealing model

- dependency analysis- data movement analysis- CPU/GPU task creations- task scheduler- C++ code gen- OpenCL code gen- etc.

- CPU work-stealing model- GPU work-pushing model- memory management

- algorithmic choices- parellelization techniques- data distributions- transformations- CPU/GPU choices- global/local memory- CPU-GPU workload ratio- GPU local work size- etc.

ZettaBricks

[ASPLOS13]

Page 6: Keeping Performance Portable In High Performance Kernels Saman Amarasinghe Una-May O’Reilly Jason Ansel Phitchaya Mangpo Phothilimthana Jonathan Ragan-Kelley

Schedule 1: Convolve2D();

Schedule 2: ConvolveRows(); ConvolveColumns();

Schedule 1: Convolve2D();

Schedule 2: Convolve2D_opencl();

Schedule 3: ConvolveRows(); ConvolveColumns();

Schedule 4: ConvolveRows (); ConvolveColumns_opencl();

Schedule 5: ConvolveRows_opencl(); ConvolveColumns();

Schedule 6: ConvolveRows_opencl(); ConvolveColumns_opencl();

Before adding OpenCL

After adding OpenCL

Scheduling Choices: Convolution

Page 7: Keeping Performance Portable In High Performance Kernels Saman Amarasinghe Una-May O’Reilly Jason Ansel Phitchaya Mangpo Phothilimthana Jonathan Ragan-Kelley

Scheduling Choices: Convolution

Schedule 1: Convolve2D();

Schedule 2: ConvolveRows(); ConvolveColumns();

Schedule 1: Convolve2D();

Schedule 2: Convolve2D_opencl();

Schedule 3: ConvolveRows(); ConvolveColumns();

Schedule 4: ConvolveRows (); ConvolveColumns_opencl();

Schedule 5: ConvolveRows_opencl(); ConvolveColumns();

Schedule 6: ConvolveRows_opencl(); ConvolveColumns_opencl();

Original Choices

After adding OpenCL

Schedule 1: Convolve2D();

Schedule 2: Convolve2D_opencl();

Schedule 3: Convolve2D_opencl_local();

Schedule 4: ConvolveRows(); ConvolveColumns();

Schedule 5: ConvolveRows (); ConvolveColumns_opencl();

Schedule 6: ConvolveRows (); ConvolveColumns_opencl_local();

Schedule 7: ConvolveRows_opencl(); ConvolveColumns();

Schedule 8: ConvolveRows_opencl_local(); ConvolveColumns();

Schedule 9: ConvolveRows_opencl(); ConvolveColumns_opencl();

Schedule 10: ConvolveRows_opencl(); ConvolveColumns_opencl_local();

Schedule 11: ConvolveRows_opencl_local(); ConvolveColumns_opencl();

Schedule 12: ConvolveRows_opencl_local(); ConvolveColumns_opencl_local();

After adding local mem version

Local memory = scratchpad memory shared by all work-items (gpu threads) in the block

Page 8: Keeping Performance Portable In High Performance Kernels Saman Amarasinghe Una-May O’Reilly Jason Ansel Phitchaya Mangpo Phothilimthana Jonathan Ragan-Kelley

CPU-GPU Workload Balancing

CPU/GPU ratio parameter statically defines how much of the data should be computed on each device.

ProgramProgram

Page 9: Keeping Performance Portable In High Performance Kernels Saman Amarasinghe Una-May O’Reilly Jason Ansel Phitchaya Mangpo Phothilimthana Jonathan Ragan-Kelley

GPU Choice RepresentationSchedule 1: Convolve2D();

Schedule 2: Convolve2D_opencl();

Schedule 3: Convolve2D_opencl_local();

Schedule 4: ConvolveRows(); ConvolveColumns();

Schedule 5: ConvolveRows (); ConvolveColumns_opencl();

Schedule 6: ConvolveRows (); ConvolveColumns_opencl_local();

Schedule 7: ConvolveRows_opencl(); ConvolveColumns();

Schedule 8: ConvolveRows_opencl_local(); ConvolveColumns();

Schedule 9: ConvolveRows_opencl(); ConvolveColumns_opencl();

Schedule 10: ConvolveRows_opencl(); ConvolveColumns_opencl_local();

Schedule 11: ConvolveRows_opencl_local(); ConvolveColumns_opencl();

Schedule 12: ConvolveRows_opencl_local(); ConvolveColumns_opencl_local();

4

9

16

25

1/8

2/8

3/8

8/8

Local Work Size GPU-CPU Ratio

Page 10: Keeping Performance Portable In High Performance Kernels Saman Amarasinghe Una-May O’Reilly Jason Ansel Phitchaya Mangpo Phothilimthana Jonathan Ragan-Kelley

GPU Choice RepresentationSchedule 1: Convolve2D();

Schedule 2: Convolve2D_opencl();

Schedule 3: Convolve2D_opencl_local();

Schedule 4: ConvolveRows(); ConvolveColumns();

Schedule 5: ConvolveRows (); ConvolveColumns_opencl();

Schedule 6: ConvolveRows (); ConvolveColumns_opencl_local();

Schedule 7: ConvolveRows_opencl(); ConvolveColumns();

Schedule 8: ConvolveRows_opencl_local(); ConvolveColumns();

Schedule 9: ConvolveRows_opencl(); ConvolveColumns_opencl();

Schedule 10: ConvolveRows_opencl(); ConvolveColumns_opencl_local();

Schedule 11: ConvolveRows_opencl_local(); ConvolveColumns_opencl();

Schedule 12: ConvolveRows_opencl_local(); ConvolveColumns_opencl_local();

4

9

16

25

1/8

2/8

3/8

8/8

Local Work Size GPU-CPU Ratio Other Parameters …

…Big Search Space!

up to 101040 choicesBottem-up evolutionary algorithm [GECCO’11]

Page 11: Keeping Performance Portable In High Performance Kernels Saman Amarasinghe Una-May O’Reilly Jason Ansel Phitchaya Mangpo Phothilimthana Jonathan Ragan-Kelley

Experimental Results

Convolution Black-Sholes

Poisson2D SOR SortStrassen Tridiagonal SolverSingle Value Decomposition

Page 12: Keeping Performance Portable In High Performance Kernels Saman Amarasinghe Una-May O’Reilly Jason Ansel Phitchaya Mangpo Phothilimthana Jonathan Ragan-Kelley

Experiment: Convolution

Desktop Server Laptop

All choices are in OpenCL

Page 13: Keeping Performance Portable In High Performance Kernels Saman Amarasinghe Una-May O’Reilly Jason Ansel Phitchaya Mangpo Phothilimthana Jonathan Ragan-Kelley

Experiment: Convolution

• Autotune on each machine• Test cross-run• Normalize execution time by the best config

Separable convolution w/local memory on GPU

Desktop config

Separable convolution on OpenCL

Server config

2D convolution w/local memory on GPU

Laptop config

Lower is better.

Hand-coded OpenCL

Page 14: Keeping Performance Portable In High Performance Kernels Saman Amarasinghe Una-May O’Reilly Jason Ansel Phitchaya Mangpo Phothilimthana Jonathan Ragan-Kelley

Experiment: Stressen (Matrix Multiply)

Right configuration can provide huge performance improvement.

16.5x

Data parallel on GPUDesktop config

Recursive decomposition-> LAPACK on CPU

Server config

LAPACK on CPULaptop config

Hand-coded OpenCL

Page 15: Keeping Performance Portable In High Performance Kernels Saman Amarasinghe Una-May O’Reilly Jason Ansel Phitchaya Mangpo Phothilimthana Jonathan Ragan-Kelley

Experiment: Poisson 2D SOR

Optimal placement is almost the opposite of another across machines.

Split on CPUCompute on GPU

Desktop config

Split on OpenCLCompute on CPU

Server config

Split on CPUCompute on GPU

Laptop config

Page 16: Keeping Performance Portable In High Performance Kernels Saman Amarasinghe Una-May O’Reilly Jason Ansel Phitchaya Mangpo Phothilimthana Jonathan Ragan-Kelley

Experiment: Tridiagonal Solver

Algorithmic choice dramatically affects performance.

Cyclic reduction on GPUDesktop config

Direct solve on CPUServer config

Direct solve on CPULaptop config

Page 17: Keeping Performance Portable In High Performance Kernels Saman Amarasinghe Una-May O’Reilly Jason Ansel Phitchaya Mangpo Phothilimthana Jonathan Ragan-Kelley

Experiment: Sort

It is always best to use accelerators.

2MS -> QS -> 4MS -> IS on CPU

Desktop config

4MS -> 2MS -> ISon CPU

Server config

4MS -> 2MS -> 4MS -> ISon CPU

Laptop config

Bitonic sortGPU-only config

Radix sortHand-coded OpenCL

Page 18: Keeping Performance Portable In High Performance Kernels Saman Amarasinghe Una-May O’Reilly Jason Ansel Phitchaya Mangpo Phothilimthana Jonathan Ragan-Kelley

Experiment: SVD

GPU-CPU task parallel division on some machines

Task parallelism betweenCPU/GPU

Desktop config

All on CPUServer config

All on CPULaptop config

Page 19: Keeping Performance Portable In High Performance Kernels Saman Amarasinghe Una-May O’Reilly Jason Ansel Phitchaya Mangpo Phothilimthana Jonathan Ragan-Kelley

Experiment: Black-sholes

GPU-CPU task workload division on some machines

All on GPUDesktop config

All on OpenCLServer config

25% on CPU, 75% on GPULaptop config

Page 20: Keeping Performance Portable In High Performance Kernels Saman Amarasinghe Una-May O’Reilly Jason Ansel Phitchaya Mangpo Phothilimthana Jonathan Ragan-Kelley

Convolution

Stressen

SOR

Tridiagonal Solver

Sort

SVD

Black-sholes

Choice Differences Across Machines

Device

s

(C++/O

penCL)

Algor

ithm

s

GPU-C

PU ra

tio

GPU/C

PU ta

sk

parall

elism

Global/

local

mem

ory

Page 21: Keeping Performance Portable In High Performance Kernels Saman Amarasinghe Una-May O’Reilly Jason Ansel Phitchaya Mangpo Phothilimthana Jonathan Ragan-Kelley

OpenTuner: Make Autotuning Available Beyond ZettaBricks

• Every high performance programmer can, and should, use autotuners• But autotuning is sparsely used

• Most still do exhaustive search!

• Taking advantage of what we have learned in the 5+ ZettaBricks autotuners• We use sophisticated machine learning techniques

• A general framework for building autotuners • A toolbox, not a one-size-fits-all autotuner

• Making it available to the community

• Domain experts can put together a sophisticated autotuner

Page 22: Keeping Performance Portable In High Performance Kernels Saman Amarasinghe Una-May O’Reilly Jason Ansel Phitchaya Mangpo Phothilimthana Jonathan Ragan-Kelley

Lessons from ZettaBricks #1

• Configuration representation is critical

• Cartesian coordinates often natural/useful

• Represents things like trees poorly

OpenTuner:

• Custom format with dual interfaces:o Cartesian view

Point in high dimensional spaceo Maze view

Dynamic number "moves" can be taken from any current position

Page 23: Keeping Performance Portable In High Performance Kernels Saman Amarasinghe Una-May O’Reilly Jason Ansel Phitchaya Mangpo Phothilimthana Jonathan Ragan-Kelley

Lessons from ZettaBricks #2

• There is no perfect search technique

• Techniques have differing strengthso Experience with many novel techniques

• Exploitation/exploration tradeoff

OpenTuner:

• Library of competing techniques:o Ensembles of techniques run in parallelo Credit assignment gives larger testing budgets to

successful techniqueso Long term (cross-run) performance informs which

techniques are best for each problem

Page 24: Keeping Performance Portable In High Performance Kernels Saman Amarasinghe Una-May O’Reilly Jason Ansel Phitchaya Mangpo Phothilimthana Jonathan Ragan-Kelley

Lessons from ZettaBricks #3

• Usage, aggregation, and interpretation of results data varies widely

• Often accessed in different ways at different times

OpenTuner:

• Fully featured database of results (SQL):o Cross cutting access and mining of results

datao Supports transactional parallelismo Long term knowledge sharing between runs

Page 25: Keeping Performance Portable In High Performance Kernels Saman Amarasinghe Una-May O’Reilly Jason Ansel Phitchaya Mangpo Phothilimthana Jonathan Ragan-Kelley

OpenTuner Modules/Processes

Page 26: Keeping Performance Portable In High Performance Kernels Saman Amarasinghe Una-May O’Reilly Jason Ansel Phitchaya Mangpo Phothilimthana Jonathan Ragan-Kelley

OpenTuner Status

• V1 is ready, ZettaBricks is now ported to the OpenTuner

• Looking for users• Come find us at the Technology

Marketplace