applying opencl - iwocl4 © 2017 codeplay software ltd 1 1 2 4 8 16 32 64 128 256 512 1,024 2,048...

25
Applying OpenCL IWOCL, May 2017 Andrew Richards

Upload: others

Post on 05-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Applying OpenCL - IWOCL4 © 2017 Codeplay Software Ltd 1 1 2 4 8 16 32 64 128 256 512 1,024 2,048 4,096 8,192 16,384 32,768 65,536 Google target Desktop GPU Integrated GPU Smartphone

Applying OpenCLIWOCL, May 2017

Andrew Richards

Page 2: Applying OpenCL - IWOCL4 © 2017 Codeplay Software Ltd 1 1 2 4 8 16 32 64 128 256 512 1,024 2,048 4,096 8,192 16,384 32,768 65,536 Google target Desktop GPU Integrated GPU Smartphone

© 2017 Codeplay Software Ltd2

The next generation of software will not be built on CPUs

Page 3: Applying OpenCL - IWOCL4 © 2017 Codeplay Software Ltd 1 1 2 4 8 16 32 64 128 256 512 1,024 2,048 4,096 8,192 16,384 32,768 65,536 Google target Desktop GPU Integrated GPU Smartphone

© 2017 Codeplay Software Ltd3

“On a 100 millimetre-squared chip, Google needs something like 50 teraflops of performance”

- Daniel Rosenband (Google’s self-driving car project) at HotChips 2016

Page 4: Applying OpenCL - IWOCL4 © 2017 Codeplay Software Ltd 1 1 2 4 8 16 32 64 128 256 512 1,024 2,048 4,096 8,192 16,384 32,768 65,536 Google target Desktop GPU Integrated GPU Smartphone

© 2017 Codeplay Software Ltd4

1

1

2

4

8

16

32

64

128

256

512

1,024

2,048

4,096

8,192

16,384

32,768

65,536

Google target

Desktop GPU

Integrated GPU

Smartphone GPU

Smartphone CPU

Desktop CPU

Performance trends

GF

LO

PS

Year of introduction

Page 5: Applying OpenCL - IWOCL4 © 2017 Codeplay Software Ltd 1 1 2 4 8 16 32 64 128 256 512 1,024 2,048 4,096 8,192 16,384 32,768 65,536 Google target Desktop GPU Integrated GPU Smartphone

© 2017 Codeplay Software Ltd5

AI processor

Parallelism of GPU

Power efficiency

of GPU

Remove graphics

The rise of the AI processor

Page 6: Applying OpenCL - IWOCL4 © 2017 Codeplay Software Ltd 1 1 2 4 8 16 32 64 128 256 512 1,024 2,048 4,096 8,192 16,384 32,768 65,536 Google target Desktop GPU Integrated GPU Smartphone

© 2017 Codeplay Software Ltd6

How do we connect tomorrow’s software to tomorrow’s silicon?

Page 7: Applying OpenCL - IWOCL4 © 2017 Codeplay Software Ltd 1 1 2 4 8 16 32 64 128 256 512 1,024 2,048 4,096 8,192 16,384 32,768 65,536 Google target Desktop GPU Integrated GPU Smartphone

© 2017 Codeplay Software Ltd7

1. Make it fast

2. Make it safe

3. Make it ubiquitous

OpenCL: Our targets for 2017 and beyond

Page 8: Applying OpenCL - IWOCL4 © 2017 Codeplay Software Ltd 1 1 2 4 8 16 32 64 128 256 512 1,024 2,048 4,096 8,192 16,384 32,768 65,536 Google target Desktop GPU Integrated GPU Smartphone

© 2017 Codeplay Software Ltd8

Hand-optimized operations

Kernel fusion

Custom operations

How do we get performance on accelerators?

Page 9: Applying OpenCL - IWOCL4 © 2017 Codeplay Software Ltd 1 1 2 4 8 16 32 64 128 256 512 1,024 2,048 4,096 8,192 16,384 32,768 65,536 Google target Desktop GPU Integrated GPU Smartphone

© 2017 Codeplay Software Ltd9

Kernel fusion: some numbers

0

10

20

30

40

50

60

70

80

90

100

OpenCV (nodes) OpenCV (graph) Halide (nodes) Halide (graph) SYCL (nodes) SYCL (graph)

Effect of combining graph nodes on performance

Kernel time (ms) Overhead time (ms)

In this example,

we perform 3

image

processing

operations on an

accelerator and

compare 3

systems when

executing

individual nodes,

or a whole graph

Halide and

SYCL use kernel

fusion, whereas

OpenCV does

not. For all 3

systems, the

performance of

the whole graph

is significantly

better than

individual nodes

executed on

their own

The system is an

AMD APU and the

operations are:

RGB->HSV,

channel masking,

HSV->RGB

Page 10: Applying OpenCL - IWOCL4 © 2017 Codeplay Software Ltd 1 1 2 4 8 16 32 64 128 256 512 1,024 2,048 4,096 8,192 16,384 32,768 65,536 Google target Desktop GPU Integrated GPU Smartphone

© 2017 Codeplay Software Ltd10

0

0.2

0.4

0.6

0.8

1

1.2

10 Fused 80 Fused 640 Fused 4096 Fused

No

rmal

ized

tim

e (l

ow

er is

bet

ter)

TensorFlow Eigen Kernel Fusion

Kernel1 Kernel2 Kernel3 Fused

Applying fusion to TensorFlow Eigen

Spee

du

p b

y fu

sio

n

Spee

du

p b

y fu

sio

n

Spee

du

p b

y fu

sio

n

Spee

du

p b

y fu

sio

n

This is how

TensorFlow

uses Eigen to

achieve

kernel-fusion

CUDA does

this for NVIDIA

GPUs, SYCL

is used here

for AMD GPUs

-1x0x1x2x3x4x5x6x7x8x9x

10x11x12x13x14x15x16x17x18x

Performance improvement at size 4,000

Improvement at 4,000

Unfused

performance

improvement: AMD

GPU vs multi-core

Intel CPU

Total performance

improvement

delivered by SYCL is

both of these graphs

combined

Page 11: Applying OpenCL - IWOCL4 © 2017 Codeplay Software Ltd 1 1 2 4 8 16 32 64 128 256 512 1,024 2,048 4,096 8,192 16,384 32,768 65,536 Google target Desktop GPU Integrated GPU Smartphone

© 2017 Codeplay Software Ltd11

Hand-optimized operations

Kernel fusion

Custom operations ?

How do we combine our requirements?

Page 12: Applying OpenCL - IWOCL4 © 2017 Codeplay Software Ltd 1 1 2 4 8 16 32 64 128 256 512 1,024 2,048 4,096 8,192 16,384 32,768 65,536 Google target Desktop GPU Integrated GPU Smartphone

© 2017 Codeplay Software Ltd12

Kernel fusion system

Custom operations written in

fusable form

Hand-coded operations written in

fusable form

How do we fuse custom and hand-coded kernels?

Page 13: Applying OpenCL - IWOCL4 © 2017 Codeplay Software Ltd 1 1 2 4 8 16 32 64 128 256 512 1,024 2,048 4,096 8,192 16,384 32,768 65,536 Google target Desktop GPU Integrated GPU Smartphone

© 2017 Codeplay Software Ltd13

• We need a language and compiler that:

1. Lets users easily write custom operations

2. Lets hardware experts drill-down and write device-specific optimized code

3. Allows code to be efficiently fused

Page 14: Applying OpenCL - IWOCL4 © 2017 Codeplay Software Ltd 1 1 2 4 8 16 32 64 128 256 512 1,024 2,048 4,096 8,192 16,384 32,768 65,536 Google target Desktop GPU Integrated GPU Smartphone

© 2017 Codeplay Software Ltd14

• We need a language and compiler that:

1. Lets users easily write custom operationsC++ is a well-understood programming language that programmers can use

2. Lets hardware experts drill-down and write device-specific optimized codeC++ allows expert programmers to write low-level device-specific optimized code

3. Allows code to be efficiently fusedC++ single-source lets us fuse kernels

… and is already used

Page 15: Applying OpenCL - IWOCL4 © 2017 Codeplay Software Ltd 1 1 2 4 8 16 32 64 128 256 512 1,024 2,048 4,096 8,192 16,384 32,768 65,536 Google target Desktop GPU Integrated GPU Smartphone

© 2017 Codeplay Software Ltd15

• We need a language and compiler that:

1. Lets users easily write custom operations

2. Lets hardware experts drill-down and write device-specific optimized code

3. Allows code to be efficiently fused

But, now we have SPIR/SPIR-V, you could write your own compiler to solve this

Page 16: Applying OpenCL - IWOCL4 © 2017 Codeplay Software Ltd 1 1 2 4 8 16 32 64 128 256 512 1,024 2,048 4,096 8,192 16,384 32,768 65,536 Google target Desktop GPU Integrated GPU Smartphone

© 2017 Codeplay Software Ltd16

Make it safe

Page 17: Applying OpenCL - IWOCL4 © 2017 Codeplay Software Ltd 1 1 2 4 8 16 32 64 128 256 512 1,024 2,048 4,096 8,192 16,384 32,768 65,536 Google target Desktop GPU Integrated GPU Smartphone

© 2017 Codeplay Software Ltd17

• Our challenges:• We need all the tools to follow standard safety-critical processes

• We need predictable timing

• We need to test the OpenCL implementations thoroughly

• We need to test OpenCL code in extreme situations

• We need to be able to handle highly parallel errors and recovery

OpenCL SC

Page 18: Applying OpenCL - IWOCL4 © 2017 Codeplay Software Ltd 1 1 2 4 8 16 32 64 128 256 512 1,024 2,048 4,096 8,192 16,384 32,768 65,536 Google target Desktop GPU Integrated GPU Smartphone

© 2017 Codeplay Software Ltd18

• Each challenge is a massive challenge

• We need to come together to solve these challenges• Academics and industry

• Parallelism, safety, automotive, medical, formal methods, testing…

OpenCL SC: We need to work together

Page 19: Applying OpenCL - IWOCL4 © 2017 Codeplay Software Ltd 1 1 2 4 8 16 32 64 128 256 512 1,024 2,048 4,096 8,192 16,384 32,768 65,536 Google target Desktop GPU Integrated GPU Smartphone

© 2017 Codeplay Software Ltd19

Make it ubiquitous

Page 20: Applying OpenCL - IWOCL4 © 2017 Codeplay Software Ltd 1 1 2 4 8 16 32 64 128 256 512 1,024 2,048 4,096 8,192 16,384 32,768 65,536 Google target Desktop GPU Integrated GPU Smartphone

© 2017 Codeplay Software Ltd20

Making OpenCL ubiquitous

Jon Peddie Research: Feb

2017 report on the VPU

market

Page 21: Applying OpenCL - IWOCL4 © 2017 Codeplay Software Ltd 1 1 2 4 8 16 32 64 128 256 512 1,024 2,048 4,096 8,192 16,384 32,768 65,536 Google target Desktop GPU Integrated GPU Smartphone

© 2017 Codeplay Software Ltd21

• It’s royalty-free

• It’s programmable

• It’s very widely supported already

• Providing OpenCL brings in a wide ecosystem of software:• OpenCV, Halide, SYCL, OpenVX, clBLAS/clBLAST, TensorFlow, Caffe, ViennaCL,

Boost.compute, ….

Why OpenCL for new AI/vision processors?

Page 22: Applying OpenCL - IWOCL4 © 2017 Codeplay Software Ltd 1 1 2 4 8 16 32 64 128 256 512 1,024 2,048 4,096 8,192 16,384 32,768 65,536 Google target Desktop GPU Integrated GPU Smartphone

© 2017 Codeplay Software Ltd22

• We need to bring OpenCL to devices that are not GPUs• And so we need to focus on adding non-GPU features

• And removing GPU features

• While also still supporting the capabilities of GPUs• and CPUs, FPGAs, DSPS…

But, what do we need to solve?

Page 23: Applying OpenCL - IWOCL4 © 2017 Codeplay Software Ltd 1 1 2 4 8 16 32 64 128 256 512 1,024 2,048 4,096 8,192 16,384 32,768 65,536 Google target Desktop GPU Integrated GPU Smartphone

© 2017 Codeplay Software Ltd23

• We need to build out the ecosystem• Make it easier to bring OpenCL to new devices

• Make it easier to test OpenCL devices

• Make it easier to find all the existing OpenCL software

Page 24: Applying OpenCL - IWOCL4 © 2017 Codeplay Software Ltd 1 1 2 4 8 16 32 64 128 256 512 1,024 2,048 4,096 8,192 16,384 32,768 65,536 Google target Desktop GPU Integrated GPU Smartphone

© 2016 Codeplay Software Ltd. - CONFIDENTIAL24

• SYCL http://sycl.tech

• OpenCL https://www.khronos.org/opencl/

• OpenVX https://www.khronos.org/openvx/

• OpenCV http://opencv.org/

• Halide http://halide-lang.org/

• VisionCpp https://github.com/codeplaysoftware/visioncpp

• OpenCL org http://opencl.org/

• CLsmith http://multicore.doc.ic.ac.uk/tools/CLsmith/clsmith.php

• TensorFlow OpenCL http://ci.tensorflow.org/view/OpenCL/

Page 25: Applying OpenCL - IWOCL4 © 2017 Codeplay Software Ltd 1 1 2 4 8 16 32 64 128 256 512 1,024 2,048 4,096 8,192 16,384 32,768 65,536 Google target Desktop GPU Integrated GPU Smartphone

[email protected]@codeplaysoft codeplay.com

What do you want to accelerate?