applying opencl - iwocl4 © 2017 codeplay software ltd 1 1 2 4 8 16 32 64 128 256 512 1,024 2,048...
TRANSCRIPT
Applying OpenCLIWOCL, May 2017
Andrew Richards
© 2017 Codeplay Software Ltd2
The next generation of software will not be built on CPUs
© 2017 Codeplay Software Ltd3
“On a 100 millimetre-squared chip, Google needs something like 50 teraflops of performance”
- Daniel Rosenband (Google’s self-driving car project) at HotChips 2016
© 2017 Codeplay Software Ltd4
1
1
2
4
8
16
32
64
128
256
512
1,024
2,048
4,096
8,192
16,384
32,768
65,536
Google target
Desktop GPU
Integrated GPU
Smartphone GPU
Smartphone CPU
Desktop CPU
Performance trends
GF
LO
PS
Year of introduction
© 2017 Codeplay Software Ltd5
AI processor
Parallelism of GPU
Power efficiency
of GPU
Remove graphics
The rise of the AI processor
© 2017 Codeplay Software Ltd6
How do we connect tomorrow’s software to tomorrow’s silicon?
© 2017 Codeplay Software Ltd7
1. Make it fast
2. Make it safe
3. Make it ubiquitous
OpenCL: Our targets for 2017 and beyond
© 2017 Codeplay Software Ltd8
Hand-optimized operations
Kernel fusion
Custom operations
How do we get performance on accelerators?
© 2017 Codeplay Software Ltd9
Kernel fusion: some numbers
0
10
20
30
40
50
60
70
80
90
100
OpenCV (nodes) OpenCV (graph) Halide (nodes) Halide (graph) SYCL (nodes) SYCL (graph)
Effect of combining graph nodes on performance
Kernel time (ms) Overhead time (ms)
In this example,
we perform 3
image
processing
operations on an
accelerator and
compare 3
systems when
executing
individual nodes,
or a whole graph
Halide and
SYCL use kernel
fusion, whereas
OpenCV does
not. For all 3
systems, the
performance of
the whole graph
is significantly
better than
individual nodes
executed on
their own
The system is an
AMD APU and the
operations are:
RGB->HSV,
channel masking,
HSV->RGB
© 2017 Codeplay Software Ltd10
0
0.2
0.4
0.6
0.8
1
1.2
10 Fused 80 Fused 640 Fused 4096 Fused
No
rmal
ized
tim
e (l
ow
er is
bet
ter)
TensorFlow Eigen Kernel Fusion
Kernel1 Kernel2 Kernel3 Fused
Applying fusion to TensorFlow Eigen
Spee
du
p b
y fu
sio
n
Spee
du
p b
y fu
sio
n
Spee
du
p b
y fu
sio
n
Spee
du
p b
y fu
sio
n
This is how
TensorFlow
uses Eigen to
achieve
kernel-fusion
CUDA does
this for NVIDIA
GPUs, SYCL
is used here
for AMD GPUs
-1x0x1x2x3x4x5x6x7x8x9x
10x11x12x13x14x15x16x17x18x
Performance improvement at size 4,000
Improvement at 4,000
Unfused
performance
improvement: AMD
GPU vs multi-core
Intel CPU
Total performance
improvement
delivered by SYCL is
both of these graphs
combined
© 2017 Codeplay Software Ltd11
Hand-optimized operations
Kernel fusion
Custom operations ?
How do we combine our requirements?
© 2017 Codeplay Software Ltd12
Kernel fusion system
Custom operations written in
fusable form
Hand-coded operations written in
fusable form
How do we fuse custom and hand-coded kernels?
© 2017 Codeplay Software Ltd13
• We need a language and compiler that:
1. Lets users easily write custom operations
2. Lets hardware experts drill-down and write device-specific optimized code
3. Allows code to be efficiently fused
© 2017 Codeplay Software Ltd14
• We need a language and compiler that:
1. Lets users easily write custom operationsC++ is a well-understood programming language that programmers can use
2. Lets hardware experts drill-down and write device-specific optimized codeC++ allows expert programmers to write low-level device-specific optimized code
3. Allows code to be efficiently fusedC++ single-source lets us fuse kernels
… and is already used
© 2017 Codeplay Software Ltd15
• We need a language and compiler that:
1. Lets users easily write custom operations
2. Lets hardware experts drill-down and write device-specific optimized code
3. Allows code to be efficiently fused
But, now we have SPIR/SPIR-V, you could write your own compiler to solve this
© 2017 Codeplay Software Ltd16
Make it safe
© 2017 Codeplay Software Ltd17
• Our challenges:• We need all the tools to follow standard safety-critical processes
• We need predictable timing
• We need to test the OpenCL implementations thoroughly
• We need to test OpenCL code in extreme situations
• We need to be able to handle highly parallel errors and recovery
OpenCL SC
© 2017 Codeplay Software Ltd18
• Each challenge is a massive challenge
• We need to come together to solve these challenges• Academics and industry
• Parallelism, safety, automotive, medical, formal methods, testing…
OpenCL SC: We need to work together
© 2017 Codeplay Software Ltd19
Make it ubiquitous
© 2017 Codeplay Software Ltd20
Making OpenCL ubiquitous
Jon Peddie Research: Feb
2017 report on the VPU
market
© 2017 Codeplay Software Ltd21
• It’s royalty-free
• It’s programmable
• It’s very widely supported already
• Providing OpenCL brings in a wide ecosystem of software:• OpenCV, Halide, SYCL, OpenVX, clBLAS/clBLAST, TensorFlow, Caffe, ViennaCL,
Boost.compute, ….
Why OpenCL for new AI/vision processors?
© 2017 Codeplay Software Ltd22
• We need to bring OpenCL to devices that are not GPUs• And so we need to focus on adding non-GPU features
• And removing GPU features
• While also still supporting the capabilities of GPUs• and CPUs, FPGAs, DSPS…
But, what do we need to solve?
© 2017 Codeplay Software Ltd23
• We need to build out the ecosystem• Make it easier to bring OpenCL to new devices
• Make it easier to test OpenCL devices
• Make it easier to find all the existing OpenCL software
© 2016 Codeplay Software Ltd. - CONFIDENTIAL24
• SYCL http://sycl.tech
• OpenCL https://www.khronos.org/opencl/
• OpenVX https://www.khronos.org/openvx/
• OpenCV http://opencv.org/
• Halide http://halide-lang.org/
• VisionCpp https://github.com/codeplaysoftware/visioncpp
• OpenCL org http://opencl.org/
• CLsmith http://multicore.doc.ic.ac.uk/tools/CLsmith/clsmith.php
• TensorFlow OpenCL http://ci.tensorflow.org/view/OpenCL/
[email protected]@codeplaysoft codeplay.com
What do you want to accelerate?