clblast: a tuned opencl blas library - cedric nugteren · clblast: a tuned blas library for faster...

44
CLBlast: ATuned BLASLibrary for Faster Deep Learning Cedric Nugteren May 11, 2017 http://github.com/cnugteren/clblast http://cnugteren.github.io/clblast

Upload: voliem

Post on 03-May-2018

219 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s

CLBlast: A Tuned BLAS Libraryfor Faster Deep Learning

Cedric Nugteren

May 11, 2017

http://github.com/cnugteren/clblast

http://cnugteren.github.io/clblast

Page 2: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s

CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 2 out of 43

The Heart of Deep Learning

Page 3: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s

CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 3 out of 43

GEMM is at the Heart of Deep Learning

Page 4: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s

CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 4 out of 43

So where are the Matrix-Multiplications?

Page 5: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s

CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 5 out of 43

Convolutions as Matrix Multiplication

Page 6: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s

CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 6 out of 43

GEMM is the Heart of Deep Learning

Page 7: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s

CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 7 out of 43

Does everyone agree?

Page 8: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s

CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 8 out of 43

Does everyone agree?

Page 9: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s

CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 9 out of 43

Still true in 2017!

Page 10: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s

CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 10 out of 43

But why a new BLAS Library?

● NVIDIA’s cuBLAS is great, or is it?

Page 11: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s

CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 11 out of 43

But why a new BLAS Library?

● NVIDIA’s cuBLAS is great, or is it?

Page 12: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s

CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 12 out of 43

But why a new BLAS Library?

● NVIDIA’s cuBLAS is great, or is it?

– Not portable, not customisable, not open-source, ...

Page 13: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s

CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 13 out of 43

But why a new BLAS Library?

● NVIDIA’s cuBLAS is great, or is it?

– Not portable, not customisable, not open-source, ...

● Is AMD’s clBLAS great?

– Not performance portable,

not well engineered,

lack of new features, ...

Page 14: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s

CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 14 out of 43

Introducing CLBlast

● CLBlast: Modern C++11 OpenCL BLAS library

● Implements all BLAS routines for all precisions (S, D, C, Z)

● Accelerates all kinds of applications:

– Fluid dynamics, quantum chemistry, linear algebra, etc.

– Today’s focus: deep learning

Page 15: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s

CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 15 out of 43

Introducing CLBlast

● CLBlast: Modern C++11 OpenCL BLAS library

● Implements all BLAS routines for all precisions (S, D, C, Z)

● Accelerates all kinds of applications:

– Fluid dynamics, quantum chemistry, linear algebra, etc.

– Today’s focus: deep learning

● Already integrated into various projects:

– JOCLBlast (Java bindings)

– ArrayFire (GPU accelerated library and applications)

– OpenCL fork of Cafe (github.com/dividiti/ck-cafe)

– OpenCL fork of TF (github.com/hughperkins/tensorlow-cl)

Page 16: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s

CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 16 out of 43

Introducing CLBlast

CI and

extensive testing

activity

community

Page 17: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s

CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 17 out of 43

But… is it fast?

● All kernels are generic and tunable thanks to integration of the

CLTune auto-tuner (presented at last year’s GTC)

Page 18: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s

CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 18 out of 43

But… is it fast?

● All kernels are generic and tunable thanks to integration of the

CLTune auto-tuner (presented at last year’s GTC)

Page 19: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s

CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 19 out of 43

But… is it fast?

● All kernels are generic and tunable thanks to integration of the

CLTune auto-tuner (presented at last year’s GTC)

● Tuned out-of-the-box for 40 common devices– For new devices: run the auto-tuner when installing CLBlast

Page 20: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s

CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 20 out of 43

CLBlast Benchmark Results

● Higher is better

● More results at http://cnugteren.github.io/clblast

AXPY

regular

(in GB/s)

AXPY

odd

(in GB/s)

GEMV

odd

(in GB/s)

GEMM

odd

(in GFLOPS)

GEMV

regular

(in GB/s)

GEMM

regular

(in GFLOPS)

Page 21: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s

CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 21 out of 43

CLBlast on GeForce GTX750Ti

● On-par or better than clBLAS (especially for GEMM)

Page 22: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s

CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 22 out of 43

CLBlast on GeForce GTX750Ti

● ...but not as fast as NVIDIA’s cuBLAS

Page 23: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s

CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 23 out of 43

CLBlast on GeForce GTX750Ti

● ...but not as fast as NVIDIA’s cuBLAS

Page 24: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s

CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 24 out of 43

CLBlast on Radeon M370X

● On-par or better than clBLAS (especially for odd-sized GEMM)

Page 25: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s

CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 25 out of 43

CLBlast on Skylake ULT GT2

● On-par or better than clBLAS (especially for GEMM)

Page 26: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s

CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 26 out of 43

CLBlast on Core i5-6200U

● On-par or better than clBLAS (especially for AXPY & GEMV)

Page 27: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s

CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 27 out of 43

CLBlast for Deep Learning

● What can we do for the deep-learning community?

– Problem-speciic tuning

– Half-precision loating-point (FP16)

– Batched routines

Page 28: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s

CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 28 out of 43

Tuning Only for a Single Size?

● Default GEMM tuning:

– 1024x1024 matrices

● Deep-learning:

– Various but ixed matrix sizes (dependent on network layout)

– Typically smaller and/or rectangular matrices

Page 29: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s

CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 29 out of 43

Tuning Only for a Single Size?

● Default GEMM tuning:

– 1024x1024 matrices

● Deep-learning:

– Various but ixed matrix sizes (dependent on network layout)

– Typically smaller and/or rectangular matrices

● Potential for optimal performance in CLBlast:

– Tuning for a custom size possible

– C++ API to change parameters at run-time

Page 30: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s

CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 30 out of 43

Problem-Speciic Tuning

● SGEMM tuning

for Radeon

M370X GPU

Page 31: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s

CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 31 out of 43

Problem-Speciic Tuning

● SGEMM tuning

for Radeon

M370X GPU

● Best on the

diagonal

● >100% due to

random tuning

Page 32: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s

CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 32 out of 43

Problem-Speciic Tuning

● SGEMM tuning

for Radeon

M370X GPU

● Best on the

diagonal

● >100% due to

random tuning

● Gain of ~2x for

some cases

default

Page 33: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s

CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 33 out of 43

Half-precision loating-point (FP16)

● Double-precision (FP64) not needed for deep-learning

● Even FP32 is too much introducing → half-precision FP16

● Implemented in low-power devices (ARM Mali, Intel GPUs) and

deep-learning speciic GPUs (P100)

Page 34: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s

CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 34 out of 43

Half-precision loating-point (FP16)

● Double-precision (FP64) not needed for deep-learning

● Even FP32 is too much introducing → half-precision FP16

● Implemented in low-power devices (ARM Mali, Intel GPUs) and

deep-learning speciic GPUs (P100)

● Potential for 2x savings in:

bandwidth, storage, compute, energy

Page 35: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s

CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 35 out of 43

Half-precision loating-point (FP16)

● Double-precision (FP64) not needed for deep-learning

● Even FP32 is too much introducing → half-precision FP16

● Implemented in low-power devices (ARM Mali, Intel GPUs) and

deep-learning speciic GPUs (P100)

● Potential for 2x savings in:

bandwidth, storage, compute, energy

● Current FP16 support for GPUs:

– cuBLAS: HGEMM only

– clBLAS: no FP16 at all

– CLBlast: all routines!

Page 36: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s

CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 36 out of 43

Half-precision FP16 on Intel Skylake GPU

● FP16 ~1.8x faster across the board!

FP32

FP16

clBLAS

Page 37: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s

CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 37 out of 43

Batching BLAS routines

● Small-sized GEMM is super slow

– Not enough work-groups

– Not enough threads

Page 38: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s

CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 38 out of 43

Batching BLAS routines

● Small-sized GEMM is super slow

– Not enough work-groups

– Not enough threads

● Let’s make it fast again:

– Combine multiple small GEMM operations into a single kernel

– Use ofsets to indicate where the next matrices start

Page 39: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s

CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 39 out of 43

Batched GEMM on GeForce GTX 750Ti

● SGEMM 128x128x128:

– Regular: ~40 GFLOPS

– Batched: ~10 GFLOPS (1 GEMM) up to ~500 GFLOPS (8K)!

batch size

Page 40: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s

CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 40 out of 43

Batched GEMM on GeForce GTX 750Ti

● Signiicant beneits for larger sizes as well

– mostly beneicial in the range n=64 till 512

8 GEMMs

64 GEMMs

Page 41: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s

CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 41 out of 43

What’s next?

● More features for deep learning:

– ‘im2col’

– Winograd? FFT?

● Input-based auto-tuning using learned models

– Similar to S7150: The ISAAC library

● Integration into OpenCL deep-learning projects

– TensorFlow SYCL? LibDNN?

● Suggestions?

Page 42: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s

CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 42 out of 43

Why is BLAS Important for ?

● HDMap making Deep-learning→

● Deep-learning Fast BLAS libraries→

● More info: S7809 - A Multi-Source, Multi-Sensor Approach to HDMap Creation

– Willem Strijbosch - Head of Autonomous Driving, TomTom

– Today at 10:30 AM in room 210D

Page 43: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s

CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 43 out of 43

Conclusion

● Introducing CLBlast: a modern C++11 OpenCL BLAS library

● Performance portable thanks to generic kernels and auto-tuning

● Especially targeted at accelerating deep-learning:

– Problem-size speciic tuning:

● Up to 2x in an example experiment

– Half-precision FP16 support:

● Up to 2x beneit in speed and memory savings

– Batched GEMM routine:

● Order of magnitude beneit depending on the use-case

Page 44: CLBlast: A Tuned OpenCL BLAS Library - Cedric Nugteren · CLBlast: A Tuned BLAS Library for Faster Deep Learning Cedric Nugteren May 11, 2017 ... But why a new BLAS Library? NVIDIA’s

CLBlast: A Tuned BLAS Libraryfor Faster Deep Learning

Cedric Nugteren

May 11, 2017

http://github.com/cnugteren/clblast

http://cnugteren.github.io/clblast