clblast: a tuned opencl blas library - arxiv · clblast: a tuned opencl blas library cedric...

10
CLBlast: A Tuned OpenCL BLAS Library Cedric Nugteren TomTom Amsterdam, The Netherlands [email protected] ABSTRACT This work introduces CLBlast, an open-source BLAS library provid- ing optimized OpenCL routines to accelerate dense linear algebra for a wide variety of devices. It is targeted at machine learning and HPC applications and thus provides a fast matrix-multiplication routine (GEMM) to accelerate the core of many applications (e.g. deep learning, iterative solvers, astrophysics, computational fluid dynamics, quantum chemistry). CLBlast has five main advantages over other OpenCL BLAS libraries: 1) it is optimized for and tested on a large variety of OpenCL devices including less commonly used devices such as embedded and low-power GPUs, 2) it can be explic- itly tuned for specific problem-sizes on specific hardware platforms, 3) it can perform operations in half-precision floating-point FP16 saving bandwidth, time and energy, 4) it has an optional CUDA back-end, 5) and it can combine multiple operations in a single batched routine, accelerating smaller problems significantly. This paper describes the library and demonstrates the advantages of CLBlast experimentally for different use-cases on a wide variety of OpenCL hardware. CCS CONCEPTS General and reference Performance; Theory of computa- tion Parallel computing models; Computing methodologies Massively parallel algorithms; Parallel programming languages; KEYWORDS OpenCL, BLAS, GEMM, Auto-Tuning, GPU, FP16, CUDA, Batched GEMM, HPC, Deep Learning ACM Reference Format: Cedric Nugteren. 2018. CLBlast: A Tuned OpenCL BLAS Library. In IWOCL ’18: International Workshop on OpenCL (IWOCL ’18), May 14–16, 2018, Oxford, United Kingdom. ACM, New York, NY, USA, 10 pages. https://doi.org/10. 1145/3204919.3204924 1 INTRODUCTION Efficient and fast software has become more important than ever as transistor scaling benefits are diminishing [3], affecting all types of platforms: from embedded devices to desktops and supercom- puters. Most of such high performance software is built up around Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. IWOCL ’18, May 14–16, 2018, Oxford, United Kingdom © 2018 Copyright held by the owner/author(s). Publication rights licensed to the Association for Computing Machinery. ACM ISBN 978-1-4503-6439-3/18/05. . . $15.00 https://doi.org/10.1145/3204919.3204924 basic building blocks, of which the ‘Basic Linear Algebra Subrou- tines’ (BLAS) library is one of the most widely used. BLAS is the main dense linear algebra library, providing among others GEMV (generalized matrix-vector multiplication) and GEMM (generalized matrix-multiplication). These routines are nowadays even more important due to their widespread use in deep learning: the most common and compute intensive layers in neural networks are the convolution layers (which can be expressed as the GEMM routine) and the fully-connected layers (either GEMM or GEMV) [4, 19, 20]. Apart from its new use for deep learning, BLAS remains a pillar for many HPC application areas such as quantum chemistry and fluid dynamics, and for other domains such as machine learning in general, computer vision, and data analytics. Now, more than 30 years after the introduction of the origi- nal Netlib BLAS API, many highly optimized implementations are available for all kinds of purposes and platforms: ATLAS, BLIS, GotoBLAS, OpenBLAS, MKL and so on. However, for graphics pro- cessing units (GPUs) and other parallel processors there are fewer alternatives. The most well-known GPU BLAS implementation is NVIDIA’s cuBLAS. However, since it is written in CUDA, cuBLAS will not work on non-NVIDIA hardware. Furthermore, it is closed- source. The main alternative is the open-source clBLAS library, written in OpenCL and thus supporting many platforms. However, it is originally designed for AMD GPUs and does not perform well or sometimes does not work at all on other devices which support OpenCL, such as GPUs from NVIDIA and Intel, embedded devices (e.g. Mali, Adreno), FPGAs and CPUs. Moreover, features relevant for deep learning such as half-precision and batched operations are missing in clBLAS. This paper presents CLBlast, a BLAS library written in OpenCL targeting a wide variety of devices including GPUs. It is open-source, it is written in C++11 and OpenCL, it is well tested on different platforms, and it implements a superset of the BLAS routines. This paper introduces CLBlast and subsequently discusses its five main advantages: (1) All kernels in CLBlast are highly parameterized and are device-agnostic. That way, they can be auto-tuned for a given OpenCL device through integration of the CLTune auto- tuner [14]. This results in performance portability, which is demonstrated in this paper by showing matching or superior performance compared to clBLAS on NVIDIA, AMD, Intel and ARM devices. (2) Thanks to integration of an auto-tuner, users can also tune CLBlast for specific problem-sizes. For example, in deep learning, matrices will have a particular shape depending on the configuration of a neural network layer. Using the auto-tuner, performance of CLBlast can be maximized for a specific problem. We demonstrate the benefits of problem- specific tuning experimentally. arXiv:1705.05249v2 [cs.MS] 27 Apr 2018

Upload: danghuong

Post on 03-May-2018

253 views

Category:

Documents


2 download

TRANSCRIPT

CLBlast: A Tuned OpenCL BLAS LibraryCedric Nugteren

TomTomAmsterdam, The Netherlands

[email protected]

ABSTRACTThis work introduces CLBlast, an open-source BLAS library provid-ing optimized OpenCL routines to accelerate dense linear algebrafor a wide variety of devices. It is targeted at machine learning andHPC applications and thus provides a fast matrix-multiplicationroutine (GEMM) to accelerate the core of many applications (e.g.deep learning, iterative solvers, astrophysics, computational fluiddynamics, quantum chemistry). CLBlast has five main advantagesover other OpenCL BLAS libraries: 1) it is optimized for and testedon a large variety of OpenCL devices including less commonly useddevices such as embedded and low-power GPUs, 2) it can be explic-itly tuned for specific problem-sizes on specific hardware platforms,3) it can perform operations in half-precision floating-point FP16saving bandwidth, time and energy, 4) it has an optional CUDAback-end, 5) and it can combine multiple operations in a singlebatched routine, accelerating smaller problems significantly. Thispaper describes the library and demonstrates the advantages ofCLBlast experimentally for different use-cases on a wide variety ofOpenCL hardware.

CCS CONCEPTS•General and reference→ Performance; • Theory of computa-tion→ Parallel computing models; • Computing methodologies→ Massively parallel algorithms; Parallel programming languages;

KEYWORDSOpenCL, BLAS, GEMM, Auto-Tuning, GPU, FP16, CUDA, BatchedGEMM, HPC, Deep LearningACM Reference Format:Cedric Nugteren. 2018. CLBlast: A Tuned OpenCL BLAS Library. In IWOCL’18: International Workshop on OpenCL (IWOCL ’18), May 14–16, 2018, Oxford,United Kingdom. ACM, New York, NY, USA, 10 pages. https://doi.org/10.1145/3204919.3204924

1 INTRODUCTIONEfficient and fast software has become more important than everas transistor scaling benefits are diminishing [3], affecting all typesof platforms: from embedded devices to desktops and supercom-puters. Most of such high performance software is built up around

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than theauthor(s) must be honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected] ’18, May 14–16, 2018, Oxford, United Kingdom© 2018 Copyright held by the owner/author(s). Publication rights licensed to theAssociation for Computing Machinery.ACM ISBN 978-1-4503-6439-3/18/05. . . $15.00https://doi.org/10.1145/3204919.3204924

basic building blocks, of which the ‘Basic Linear Algebra Subrou-tines’ (BLAS) library is one of the most widely used. BLAS is themain dense linear algebra library, providing among others GEMV(generalized matrix-vector multiplication) and GEMM (generalizedmatrix-multiplication). These routines are nowadays even moreimportant due to their widespread use in deep learning: the mostcommon and compute intensive layers in neural networks are theconvolution layers (which can be expressed as the GEMM routine)and the fully-connected layers (either GEMM or GEMV) [4, 19, 20].Apart from its new use for deep learning, BLAS remains a pillarfor many HPC application areas such as quantum chemistry andfluid dynamics, and for other domains such as machine learning ingeneral, computer vision, and data analytics.

Now, more than 30 years after the introduction of the origi-nal Netlib BLAS API, many highly optimized implementations areavailable for all kinds of purposes and platforms: ATLAS, BLIS,GotoBLAS, OpenBLAS, MKL and so on. However, for graphics pro-cessing units (GPUs) and other parallel processors there are feweralternatives. The most well-known GPU BLAS implementation isNVIDIA’s cuBLAS. However, since it is written in CUDA, cuBLASwill not work on non-NVIDIA hardware. Furthermore, it is closed-source. The main alternative is the open-source clBLAS library,written in OpenCL and thus supporting many platforms. However,it is originally designed for AMD GPUs and does not perform wellor sometimes does not work at all on other devices which supportOpenCL, such as GPUs from NVIDIA and Intel, embedded devices(e.g. Mali, Adreno), FPGAs and CPUs. Moreover, features relevantfor deep learning such as half-precision and batched operations aremissing in clBLAS.

This paper presents CLBlast, a BLAS library written in OpenCLtargeting awide variety of devices includingGPUs. It is open-source,it is written in C++11 and OpenCL, it is well tested on differentplatforms, and it implements a superset of the BLAS routines. Thispaper introduces CLBlast and subsequently discusses its five mainadvantages:

(1) All kernels in CLBlast are highly parameterized and aredevice-agnostic. That way, they can be auto-tuned for a givenOpenCL device through integration of the CLTune auto-tuner [14]. This results in performance portability, which isdemonstrated in this paper by showing matching or superiorperformance compared to clBLAS on NVIDIA, AMD, Inteland ARM devices.

(2) Thanks to integration of an auto-tuner, users can also tuneCLBlast for specific problem-sizes. For example, in deeplearning, matrices will have a particular shape dependingon the configuration of a neural network layer. Using theauto-tuner, performance of CLBlast can be maximized for aspecific problem. We demonstrate the benefits of problem-specific tuning experimentally.

arX

iv:1

705.

0524

9v2

[cs

.MS]

27

Apr

201

8

IWOCL ’18, May 14–16, 2018, Oxford, United Kingdom Cedric Nugteren

(3) In contrast to existing OpenCL BLAS libraries, CLBlast alsoimplements half-precision routines using the 16-bit floating-point format (FP16). This reduces storage and bandwidthrequirements by a factor two, but also allows for much fasterandmore energy efficient computations (e.g. around 2x fasterGEMM on a Skylake GT2 GPU or Mali GPU).

(4) The library internally abstracts the OpenCL API behind thenew high-level CLCudaAPI. With this API, porting CLBlasthost-code to CUDA is trivial, requiring only a simple headerchange. CLCudaAPI also provides an OpenCL-to-CUDA ker-nel header making kernel porting easy as well. Thanks tothis, CLBlast has a CUDA back-end as well, making it the firstfully-featured CUDA BLAS library which is open-source.

(5) CLBlast provides a special interface for batching BLAS rou-tines. As we show in this work, this can yield up to an orderof magnitude better performance especially when processingmany small vectors or matrices. This is of special interest fordeep learning, as multiple smaller operations are typicallybatched.

2 RELATEDWORKFrom a technical perspective, AMD’s clBLAS is the most closely re-latedwork: it is also anOpenCL BLAS open-source library. However,it does not have CLBlast’s performance portability, problem-specifictuning, FP16 support and batched routines. Furthermore, from atechnical perspective it lacks proper testing on less-common de-vices, it has no C++ interface, it requires a newer version of OpenCL(1.2 instead of 1.1), and its OpenCL kernels are partly generatedas strings and thus not easily readable or editable. The originalauthors are no longer developing clBLAS, but they have re-focusedon rocBLAS1: a work-in-progress BLAS library written in HIP thatcurrently only supports a small subset of all BLAS routines anddata-types.

NVIDIA’s cuBLAS is also related, but it is closed source andCUDA-only and thus does not run on non-NVIDIA hardware. Thosetwo libraries (cuBLAS and clBLAS) are also combined togetherwith additional auto-tuned kernels in the ISAAC project2. However,ISAAC is not a full BLAS library yet, supporting only a few routinesso far. However, it does support input-size aware auto-tuning basedon a learned tuning parameters model [17].

Other related OpenCL libraries are clMAGMA, ArrayFire, andViennaCL, but they focus on higher-level routines such as LAPACKrather than BLAS. From a non-OpenCL perspective, ATLAS is themost relevant work, as it also includes a device-specific auto-tuner.

From a scientific perspective, several works have previously pub-lished auto-tuning and optimization approaches for dense matrix-matrix multiplications [7, 8, 10, 11, 17, 21]. In fact, the GEMM kernelin CLBlast is based on and evolved from the work by Matsumotoet al. [10]. There are also several recent publications on batchedGEMM operations and more in general on optimizing GEMM forsmall matrices [1, 5, 9]. Related to the batched operations in CLBlastis also a comparison article for possible standard interfaces [16].

1rocBLAS: http://github.com/ROCmSoftwarePlatform/rocBLAS2ISAAC project: http://github.com/ptillet/isaac

3 THE CLBLAST LIBRARYCLBlast is an APACHE 2.0 licensed open-source3 OpenCL imple-mentation of the BLAS API. The host code is written in C++11and the kernel code in OpenCL C, compatible with any device sup-porting the OpenCL 1.1 or newer standards. There are automatedbuild tests on Windows, macOS and Linux systems and there iscontinuous integration through automated correctness tests on sixdifferent devices from three different vendors. Furthermore, thereare Python bindings in the form of PyCLBlast. CLBlast has an ac-tive community: there are third party Java bindings (JOCLBlast)and Nim bindings (nimCLBlast), 11 contributors, 50+ forks, 100+resolved issues, it is the standard OpenCL BLAS back-end for Array-Fire, it is being used in experimental OpenCL versions of Caffe4 andTensorflow5 [15], and it is used in PyTorch through libgpuarray6.

The CLBlast project was created as a stand-alone project ratherthan by extending and improving the existing clBLAS project, be-cause the kernels in clBLAS are generated from C++ code, makingthem very difficult to read, extend and maintain. Furthermore, thelack of development by clBLAS’s authors and the use of good codingpractices contributed to this decision.

3.1 Library DesignImplementing the exact Netlib BLAS API would require internalhost-to-device and device-to-host OpenCL transfers in the library.This could be detrimental for performance, especially for O(n)and O(n2) BLAS routines. That is why clBLAS and cuBLAS takepointers to device memory in the API, leaving full control over datatransfers to the user. For the same reasons, CLBlast also providesthis as the main interface. There is a C, C++ and Java interfaceavailable. On top of this, there is also a fully compatible NetlibBLAS interface, but this is not recommended for performance sinceOpenCL data-transfers are done internally, leaving no control tothe user.

BLAS routines are divided into three levels. CLBlast implementsall of these routines plus a few extra, see table 1: 10 level-1 scalar,vector, and vector-vector routines, 23 level-2 matrix-vector routines,9 level-3 matrix-matrix routines, and 9 extra BLAS-like routines.For each of these routines, CLBlast has (if possible) an implemen-tation in 5 different precisions: half-precision FP16 (e.g. HGEMM),single-precision FP32 (e.g. SGEMM), double-precision FP64 (e.g.DGEMM), complex single-precision 2xFP32 (e.g. CGEMM), andcomplex double-precision 2xFP64 (e.g. ZGEMM).

Although there are 51 routines per precision in CLBlast, there arenot that many OpenCL kernels implemented. First of all, the kernelsare precision-agnostic. Although C++ templates aren’t supportedin OpenCL C 1.1, we can still use a type alias and at kernel-compile-time define the type as either half, single, double precision or one ofthe complex data-types. Second, there are several families of kernelswhich can be re-used for other routines. The kernels axpy, dot,gemv, ger and gemm span already most of the routines given somesupport kernels such as copying or padding vectors and matrices.For example, to implement the GBMV routine, CLBlast uses the

3CLBlast repository: http://github.com/CNugteren/CLBlast4Caffe with CLBlast: http://github.com/dividiti/ck-caffe5OpenCL Tensorflow: http://github.com/hughperkins/tensorflow-cl6Libgpuarray: http://deeplearning.net/software/libgpuarray

CLBlast: A Tuned OpenCL BLAS Library IWOCL ’18, May 14–16, 2018, Oxford, United Kingdom

Table 1: Routines available in CLBlast

level routines1 AXPY COPY SCAL SWAP

AMAX ASUM DOT DOTC DOTU NRM22 GBMV GEMV HBMV HEMV HPMV SBMV

SPMV SYMV TMBV TPMV TRMV TRSVGER GERC GERU HER HER2 HPR HPR2SPR SPR2 SYR SYR2

3 GEMM HEMM HER2K HERK SYMMSYR2K SYRK TRMM TRSM

extra SUM MAX MIN AMIN OMATCOPY IM2COLAXPYBATCHED GEMMBATCHEDGEMMSTRIDEDBATCHED

gemv OpenCL kernel but uses a pre-processor macro to change theloading of the input from a general matrix into a bandedmatrix. Theremainder of the kernel with all performance-critical optimizationscan be re-used, avoiding code duplication. This is done for almost allroutines, similarly to what is described in [10] for level-3 routines.

3.2 Parameterized KernelsAll kernel implementations in CLBlast are written in a highly pa-rameterized way to be tunable across devices: they are not writtenfor a specific device nor OpenCL implementation. This is achievedby creating pre-processor constants which can be changed withoutaffecting the correctness of the program. A simplified example ofthe axpy kernel in figure 1 illustrates this: the local work-group sizeis tunable (WGS), the amount of work-per-thread can vary (WPT),and the vector width is configurable (VW). In contrast to clBLAS,we rely on the target compiler to perform the low-level optimiza-tions such as loop-unrolling and pointer-arithmetic, increasing thereadability, portability, and maintainability of the kernels.

1 # d e f i n e WGS 64 / / The l o c a l work−group s i z e2 # d e f i n e WPT 4 / / The amount o f work−per−t h r e ad3 # d e f i n e VW 2 / / Width o f v e c t o r s X and Y45 typedef f l oa t dtype ; / / Example data−type6 # i f VW == 17 typedef f l oa t dtypeV ;8 # e l i f VW == 29 typedef f l o a t 2 dtypeV ;10 # e n d i f / / and s i m i l a r l y f o r VW == { 4 , 8 , 1 6 }1112 __ke rne l _ _ a t t r i b u t e _ ( r eqd_work_group_s i ze (WGS) )13 void Xaxpy ( const int n , const dtype alpha ,14 const _ _ g l o b a l dtypeV ∗ r e s t r i c t xgm ,15 _ _ g l o b a l dtypeV ∗ ygm ) {16 #pragma un r o l l17 for ( in t w = 0 ; w < WPT; ++w) {18 in t i = w ∗ g e t _ g l o b a l _ s i z e ( 0 ) + g e t _ g l o b a l _ i d ( 0 ) ;19 ygm[ i ] = ygm[ i ] + a lpha ∗ xgm[ i ] ;20 }21 }

Figure 1: Simplified example of a parameterized OpenCLkernel from CLBlast.

Although it is unfeasible to discuss all of CLBlast’s kernels andtheir parameters here, we believe that it is important to brieflyillustrate CLBlast’s generality with a larger parameterized kernel as

Nwi

Nwg

n

Mwi

Mwg

m

+=

C

Mwi

Mwg

m

Kwg

k X

A^T

Nwi

Nwg

n

B

M xM = Mwi dimC wg

K isunrolled by a factor Kwg wi

Perwork-group tiling

Perthread tiling

Figure 2: Matrix-multiplication and some of its tuning pa-rameters. The blue area indicates work done by a singlethread (‘work-item’), the orange area indicates work doneper work-group.

well: the gemm kernel. Similar to [11], we make many assumptionson the input arguments, which are handled by pre-processing andpost-processing kernels. These assumptions are e.g. matrix sizes area multiple of the work-group sizes, offsets are zero, and matrix B istransposed. This is a good solution for larger problem sizes sinceO(n2) data movement is typically cheaper than O(n3) computation,but the hidden constant starts to play a role for smaller n. Therefore,there is also a single-kernel ‘direct’ version available for those cases,but it shares most of the design and parameters as discussed below.

The gemm kernel has 14 different parameters, of which 6 areillustrated in figure 2. The parameters define among others thework-group sizes in 2 dimensions (Mwд ,Nwд ), the 2D register tilingconfiguration (Mwi ,Nwi ), the vector widths of both input matrices,loop unroll factors (Kwi ), and whether or not and how to use thelocal memory. For more details we refer to the CLTune paper whichdiscusses an earlier version of the kernel [14], and also to the workof Matsumoto et al. which served as inspiration for the design ofthe kernel [11].

3.3 Performance TuningThe parameterized kernels in CLBlast can be tuned for a specificdevice and/or specific problem size using an integrated version ofthe CLTune library7, which is an open-source CUDA and OpenCLauto-tuner written in C++. For details on the CLTune library, werefer to [14]. CLBlast provides binaries to interface with the auto-tuner for each kernel. By default, these tuners will find optimalkernel parameters for each kernel, for each precision, and for afixed problem size. Tuning results for previously unseen devicesare collected in a central tuning database8 from which CLBlasttakes its optimized parameters. The database contains timings ofeach kernel execution while tuning, not just the optimal. Thus, thedatabase can be used for other purposes as well, such as researchon performance modeling or optimal parameter prediction.

To illustrate how the tuners in CLBlast work, consider the gemmkernel discussed earlier. Although each of the parameters has atmost only 4 or 5 reasonable values to try, the total search-spaceexplodes quickly and has more than 100.000 possible combinations

7CLTune: http://github.com/CNugteren/CLTune8DB: http://github.com/CNugteren/CLBlast-database

IWOCL ’18, May 14–16, 2018, Oxford, United Kingdom Cedric Nugteren

to explore for the 14 parameters. This is even after filtering out non-valid parameter combinations due to device or software restrictions(e.g. maximum work-group size, local memory size). Dependingon the use-case the amount of combinations might be too much toexplore. Therefore, CLBlast defines two sets of tuning parameters:one set with the most likely combinations (e.g. 500) and one set withall combinations. The first set is explored exhaustively, while theother set is additionally explored by random sampling in the search-space. The tuner will thus always explore the basic kernel parametercombinations and on top of that an extra user-configurable amount.

All kernels in CLBlast have already been tuned for around 50different devices thanks to the community, and they can be tunedfor any new device. Nevertheless, the library can also perform de-cently on previously unseen devices: default parameters per devicevendor/type and even per device architecture are computed by tak-ing the average best performing parameters for similar devices. Forexample, kernel parameters for a new unseen AMD GPU will be setto the parameters corresponding to the average best performingcase across all existing AMD GPU devices in the database or acrossall device of that specific architecture if already present. (e.g Tahiti,Vega). Of course, there is no guarantee that performance on the newdevice will be good or that the parameters are legal at all. However,in case the device is significantly different, the user can alwaysstill run the tuners on his/her device to make sure performance isoptimized.

The default tuning results are only for a pre-defined problem sizeto limit the total run-time of the tuners. Nevertheless, users can stilltune for their specific problem size, e.g. a small rectangular matrixof size 279 by 32. The CLBlast API furthermore provides an interfaceto change the library’s default parameters at run-time. Thus, theuser can also programmatically provide optimized parameters for(multiple) custom problem sizes and/or devices.

4 RESULTSThis section contains several experimental results and details onthe setup used in this paper. All results and graphs shown are alsoavailable on-line9, including results for the same and other devices.

4.1 Experimental SetupThis section explains the set-up used in this paper. In this paper wereport results of single-precision and half-precision. We test on theOpenCL devices listed in table 2.

Table 2: Overview of the tested OpenCL devices

vendor and archi- library GFLOPSdevice name tecture and SDK and GB/sNVIDIA GTX 750Ti Maxwell CUDA 8.0 1305 – 88NVIDIA Titan X Pascal CUDA 9.0 10974 – 480ARM Mali T628 Midgard Mali r12p0 38 – 15AMD Radeon HD7970 Tahiti APP 3.0 3789 – 264AMD Radeon M370X GCN 1 Apple 2.4.2 1024 – 72Intel Skylake ULT GT2 Iris Beignet 1.3 384 – 30Intel Core i5-6200U Skylake Intel 73 – 30

9On-line appendix of this paper with more graphs at:http://cnugteren.github.io/clblast/

All experimental results presented in the following sections arefully reproducible: they are obtained through the official CLBlast‘clients’: binaries which compare run-time of CLBlast against otherlibraries. The clients perform a warm-up run first, followed by 10repeated timed runs, and output the final graphs as shown in thispaper directly. The reported results are based on the average timeover those runs. We report GB/s for routines which are typicallybandwidth-bound (level-1 and level-2) and GFLOPS for routineswhich are typically compute-bound (level-3). We test with CLBlastrelease 1.3.0 (latest as of January 2018) and compare against:

(1) AMD’s clBLAS 2.12 (latest version as of January 2018). Afterinstalling clBLAS, we run the included ‘clBLAS-tune’ to fine-tune performance. However, for some devices the tuner didnot complete successfully.

(2) NVIDIA’s cuBLAS, included as part of the CUDA installation(see table 2 for the version).

4.2 Performance Across DevicesCLBlast is performance-portable due to its tuning capabilities andthe generic OpenCL kernels. However, the level of performanceachieved is of course still limited by the design and flexibility of theimplemented kernels. The design of CLBlast has mainly focused onthe gemm kernel which is used for almost all level-3 BLAS operations.

In a first set of experiments in figure 3we demonstrate performance-portability across devices. We test on 6 very different devices (seetable 2) for 3 different types of routines: AXPY, GEMV, and GEMM.These routines are chosen for being the most representative foreach BLAS level and actually include kernels covering almost allroutines. We show results for both multiples of a power-of-2 andfor a variety of irregular vector and matrix sizes: multiples of someodd number. For the full results including more complete experi-ments we refer to the on-line appendix9. We draw the followingconclusions from figure 3:

• The AXPY results are for 4 of the 6 devices roughly on-par with clBLAS and cuBLAS. This is to be expected, as itis a simple bandwidth-bound operation. CLBlast tunes thework-group size, the amount of work-per-thread, and thevector width, but the first two don’t matter too much formost devices as long as they don’t take extreme values. Forthe CPU experiment this matters more, in which CLBlast ismuch closer to the peak memory bandwidth. On the Titan X,CLBlast is far ahead of clBLAS and almost matches cuBLAS.

• The GEMV results also show on-par or slightly better per-formance compared to clBLAS and cuBLAS for most de-vices. More elaborate tests in the on-line appendix9 showthat CLBlast’s GEMV routine is still sub-optimal: clBLASachieves better performance for certain cases and devices.Nevertheless, in certain cases CLBlast is much faster, suchas on the CPU.

• The GEMM results for CLBlast are much more stable acrossinput sizes compared to clBLAS. This is due to the ‘indirect’kernel design with the additional kernels to transform datain the expected format. The main benefit of this is manifestedfor irregular sizes.

• The GEMM results show significantly better overall resultsfor CLBlast compared to clBLAS. This is especially the case

CLBlast: A Tuned OpenCL BLAS Library IWOCL ’18, May 14–16, 2018, Oxford, United Kingdom25

6K76

8K12

80K

1792

K23

04K

2816

K33

28K

3840

K

sizes (n)0

25

50

75

100

125

150

GB/s

(hig

her i

s bet

ter)

SAXPY multiples of 256KCLBlastclBLAScuBLAS

2621

4578

6435

1310

725

1835

015

2359

305

2883

595

3407

885

3932

175

sizes (n)0

25

50

75

100

125

150 SAXPY multiples of 256K+1CLBlastclBLAScuBLAS

256

768

1280

1792

2304

2816

3328

3840

4352

4864

sizes (n=m)0

20

40

60

80

100

GB/s

(hig

her i

s bet

ter)

SGEMV multiples of 256CLBlastclBLAScuBLAS

257

771

1285

1799

2313

2827

3341

3855

4369

4883

sizes (n=m)0

20

40

60

80

100 SGEMV multiples of 257CLBlastclBLAScuBLAS

128

384

640

896

1152

1408

1664

1920

2176

2432

sizes (m=n=k)0

500

1000

1500

GFLO

PS (h

ighe

r is b

ette

r) SGEMM multiples of 128CLBlastclBLAScuBLAS

129

387

645

903

1161

1419

1677

1935

2193

2451

sizes (m=n=k)0

250

500

750

1000

1250

1500 SGEMM multiples of 129CLBlastclBLAScuBLAS

NVIDIA GeForce GTX 750Ti

256K

768K

1280

K17

92K

2304

K28

16K

3328

K38

40K

sizes (n)0

100

200

300

400

GB/s

(hig

her i

s bet

ter)

SAXPY multiples of 256KCLBlastclBLAScuBLAS

2621

4578

6435

1310

725

1835

015

2359

305

2883

595

3407

885

3932

175

sizes (n)0

100

200

300

400 SAXPY multiples of 256K+1CLBlastclBLAScuBLAS

256

768

1280

1792

2304

2816

3328

3840

4352

4864

sizes (n=m)0

100

200

300

400

GB/s

(hig

her i

s bet

ter)

SGEMV multiples of 256CLBlastclBLAScuBLAS

257

771

1285

1799

2313

2827

3341

3855

4369

4883

sizes (n=m)0

100

200

300

400 SGEMV multiples of 257CLBlastclBLAScuBLAS

128

384

640

896

1152

1408

1664

1920

2176

2432

sizes (m=n=k)0

2000

4000

6000

8000

10000

12000

GFLO

PS (h

ighe

r is b

ette

r) SGEMM multiples of 128CLBlastclBLAScuBLAS

129

387

645

903

1161

1419

1677

1935

2193

2451

sizes (m=n=k)0

2000

4000

6000

8000

10000

12000 SGEMM multiples of 129CLBlastclBLAScuBLAS

NVIDIA Titan X (Pascal)

256K

768K

1280

K17

92K

2304

K28

16K

3328

K38

40K

sizes (n)0

50

100

150

200

250

GB/s

(hig

her i

s bet

ter)

SAXPY multiples of 256KCLBlastclBLAS

2621

4578

6435

1310

725

1835

015

2359

305

2883

595

3407

885

3932

175

sizes (n)0

50

100

150

200

250

300 SAXPY multiples of 256K+1CLBlastclBLAS

256

768

1280

1792

2304

2816

3328

3840

4352

4864

sizes (n=m)0

50

100

150

200

250

300

GB/s

(hig

her i

s bet

ter)

SGEMV multiples of 256CLBlastclBLAS

257

771

1285

1799

2313

2827

3341

3855

4369

4883

sizes (n=m)0

50

100

150

200

250

300 SGEMV multiples of 257CLBlastclBLAS

128

384

640

896

1152

1408

1664

1920

2176

2432

sizes (m=n=k)0

1000

2000

3000

GFLO

PS (h

ighe

r is b

ette

r) SGEMM multiples of 128CLBlastclBLAS

129

387

645

903

1161

1419

1677

1935

2193

2451

sizes (m=n=k)0

500

1000

1500

2000

2500 SGEMM multiples of 129CLBlastclBLAS

AMD Radeon HD7970

256K

768K

1280

K17

92K

2304

K28

16K

3328

K38

40K

sizes (n)0

2

4

6

8

10

GB/s

(hig

her i

s bet

ter)

SAXPY multiples of 256KCLBlastclBLAS

2621

4578

6435

1310

725

1835

015

2359

305

2883

595

3407

885

3932

175

sizes (n)0

2

4

6

8

10 SAXPY multiples of 256K+1CLBlastclBLAS

256

768

1280

1792

2304

2816

3328

3840

4352

4864

sizes (n=m)0

2

4

6

8

10

GB/s

(hig

her i

s bet

ter)

SGEMV multiples of 256CLBlastclBLAS

257

771

1285

1799

2313

2827

3341

3855

4369

4883

sizes (n=m)0

2

4

6

8

10 SGEMV multiples of 257CLBlastclBLAS

128

256

384

512

640

768

896

1024

1152

1280

sizes (m=n=k)0

5

10

15

20

GFLO

PS (h

ighe

r is b

ette

r) SGEMM multiples of 128CLBlastclBLAS

129

258

387

516

645

774

903

1032

1161

1290

sizes (m=n=k)0

5

10

15

20 SGEMM multiples of 129CLBlastclBLAS

ARM Mali T628

256K

768K

1280

K17

92K

2304

K28

16K

3328

K38

40K

sizes (n)0

5

10

15

20

25

30

GB/s

(hig

her i

s bet

ter)

SAXPY multiples of 256KCLBlastclBLAS

2621

4578

6435

1310

725

1835

015

2359

305

2883

595

3407

885

3932

175

sizes (n)0

5

10

15

20

25

30 SAXPY multiples of 256K+1CLBlastclBLAS

256

768

1280

1792

2304

2816

3328

3840

4352

4864

sizes (n=m)0

5

10

15

20

25

30

GB/s

(hig

her i

s bet

ter)

SGEMV multiples of 256CLBlastclBLAS

257

771

1285

1799

2313

2827

3341

3855

4369

4883

sizes (n=m)0

5

10

15

20

25

30 SGEMV multiples of 257CLBlastclBLAS

128

384

640

896

1152

1408

1664

1920

2176

2432

sizes (m=n=k)0

25

50

75

100

125

150

GFLO

PS (h

ighe

r is b

ette

r) SGEMM multiples of 128CLBlastclBLAS

129

387

645

903

1161

1419

1677

1935

2193

2451

sizes (m=n=k)0

25

50

75

100

125

150 SGEMM multiples of 129CLBlastclBLAS

Intel Skylake ULT GT2

256K

768K

1280

K17

92K

2304

K28

16K

3328

K38

40K

sizes (n)0

5

10

15

20

25

30

GB/s

(hig

her i

s bet

ter)

SAXPY multiples of 256KCLBlastclBLAS

2621

4578

6435

1310

725

1835

015

2359

305

2883

595

3407

885

3932

175

sizes (n)0

5

10

15

20

25

30 SAXPY multiples of 256K+1CLBlastclBLAS

256

768

1280

1792

2304

2816

3328

3840

4352

4864

sizes (n=m)0

5

10

15

20

GB/s

(hig

her i

s bet

ter)

SGEMV multiples of 256CLBlastclBLAS

257

771

1285

1799

2313

2827

3341

3855

4369

4883

sizes (n=m)0

5

10

15

20 SGEMV multiples of 257CLBlastclBLAS

128

384

640

896

1152

1408

1664

1920

2176

2432

sizes (m=n=k)0

10

20

30

40

50

GFLO

PS (h

ighe

r is b

ette

r) SGEMM multiples of 128CLBlastclBLAS

129

387

645

903

1161

1419

1677

1935

2193

2451

sizes (m=n=k)0

10

20

30

40

50 SGEMM multiples of 129CLBlastclBLAS

Intel Core i5-6200U

Figure 3: Performance across 6 different devices. For each device we show three routines (one per row): SAXPY (measured inGB/s), SGEMV (measured in GB/s) and SGEMM (measured in GFLOPS). For each routine we show two graphs (one per column):left for multiples of a power-of-2, right for multiples of an odd number. Blue circles denote results of CLBlast, red crossesdenote results of clBLAS, and green dots denote results of NVIDIA’s cuBLAS where available. Best viewed on a computerscreen or in the on-line appendix of this paper at http://cnugteren.github.io/clblast/.

IWOCL ’18, May 14–16, 2018, Oxford, United Kingdom Cedric Nugteren

Figure 4: CLBlast performance of SGEMM for different matrix sizes (rows) relative to a matrix-size specific tuned version(diagonal) on two devices. Higher percentages represent faster kernels.

for irregular sizes and for certain devices: Skylake ULT GT2and the NVIDIA GPUs. However, we note that clBLAS onthe Skylake GPU couldn’t be tuned due to repeated crashes.Nevertheless, all other devices for which clBLAS worked asexpected also show results in favor of CLBlast. This is eventrue on the tested AMDGPU (for which clBLAS was created),as shown in the multiples-of-129 experiment (bottom-rightgraphs).

• NVIDIA’s cuBLAS is still superior over both OpenCL li-braries. Because cuBLAS is closed source, we can only for-mulate hypotheses. First, cuBLAS might be tuned at assem-bly/PTX level for specific hardware, whereas CLBlast relieson the compiler performing low-level optimizations. Second,specific instructions such as __ldg for L1 data caching areavailable from CUDA, but not from OpenCL C. A more in-depth analysis of CUDA vs OpenCL for GEMM can be foundon-line10.

In general, performance improvement over clBLAS can be attrib-uted to several factors. For example, clBLAS has more low-level op-timizations hard-coded (e.g. pointer arithmetic), leaving less roomfor the device’s compiler to optimize. Furthermore, the tuning spaceper kernel is limited compared to CLBlast. Examples for GEMM in-clude the lack of a tunable loop unroll factor, no support for stridedloading, and absence of the possibility to cache in local memory.Also for GEMM in particular, clBLAS does not pre-transpose the Bmatrix, resulting in non-subsequent memory accesses for certaintuning parameter configurations.

10GEMM tutorial: http://www.cedricnugteren.nl/tutorial/

4.3 Problem-specific TuningThanks to the included auto-tuners and the user community, CLBlastis already tuned for a wide variety of devices. However, tuning iscurrently only done for a default set of input arguments. For exam-ple, the GEMM routine is per-default tuned for squared matricesof dimensions m = 1024, n = 1024, k = 1024. To get the maxi-mum performance out of CLBlast, it is possible to tune for specificroutine arguments. It is even possible to tune for multiple cases:CLBlast provides an API to change the tuning parameters at run-time. Problem-specific tuning can be beneficial for example for deeplearning applications, in which matrices have a specific size basedon the neural network’s layout.

To illustrate the benefits of problem-specific tuning we tuned thegemm kernel for 9 different matrix sizes. We then benchmarked theSGEMM routine for each of the 9 different sets of tuning parameterson the same 9 problems. The results are shown as heat-maps infigure 4. Shown are the relative performances compared to thediagonal, i.e. the case for which the parameters were tuned. Thereare a few cases with performance slightly higher than the diagonal(i.e. higher than 100%), which can happen because the tuner exploresa random sub-set of the search space and might have been morefortunate in a particular case. Overall we see quite some potentialbenefit for problem-specific tuning, even up to a factor 2 for theRadeon M370X GPU: performance drops to ±50% if not properlytuned. For the Skylake ULT GT2 GPU we see less benefit: tuning forthe larger matrix dimensions seems to generalize towards smallermatrices (top left corner has high values), whereas the opposite isnot true (bottom right corner has low values). In conclusion, benefitsfrom problem-specific tuning vary per use-case and per device. Ingeneral, it seems definitely worth exploring this problem-specificperformance potential.

CLBlast: A Tuned OpenCL BLAS Library IWOCL ’18, May 14–16, 2018, Oxford, United Kingdom12

838

464

089

611

5214

0816

6419

2021

7624

32

sizes (m=n=k)0

50

100

150

200

250

300

GFLO

PS (h

ighe

r is b

ette

r)

HGEMM multiples of 128CLBlast FP16clBLAS FP32CLBlast FP32

129

387

645

903

1161

1419

1677

1935

2193

2451

sizes (m=n=k)0

50

100

150

200

250HGEMM multiples of 129CLBlast FP16clBLAS FP32CLBlast FP32

512

514

516

518

520

522

524

526

sizes (m=n=k)0

50

100

150

200HGEMM around 512

CLBlast FP16clBLAS FP32CLBlast FP32

2K20

5020

5220

5420

5620

5820

6020

62

sizes (m=n=k)0

50

100

150

200

250

300

GFLO

PS (h

ighe

r is b

ette

r)

HGEMM around 2048CLBlast FP16clBLAS FP32CLBlast FP32

101,

111,

111

101,

111,

112

101,

112,

111

101,

112,

112

102,

111,

111

102,

111,

112

102,

112,

111

102,

112,

112

layout, transA, transB0

50

100

150

200

250

300HGEMM layouts/transpose

CLBlast FP16clBLAS FP32CLBlast FP32

8 16 32 64 128

256

512

1024 2K 4K

sizes (m=n=k)0

50

100

150

200

250

300HGEMM powers of 2

CLBlast FP16clBLAS FP32CLBlast FP32

HGEMM: Intel Skylake ULT GT2

128

256

384

512

640

768

896

1024

1152

1280

sizes (m=n=k)0

10

20

30

40

GFLO

PS (h

ighe

r is b

ette

r)

HGEMM multiples of 128CLBlast FP16clBLAS FP32CLBlast FP32

129

258

387

516

645

774

903

1032

1161

1290

sizes (m=n=k)0

5

10

15

20

25

30HGEMM multiples of 129CLBlast FP16clBLAS FP32CLBlast FP32

512

514

516

518

520

522

524

526

sizes (m=n=k)0

5

10

15

20

25

30HGEMM around 512

CLBlast FP16clBLAS FP32CLBlast FP32

1024

1026

1028

1030

1032

1034

1036

1038

sizes (m=n=k)0

10

20

30

40

GFLO

PS (h

ighe

r is b

ette

r)

HGEMM around 2048CLBlast FP16clBLAS FP32CLBlast FP32

101,

111,

111

101,

111,

112

101,

112,

111

101,

112,

112

102,

111,

111

102,

111,

112

102,

112,

111

102,

112,

112

layout, transA, transB0

10

20

30

40HGEMM layouts/transpose

CLBlast FP16clBLAS FP32CLBlast FP32

8 16 32 64 128

256

512

1024

sizes (m=n=k)0

10

20

30

40HGEMM powers of 2

CLBlast FP16clBLAS FP32CLBlast FP32

HGEMM: ARM Mali T628

Figure 5: Performance of half-precision matrix-multiplication (HGEMM) on two low-power devices with native FP16 support.Shown are results for CLBlast’s FP16 mode (purple dots), CLBlast’s FP32 mode (blue circles) and clBLAS’s FP32 mode (redcrosses).

4.4 Half-precision Floating-pointWe also demonstrate the benefit of half-precision floating-point(FP16) in CLBlast. This mode trades-off precision for potential mem-ory savings, faster computation, and less energy consumption. Tra-ditionally used in computer graphics and image processing appli-cations, FP16 has seen renewed interest with the recent successesof deep-learning [12]. Devices with native FP16 at 2x FP32 speedcan be found in the embedded and low-power domain (e.g. IntelSkylake GPUs and ARM Mali GPUs) and the very high-end (e.g.NVIDIA Tesla P100 and V100). Recent AMD GPUs (Polaris andVega) also support FP16, but for memory and energy savings only:they run computations at FP32 speed.

CLBlast supports all routines in half-precision mode, while otherlibraries are lagging behind the recent hardware advances andsoftware requirements. For example, clBLAS has no FP16 supportat all. NVIDIA’s cuBLAS only supports the GEMM routine in half-precision. Intel does have a single special-purpose OpenCL FP16kernel integrated in the ISAAC library11, but there is no BLASlibrary or interface.

We have tested CLBlast’s half-precision mode on the two deviceswith FP16 support from table 2. Since there is no FP16 support inclBLAS, we test against FP32 versions of CLBlast and clBLAS toshow the advantage of FP16. We show results for half-precisionGEMM (HGEMM) in figure 5, chosen since it is FLOPS-bound ratherthan bandwidth-bound. From the figure, we can see that for boththe Skylake and Mali low-power GPUs achieve around 2x speed-upover CLBlast’s FP32 mode, benefiting fully from the hardware’scapabilities. The fact that this is sometimes beyond the theoretical 2xis explained by the randomness in the tuning parameter exploration.It is worth noting that on the Skylake ULT GT2 we now achieveover 200 GFLOPS, which is quite an achievement given that is on alaptop system-on-chip sharing 15Wwith a dual-core Core i5-6200U.11Intel HGEMM http://github.com/ptillet/isaac/pull/20

4.5 CUDA Back-end for CLBlastThe CLBlast library was originally written with OpenCL as a back-end (hence its name). However, all OpenCL library calls in the host-code were abstracted for convenience through CLCudaAPI, a smallheader-only C++11 project12. This API abstracts away low-leveldetails of OpenCL C calls behind modern C++ classes. Examples area ‘Buffer’ and a ‘Kernel’ class, with methods such as ‘Buffer.Write()’or ‘Kernel.Launch()’. This has several advantages, such as beingC++ rather than C, having a higher level interface, benefiting fromtype templates, and included error checking. However, the mainadvantage lies in the fact that there is also a CUDA version of CLCu-daAPI with an identical interface. Thus, any application writtenusing CLCudaAPI can switch its host-code from OpenCL to CUDAor vice-versa by simply changing a single header ‘#include’. This ismade possible with the CUDA driver API and NVRTC (since CUDA7.5) for run-time kernel compilation.

CLBlast includes a compile-time option to easily switch betweenOpenCL and CUDA host-code without much trouble. However, thekernel code is still written in OpenCL C dialect. Luckily, CLCudaAPIalso has a solution for kernel code: an OpenCL-to-CUDA header ofaround 50 lines of code translates OpenCL kernel code to CUDA us-ing several pre-processor defines and inline functions. This headerdoes not cover the full OpenCL kernel language specification, butit is good enough for the purposes of CLBlast.

Since CLBlast isn’t originally designed for CUDA and since mostCUDA devices can also run OpenCL applications, one might won-der about the advantages of a CUDA version of CLBlast? First ofall, the library can now be integrated into existing CUDA projects,taking CUDA buffers directly as input. Secondly, it can now runon platforms for which NVIDIA does not ship an OpenCL imple-mentation: all non-x86 systems such as the Jetson and Drive PXseries and IBM Power based supercomputers. Finally, performance12CLCudaAPI: http://github.com/CNugteren/CLCudaAPI

IWOCL ’18, May 14–16, 2018, Oxford, United Kingdom Cedric Nugteren8K 16K

32K

64K

128K

256K

512K

1024

K 2M 4M

sizes (n)0

102030405060708090

GB/s

(hig

her i

s bet

ter)

num AXPYs = 8CLBlastclBLAS (non-batched)cuBLAS (non-batched)

8K 16K

32K

64K

128K

256K

512K

1024

K

sizes (n)0

102030405060708090

num AXPYs = 64CLBlastclBLAS (non-batched)cuBLAS (non-batched)

1 2 4 8 16 32 64 128

num AXPYs0

102030405060708090

n=512KCLBlastclBLAS (non-batched)cuBLAS (non-batched)

SAXPYBATCHED: NVIDIA GeForce GTX 750Ti

8K 16K

32K

64K

128K

256K

512K

1024

K 2M 4M

sizes (n)0

5

10

15

20

25

30

GB/s

(hig

her i

s bet

ter)

num AXPYs = 8CLBlastclBLAS (non-batched)

8K 16K

32K

64K

128K

256K

512K

1024

K 2M 4M

sizes (n)0

5

10

15

20

25

30num AXPYs = 64

CLBlastclBLAS (non-batched)

1 2 4 8 16 32 64 128

256

num AXPYs0

5

10

15

20

25

30n=512K

CLBlastclBLAS (non-batched)

SAXPYBATCHED: Intel Skylake ULT GT2

32 96 160

224

288

352

416

480

544

608

sizes (m=n=k)0

200

400

600

800

1000

1200

1400

1600

GFLO

PS (h

ighe

r is b

ette

r)

num GEMMs = 8CLBlastclBLAS (non-batched)cuBLAS (non-batched)

32 96 160

224

288

352

416

480

544

608

sizes (m=n=k)0

200

400

600

800

1000

1200

1400num GEMMs = 64

CLBlastclBLAS (non-batched)cuBLAS (non-batched)

1 4 16 64 256

1024 4K

num GEMMs0

100

200

300

400

500

600

700

800m=n=k=128

CLBlastclBLAS (non-batched)cuBLAS (non-batched)

SGEMMBATCHED: NVIDIA GeForce GTX 750Ti

32 96 160

224

288

352

416

480

544

608

sizes (m=n=k)0

20

40

60

80

100

120

140

GFLO

PS (h

ighe

r is b

ette

r)

num GEMMs = 8CLBlastclBLAS (non-batched)

32 96 160

224

288

352

416

480

544

608

sizes (m=n=k)0

20

40

60

80

100

120

140 num GEMMs = 64CLBlastclBLAS (non-batched)

1 4 16 64 256

1024

num GEMMs0

5

10

15

20

25

30m=n=k=128

CLBlastclBLAS (non-batched)

SGEMMBATCHED: Intel Skylake ULT GT2

32 96 160

224

288

352

416

480

544

608

sizes (m=n=k)0

200

400

600

800

1000

1200

1400

1600

GFLO

PS (h

ighe

r is b

ette

r)

num GEMMs = 8CLBlastclBLAS (non-batched)cuBLAS (non-batched)

32 96 160

224

288

352

416

480

544

608

sizes (m=n=k)0

200

400

600

800

1000

1200

1400num GEMMs = 64

CLBlastclBLAS (non-batched)cuBLAS (non-batched)

1 4 16 64 256

1024 4K

num GEMMs0

100

200

300

400

500

600

700

800m=n=k=128

CLBlastclBLAS (non-batched)cuBLAS (non-batched)

SGEMMSTRIDEDBATCHED: NVIDIA GeForce GTX 750Ti

32 96 160

224

288

352

416

480

544

608

sizes (m=n=k)0

20

40

60

80

100

120

140

GFLO

PS (h

ighe

r is b

ette

r)

num GEMMs = 8CLBlastclBLAS (non-batched)

32 96 160

224

288

352

416

480

544

608

sizes (m=n=k)0

25

50

75

100

125

150

175

200num GEMMs = 64

CLBlastclBLAS (non-batched)

1 4 16 64 256

1024

num GEMMs0

20

40

60

80

100

120

140 m=n=k=128CLBlastclBLAS (non-batched)

SGEMMSTRIDEDBATCHED: Intel Skylake ULT GT2

Figure 6: Performance of batched AXPY (top), generic batched GEMM (middle), and strided batched GEMM (bottom) on twodifferent devices. Shown are results for the batched CLBlast routines (blue circles) and for running the non-batched routinesfrom clBLAS (red crosses) and cuBLAS (green dots) in a loop. Note that cuBLAS also has a batchedmode, but this is not includedin this comparison.

can be different from the OpenCL version. A first test showed thatout-of-the-box the CUDA version was actually around 3% slowerfor the SGEMM benchmarks on the GeForce GTX 750Ti test system.This could be due to different optimisation settings in NVIDIA’sOpenCL/CUDA compiler. Nevertheless, the CUDA kernels also of-fer new optimisation opportunities such as using __ldg or __shflintrinsics or mixed-precision tensor operations, all of which arenot (yet) available in OpenCL. In this work we did not performextensive CUDA versus OpenCL tests, but we leave this for futurework.

4.6 Batched BLAS RoutinesThis section discusses the benefits of batched routines: groupingmultiple similar traditional BLAS calls into a single routine forbetter efficiency. Although the overhead of the regular CLBlastroutines is minimal, performing multiple small operations comes ata cost: the OpenCL device can become underutilized when runningtoo few threads and work-groups. For example on an NVIDIA GTX750 Ti with 5 compute units, a 32 by 32 matrix-multiplication witha work-group size of 512 results in 2 work-groups, leaving 3 units

unoccupied. Even if the work-group size would be smaller, therewould be insufficient threads to hide the GPU’s memory latency.Batched BLAS routines can alleviate this issue by running unrelatedbut similar computations simultaneously.

CLBlast implements 3 batched routines: generic batched AXPY,generic batched GEMM, and strided batched GEMM. For the firsttwo, their interfaces are similar to the non-batched counterpart,with the following exceptions:

(1) An additional parameter specifies the size of a batch.(2) All offset arguments have become arrays, specifying the

starting points of the individual vectors or matrices with re-spect to an OpenCL memory object for each of the individualcomputations within the batch.

(3) The scalar arguments (alpha and beta) are now arrays suchthat they can be set differently within the batch.

The special strided batched GEMM is less generic: it doesn’t havean offset array as argument. Instead, it has a single stride parameterfor each of the matrices, specifying where the next data is. Also, itassumes alpha and beta to be equal across the whole batch. This

CLBlast: A Tuned OpenCL BLAS Library IWOCL ’18, May 14–16, 2018, Oxford, United Kingdom

variant is also implemented in cuBLAS and was shown by NVIDIAto reduce overhead significantly for small matrix sizes 13.

These batched routines are specifically beneficial for machinelearning applications, in which data is often processed in batchesfor training (minibatch stochastic gradient descent) and for in-ference (bulk-processing). In fact, deep-learning libraries such ascuBLAS/cuDNN and GreenTea libDNN [18] implement batchedGEMM routines as well.

Figure 6 presents results of running batched AXPY and batchedGEMM on two different devices. Additional results for batched rou-tines can be found in the on-line appendix9. We evaluate the benefitof batched routines compared to a non-batched clBLAS/cuBLASreference for a fixed batch-size but with varying data-size (leftand middle) and for a fixed data-size but with a varying batch-size(right). From the results, we conclude that the batched routinesperform significantly better for small problems compared to theirnon-batched counter-parts (see also non-batched CLBlast in fig-ure 3). In some cases this can even be an order of magnitude better:batched routines can bring the performance of operations on smallvectors and matrices to the level of much larger operations. Ofcourse, the advantage of batching diminishes as input sizes growbigger. We also note that the overhead due to loading the offsetsand scalars is visible when comparing the generic batched GEMMwith the strided version, which can be up to a factor 2 for the GTX750Ti GPU, and even higher for the Skylake GPU. In the latter casewe should note that these routines use the tuning parameters ofthe regular GEMM, which can be sub-optimal as shown for genericbatched GEMM for small sizes. Thus, for optimal performance, tun-ing needs to be performed separately on the batched versions ofthe kernels.

5 FUTUREWORKThe current version of CLBlast is already production-ready andperformswell on a variety of platforms. Nevertheless, we do identifytopics of future work to further improve the library and the tuning:

• The current out-of-the-box tuning parameters are optimizedfor specific routine arguments (e.g. matrix size). To get op-timal performance for a particular use-case, the user is cur-rently required to run the auto-tuner. However, if the tunerswould be run per-default for a mixed set of arguments (e.g.both small and large matrices), we could estimate tuning pa-rameters for every use-case. Selecting the argument mix andperforming the estimation is not trivial and might requireelaborate models of the kernels and the hardware, perhapsusing machine learning such as in [2, 6].

• The above can be applied in another dimension: predictingtuning parameters for unseen devices instead of for unseenarguments. Again, this might require sophisticated modelsof the kernels and hardware.

• The library already supports features useful for deep learningsuch as half-precision, batched GEMM, and an im2col imple-mentation. However, further modifications can be made fordeep learning, such as tensor-based convolutions and othercuDNN routines. Such additions would bring CLBlast in thescope of other auto-tuning work, such as [13, 18].

13Blog on batching: http://devblogs.nvidia.com/cublas-strided-batched-matrix-multiply

6 CONCLUSIONSThis paper discussed CLBlast, an OpenCL BLAS library writtenin C++11. Thanks to its integrated auto-tuning support and thegeneric OpenCL kernels it performswell on awide range of OpenCLdevices including mobile and low-power GPUs, offering a viablealternative to the de-facto standard clBLAS or the closed-sourcecuBLAS. Users of the library can further improve performanceby fine-tuning for their specific hardware or even for their spe-cific use-case (e.g. matrix size). Integration into CUDA applicationsis also possible since the library also has a CUDA interface andback-end. Furthermore, this paper demonstrated that CLBlast isequipped with features which go beyond the standard BLAS defini-tion: half-precision floating-point (FP16) and batched routines. Bothare highly beneficial for deep-learning where batched operationsare common and high precision is not always required. With theright hardware, CLBlast’s FP16 mode can give you a factor twoperformance gain and memory savings. Batching can improve per-formance up to an order of magnitude depending on the use-caseand hardware.

In conclusion, CLBlast is a production-ready and high-performancelibrary which can be used today to accelerate code on OpenCL hard-ware. In the future CLBlast will continue to support the needs of thehigh-performance computing and deep learning communities tomake the library even more tuned to the needs of the users. Readersare invited to contribute by sharing ideas for future work, tuningresults or patches on http://github.com/CNugteren/CLBlast, themain project website of CLBlast.

ACKNOWLEDGMENTSWe thank all the contributors on the CLBlast project. We also thankdividiti for providing access to ARM Mali hardware, Mark Wijtvlietfor access to an AMD GPU, and ArrayFire for making continuousintegration of CLBlast possible.

REFERENCES[1] Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, and Jack Dongarra. 2016.

Performance, Design, and Autotuning of Batched GEMM for GPUs. In 31st Inter-national Conference on High Performance Computing. Springer.

[2] R. Ballester-Ripoll, E. G. Paredes, and R. Pajarola. 2017. Sobol Tensor Trains forGlobal Sensitivity Analysis. ArXiv e-prints (Dec. 2017). arXiv:1712.00233

[3] Shekhar Borkar and Andrew A. Chien. 2011. The Future of Microprocessors.Commun. ACM 54, 5 (May 2011), 67–77. https://doi.org/10.1145/1941487.1941507

[4] Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, JohnTran, Bryan Catanzaro, and Evan Shelhamer. 2014. cuDNN: Efficient Primitivesfor Deep Learning. (2014).

[5] Tingxing Dong, Azzam Haidar, Piotr Luszczek, Stanimire Tomov, Ahmad Ab-delfattah, and Jack Dongarra. 2016. MAGMA Batched: A Batched BLAS Approachfor Small Matrix Factorizations and Applications on GPUs. Technical Report.

[6] T. L. Falch and A. C. Elster. 2015. Machine Learning Based Auto-Tuning forEnhanced OpenCL Performance Portability. In IPDPSW: International Paralleland Distributed Processing Symposium Workshop. IEEE. https://doi.org/10.1109/IPDPSW.2015.85

[7] Junjie Lai and A. Seznec. 2013. Performance Upper Bound Analysis and Opti-mization of SGEMM on Fermi and Kepler GPUs. In CGO ’13: Code Generationand Optimization. IEEE. https://doi.org/10.1109/CGO.2013.6494986

[8] Yinan Li, Jack Dongarra, and Stanimire Tomov. 2009. A Note on Auto-tuningGEMM for GPUs. In ICCS: Int. Conf. on Computational Science. Springer. https://doi.org/10.1007/978-3-642-01970-8_89

[9] Ian Masliah, Ahmad Abdelfattah, A. Haidar, S. Tomov, Marc Baboulin, J. Falcou,and J. Dongarra. 2016. High-Performance Matrix-Matrix Multiplications ofVery Small Matrices. In Euro-Par: International Conf. on Parallel and DistributedComputing. Springer.

[10] K. Matsumoto, N. Nakasato, and S.G. Sedukhin. 2014. Implementing Level-3 BLASRoutines in OpenCL on Different Processing Units. Technical Report TR 2014-001.

IWOCL ’18, May 14–16, 2018, Oxford, United Kingdom Cedric Nugteren

The University of Aizu.[11] K. Matsumoto, N. Nakasato, and S. G. Sedukhin. 2012. Performance Tuning of

Matrix Multiplication in OpenCL on Different GPUs and CPUs. In SC Companion:High Performance Computing, Networking Storage and Analysis. IEEE, 396–405.https://doi.org/10.1109/SC.Companion.2012.59

[12] Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory F. Diamos, ErichElsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, GaneshVenkatesh, and Hao Wu. 2017. Mixed Precision Training. arXiv abs/1710.03740(2017). arXiv:1710.03740 http://arxiv.org/abs/1710.03740

[13] Matthew W. Moskewicz, Ali Jannesari, and Kurt Keutzer. 2017. Boda: A Holis-tic Approach for Implementing Neural Network Computations. In ComputingFrontiers (CF’17). ACM. https://doi.org/10.1145/3075564.3077382

[14] Cedric Nugteren and Valeriu Codreanu. 2015. CLTune: A Generic Auto-Tunerfor OpenCL Kernels. In MCSOC ’15: International Symposium on EmbeddedMulticore/Many-core Systems-on-Chip. IEEE.

[15] Hugh Perkins. 2017. CUDA-on-CL: A Compiler and Runtime for Running NVIDIACUDA C++11 Applications on OpenCL 1.2 Devices. In Proceedings of the 5thInternational Workshop on OpenCL (IWOCL 2017). ACM, New York, NY, USA,Article 6, 4 pages. https://doi.org/10.1145/3078155.3078156

[16] Samuel D Relton, Pedro Valero-Lara, and Mawussi Zounon. 2016. A Comparisonof Potential Interfaces for Batched BLAS Computations. (2016).

[17] Philippe Tillet and David Cox. 2017. Input-aware Auto-tuning of Compute-boundHPC Kernels. In SC ’17: International Conf. for High Performance Computing,Networking, Storage and Analysis. ACM, Article 43, 12 pages. https://doi.org/10.1145/3126908.3126939

[18] Fabian Tschopp. 2015. Efficient Convolutional Neural Networks for PixelwiseClassification on Heterogeneous Hardware Systems. arXiv abs/1509.03371 (2015).arXiv:1509.03371 http://arxiv.org/abs/1509.03371

[19] Aravind Vasudevan, Andrew Anderson, and David Gregg. 2017. Parallel MultiChannel Convolution using General Matrix Multiplication. arXiv abs/1704.04428(2017).

[20] Pete Warden. 2015. Why GEMM is at the heart of deep learn-ing. https://petewarden.com/2015/04/20/why-gemm-is-at-the-heart-of-deep-learning (2015).

[21] Rick Weber and Gregory D. Peterson. 2012. A Trip to Tahiti: Approaching a 5TFlop SGEMMUsing 3 AMDGPUs. In Proceedings of the 2012 Symposium on Appli-cation Accelerators in High Performance Computing (SAAHPC ’12). IEEE ComputerSociety, Washington, DC, USA, 19–25. https://doi.org/10.1109/SAAHPC.2012.19