cuda 6 performance_report

1

CUDA 6.0

Performance Report

April 2014

2

CUDA 6 Performance Report

CUDART CUDA Runtime Library

cuFFT Fast Fourier Transforms Library

cuBLAS Complete BLAS Library

cuSPARSE Sparse Matrix Library

cuRAND Random Number Generation (RNG) Library

NPP Performance Primitives for Image & Video Processing

Thrust Templated Parallel Algorithms & Data Structures

math.h C99 floating-point Library

Included in the CUDA Toolkit (free download):

developer.nvidia.com/cuda-toolkit

For more information on CUDA libraries:

developer.nvidia.com/gpu-accelerated-libraries

https://developer.nvidia.com/cuda-toolkit




https://developer.nvidia.com/gpu-accelerated-libraries






3

0

5

10

15

20

25

30

35

40

CUDA 5 CUDA 6 CUDA 5 CUDA 6

Back to Back Launches Launch and Synchronize

use

c

1.5x

Faster

2.0x

Faster

CUDA 6: 2x Faster GPU Kernel Launches

Performance may vary based on OS version and motherboard configuration • CUDA 5.0 and CUDA 6.0 on Tesla K20

a<<<...>>>; b<<<...>>>;

a<<<...>>>; cudaDeviceSynchronize();

Dynamic Parallel Kernel Launches

4

cuFFT: Multi-dimensional FFTs

Real and complex

Single- and double-precision data types

1D, 2D and 3D batched transforms

Flexible input and output data layouts

New in

CUDA 6

XT interface supports dual-GPU cards

(Tesla K10, GeForce GTX690, …)

5

cuFFT: up to 700 GFLOPS

Performance may vary based on OS version and motherboard configuration

0

100

200

300

400

500

600

700

800

1 3 5 7 9 11 13 15 17 19 21 23 25

GF

LO

PS

log2(transform_size)

Single Precision

0

50

100

150

200

250

300

1 3 5 7 9 11 13 15 17 19 21 23 25

GF

LO

PS

log2(transform_size)

Double Precision

• cuFFT 6.0 on K40c, ECC ON, 32M elements, input and output data on device

1D Complex, Batched FFTs

Used in Audio Processing and as a Foundation for 2D and 3D FFTs

6

cuFFT: Consistently High Performance

0

100

200

300

400

500

600

700

800

1 100 10,000 1,000,000 100,000,000

GF

LO

PS

Transform Size

Single Precision

0

50

100

150

200

250

300

1 100 10,000 1,000,000 100,000,000

GF

LO

PS

Transform Size

Double Precision

• cuFFT 6.0 on K40c, ECC ON, 28M-33M elements, input and output data on device Performance may vary based on OS version and motherboard configuration

1D Complex, Batched FFTs

Used in Audio Processing and as a Foundation for 2D and 3D FFTs

7

0

20

40

60

80

100

120

140

cuFFT cuFFT-XT

Ex

ec

uti

on

Tim

e (

ms)

0

1

2

3

4

5

6

7

8

9

10

cuFFT cuFFT-XT

Ex

ec

uti

on

Tim

e (

ms)

cuFFT-XT: Boosts Performance on K10

• cuFFT 6.0 on K10, ECC ON, input and output data on device Performance may vary based on OS version and motherboard configuration

25%

Faster

65%

Faster

New in

CUDA 6

256x256x256 512x512x512

8

cuBLAS: Dense Linear Algebra on GPUs

Complete BLAS implementation plus useful extensions

Supports all 152 standard routines for single, double,

complex, and double complex

Host and device-callable interface

New in

CUDA 6 XT Interface for Level 3 BLAS

Distributed computations across multiple GPUs

Out-of-core streaming to GPU, no upper limit on matrix size

“Drop-in” BLAS intercepts CPU BLAS calls, streams to GPU

9

>3 TFLOPS single-precision

>1 TFLOPS double-precision


• cuBLAS 6.0 on K40m, ECC ON, input and output data on device

• m=n=k=4096, transpose=no, side=right, fill=lower

cuBLAS:

0

500

1000

1500

2000

2500

3000

3500

SG

EM

M

SS

YM

M

ST

RS

M

SS

YR

K

CG

EM

M

CS

YM

M

CT

RS

M

CS

YR

K

DG

EM

M

DS

YM

M

DT

RS

M

DS

YR

K

ZG

EM

M

ZS

YM

M

ZT

RS

M

ZS

YR

K

Single Single Complex Double Double Complex

GF

LO

PS

10

cuBLAS: ZGEMM 5x Faster than MKL

• cuBLAS 6.0 on K40m, ECC ON, input and output data on device

• MKL 11.0.4 on Intel IvyBridge 12-core E5-2697 v2 @ 2.70GHz Performance may vary based on OS version and motherboard configuration

0

200

400

600

800

1000

1200

0 500 1000 1500 2000 2500 3000 3500 4000

GF

LO

PS

Matrix Dimension (m=n=k)

cuBLAS

MKL

11

cuBLAS-XT: Multi-GPU Performance Scaling

• cuBLAS-XT 6.0 on K10, ECC ON, input and output data on host Performance may vary based on OS version and motherboard configuration

2.2 TFLOPS

4.2 TFLOPS

6.0 TFLOPS

7.9 TFLOPS

0

1

2

3

4

5

6

7

8

1 x K10 2 x K10 3 x K10 4 x K10

New in

CUDA 6

16K x 16K SGEMM on Tesla K10

12

cuSPARSE: Sparse linear algebra routines

Optimized sparse linear algebra BLAS routines – matrix-

vector, matrix-matrix, triangular solve

Support for variety of formats (CSR, COO, block variants)

1.0

2.0

3.0

4.0

y1

y2

y3

y4

\alpha + \beta

1.0

6.0

4.0

7.0

3.0 2.0

5.0

y1

y2

y3

y4

New in

CUDA 6

Many improvements to triangular solvers,

Incomplete-LU, and Cholesky preconditioners

13

cuSPARSE: 5x Faster than MKL

• Average of s/c/d/z routines

• cuSPARSE 6.0 on K40m, ECC ON, input and output data on device

• MKL 11.0.4 on Intel IvyBridge 12-core E5-2697 v2 @ 2.70GHz

• Matrices obtained from: http://www.cise.ufl.edu/research/sparse/matrices/ Performance may vary based on OS version and motherboard configuration

0x

1x

2x

3x

4x

5x

6x

Sp

eed

up

ov

er

MK

L

Sparse Matrix x Dense Vector (SpMV)

http://www.cise.ufl.edu/research/sparse/matrices/

http://www.cise.ufl.edu/research/sparse/matrices/

14

cuRAND: Random Number Generation

Generating high quality random numbers in parallel is hard

Don’t do it yourself, use a library!

Pseudo- and Quasi-RNGs

Supports several output distributions

Statistical test results in documentation

New in

CUDA 6 Mersenne Twister 19937

15

cuRAND: Up to 75x Faster vs. Intel MKL

• cuRAND 6.0 on K40c, ECC ON, double-precision input and output data on device

• MKL 11.0.1 on Intel SandyBridge 6-core E5-2620 @ 2.0 GHz Performance may vary based on OS version and motherboard configuration

0

2

4

6

8

10

12

14

16

Sobol32 MRG32k3a Sobol32 MRG32k3a Sobol32 MRG32k3a

Uniform Distribution Normal Distribution Log-Normal Distribution

Gs

am

ple

s / s

ec

cuRAND

MKL

16

cuRAND: High Performance RNGs

Performance may vary based on OS version and motherboard configuration • cuRAND 6.0 on K40m, ECC ON, double precision input and output data on device

0

2

4

6

8

10

12

14

16

18

XORWOW Philox MRG32k3a MTGP32 Sobol32Scrambled

Sobol64Scrambled

Pseudo-random Quasi-random

Gs

am

ple

s / s

ec

Uniform Distribution Normal Distribution Log-Normal Distribution

17

NPP: NVIDIA Performance Primitives

Over 5000 image and signal processing routines:

color transforms, geometric transforms, move operations, linear

filters, image & signal statistics, image & signal arithmetic, JPEG

building blocks, image segmentation

New in

CUDA 6 Over 500 new routines, including:

median filter, BGR/YUV conversion, 3D LUT color conversion,

improvements to JPEG primitives, plus many more

18

NPP Speedup vs. Intel IPP


• NPP 6.0 on K40m, input and output data on device

• IPP 7.0 on Intel IvyBridge 12-core E5-2697 v2 @ 2.70GHz

5.7x

14.4x

17.8x

6.3x

12.9x

28.5x

0x

5x

10x

15x

20x

25x

30x

Image Set(8-bit RGB)

Image Set Channel(8-bit RGB)

Image Resize(8-bit RGB)

Image Gaussian Filter(32-bit float)

Color Conversion8-bit YUV422 to 8-bit

RGB

JPEG 8x8 ForwardDCT

19

CUDA C++ Template Library

Template library for CUDA C++

Host and Device Containers that mimic the C++ STL

Optimized Algorithms for sort, reduce, scan, etc.

OpenMP Backend for portability

Also available on github: thrust.github.com

Allows applications and prototypes to be built quickly

http://thrust.github.com/

http://code.google.com/p/thrust/downloads/list

20

Thrust Performance vs. Intel TBB


• Thrust v1.7.1 on K40m, ECC ON, input and output data on device

• TBB 4.2 on Intel IvyBridge 12-core E5-2697 v2 @ 2.70GHz

0x

2x

4x

6x

8x

10x

12x

14x

reduce transform scan sort

Sp

eed

up

Thrust vs. TBB on 32M integers

43.7x

24.8x

13.3x

5.2x

15.0x

5.8x

0x

10x

20x

30x

40x

50x

char short int long float double

Sp

eed

up

Thrust Sort vs. TBB on 32M samples

21

math.h: C99 floating-point library + extras

CUDA math.h is industry proven, high performance, accurate •Basic: +, *, /, 1/, sqrt, FMA (all IEEE-754 accurate for float, double, all rounding modes)

•Exponentials: exp, exp2, log, log2, log10, ...

•Trigonometry: sin, cos, tan, asin, acos, atan2, sinh, cosh, asinh, acosh, ...

•Special functions: lgamma, tgamma, erf, erfc

•Utility: fmod, remquo, modf, trunc, round, ceil, floor, fabs, ...

•Extras: rsqrt, rcbrt, exp10, sinpi, sincos[pi], cospi, erfinv, erfcinv, normcdf[inv],...

New in

CUDA 6

• Over 80 new SIMD instructions • Useful for video processing: _v*2, _v*4

• Cylindrical bessel: cyl_i{0,1}

• 1/hypotenuse: rhypot

cuda 6 performance_report

Technology