cuda 6 performance_report
DESCRIPTION
TRANSCRIPT
1
CUDA 6.0
Performance Report
April 2014
2
CUDA 6 Performance Report
CUDART CUDA Runtime Library
cuFFT Fast Fourier Transforms Library
cuBLAS Complete BLAS Library
cuSPARSE Sparse Matrix Library
cuRAND Random Number Generation (RNG) Library
NPP Performance Primitives for Image & Video Processing
Thrust Templated Parallel Algorithms & Data Structures
math.h C99 floating-point Library
Included in the CUDA Toolkit (free download):
developer.nvidia.com/cuda-toolkit
For more information on CUDA libraries:
developer.nvidia.com/gpu-accelerated-libraries
3
0
5
10
15
20
25
30
35
40
CUDA 5 CUDA 6 CUDA 5 CUDA 6
Back to Back Launches Launch and Synchronize
use
c
1.5x
Faster
2.0x
Faster
CUDA 6: 2x Faster GPU Kernel Launches
Performance may vary based on OS version and motherboard configuration • CUDA 5.0 and CUDA 6.0 on Tesla K20
a<<<...>>>; b<<<...>>>;
a<<<...>>>; cudaDeviceSynchronize();
Dynamic Parallel Kernel Launches
4
cuFFT: Multi-dimensional FFTs
Real and complex
Single- and double-precision data types
1D, 2D and 3D batched transforms
Flexible input and output data layouts
New in
CUDA 6
XT interface supports dual-GPU cards
(Tesla K10, GeForce GTX690, …)
5
cuFFT: up to 700 GFLOPS
Performance may vary based on OS version and motherboard configuration
0
100
200
300
400
500
600
700
800
1 3 5 7 9 11 13 15 17 19 21 23 25
GF
LO
PS
log2(transform_size)
Single Precision
0
50
100
150
200
250
300
1 3 5 7 9 11 13 15 17 19 21 23 25
GF
LO
PS
log2(transform_size)
Double Precision
• cuFFT 6.0 on K40c, ECC ON, 32M elements, input and output data on device
1D Complex, Batched FFTs
Used in Audio Processing and as a Foundation for 2D and 3D FFTs
6
cuFFT: Consistently High Performance
0
100
200
300
400
500
600
700
800
1 100 10,000 1,000,000 100,000,000
GF
LO
PS
Transform Size
Single Precision
0
50
100
150
200
250
300
1 100 10,000 1,000,000 100,000,000
GF
LO
PS
Transform Size
Double Precision
• cuFFT 6.0 on K40c, ECC ON, 28M-33M elements, input and output data on device Performance may vary based on OS version and motherboard configuration
1D Complex, Batched FFTs
Used in Audio Processing and as a Foundation for 2D and 3D FFTs
7
0
20
40
60
80
100
120
140
cuFFT cuFFT-XT
Ex
ec
uti
on
Tim
e (
ms)
0
1
2
3
4
5
6
7
8
9
10
cuFFT cuFFT-XT
Ex
ec
uti
on
Tim
e (
ms)
cuFFT-XT: Boosts Performance on K10
• cuFFT 6.0 on K10, ECC ON, input and output data on device Performance may vary based on OS version and motherboard configuration
25%
Faster
65%
Faster
New in
CUDA 6
256x256x256 512x512x512
8
cuBLAS: Dense Linear Algebra on GPUs
Complete BLAS implementation plus useful extensions
Supports all 152 standard routines for single, double,
complex, and double complex
Host and device-callable interface
New in
CUDA 6 XT Interface for Level 3 BLAS
Distributed computations across multiple GPUs
Out-of-core streaming to GPU, no upper limit on matrix size
“Drop-in” BLAS intercepts CPU BLAS calls, streams to GPU
9
>3 TFLOPS single-precision
>1 TFLOPS double-precision
Performance may vary based on OS version and motherboard configuration
• cuBLAS 6.0 on K40m, ECC ON, input and output data on device
• m=n=k=4096, transpose=no, side=right, fill=lower
cuBLAS:
0
500
1000
1500
2000
2500
3000
3500
SG
EM
M
SS
YM
M
ST
RS
M
SS
YR
K
CG
EM
M
CS
YM
M
CT
RS
M
CS
YR
K
DG
EM
M
DS
YM
M
DT
RS
M
DS
YR
K
ZG
EM
M
ZS
YM
M
ZT
RS
M
ZS
YR
K
Single Single Complex Double Double Complex
GF
LO
PS
10
cuBLAS: ZGEMM 5x Faster than MKL
• cuBLAS 6.0 on K40m, ECC ON, input and output data on device
• MKL 11.0.4 on Intel IvyBridge 12-core E5-2697 v2 @ 2.70GHz Performance may vary based on OS version and motherboard configuration
0
200
400
600
800
1000
1200
0 500 1000 1500 2000 2500 3000 3500 4000
GF
LO
PS
Matrix Dimension (m=n=k)
cuBLAS
MKL
11
cuBLAS-XT: Multi-GPU Performance Scaling
• cuBLAS-XT 6.0 on K10, ECC ON, input and output data on host Performance may vary based on OS version and motherboard configuration
2.2 TFLOPS
4.2 TFLOPS
6.0 TFLOPS
7.9 TFLOPS
0
1
2
3
4
5
6
7
8
1 x K10 2 x K10 3 x K10 4 x K10
New in
CUDA 6
16K x 16K SGEMM on Tesla K10
12
cuSPARSE: Sparse linear algebra routines
Optimized sparse linear algebra BLAS routines – matrix-
vector, matrix-matrix, triangular solve
Support for variety of formats (CSR, COO, block variants)
1.0
2.0
3.0
4.0
y1
y2
y3
y4
\alpha + \beta
1.0
6.0
4.0
7.0
3.0 2.0
5.0
y1
y2
y3
y4
New in
CUDA 6
Many improvements to triangular solvers,
Incomplete-LU, and Cholesky preconditioners
13
cuSPARSE: 5x Faster than MKL
• Average of s/c/d/z routines
• cuSPARSE 6.0 on K40m, ECC ON, input and output data on device
• MKL 11.0.4 on Intel IvyBridge 12-core E5-2697 v2 @ 2.70GHz
• Matrices obtained from: http://www.cise.ufl.edu/research/sparse/matrices/ Performance may vary based on OS version and motherboard configuration
0x
1x
2x
3x
4x
5x
6x
Sp
eed
up
ov
er
MK
L
Sparse Matrix x Dense Vector (SpMV)
14
cuRAND: Random Number Generation
Generating high quality random numbers in parallel is hard
Don’t do it yourself, use a library!
Pseudo- and Quasi-RNGs
Supports several output distributions
Statistical test results in documentation
New in
CUDA 6 Mersenne Twister 19937
15
cuRAND: Up to 75x Faster vs. Intel MKL
• cuRAND 6.0 on K40c, ECC ON, double-precision input and output data on device
• MKL 11.0.1 on Intel SandyBridge 6-core E5-2620 @ 2.0 GHz Performance may vary based on OS version and motherboard configuration
0
2
4
6
8
10
12
14
16
Sobol32 MRG32k3a Sobol32 MRG32k3a Sobol32 MRG32k3a
Uniform Distribution Normal Distribution Log-Normal Distribution
Gs
am
ple
s / s
ec
cuRAND
MKL
16
cuRAND: High Performance RNGs
Performance may vary based on OS version and motherboard configuration • cuRAND 6.0 on K40m, ECC ON, double precision input and output data on device
0
2
4
6
8
10
12
14
16
18
XORWOW Philox MRG32k3a MTGP32 Sobol32Scrambled
Sobol64Scrambled
Pseudo-random Quasi-random
Gs
am
ple
s / s
ec
Uniform Distribution Normal Distribution Log-Normal Distribution
17
NPP: NVIDIA Performance Primitives
Over 5000 image and signal processing routines:
color transforms, geometric transforms, move operations, linear
filters, image & signal statistics, image & signal arithmetic, JPEG
building blocks, image segmentation
New in
CUDA 6 Over 500 new routines, including:
median filter, BGR/YUV conversion, 3D LUT color conversion,
improvements to JPEG primitives, plus many more
18
NPP Speedup vs. Intel IPP
Performance may vary based on OS version and motherboard configuration
• NPP 6.0 on K40m, input and output data on device
• IPP 7.0 on Intel IvyBridge 12-core E5-2697 v2 @ 2.70GHz
5.7x
14.4x
17.8x
6.3x
12.9x
28.5x
0x
5x
10x
15x
20x
25x
30x
Image Set(8-bit RGB)
Image Set Channel(8-bit RGB)
Image Resize(8-bit RGB)
Image Gaussian Filter(32-bit float)
Color Conversion8-bit YUV422 to 8-bit
RGB
JPEG 8x8 ForwardDCT
19
CUDA C++ Template Library
Template library for CUDA C++
Host and Device Containers that mimic the C++ STL
Optimized Algorithms for sort, reduce, scan, etc.
OpenMP Backend for portability
Also available on github: thrust.github.com
Allows applications and prototypes to be built quickly
20
Thrust Performance vs. Intel TBB
Performance may vary based on OS version and motherboard configuration
• Thrust v1.7.1 on K40m, ECC ON, input and output data on device
• TBB 4.2 on Intel IvyBridge 12-core E5-2697 v2 @ 2.70GHz
0x
2x
4x
6x
8x
10x
12x
14x
reduce transform scan sort
Sp
eed
up
Thrust vs. TBB on 32M integers
43.7x
24.8x
13.3x
5.2x
15.0x
5.8x
0x
10x
20x
30x
40x
50x
char short int long float double
Sp
eed
up
Thrust Sort vs. TBB on 32M samples
21
math.h: C99 floating-point library + extras
CUDA math.h is industry proven, high performance, accurate •Basic: +, *, /, 1/, sqrt, FMA (all IEEE-754 accurate for float, double, all rounding modes)
•Exponentials: exp, exp2, log, log2, log10, ...
•Trigonometry: sin, cos, tan, asin, acos, atan2, sinh, cosh, asinh, acosh, ...
•Special functions: lgamma, tgamma, erf, erfc
•Utility: fmod, remquo, modf, trunc, round, ceil, floor, fabs, ...
•Extras: rsqrt, rcbrt, exp10, sinpi, sincos[pi], cospi, erfinv, erfcinv, normcdf[inv],...
New in
CUDA 6
• Over 80 new SIMD instructions • Useful for video processing: _v*2, _v*4
• Cylindrical bessel: cyl_i{0,1}
• 1/hypotenuse: rhypot