using volta tensor cores › wp-content › uploads › 2018 › 12 › ...6 a giant leap for deep...
TRANSCRIPT
![Page 1: USING VOLTA TENSOR CORES › wp-content › uploads › 2018 › 12 › ...6 A GIANT LEAP FOR DEEP LEARNING ce P100 V100 –Tensor Cores9.3x faster cuBLAS Mixed Precision (FP16 input,](https://reader035.vdocument.in/reader035/viewer/2022081406/5f0e3f547e708231d43e51c6/html5/thumbnails/1.jpg)
Jeff Larkin, December 04, 2018
USING VOLTA TENSOR CORES
![Page 2: USING VOLTA TENSOR CORES › wp-content › uploads › 2018 › 12 › ...6 A GIANT LEAP FOR DEEP LEARNING ce P100 V100 –Tensor Cores9.3x faster cuBLAS Mixed Precision (FP16 input,](https://reader035.vdocument.in/reader035/viewer/2022081406/5f0e3f547e708231d43e51c6/html5/thumbnails/2.jpg)
2
VOLTA TENSOR CORE
![Page 3: USING VOLTA TENSOR CORES › wp-content › uploads › 2018 › 12 › ...6 A GIANT LEAP FOR DEEP LEARNING ce P100 V100 –Tensor Cores9.3x faster cuBLAS Mixed Precision (FP16 input,](https://reader035.vdocument.in/reader035/viewer/2022081406/5f0e3f547e708231d43e51c6/html5/thumbnails/3.jpg)
3
TENSOR COREMixed Precision Matrix Math4x4 matrices
D = AB + C
D =
FP16 or FP32 FP16 FP16 FP16 or FP32
A0,0 A0,1 A0,2 A0,3
A1,0 A1,1 A1,2 A1,3
A2,0 A2,1 A2,2 A2,3
A3,0 A3,1 A3,2 A3,3
B0,0 B0,1 B0,2 B0,3
B1,0 B1,1 B1,2 B1,3
B2,0 B2,1 B2,2 B2,3
B3,0 B3,1 B3,2 B3,3
C0,0 C0,1 C0,2 C0,3
C1,0 C1,1 C1,2 C1,3
C2,0 C2,1 C2,2 C2,3
C3,0 C3,1 C3,2 C3,3
![Page 4: USING VOLTA TENSOR CORES › wp-content › uploads › 2018 › 12 › ...6 A GIANT LEAP FOR DEEP LEARNING ce P100 V100 –Tensor Cores9.3x faster cuBLAS Mixed Precision (FP16 input,](https://reader035.vdocument.in/reader035/viewer/2022081406/5f0e3f547e708231d43e51c6/html5/thumbnails/4.jpg)
4
TENSOR CORE COORDINATION
Warp-synchronizing operation for cooperative matrix math
Full Warp 16x16 Matrix Math
Aggregate Matrix Multiply and Accumulate for 16x16 matrices
Result distributed across warp
warp
warp
![Page 5: USING VOLTA TENSOR CORES › wp-content › uploads › 2018 › 12 › ...6 A GIANT LEAP FOR DEEP LEARNING ce P100 V100 –Tensor Cores9.3x faster cuBLAS Mixed Precision (FP16 input,](https://reader035.vdocument.in/reader035/viewer/2022081406/5f0e3f547e708231d43e51c6/html5/thumbnails/5.jpg)
5
VOLTA TENSOR OPERATION
FP16
storage/input
Full precision
product
Sum with
FP32
accumulator
Convert to
FP32 result
F16
F16
× +
Also supports FP16 accumulator mode for inferencing
F32
F32
more products
![Page 6: USING VOLTA TENSOR CORES › wp-content › uploads › 2018 › 12 › ...6 A GIANT LEAP FOR DEEP LEARNING ce P100 V100 –Tensor Cores9.3x faster cuBLAS Mixed Precision (FP16 input,](https://reader035.vdocument.in/reader035/viewer/2022081406/5f0e3f547e708231d43e51c6/html5/thumbnails/6.jpg)
6
A GIANT LEAP FOR DEEP LEARNING
Rela
tive P
erf
orm
ance
P100 V100 – Tensor Cores
9.3xfaster
cuBLAS Mixed Precision (FP16 input, FP32 compute)
Matrix Multiply (M=N=K=2048)
(CUDA 8) (CUDA 9)
V100 measured on pre-production hardware.
![Page 7: USING VOLTA TENSOR CORES › wp-content › uploads › 2018 › 12 › ...6 A GIANT LEAP FOR DEEP LEARNING ce P100 V100 –Tensor Cores9.3x faster cuBLAS Mixed Precision (FP16 input,](https://reader035.vdocument.in/reader035/viewer/2022081406/5f0e3f547e708231d43e51c6/html5/thumbnails/7.jpg)
7
USING TENSOR CORES
Volta Optimized Frameworks and Libraries
__device__ void tensor_op_16_16_16(float *d, half *a, half *b, float *c)
{wmma::fragment<matrix_a, …> Amat;wmma::fragment<matrix_b, …> Bmat;wmma::fragment<matrix_c, …> Cmat;
wmma::load_matrix_sync(Amat, a, 16);wmma::load_matrix_sync(Bmat, b, 16);wmma::fill_fragment(Cmat, 0.0f);
wmma::mma_sync(Cmat, Amat, Bmat, Cmat);
wmma::store_matrix_sync(d, Cmat, 16,wmma::row_major);
}
CUDA C++
Warp-Level Matrix Operations
NVIDIA cuDNN, cuBLAS, TensorRT
![Page 8: USING VOLTA TENSOR CORES › wp-content › uploads › 2018 › 12 › ...6 A GIANT LEAP FOR DEEP LEARNING ce P100 V100 –Tensor Cores9.3x faster cuBLAS Mixed Precision (FP16 input,](https://reader035.vdocument.in/reader035/viewer/2022081406/5f0e3f547e708231d43e51c6/html5/thumbnails/8.jpg)
8
TENSOR CORES FROM CUBLAS
CUBLAS provides an extended BLAS interface for mixed precisions
cublasGemmEx(), cublasGemmBatchedEx(), and cublasGemmStridedBatchedEx() all accept
FP16 data types for Tensor Core use.
Some solvers are also available in an Ex form.
See https://docs.nvidia.com/cuda/cublas/index.html#cublas-GemmEx
![Page 9: USING VOLTA TENSOR CORES › wp-content › uploads › 2018 › 12 › ...6 A GIANT LEAP FOR DEEP LEARNING ce P100 V100 –Tensor Cores9.3x faster cuBLAS Mixed Precision (FP16 input,](https://reader035.vdocument.in/reader035/viewer/2022081406/5f0e3f547e708231d43e51c6/html5/thumbnails/9.jpg)
9
CUDA TENSOR CORE PROGRAMMINGNew WMMA datatypes
wmma::fragment<matrix_a, …> Amat;
Per-Thread fragments to hold components of matrices for use with Tensor Cores
![Page 10: USING VOLTA TENSOR CORES › wp-content › uploads › 2018 › 12 › ...6 A GIANT LEAP FOR DEEP LEARNING ce P100 V100 –Tensor Cores9.3x faster cuBLAS Mixed Precision (FP16 input,](https://reader035.vdocument.in/reader035/viewer/2022081406/5f0e3f547e708231d43e51c6/html5/thumbnails/10.jpg)
10
CUDA TENSOR CORE PROGRAMMINGNew WMMA load and store operations
wmma::load_matrix_sync(Amat, a, stride);
Warp-level operation to fetch components of matrices into fragments
warp
![Page 11: USING VOLTA TENSOR CORES › wp-content › uploads › 2018 › 12 › ...6 A GIANT LEAP FOR DEEP LEARNING ce P100 V100 –Tensor Cores9.3x faster cuBLAS Mixed Precision (FP16 input,](https://reader035.vdocument.in/reader035/viewer/2022081406/5f0e3f547e708231d43e51c6/html5/thumbnails/11.jpg)
11
CUDA TENSOR CORE PROGRAMMINGNew WMMA Matrix Multiply and Accumulate Operation
wmma::mma_sync(Dmat, Amat, Bmat, Cmat);
Warp-level operation to perform matrix multiply and accumulate
D =
![Page 12: USING VOLTA TENSOR CORES › wp-content › uploads › 2018 › 12 › ...6 A GIANT LEAP FOR DEEP LEARNING ce P100 V100 –Tensor Cores9.3x faster cuBLAS Mixed Precision (FP16 input,](https://reader035.vdocument.in/reader035/viewer/2022081406/5f0e3f547e708231d43e51c6/html5/thumbnails/12.jpg)
12
CUDA TENSOR CORE PROGRAMMINGNew WMMA load and store operations
wmma::store_matrix_sync(d, Dmat, stride);
Warp-level operation to fetch components of matrices into fragments
warp
Result
![Page 13: USING VOLTA TENSOR CORES › wp-content › uploads › 2018 › 12 › ...6 A GIANT LEAP FOR DEEP LEARNING ce P100 V100 –Tensor Cores9.3x faster cuBLAS Mixed Precision (FP16 input,](https://reader035.vdocument.in/reader035/viewer/2022081406/5f0e3f547e708231d43e51c6/html5/thumbnails/13.jpg)
13
TENSOR CORE EXAMPLE
__device__ void tensor_op_16_16_16(float *d, half *a, half *b, float *c)
{wmma::fragment<matrix_a, …> Amat;wmma::fragment<matrix_b, …> Bmat;wmma::fragment<matrix_c, …> Cmat;
wmma::load_matrix_sync(Amat, a, 16);wmma::load_matrix_sync(Bmat, b, 16);wmma::fill_fragment(Cmat, 0.0f);
wmma::mma_sync(Cmat, Amat, Bmat, Cmat);
wmma::store_matrix_sync(d, Cmat, 16,wmma::row_major);
}
CUDA C++
Warp-Level Matrix Operations
Create Fragments
Initialize Fragments
Perform MatMul
Store Results