using volta tensor cores › wp-content › uploads › 2018 › 12 › ...6 a giant leap for deep...

Jeff Larkin, December 04, 2018 USING VOLTA TENSOR CORES

Upload: others

Post on 26-Jun-2020

0 views

Category:

Documents

0 download

Report

Download

Embed Size (px):

TRANSCRIPT

Jeff Larkin, December 04, 2018

USING VOLTA TENSOR CORES

VOLTA TENSOR CORE

TENSOR COREMixed Precision Matrix Math4x4 matrices

D = AB + C

D =

FP16 or FP32 FP16 FP16 FP16 or FP32

A0,0 A0,1 A0,2 A0,3

A1,0 A1,1 A1,2 A1,3

A2,0 A2,1 A2,2 A2,3

A3,0 A3,1 A3,2 A3,3

B0,0 B0,1 B0,2 B0,3

B1,0 B1,1 B1,2 B1,3

B2,0 B2,1 B2,2 B2,3

B3,0 B3,1 B3,2 B3,3

C0,0 C0,1 C0,2 C0,3

C1,0 C1,1 C1,2 C1,3

C2,0 C2,1 C2,2 C2,3

C3,0 C3,1 C3,2 C3,3

Page 4: USING VOLTA TENSOR CORES › wp-content › uploads › 2018 › 12 › ...6 A GIANT LEAP FOR DEEP LEARNING ce P100 V100 –Tensor Cores9.3x faster cuBLAS Mixed Precision (FP16 input,

TENSOR CORE COORDINATION

Warp-synchronizing operation for cooperative matrix math

Full Warp 16x16 Matrix Math

Aggregate Matrix Multiply and Accumulate for 16x16 matrices

Result distributed across warp

warp

Page 5: USING VOLTA TENSOR CORES › wp-content › uploads › 2018 › 12 › ...6 A GIANT LEAP FOR DEEP LEARNING ce P100 V100 –Tensor Cores9.3x faster cuBLAS Mixed Precision (FP16 input,

VOLTA TENSOR OPERATION

FP16

storage/input

Full precision

product

Sum with

FP32

accumulator

Convert to

FP32 result

F16

× +

Also supports FP16 accumulator mode for inferencing

F32

more products

A GIANT LEAP FOR DEEP LEARNING

Rela

tive P

erf

orm

ance

P100 V100 – Tensor Cores

9.3xfaster

cuBLAS Mixed Precision (FP16 input, FP32 compute)

Matrix Multiply (M=N=K=2048)

(CUDA 8) (CUDA 9)

V100 measured on pre-production hardware.

Page 7: USING VOLTA TENSOR CORES › wp-content › uploads › 2018 › 12 › ...6 A GIANT LEAP FOR DEEP LEARNING ce P100 V100 –Tensor Cores9.3x faster cuBLAS Mixed Precision (FP16 input,

USING TENSOR CORES

Volta Optimized Frameworks and Libraries

__device__ void tensor_op_16_16_16(float *d, half *a, half *b, float *c)

{wmma::fragment<matrix_a, …> Amat;wmma::fragment<matrix_b, …> Bmat;wmma::fragment<matrix_c, …> Cmat;

wmma::load_matrix_sync(Amat, a, 16);wmma::load_matrix_sync(Bmat, b, 16);wmma::fill_fragment(Cmat, 0.0f);

wmma::mma_sync(Cmat, Amat, Bmat, Cmat);

wmma::store_matrix_sync(d, Cmat, 16,wmma::row_major);

}

CUDA C++

Warp-Level Matrix Operations

NVIDIA cuDNN, cuBLAS, TensorRT

Page 8: USING VOLTA TENSOR CORES › wp-content › uploads › 2018 › 12 › ...6 A GIANT LEAP FOR DEEP LEARNING ce P100 V100 –Tensor Cores9.3x faster cuBLAS Mixed Precision (FP16 input,

TENSOR CORES FROM CUBLAS

CUBLAS provides an extended BLAS interface for mixed precisions

cublasGemmEx(), cublasGemmBatchedEx(), and cublasGemmStridedBatchedEx() all accept

FP16 data types for Tensor Core use.

Some solvers are also available in an Ex form.

See https://docs.nvidia.com/cuda/cublas/index.html#cublas-GemmEx

https://docs.nvidia.com/cuda/cublas/index.html#cublas-GemmEx

Page 9: USING VOLTA TENSOR CORES › wp-content › uploads › 2018 › 12 › ...6 A GIANT LEAP FOR DEEP LEARNING ce P100 V100 –Tensor Cores9.3x faster cuBLAS Mixed Precision (FP16 input,

CUDA TENSOR CORE PROGRAMMINGNew WMMA datatypes

wmma::fragment<matrix_a, …> Amat;

Per-Thread fragments to hold components of matrices for use with Tensor Cores

Page 10: USING VOLTA TENSOR CORES › wp-content › uploads › 2018 › 12 › ...6 A GIANT LEAP FOR DEEP LEARNING ce P100 V100 –Tensor Cores9.3x faster cuBLAS Mixed Precision (FP16 input,

CUDA TENSOR CORE PROGRAMMINGNew WMMA load and store operations

wmma::load_matrix_sync(Amat, a, stride);

Warp-level operation to fetch components of matrices into fragments

warp

Page 11: USING VOLTA TENSOR CORES › wp-content › uploads › 2018 › 12 › ...6 A GIANT LEAP FOR DEEP LEARNING ce P100 V100 –Tensor Cores9.3x faster cuBLAS Mixed Precision (FP16 input,

CUDA TENSOR CORE PROGRAMMINGNew WMMA Matrix Multiply and Accumulate Operation

wmma::mma_sync(Dmat, Amat, Bmat, Cmat);

Warp-level operation to perform matrix multiply and accumulate

D =

Page 12: USING VOLTA TENSOR CORES › wp-content › uploads › 2018 › 12 › ...6 A GIANT LEAP FOR DEEP LEARNING ce P100 V100 –Tensor Cores9.3x faster cuBLAS Mixed Precision (FP16 input,

CUDA TENSOR CORE PROGRAMMINGNew WMMA load and store operations

wmma::store_matrix_sync(d, Dmat, stride);

Warp-level operation to fetch components of matrices into fragments

warp

Result

Page 13: USING VOLTA TENSOR CORES › wp-content › uploads › 2018 › 12 › ...6 A GIANT LEAP FOR DEEP LEARNING ce P100 V100 –Tensor Cores9.3x faster cuBLAS Mixed Precision (FP16 input,

TENSOR CORE EXAMPLE

__device__ void tensor_op_16_16_16(float *d, half *a, half *b, float *c)

{wmma::fragment<matrix_a, …> Amat;wmma::fragment<matrix_b, …> Bmat;wmma::fragment<matrix_c, …> Cmat;

wmma::load_matrix_sync(Amat, a, 16);wmma::load_matrix_sync(Bmat, b, 16);wmma::fill_fragment(Cmat, 0.0f);

wmma::mma_sync(Cmat, Amat, Bmat, Cmat);

wmma::store_matrix_sync(d, Cmat, 16,wmma::row_major);

}

CUDA C++

Warp-Level Matrix Operations

Create Fragments

Initialize Fragments

Perform MatMul

Store Results

Tensor ﬁelds - sorbonne-universite · EXAMPLES OF APPLICATIONS • Gradient tensor • Stress tensor • Diffusion tensor • Metric/Curvature tensor • Structure tensor 6. REPRESENTATION

CUDA CUBLAS Library - Université de Montréaldms.umontreal.ca/downloads/CUDA/CUBLAS_Library.pdf · NVIDIA Corporation CUBLAS Library PG-05326-032_V01 Portions of the SGEMM, DGEMM

Tensor Processing Unit Tensor Proces

The CUBLAS and CULA libraries - GitHub Pages · The CUBLAS and CULA libraries Will Landau CUBLAS overview Using CUBLAS The CUBLAS and CULA libraries CULA Will Landau Iowa State University

Linear Algebra on the GPU - SHARCNET · • CUBLAS library can be used by applications written in Fortran, via wrappers 16. SHARCNET seminar 2017 Pawel Pomorski CUBLAS in CUDA 4.0+

A Case Study on Optimizing Accurate Half Precision Average · - k-means, meanshift, average pooling FP16 hardware support incoming - Pascal GPU, ARM SVE FP16 precision imposes serious

Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed …eprints.maths.manchester.ac.uk/2664/1/International ACM... · 2018. 10. 2. · (FP64, FP128). An investigation of

Investigating the Benefit of FP16-Enabled Mixed …...Investigating the Beneﬁt of FP16-Enabled Mixed-Precision Solvers for Symmetric Positive Deﬁnite Matrices Using GPUs Ahmad

CUDA Toolkit and Libraries - Hot Chips · #ifdef CUBLAS ! Call SGEMM in CUBLAS library using THUNKING interface (library takes care of ! memory allocation on device and data movement)

CUDA CUBLAS Library - developer.download.nvidia.comdeveloper.download.nvidia.com/compute/cuda/1_1/... · PG-00000-002_V1.1 3 NVIDIA CHAPTER 1 The CUBLAS Library Example 1. Fortran

CUDA CUBLAS Library - Arecibo Observatory · NVIDIA Corporation CUBLAS Library PG-00000-002_V2.3 Published by NVIDIA Corporation 2701 San Tomas Expressway Santa Clara, CA 95050 Notice

CUDA CUBLAS Library - RUC.dkdirac.ruc.dk/manuals/cuda-3.0/CUBLAS_Library_3.0.pdf · NVIDIA Corporation CUBLAS Library PG-00000-002_V3.0 Portions of the SGEMM, DGEMM and ZGEMM library

CUDA Toolkit - PRACE Research Infrastructure · CUDA Toolkit © NVIDIA Corporation 2008 CUDA Compiler: nvcc ... #ifdef CUBLAS! Call SGEMM in CUBLAS library using THUNKING interface

CUBLAS Library Dr. Bo Yuan E-mail: [email protected]

Tensor-Tensor Products with Invertible Linear Transformsshuchin/els_cosine_7_10_15.pdf · Tensor-Tensor Products with Invertible Linear Transforms Eric Kernfelda, Misha Kilmerb, Shuchin

CUDA CUBLAS Library

NVIDIA™ GPU Performance Testing and PowerAI on IBM Cloud...Mixed precision (FP16/32) leverages Tensor Cores in V100 GPUs. SoftLayer bare-metal server has 48 logical CPU cores while

Evaluation and Tuning of Level 3 CUBLAS for Graphics

JETSON XAVIER NX - NVIDIA€¦ · 9 a re JETSON NANO JETSON TX2 JETSON XAVIER NX JETSON AGX XAVIER GPU 128 Core Maxwell 0.5 TFLOPs (FP16) 256 Core Pascal 1.3 TFLOPS (FP16) 384 Core

Cublas Library

CUDA CUBLAS Library - Research School of Computer … · NVIDIA Corporation CUBLAS Library PG-00000-002_V2.0 Published by NVIDIA Corporation 2701 San Tomas Expressway Santa Clara,

CUDA CUBLAS Library - Nc State Universitymoss.csc.ncsu.edu/.../nvidia/1.1/CUBLAS_Library_1.1.pdfPG-00000-002_V1.1 1 NVIDIA CHAPTER1 The CUBLAS Library CUBLAS is an implementation of

tensor calculus 02 - tensor calculus - tensor algebrabiomechanics.stanford.edu/me337/me337_s02.pdf · 02 - tensor calculus - tensor algebra tensor calculus 2 tensor the word tensor

CUDA CUBLAS Library - National Tsing Hua Universityoz.nthu.edu.tw/~d947207/NVIDIA/CUBLAS_Library_2.0.pdfNVIDIA Corporation CUBLAS Library PG-00000-002_V2.0 Published by NVIDIA Corporation

Image Processing using CUDA - Amazon S3 · Image Processing using CUDA Anders Eklund, PhD ... CUBLAS •CUBLAS has many ... between two images, using the CUBLAS library

tensor algebra - invariants 03 - tensor calculus - tensor ... · 03 - tensor calculus - tensor analysis tensor calculus 2 tensor algebra - invariants ¥ ... ¥ consider scalar,vector

CS 598 EVS: Tensor Computations - Tensor Decomposition

Tensor Algebra and Tensor Analysis for Engineers ... · Tensor Algebra and Tensor Analysis for Engineers ... 2.1 Vector- and Tensor-Valued Functions, ... 38 2 Vector and Tensor Analysis

Tensor 18kN - Kvalitest€¦ · Tensor TE6-900 Tensor 18kN U.S. Patent: 6 860 152 China Patent: ZL 03 809 374.X Japan Patent: 4 217 210 Tensor TE6-900 Tensor 900 Tensor 18kN Sine

CUDA CUBLAS Librarydirac.ruc.dk/manuals/cuda-3.2/CUBLAS_Library_02.pdfNVIDIA Corporation CUBLAS Library PG-05326-032_V02 Published by NVIDIA Corporation 2701 San Tomas Expressway Santa

April 2018 TENSORFLOW Release Notes - Nvidia · ‣ CUDA® Basic Linear Algebra Subroutines library™ (cuBLAS) 9.0.282 Patch 2 which is installed by default ‣ cuBLAS 9.0.234 Patch

tensor calculus 02 - tensor calculus - tensor algebra - tensor calculus - tensor algebra tensor calculus 2 tensor the word tensor was introduced ... solution to tensor calculus 7

Tensor visualizations in computational geomechanicsgraphics.cs.ucdavis.edu/~lfeng/sig/tensor/papers/Tensor... · Tensor visualizations in computational geomechanics ... is the visualization