tensorcore and tensorizationwmma::mma_sync(cmat, amat, bmat, cmat); wmma::store_matrix_sync(d, cmat,...

TensorCore and Tensorization

Dec 5, 2019Siyuan Feng

1

Contents

TensorCore Introduction 1

2 TensorCore Support in TVM

3Future Work

2

Contents



3Future Work

3

What are TensorCores

4

Warp-Level Operation

wmma::fill_fragment(Cmat, 0.0f);

Warp32 threads

5

Programming TensorCore__device__ void tensor_op_16_16_16

( float *d, half *a, half *b, float *c) { wmma::fragment<matrix_a> Amat; wmma::fragment<matrix_b> Bmat; wmma::fragment <accumulator> Cmat;

wmma::load_matrix_sync(Amat, a, 16); wmma::load_matrix_sync(Bmat, b, 16);

wmma::fill_fragment(Cmat, 0.0f); wmma::mma_sync(Cmat, Amat, Bmat, Cmat);

wmma::store_matrix_sync(d, Cmat, 16, wmma::row_major);

}

Create Fragments

Load Fragments

Perform MatMul

Store Results

6

16x16x16 MatMul

TensorCore Summary

• TensorCores are hardware accelerators

• Warp-level operation

• New memory scope fragment

7

Contents



3Future Work

8

Steps for TensorCore Support in TVM

Memory Scope

Tensorization

Create Schedule

9

Current Memory Scope

1

2

3

Memory Scope Create Schedule Tensorization 10

Special Memory Scope

1

2

3

5

6

4

11Memory Scope Create Schedule Tensorization

Traditional GPU Memory Scope Order

Global LocalShared Global


Enhanced TensorCore Memory Scope Order

GlobalFragment

LocalShared Global


Warp Level Schedule

blockDim.x = warp_size= 32


Warp Level Schedule

blockDim.x = warp_size= 32

Warp Warp……

Warp Warp……

…… …… ……

blockDim.y

bloc

kDim

.z


Tensorization

for (i, 0, 16) {for (j, 0, 16) {

for (k, 0, 16) {C[i*16 + j]= (C[i*16 + j] + (float32(A[i*16 + k])*float32(B[k*16 + j)))

}}

}

tvm_mma_sync(C, 0, A, 0, B, 0, C, 0);


Performance Improvements over non-TensorCore

1 1 1 1

4.875.17 5.02 4.97

Large MatMul BatchConv Small MatMul BatchMatMul

TVM w/o TensorCores tvm w/ TensorCores

17

Performance Comparison vs CuDNN

1 1 1 1

0.760.83

1.16

1.44


CuDNN w/ TensorCores tvm w/ TensorCores

Comparable on traditional workloads

18

Performance Comparison vs CuDNN

1 1 1 1

0.760.83

1.16

1.44


CuDNN w/ TensorCores tvm w/ TensorCores

19

1.4x on emerging workloads(BERT)

TVM TensorCore Support Summary

• Massive speed up over non-tensorcore

• Competitive performance with CuDNN

• Based on tensor intrinsic

20

Contents



3Future Work

21

Future Work

1. Use TensorCore in TOPI and Relay

2. Apply TensorCore to popular ML model, such as

BERT

22

Thank you

Dec 5, 2019Siyuan Feng

23

tensorcore and tensorizationwmma::mma_sync(cmat, amat, bmat, cmat); wmma::store_matrix_sync(d, cmat,...

Documents