tensorcore and tensorizationwmma::mma_sync(cmat, amat, bmat, cmat); wmma::store_matrix_sync(d, cmat,...

23
TensorCore and Tensorization Dec 5, 2019 Siyuan Feng 1

Upload: others

Post on 24-Sep-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: TensorCore and Tensorizationwmma::mma_sync(Cmat, Amat, Bmat, Cmat); wmma::store_matrix_sync(d, Cmat, 16, wmma::row_major); } Create Fragments Load Fragments Perform MatMul Store Results

TensorCore and Tensorization

Dec 5, 2019Siyuan Feng

1

Page 2: TensorCore and Tensorizationwmma::mma_sync(Cmat, Amat, Bmat, Cmat); wmma::store_matrix_sync(d, Cmat, 16, wmma::row_major); } Create Fragments Load Fragments Perform MatMul Store Results

Contents

TensorCore Introduction 1

2 TensorCore Support in TVM

3Future Work

2

Page 3: TensorCore and Tensorizationwmma::mma_sync(Cmat, Amat, Bmat, Cmat); wmma::store_matrix_sync(d, Cmat, 16, wmma::row_major); } Create Fragments Load Fragments Perform MatMul Store Results

Contents

TensorCore Introduction 1

2 TensorCore Support in TVM

3Future Work

3

Page 4: TensorCore and Tensorizationwmma::mma_sync(Cmat, Amat, Bmat, Cmat); wmma::store_matrix_sync(d, Cmat, 16, wmma::row_major); } Create Fragments Load Fragments Perform MatMul Store Results

What are TensorCores

4

Page 5: TensorCore and Tensorizationwmma::mma_sync(Cmat, Amat, Bmat, Cmat); wmma::store_matrix_sync(d, Cmat, 16, wmma::row_major); } Create Fragments Load Fragments Perform MatMul Store Results

Warp-Level Operation

wmma::fill_fragment(Cmat, 0.0f);

Warp32 threads

5

Page 6: TensorCore and Tensorizationwmma::mma_sync(Cmat, Amat, Bmat, Cmat); wmma::store_matrix_sync(d, Cmat, 16, wmma::row_major); } Create Fragments Load Fragments Perform MatMul Store Results

Programming TensorCore__device__ void tensor_op_16_16_16

( float *d, half *a, half *b, float *c) { wmma::fragment<matrix_a> Amat; wmma::fragment<matrix_b> Bmat; wmma::fragment <accumulator> Cmat;

wmma::load_matrix_sync(Amat, a, 16); wmma::load_matrix_sync(Bmat, b, 16);

wmma::fill_fragment(Cmat, 0.0f); wmma::mma_sync(Cmat, Amat, Bmat, Cmat);

wmma::store_matrix_sync(d, Cmat, 16, wmma::row_major);

}

Create Fragments

Load Fragments

Perform MatMul

Store Results

6

16x16x16 MatMul

Page 7: TensorCore and Tensorizationwmma::mma_sync(Cmat, Amat, Bmat, Cmat); wmma::store_matrix_sync(d, Cmat, 16, wmma::row_major); } Create Fragments Load Fragments Perform MatMul Store Results

TensorCore Summary

• TensorCores are hardware accelerators

• Warp-level operation

• New memory scope fragment

7

Page 8: TensorCore and Tensorizationwmma::mma_sync(Cmat, Amat, Bmat, Cmat); wmma::store_matrix_sync(d, Cmat, 16, wmma::row_major); } Create Fragments Load Fragments Perform MatMul Store Results

Contents

TensorCore Introduction 1

2 TensorCore Support in TVM

3Future Work

8

Page 9: TensorCore and Tensorizationwmma::mma_sync(Cmat, Amat, Bmat, Cmat); wmma::store_matrix_sync(d, Cmat, 16, wmma::row_major); } Create Fragments Load Fragments Perform MatMul Store Results

Steps for TensorCore Support in TVM

Memory Scope

Tensorization

Create Schedule

9

Page 10: TensorCore and Tensorizationwmma::mma_sync(Cmat, Amat, Bmat, Cmat); wmma::store_matrix_sync(d, Cmat, 16, wmma::row_major); } Create Fragments Load Fragments Perform MatMul Store Results

Current Memory Scope

1

2

3

Memory Scope Create Schedule Tensorization 10

Page 11: TensorCore and Tensorizationwmma::mma_sync(Cmat, Amat, Bmat, Cmat); wmma::store_matrix_sync(d, Cmat, 16, wmma::row_major); } Create Fragments Load Fragments Perform MatMul Store Results

Special Memory Scope

1

2

3

5

6

4

11Memory Scope Create Schedule Tensorization

Page 12: TensorCore and Tensorizationwmma::mma_sync(Cmat, Amat, Bmat, Cmat); wmma::store_matrix_sync(d, Cmat, 16, wmma::row_major); } Create Fragments Load Fragments Perform MatMul Store Results

Traditional GPU Memory Scope Order

Global LocalShared Global

12Memory Scope Create Schedule Tensorization

Page 13: TensorCore and Tensorizationwmma::mma_sync(Cmat, Amat, Bmat, Cmat); wmma::store_matrix_sync(d, Cmat, 16, wmma::row_major); } Create Fragments Load Fragments Perform MatMul Store Results

Enhanced TensorCore Memory Scope Order

GlobalFragment

LocalShared Global

13Memory Scope Create Schedule Tensorization

Page 14: TensorCore and Tensorizationwmma::mma_sync(Cmat, Amat, Bmat, Cmat); wmma::store_matrix_sync(d, Cmat, 16, wmma::row_major); } Create Fragments Load Fragments Perform MatMul Store Results

Warp Level Schedule

blockDim.x = warp_size= 32

14Memory Scope Create Schedule Tensorization

Page 15: TensorCore and Tensorizationwmma::mma_sync(Cmat, Amat, Bmat, Cmat); wmma::store_matrix_sync(d, Cmat, 16, wmma::row_major); } Create Fragments Load Fragments Perform MatMul Store Results

Warp Level Schedule

blockDim.x = warp_size= 32

Warp Warp……

Warp Warp……

…… …… ……

blockDim.y

bloc

kDim

.z

15Memory Scope Create Schedule Tensorization

Page 16: TensorCore and Tensorizationwmma::mma_sync(Cmat, Amat, Bmat, Cmat); wmma::store_matrix_sync(d, Cmat, 16, wmma::row_major); } Create Fragments Load Fragments Perform MatMul Store Results

Tensorization

for (i, 0, 16) {for (j, 0, 16) {

for (k, 0, 16) {C[i*16 + j]= (C[i*16 + j] + (float32(A[i*16 + k])*float32(B[k*16 + j)))

}}

}

tvm_mma_sync(C, 0, A, 0, B, 0, C, 0);

16Memory Scope Create Schedule Tensorization

Page 17: TensorCore and Tensorizationwmma::mma_sync(Cmat, Amat, Bmat, Cmat); wmma::store_matrix_sync(d, Cmat, 16, wmma::row_major); } Create Fragments Load Fragments Perform MatMul Store Results

Performance Improvements over non-TensorCore

1 1 1 1

4.875.17 5.02 4.97

Large MatMul BatchConv Small MatMul BatchMatMul

TVM w/o TensorCores tvm w/ TensorCores

17

Page 18: TensorCore and Tensorizationwmma::mma_sync(Cmat, Amat, Bmat, Cmat); wmma::store_matrix_sync(d, Cmat, 16, wmma::row_major); } Create Fragments Load Fragments Perform MatMul Store Results

Performance Comparison vs CuDNN

1 1 1 1

0.760.83

1.16

1.44

Large MatMul BatchConv Small MatMul BatchMatMul

CuDNN w/ TensorCores tvm w/ TensorCores

Comparable on traditional workloads

18

Page 19: TensorCore and Tensorizationwmma::mma_sync(Cmat, Amat, Bmat, Cmat); wmma::store_matrix_sync(d, Cmat, 16, wmma::row_major); } Create Fragments Load Fragments Perform MatMul Store Results

Performance Comparison vs CuDNN

1 1 1 1

0.760.83

1.16

1.44

Large MatMul BatchConv Small MatMul BatchMatMul

CuDNN w/ TensorCores tvm w/ TensorCores

19

1.4x on emerging workloads(BERT)

Page 20: TensorCore and Tensorizationwmma::mma_sync(Cmat, Amat, Bmat, Cmat); wmma::store_matrix_sync(d, Cmat, 16, wmma::row_major); } Create Fragments Load Fragments Perform MatMul Store Results

TVM TensorCore Support Summary

• Massive speed up over non-tensorcore

• Competitive performance with CuDNN

• Based on tensor intrinsic

20

Page 21: TensorCore and Tensorizationwmma::mma_sync(Cmat, Amat, Bmat, Cmat); wmma::store_matrix_sync(d, Cmat, 16, wmma::row_major); } Create Fragments Load Fragments Perform MatMul Store Results

Contents

TensorCore Introduction 1

2 TensorCore Support in TVM

3Future Work

21

Page 22: TensorCore and Tensorizationwmma::mma_sync(Cmat, Amat, Bmat, Cmat); wmma::store_matrix_sync(d, Cmat, 16, wmma::row_major); } Create Fragments Load Fragments Perform MatMul Store Results

Future Work

1. Use TensorCore in TOPI and Relay

2. Apply TensorCore to popular ML model, such as

BERT

22

Page 23: TensorCore and Tensorizationwmma::mma_sync(Cmat, Amat, Bmat, Cmat); wmma::store_matrix_sync(d, Cmat, 16, wmma::row_major); } Create Fragments Load Fragments Perform MatMul Store Results

Thank you

Dec 5, 2019Siyuan Feng

23