cmsc5743 lab 03 cuda tutorial materials

CMSC5743 Lab 03 CUDA Tutorial Materials Yang BAI Department of Computer Science & Engineering Chinese University of Hong Kong [email protected] October 29, 2021

Upload: others

Post on 01-Feb-2022

3 views

Category:

Documents

0 download

Report

Download

Embed Size (px):

TRANSCRIPT

CMSC5743 Lab 03CUDA Tutorial Materials

Yang BAIDepartment of Computer Science & EngineeringChinese University of Hong [email protected]

October 29, 2021

Page 2: CMSC5743 Lab 03 CUDA Tutorial Materials

1 Vector Addition

2 Tensor Core WMMA

3 General Matrix Multiplication

Outline

2/23

Page 3: CMSC5743 Lab 03 CUDA Tutorial Materials

• Heterogeneous Computing

• Host and Device

• CUDA C/C++

CUDA Programming Language

3/23

Page 4: CMSC5743 Lab 03 CUDA Tutorial Materials

Vector Addition

Page 5: CMSC5743 Lab 03 CUDA Tutorial Materials

• Host Code Initialization

• Device Code Initialization

• Kernel Code

• Check Your Results

• Free Source on Host

• Free Source on Device

Coding Style and Organization

5/23

Page 6: CMSC5743 Lab 03 CUDA Tutorial Materials

Host Code Initialization

6/23

Page 7: CMSC5743 Lab 03 CUDA Tutorial Materials

Device Code Initialization

7/23

Page 8: CMSC5743 Lab 03 CUDA Tutorial Materials

Kernel Code

8/23

Page 9: CMSC5743 Lab 03 CUDA Tutorial Materials

Check Your Results

9/23

Page 10: CMSC5743 Lab 03 CUDA Tutorial Materials

Free Source on Host and Device

10/23

Page 11: CMSC5743 Lab 03 CUDA Tutorial Materials

Tensor Core WMMA

Page 12: CMSC5743 Lab 03 CUDA Tutorial Materials

• Access to Tensor Core by cuBLAS/cuDNN

• Access to Tensor Core by CUTLASS

• Access to Tensor Core by TVM

Tensor Core Overview

12/23

Page 13: CMSC5743 Lab 03 CUDA Tutorial Materials

• Volta Tensor Core (1st)

• FP16 supported• 8 x 8 x 4 (M x N x K)

• Turing Tensor Core (2nd)

• FP16 supported• 8 x 8 x 4, 16 x 8 x 8 (recommended)• INT8, INT4, INT1 supported• 8 x 8 x 16, 8 x 8 x 32, 8 x 8 x 128

• Ampere Tensor Core (3rd)

• new bfloat16 supported, FP16• 16 x 8 x 8, 16 x 8 x 16• new TF32 supported, 16 x 8 x 4, 16 x 8 x 8• new Double supported, 8 x 8 x 4

Tensor Core History

13/23

Page 14: CMSC5743 Lab 03 CUDA Tutorial Materials

Data types

14/23

Page 15: CMSC5743 Lab 03 CUDA Tutorial Materials

Data types

15/23

Page 16: CMSC5743 Lab 03 CUDA Tutorial Materials

Tesla T4 GPU Architecture

16/23

Page 17: CMSC5743 Lab 03 CUDA Tutorial Materials

• Architecture: Turing

• SMs: 68

• CUDA Cores/SM: 64

• CUDA Cores/GPU: 4352

• Tensor Cores/SM: 8

• Tensor Cores/GPU: 544

Turing 2080Ti Information

17/23

Page 18: CMSC5743 Lab 03 CUDA Tutorial Materials

Hierarchical Structure

18/23

Page 19: CMSC5743 Lab 03 CUDA Tutorial Materials

Hierarchical Structure• The basic triple loop nest computing matrix multiply may be blocked and tiled to

match concurrency in hardware, memory locality, and parallel programming models.

• In CUTLASS, GEMM is mapped to NVIDIA GPUs with the structure illustrated bythe following loop nest.

This tiled loop nest targets concurrency among• threadblocks-level

• warps

• CUDA and Tensor Cores

and takes advantage of memory locality within• shared memory

• registers

19/23

Page 20: CMSC5743 Lab 03 CUDA Tutorial Materials

The flow of data in CUTLASS

20/23

Page 21: CMSC5743 Lab 03 CUDA Tutorial Materials

General Matrix Multiplication

Page 22: CMSC5743 Lab 03 CUDA Tutorial Materials

Q1 Learn the code in ./Lab03-CUDA/code and it contains three folders(vector_add, gemm, wmma)

• Learn the code style and components of vector_add.cu file• Complete all of the code in gemm folder• Try to make your gemm kernel more efficient

• shared memory• tiling size• block and thread size

Q2 Learn the wmma.cu from the /Lab03-CUDA/code/wmma to run itsuccessfully by compile.sh script

• Learn the different data type in CUDA programming language such asFloat16, Int8

• Learn the basic knowledge of Tensor Core and WMMA in CUDAprogramming language

• Learn the difference between FLOPs and FLOPS• Change the tiling size in wmma.cu to get the different TFLOPS

22/23

Page 23: CMSC5743 Lab 03 CUDA Tutorial Materials

THANK YOU!

CUDA Compiler Driver NVCC - Nvidiadocs.nvidia.com/cuda//pdf/CUDA_Compiler_Driver_NVCC.pdf · CUDA Compiler Driver NVCC TRM-06721-001_v9.1 ... CUDA Programming Model ... 3.1. Command

GPUDIRECT, CUDA AWARE MPI, & CUDA IPC€¦ · Steve Abbott, February 12, 2019 GPUDIRECT, CUDA AWARE MPI,& CUDA IPC

CUDA programming Performance considerations (CUDA best practices)

CMSC5743 Lab 04 MNN Tutorial Materials

Cuda tutorial

Debugging Experience with CUDA-GDB and CUDA ......2 CUDA Debugging Solutions CUDA-GDB (Linux & Mac) CUDA-MEMCHECK (Linux, Mac, & Windows) NVIDIA® Nsight Eclipse Edition (NEW!)Visual

CUDA Lecture 4 CUDA Programming Basics

Tutorial on GPU computing - UFSCbosco.sobral/ensino/ine5645/2015-2/Cruz...Tutorial on GPU computing With an introduction to CUDA University of Bristol, Bristol, United Kingdom. Felipe

CUDA Tutorial - SBAC

Introduction to Scientific Programming using GPGPU and CUDA · Introduction to Scientific Programming using GPGPU and CUDA ... (NVIDIA CUDA Programming Guide) ... CUDA C OpenCL CUDA

Tutorial CUDA - Pascal-Man CUDA © NVIDIA Corporation ... Why GPUs? CUDA programming model, language, and runtime Break CUDA implementation on the GPU ... vec_dot…

CUDA Parallel Programming Tutorial

Debugging Your CUDA Applications With CUDA-GDBdeveloper.download.nvidia.com/GTC/PDF/1062_Satoor.pdf · Debugging Solutions CUDA-GDB (Linux & Mac) CUDA-MEMCHECK (Linux, Mac, & Windows)

Basic CUDA Programming Computer Architecture 2014 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang ([email protected])

CUDA: NEW AND UPCOMING FEATURES - University of Oxford · CUDA: NEW AND UPCOMING FEATURES. 2 CUDA ECOSYSTEM 2018 CUDA DOWNLOADS IN 2017 3,500,000 CUDA REGISTERED DEVELOPERS 800,000

Matlab CUDA Tutorial 8 08

MD-CUDA · GPGPU CUDA N-body problem ... –Application programming interface (API) –CUDA runtime –CUFFT –CUBLAS. 20 CUDA Layers. 21 GPU Architecture In CUDA Memory Addressing

Matlab CUDA Tutorial 2 10

ViennaCL and PETSc Tutorial - Karl Rupp · ViennaCL and PETSc Tutorial Karl Rupp ... OpenMP, OpenCL, and CUDA backends Header-only ... Hyperlinked manual, examples,

Further Details on Dense Linear Algebra in CUDA · November 11th 2015 Alexander Poppl: Further Details on Dense Linear Algebra in CUDA¨ Tutorial: High Performance Computing - Algorithms

High Performance Computing on GPUs using NVIDIA CUDA · Mark Silberstein, Technion 1 High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial

CUDA-GDB: The NVIDIA CUDA Debuggerdeveloper.download.nvidia.com/compute/cuda/2_1/cudagdb/... · 2008-12-24 · 1.1 CUDA-GDB: The NVIDIA CUDA Debugger ... You must select Linux 32-bit

A Tutorial on CUDA Performance Optimizations - …€¦ · • C2075 Tesla has 14 SMs with 32 cores each. 5 CUDA ... (LBM) 8 Optimizations Part I • Block and Grid size • Shared

CUDA Lecture 7 CUDA Threads and Atomics

CUDA programming Performance considerations (CUDA best practices) NVIDIA CUDA C programming best practices guide ACK: CUDA teaching center Stanford (Hoberrock

Introduction to GPU Programming with the CUDA Platform · Resources ThispresentationandallsourcecodeareavailableatGitHub: • github.com/phrb/intro-cuda Moreresources: • CUDAC:docs.nvidia.com/cuda/cuda-c-programming-guide

Welcome! [] · Tutorial topics CUDA programming model Tools, languages, and libraries for GPU computing Advanced CUDA: optimization, irregular parallelism Case studies: CFD Seismic

Jared Law CUDA: Super-Computing Made Easy. Jared Law NVidia CUDA: Why CUDA? What is CUDA? Where/how is CUDA being used? What does CUDA mean to programmers?

CMSC5743 Lab 06 TVM Tutorial-2 Materials

Debugging Experience with CUDA-GDB and CUDA …

Programming with CUDA · Programming with CUDA ... CUDA C programming guide – CUDA Programming 4 …

CUDA-GDB (NVIDIA CUDA Debugger)

Francis Marion University Patriot Cluster - AAPT.org · • Created tutorial programs! ... • Languages: Fortran, C, Java, Python, CUDA! ... • Education of Computational Physics

Best Practices Guide -- CUDA 2 - Nc State Universitymoss.csc.ncsu.edu/.../2.3/NVIDIA_CUDA_BestPracticesGuide_2.3.pdf · NVIDIA CUDA C Programming Best Practices Guide . CUDA ... CUDA

CUHKredCMSC5743 L03: CNN Trainingbyu/CMSC5743/2020Fall/slides/L08-train.pdf · Xaiverinitialization Deﬁnition zi istheinputvectoroflayeri,andsi isthefeaturevectoroftheactivationfunctionatlayeri,