gpgpu programming using nvidia cudaresearch.utar.edu.my/centres/dev/cisst/event/gpgpu programming...

GPGPU Programming Using NVIDIA CUDA

Prepared by

Lee Wai Kong

Email: [email protected]

∗ GPU Programming in pre-CUDA age:

∗ Shader Language

Introduction

∗ Shader Language

∗ OpenGL

∗ Compute Unified Device Architecture (CUDA)

∗ C/C++

∗ FORTRAN

∗ Two leaders in GPU Computing world:

∗ AMD (ATI) and NVIDIA

Why GPU?

∗ Multicores vs Many Cores

∗ 6-8 cores vs 16 – 2496 cores (K-20)

∗ Design Philosophies

∗ CPU: Cache, sophisticated control logic, branch

Why GPU?

∗ CPU: Cache, sophisticated control logic, branch prediction, good for sequential and complex tasks.

∗ GPU: Limited cache, simple control logic, good for simple tasks that can be run in many parallel threads.

∗ Best Strategy:

∗ Heterogeneous Programming Model

∗ Marry both CPU and GPU.

Example Applications Contacts Application Speedup

Seismic Database www.headwave.com 66 to 100X

Mobile Phone Antenna Simulation

www.acceleware.com 45X

Molecular Dynamicshttp://www.ks.uiuc.edu/Research/vmd/

240X

Success Stories

Molecular Dynamicssearch/vmd/

240X

Neuron Simulationwww.evolvedmachines.com

100X

MRI Processinghttp://bic-test.beckman.uiuc.edu/

245 – 415X

Atmospheric Cloud Simulation

www.cs.clemson.edu/~jesteel/clouds.html

50X

Summary from http://www.nvidia.com/object/IO_43499.html

∗ What do you need to do GPU computing?∗ NVIDIA GPU that is CUDA enabled1, Visual Studio

(C/C++), CUDA SDK2, display driver.

Tools and supported platforms

(C/C++), CUDA SDK2, display driver.

∗ Supported platforms∗ Windows

∗ Linux

∗ Mac

∗ NVIDIA provide useful profiling tools to check the bottleneck of your applications!

∗ © David Kirk/NVIDIA and Wen-mei Hwu, 2007-2011

∗ ECE408/CS483, University of Illinois, Urbana-Champaign

∗ Combination of CPU (host) and GPU (device) code.

∗ Generally, CPU will run serial code, GPU will run parallelizable code.

Heterogeneous Programming Model

code.

∗ CPU can run multithreaded code as well.

Heterogeneous Programming Model• The tiny program that

runs in every GPU thread is called ‘kernel’.

• Multiple kernels form a thread block, usually in thread block, usually in multiples of 32 (why?).

• Multiple thread blocks form a grid.

∗ The multiprocessor creates, manages, schedules, and executes threads in groups of 32 parallel threads called warps.

Warp

warps.

∗ When a multiprocessor is given one or more thread blocks to execute, it partitions them into warps and each warp gets scheduled by a warp scheduler for execution.

∗ A warp executes one common instruction at a time, so full efficiency is realized when all 32 threads of a warp agree on their execution path.

∗ Rule of thumb: Avoid divergence in kernel.

__global__ void VecAdd(float* A, float* B, float* C)

{

int i = threadIdx.x;

C[i] = A[i] + B[i];

}

Single Instruction Multiple Data (SIMD)

void VecAdd()

{

for(int i=0; i<32; i++)

{

C[i] = A[i] + B[i];

int main()

{

...

// Kernel invocation with N threads

VecAdd<<<1, 32>>>(A, B, C);

...

}

C[i] = A[i] + B[i];

}

}

∗ On-chip Memories

∗ Register

∗ Shared Memory

∗ Off-chip Memories

∗ Global Memory

∗ Constant Memory

Memory Hierarchy

∗ Constant Memory

∗ Texture Memory

∗ Local Memory

∗ Register is fastest, but with limited capacity. GTX690 has 64KB registers per SM, and maximum 63 registers per thread.

Memory Hierarchy

per thread.

∗ Register use can affect the maximum threads that can run simultaneously. If a SM is running 2048 threads, only 32 registers per thread can be used.

∗ If a kernel uses more registers than its maximally allowed limit, the compiler will spill extra register usage into “local memory”.

∗ Shared memory is accessible by all threads within the same thread block. It is commonly used to hold temporal data so that threads within the same block can cooperate.

Memory Hierarchy

that threads within the same block can cooperate.

∗ Shared memory is organized in banks that are 32-bits. If multiple requests are made by different threads to the same address or to different addresses in the same bank, bank conflicts will occur.

∗ Bank conflicts will seriously degrade performance as the memory access is serialized now.

∗ For concurrent accesses, shared memory is divided into equally sized memory modules (banks) that can be accessed simultaneously.

∗ Any memory load or store of Naddresses that spans N distinct memory banks can be serviced

Memory Hierarchy: Bank Conflicts

memory banks can be serviced simultaneously.

∗ If multiple addresses of a memory request map to the same memory bank, the accesses are serialized. In this case, hardware split this into as many separate conflict-free requests as necessary.

∗ One exception is when multiple threads in a warp address the same shared memory location.

∗ Global memory is the largest off-chip memory in the GPU, but it is also the slowest.

Memory Hierarchy

GPU, but it is also the slowest.

∗ It is used to store the data transferred from the host, accessible by all threads in all SM.

∗ Global memory needs to be accessed in coalesced manner (128 bytes), or else it will suffer great performance degradation.

∗ Coalesced global memory access.

∗ The threads of same warp access adjacent 4-byte words (e.g., adjacent float values), a single 128B L1

Memory Hierarchy: Coalesced Access

words (e.g., adjacent float values), a single 128B L1 cache line and therefore a single coalesced transaction will service that memory access.

∗ Un-coalesced global memory access.

∗ Adjacent threads accessing memory with a stride of 2. Hence, two L1 cache lines loads are needed to complete the action. This represent a 50% efficiency in

Memory Hierarchy: Coalesced Access

complete the action. This represent a 50% efficiency in memory load.

∗ Constant memory is cached memory that allows the user to store read-only data. It is an ideal choice to store and broadcast read only data to all threads on the GPU.

Memory Hierarchy

broadcast read only data to all threads on the GPU.

∗ Texture memory is bound to global memory and provides cache functionality. It is optimized for 2D spatial access patterns.

∗ Local memory resides in global memory, but it is cached at L1 cache. Register spilling effect is determined by the compiler; the programmer does not have explicit control over this aspect.

∗ PTX (parallel thread execution) is a low-level virtual machine and instruction set architecture (ISA).

∗ PTX programs are translated at install time to the

PTX

∗ PTX programs are translated at install time to the target hardware instruction set.

∗ PTX provides a stable programming model and instruction set for general purpose parallel programming.

∗ Provide a stable ISA that spans multiple GPU generations.

∗ Can be treated as ‘assembly’ code for GPU.

∗ Include:

PTX

∗ Include:∗ Integer arithmetic (add, sub, mul, mad, min, max)

∗ Floating point arithmetic

∗ Logic and shift operation (shl, shr, and, or, xor, not)

∗ Data movement (mov, shfl, prefetch, ld, st)

∗ Control flow (bra, call, exit)

∗ Video instruction (vadd, vadd2, vmul, vmul2, vshl, vshr, vmin, vmax)

∗ Seldom used to directly write the kernel code, unless the kernel is very simple.

PTX

the kernel is very simple.

∗ Can exploit the instruction sets for special purposes (e.g. combine instructions).

∗ May affect how the compiler use registers.

∗ Overlap memory copy and kernel execution.

∗ General work flow:

Streams

∗ General work flow:

∗ Write a kernel code

∗ Divide the input and output data into N parts (assuming you are using N streams).

∗ Start independent memory copy first (host to device).

∗ Launch kernel.

∗ Start independent memory copy (device to host).

Streams

∗ Differentiate by ‘Compute Capability’.

∗ 1.0 Tesla Architecture

∗ 2.0 Fermi Architecture

∗ 3.0 Kepler Architecture

Generations of NVIDIA GPU

∗ 3.0 Kepler Architecture

Thank You!

Q & A

gpgpu programming using nvidia cudaresearch.utar.edu.my/centres/dev/cisst/event/gpgpu programming...

Documents