parallel programming basics things we need to consider: control synchronization communication ...

Parallel Programming Basics

Things we need to consider:

Control

Synchronization

Communication

Parallel programming languages offer different ways of dealing with above

1

CUDA Devices and Threads

A compute device Is a coprocessor to the CPU or host Has its own DRAM (device memory) Runs many threads in parallel Is typically a GPU but can also be another type of parallel

processing device Data-parallel portions of an application are

expressed as device kernels which run on many threads

Differences between GPU and CPU threads GPU threads are extremely lightweight

Very little creation overhead GPU needs 1000s of threads for full efficiency

Multi-core CPU needs only a few 2

Extended C Type Qualifiers

global, device, shared, local, constant

Keywords threadIdx, blockIdx

Intrinsics __syncthreads

Runtime API Memory, symbol,

execution management

Function launch

3

__device__ float filter[N];

__global__ void convolve (float *image) {

__shared__ float region[M]; ...

region[threadIdx] = image[i];

__syncthreads() ...

image[j] = result;}

// Allocate GPU memoryvoid *myimage = cudaMalloc(bytes)

// 100 blocks, 10 threads per blockconvolve<<<100, 10>>> (myimage);

4

Arrays of Parallel Threads

• A CUDA kernel is executed by an array of threads– All threads run the same code (SPMD)– Each thread has an ID that it uses to compute

memory addresses and make control decisions

76543210

…float x = input[threadID];float y = func(x);output[threadID] = y;…

threadID

Thread Blocks: Scalable Cooperation Divide monolithic thread array into multiple blocks

Threads within a block cooperate via shared memory, atomic operations and barrier synchronization

Threads in different blocks cannot cooperate

5


threadID

Thread Block 0

……float x = input[threadID];float y = func(x);output[threadID] = y;…

Thread Block 1


Thread Block N - 176543210 76543210 76543210

Block IDs and Thread IDs

Each thread uses IDs to decide what data to work on Block ID: 1D or 2D Thread ID: 1D, 2D, or 3D

Simplifies memoryaddressing when processingmultidimensional data Image processing Solving PDEs on volumes …

Host

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(0, 1)

Block(1, 1)

Grid 2

Courtesy: NDVIA

Figure 3.2. An Example of CUDA Thread Organization.

Block (1, 1)

Thread(0,1,0)

Thread(1,1,0)

Thread(2,1,0)

Thread(3,1,0)

Thread(0,0,0)

Thread(1,0,0)

Thread(2,0,0)

Thread(3,0,0)

(0,0,1) (1,0,1) (2,0,1) (3,0,1)

6

CUDA Memory Model Overview

Global memory

Main means of communicating R/W Data between host and device

Contents visible to all threads

Long latency access

We will focus on global memory for now

Constant and texture memory will come later

7

Grid

Global Memory

Block (0, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Host

CUDA Function Declarations

• __global__ defines a kernel function

Must return void

• __device__ and __host__ can be used together

8

hosthost__host__ float HostFunc()

hostdevice__global__ void KernelFunc()

devicedevice__device__ float DeviceFunc()

Only callable from the:

Executed on the:

9

Transparent Scalability

Hardware is free to assigns blocks to any processor at any time A kernel scales across any number of

parallel processorsDevice

Block 0 Block 1

Block 2 Block 3

Block 4 Block 5

Block 6 Block 7

Kernel grid

Block 0 Block 1

Block 2 Block 3

Block 4 Block 5

Block 6 Block 7

Device

Block 0 Block 1 Block 2 Block 3

Block 4 Block 5 Block 6 Block 7

Each block can execute in any order relative to other blocks.

time

10

Language Extensions:Built-in Variables

dim3 gridDim;

Dimensions of the grid in blocks (gridDim.z unused)

dim3 blockDim;

Dimensions of the block in threads

dim3 blockIdx;

Block index within the grid

dim3 threadIdx;

Thread index within the block

11

Common Runtime Component:Mathematical Functions

pow, sqrt, cbrt, hypot exp, exp2, expm1 log, log2, log10, log1p sin, cos, tan, asin, acos, atan, atan2 sinh, cosh, tanh, asinh, acosh, atanh ceil, floor, trunc, round Etc.

When executed on the host, a given function uses the C runtime implementation if available

These functions are only supported for scalar types, not vector types

12

Device Runtime Component:Mathematical Functions

Some mathematical functions (e.g. sin(x)) have a less accurate, but faster device-only version (e.g. __sin(x))

__pow

__log, __log2, __log10

__exp

__sin, __cos, __tan

parallel programming basics things we need to consider: control synchronization communication ...

Documents