programming with cuda, ws09 waqar saleem, jens müller programming with cuda and parallel algorithms...

Programming with CUDA, WS09

Waqar Saleem, Jens Müller

Programming with CUDA and Parallel

AlgorithmsWaqar Saleem

Jens Müller

•Organization

•GPGPU motivation and platforms: CUDA, Stream, OpenCL

•Simple ray tracing

GPU wars

•Nvidia is far more popular

•ATI reports more powerful number crunchersimage courtesy of Udeepta Bordoloi, AMD

GPU performance

• images courtesy of AnandTech, http://www.anandtech.com/video/showdoc.aspx?i=3643&p=8

CUDA or Stream?

•OpenCL is vendor independent

•OpenCL drivers provided by ATI and Nvidia for their cards

•OpenCL like initiative for Windows: Microsoft DirectCompute

Why CUDA?• Many of the concepts from CUDA carry over almost

exactly to OpenCL

• CUDA has been around since Feb 2007 and is very well documented

• CUDA home,

http://www.nvidia.com/object/cuda_home.html

• links to programming guide, numerous university courses, multimedia presentations ...

• OpenCL v1.0 was released in Nov/Dec 2008

• GPU drivers are less than 6 months old

• A lot of our material will borrow heavily from the above

Today• Motivational videos

• CUDA hardware and programming models

• Threads, blocks and grids

• CUDA memory hierarchy

• Device compute capability

• Example kernel

• Thread IDs

• Memory overhead

Programming with CUDA•The G80 architecture, e.g. GeForce 8800

QuickTime™ and aBMP decompressor

are needed to see this picture.

Simpler than graphics mode•G80 in graphics mode

Thinking CUDA• Break down the problem into serial and parallel

• Serial parts execute on the host (few threads)

• Parallel parts execute on the device (massively parallel)

Program execution• The host launches a C program

• Compute intensive, data parallel computations are written in special functions, kernels, in extended C

• The host launches a kernel on the compute device with a grid configuration

• The device starts threads according to the provided configuration

• All threads run in parallel on the device

• All threads execute the same kernel

Thread organization• A kernel launch specifies a

grid of thread blocks that execute the kernel

• A grid is a 1D or 2D array of blocks

• Each block is a 1D, 2D or 3D array of threads

• Blocks and threads have IDs

• Choose organization to suit your problem and optimize performance

Thread specific information•Each thread may use block and

thread ID to access its data and make control decisions

are needed to see this picture. QuickTime™ and aBMP decompressor

Thread execution• Each thread block is assigned to a single multiprocessor

• If there are more blocks than MPs, multiple blocks are assigned to MPs

• An MP breaks assigned block(s) into warps

• The order of execution of blocks and warps is determined by the thread scheduler

• We can thus write code independent of our device specification

• Thread organization according to the problem rather than the hardware

• Caution: For optimum performance, device specification needs to be considered

Device independent code•This also allows for scalable code

•Execution on more MPs is faster

Memory Hierarchy•Each block is mapped to a MP and

each thread in the block to a processor in the MP

Texture MemoryConstant Memory

Thread communication• Threads in a block cooperate via shared memory,

barrier synchronization and atomic operations

• Threads from different blocks cannot cooperate

Host and Device communication

• Host can read/write all device memories except registers and shared memory

• Device can read/write global memory

• large but slow (600 clocks)

• Device can read texture memory

• large, slow but cached after first read

• Device can read constant memory

• small, cached

• optimized for certain memory accessesQuickTime™ and aBMP decompressor

CUDA devices•The compute capability of a device is

a number <major_rev>.<minor_rev>

•The major revision number represents a fundamental change in card architecture

•The minor revision number represents incremental changes within the major revision

•CUDA ready devices have compute capability >= 1.0

Example kernel: Vector addition

•Time taken for the CPU version: N * 1 addition

Example kernel: Vector addition

•Time taken for the GPU version: 1 addition

Example kernel

• Host codefor ( int i = 0; i < N; i++ ) C[i] = A[i] + B[i];

• Device kernel__global__ void vAdd ( float* A, float* B, float* C ) { int i = threadIdx.x; C[i] = A[i] + B[i];}

Kernel quantifier and thread ID•Kernels MUST be quantified as __global__

•Kernels MUST be declared void

• threadIdx is the 3 dimensional index of the thread in its block

•A thread with threadIdx (x,y,z) in a block of blockDim (Dx, Dy, Dz) has thread ID x + y.Dx + z.Dx.Dy

•For missing dimensions, the blockDim entry is 1 and the threadIdx entry is 0

Kernel invocation• Host program

void main() { // allocate h_A, h_B,h_C, size N // assign values to host vectors for ( int i = 0; i < N; ++ ) h_C[i] = h_A[i] + h_B[i]; // output h_C // free host variables}

• Host + device programvoid main() { // allocate h_A, h_B,h_C, size N // assign values to host vectors // initialize device // allocate d_A,d_B,d_C, size N // copy h_A,h_B to d_A,d_B vAdd<<<1,N>>>(d_A,d_B,d_C) // copy d_C to h_C // output h_C // free host variables // free device variables}

Memory overhead•Necessary evil: device needs data

in its own memory

•Overhead is justified if the kernel is compute intensive

•With multiple kernels, memory overhead can be overlaid with computation using streams

•Bandwidth between device and host is high

Host-Device bandwidth

Device

Memory usage•Host data is typically copied to/from

global memory on device

•Threads have to fetch their data from global memory (slow)

•If the data is to be operated on several times, copy it from global memory to shared memory or registers (fast)

•Write result back to global memory at the end of computation

Memory usage

• __global__ void myKernel ( float* in1, float* in2, float* out ) { // initialize s_in1,s_in2,s_out in shared memory // copy in1,in2 to s_in1,s_in2 // perform heavy computations on s_in1,s_in2 // store result in s_out // copy s_out to out}

Other issues

•New time for exercises

•Some ray tracing issues: to be clarified in the next exercise session

Next time

•Copying memory between host and device

•Clarify grid parameters

•CUDA additions to C

•Memory limitations

See you next time!

programming with cuda, ws09 waqar saleem, jens müller programming with cuda and parallel algorithms...

Documents

sample slides - waqar

arshad saleem bhattiarshad saleem bhatti department of

khalid saleem s/o dr saleem wahid saleem a born thinker

takeaway menu saleem

che guevara by waqar ali bugti

cv of waqar o

waqar hadith alleged death of eisa

orale antidiabetika bruno müller. www. derendokrinologe.ch...

presenter: khalid saleem

socks manufacturing unit waqar hussain

qalb e-saleem

sitara chemical by muhammad waqar

publikationsliste erich müller stand juli...

saleem hassan ali

ghazi by abu shuja abu waqar

saleem ullah khan

approvals (saleem khan)

vortex_khwaja saleem a

psychology books by al waqar pub

waqar ahmad seth,j:-