multi-gpu and stream programming

Multi-GPU and StreamProgramming

Kishan Wimalawarne

Agenda

• Memory• Stream programming• Multi-GPU programming• UVA & GPUDirect

Memory

• Paged locked memory (Pinned memory)– Useful in concurrent kernel execution– Use cudaHostAlloc() and cudaFreeHost() allocate and free

page-locked host memory

• Mapped memory– A block of page-locked host memory can also be mapped

into the address space of the device by passing flag cudaHostAllocMapped to cudaHostAlloc()

Zero-Copy

• Zero-Copy enables GPU threads to directly access host memory.

• Requires mapped pinned (non-pageable) memory.• Zero copy can be used in place of streams because

kernel-originated data transfers automatically overlap kernel execution without the overhead of setting up and determining the optimal number of streams

• Use cudaSetDeviceFlags() with cudaDeviceMapHost()

Zero-Copy

Stream Programming

Introduction• Stream programming (pipeline) is a useful parallel pattern.• Data transfer from host to device is a major performance

bottleneck in GPU programming• CUDA provides support for asynchronous data transfer

and kernel executions.• A stream is simply a sequence of operations that are

performed in order on the device.• Allow concurrent execution of kernels.• Maximum number of concurrent kernel calls to be

launched is 16.

Introduction

Asynchronous memory Transfer

• Use cudaMemcpyAsync() instead of cudaMemcpy().

• cudaMemcpyAsync() – non-blocking data transfer method uses pinned host memory .

• cudaError_t cudaMemcpyAsync ( void * dst, const void * src, size_t count,

enum cudaMemcpyKind, cudaStream_t stream)

Stream Structures

• cudaStream_t– Sepcifies a stream in a CUDA program

• cudaStreamCreate(cudaStream_t * stm)– Instantiate streams

Streaming example

Event processing

• Events are used for – Monitor device behavior– Accurate rate timing

• cudaEvent_t e• cudaEventCreate(&e);• cudaEventDestroy(e);

Event processing• cudaEventRecord() records and event associated with a stream.• cudaEventElapsedTime() finds the time between two input

events. • cudaEventSynchronize() blocks until the event has actually been

recorded• cudaEventQuery() Check status of an event.• cudaStreamWaitEvent() makes all future work submitted to

stream wait until event reports completion before beginning execution.

• cudaEventCreateWithFlags() create events with flags e.g:- cudaEventDefault, cudaEventBlockingSync

Stream Synchronization• cudaDeviceSynchronize() waits until all preceding commands

in all streams of all host threads have completed.• cudaStreamSynchronize() takes a stream as a parameter and

waits until all preceding commands in the given stream have completed

• cudaStreamWaitEvent() takes a stream and an event as parameters and makes all the commands added to the given stream after the call to cudaStreamWaitEvent() delay their execution until the given event has completed.

• cudaStreamQuery() provides applications with a way to know if all preceding commands in a stream have completed.

Multi GPU programming

Multiple device access

• cudaSetDevice(devID)– Devise selection within the code by specifying the

identifier and making CUDA kernels run on the selected GPU.

Peer to peer memory Access

• Peer-to-Peer Memory Access– Only on Tesla or above– cudaDeviceEnablePeerAccess() to check peer

access

Peer to peer memory Copy

• Using cudaMemcpyPeer() – works for Geforce 480 and other GPUs.

Programming multiple GPUs

• The most efficient way to use multiple GPUs is to use host threads for multiple GPUs and divide the work among them.– E.g- pthreads

• Need to combine the parallelism of multi-core processor to in conjunction with multiple GPU's.

• In each thread use cudaSetDevice() to specify the device to run.

Multiple GPU

• For each computation on GPU create a separate thread and specify the device a CUDA kernel should run.

• Synchronize both CPU threads and GPU.

Multiple GPU Examplevoid * GPUprocess(void *id){ long tid; tid = (long)id; if(tid ==0){ cudaSetDevice(tid); cudaMalloc((void **)&p2 , size); cudaMemcpy(p2, p0, size, cudaMemcpyHostToDevice ); test<<<10*5024, 1024>>>(p2,tid +2); cudaMemcpy(p0,p2 , size, cudaMemcpyDeviceToHost ); }else if(tid ==1){ cudaSetDevice(tid); cudaMalloc((void **)&p3 , size); cudaMemcpy(p3, p1, size, cudaMemcpyHostToDevice ); test<<<10*5024, 1024>>>(p3,tid +2); cudaMemcpy(p1,p3 , size, cudaMemcpyDeviceToHost );

}

Multiple GPU Example#include <pthread.h>

int NUM_THREADS=2;pthread_t thread[NUM_THREADS];pthread_attr_t attr;

pthread_attr_init(&attr); pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);

for(t=0; t<NUM_THREADS; t++) {rc = pthread_create(&thread[t], &attr, GPUprocess, (void *)t);

if (rc) { printf("ERROR; return code from pthread_create() is %d\n", rc); exit(-1); } }

Unified Virtual Address Space (UVA)

• 64-bit process on Windows Vista/7 in TCC mode (only on Tesla)

GPUDirect

• Build on UVA for Tesla (fermi) products.

GPUDirect

multi-gpu and stream programming

Documents