multi-gpu and stream programming

26
Multi-GPU and Stream Programming Kishan Wimalawarne

Upload: masao

Post on 23-Feb-2016

54 views

Category:

Documents


1 download

DESCRIPTION

Multi-GPU and Stream Programming. Kishan Wimalawarne. Agenda. Memory Stream programming Multi-GPU programming UVA & GPUDirect. Memory. Paged locked memory (Pinned memory) Useful in concurrent kernel execution - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Multi-GPU and Stream Programming

Multi-GPU and StreamProgramming

Kishan Wimalawarne

Page 2: Multi-GPU and Stream Programming

Agenda

• Memory• Stream programming• Multi-GPU programming• UVA & GPUDirect

Page 3: Multi-GPU and Stream Programming

Memory

• Paged locked memory (Pinned memory)– Useful in concurrent kernel execution– Use cudaHostAlloc() and cudaFreeHost() allocate and free

page-locked host memory

• Mapped memory– A block of page-locked host memory can also be mapped

into the address space of the device by passing flag cudaHostAllocMapped to cudaHostAlloc()

Page 4: Multi-GPU and Stream Programming

Zero-Copy

• Zero-Copy enables GPU threads to directly access host memory.

• Requires mapped pinned (non-pageable) memory.• Zero copy can be used in place of streams because

kernel-originated data transfers automatically overlap kernel execution without the overhead of setting up and determining the optimal number of streams

• Use cudaSetDeviceFlags() with cudaDeviceMapHost()

Page 5: Multi-GPU and Stream Programming

Zero-Copy

Page 6: Multi-GPU and Stream Programming

Stream Programming

Page 7: Multi-GPU and Stream Programming

Introduction• Stream programming (pipeline) is a useful parallel pattern.• Data transfer from host to device is a major performance

bottleneck in GPU programming• CUDA provides support for asynchronous data transfer

and kernel executions.• A stream is simply a sequence of operations that are

performed in order on the device.• Allow concurrent execution of kernels.• Maximum number of concurrent kernel calls to be

launched is 16.

Page 8: Multi-GPU and Stream Programming

Introduction

Page 9: Multi-GPU and Stream Programming

Asynchronous memory Transfer

• Use cudaMemcpyAsync() instead of cudaMemcpy().

• cudaMemcpyAsync() – non-blocking data transfer method uses pinned host memory .

• cudaError_t cudaMemcpyAsync ( void * dst, const void * src, size_t count,

enum cudaMemcpyKind, cudaStream_t stream)

Page 10: Multi-GPU and Stream Programming

Stream Structures

• cudaStream_t– Sepcifies a stream in a CUDA program

• cudaStreamCreate(cudaStream_t * stm)– Instantiate streams

Page 11: Multi-GPU and Stream Programming

Streaming example

Page 12: Multi-GPU and Stream Programming

Event processing

• Events are used for – Monitor device behavior– Accurate rate timing

• cudaEvent_t e• cudaEventCreate(&e);• cudaEventDestroy(e);

Page 13: Multi-GPU and Stream Programming

Event processing• cudaEventRecord() records and event associated with a stream.• cudaEventElapsedTime() finds the time between two input

events. • cudaEventSynchronize() blocks until the event has actually been

recorded• cudaEventQuery() Check status of an event.• cudaStreamWaitEvent() makes all future work submitted to

stream wait until event reports completion before beginning execution.

• cudaEventCreateWithFlags() create events with flags e.g:- cudaEventDefault, cudaEventBlockingSync

Page 14: Multi-GPU and Stream Programming

Stream Synchronization• cudaDeviceSynchronize() waits until all preceding commands

in all streams of all host threads have completed.• cudaStreamSynchronize() takes a stream as a parameter and

waits until all preceding commands in the given stream have completed

• cudaStreamWaitEvent() takes a stream and an event as parameters and makes all the commands added to the given stream after the call to cudaStreamWaitEvent() delay their execution until the given event has completed.

• cudaStreamQuery() provides applications with a way to know if all preceding commands in a stream have completed.

Page 15: Multi-GPU and Stream Programming

Multi GPU programming

Page 16: Multi-GPU and Stream Programming
Page 17: Multi-GPU and Stream Programming

Multiple device access

• cudaSetDevice(devID)– Devise selection within the code by specifying the

identifier and making CUDA kernels run on the selected GPU.

Page 18: Multi-GPU and Stream Programming

Peer to peer memory Access

• Peer-to-Peer Memory Access– Only on Tesla or above– cudaDeviceEnablePeerAccess() to check peer

access

Page 19: Multi-GPU and Stream Programming

Peer to peer memory Copy

• Using cudaMemcpyPeer() – works for Geforce 480 and other GPUs.

Page 20: Multi-GPU and Stream Programming

Programming multiple GPUs

• The most efficient way to use multiple GPUs is to use host threads for multiple GPUs and divide the work among them.– E.g- pthreads

• Need to combine the parallelism of multi-core processor to in conjunction with multiple GPU's.

• In each thread use cudaSetDevice() to specify the device to run.

Page 21: Multi-GPU and Stream Programming

Multiple GPU

• For each computation on GPU create a separate thread and specify the device a CUDA kernel should run.

• Synchronize both CPU threads and GPU.

Page 22: Multi-GPU and Stream Programming

Multiple GPU Examplevoid * GPUprocess(void *id){ long tid; tid = (long)id; if(tid ==0){ cudaSetDevice(tid); cudaMalloc((void **)&p2 , size); cudaMemcpy(p2, p0, size, cudaMemcpyHostToDevice ); test<<<10*5024, 1024>>>(p2,tid +2); cudaMemcpy(p0,p2 , size, cudaMemcpyDeviceToHost ); }else if(tid ==1){ cudaSetDevice(tid); cudaMalloc((void **)&p3 , size); cudaMemcpy(p3, p1, size, cudaMemcpyHostToDevice ); test<<<10*5024, 1024>>>(p3,tid +2); cudaMemcpy(p1,p3 , size, cudaMemcpyDeviceToHost );

}

Page 23: Multi-GPU and Stream Programming

Multiple GPU Example#include <pthread.h>

int NUM_THREADS=2;pthread_t thread[NUM_THREADS];pthread_attr_t attr;

pthread_attr_init(&attr); pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);

for(t=0; t<NUM_THREADS; t++) {rc = pthread_create(&thread[t], &attr, GPUprocess, (void *)t);

if (rc) { printf("ERROR; return code from pthread_create() is %d\n", rc); exit(-1); } }

Page 24: Multi-GPU and Stream Programming

Unified Virtual Address Space (UVA)

• 64-bit process on Windows Vista/7 in TCC mode (only on Tesla)

Page 25: Multi-GPU and Stream Programming

GPUDirect

• Build on UVA for Tesla (fermi) products.

Page 26: Multi-GPU and Stream Programming

GPUDirect