cuda odds and ends joseph kider university of pennsylvania cis 565 - fall 2011

CUDA Odds and Ends

Joseph KiderUniversity of PennsylvaniaCIS 565 - Fall 2011

Sources

Patrick Cozzi Spring 2011 NVIDIA CUDA Programming Guide CUDA by Example Programming Massively Parallel

Processors

Agenda

Atomic Functions Paged-Locked Host Memory Streams Graphics Interoperability

Atomic Functions

__device__ unsigned int count = 0;// ...++count;

What is the value of count if 8 threads execute ++count?

Atomic Functions

Read-modify-write atomic operationGuaranteed no interference from other threadsNo guarantee on order

Shared or global memory Requires compute capability 1.1 (> G80)

See G.1 in the NVIDIA CUDA C Programming Guide for full compute capability requirements

Atomic Functions

__device__ unsigned int count = 0;// ...// atomic ++countatomicInc(&count, 1);

What is the value of count if 8 threads execute atomicInc below?

Atomic Functions

How do you implement atomicInc?

__device__ int atomicAdd( int *address, int val);

Atomic Functions

How do you implement atomicInc?

__device__ int atomicAdd( int *address, int val){ // Made up keyword: __lock (address) { *address += value; }}

Atomic Functions

How do you implement atomicInc without locking?

Atomic Functions

How do you implement atomicInc without locking?

What if you were given an atomic compare and swap?

int atomicCAS(int *address, int compare, int val);

Atomic Functions

atomicCAS pseudo implementation

int atomicCAS(int *address, int compare, int val){ // Made up keyword __lock(address) { int old = *address; *address = (old == compare) ? val : old; return old; }}

Atomic Functions

Example:

*addr = 1;

atomicCAS(addr, 1, 2);atomicCAS(addr, 1, 3);atomicCAS(addr, 2, 3);

Atomic Functions

Example:

*addr = 1;


// returns 1// *addr = 2

Atomic Functions

Example:

*addr = 1;


// returns 2// *addr = 2

Atomic Functions

Example:

*addr = 1;

atomicCAS(addr, 1, 2);atomicCAS(addr, 1, 3);atomicCAS(addr, 2, 3); // returns 2

// *addr = 3

Atomic Functions

Again, how do you implement atomicInc given atomicCAS?

__device__ int atomicAdd( int *address, int val);

Atomic Functions__device__ int atomicAdd(int *address, int val){ int old = *address, assumed; do { assumed = old; old = atomicCAS(address, assumed, val + assumed); } while (assumed != old); return old;}


Read original value at *address.


If the value at *address didn’t change, increment it.

Atomic Functions__device__ int atomicAdd(int *address, int val){ int old = *address, assumed; do { assumed = old; old = atomicCAS(address, assumed, assumed + val); } while (assumed != old); return old;}

Otherwise, loop until atomicCAS succeeds.

The value of *address after this function returns is not necessarily the original value of *address + val, why?

Atomic Functions

Lots of atomics:

// Arithmetic // BitwiseatomicAdd() atomicAnd()atomicSub() atomicOr()atomicExch() atomicXor()atomicMin()atomicMax()atomicInc()atomicDec()atomicCAS()

See B.10 in the NVIDIA CUDA C Programming Guide

Atomic Functions

How can threads from different blocks work together?

Use atomics sparingly. Why?

Page-Locked Host Memory

Page-locked MemoryHost memory that is essentially removed from

virtual memoryAlso called Pinned Memory


BenefitsOverlap kernel execution and data transfers

See G.1 in the NVIDIA CUDA C Programming Guide for full compute capability requirements

Time

Data Transfer Kernel Execution

Data Transfer

Kernel Execution

Normally:

Paged-locked:


Benefits Increased memory bandwidth for systems

with a front-side bus Up to ~2x throughput

Image from http://arstechnica.com/hardware/news/2009/10/day-of-nvidia-chipset-reckoning-arrives.ars


BenefitsWriting-Combing Memory

Page-locked memory is cacheable Allocate with cudaHostAllocWriteCombined to

Avoid polluting L1 and L2 caches Avoid snooping transfers across PCIe Improve transfer performance up to 40% - in theory

Reading from write-combing memory is slow! Only write to it from the host


BenefitsPaged-locked host memory can be mapped

into the address space of the device on some systems

What systems allow this? What does this eliminate? What applications does this enable?

Call cudaGetDeviceProperties() and check canMapHostMemory


Usage:

cudaHostAlloc() / cudaMallocHost()cudaHostFree()

cudaMemcpyAsync()

See 3.2.5 in the NVIDIA CUDA C Programming Guide


DEMOCUDA SDK Example: bandwidthTest


What’s the catch?Page-locked memory is scarce

Allocations will start failing before allocation of in pageable memory

Reduces amount of physical memory available to the OS for paging

Allocating too much will hurt overall system performance

Streams Stream: Sequence of commands that execute in order Streams may execute their commands out-of-order or concurrently

with respect to other streams

Command 0

Command 1

Command 2

Command 0

Command 1

Command 2

Stream A Stream B

Streams

Command 0

Command 1

Command 2

Command 0

Command 1

Command 2

Stream A Stream B Time

Command 0

Command 1

Command 2

Command 0

Command 1

Command 2

Is this a possible order?

Streams

Command 0

Command 1

Command 2

Command 0

Command 1

Command 2


Command 0

Command 2

Command 1

Command 0

Command 2

Command 1


Streams

Command 0

Command 1

Command 2

Command 0

Command 1

Command 2


Command 0

Command 1

Command 2

Command 0

Command 1

Command 2


Streams

In CUDA, what commands go in a stream?Kernel launchesHost device memory transfers

Streams

Code Example1. Create two streams2. Each stream:

1. Copy page-locked memory to device2. Launch kernel3. Copy memory back to host

3. Destroy streams

Stream Example (Step 1 of 3)cudaStream_t stream[2];for (int i = 0; i < 2; ++i){ cudaStreamCreate(&stream[i]);}

float *hostPtr;cudaMallocHost(&hostPtr, 2 * size);



Create two streams



Allocate two buffers in page-locked memory

Stream Example (Step 2 of 3)for (int i = 0; i < 2; ++i){ cudaMemcpyAsync(/* ... */, cudaMemcpyHostToDevice, stream[i]); kernel<<<100, 512, 0, stream[i]>>> (/* ... */); cudaMemcpyAsync(/* ... */, cudaMemcpyDeviceToHost, stream[i]);}

Stream Example (Step 2 of 3)for (int i = 0; i < 2; ++i){ cudaMemcpyAsync(/* ... */, cudaMemcpyHostToDevice, stream[i]); kernel<<<100, 512, 0, stream[i]>>> (/* ... */); cudaMemcpyAsync(/* ... */, cudaMemcpyDeviceToHost, stream[i]);}

Commands are assigned to, and executed by streams

Stream Example (Step 3 of 3)for (int i = 0; i < 2; ++i){ // Blocks until commands complete cudaStreamDestroy(stream[i]);}

Streams

Assume compute capabilities:Overlap of data transfer and kernel executionConcurrent kernel executionConcurrent data transfer

How can the streams overlap?

See G.1 in the NVIDIA CUDA C Programming Guide for more on compute capabilities

Streams

Time

Kernel execution

Stream A Stream B

Host device memory

Device to host memory Kernel execution

Host device memory

Device to host memory

Can we have more overlap than this?

Streams

Time

Kernel execution

Stream A Stream B

Host device memory


Kernel execution

Host device memory


Can we have this?

Streams

Implicit SynchronizationAn operation that requires a dependency

check to see if a kernel finished executing: Blocks all kernel launches from any stream until

the checked kernel is finished

See 3.2.6.5.3 in the NVIDIA CUDA C Programming Guide for all limitations

Streams

Time

Kernel execution

Stream A

Stream BHost device memory

Device to host memory Kernel execution

Host device memory


Can we have this?

Dependent on kernel

completion

Blocked until kernel from Stream A completes

Streams

Performance Advice Issue all independent commands before

dependent onesDelay synchronization (implicit or explicit) as

long as possible

Streams

for (int i = 0; i < 2; ++i){ cudaMemcpyAsync(/* ... */, stream[i]); kernel<<< /*... */ stream[i]>>>(); cudaMemcpyAsync(/* ... */, stream[i]);}

Rewrite this to allow concurrent kernel execution

Streams

for (int i = 0; i < 2; ++i) // to device cudaMemcpyAsync(/* ... */, stream[i]);

for (int i = 0; i < 2; ++i) kernel<<< /*... */ stream[i]>>>();

for (int i = 0; i < 2; ++i) // to host cudaMemcpyAsync(/* ... */, stream[i]);

Streams

Explicit SynchronizationcudaThreadSynchronize()

Blocks until commands in all streams finishcudaStreamSynchronize()

Blocks until commands in a stream finish

See 3.2.6.5 in the NVIDIA CUDA C Programming Guide for more synchronization functions

Timing with Stream Events

Events can be added to a stream to monitor the device’s progress

An event is completed when all commands in the stream preceding it complete.

Timing with Stream EventscudaEvent_t start, stop;cudaEventCreate(&start);cudaEventCreate(&stop)

cudaEventRecord(start, 0);for (int i = 0; i < 2; ++i) // ...cudaEventRecord(stop, 0);

cudaEventSynchronize(stop);float elapsedTime;cudaEventElapsedTime(&elapsedTime, start, stop);// cudaEventDestroy(...)




Create two events. Each will record the time




Record events before and after each stream is assigned its work




Delay addition commands in stream until after the stop event




Compute elapsed time

Graphics Interoperability

What applications use both CUDA and OpenGL/Direct3D?CUDA GLGL CUDA

If CUDA and GL cannot share resources, what is the performance implication?


Graphics Interop: Map GL resource into CUDA address spaceBuffersTexturesRenderbuffers


OpenGL Buffer Interop1. Assign device with GL interop

cudaGLSetGLDevice()

2. Register GL resource with CUDA cudaGraphicsGLRegisterBuffer()

3. Map it cudaGraphicsMapResources()

4. Get mapped pointer cudaGraphicsResourceGetMappedPointer()

cuda odds and ends joseph kider university of pennsylvania cis 565 - fall 2011

Documents