mixing graphics & compute with multi-gpu | gtc 2013 · 2013. 4. 19. · 1 2 3 map * cuda kernel...
TRANSCRIPT
-
Mixing Graphics & Compute with multi-GPU
Wil Braithwaite - NVIDIA Applied Engineering
http://www.gputechconf.com/page/home.html
-
Talk Outline
Compute and Graphics API Interoperability.
Interoperability Methodologies.
Interoperability at a system level.
Application design considerations.
2
-
Compute & Visualize the same data
Application
3
-
Compute/Graphics interoperability
Setup the objects in the graphics context.
Register objects with the compute context.
Map / Unmap the objects from the compute context.
CUDA OpenGL/DX
Application
CUDA Array
CUDA Buffer Buffer Object
Texture Object
4
-
Code Sample – Simple image interop
Setup and Registration of Texture Objects:
GLuint texId;
cudaGraphicsResource_t texRes;
// OpenGL buffer creation...
glGenTextures(1, &texId);
glBindTexture(GL_TEXTURE_2D, texId);
glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA8UI_EXT, texWidth, texHeight, 0,
GL_RGBA_INTEGER_EXT, GL_UNSIGNED_BYTE, 0);
glBindTexture(GL_TEXTURE_2D, 0);
// Registration with CUDA.
cudaGraphicsGLRegisterImage(&texRes, texId, GL_TEXTURE_2D,
cudaGraphicsRegisterFlagsNone);
5
-
Code Sample – Simple image interop
Mapping between contexts:
cudaArray* texArray;
while (!done)
{
cudaGraphicsMapResources(1, &texRes);
cudaGraphicsSubResourceGetMappedArray(&texArray, texRes, 0, 0);
runCUDA(texArray);
cudaGraphicsUnmapResources(1, &texRes);
runGL(texId);
}
6
-
Code Sample – Simple buffer interop
Setup and Registration of Buffer Objects:
GLuint vboId;
cudaGraphicsResource_t vboRes;
// OpenGL buffer creation...
glGenBuffers(1, &vboId);
glBindBuffer(GL_ARRAY_BUFFER, vboId);
glBufferData(GL_ARRAY_BUFFER, vboSize, 0, GL_DYNAMIC_DRAW);
glBindBuffer(GL_ARRAY_BUFFER, 0);
// Registration with CUDA.
cudaGraphicsGLRegisterBuffer(&vboRes, vboId, cudaGraphicsRegisterFlagsNone);
7
-
Code Sample – Simple buffer interop
Mapping between contexts:
float* vboPtr;
while (!done)
{
cudaGraphicsMapResources(1, &vboRes, 0);
cudaGraphicsResourceGetMappedPointer((void**)&vboPtr, &size, vboRes);
runCUDA(vboPtr);
cudaGraphicsUnmapResources(1, &vboRes, 0);
runGL(vboId);
}
8
-
Resource Behavior: Single-GPU
The resource is shared.
Context switch is fast and independent on data size.
GL contextCUDA context
Data
interop API
GPU
9
-
float* vboPtr;
while (!done)
{
cudaGraphicsMapResources(1, &vboRes, 0);
cudaGraphicsResourceGetMappedPointer((void**)&vboPtr, &size, vboRes);
runCUDA(vboPtr);
cudaGraphicsUnmapResources(1, &vboRes, 0);
runGL(vboId);
}
Code Sample – Simple buffer interop
Mapping between contexts:
Context-switching happens
when these commands are processed.
10
-
Timeline: Single-GPU
Driver-interop
1 2 3
map
kernel
GL
kernel
GL
CU
DA
G
L
runCUDA
unmap
runGL
kernel
GL
4
11
-
float* vboPtr;
while (!done)
{
cudaGraphicsMapResources(1, &vboRes, 0);
cudaStreamSynchronize(0);
cudaGraphicsResourceGetMappedPointer((void**)&vboPtr, &size, vboRes);
runCUDA(vboPtr);
cudaGraphicsUnmapResources(1, &vboRes, 0);
cudaStreamSynchronize(0);
runGL(vboId);
}
Code Sample – Simple buffer interop
Adding synchronization for analysis:
12
-
Timeline: Single-GPU
Driver-interop, synchronous*
— (we synchronize after map and unmap calls)
1 2 3
map *
kernel
GL
kernel
GL
CU
DA
G
L
runCUDA
unmap *
runGL
kernel
Synchronize after “map”
waits for GL to finish before context-switch.
GL
4
Synchronize after “unmap”
waits for CUDA (& GL) to finish.
13
-
Resource Behavior: Multi-GPU
Each GPU has a copy of the resource.
Context-switch is dependent on data size, because driver
must copy data.
GL contextCuda context
Data
interop API
Data
compute-GPU render-GPU
16
-
Resource Behavior: Multi-GPU
Each GPU has a copy of the resource.
Context-switch is dependent on data size, because driver
must copy data.
GL contextCuda context
Data
interop API
Data
compute-GPU render-GPU
MAP
17
-
Resource Behavior: Multi-GPU
Each GPU has a copy of the resource.
Context-switch is dependent on data size, because driver
must copy data.
GL contextCuda context
Data
interop API
Data
compute-GPU render-GPU
UNMAP
18
-
Timeline: Multi-GPU
Driver-interop, synchronous*
— SLOWER! (Tasks are serialized).
1 2 3
map *
kernel CUtoGL
GL
kernel
GL
CU
DA
G
L
runCUDA
unmap *
runGL
GLtoCU kernel GLtoCU GLtoCU CUtoGL
Resources are mirrored and
synchronized across the GPUs
“map” has to wait for GL to complete
before it synchronizes the resource.
20
-
Interoperability Methodologies
READ-ONLY
— GL produces... and CUDA consumes.
e.g. Post-process the GL render in CUDA.
WRITE-DISCARD
— CUDA produces... and GL consumes.
e.g. CUDA simulates fluid, and GL renders result.
READ & WRITE
— Useful if you want to use the rasterization pipeline.
e.g. Feedback loop:
— runGL(texture) framebuffer
— runCUDA(framebuffer) texture
21
-
float* vboPtr;
cudaGraphicsResourceSetMapFlags(vboRes, cudaGraphicsMapFlagsWriteDiscard);
while (!done)
{
cudaGraphicsMapResources(1, &vboRes, 0);
cudaStreamSynchronize(0);
cudaGraphicsResourceGetMappedPointer((void**)&vboPtr, &size, vboRes);
runCUDA(vboPtr);
cudaGraphicsUnmapResources(1, &vboRes, 0);
cudaStreamSynchronize(0);
runGL(vboId);
}
CUDA produces... and OpenGL consumes:
Code Sample – WRITE-DISCARD
Hint that we do not care about
the previous contents of buffer.
22
-
Timeline: Single-GPU
Driver-interop, synchronous*, WRITE-DISCARD
1 2 3
map *
kernel
GL
kernel
GL
CU
DA
G
L
runCUDA
unmap *
runGL
kernel
Synchronize after “map”
waits for GL to finish before context-switch.
GL
4
Synchronize after “unmap”
waits for CUDA (& GL) to finish.
Context switch forces serialization.
23
-
Timeline: Multi-GPU
Driver-interop, synchronous*, WRITE-DISCARD
1 2 3
map *
kernel CUtoGL kernel
CU
DA
G
L
runCUDA
unmap *
runGL
CUtoGL kernel CUtoGL
4
When multi-GPU,
“map” does nothing.
Compute & render can overlap
as they are on different GPUs.
GL GL GL
Synchronize after “unmap”
waits for CUDA & GL to finish.
24
-
Timeline: Multi-GPU
Driver-interop, synchronous*, WRITE-DISCARD
— if render is long...
1 2 3
map *
kernel CUtoGL kernel
CU
DA
G
L
runCUDA
unmap *
runGL
CUtoGL kernel CUtoGL
GL
4
“unmap” will wait for GL.
GL GL
25
-
Driver-Interop: System View
Single-GPU Multi-GPU
26
-
Manual-Interop: System View
Multi-GPU
compute-GPU render-GPU
27
-
cudaMalloc((void**)&d_data, vboSize);
cudaHostAlloc((void**)&h_data, vboSize, cudaHostAllocPortable);
while (!done) {
// Compute data in temp buffer, and copy to host...
runCUDA(d_data);
cudaMemcpyAsync(h_data, d_data, vboSize, cudaMemcpyDeviceToHost, 0);
cudaStreamSynchronize(0);
// Map the render-GPU’s resource and upload the host buffer...
cudaSetDevice(renderGPU);
cudaGraphicsMapResources(1, &vboRes, 0);
cudaGraphicsResourceGetMappedPointer((void**)&vboPtr, &size, vboRes);
cudaMemcpy(vboPtr, h_data, size, cudaMemcpyHostToDevice);
cudaGraphicsUnmapResources(1, &vboRes, 0);
cudaSetDevice(computeGPU);
runGL(vboId);
}
Code Sample – Manual-Interop
Create a temporary buffer
in pinned host-memory.
28
-
Timeline: Multi-GPU
Manual-interop, synchronous*, WRITE-DISCARD
1 2 3
map
kernel CUtoH kernel
CU
DA
G
L
runCUDA
unmap
runGL
CUtoH kernel CUtoH
4
CUtoH *
HtoGL *
HtoGL HtoGL HtoGL
GL GL
29
-
cudaMalloc((void**)&d_data, vboSize);
cudaHostAlloc((void**)&h_data, vboSize, cudaHostAllocPortable);
while (!done) {
// Compute data in temp buffer, and copy to host...
runCUDA(d_data);
cudaMemcpyAsync(h_data, d_data, vboSize, cudaMemcpyDeviceToHost, 0);
cudaStreamSynchronize(0);
// Map the render-GPU’s resource and upload the host buffer...
// (all commands must be asynchronous.)
cudaSetDevice(renderGPU);
cudaGraphicsMapResources(1, &vboRes, 0);
cudaGraphicsResourceGetMappedPointer((void**)&vboPtr, &size, vboRes);
cudaMemcpyAsync(vboPtr, h_data, size, cudaMemcpyHostToDevice, 0);
cudaGraphicsUnmapResources(1, &vboRes, 0);
cudaSetDevice(computeGPU);
runGL(vboId);
}
Code Sample - Manual Interop (Async)
Use asynchronous copy in default stream.
30
-
Timeline: Multi-GPU
Manual-interop, asynchronous, WRITE-DISCARD
1 2 3
map
kernel CUtoH kernel
CU
DA
G
L
runCUDA
unmap
runGL
CUtoH kernel CUtoH
GL
4
CUtoH *
HtoGL
HtoGL
GL GL
HtoGL HtoGL
31
-
Timeline: Multi-GPU
Manual-interop, asynchronous, WRITE-DISCARD
— if render is long...
1 2 3
map
kernel CUtoH kernel
CU
DA
G
L
runCUDA
unmap
runGL
CUtoH kernel CUtoH
4
CUtoH *
HtoGL
HtoGL
GL
HtoGL HtoGL
GL
We are downloading while uploading! Drifting out of sync!
32
-
Timeline: Multi-GPU (fixed Async)
Manual-interop, asynchronous, WRITE-DISCARD
— if render is long...
1 2 3
map
kernel CUtoH kernel
CU
DA
G
L
runCUDA
unmap
runGL
CUtoH kernel CUtoH
4
CUtoH *
HtoGL
HtoGL
GL
HtoGL HtoGL
GL
Synchronization must also wait for HtoGL to finish
33
-
while (!done) {
// Compute the data in a temp buffer, and copy to a host buffer...
runCUDA(d_data);
cudaStreamWaitEvent(0, uploadFinished, 0);
cudaMemcpyAsync(h_data, d_data, vboSize, cudaMemcpyDeviceToHost, 0);
cudaStreamSynchronize(0);
// Map the render-GPU’s resource and upload the host buffer...
// (all commands must be asynchronous.)
cudaSetDevice(renderGPU);
cudaGraphicsMapResources(1, &vboRes, 0);
cudaGraphicsResourceGetMappedPointer((void**)&vboPtr, &size, vboRes);
cudaMemcpyAsync(vboPtr, h_data, size, cudaMemcpyHostToDevice, 0);
cudaGraphicsUnmapResources(1, &vboRes, 0);
cudaEventRecord(uploadFinished, 0);
cudaSetDevice(computeGPU);
runGL(vboId);
}
Code Sample - Manual Interop (fixed Async)
34
-
Timeline: Multi-GPU
Manual-interop, asynchronous, WRITE-DISCARD, with flipping
1 2 3
map
kernel[A] CUtoH[A]
CU
DA
G
L
runCUDA
unmap *
runGL
CUtoH[B]
kernel[A] CUtoH[A]
4
CUtoH
HtoGL
H[A]toGL
GL
H[A]toGL H[B]toGL
GL GL
kernel[B] CUtoH[B] kernel[B]
H[B]toGL
GL
5
35
-
int read = 1, write = 0;
while (!done) {
// Compute the data in a temp buffer, and copy to a host buffer...
cudaStreamWaitEvent(custream[write], kernelFinished[read]);
runCUDA(d_data, custream[write]);
cudaEventRecord(kernelFinished[write], custream[write]);
cudaStreamWaitEvent(custream[write], uploadFinished[read]);
cudaMemcpyAsync(h_data[write], d_data, vboSize, cudaMemcpyDeviceToHost, custream[write]);
cudaEventRecord(downloadFinished[write], custream[write]);
// Map the renderGPU’s resource and upload the host buffer...
cudaSetDevice(renderGPU);
cudaGraphicsMapResources(1, &vboRes, glstream);
cudaGraphicsResourceGetMappedPointer((void**)&vboPtr, &size, vboRes);
cudaStreamWaitEvent(glstream, downloadFinished[read]);
cudaMemcpyAsync(vboPtr, h_data[read], size, cudaMemcpyHostToDevice, glstream);
cudaGraphicsUnmapResources(1, &vboRes, glstream);
cudaEventRecord(uploadFinished[read], glstream);
cudaStreamSynchronize(glstream); // Sync for easier analysis!
cudaSetDevice(computeGPU);
runGL(vboId);
swap(&read, &write);
}
Code Sample - Manual Interop (streams)
36
-
Timeline: Multi-GPU
Manual-interop, asynchronous, WRITE-DISCARD, + pingpong
— if render is long... then we must flip the resource too... etc. etc.
1 2 3
map
kernel[A] CUtoH[A]
CU
DA
G
L
runCUDA
unmap *
runGL
CUtoH[B]
kernel[A] CUtoH[A]
4
CUtoH
HtoGL
H[A]toGL
GL
H[A]toGL H[B]toGL
GL GL
kernel[B] CUtoH[B] kernel[B]
H[B]toGL
37
-
Benchmarks & Demo
— runCUDA (20ms)
— runGL (10ms)
— copy (10ms)
Single-GPU
— Driver-interop = 30ms
Multi-GPU
— Driver-interop = 36ms
— Async Manual-interop = 32ms
— Flipped Manual-interop = 22ms
Too large data size
makes multi-GPU interop worse.
Overlapping the
download helps us break even.
But using streams
and flipping is a significant win!
38
-
Scaling further
Ping-pong the renderGPU side too.
— Current example is bound by (upload+render)
Kernel might not be dependent on previous kernel.
— e.g. Could you run two kernels simultaneously?
If your CUDA kernels are much more expensive than your GL render then this could be a win.
Remember to use streams and events, AND consider all the dependencies.
39
-
Interoperability behavior: Multi-GPU
Similar considerations are applicable when OpenGL is the producer and CUDA is the consumer.
Use cudaGraphicsMapFlagsReadOnly
40
-
Application Design Considerations
Avoid synchronized GPUs for CUDA.
— Watch out for Windows’s WDDM implicit synchronization on unmap!
Provision for multi-GPU environments:
— Let the user choose the GPUs.
— Use cudaD3D[9|10|11]GetDevices()/cudaGLGetDevices() to match CUDA and graphics device enumerations.
CUDA-OpenGL interoperability can perform slower if OpenGL context spans multiple GPUs.
Context switch performance varies with system config and OS.
41
-
Conclusions & Resources
The driver can do all the heavy-lifting but...
Scalability and final performance is up to the developer — For fine-grained control and optimization, you might want to move
the data manually.
CUDA samples/documentation: — http://developer.nvidia.com/cuda-downloads
OpenGL Insights, Patrick Cozzi, Christophe Riccio, 2012. ISBN 1439893764.
www.openglinsights.com
42
http://developer.nvidia.com/cuda-downloadshttp://developer.nvidia.com/cuda-downloadshttp://developer.nvidia.com/cuda-downloads
-
Thank you.
Questions will be taken at the end of the full session.
43
-
Scaling Graphics and Compute on Multi-GPUs Shalini Venkataraman
PSG Applied Engineering
-
Talk Outline
Recap on default behavior
— We need finer grained control of managing scaling
Enumerating Graphics & Compute Resources
— Supported hardware
— NUMA considerations
Different methods for scaling and communication
— Multiple CUDA GPUs + 1 Graphics GPU
— Multiple graphics GPUs
Focus on Single node system, CUDA and OpenGL
Nsight for profiling
-
Recap – Beyond API Interoperability
API Interop hides all the complexity
BUT, sometimes
— Don’t want to transfer all data
Kernel may compute subregions but entire buffer object/texture is copied
— May have a complex system with multiple compute GPUs and/or
render GPUs
— Application specific pipelining and multi-buffering
— May have some CPU code in your algorithm between compute and
graphics
-
Scaling Beyond Single Compute+Graphics
Scaling compute
— Divide tasks across multiple devices
— When data does not fit into single GPU memory - Distribute data
Scaling graphics
— Multi-displays, Stereo
— Cooler rendering eg raytracing, complex lighting models
Higher compute density
— Amortize host or server costs eg CPU, memory, RAID shared with
multiple GPUs
-
Multi-GPU Compute+Graphics Use Cases
Image processing
— Multiple compute GPUs and a low-end
display GPU for blitting
Mixing rasterization with compute
— Polygonal rendering done in OpenGL
and input to Compute for further
processing
Visualization for HPC Simulation
— Numerical simulation distributed across
multiple compute GPUs, possibly on
remote supercomputer
NVIDIA Index – Seismic Interpretation
Morpheus Medical – Real-time CFD Viz
GTC2013 Exhibit Hall
-
Mixing Tesla and Quadro GPUs
— Tight integration with OEMs and System Integrators
— Optimized driver paths for GPU-GPU communication
NVIDIA Maximus Initiative
NVIDIA® MAXIMUS™ Workstation
Visualize Simulate (Cluster)
Simulate0 Simulate1
+ Visualize0
Simulate2
+ Visualize1
Simulate3
+ Visualize2
Traditional Workstation
-
Supported Hardware Supermicro Maximus Systems
3D Boxx 8950 -Quadro + 4 Tesla K20s
3D Boxx 4920/4920 Xtreme -Quadro + 3 Tesla K20s
3D Boxx 8920 - Quadro + 2 Tesla K20s
SYS-7047GR-TRF - Quadro + 4 Tesla K20s
SYS-7047A/7047A-73/7037A-i - Quadro + 3 Tesla K20s
+
Boxx Systems
-
NUMA/Topology Considerations
Memory access is non-uniform!
— Local GPU access is faster than
remote (extra QPI hop)
— Affects PCIe transfer throughput.
NUMA APIs
— Thread affinity considerations
Pitfalls of set process affinity
— Does not work with graphics APIs
Memory
CPU0
IOH0
CPU1
IOH1
QPI
QPI
QPI
Memory
QPI
GPU0 GPU1
6 GB/s
~4 GB/s
SandyBridge
Integrated IOH
Dale Southard. Designing and Managing GPU Clusters. Supercomputing 2011 http://nvidia.fullviewmedia.com/fb/nv-sc11/tabscontent/archive/401-thu-southard.html
-
Unified Virtual Addressing Easier to program with single address space
System
Memory
CPU GPU0
GPU0
Memory
GPU1
GPU1
Memory
PCI-e
0x0000
0xFFFF
No UVA – Multiple Memory Space UVA – Single Address Space
-
Peer-to-Peer (P2P) Communication
-
NUMA/Topology matters for P2P
PCI-e
P2P Communication Supported
Between GPUs on the Same IOH
x16 x16
PCI-e
P2P Communication Supported
Between GPUs on the Same IOH
x16
x16 CPU0 IOH0
CPU1 IOH1
QPI incompatible with PCI-e P2P specification
GPU0 GPU1
GPU2 GPU3
QPI
QPI
QPI
Sandy Bridge Socket
Integrated IOH
Memory Memory
Memory Memory
Mem
ory
M
em
ory
-
P2P disabled over QPI – Copying staged via host (P2H2P)
Configuration for Compute & Graphics
CPU0 QPI
PCI-e
IOH0
Best P2P Performance
Between GPUs on the
Same PCIe Switch
Eg K10 dual-GPU card
(~6.5GB)
P2P Communication Supported
Between GPUs on the Same IOH
(~5GB)
x16 x16 x16 x16
x16 x16
Tesla
GPU0 Tesla
GPU1 Tesla
GPU2
Tesla
GPU3
Switch
0 Switch
1
PCI-e x16
x16 CPU1 IOH1
Quadro
GPU4
Quadro
GPU5
QPI
P2P communication
- Linux only
- No WDDM
-
Mapping Algorithms to Hardware Topology
Won-Ki Jeong. S3308 - Fast Compressive
Sensing MRI Reconstruction on a Multi-
GPU System, GTC 2013.
Paulius M. S0515 – Implementing
3D Finite Difference Code on
GPUs, GTC 2009.
GPU0 GPU1 Data reduction, Multli-phase problems
-
Programming - Enumerating Resources
CUDA - Tesla enumerated above Quadro for device 0 & 1 ./deviceQuery -noprompt | egrep "^Device"
Device 0: "Tesla K20c"
Device 1: "Quadro K2000"
Device 2: "Tesla K20c"
cudaSetDevice() sets current GPU
— All cuda calls are issued to current GPU except – p2p memcopies
— Current GPU can be changed while async calls are executed
Single thread can drive multiple GPUs
cudaGLGetDevices() gets the GPU(s) for current GL context
— We will touch multiple OpenGL devices later
-
Communication across GPUs - via Host
while (!done) {
// Compute the data in a temp buffer, and copy to a host buffer...
runCUDA(d_data);
cudaStreamWaitEvent(0, uploadFinished, 0);
cudaMemcpyAsync(h_data, d_data, vboSize, cudaMemcpyDeviceToHost, 0);
cudaEventRecord(downloadFinished, 0);
// Map the render-GPU’s resource and upload the host buffer...
// (all commands must be asynchronous.)
cudaSetDevice(renderGPU);
doMap(d_mappedPtr,0);
cudaStreamWaitEvent(0, downloadFinished);
cudaMemcpyAsync(vboPtr, h_data, size, cudaMemcpyHostToDevice, 0);
doUnmap(0);
cudaEventRecord(uploadFinished, 0);
cudaSetDevice(computeGPU);
doRender(…);
}
-
Synchronization across GPUs
Streams and events are per device
— Determined by the GPU that is current at creation
Stream GPU must be set current for
— Launching kernel to a stream
— Recording events to a stream
Agnostic to the current GPU
— Memcpy can be launched on any stream
— Synchronize/Query of Events
-
Revisit – P2P Communication
-
Peer-to-Peer (P2P) Initialization
cudaDeviceCanAccessPeer(&isAccessible, srcGPU, dstGPU)
— Returns in 1st arg if srcGPU can access memory of dstGPU
— Need to do this bidirectionally
cudaDeviceEnablePeerAccess(peerDevice, 0)
— Enables current GPU to access peerDevice
— Note that this is asymmetric!
cudaDeviceDisablePeerAccess
— P2p can be limited to a specific phase in order to reduce overhead
and free resources
-
Peer-to-Peer Copy
cudaMemcpyAsync
cudaMemcpyPeerAsync
— Called in a separate stream
— Falls back to staging copies via host
for unsupported configutations
NVIDIA CUDA Webinars – Multi-GPU Programming
http://developer.download.nvidia.com/CUDA/training/cuda_webinars_multi_gpu.pdf
-
P2P Requirements
Works on
— 64bit app
— Fermi and above
— Cuda 4.0 +
— Linux and Windows TCC
Will not work
— Dual-IOH configs
— Mixing WDDM/TCC
— Across different chips even in same generation
eg K5000 (GK104) and K20 (GK110)
-
cudaSetDevice(0);
cudaEventCreate(&finishedKernelEvent);
cudaSetDevice(1);
cudaEventCreate(&finishedCopyEvent);
cudaEventRecord(finishedCopyEvent, 0);//Trigger compute
P2P communication
IOH
PCIe
IOH0 IOH1
CPU0 CPU1
QPI
while (!done) {
// Set CUDA device to COMPUTE DEVICE 0
cudaSetDevice(device);
cudaStreamWaitEvent(0, finishedCopyEvent);
doKernel(d_buffer,0);
cudaEventRecord(finishedKernelEvent, 0);
//ON GL RENDER DEVICE 1
cudaSetDevice(1); // GL Device
doMap((void**)&d_mappedPtr,0);
cudaStreamWaitEvent(0, finishedKernel);
cudaMemcpyPeerAsync(d_mappedPtr,1,d_buffer,0,
size, cuda2glStream);
cudaEventRecord(finishedCopyEvent, 0);
doUnmap(0);
doRender();
}
Events Creation
-
P2P : Staging via Host
Example mixed WDDM/TCC configuration
render
Map
Upload
Unmap
Render
frame 1 frame 2 frame 3
CU
DA
-Dev
1
GL
-Dev
1
Compute
Download
CU
DA
-Dev
0
memcpy
memcpy : Driver automatically
pipelines download and upload
memcpy
Wait for event
compute
Record event
render
compute
memcpy
memcpy
Compute Engine idle during transfers
-
Profiling with Nsight - Timeline
Kernel0 D2H0
H2D0
Kernel1
Overlapped by driver
S Domine. S3377 - Seamless Compute and OpenGL Graphics
Development in NVIDIA Nsight 3.0 Visual Studio Edition and
Beyond. GTC 2013
-
Overlapping Kernel With Transfer void* d_buffer[2]; // Ping pong the buffer kernel writes to, so it can
overlap with transfer
while (!done) {
// Set CUDA device to COMPUTE DEVICE 0
cudaSetDevice(0);
cudaStreamWaitEvent(stream, finishedCopyEvent[cur]);
doKernel(d_buffer[cur],stream);
cudaEventRecord(finishedKernelEvent[cur], stream);
//ON GL RENDER DEVICE 1
cudaSetDevice(1); // GL Device
doMap((void**)&d_mappedPtr, cuda2glStream);
cudaStreamWaitEvent(0, finishedKernelEvent[prev]);
cudaMemcpyPeerAsync(d_mappedPtr,1,d_buffer[prev],0,
size, cuda2glStream);
cudaEventRecord(finishedCopyEvent[prev], cuda2glStream);
doUnmap(cuda2glStream);
doRender();
}
prev = cur;
cur =1- cur ;
-
Timeline
Kernel, download
and upload
are overlapped Kernel0 D2H0
H2D0
Kernel1
-
Peer-to-Peer Memory Access
Sometimes we don’t want to explicitly copy but have access
to entire space on all GPUs and CPU
Already possible with linear memory, recent texture
addition useful for graphics
Example – large dataset that can’t fit into 1 GPU memory
— Distribute the domain/data across GPUs
— Each GPU now need to access the other GPU’s data for
halos/boundary exchange
-
Sharing texture across GPUs
Peer access is unidirectional. cudaSetDevice(0); cudaMalloc3DArray(d0_volume, …);
cudaSetDevice(1);
cudaMalloc3DArray(d1_volume, …);
while (!done) {
// Set CUDA device to COMPUTE DEVICE 0
cudaSetDevice(0);
cudaBindTextureToArray(tex0, d0_volume);
cudaBindTextureToArray(tex1, d1_volume);
Kernel
}
_global__ void Kernel(..) {
float voxel0 = tex3D(tex0, u, v, w); //accesses gpu0
float voxel1 = tex3D(tex1, u, v, w); //accesses gpu1 through p2p
}
Kernel runs on device 0 accesses texture from device 1
-
Multiple Compute : Single Render GPU
CUDA
Context
GPU 1
CUDA
Context
GPU 1 GPU 0
Compute Devices Render Device
OpenGL
Context
Aux
CUDA
Context
GPU 2
Map
Unmap
P2H2P on Windows
(WDDM/TCC)
-Explicit copies via host
-CudaMemcpyPeer
-P2P Memory Access
-
Multi-GPU Image Processing - SagivTech
Live demo at Exhibit Hall - Booth 712, A middleware for Real Time Multi GPU,
GTC 2013
Renderer
-
Scaling Graphics - Multiple Quadro GPUs
Virtualized SLI case
— Multi-GPU mosaic drive a large wall
Access each GPU separately
— Parallel rendering
— Multi view-frustum eg Stereo, CAVEs
Hurricane Sandy Simulation showing multiple
computed tracks for different input params –
Image Credits : NOAA
Sort +
Alpha Composite
Data Distribution +
Render
GPU-0
GPU-2
GPU-1
GPU-3
Visible Human 14GB Texture Data
-
CUDA With SLI Mosaic
Cuda
Enumeration
[0] = K20
[1] = K5000
[2] = K5000
OpenGL
Enumeration
with SLI
[0] = K5000
cudaGLgetGLDevices
GL context
int glDeviceIndices[MAX_GPU];
int glDeviceCount;
cudaGLGetDevices(&glDeviceCount, glDeviceIndices, glDeviceCount,
cudaGLDeviceListAll));
glDeviceCount = 2
Use driver interop with the current cuda device
— Driver handles the optimized data transfer (Work in progress)
The OpenGL context spans 2 cuda devices
-
Specifying Render GPU Explicitly
Linux
— Specify separate X screens using XOpenDisplay
Windows
— Using NV_GPU_AFFINITY extension
Display* dpy = XOpenDisplay(“:0.”+gpu)
GLXContext = glxCreateContextAttribs(dpy,…);
BOOL wglEnumGpusNV(UINT iGpuIndex, HGPUNV *phGPU)
For #GPUs enumerated {
GpuMask[0]=hGPU[0];
GpuMask[1]=NULL;
//Get affinity DC based on GPU
HDC affinityDC = wglCreateAffinityDCNV(GpuMask);
setPixelFormat(affinityDC);
HGLRC affinityGLRC = wglCreateContext(affinityDC);
}
-
Mapping Cuda Affinity GL Devices
How to map OpenGL device to CUDA
cudaWGLGetDevice(int *cudadevice,HGPUNV oglGPUHandle)
Don’t expect the same order
— GL order is windows specific while CUDA is device specific
Cuda
Enumeration
[0] = K20
[1] = K5000
[2] = K5000
GL Affinity
Enumeration
[0] = K5000
[1] = K5000
GL Affinity
Enumeration
(with SLI)
[0] = K5000
cudaGLgetGLDevices
cudaWGLgetDevice
GL context GL context
S3070 - Part 1 - Configuring, Programming and Debugging Applications for
Compute and Graphics on Multi-GPUS
-
Compute & Multiple Render
OpenGL
Context
GPU 2
CUDA
Context
GPU 1
OpenGL
Context
CudaMemcpyPeer
Aux
CUDA
Context
GPU 1 GPU 2 GPU 0
Compute Render Devices
wglCopyImageSubData
Map
Unmap
GL_NV_COPY_IMAGE - Copy Across Render GPUs
— Cross-platform
wglCopyImageSubData and glXCopyImageSubDataNV
— Can copy subregions, Texture only
-
Multithreading with OpenGL Contexts
// Wait for signal to start consuming
CPUWait(producedFenceValid);
glWaitSync(producedFence[0]);
// Bind texture object
glBindTexture(destTex[0]);
// Use as needed
// Signal we are done wit this texture
consumedFence[0] = glFenceSync(…);
CPUSignal(consumedFenceValid);
// Wait for
CPUWait(consumedFenceValid);
glWaitSync(consumedFence[1]);
// Bind render target
glFramebufferTexture2D(srcTex[1]);
// Draw here…
// Unbind
glFramebufferTexture2D(0);
// Copy over to consumer GPU
wglCopyImageSubDataNV(srcCtx,srcTex[1],
..destCtx,destTex[1]);
// Signal that producer has completed
producedFence[1] = glFenceSync(…);
CPUSignal(producedFenceValid);
Thread - Dest GPU1 Thread - Source GPU0
[0]
[1]
GLsync consumedFence[MAX_BUFFERS];
GLsync producedFence[MAX_BUFFERS];
HANDLE consumedFenceValid, producedFenceValid;
destTex
Multi-level CPU and GPU sync primitives
[0]
[1]
srcTex
GPU1 GPU0
glCopyImage
Shalini V. S0353 - Programming Multi-GPUs for Scalable Rendering, GTC 2012.
-
Conclusions
Need for finer grained programmability for
compute+graphics
Hardware architecture influences transfer and system
throughput
Peer-To-Peer copies and access simplify multi-gpu
programming
Scaling graphics
— device mapping between CUDA and OpenGL
GTC_S3072_Part1.pdfGTC_S3072_Part1.pdf
GTC_S3071_Part2.pdf