mixing graphics & compute with multi-gpu | gtc 2013 · 2013. 4. 19. · 1 2 3 map * cuda kernel...

Mixing Graphics & Compute with multi-GPU

Wil Braithwaite - NVIDIA Applied Engineering

http://www.gputechconf.com/page/home.html

Talk Outline

Compute and Graphics API Interoperability.

Interoperability Methodologies.

Interoperability at a system level.

Application design considerations.

2

Compute & Visualize the same data

Application

3

Compute/Graphics interoperability

Setup the objects in the graphics context.

Register objects with the compute context.

Map / Unmap the objects from the compute context.

CUDA OpenGL/DX

Application

CUDA Array

CUDA Buffer Buffer Object

Texture Object

4

Code Sample – Simple image interop

Setup and Registration of Texture Objects:

GLuint texId;

cudaGraphicsResource_t texRes;

// OpenGL buffer creation...

glGenTextures(1, &texId);

glBindTexture(GL_TEXTURE_2D, texId);

glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA8UI_EXT, texWidth, texHeight, 0,

GL_RGBA_INTEGER_EXT, GL_UNSIGNED_BYTE, 0);

glBindTexture(GL_TEXTURE_2D, 0);

// Registration with CUDA.

cudaGraphicsGLRegisterImage(&texRes, texId, GL_TEXTURE_2D,

cudaGraphicsRegisterFlagsNone);

5

Code Sample – Simple image interop

Mapping between contexts:

cudaArray* texArray;

while (!done)

{

cudaGraphicsMapResources(1, &texRes);

cudaGraphicsSubResourceGetMappedArray(&texArray, texRes, 0, 0);

runCUDA(texArray);

cudaGraphicsUnmapResources(1, &texRes);

runGL(texId);

}

6

Code Sample – Simple buffer interop

Setup and Registration of Buffer Objects:

GLuint vboId;

cudaGraphicsResource_t vboRes;

// OpenGL buffer creation...

glGenBuffers(1, &vboId);

glBindBuffer(GL_ARRAY_BUFFER, vboId);

glBufferData(GL_ARRAY_BUFFER, vboSize, 0, GL_DYNAMIC_DRAW);

glBindBuffer(GL_ARRAY_BUFFER, 0);

// Registration with CUDA.

cudaGraphicsGLRegisterBuffer(&vboRes, vboId, cudaGraphicsRegisterFlagsNone);

7



float* vboPtr;

while (!done)

{

cudaGraphicsMapResources(1, &vboRes, 0);

cudaGraphicsResourceGetMappedPointer((void**)&vboPtr, &size, vboRes);

runCUDA(vboPtr);

cudaGraphicsUnmapResources(1, &vboRes, 0);

runGL(vboId);

}

8

Resource Behavior: Single-GPU

The resource is shared.

Context switch is fast and independent on data size.

GL contextCUDA context

Data

interop API

GPU

9

float* vboPtr;

while (!done)

{



runCUDA(vboPtr);


runGL(vboId);

}



Context-switching happens

when these commands are processed.

10

Timeline: Single-GPU

Driver-interop

1 2 3

map

kernel

GL

kernel

GL

CU

DA

G

L

runCUDA

unmap

runGL

kernel

GL

4

11

float* vboPtr;

while (!done)

{


cudaStreamSynchronize(0);


runCUDA(vboPtr);



runGL(vboId);

}


Adding synchronization for analysis:

12


Driver-interop, synchronous*

— (we synchronize after map and unmap calls)

1 2 3

map *

kernel

GL

kernel

GL

CU

DA

G

L

runCUDA

unmap *

runGL

kernel

Synchronize after “map”

waits for GL to finish before context-switch.

GL

4

Synchronize after “unmap”

waits for CUDA (& GL) to finish.

13

Resource Behavior: Multi-GPU

Each GPU has a copy of the resource.

Context-switch is dependent on data size, because driver

must copy data.

GL contextCuda context

Data

interop API

Data

compute-GPU render-GPU

16




must copy data.


Data

interop API

Data


MAP

17




must copy data.


Data

interop API

Data


UNMAP

18

Timeline: Multi-GPU

Driver-interop, synchronous*

— SLOWER! (Tasks are serialized).

1 2 3

map *

kernel CUtoGL

GL

kernel

GL

CU

DA

G

L

runCUDA

unmap *

runGL

GLtoCU kernel GLtoCU GLtoCU CUtoGL

Resources are mirrored and

synchronized across the GPUs

“map” has to wait for GL to complete

before it synchronizes the resource.

20

Interoperability Methodologies

READ-ONLY

— GL produces... and CUDA consumes.

e.g. Post-process the GL render in CUDA.

WRITE-DISCARD

— CUDA produces... and GL consumes.

e.g. CUDA simulates fluid, and GL renders result.

READ & WRITE

— Useful if you want to use the rasterization pipeline.

e.g. Feedback loop:

— runGL(texture) framebuffer

— runCUDA(framebuffer) texture

21

float* vboPtr;

cudaGraphicsResourceSetMapFlags(vboRes, cudaGraphicsMapFlagsWriteDiscard);

while (!done)

{




runCUDA(vboPtr);



runGL(vboId);

}

CUDA produces... and OpenGL consumes:

Code Sample – WRITE-DISCARD

Hint that we do not care about

the previous contents of buffer.

22


Driver-interop, synchronous*, WRITE-DISCARD

1 2 3

map *

kernel

GL

kernel

GL

CU

DA

G

L

runCUDA

unmap *

runGL

kernel

Synchronize after “map”

waits for GL to finish before context-switch.

GL

4


waits for CUDA (& GL) to finish.

Context switch forces serialization.

23

Timeline: Multi-GPU


1 2 3

map *

kernel CUtoGL kernel

CU

DA

G

L

runCUDA

unmap *

runGL

CUtoGL kernel CUtoGL

4

When multi-GPU,

“map” does nothing.

Compute & render can overlap

as they are on different GPUs.

GL GL GL


waits for CUDA & GL to finish.

24

Timeline: Multi-GPU


— if render is long...

1 2 3

map *

kernel CUtoGL kernel

CU

DA

G

L

runCUDA

unmap *

runGL

CUtoGL kernel CUtoGL

GL

4

“unmap” will wait for GL.

GL GL

25

Driver-Interop: System View

Single-GPU Multi-GPU

26

Manual-Interop: System View

Multi-GPU


27

cudaMalloc((void**)&d_data, vboSize);

cudaHostAlloc((void**)&h_data, vboSize, cudaHostAllocPortable);

while (!done) {

// Compute data in temp buffer, and copy to host...

runCUDA(d_data);

cudaMemcpyAsync(h_data, d_data, vboSize, cudaMemcpyDeviceToHost, 0);


// Map the render-GPU’s resource and upload the host buffer...

cudaSetDevice(renderGPU);



cudaMemcpy(vboPtr, h_data, size, cudaMemcpyHostToDevice);


cudaSetDevice(computeGPU);

runGL(vboId);

}

Code Sample – Manual-Interop

Create a temporary buffer

in pinned host-memory.

28

Timeline: Multi-GPU

Manual-interop, synchronous*, WRITE-DISCARD

1 2 3

map

kernel CUtoH kernel

CU

DA

G

L

runCUDA

unmap

runGL

CUtoH kernel CUtoH

4

CUtoH *

HtoGL *

HtoGL HtoGL HtoGL

GL GL

29

cudaMalloc((void**)&d_data, vboSize);

cudaHostAlloc((void**)&h_data, vboSize, cudaHostAllocPortable);

while (!done) {

// Compute data in temp buffer, and copy to host...

runCUDA(d_data);




// (all commands must be asynchronous.)




cudaMemcpyAsync(vboPtr, h_data, size, cudaMemcpyHostToDevice, 0);



runGL(vboId);

}

Code Sample - Manual Interop (Async)

Use asynchronous copy in default stream.

30

Timeline: Multi-GPU

Manual-interop, asynchronous, WRITE-DISCARD

1 2 3

map

kernel CUtoH kernel

CU

DA

G

L

runCUDA

unmap

runGL

CUtoH kernel CUtoH

GL

4

CUtoH *

HtoGL

HtoGL

GL GL

HtoGL HtoGL

31

Timeline: Multi-GPU



1 2 3

map

kernel CUtoH kernel

CU

DA

G

L

runCUDA

unmap

runGL

CUtoH kernel CUtoH

4

CUtoH *

HtoGL

HtoGL

GL

HtoGL HtoGL

GL

We are downloading while uploading! Drifting out of sync!

32

Timeline: Multi-GPU (fixed Async)



1 2 3

map

kernel CUtoH kernel

CU

DA

G

L

runCUDA

unmap

runGL

CUtoH kernel CUtoH

4

CUtoH *

HtoGL

HtoGL

GL

HtoGL HtoGL

GL

Synchronization must also wait for HtoGL to finish

33

while (!done) {

// Compute the data in a temp buffer, and copy to a host buffer...

runCUDA(d_data);

cudaStreamWaitEvent(0, uploadFinished, 0);










cudaEventRecord(uploadFinished, 0);


runGL(vboId);

}

Code Sample - Manual Interop (fixed Async)

34

Timeline: Multi-GPU

Manual-interop, asynchronous, WRITE-DISCARD, with flipping

1 2 3

map

kernel[A] CUtoH[A]

CU

DA

G

L

runCUDA

unmap *

runGL

CUtoH[B]

kernel[A] CUtoH[A]

4

CUtoH

HtoGL

H[A]toGL

GL

H[A]toGL H[B]toGL

GL GL

kernel[B] CUtoH[B] kernel[B]

H[B]toGL

GL

5

35

int read = 1, write = 0;

while (!done) {


cudaStreamWaitEvent(custream[write], kernelFinished[read]);

runCUDA(d_data, custream[write]);

cudaEventRecord(kernelFinished[write], custream[write]);

cudaStreamWaitEvent(custream[write], uploadFinished[read]);

cudaMemcpyAsync(h_data[write], d_data, vboSize, cudaMemcpyDeviceToHost, custream[write]);

cudaEventRecord(downloadFinished[write], custream[write]);

// Map the renderGPU’s resource and upload the host buffer...


cudaGraphicsMapResources(1, &vboRes, glstream);


cudaStreamWaitEvent(glstream, downloadFinished[read]);

cudaMemcpyAsync(vboPtr, h_data[read], size, cudaMemcpyHostToDevice, glstream);

cudaGraphicsUnmapResources(1, &vboRes, glstream);

cudaEventRecord(uploadFinished[read], glstream);

cudaStreamSynchronize(glstream); // Sync for easier analysis!


runGL(vboId);

swap(&read, &write);

}

Code Sample - Manual Interop (streams)

36

Timeline: Multi-GPU

Manual-interop, asynchronous, WRITE-DISCARD, + pingpong

— if render is long... then we must flip the resource too... etc. etc.

1 2 3

map

kernel[A] CUtoH[A]

CU

DA

G

L

runCUDA

unmap *

runGL

CUtoH[B]

kernel[A] CUtoH[A]

4

CUtoH

HtoGL

H[A]toGL

GL

H[A]toGL H[B]toGL

GL GL

kernel[B] CUtoH[B] kernel[B]

H[B]toGL

37

Benchmarks & Demo

— runCUDA (20ms)

— runGL (10ms)

— copy (10ms)

Single-GPU

— Driver-interop = 30ms

Multi-GPU

— Driver-interop = 36ms

— Async Manual-interop = 32ms

— Flipped Manual-interop = 22ms

Too large data size

makes multi-GPU interop worse.

Overlapping the

download helps us break even.

But using streams

and flipping is a significant win!

38

Scaling further

Ping-pong the renderGPU side too.

— Current example is bound by (upload+render)

Kernel might not be dependent on previous kernel.

— e.g. Could you run two kernels simultaneously?

If your CUDA kernels are much more expensive than your GL render then this could be a win.

Remember to use streams and events, AND consider all the dependencies.

39

Interoperability behavior: Multi-GPU

Similar considerations are applicable when OpenGL is the producer and CUDA is the consumer.

Use cudaGraphicsMapFlagsReadOnly

40

Application Design Considerations

Avoid synchronized GPUs for CUDA.

— Watch out for Windows’s WDDM implicit synchronization on unmap!

Provision for multi-GPU environments:

— Let the user choose the GPUs.

— Use cudaD3D[9|10|11]GetDevices()/cudaGLGetDevices() to match CUDA and graphics device enumerations.

CUDA-OpenGL interoperability can perform slower if OpenGL context spans multiple GPUs.

Context switch performance varies with system config and OS.

41

Conclusions & Resources

The driver can do all the heavy-lifting but...

Scalability and final performance is up to the developer — For fine-grained control and optimization, you might want to move

the data manually.

CUDA samples/documentation: — http://developer.nvidia.com/cuda-downloads

OpenGL Insights, Patrick Cozzi, Christophe Riccio, 2012. ISBN 1439893764.

www.openglinsights.com

42

http://developer.nvidia.com/cuda-downloadshttp://developer.nvidia.com/cuda-downloadshttp://developer.nvidia.com/cuda-downloads

Thank you.

Questions will be taken at the end of the full session.

43

Scaling Graphics and Compute on Multi-GPUs Shalini Venkataraman

PSG Applied Engineering

Talk Outline

Recap on default behavior

— We need finer grained control of managing scaling

Enumerating Graphics & Compute Resources

— Supported hardware

— NUMA considerations

Different methods for scaling and communication

— Multiple CUDA GPUs + 1 Graphics GPU

— Multiple graphics GPUs

Focus on Single node system, CUDA and OpenGL

Nsight for profiling

Recap – Beyond API Interoperability

API Interop hides all the complexity

BUT, sometimes

— Don’t want to transfer all data

Kernel may compute subregions but entire buffer object/texture is copied

— May have a complex system with multiple compute GPUs and/or

render GPUs

— Application specific pipelining and multi-buffering

— May have some CPU code in your algorithm between compute and

graphics

Scaling Beyond Single Compute+Graphics

Scaling compute

— Divide tasks across multiple devices

— When data does not fit into single GPU memory - Distribute data

Scaling graphics

— Multi-displays, Stereo

— Cooler rendering eg raytracing, complex lighting models

Higher compute density

— Amortize host or server costs eg CPU, memory, RAID shared with

multiple GPUs

Multi-GPU Compute+Graphics Use Cases

Image processing

— Multiple compute GPUs and a low-end

display GPU for blitting

Mixing rasterization with compute

— Polygonal rendering done in OpenGL

and input to Compute for further

processing

Visualization for HPC Simulation

— Numerical simulation distributed across

multiple compute GPUs, possibly on

remote supercomputer

NVIDIA Index – Seismic Interpretation

Morpheus Medical – Real-time CFD Viz

GTC2013 Exhibit Hall

Mixing Tesla and Quadro GPUs

— Tight integration with OEMs and System Integrators

— Optimized driver paths for GPU-GPU communication

NVIDIA Maximus Initiative

NVIDIA® MAXIMUS™ Workstation

Visualize Simulate (Cluster)

Simulate0 Simulate1

+ Visualize0

Simulate2

+ Visualize1

Simulate3

+ Visualize2

Traditional Workstation

Supported Hardware Supermicro Maximus Systems

3D Boxx 8950 -Quadro + 4 Tesla K20s

3D Boxx 4920/4920 Xtreme -Quadro + 3 Tesla K20s

3D Boxx 8920 - Quadro + 2 Tesla K20s

SYS-7047GR-TRF - Quadro + 4 Tesla K20s

SYS-7047A/7047A-73/7037A-i - Quadro + 3 Tesla K20s

+

Boxx Systems

NUMA/Topology Considerations

Memory access is non-uniform!

— Local GPU access is faster than

remote (extra QPI hop)

— Affects PCIe transfer throughput.

NUMA APIs

— Thread affinity considerations

Pitfalls of set process affinity

— Does not work with graphics APIs

Memory

CPU0

IOH0

CPU1

IOH1

QPI

QPI

QPI

Memory

QPI

GPU0 GPU1

6 GB/s

~4 GB/s

SandyBridge

Integrated IOH

Dale Southard. Designing and Managing GPU Clusters. Supercomputing 2011 http://nvidia.fullviewmedia.com/fb/nv-sc11/tabscontent/archive/401-thu-southard.html

Unified Virtual Addressing Easier to program with single address space

System

Memory

CPU GPU0

GPU0

Memory

GPU1

GPU1

Memory

PCI-e

0x0000

0xFFFF

No UVA – Multiple Memory Space UVA – Single Address Space

Peer-to-Peer (P2P) Communication

NUMA/Topology matters for P2P

PCI-e

P2P Communication Supported

Between GPUs on the Same IOH

x16 x16

PCI-e



x16

x16 CPU0 IOH0

CPU1 IOH1

QPI incompatible with PCI-e P2P specification

GPU0 GPU1

GPU2 GPU3

QPI

QPI

QPI

Sandy Bridge Socket

Integrated IOH

Memory Memory

Memory Memory

Mem

ory

M

em

ory

P2P disabled over QPI – Copying staged via host (P2H2P)

Configuration for Compute & Graphics

CPU0 QPI

PCI-e

IOH0

Best P2P Performance

Between GPUs on the

Same PCIe Switch

Eg K10 dual-GPU card

(~6.5GB)



(~5GB)

x16 x16 x16 x16

x16 x16

Tesla

GPU0 Tesla

GPU1 Tesla

GPU2

Tesla

GPU3

Switch

0 Switch

1

PCI-e x16

x16 CPU1 IOH1

Quadro

GPU4

Quadro

GPU5

QPI

P2P communication

- Linux only

- No WDDM

Mapping Algorithms to Hardware Topology

Won-Ki Jeong. S3308 - Fast Compressive

Sensing MRI Reconstruction on a Multi-

GPU System, GTC 2013.

Paulius M. S0515 – Implementing

3D Finite Difference Code on

GPUs, GTC 2009.

GPU0 GPU1 Data reduction, Multli-phase problems

Programming - Enumerating Resources

CUDA - Tesla enumerated above Quadro for device 0 & 1 ./deviceQuery -noprompt | egrep "^Device"

Device 0: "Tesla K20c"

Device 1: "Quadro K2000"

Device 2: "Tesla K20c"

cudaSetDevice() sets current GPU

— All cuda calls are issued to current GPU except – p2p memcopies

— Current GPU can be changed while async calls are executed

Single thread can drive multiple GPUs

cudaGLGetDevices() gets the GPU(s) for current GL context

— We will touch multiple OpenGL devices later

Communication across GPUs - via Host

while (!done) {


runCUDA(d_data);

cudaStreamWaitEvent(0, uploadFinished, 0);


cudaEventRecord(downloadFinished, 0);




doMap(d_mappedPtr,0);

cudaStreamWaitEvent(0, downloadFinished);


doUnmap(0);

cudaEventRecord(uploadFinished, 0);


doRender(…);

}

Synchronization across GPUs

Streams and events are per device

— Determined by the GPU that is current at creation

Stream GPU must be set current for

— Launching kernel to a stream

— Recording events to a stream

Agnostic to the current GPU

— Memcpy can be launched on any stream

— Synchronize/Query of Events

Revisit – P2P Communication

Peer-to-Peer (P2P) Initialization

cudaDeviceCanAccessPeer(&isAccessible, srcGPU, dstGPU)

— Returns in 1st arg if srcGPU can access memory of dstGPU

— Need to do this bidirectionally

cudaDeviceEnablePeerAccess(peerDevice, 0)

— Enables current GPU to access peerDevice

— Note that this is asymmetric!

cudaDeviceDisablePeerAccess

— P2p can be limited to a specific phase in order to reduce overhead

and free resources

Peer-to-Peer Copy

cudaMemcpyAsync

cudaMemcpyPeerAsync

— Called in a separate stream

— Falls back to staging copies via host

for unsupported configutations

NVIDIA CUDA Webinars – Multi-GPU Programming

http://developer.download.nvidia.com/CUDA/training/cuda_webinars_multi_gpu.pdf

P2P Requirements

Works on

— 64bit app

— Fermi and above

— Cuda 4.0 +

— Linux and Windows TCC

Will not work

— Dual-IOH configs

— Mixing WDDM/TCC

— Across different chips even in same generation

eg K5000 (GK104) and K20 (GK110)

cudaSetDevice(0);

cudaEventCreate(&finishedKernelEvent);

cudaSetDevice(1);

cudaEventCreate(&finishedCopyEvent);

cudaEventRecord(finishedCopyEvent, 0);//Trigger compute

P2P communication

IOH

PCIe

IOH0 IOH1

CPU0 CPU1

QPI

while (!done) {

// Set CUDA device to COMPUTE DEVICE 0

cudaSetDevice(device);

cudaStreamWaitEvent(0, finishedCopyEvent);

doKernel(d_buffer,0);

cudaEventRecord(finishedKernelEvent, 0);

//ON GL RENDER DEVICE 1

cudaSetDevice(1); // GL Device

doMap((void**)&d_mappedPtr,0);

cudaStreamWaitEvent(0, finishedKernel);

cudaMemcpyPeerAsync(d_mappedPtr,1,d_buffer,0,

size, cuda2glStream);

cudaEventRecord(finishedCopyEvent, 0);

doUnmap(0);

doRender();

}

Events Creation

P2P : Staging via Host

Example mixed WDDM/TCC configuration

render

Map

Upload

Unmap

Render

frame 1 frame 2 frame 3

CU

DA

-Dev

1

GL

-Dev

1

Compute

Download

CU

DA

-Dev

0

memcpy

memcpy : Driver automatically

pipelines download and upload

memcpy

Wait for event

compute

Record event

render

compute

memcpy

memcpy

Compute Engine idle during transfers

Profiling with Nsight - Timeline

Kernel0 D2H0

H2D0

Kernel1

Overlapped by driver

S Domine. S3377 - Seamless Compute and OpenGL Graphics

Development in NVIDIA Nsight 3.0 Visual Studio Edition and

Beyond. GTC 2013

Overlapping Kernel With Transfer void* d_buffer[2]; // Ping pong the buffer kernel writes to, so it can

overlap with transfer

while (!done) {


cudaSetDevice(0);

cudaStreamWaitEvent(stream, finishedCopyEvent[cur]);

doKernel(d_buffer[cur],stream);

cudaEventRecord(finishedKernelEvent[cur], stream);

//ON GL RENDER DEVICE 1

cudaSetDevice(1); // GL Device

doMap((void**)&d_mappedPtr, cuda2glStream);

cudaStreamWaitEvent(0, finishedKernelEvent[prev]);

cudaMemcpyPeerAsync(d_mappedPtr,1,d_buffer[prev],0,

size, cuda2glStream);

cudaEventRecord(finishedCopyEvent[prev], cuda2glStream);

doUnmap(cuda2glStream);

doRender();

}

prev = cur;

cur =1- cur ;

Timeline

Kernel, download

and upload

are overlapped Kernel0 D2H0

H2D0

Kernel1

Peer-to-Peer Memory Access

Sometimes we don’t want to explicitly copy but have access

to entire space on all GPUs and CPU

Already possible with linear memory, recent texture

addition useful for graphics

Example – large dataset that can’t fit into 1 GPU memory

— Distribute the domain/data across GPUs

— Each GPU now need to access the other GPU’s data for

halos/boundary exchange

Sharing texture across GPUs

Peer access is unidirectional. cudaSetDevice(0); cudaMalloc3DArray(d0_volume, …);

cudaSetDevice(1);

cudaMalloc3DArray(d1_volume, …);

while (!done) {


cudaSetDevice(0);

cudaBindTextureToArray(tex0, d0_volume);

cudaBindTextureToArray(tex1, d1_volume);

Kernel

}

_global__ void Kernel(..) {

float voxel0 = tex3D(tex0, u, v, w); //accesses gpu0

float voxel1 = tex3D(tex1, u, v, w); //accesses gpu1 through p2p

}

Kernel runs on device 0 accesses texture from device 1

Multiple Compute : Single Render GPU

CUDA

Context

GPU 1

CUDA

Context

GPU 1 GPU 0

Compute Devices Render Device

OpenGL

Context

Aux

CUDA

Context

GPU 2

Map

Unmap

P2H2P on Windows

(WDDM/TCC)

-Explicit copies via host

-CudaMemcpyPeer

-P2P Memory Access

Multi-GPU Image Processing - SagivTech

Live demo at Exhibit Hall - Booth 712, A middleware for Real Time Multi GPU,

GTC 2013

Renderer

Scaling Graphics - Multiple Quadro GPUs

Virtualized SLI case

— Multi-GPU mosaic drive a large wall

Access each GPU separately

— Parallel rendering

— Multi view-frustum eg Stereo, CAVEs

Hurricane Sandy Simulation showing multiple

computed tracks for different input params –

Image Credits : NOAA

Sort +

Alpha Composite

Data Distribution +

Render

GPU-0

GPU-2

GPU-1

GPU-3

Visible Human 14GB Texture Data

CUDA With SLI Mosaic

Cuda

Enumeration

[0] = K20

[1] = K5000

[2] = K5000

OpenGL

Enumeration

with SLI

[0] = K5000

cudaGLgetGLDevices

GL context

int glDeviceIndices[MAX_GPU];

int glDeviceCount;

cudaGLGetDevices(&glDeviceCount, glDeviceIndices, glDeviceCount,

cudaGLDeviceListAll));

glDeviceCount = 2

Use driver interop with the current cuda device

— Driver handles the optimized data transfer (Work in progress)

The OpenGL context spans 2 cuda devices

Specifying Render GPU Explicitly

Linux

— Specify separate X screens using XOpenDisplay

Windows

— Using NV_GPU_AFFINITY extension

Display* dpy = XOpenDisplay(“:0.”+gpu)

GLXContext = glxCreateContextAttribs(dpy,…);

BOOL wglEnumGpusNV(UINT iGpuIndex, HGPUNV *phGPU)

For #GPUs enumerated {

GpuMask[0]=hGPU[0];

GpuMask[1]=NULL;

//Get affinity DC based on GPU

HDC affinityDC = wglCreateAffinityDCNV(GpuMask);

setPixelFormat(affinityDC);

HGLRC affinityGLRC = wglCreateContext(affinityDC);

}

Mapping Cuda Affinity GL Devices

How to map OpenGL device to CUDA

cudaWGLGetDevice(int *cudadevice,HGPUNV oglGPUHandle)

Don’t expect the same order

— GL order is windows specific while CUDA is device specific

Cuda

Enumeration

[0] = K20

[1] = K5000

[2] = K5000

GL Affinity

Enumeration

[0] = K5000

[1] = K5000

GL Affinity

Enumeration

(with SLI)

[0] = K5000

cudaGLgetGLDevices

cudaWGLgetDevice

GL context GL context

S3070 - Part 1 - Configuring, Programming and Debugging Applications for

Compute and Graphics on Multi-GPUS

Compute & Multiple Render

OpenGL

Context

GPU 2

CUDA

Context

GPU 1

OpenGL

Context

CudaMemcpyPeer

Aux

CUDA

Context

GPU 1 GPU 2 GPU 0

Compute Render Devices

wglCopyImageSubData

Map

Unmap

GL_NV_COPY_IMAGE - Copy Across Render GPUs

— Cross-platform

wglCopyImageSubData and glXCopyImageSubDataNV

— Can copy subregions, Texture only

Multithreading with OpenGL Contexts

// Wait for signal to start consuming

CPUWait(producedFenceValid);

glWaitSync(producedFence[0]);

// Bind texture object

glBindTexture(destTex[0]);

// Use as needed

// Signal we are done wit this texture

consumedFence[0] = glFenceSync(…);

CPUSignal(consumedFenceValid);

// Wait for

CPUWait(consumedFenceValid);

glWaitSync(consumedFence[1]);

// Bind render target

glFramebufferTexture2D(srcTex[1]);

// Draw here…

// Unbind

glFramebufferTexture2D(0);

// Copy over to consumer GPU

wglCopyImageSubDataNV(srcCtx,srcTex[1],

..destCtx,destTex[1]);

// Signal that producer has completed

producedFence[1] = glFenceSync(…);

CPUSignal(producedFenceValid);

Thread - Dest GPU1 Thread - Source GPU0

[0]

[1]

GLsync consumedFence[MAX_BUFFERS];

GLsync producedFence[MAX_BUFFERS];

HANDLE consumedFenceValid, producedFenceValid;

destTex

Multi-level CPU and GPU sync primitives

[0]

[1]

srcTex

GPU1 GPU0

glCopyImage

Shalini V. S0353 - Programming Multi-GPUs for Scalable Rendering, GTC 2012.

Conclusions

Need for finer grained programmability for

compute+graphics

Hardware architecture influences transfer and system

throughput

Peer-To-Peer copies and access simplify multi-gpu

programming

Scaling graphics

— device mapping between CUDA and OpenGL

GTC_S3072_Part1.pdfGTC_S3072_Part1.pdf

GTC_S3071_Part2.pdf

mixing graphics & compute with multi-gpu | gtc 2013 · 2013. 4. 19. · 1 2 3 map * cuda kernel...

Documents