mixing graphics & compute with multi-gpu | gtc 2013 · 2013. 4. 19. · 1 2 3 map * cuda kernel...

76
Mixing Graphics & Compute with multi-GPU Wil Braithwaite - NVIDIA Applied Engineering

Upload: others

Post on 03-Feb-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

  • Mixing Graphics & Compute with multi-GPU

    Wil Braithwaite - NVIDIA Applied Engineering

    http://www.gputechconf.com/page/home.html

  • Talk Outline

    Compute and Graphics API Interoperability.

    Interoperability Methodologies.

    Interoperability at a system level.

    Application design considerations.

    2

  • Compute & Visualize the same data

    Application

    3

  • Compute/Graphics interoperability

    Setup the objects in the graphics context.

    Register objects with the compute context.

    Map / Unmap the objects from the compute context.

    CUDA OpenGL/DX

    Application

    CUDA Array

    CUDA Buffer Buffer Object

    Texture Object

    4

  • Code Sample – Simple image interop

    Setup and Registration of Texture Objects:

    GLuint texId;

    cudaGraphicsResource_t texRes;

    // OpenGL buffer creation...

    glGenTextures(1, &texId);

    glBindTexture(GL_TEXTURE_2D, texId);

    glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA8UI_EXT, texWidth, texHeight, 0,

    GL_RGBA_INTEGER_EXT, GL_UNSIGNED_BYTE, 0);

    glBindTexture(GL_TEXTURE_2D, 0);

    // Registration with CUDA.

    cudaGraphicsGLRegisterImage(&texRes, texId, GL_TEXTURE_2D,

    cudaGraphicsRegisterFlagsNone);

    5

  • Code Sample – Simple image interop

    Mapping between contexts:

    cudaArray* texArray;

    while (!done)

    {

    cudaGraphicsMapResources(1, &texRes);

    cudaGraphicsSubResourceGetMappedArray(&texArray, texRes, 0, 0);

    runCUDA(texArray);

    cudaGraphicsUnmapResources(1, &texRes);

    runGL(texId);

    }

    6

  • Code Sample – Simple buffer interop

    Setup and Registration of Buffer Objects:

    GLuint vboId;

    cudaGraphicsResource_t vboRes;

    // OpenGL buffer creation...

    glGenBuffers(1, &vboId);

    glBindBuffer(GL_ARRAY_BUFFER, vboId);

    glBufferData(GL_ARRAY_BUFFER, vboSize, 0, GL_DYNAMIC_DRAW);

    glBindBuffer(GL_ARRAY_BUFFER, 0);

    // Registration with CUDA.

    cudaGraphicsGLRegisterBuffer(&vboRes, vboId, cudaGraphicsRegisterFlagsNone);

    7

  • Code Sample – Simple buffer interop

    Mapping between contexts:

    float* vboPtr;

    while (!done)

    {

    cudaGraphicsMapResources(1, &vboRes, 0);

    cudaGraphicsResourceGetMappedPointer((void**)&vboPtr, &size, vboRes);

    runCUDA(vboPtr);

    cudaGraphicsUnmapResources(1, &vboRes, 0);

    runGL(vboId);

    }

    8

  • Resource Behavior: Single-GPU

    The resource is shared.

    Context switch is fast and independent on data size.

    GL contextCUDA context

    Data

    interop API

    GPU

    9

  • float* vboPtr;

    while (!done)

    {

    cudaGraphicsMapResources(1, &vboRes, 0);

    cudaGraphicsResourceGetMappedPointer((void**)&vboPtr, &size, vboRes);

    runCUDA(vboPtr);

    cudaGraphicsUnmapResources(1, &vboRes, 0);

    runGL(vboId);

    }

    Code Sample – Simple buffer interop

    Mapping between contexts:

    Context-switching happens

    when these commands are processed.

    10

  • Timeline: Single-GPU

    Driver-interop

    1 2 3

    map

    kernel

    GL

    kernel

    GL

    CU

    DA

    G

    L

    runCUDA

    unmap

    runGL

    kernel

    GL

    4

    11

  • float* vboPtr;

    while (!done)

    {

    cudaGraphicsMapResources(1, &vboRes, 0);

    cudaStreamSynchronize(0);

    cudaGraphicsResourceGetMappedPointer((void**)&vboPtr, &size, vboRes);

    runCUDA(vboPtr);

    cudaGraphicsUnmapResources(1, &vboRes, 0);

    cudaStreamSynchronize(0);

    runGL(vboId);

    }

    Code Sample – Simple buffer interop

    Adding synchronization for analysis:

    12

  • Timeline: Single-GPU

    Driver-interop, synchronous*

    — (we synchronize after map and unmap calls)

    1 2 3

    map *

    kernel

    GL

    kernel

    GL

    CU

    DA

    G

    L

    runCUDA

    unmap *

    runGL

    kernel

    Synchronize after “map”

    waits for GL to finish before context-switch.

    GL

    4

    Synchronize after “unmap”

    waits for CUDA (& GL) to finish.

    13

  • Resource Behavior: Multi-GPU

    Each GPU has a copy of the resource.

    Context-switch is dependent on data size, because driver

    must copy data.

    GL contextCuda context

    Data

    interop API

    Data

    compute-GPU render-GPU

    16

  • Resource Behavior: Multi-GPU

    Each GPU has a copy of the resource.

    Context-switch is dependent on data size, because driver

    must copy data.

    GL contextCuda context

    Data

    interop API

    Data

    compute-GPU render-GPU

    MAP

    17

  • Resource Behavior: Multi-GPU

    Each GPU has a copy of the resource.

    Context-switch is dependent on data size, because driver

    must copy data.

    GL contextCuda context

    Data

    interop API

    Data

    compute-GPU render-GPU

    UNMAP

    18

  • Timeline: Multi-GPU

    Driver-interop, synchronous*

    — SLOWER! (Tasks are serialized).

    1 2 3

    map *

    kernel CUtoGL

    GL

    kernel

    GL

    CU

    DA

    G

    L

    runCUDA

    unmap *

    runGL

    GLtoCU kernel GLtoCU GLtoCU CUtoGL

    Resources are mirrored and

    synchronized across the GPUs

    “map” has to wait for GL to complete

    before it synchronizes the resource.

    20

  • Interoperability Methodologies

    READ-ONLY

    — GL produces... and CUDA consumes.

    e.g. Post-process the GL render in CUDA.

    WRITE-DISCARD

    — CUDA produces... and GL consumes.

    e.g. CUDA simulates fluid, and GL renders result.

    READ & WRITE

    — Useful if you want to use the rasterization pipeline.

    e.g. Feedback loop:

    — runGL(texture) framebuffer

    — runCUDA(framebuffer) texture

    21

  • float* vboPtr;

    cudaGraphicsResourceSetMapFlags(vboRes, cudaGraphicsMapFlagsWriteDiscard);

    while (!done)

    {

    cudaGraphicsMapResources(1, &vboRes, 0);

    cudaStreamSynchronize(0);

    cudaGraphicsResourceGetMappedPointer((void**)&vboPtr, &size, vboRes);

    runCUDA(vboPtr);

    cudaGraphicsUnmapResources(1, &vboRes, 0);

    cudaStreamSynchronize(0);

    runGL(vboId);

    }

    CUDA produces... and OpenGL consumes:

    Code Sample – WRITE-DISCARD

    Hint that we do not care about

    the previous contents of buffer.

    22

  • Timeline: Single-GPU

    Driver-interop, synchronous*, WRITE-DISCARD

    1 2 3

    map *

    kernel

    GL

    kernel

    GL

    CU

    DA

    G

    L

    runCUDA

    unmap *

    runGL

    kernel

    Synchronize after “map”

    waits for GL to finish before context-switch.

    GL

    4

    Synchronize after “unmap”

    waits for CUDA (& GL) to finish.

    Context switch forces serialization.

    23

  • Timeline: Multi-GPU

    Driver-interop, synchronous*, WRITE-DISCARD

    1 2 3

    map *

    kernel CUtoGL kernel

    CU

    DA

    G

    L

    runCUDA

    unmap *

    runGL

    CUtoGL kernel CUtoGL

    4

    When multi-GPU,

    “map” does nothing.

    Compute & render can overlap

    as they are on different GPUs.

    GL GL GL

    Synchronize after “unmap”

    waits for CUDA & GL to finish.

    24

  • Timeline: Multi-GPU

    Driver-interop, synchronous*, WRITE-DISCARD

    — if render is long...

    1 2 3

    map *

    kernel CUtoGL kernel

    CU

    DA

    G

    L

    runCUDA

    unmap *

    runGL

    CUtoGL kernel CUtoGL

    GL

    4

    “unmap” will wait for GL.

    GL GL

    25

  • Driver-Interop: System View

    Single-GPU Multi-GPU

    26

  • Manual-Interop: System View

    Multi-GPU

    compute-GPU render-GPU

    27

  • cudaMalloc((void**)&d_data, vboSize);

    cudaHostAlloc((void**)&h_data, vboSize, cudaHostAllocPortable);

    while (!done) {

    // Compute data in temp buffer, and copy to host...

    runCUDA(d_data);

    cudaMemcpyAsync(h_data, d_data, vboSize, cudaMemcpyDeviceToHost, 0);

    cudaStreamSynchronize(0);

    // Map the render-GPU’s resource and upload the host buffer...

    cudaSetDevice(renderGPU);

    cudaGraphicsMapResources(1, &vboRes, 0);

    cudaGraphicsResourceGetMappedPointer((void**)&vboPtr, &size, vboRes);

    cudaMemcpy(vboPtr, h_data, size, cudaMemcpyHostToDevice);

    cudaGraphicsUnmapResources(1, &vboRes, 0);

    cudaSetDevice(computeGPU);

    runGL(vboId);

    }

    Code Sample – Manual-Interop

    Create a temporary buffer

    in pinned host-memory.

    28

  • Timeline: Multi-GPU

    Manual-interop, synchronous*, WRITE-DISCARD

    1 2 3

    map

    kernel CUtoH kernel

    CU

    DA

    G

    L

    runCUDA

    unmap

    runGL

    CUtoH kernel CUtoH

    4

    CUtoH *

    HtoGL *

    HtoGL HtoGL HtoGL

    GL GL

    29

  • cudaMalloc((void**)&d_data, vboSize);

    cudaHostAlloc((void**)&h_data, vboSize, cudaHostAllocPortable);

    while (!done) {

    // Compute data in temp buffer, and copy to host...

    runCUDA(d_data);

    cudaMemcpyAsync(h_data, d_data, vboSize, cudaMemcpyDeviceToHost, 0);

    cudaStreamSynchronize(0);

    // Map the render-GPU’s resource and upload the host buffer...

    // (all commands must be asynchronous.)

    cudaSetDevice(renderGPU);

    cudaGraphicsMapResources(1, &vboRes, 0);

    cudaGraphicsResourceGetMappedPointer((void**)&vboPtr, &size, vboRes);

    cudaMemcpyAsync(vboPtr, h_data, size, cudaMemcpyHostToDevice, 0);

    cudaGraphicsUnmapResources(1, &vboRes, 0);

    cudaSetDevice(computeGPU);

    runGL(vboId);

    }

    Code Sample - Manual Interop (Async)

    Use asynchronous copy in default stream.

    30

  • Timeline: Multi-GPU

    Manual-interop, asynchronous, WRITE-DISCARD

    1 2 3

    map

    kernel CUtoH kernel

    CU

    DA

    G

    L

    runCUDA

    unmap

    runGL

    CUtoH kernel CUtoH

    GL

    4

    CUtoH *

    HtoGL

    HtoGL

    GL GL

    HtoGL HtoGL

    31

  • Timeline: Multi-GPU

    Manual-interop, asynchronous, WRITE-DISCARD

    — if render is long...

    1 2 3

    map

    kernel CUtoH kernel

    CU

    DA

    G

    L

    runCUDA

    unmap

    runGL

    CUtoH kernel CUtoH

    4

    CUtoH *

    HtoGL

    HtoGL

    GL

    HtoGL HtoGL

    GL

    We are downloading while uploading! Drifting out of sync!

    32

  • Timeline: Multi-GPU (fixed Async)

    Manual-interop, asynchronous, WRITE-DISCARD

    — if render is long...

    1 2 3

    map

    kernel CUtoH kernel

    CU

    DA

    G

    L

    runCUDA

    unmap

    runGL

    CUtoH kernel CUtoH

    4

    CUtoH *

    HtoGL

    HtoGL

    GL

    HtoGL HtoGL

    GL

    Synchronization must also wait for HtoGL to finish

    33

  • while (!done) {

    // Compute the data in a temp buffer, and copy to a host buffer...

    runCUDA(d_data);

    cudaStreamWaitEvent(0, uploadFinished, 0);

    cudaMemcpyAsync(h_data, d_data, vboSize, cudaMemcpyDeviceToHost, 0);

    cudaStreamSynchronize(0);

    // Map the render-GPU’s resource and upload the host buffer...

    // (all commands must be asynchronous.)

    cudaSetDevice(renderGPU);

    cudaGraphicsMapResources(1, &vboRes, 0);

    cudaGraphicsResourceGetMappedPointer((void**)&vboPtr, &size, vboRes);

    cudaMemcpyAsync(vboPtr, h_data, size, cudaMemcpyHostToDevice, 0);

    cudaGraphicsUnmapResources(1, &vboRes, 0);

    cudaEventRecord(uploadFinished, 0);

    cudaSetDevice(computeGPU);

    runGL(vboId);

    }

    Code Sample - Manual Interop (fixed Async)

    34

  • Timeline: Multi-GPU

    Manual-interop, asynchronous, WRITE-DISCARD, with flipping

    1 2 3

    map

    kernel[A] CUtoH[A]

    CU

    DA

    G

    L

    runCUDA

    unmap *

    runGL

    CUtoH[B]

    kernel[A] CUtoH[A]

    4

    CUtoH

    HtoGL

    H[A]toGL

    GL

    H[A]toGL H[B]toGL

    GL GL

    kernel[B] CUtoH[B] kernel[B]

    H[B]toGL

    GL

    5

    35

  • int read = 1, write = 0;

    while (!done) {

    // Compute the data in a temp buffer, and copy to a host buffer...

    cudaStreamWaitEvent(custream[write], kernelFinished[read]);

    runCUDA(d_data, custream[write]);

    cudaEventRecord(kernelFinished[write], custream[write]);

    cudaStreamWaitEvent(custream[write], uploadFinished[read]);

    cudaMemcpyAsync(h_data[write], d_data, vboSize, cudaMemcpyDeviceToHost, custream[write]);

    cudaEventRecord(downloadFinished[write], custream[write]);

    // Map the renderGPU’s resource and upload the host buffer...

    cudaSetDevice(renderGPU);

    cudaGraphicsMapResources(1, &vboRes, glstream);

    cudaGraphicsResourceGetMappedPointer((void**)&vboPtr, &size, vboRes);

    cudaStreamWaitEvent(glstream, downloadFinished[read]);

    cudaMemcpyAsync(vboPtr, h_data[read], size, cudaMemcpyHostToDevice, glstream);

    cudaGraphicsUnmapResources(1, &vboRes, glstream);

    cudaEventRecord(uploadFinished[read], glstream);

    cudaStreamSynchronize(glstream); // Sync for easier analysis!

    cudaSetDevice(computeGPU);

    runGL(vboId);

    swap(&read, &write);

    }

    Code Sample - Manual Interop (streams)

    36

  • Timeline: Multi-GPU

    Manual-interop, asynchronous, WRITE-DISCARD, + pingpong

    — if render is long... then we must flip the resource too... etc. etc.

    1 2 3

    map

    kernel[A] CUtoH[A]

    CU

    DA

    G

    L

    runCUDA

    unmap *

    runGL

    CUtoH[B]

    kernel[A] CUtoH[A]

    4

    CUtoH

    HtoGL

    H[A]toGL

    GL

    H[A]toGL H[B]toGL

    GL GL

    kernel[B] CUtoH[B] kernel[B]

    H[B]toGL

    37

  • Benchmarks & Demo

    — runCUDA (20ms)

    — runGL (10ms)

    — copy (10ms)

    Single-GPU

    — Driver-interop = 30ms

    Multi-GPU

    — Driver-interop = 36ms

    — Async Manual-interop = 32ms

    — Flipped Manual-interop = 22ms

    Too large data size

    makes multi-GPU interop worse.

    Overlapping the

    download helps us break even.

    But using streams

    and flipping is a significant win!

    38

  • Scaling further

    Ping-pong the renderGPU side too.

    — Current example is bound by (upload+render)

    Kernel might not be dependent on previous kernel.

    — e.g. Could you run two kernels simultaneously?

    If your CUDA kernels are much more expensive than your GL render then this could be a win.

    Remember to use streams and events, AND consider all the dependencies.

    39

  • Interoperability behavior: Multi-GPU

    Similar considerations are applicable when OpenGL is the producer and CUDA is the consumer.

    Use cudaGraphicsMapFlagsReadOnly

    40

  • Application Design Considerations

    Avoid synchronized GPUs for CUDA.

    — Watch out for Windows’s WDDM implicit synchronization on unmap!

    Provision for multi-GPU environments:

    — Let the user choose the GPUs.

    — Use cudaD3D[9|10|11]GetDevices()/cudaGLGetDevices() to match CUDA and graphics device enumerations.

    CUDA-OpenGL interoperability can perform slower if OpenGL context spans multiple GPUs.

    Context switch performance varies with system config and OS.

    41

  • Conclusions & Resources

    The driver can do all the heavy-lifting but...

    Scalability and final performance is up to the developer — For fine-grained control and optimization, you might want to move

    the data manually.

    CUDA samples/documentation: — http://developer.nvidia.com/cuda-downloads

    OpenGL Insights, Patrick Cozzi, Christophe Riccio, 2012. ISBN 1439893764.

    www.openglinsights.com

    42

    http://developer.nvidia.com/cuda-downloadshttp://developer.nvidia.com/cuda-downloadshttp://developer.nvidia.com/cuda-downloads

  • Thank you.

    Questions will be taken at the end of the full session.

    43

  • Scaling Graphics and Compute on Multi-GPUs Shalini Venkataraman

    PSG Applied Engineering

  • Talk Outline

    Recap on default behavior

    — We need finer grained control of managing scaling

    Enumerating Graphics & Compute Resources

    — Supported hardware

    — NUMA considerations

    Different methods for scaling and communication

    — Multiple CUDA GPUs + 1 Graphics GPU

    — Multiple graphics GPUs

    Focus on Single node system, CUDA and OpenGL

    Nsight for profiling

  • Recap – Beyond API Interoperability

    API Interop hides all the complexity

    BUT, sometimes

    — Don’t want to transfer all data

    Kernel may compute subregions but entire buffer object/texture is copied

    — May have a complex system with multiple compute GPUs and/or

    render GPUs

    — Application specific pipelining and multi-buffering

    — May have some CPU code in your algorithm between compute and

    graphics

  • Scaling Beyond Single Compute+Graphics

    Scaling compute

    — Divide tasks across multiple devices

    — When data does not fit into single GPU memory - Distribute data

    Scaling graphics

    — Multi-displays, Stereo

    — Cooler rendering eg raytracing, complex lighting models

    Higher compute density

    — Amortize host or server costs eg CPU, memory, RAID shared with

    multiple GPUs

  • Multi-GPU Compute+Graphics Use Cases

    Image processing

    — Multiple compute GPUs and a low-end

    display GPU for blitting

    Mixing rasterization with compute

    — Polygonal rendering done in OpenGL

    and input to Compute for further

    processing

    Visualization for HPC Simulation

    — Numerical simulation distributed across

    multiple compute GPUs, possibly on

    remote supercomputer

    NVIDIA Index – Seismic Interpretation

    Morpheus Medical – Real-time CFD Viz

    GTC2013 Exhibit Hall

  • Mixing Tesla and Quadro GPUs

    — Tight integration with OEMs and System Integrators

    — Optimized driver paths for GPU-GPU communication

    NVIDIA Maximus Initiative

    NVIDIA® MAXIMUS™ Workstation

    Visualize Simulate (Cluster)

    Simulate0 Simulate1

    + Visualize0

    Simulate2

    + Visualize1

    Simulate3

    + Visualize2

    Traditional Workstation

  • Supported Hardware Supermicro Maximus Systems

    3D Boxx 8950 -Quadro + 4 Tesla K20s

    3D Boxx 4920/4920 Xtreme -Quadro + 3 Tesla K20s

    3D Boxx 8920 - Quadro + 2 Tesla K20s

    SYS-7047GR-TRF - Quadro + 4 Tesla K20s

    SYS-7047A/7047A-73/7037A-i - Quadro + 3 Tesla K20s

    +

    Boxx Systems

  • NUMA/Topology Considerations

    Memory access is non-uniform!

    — Local GPU access is faster than

    remote (extra QPI hop)

    — Affects PCIe transfer throughput.

    NUMA APIs

    — Thread affinity considerations

    Pitfalls of set process affinity

    — Does not work with graphics APIs

    Memory

    CPU0

    IOH0

    CPU1

    IOH1

    QPI

    QPI

    QPI

    Memory

    QPI

    GPU0 GPU1

    6 GB/s

    ~4 GB/s

    SandyBridge

    Integrated IOH

    Dale Southard. Designing and Managing GPU Clusters. Supercomputing 2011 http://nvidia.fullviewmedia.com/fb/nv-sc11/tabscontent/archive/401-thu-southard.html

  • Unified Virtual Addressing Easier to program with single address space

    System

    Memory

    CPU GPU0

    GPU0

    Memory

    GPU1

    GPU1

    Memory

    PCI-e

    0x0000

    0xFFFF

    No UVA – Multiple Memory Space UVA – Single Address Space

  • Peer-to-Peer (P2P) Communication

  • NUMA/Topology matters for P2P

    PCI-e

    P2P Communication Supported

    Between GPUs on the Same IOH

    x16 x16

    PCI-e

    P2P Communication Supported

    Between GPUs on the Same IOH

    x16

    x16 CPU0 IOH0

    CPU1 IOH1

    QPI incompatible with PCI-e P2P specification

    GPU0 GPU1

    GPU2 GPU3

    QPI

    QPI

    QPI

    Sandy Bridge Socket

    Integrated IOH

    Memory Memory

    Memory Memory

    Mem

    ory

    M

    em

    ory

  • P2P disabled over QPI – Copying staged via host (P2H2P)

    Configuration for Compute & Graphics

    CPU0 QPI

    PCI-e

    IOH0

    Best P2P Performance

    Between GPUs on the

    Same PCIe Switch

    Eg K10 dual-GPU card

    (~6.5GB)

    P2P Communication Supported

    Between GPUs on the Same IOH

    (~5GB)

    x16 x16 x16 x16

    x16 x16

    Tesla

    GPU0 Tesla

    GPU1 Tesla

    GPU2

    Tesla

    GPU3

    Switch

    0 Switch

    1

    PCI-e x16

    x16 CPU1 IOH1

    Quadro

    GPU4

    Quadro

    GPU5

    QPI

    P2P communication

    - Linux only

    - No WDDM

  • Mapping Algorithms to Hardware Topology

    Won-Ki Jeong. S3308 - Fast Compressive

    Sensing MRI Reconstruction on a Multi-

    GPU System, GTC 2013.

    Paulius M. S0515 – Implementing

    3D Finite Difference Code on

    GPUs, GTC 2009.

    GPU0 GPU1 Data reduction, Multli-phase problems

  • Programming - Enumerating Resources

    CUDA - Tesla enumerated above Quadro for device 0 & 1 ./deviceQuery -noprompt | egrep "^Device"

    Device 0: "Tesla K20c"

    Device 1: "Quadro K2000"

    Device 2: "Tesla K20c"

    cudaSetDevice() sets current GPU

    — All cuda calls are issued to current GPU except – p2p memcopies

    — Current GPU can be changed while async calls are executed

    Single thread can drive multiple GPUs

    cudaGLGetDevices() gets the GPU(s) for current GL context

    — We will touch multiple OpenGL devices later

  • Communication across GPUs - via Host

    while (!done) {

    // Compute the data in a temp buffer, and copy to a host buffer...

    runCUDA(d_data);

    cudaStreamWaitEvent(0, uploadFinished, 0);

    cudaMemcpyAsync(h_data, d_data, vboSize, cudaMemcpyDeviceToHost, 0);

    cudaEventRecord(downloadFinished, 0);

    // Map the render-GPU’s resource and upload the host buffer...

    // (all commands must be asynchronous.)

    cudaSetDevice(renderGPU);

    doMap(d_mappedPtr,0);

    cudaStreamWaitEvent(0, downloadFinished);

    cudaMemcpyAsync(vboPtr, h_data, size, cudaMemcpyHostToDevice, 0);

    doUnmap(0);

    cudaEventRecord(uploadFinished, 0);

    cudaSetDevice(computeGPU);

    doRender(…);

    }

  • Synchronization across GPUs

    Streams and events are per device

    — Determined by the GPU that is current at creation

    Stream GPU must be set current for

    — Launching kernel to a stream

    — Recording events to a stream

    Agnostic to the current GPU

    — Memcpy can be launched on any stream

    — Synchronize/Query of Events

  • Revisit – P2P Communication

  • Peer-to-Peer (P2P) Initialization

    cudaDeviceCanAccessPeer(&isAccessible, srcGPU, dstGPU)

    — Returns in 1st arg if srcGPU can access memory of dstGPU

    — Need to do this bidirectionally

    cudaDeviceEnablePeerAccess(peerDevice, 0)

    — Enables current GPU to access peerDevice

    — Note that this is asymmetric!

    cudaDeviceDisablePeerAccess

    — P2p can be limited to a specific phase in order to reduce overhead

    and free resources

  • Peer-to-Peer Copy

    cudaMemcpyAsync

    cudaMemcpyPeerAsync

    — Called in a separate stream

    — Falls back to staging copies via host

    for unsupported configutations

    NVIDIA CUDA Webinars – Multi-GPU Programming

    http://developer.download.nvidia.com/CUDA/training/cuda_webinars_multi_gpu.pdf

  • P2P Requirements

    Works on

    — 64bit app

    — Fermi and above

    — Cuda 4.0 +

    — Linux and Windows TCC

    Will not work

    — Dual-IOH configs

    — Mixing WDDM/TCC

    — Across different chips even in same generation

    eg K5000 (GK104) and K20 (GK110)

  • cudaSetDevice(0);

    cudaEventCreate(&finishedKernelEvent);

    cudaSetDevice(1);

    cudaEventCreate(&finishedCopyEvent);

    cudaEventRecord(finishedCopyEvent, 0);//Trigger compute

    P2P communication

    IOH

    PCIe

    IOH0 IOH1

    CPU0 CPU1

    QPI

    while (!done) {

    // Set CUDA device to COMPUTE DEVICE 0

    cudaSetDevice(device);

    cudaStreamWaitEvent(0, finishedCopyEvent);

    doKernel(d_buffer,0);

    cudaEventRecord(finishedKernelEvent, 0);

    //ON GL RENDER DEVICE 1

    cudaSetDevice(1); // GL Device

    doMap((void**)&d_mappedPtr,0);

    cudaStreamWaitEvent(0, finishedKernel);

    cudaMemcpyPeerAsync(d_mappedPtr,1,d_buffer,0,

    size, cuda2glStream);

    cudaEventRecord(finishedCopyEvent, 0);

    doUnmap(0);

    doRender();

    }

    Events Creation

  • P2P : Staging via Host

    Example mixed WDDM/TCC configuration

    render

    Map

    Upload

    Unmap

    Render

    frame 1 frame 2 frame 3

    CU

    DA

    -Dev

    1

    GL

    -Dev

    1

    Compute

    Download

    CU

    DA

    -Dev

    0

    memcpy

    memcpy : Driver automatically

    pipelines download and upload

    memcpy

    Wait for event

    compute

    Record event

    render

    compute

    memcpy

    memcpy

    Compute Engine idle during transfers

  • Profiling with Nsight - Timeline

    Kernel0 D2H0

    H2D0

    Kernel1

    Overlapped by driver

    S Domine. S3377 - Seamless Compute and OpenGL Graphics

    Development in NVIDIA Nsight 3.0 Visual Studio Edition and

    Beyond. GTC 2013

  • Overlapping Kernel With Transfer void* d_buffer[2]; // Ping pong the buffer kernel writes to, so it can

    overlap with transfer

    while (!done) {

    // Set CUDA device to COMPUTE DEVICE 0

    cudaSetDevice(0);

    cudaStreamWaitEvent(stream, finishedCopyEvent[cur]);

    doKernel(d_buffer[cur],stream);

    cudaEventRecord(finishedKernelEvent[cur], stream);

    //ON GL RENDER DEVICE 1

    cudaSetDevice(1); // GL Device

    doMap((void**)&d_mappedPtr, cuda2glStream);

    cudaStreamWaitEvent(0, finishedKernelEvent[prev]);

    cudaMemcpyPeerAsync(d_mappedPtr,1,d_buffer[prev],0,

    size, cuda2glStream);

    cudaEventRecord(finishedCopyEvent[prev], cuda2glStream);

    doUnmap(cuda2glStream);

    doRender();

    }

    prev = cur;

    cur =1- cur ;

  • Timeline

    Kernel, download

    and upload

    are overlapped Kernel0 D2H0

    H2D0

    Kernel1

  • Peer-to-Peer Memory Access

    Sometimes we don’t want to explicitly copy but have access

    to entire space on all GPUs and CPU

    Already possible with linear memory, recent texture

    addition useful for graphics

    Example – large dataset that can’t fit into 1 GPU memory

    — Distribute the domain/data across GPUs

    — Each GPU now need to access the other GPU’s data for

    halos/boundary exchange

  • Sharing texture across GPUs

    Peer access is unidirectional. cudaSetDevice(0); cudaMalloc3DArray(d0_volume, …);

    cudaSetDevice(1);

    cudaMalloc3DArray(d1_volume, …);

    while (!done) {

    // Set CUDA device to COMPUTE DEVICE 0

    cudaSetDevice(0);

    cudaBindTextureToArray(tex0, d0_volume);

    cudaBindTextureToArray(tex1, d1_volume);

    Kernel

    }

    _global__ void Kernel(..) {

    float voxel0 = tex3D(tex0, u, v, w); //accesses gpu0

    float voxel1 = tex3D(tex1, u, v, w); //accesses gpu1 through p2p

    }

    Kernel runs on device 0 accesses texture from device 1

  • Multiple Compute : Single Render GPU

    CUDA

    Context

    GPU 1

    CUDA

    Context

    GPU 1 GPU 0

    Compute Devices Render Device

    OpenGL

    Context

    Aux

    CUDA

    Context

    GPU 2

    Map

    Unmap

    P2H2P on Windows

    (WDDM/TCC)

    -Explicit copies via host

    -CudaMemcpyPeer

    -P2P Memory Access

  • Multi-GPU Image Processing - SagivTech

    Live demo at Exhibit Hall - Booth 712, A middleware for Real Time Multi GPU,

    GTC 2013

    Renderer

  • Scaling Graphics - Multiple Quadro GPUs

    Virtualized SLI case

    — Multi-GPU mosaic drive a large wall

    Access each GPU separately

    — Parallel rendering

    — Multi view-frustum eg Stereo, CAVEs

    Hurricane Sandy Simulation showing multiple

    computed tracks for different input params –

    Image Credits : NOAA

    Sort +

    Alpha Composite

    Data Distribution +

    Render

    GPU-0

    GPU-2

    GPU-1

    GPU-3

    Visible Human 14GB Texture Data

  • CUDA With SLI Mosaic

    Cuda

    Enumeration

    [0] = K20

    [1] = K5000

    [2] = K5000

    OpenGL

    Enumeration

    with SLI

    [0] = K5000

    cudaGLgetGLDevices

    GL context

    int glDeviceIndices[MAX_GPU];

    int glDeviceCount;

    cudaGLGetDevices(&glDeviceCount, glDeviceIndices, glDeviceCount,

    cudaGLDeviceListAll));

    glDeviceCount = 2

    Use driver interop with the current cuda device

    — Driver handles the optimized data transfer (Work in progress)

    The OpenGL context spans 2 cuda devices

  • Specifying Render GPU Explicitly

    Linux

    — Specify separate X screens using XOpenDisplay

    Windows

    — Using NV_GPU_AFFINITY extension

    Display* dpy = XOpenDisplay(“:0.”+gpu)

    GLXContext = glxCreateContextAttribs(dpy,…);

    BOOL wglEnumGpusNV(UINT iGpuIndex, HGPUNV *phGPU)

    For #GPUs enumerated {

    GpuMask[0]=hGPU[0];

    GpuMask[1]=NULL;

    //Get affinity DC based on GPU

    HDC affinityDC = wglCreateAffinityDCNV(GpuMask);

    setPixelFormat(affinityDC);

    HGLRC affinityGLRC = wglCreateContext(affinityDC);

    }

  • Mapping Cuda Affinity GL Devices

    How to map OpenGL device to CUDA

    cudaWGLGetDevice(int *cudadevice,HGPUNV oglGPUHandle)

    Don’t expect the same order

    — GL order is windows specific while CUDA is device specific

    Cuda

    Enumeration

    [0] = K20

    [1] = K5000

    [2] = K5000

    GL Affinity

    Enumeration

    [0] = K5000

    [1] = K5000

    GL Affinity

    Enumeration

    (with SLI)

    [0] = K5000

    cudaGLgetGLDevices

    cudaWGLgetDevice

    GL context GL context

    S3070 - Part 1 - Configuring, Programming and Debugging Applications for

    Compute and Graphics on Multi-GPUS

  • Compute & Multiple Render

    OpenGL

    Context

    GPU 2

    CUDA

    Context

    GPU 1

    OpenGL

    Context

    CudaMemcpyPeer

    Aux

    CUDA

    Context

    GPU 1 GPU 2 GPU 0

    Compute Render Devices

    wglCopyImageSubData

    Map

    Unmap

    GL_NV_COPY_IMAGE - Copy Across Render GPUs

    — Cross-platform

    wglCopyImageSubData and glXCopyImageSubDataNV

    — Can copy subregions, Texture only

  • Multithreading with OpenGL Contexts

    // Wait for signal to start consuming

    CPUWait(producedFenceValid);

    glWaitSync(producedFence[0]);

    // Bind texture object

    glBindTexture(destTex[0]);

    // Use as needed

    // Signal we are done wit this texture

    consumedFence[0] = glFenceSync(…);

    CPUSignal(consumedFenceValid);

    // Wait for

    CPUWait(consumedFenceValid);

    glWaitSync(consumedFence[1]);

    // Bind render target

    glFramebufferTexture2D(srcTex[1]);

    // Draw here…

    // Unbind

    glFramebufferTexture2D(0);

    // Copy over to consumer GPU

    wglCopyImageSubDataNV(srcCtx,srcTex[1],

    ..destCtx,destTex[1]);

    // Signal that producer has completed

    producedFence[1] = glFenceSync(…);

    CPUSignal(producedFenceValid);

    Thread - Dest GPU1 Thread - Source GPU0

    [0]

    [1]

    GLsync consumedFence[MAX_BUFFERS];

    GLsync producedFence[MAX_BUFFERS];

    HANDLE consumedFenceValid, producedFenceValid;

    destTex

    Multi-level CPU and GPU sync primitives

    [0]

    [1]

    srcTex

    GPU1 GPU0

    glCopyImage

    Shalini V. S0353 - Programming Multi-GPUs for Scalable Rendering, GTC 2012.

  • Conclusions

    Need for finer grained programmability for

    compute+graphics

    Hardware architecture influences transfer and system

    throughput

    Peer-To-Peer copies and access simplify multi-gpu

    programming

    Scaling graphics

    — device mapping between CUDA and OpenGL

    GTC_S3072_Part1.pdfGTC_S3072_Part1.pdf

    GTC_S3071_Part2.pdf