optimizing texture transfers - nvidia · results – synchronous vs cpu async pbos synchronous 0...

Applications

Streaming videos/time varying geometry or volumes

— Broadcast, real-time fluid simulations etc

Level of detailing

— Out of core image viewers, terrain engines

— Bricks paged in as needed

Parallel rendering

— Fast communication between multiple GPUs for scaling data/render

Remoting Graphics

— Readback GPU results fast and stream over network

CPU

GPU

PCIe

8GB/s

100GB/s

5-10GB/s

RAM

Graphics Memory

OpenGL Graphics – Streaming Data

Previous approaches

— Synchronous – CPU and GPU idle during transfer

— CPU Asynchronous

GPU and CPU Asynchronous with Copy Engines

— Application layout

— Use cases

— Results

Synchronous Transfers

Straightforward

— Upload texture every frame

— Driver does all copy

Copy, download and draw are

sequential

…

pData

[nBricks]

Main

Memory

[0]

[1]

[2]

Graphics

Memory

texID

Disk

glTexSubImage

time

Upload Upload Upload

CPU

GPU Draw Draw Draw

Frame Draw

Copy Copy Copy

Bus

glTexSubImage

Frame Draw

Other work

CPU Asynchronous Transfers

Non CPU-blocking transfer using Pixel Buffer Objects (PBO)

— Ping-pong PBO’s for optimal throughput

— Data must be in GPU native format

OpenGL Controlled

Memory

Datacur: glTexSubImage

PBO0

PBO1

…

pData

[nBricks]

Main Memory

[0]

[1]

[2]

Graphics Memory

texID

Datanext memcpy

Textures

Disk

PBO0

PBO1

Example – 3D texture +Ping-Pong PBOs

Gluint pbo[2] ; //ping-pong pbo generate and initialize them ahead

unsigned int curPBO = 0;

//bind current pbo for app->pbo transfer

glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, pbo[curPBO]); //bind pbo

GLubyte* ptr = (GLubyte*)glMapBufferRange(GL_PIXEL_UNPACK_BUFFER_ARB, 0, size,

GL_MAP_WRITE_BIT|GL_MAP_INVALIDATE_BUFFER_BIT);

memcpy(ptr,pData[curBrick],xdim*ydim*zdim);

glUnmapBuffer(GL_PIXEL_UNPACK_BUFFER_ARB);

//Copy pixels from pbo to texture object

glBindTexture(GL_TEXTURE_3D,texId);

glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, pbo[1-curPBO]); //bind pbo

glTexSubImage3D(GL_TEXTURE_3D,0,0,0,0,xdim,ydim,zdim,GL_LUMINANCE,GL_UNSIGNED_BYTE,0);

glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB,0);

glBindTexture(GL_TEXTURE_3D,0);

curPBO = 1-curPBO;

//Call drawing code here

CPU Async - Execution Timeline

time

Uploadt0:PBO0 Uploadt2:PBO0 Uploadt1:PBO1

CPU

GPU Drawt0 Drawt2 Drawt1

Frame Draw

Copyt0:PBO0 Copyt1:PBO1 Copyt2:PBO0

Bus

CPU Async

Analysis with GPUView

(http://graphics.stanford.edu/~mdfish

er/GPUView.html)

GLDriver

GPU

GLDriver

CPU

App

Results – Synchronous vs CPU Async

PBOs

Synchronous

0

500

1000

1500

2000

2500

3000

3500

4000

4500

16^3 (4KB) 32^3 (32KB) 64^3 (256KB) 128^3 (2MB) 256^3 (16MB)

PBO vs Synchronous uploads - Quadro 6000

PBO (MB/s) TexSubImage (MB/s)

- Transfers only

- Adding rendering will reduce bandwidth, GPU can’t do both

- Ideally – want to sustain bandwidth with render, need GPU overlap

Bandw

idth

(M

B/s)

Texture Size

Achieving Overlap - Copy Engines

Fermi+ have copy engines

— GeForce, low-end Quadro- 1 CE

— Quadro 4000+ - 2 CEs

Allows copy-to-host + compute +

copy-to-device to overlap

simultaneously

Graphics/OpenGL

— Using PBO’s in multiple threads

— Handle synchronization

GPU Asynchronous Transfers

Downloads/uploads in separate thread

— Using OpenGL PBOs

ARB_SYNC used for context

synchronization

Uploadt0:PBO0 Uploadt2:PBO0 Uploadt1:PBO1

CPU

GPU Drawt0 Drawt2 Drawt1

Frame Draw

Copyt0:PBO0 Copyt1:PBO1 Copyt2:PBO0

Bus

Using PBO

Using CE

Upload Draw

Init

Main App Thread

Shared textures

Readback

Upload–Render : Application Layout

Disk

OpenGL Controlled

Memory

PBO0

PBO1

…

pData

[nBricks]

Main Memory

[0]

[1]

[2]

Graphics Memory srcTex

[numTextures]

Render

Thread

glBindTexture

Upload Thread

Datacur: glTexSubImage

Datanext : memcpy

uploadGLRC

mainGLRC

Multi-threaded Context Creation

Sharing textures between multiple contexts

— Don’t use wglShareLists

— Use WGL/GLX_ARB_CREATE_CONTEXT instead

— Set OpenGL debug on

static const int contextAttribs[] =

{

WGL_CONTEXT_FLAGS_ARB, WGL_CONTEXT_DEBUG_BIT_ARB,

0

};

mainGLRC = wglCreateContextAttribsARB(winDC, 0, contextAttribs);

wglMakeCurrent(winDC, mainGLRC);

glGenTextures(numTextures, srcTex);

//uploadGLRC now shares all its textures with mainGLRC

uploadGLRC = wglCreateContextAttribsARB(winDC, mainGLRC, contextAttribs);

//Create Upload thread

//Do above for readback if using

Synchronization using ARB_SYNC

OpenGL commands are asynchronous

— When glDrawXXX returns, does not mean command is completed

Sync object glSync (ARB_SYNC) is used for multi-threaded

apps that need sync

— Eg rendering a texture waits for upload completion

Fence is inserted in a unsignaled state but when completed

changed to signaled.

//Upload //Render

glTexSubImage(texID,..) glWaitSync(fence);

GLSync fence = glFenceSync(..) glBindTexture(.., texID);

unsignaled

signaled

Upload-Render Sychronizaton

Need additional CPU event to coordinate waiting for GPU

sync!

WaitForSingleObject(startUploadValid)

glWaitSync(startUpload[2])

glBindTexture(srcTex[2])

glTexSubImage(..)

endUpload[2] = glFenceSync(…)

SetEvent(endUploadValid)

srcTex

Upload

WaitForSingleObject(endUploadValid)

glWaitSync(endUpload[0])

glBindTexture(srcTex[0])

//Draw

startUpload[0] = glFenceSync(…)

SetEvent(startUploadValid);

Render

…

[0]

[2]

GLsync startUpload[MAX_BUFFERS], endUpload[MAX_BUFFERS]; //GPU fence sync objects

HANDLE startUploadValid, endUploadValid; //cpu event to coordinate wait for GPU sync

Analysis with GPUView

Upload and Render in

separate threads

— Map to distinct

hardware queues on

GPU

— Executed concurrently

— Will serialize on pre-

Fermi hardware

Adding Readback

OpenGL Controlled

Memory

Images

[nFrames]

[0]

[1]

[2]

Framecur: glGetTexImage

Frameprev : memcpy

glFramebufferTexture

(GL_DRAW_FRAMEBUFFER

_TEXTURE,…)

DRAW

[0]

[1]

[2]

[3]

PBO0

PBO1

Use glGetTexImage, not glReadPixels between threads

mainGLRC

readbackGLRC

Render Thread Readback Thread

Main Memory

Graphics Memory

resultTex

[numTextures]

Render-Readback Synchronizaton

WaitForSingleObject(endReadbackValid)

glWaitSync(endReadback[2])

glFramebufferTexture(resultTex[2])

//Draw

startReadback[3] = glFenceSync(…)

SetEvent(startReadbackValid)

resultTex

Render

WaitForSingleObject(startReadbackValid)

glWaitSync(startReadback[0])

glGetTexImage(resultTex[0])

//Read pixels to png-pong pbo

endReadback[0] = glFenceSync(…)

SetEvent(endReadbackValid);

Readback

…

[0]

[2]

GLsync startReadback[MAX_BUFFERS],endReadback[MAX_BUFFERS]; //GPU fence sync objects

HANDLE startReadbackValid, endReadbackValid; //cpu event to coordinate wait for GPU

sync

GeForce vs Quadro Readbacks

Readbacks on GeForce are 3x slower than Quadro

0

1000

2000

3000

4000

5000

256K 1MB 8MB 32MB

PCI-

e b

andw

idth

(M

B/s)

Texture Size

Render-Download Bandwidth for Quadro vs GeForce

GeForce GTX 570 Quadro 6000

Upload-Render-Readback pipeline

// Wait for signal to start upload

CPUWait(startUploadValid);

glWaitSync(startUpload[2]);

// Bind texture object

BindTexture(capTex[2]);

// Upload

glTexSubImage(texID…);

// Signal upload complete

GLSync endUpload[2]= glFenceSync(…);

CPUSignal(endUploadValid);

// Wait for download to complete

CPUWait(endDownloadValid);

glWaitSync(endDownload[3]);

// Wait for upload to complete

CPUWait(endUploadValid);

glWaitSync(endUpload)[0]);

// Bind render target

glFramebufferTexture(playTex[3]);

// Bind video capture source texture

BindTexture(capTex[0]);

// Draw

// Signal next upload

startUpload[0] = glFenceSync(…);

CPUSignal(startUploadValid);

// Signal next download

startDownload[3] = glFenceSync(…);

CPUSignal(startDownloadValid);

// Playout thread

CPUWait(startDownloadValid);

glWaitSync(startDownload[2]);

// Readback

glGetTexImage(playTex[2]);

// Read pixels to PBO

// Signal download complete

endDownload[2] = glFenceSync(…);

CPUSignal(endDownloadValid);

Capture Thread Render Thread Playout Thread

True, S038 – Best Practices in GPU-based Video Processing, GTC 2012 Proceedings

[0]

[1]

[2]

[3]

[0]

[1]

[2]

[3]

GPUView trace showing 3-way overlap

Copy Engines

are idle

Frame time

Readback

Render

Upload

Readback

Render

Upload

Balanced render, upload

and readback times

Render time larger than

upload and readback

Debugging Transfers

Some OGL calls may not overlap between transfer/render

thread

— Eg non-transfer related OGL calls in transfer thread

— Driver generates debug message

“Pixel transfer is synchronized with 3D rendering”

— Application uses ARB_DEBUG_OUTPUT to check the OGL debug log

— OpenGL 4.0 and above

GL_ARB_debug_output -

http://www.opengl.org/registry/specs/ARB/debug_output.txt

Copy Engine Results – Best Case

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

256KB 1MB 8MB 32MB

Scaln

g F

acto

r

Texture Size

Performance Scaling from CPU Asynchronous Transfers

Upload-Render Scaling Render-Download Scalng

4.2 GB/s 3.2GB/s

1.4 GB/s

900 MB/s

Perfect Scaling

No Scaling

Quadro 6000

Conclusion

Presented different transfer methods

Keep the transfer method simple

— Look at your application transfer needs and render times

— Tradeoff in scaling vs application complexity

Future

— Debugging multi-threaded transfers made much easier with

Nsight Visual studio http://developer.nvidia.com/nvidia-nsight-

visual-studio-edition)

References

Venkataraman, Fermi Asynchronous Texture

Transfers, OpenGL Insights, 2012

— Source code (around SIGGRAPH 2012) –

https://github.com/organizations/OpenGLInsights

Related GTC Talks

— S0328, Thomas True, Best Practices in GPU-based

video processng

— S0049, Alina Alt &Tom True, Using the GPU Direct for

Video API

— S0353, S Venkataraman, Programming multi-gpus for

scalable rendering



optimizing texture transfers - nvidia · results – synchronous vs cpu async pbos synchronous 0...

Documents