optimizing texture transfers - nvidia · results – synchronous vs cpu async pbos synchronous 0...
TRANSCRIPT
Outline
Definitions
— Upload : Host (CPU) -> Device (GPU)
— Readback: Device (GPU) -> Host (CPU)
Focus on OpenGL graphics
— Implementing various transfer methods
— Multi-threading and Synchronization
— Debugging transfers
— Best Practices & Results
Applications
Streaming videos/time varying geometry or volumes
— Broadcast, real-time fluid simulations etc
Level of detailing
— Out of core image viewers, terrain engines
— Bricks paged in as needed
Parallel rendering
— Fast communication between multiple GPUs for scaling data/render
Remoting Graphics
— Readback GPU results fast and stream over network
CPU
GPU
PCIe
8GB/s
100GB/s
5-10GB/s
RAM
Graphics Memory
OpenGL Graphics – Streaming Data
Previous approaches
— Synchronous – CPU and GPU idle during transfer
— CPU Asynchronous
GPU and CPU Asynchronous with Copy Engines
— Application layout
— Use cases
— Results
Synchronous Transfers
Straightforward
— Upload texture every frame
— Driver does all copy
Copy, download and draw are
sequential
…
pData
[nBricks]
Main
Memory
[0]
[1]
[2]
Graphics
Memory
texID
Disk
glTexSubImage
time
Upload Upload Upload
CPU
GPU Draw Draw Draw
Frame Draw
Copy Copy Copy
Bus
glTexSubImage
Frame Draw
Other work
CPU Asynchronous Transfers
Non CPU-blocking transfer using Pixel Buffer Objects (PBO)
— Ping-pong PBO’s for optimal throughput
— Data must be in GPU native format
OpenGL Controlled
Memory
Datacur: glTexSubImage
PBO0
PBO1
…
pData
[nBricks]
Main Memory
[0]
[1]
[2]
Graphics Memory
texID
Datanext memcpy
Textures
Disk
PBO0
PBO1
Example – 3D texture +Ping-Pong PBOs
Gluint pbo[2] ; //ping-pong pbo generate and initialize them ahead
unsigned int curPBO = 0;
//bind current pbo for app->pbo transfer
glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, pbo[curPBO]); //bind pbo
GLubyte* ptr = (GLubyte*)glMapBufferRange(GL_PIXEL_UNPACK_BUFFER_ARB, 0, size,
GL_MAP_WRITE_BIT|GL_MAP_INVALIDATE_BUFFER_BIT);
memcpy(ptr,pData[curBrick],xdim*ydim*zdim);
glUnmapBuffer(GL_PIXEL_UNPACK_BUFFER_ARB);
//Copy pixels from pbo to texture object
glBindTexture(GL_TEXTURE_3D,texId);
glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, pbo[1-curPBO]); //bind pbo
glTexSubImage3D(GL_TEXTURE_3D,0,0,0,0,xdim,ydim,zdim,GL_LUMINANCE,GL_UNSIGNED_BYTE,0);
glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB,0);
glBindTexture(GL_TEXTURE_3D,0);
curPBO = 1-curPBO;
//Call drawing code here
CPU Async - Execution Timeline
time
Uploadt0:PBO0 Uploadt2:PBO0 Uploadt1:PBO1
CPU
GPU Drawt0 Drawt2 Drawt1
Frame Draw
Copyt0:PBO0 Copyt1:PBO1 Copyt2:PBO0
Bus
CPU Async
Analysis with GPUView
(http://graphics.stanford.edu/~mdfish
er/GPUView.html)
GLDriver
GPU
GLDriver
CPU
App
Results – Synchronous vs CPU Async
PBOs
Synchronous
0
500
1000
1500
2000
2500
3000
3500
4000
4500
16^3 (4KB) 32^3 (32KB) 64^3 (256KB) 128^3 (2MB) 256^3 (16MB)
PBO vs Synchronous uploads - Quadro 6000
PBO (MB/s) TexSubImage (MB/s)
- Transfers only
- Adding rendering will reduce bandwidth, GPU can’t do both
- Ideally – want to sustain bandwidth with render, need GPU overlap
Bandw
idth
(M
B/s)
Texture Size
Achieving Overlap - Copy Engines
Fermi+ have copy engines
— GeForce, low-end Quadro- 1 CE
— Quadro 4000+ - 2 CEs
Allows copy-to-host + compute +
copy-to-device to overlap
simultaneously
Graphics/OpenGL
— Using PBO’s in multiple threads
— Handle synchronization
GPU Asynchronous Transfers
Downloads/uploads in separate thread
— Using OpenGL PBOs
ARB_SYNC used for context
synchronization
Uploadt0:PBO0 Uploadt2:PBO0 Uploadt1:PBO1
CPU
GPU Drawt0 Drawt2 Drawt1
Frame Draw
Copyt0:PBO0 Copyt1:PBO1 Copyt2:PBO0
Bus
Using PBO
Using CE
Upload Draw
Init
Main App Thread
Shared textures
Readback
Upload–Render : Application Layout
Disk
OpenGL Controlled
Memory
PBO0
PBO1
…
pData
[nBricks]
Main Memory
[0]
[1]
[2]
Graphics Memory srcTex
[numTextures]
Render
Thread
glBindTexture
Upload Thread
Datacur: glTexSubImage
Datanext : memcpy
uploadGLRC
mainGLRC
Multi-threaded Context Creation
Sharing textures between multiple contexts
— Don’t use wglShareLists
— Use WGL/GLX_ARB_CREATE_CONTEXT instead
— Set OpenGL debug on
static const int contextAttribs[] =
{
WGL_CONTEXT_FLAGS_ARB, WGL_CONTEXT_DEBUG_BIT_ARB,
0
};
mainGLRC = wglCreateContextAttribsARB(winDC, 0, contextAttribs);
wglMakeCurrent(winDC, mainGLRC);
glGenTextures(numTextures, srcTex);
//uploadGLRC now shares all its textures with mainGLRC
uploadGLRC = wglCreateContextAttribsARB(winDC, mainGLRC, contextAttribs);
//Create Upload thread
//Do above for readback if using
Synchronization using ARB_SYNC
OpenGL commands are asynchronous
— When glDrawXXX returns, does not mean command is completed
Sync object glSync (ARB_SYNC) is used for multi-threaded
apps that need sync
— Eg rendering a texture waits for upload completion
Fence is inserted in a unsignaled state but when completed
changed to signaled.
//Upload //Render
glTexSubImage(texID,..) glWaitSync(fence);
GLSync fence = glFenceSync(..) glBindTexture(.., texID);
unsignaled
signaled
Upload-Render Sychronizaton
Need additional CPU event to coordinate waiting for GPU
sync!
WaitForSingleObject(startUploadValid)
glWaitSync(startUpload[2])
glBindTexture(srcTex[2])
glTexSubImage(..)
endUpload[2] = glFenceSync(…)
SetEvent(endUploadValid)
srcTex
Upload
WaitForSingleObject(endUploadValid)
glWaitSync(endUpload[0])
glBindTexture(srcTex[0])
//Draw
startUpload[0] = glFenceSync(…)
SetEvent(startUploadValid);
Render
…
[0]
[2]
GLsync startUpload[MAX_BUFFERS], endUpload[MAX_BUFFERS]; //GPU fence sync objects
HANDLE startUploadValid, endUploadValid; //cpu event to coordinate wait for GPU sync
Analysis with GPUView
Upload and Render in
separate threads
— Map to distinct
hardware queues on
GPU
— Executed concurrently
— Will serialize on pre-
Fermi hardware
Adding Readback
OpenGL Controlled
Memory
Images
[nFrames]
[0]
[1]
[2]
Framecur: glGetTexImage
Frameprev : memcpy
glFramebufferTexture
(GL_DRAW_FRAMEBUFFER
_TEXTURE,…)
DRAW
[0]
[1]
[2]
[3]
PBO0
PBO1
Use glGetTexImage, not glReadPixels between threads
mainGLRC
readbackGLRC
Render Thread Readback Thread
Main Memory
Graphics Memory
resultTex
[numTextures]
Render-Readback Synchronizaton
WaitForSingleObject(endReadbackValid)
glWaitSync(endReadback[2])
glFramebufferTexture(resultTex[2])
//Draw
startReadback[3] = glFenceSync(…)
SetEvent(startReadbackValid)
resultTex
Render
WaitForSingleObject(startReadbackValid)
glWaitSync(startReadback[0])
glGetTexImage(resultTex[0])
//Read pixels to png-pong pbo
endReadback[0] = glFenceSync(…)
SetEvent(endReadbackValid);
Readback
…
[0]
[2]
GLsync startReadback[MAX_BUFFERS],endReadback[MAX_BUFFERS]; //GPU fence sync objects
HANDLE startReadbackValid, endReadbackValid; //cpu event to coordinate wait for GPU
sync
GeForce vs Quadro Readbacks
Readbacks on GeForce are 3x slower than Quadro
0
1000
2000
3000
4000
5000
256K 1MB 8MB 32MB
PCI-
e b
andw
idth
(M
B/s)
Texture Size
Render-Download Bandwidth for Quadro vs GeForce
GeForce GTX 570 Quadro 6000
Upload-Render-Readback pipeline
// Wait for signal to start upload
CPUWait(startUploadValid);
glWaitSync(startUpload[2]);
// Bind texture object
BindTexture(capTex[2]);
// Upload
glTexSubImage(texID…);
// Signal upload complete
GLSync endUpload[2]= glFenceSync(…);
CPUSignal(endUploadValid);
// Wait for download to complete
CPUWait(endDownloadValid);
glWaitSync(endDownload[3]);
// Wait for upload to complete
CPUWait(endUploadValid);
glWaitSync(endUpload)[0]);
// Bind render target
glFramebufferTexture(playTex[3]);
// Bind video capture source texture
BindTexture(capTex[0]);
// Draw
// Signal next upload
startUpload[0] = glFenceSync(…);
CPUSignal(startUploadValid);
// Signal next download
startDownload[3] = glFenceSync(…);
CPUSignal(startDownloadValid);
// Playout thread
CPUWait(startDownloadValid);
glWaitSync(startDownload[2]);
// Readback
glGetTexImage(playTex[2]);
// Read pixels to PBO
// Signal download complete
endDownload[2] = glFenceSync(…);
CPUSignal(endDownloadValid);
Capture Thread Render Thread Playout Thread
True, S038 – Best Practices in GPU-based Video Processing, GTC 2012 Proceedings
[0]
[1]
[2]
[3]
[0]
[1]
[2]
[3]
GPUView trace showing 3-way overlap
Copy Engines
are idle
Frame time
Readback
Render
Upload
Readback
Render
Upload
Balanced render, upload
and readback times
Render time larger than
upload and readback
Debugging Transfers
Some OGL calls may not overlap between transfer/render
thread
— Eg non-transfer related OGL calls in transfer thread
— Driver generates debug message
“Pixel transfer is synchronized with 3D rendering”
— Application uses ARB_DEBUG_OUTPUT to check the OGL debug log
— OpenGL 4.0 and above
GL_ARB_debug_output -
http://www.opengl.org/registry/specs/ARB/debug_output.txt
Copy Engine Results – Best Case
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
256KB 1MB 8MB 32MB
Scaln
g F
acto
r
Texture Size
Performance Scaling from CPU Asynchronous Transfers
Upload-Render Scaling Render-Download Scalng
4.2 GB/s 3.2GB/s
1.4 GB/s
900 MB/s
Perfect Scaling
No Scaling
Quadro 6000
Conclusion
Presented different transfer methods
Keep the transfer method simple
— Look at your application transfer needs and render times
— Tradeoff in scaling vs application complexity
Future
— Debugging multi-threaded transfers made much easier with
Nsight Visual studio http://developer.nvidia.com/nvidia-nsight-
visual-studio-edition)
References
Venkataraman, Fermi Asynchronous Texture
Transfers, OpenGL Insights, 2012
— Source code (around SIGGRAPH 2012) –
https://github.com/organizations/OpenGLInsights
Related GTC Talks
— S0328, Thomas True, Best Practices in GPU-based
video processng
— S0049, Alina Alt &Tom True, Using the GPU Direct for
Video API
— S0353, S Venkataraman, Programming multi-gpus for
scalable rendering