GPU Data Formatting and GPU Data Formatting and AddressingAddressing
Aaron Lefohn University of California, Davis
OverviewOverview• GPU Memory Model
• GPU-Based Data Structures
• Performance Considerations
GPU memory modelGPU memory model
• GPU Data Storage– Vertex data– Texture data– Frame buffer
Vertex DataVertex
ProcessorRasterizer
FragmentProcessor
Texture Data
Frame Buffer(s)
PS3.0 GPUs
GPU memory modelGPU memory model
• Read-Only– Traditional use of GPU memory– CPU writes, GPU reads
• Read/Write– Save frame buffer(s) for later use as texture or vertex array– Save up to 16, 32-bit floating values per pixel
• Multiple Render Targets (MRTs)
How to Save Render ResultHow to Save Render Result
1. Copy framebuffer result to “other GPU memory”– Copy-to-texture– Copy-to-vertex-array
2. Write directly to “other GPU memory'' – Render-to-texture– Render-to-vertex-array
OpenGL GPU Memory WritesOpenGL GPU Memory Writes
• Texture1. Copy frame buffer to texture
2. Render-to-texture• WGL_ARB_render_texture • GL_EXT_render_target• Superbuffers
• Vertex Array1. Copy frame buffer to vertex array
• GL_EXT_pixel_buffer_object• Superbuffers
2. Render-to-vertex-array• Superbuffers
Render-To-Texture: 1Render-To-Texture: 1
• Copy-To-Texture– Good
• Cross-Platform texture writes• Flexible output• 2D output Copy to 1D, 2D, or 3D texture
– Bad• Slow • Consumes internal GPU memory bandwidth
Render-To-Texture: 2Render-To-Texture: 2
• WGL_ARB_render_texture– Render-to-texture (RTT) using pbuffers
http://oss.sgi.com/projects/ogl-sample/registry/ARB/wgl_render_texture.txt
– Good• Fast RTT• Current state of the art for RTT
– Bad• Only works on Windows• Slow OpenGL context switches • Many hacks to avoid this bottleneck
Render-To-Texture: 3Render-To-Texture: 3
• GL_EXT_render_target– Proposed extension for cross-platform RTT
http://www.opengl.org/resources/features/GL_EXT_render_target.txt
– Good• Cross-platform, efficient RTT solution• Lightweight, simple extension
– Bad• Specification not approved (April 24, 2004)• No implementations exist (April 24, 2004)
Render-To-Texture: 4Render-To-Texture: 4
• Superbuffers– Proposed new memory model for GPUs
http://www.ati.com/developer/gdc/SuperBuffers.pdf
– Good• Unified GPU memory model• Render to any GPU memory• Cross platform (OpenGL owns memory, not OS)• Mix-and-match depth/stencil/color buffers
– Bad• Large, complex extension• Specification not approved (April 24, 2004)• Only driver support is alpha version (ATI)
Render-To-Texture SummaryRender-To-Texture Summary
• OpenGL RTT Currently Only Under Windows– Pbuffers
• Complex and awkward RTT mechanism• Current state of the art
• Cross-Platform RTT Coming Soon…
Render-To-Vertex-Array: 1Render-To-Vertex-Array: 1
• GL_EXT_pixel_buffer_object– Copy framebuffer to vertex buffer object
http://developer.nvidia.com/object/nvidia_opengl_specs.html
– Good• Only GPU/AGP memory bandwidth• Works with current drivers (NVIDIA)
– Bad• No direct render-to-vertex-array (slower than true RTVA)• No ATI implementation
Render-To-Vertex-Array: 2Render-To-Vertex-Array: 2
• Superbuffers– Write to “memory object” as render target – Read from “memory object” as vertex array
– Good• Direct render-to-vertex-array (fast)
– Bad• Can render results always be interpreted as vertex data?• Large, complex, unapproved extension, …
Render-To-Vertex-Array SummaryRender-To-Vertex-Array Summary
• Current OpenGL Support– NVIDIA: GL_EXT_pixel_buffer_object– ATI: Superbuffers
• Semantics Still Under Development…
Fbuffer: Capturing FragmentsFbuffer: Capturing Fragments
• Idea– “Rasterization-Order FIFO Buffer”– Render results are fragment values instead of pixel values– Mark and Proudfoot, Graphics Hardware 2001
http://graphics.stanford.edu/projects/shading/pubs/hwws2001-fbuffer/
• Uses– Designed for multi-pass rendering with transparent geometry– New possibilities for GPGPU?
• Varying number of results per pixel• RTT and RTVA with an fbuffer?
Fbuffer: Capturing FragmentsFbuffer: Capturing Fragments
• Implementations– ATI Radeon 9800 and newer ATI GPUs– Not yet exposed to user (ask for it!)
• Problems– Size of fbuffer is not known before rendering– GPUs cannot perform dynamic memory allocation– How to handle buffer overflow?
OverviewOverview• GPU Memory Model
• GPU-Based Data Structures
• Performance Considerations
GPU-Based Data StructuresGPU-Based Data Structures
• Building Blocks– GPU memory addresses
• Address Generation• Address Use• Pointers
– Multi-dimensional arrays– Sparse representations
GPU Memory AddressesGPU Memory Addresses
• Where Are Addresses Generated?– CPU Vertex stream or textures– Vertex processor Input stream, ALU ops or textures– Rasterizer Interpolation– Fragment processor Input stream, ALU ops or textures
Vertex Processor
Rasterizer FragmentProcessor
CPU
GPU Memory AddressesGPU Memory Addresses
• Where Are Addresses Used?– Vertex textures (PS3.0 GPUs)– Fragment textures
Vertex Processor
RasterizerFragmentProcessor
Texture Data
CPU
GPU Memory AddressesGPU Memory Addresses
• Pointers– Store addresses in texture– Dependent texture read– Example: See Tim Purcell’s ray tracing talk
float2 addr = tex2D( addrTex, texCoord );
float2 data = tex2D( dataTex, addr );
3311
DataDataDataData
Address Texture Data Texture
0123
0123
GPU-Based Data StructuresGPU-Based Data Structures
• Building Blocks– GPU memory addresses
• Address Generation• Address Use• Pointers
– Multi-dimensional arrays– Sparse representations
Multi-Dimensional ArraysMulti-Dimensional Arrays
• Build Data Structures in 2D Memory– Read/Write GPU memory optimized for 2D – Images
• But Isn’t Physical Memory 1D?– GPU memory hierarchy optimized to capture 2D locality
• Rasterization• Texture filtering• Igehy, Eldridge, Proudfoot, “"Prefetching in a Texture
Cache Architecture,” Graphics Hardware, 1998
• Conclusion: Use illusion of 2D physical memory
GPU ArraysGPU Arrays
• Large 1D Arrays– Current GPUs limit 1D array sizes to 2048 or 4096– Pack into 2D memory– 1D-to-2D address translation
GPU ArraysGPU Arrays
• 3D Arrays– Problem
• GPUs do not have 3D frame buffers• No RTT to slice of 3D texture (except Superbuffers)
– Solutions
1. Stack of 2D slices
2. Multiple slices per 2D buffer
GPU ArraysGPU Arrays
• Problems With 3D Arrays for GPGPU– Cannot read stack of 2D slices as 3D texture– Must know which slices are needed in advance– Visualization of 3D data difficult
• Solutions– Need render-to-slice-of-3D-texture (Superbuffers)– Volume rendering of slice-based 3D data
• Course 28, “Real-Time Volume Graphics”, Siggraph 2004
GPU ArraysGPU Arrays
• Higher Dimensional Arrays– Pack into 2D buffers– N-D to 2D address translation– Same problems as 3D arrays if data does not fit in a single
2D texture
• Conclusions– Fundamental GPU memory primitive is a fixed-size 2D array– GPGPU needs more general memory model
GPU-Based Data StructuresGPU-Based Data Structures
• Building Blocks– GPU memory addresses
• Address Generation• Address Use• Pointers
– Multi-dimensional arrays– Sparse representations
Sparse Data StructuresSparse Data Structures
• Why Sparse Data Structures?– Reduce computational workload – Reduce memory pressure
• Examples– Sparse matrices
• Krueger et al., Siggraph 2003• Bolz et al., Siggraph 2003
– Implicit surface computations (sparse volumes)• Sherbondy et al., IEEE Visualization 2003• Lefohn et al., IEEE Visualization 2003
Premoze et al.Eurographics 2003
Sparse ComputationSparse Computation
• Option 1: Store Complete Data Set on GPU– Cull unused data– Conditional execution tricks (discussed earlier)
• Option 2: Store Only Sparse Data on GPU– Saves memory– Potentially much faster than culling– Much more complicated (especially if time-varying)
Sparse Data StructuresSparse Data Structures
• Basic Idea– Pack “active” data elements into GPU memory– For more information
• Linear algebra section in this course : Static structures• Level-set case study in this course : Dynamic
structures
Sparse Data StructuresSparse Data Structures
• Addressing Sparse Data– Neighborhoods no longer implicitly defined on grid
– Use pointer-based data structures to locate neighbors• Pre-compute neighbor addresses if possible
– Use CPU or vertex processor– Removes pointer dereference from fragment program
– Separate common addressing case from boundary conditions• Common case must be cache coherent• See Harris and Lefohn case studies for “substream”
technique
OverviewOverview• GPU Memory Model
• GPU-Based Data Structures
• Performance Considerations
Memory Performance IssuesMemory Performance Issues
• Pbuffer Survival Guide
• Dependent Texture Costs
• Computational Frequency
Pbuffer Survival GuidePbuffer Survival Guide
• Pbuffers Give us Render-To-Texture– Designed to create an environment map or two– Never intended to be used for GPGPU (100s of pbuffers)
– Problem• Each pbuffer has its own OpenGL render context• Each pbuffer may have depth and/or stencil buffer• Changing OpenGL contexts is slow
– Solution• Many optimizations to avoid this bottleneck…
Pbuffer Survival GuidePbuffer Survival Guide
1. Pack Scalar Data Into RGBA– > 4x memory savings– 4x reduction in context switches– Be careful of read-modify-write hazard
1 RGBA PbufferScalar Data in 4 RGBA Pbuffers
Pbuffer Survival GuidePbuffer Survival Guide
2. Use Multi-Surface Pbuffers – Each RGBA surface is its own render-texture
• Front, Back, AuxN (N = 0,1,2,…)– Greatly reduces context switches– Technically illegal, but “blessed” by ATI. Works on NVIDIA.
1 Pbuffer5 RGBA Surfaces
5 Pbuffers1 RGBA Surface Each
Pbuffer Survival GuidePbuffer Survival Guide
2. Using Multi-Surface Pbuffers
a) Allocate double buffer pbuffer (and/or with AUX buffers)
b) Set render target to back bufferglDrawBuffer(GL_BACK)
2. Bind front buffer as texturewglBindTexImageARB(hpbuffer, WGL_FRONT_ARB)
a) Render
b) Switch bufferswglReleaseTexImageARB(hpbuffer, WGL_FRONT_ARB)
glDrawBuffer(GL_FRONT)
wglBindTexImageARB(hpbuffer, WGL_BACK_ARB)
Pbuffer Survival GuidePbuffer Survival Guide
3. Pack 2D domains into large buffer– “Flat 3D textures”– Be careful of read-modify-write hazard
Flattened Volume3D Volume
Dependent Texture CostsDependent Texture Costs
• Cache Coherency– Dependent reads fast if they hit cache
• Even chained dependencies can be same speed as non-dependent reads
– Very slow if out of cache• Example:
3 levels of dependent cache misses can be >10x slower
– More detail in “GPU Computation Strategies and Tricks”
Computational FrequencyComputational Frequency
• Compute Memory Addresses at Low Frequency– Compute memory addresses in vertex program
• Let rasterizer interpolation create per-fragment addresses• Compute neighbor addresses this way
– Avoid fragment-level address computation whenever possible• Consumes fragment instructions• Computation often redundant with neighboring fragments• May defeat texture pre-fetch
ConclusionsConclusions
• GPU Memory Model Evolving– Writable GPU memory forms loop-back in an otherwise feed-
forward streaming pipeline– Memory model will continue to evolve as GPUs become more
general stream processors
• GPGPU Data Structures– Basic memory primitive is limited-size, 2D texture– Use address translation to fit all array dimensions into 2D– Maintain 2D cache locality
• Render-To-Texture– Use pbuffers with care and eagerly adopt their successor
Selected ReferencesSelected References
• J. Boltz, I. Farmer, E. Grinspun, P. Schoder, “Spare Matrix Solvers on the GPU: Conjugate Gradients and Multigrid,” SIGGRAPH 2003
• N. Goodnight, C. Woolley, G. Lewin, D. Luebke, G. Humphreys, “A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware,” Graphics Hardware 2003
• M. Harris, W. Baxter, T. Scheuermann, A. Lastra, “Simulation of Cloud Dynamics on Graphics Hardware,“ Graphics Hardware 2003
• H. Igehy, M. Eldridge, K. Proudfoot, “Prefetching in a Texture Cache Architecture,” Graphics Hardware 1998
• J. Krueger, R. Westermann, “Linear Algebra Operators for GPU Implementation of Numerical Algorithms,” SIGGRAPH 2003
• A. Lefohn, J. Kniss, C. Hansen, R. Whitaker, “A Streaming Narrow-Band Algorithm: Interactive Deformation and Visualization of Level Sets,” IEEE Transactions on Visualization and Computer Graphics 2004
Selected ReferencesSelected References
• A. Lefohn, J. Kniss, C. Hansen, R. Whitaker, “Interactive Deformation and Visualization of Level Set Surfaces Using Graphics Hardware,” IEEE Visualization 2003
• W. Mark, K. Proudfoot, “The F-Buffer: A Rasterization-Order FIFO Buffer for Multi-Pass Rendering,” Graphics Hardware 2001
• T. Purcell, C. Donner, M. Cammarano, H. W. Jensen, P. Hanrahan, “Photon Mapping on Programmable Graphics Hardware,” Graphics Hardware 2003
• A. Sherbondy, M. Houston, S. Napel, “Fast Volume Segmentation With Simultaneous Visualization Using Programmable Graphics Hardware,” IEEE Visualization 2003
OpenGL ReferencesOpenGL References
• GL_EXT_pixel_buffer_objecthttp://www.nvidia.com/dev_content/nvopenglspecs/GL_EXT_pixel_buffer_object.txt
• GL_EXT_render_target, http://www.opengl.org/resources/features/GL_EXT_render_target.txt
• OpenGL Extension Registryhttp://oss.sgi.com/projects/ogl-sample/registry/
• Superbuffershttp://www.ati.com/developer/gdc/SuperBuffers.pdf
• WGL_ARB_render_texturehttp://oss.sgi.com/projects/ogl-sample/registry/ARB/wgl_render_texture.txthttp://oss.sgi.com/projects/ogl-sample/registry/ARB/wgl_pbuffer.txt
Questions?Questions?
• Acknowledgements– Cass Everitt, Craig Kolb, Chris Seitz, and Jeff Juliano at NVIDIA– Mark Segal, Rob Mace, and Evan Hart at ATI– GPGPU Siggraph 2004 course presenters– Joe Kniss and Ross Whitaker– Brian Budge– John Owens– National Science Foundation Graduate Fellowship– Pixar Animation Studios