low-level graphics apis

Johan AnderssonTechnical Director

Frostbite

LOW-LEVEL GRAPHICSINTEL VCARB 2014

Email: [email protected]: http://frostbite.comTwitter: @repi

Frostbite has 2 very different rendering use cases:

1. Rendering the world with huge amounts of objects Tons of draw calls and with lots of different states & pipelines Heavily CPU limited with high-level APIs Read-only view of the world and resources (except for render targets)

2. Setting up rendering and doing lighting, post-fx, virtual texturing, compute, etc Tons of different types of complex operations, not a lot of draw calls ~50 different rendering passes in Frostbite Managing resource state, memory and running on different queues (graphics, compute, DMA)

Both are very important and low-level discussion & design target are for both!

RENDERING USE CASES

I consider low-level APIs having 3 important design targets:

1. Solving the many draw calls CPU performance problem CPU overhead Parallel command buffer building Binding model & pipeline objects Explicit resource state handling

2. Enabling new GPU programmability Bindless CPU/GPU collaboration GPU work creation

3. Improving GPU performance & memory usage Utilize multiple hardware engines in parallel Explicit memory management / virtual memory Explicit resource state handling

Believe one needs a clean slate API design to target all of these – new paradigm model Hence Mantle & DX12 - too much legacy in the way in DX11, GL4 and GLES

LOW-LEVEL API TARGETS

1-2 orders of a magnitude less CPU overhead for lots of draw calls Thanks to explicit resource barriers, explicit memory management & few bind points Problem for us is not ”how to do 1 million similar draw calls with same state” Problem is ”how to do 10-100k draw calls with different state” Should in parallel also try and reduce amount of state (such as bindless), but wary of GPU overhead

Stable & consistent performance Major benefit for users, will only be more important going forward Explicit submission to GPU, not at random commands No runtime late binding or compilation of shaders Can be a challenge for engines to know all pipeline state up front, but worth designing for!

Improved GPU performance Engine has more high-level knowledge of how resources are used Seen examples of adv GPU driver opts that are easier to implement thanks to low-level model

PERFORMANCE

MANTLE

Key design to significantly lower driver overhead and complexity Explicit hazard tracking Hides architecture-specific caches

Can be challenge in the apps/engines – but worth it Esp. with multiple queues & out-of-order command buffers Requires very clear specifications and great validation layer!

In Frostbite we mostly track this per resource instead for simplicity & performance Instead of per subresource

RESOURCE TRANSITIONS

Example complex cases: Read-only depth testing together with stencil writing (different state for depth & stencil) Mipmap generation (sub-resource specific states) Async compute offloading part of graphics pipeline

Critical to make sure future low-level APIs have the right states/usages exposed Devil is in the details Does the sw/hw require the transition to happen on the same queue that just used a resource? Look at your concrete use cases early!

Would help if future hardware doesn’t need as many different barriers But at what cost?

RESOURCE TRANSITIONS

Aka ”descriptor sets” in Mantle

This new model has been working very well for us – even with very basic handling Great to have as separate objects not connected to device context Treated all resource references as dynamic and built every frame ~15k resource entries in single large table per frame in BF4 (heavy instancing)

Lots of opportunity going forward Split out static resources to own persistent tables Split out shared common resources to own table Connect together with nested resource tables Bindless for more complex cases

RESOURCE TABLES

Seen really good wins with both async DMAs and async compute And it is an important target for us going forward

Additional opportunities DMA in and out of embedded memory Buffer/Image/Video compression/decompression More?

What engines / queues does Intel have & be able to expose?

MULTIPLE GPU QUEUES

Kick CPU job from GPU Possible now with explicit fences & events that CPU can poll/wait on Enables filling in resources just in time

Want async command buffers Kick CPU job from GPU and CPU builds & unlocks an already queued command buffer We’ve been doing this on consoles - ”just” a software limitation Example use case: Sample Distributed Shadowmaps without stalling GPU pipeline

Major opportunity going forward Needs support in OS:es & driver models Drive rendering pipeline based on data from the current GPU frame (such as the zbuffer) Decide where to run code based on power efficiency Important both for discrete & integrated GPUs

GPU/CPU COLLABORATION

For us, have been both easier to work with & get good performance What we are used to from working on consoles and have architecture for Update buffers from any thread, not locked to a device context Persistently map or pin buffer & image objects for easy reading & writing

Pool memory to reduce overhead

Alias objects to the same memory for significant reduction in memory Esp. for render targets

Built-in virtual memory mapping Easier & more flexible way to manage large amount of memory

EXPLICIT MEMORY MANAGEMENT

Major issue for us during BF4, avoiding VidMM stalls VidMM is a black box Difficult to know what is going on & why Explicitly track memory references for each command buffer Tweak memory pools & chunk sizes Force memory to different heaps Setting memory priorities

Going forward will redesign to strongly avoid overcommiting Automatically balance streaming pool settings and cap graphics settings

Are there any other ways the app, OS and GPUs can handle this? Page faulting GPUs?

OVERCOMMITING VIDEO MEMORY

Extensions are a great fit for low-level APIs Low-level extensions that exposes new hardware functionality Examples: PixelSync and EXT_shader_pixel_local_storage No need for huge amount of extensions like OGL which is mostly a high-level API Mantle has 5 extensions for the AMD-specific hardware functionality & Windows-specific integration

Potential challenge for DX that has (officially) not had extensions before Would like to see DX official extensions, including shader code extensions! GL & Mantle has a strong advantage here Other alternative would be rapid iterations on the DX API (small updates quarterly?)

EXTENSIONS

Discuss!

THANKS

low-level graphics apis

Technology

explicit resource barriers

resource entries

resource references

gpu performance memory

pipeline state

gpu possible

wary of gpu

multiple gpu queues