![Page 1: Don't Throw it all Away: Efficient Buffer Management](https://reader034.vdocument.in/reader034/viewer/2022051102/5873b6a81a28aba9328b45f2/html5/thumbnails/1.jpg)
Don’t Throw it all Away: Efficient Buffer Management John McDonald Developer Technology, NVIDIA Corporation
![Page 2: Don't Throw it all Away: Efficient Buffer Management](https://reader034.vdocument.in/reader034/viewer/2022051102/5873b6a81a28aba9328b45f2/html5/thumbnails/2.jpg)
What are we talking about?
● General Performance/Functional Guidance
● CPU-GPU Sync Points
● Buffer Usage Patterns
● Contention-Free Buffers
● Constant Buffers
● Performance Investigation
![Page 3: Don't Throw it all Away: Efficient Buffer Management](https://reader034.vdocument.in/reader034/viewer/2022051102/5873b6a81a28aba9328b45f2/html5/thumbnails/3.jpg)
“Buffers” is really generic…
● Vertex Buffers
● Index Buffers
● Constant Buffers
![Page 4: Don't Throw it all Away: Efficient Buffer Management](https://reader034.vdocument.in/reader034/viewer/2022051102/5873b6a81a28aba9328b45f2/html5/thumbnails/4.jpg)
General Guidance
![Page 5: Don't Throw it all Away: Efficient Buffer Management](https://reader034.vdocument.in/reader034/viewer/2022051102/5873b6a81a28aba9328b45f2/html5/thumbnails/5.jpg)
General Guidance
● D3D11 >> D3D9 (generally)
● It’s much harder to hit the ultra-slow path (aka CPU-GPU Sync Points)
● Reduce your API calls where possible
● Batch up buffer updates
● Alignment matters! (16-byte, please)
● Aligned copies can be ~30x faster
![Page 6: Don't Throw it all Away: Efficient Buffer Management](https://reader034.vdocument.in/reader034/viewer/2022051102/5873b6a81a28aba9328b45f2/html5/thumbnails/6.jpg)
More General Guidance
● D3D11Device will grab a mutex for you, but each DeviceContext can only be called from one thread at a time
● This is the source of many crashes blamed on the driver
● UpdateSubresource requires more CPU time
● When possible, prefer Map/Unmap
● D3D11 Debug Runtime is awesome!
● Please use it, ensure you are running clean
![Page 7: Don't Throw it all Away: Efficient Buffer Management](https://reader034.vdocument.in/reader034/viewer/2022051102/5873b6a81a28aba9328b45f2/html5/thumbnails/7.jpg)
CPU-GPU Sync Points
![Page 8: Don't Throw it all Away: Efficient Buffer Management](https://reader034.vdocument.in/reader034/viewer/2022051102/5873b6a81a28aba9328b45f2/html5/thumbnails/8.jpg)
CPU-GPU Sync Points
● CPU-GPU Sync Points are caused when the CPU needs the GPU to complete work before an API call can return
● These make us sad
![Page 9: Don't Throw it all Away: Efficient Buffer Management](https://reader034.vdocument.in/reader034/viewer/2022051102/5873b6a81a28aba9328b45f2/html5/thumbnails/9.jpg)
CPU-GPU sync point examples
● Explicit
● Spin-lock waiting for query results
● Readback of Framebuffer you just rendered to
● Implicit (potential sync points)
● GPU Memory Allocation after Deallocation
● Buffer Rename operation (MAP_DISCARD) after deallocation
● Immediate update of a buffer still in use
![Page 10: Don't Throw it all Away: Efficient Buffer Management](https://reader034.vdocument.in/reader034/viewer/2022051102/5873b6a81a28aba9328b45f2/html5/thumbnails/10.jpg)
Why are they bad?
● Ideal frame time should be max(CPU time, GPU time)
● CPU-GPU Sync point turns this into CPU Time + GPU Time.
Ideal
GPU
CPU
With Sync point
Presents Presents
![Page 11: Don't Throw it all Away: Efficient Buffer Management](https://reader034.vdocument.in/reader034/viewer/2022051102/5873b6a81a28aba9328b45f2/html5/thumbnails/11.jpg)
Really? That bad?
● One bad sync point can halve your frame rate
● Even worse: the more sync points you have, the harder they are to find.
● Performance will just seem generally slow
● The badness depends, in part, on where in the frame the sync-point occurs
● Generally, the later the sync point, the worse it is
● Early sync-points are also bad if your workload is very lopsided towards either the CPU or the GPU
![Page 12: Don't Throw it all Away: Efficient Buffer Management](https://reader034.vdocument.in/reader034/viewer/2022051102/5873b6a81a28aba9328b45f2/html5/thumbnails/12.jpg)
Check your middleware
● Middleware is generally written in a vacuum
● What works best in the small might not scale well
● Especially check for CPU-GPU sync points
![Page 13: Don't Throw it all Away: Efficient Buffer Management](https://reader034.vdocument.in/reader034/viewer/2022051102/5873b6a81a28aba9328b45f2/html5/thumbnails/13.jpg)
A quick D3D9 interlude
● CPU-GPU sync points are trivial to introduce in D3D9
● Locking any buffer in D3D9 with flags=0 is a virtually guaranteed CPU-GPU Sync point if that buffer is still in use.
![Page 14: Don't Throw it all Away: Efficient Buffer Management](https://reader034.vdocument.in/reader034/viewer/2022051102/5873b6a81a28aba9328b45f2/html5/thumbnails/14.jpg)
Buffer Usage Patterns
![Page 15: Don't Throw it all Away: Efficient Buffer Management](https://reader034.vdocument.in/reader034/viewer/2022051102/5873b6a81a28aba9328b45f2/html5/thumbnails/15.jpg)
Buffer Usage Patterns
Update
s M
ore
Ofte
n
“Forever”
Long Lived
Transient
Temporary
Constants
- Level BSPs
- Character Geometry
- UI, Text (New!)
- Particle Systems (Streaming)
- Shader Parameters
![Page 16: Don't Throw it all Away: Efficient Buffer Management](https://reader034.vdocument.in/reader034/viewer/2022051102/5873b6a81a28aba9328b45f2/html5/thumbnails/16.jpg)
“Forever” Buffers
● Useful for geometry that is loaded once
● Ex: Level BSPs, loaded behind a load screen
● Don’t use this for streaming data
● Hitching during allocation is possible/likely
● IMMUTABLE flag at creation time
● Cannot update these!
Update
s M
ore
Ofte
n
“Forever”
Long Lived
Transient
Temporary
Constants
![Page 17: Don't Throw it all Away: Efficient Buffer Management](https://reader034.vdocument.in/reader034/viewer/2022051102/5873b6a81a28aba9328b45f2/html5/thumbnails/17.jpg)
Long Lived Buffers
● Data that is streamed in from disk, but is expected to last for “awhile”
● Ex: Character geometry
● Reuse these; stream into them
● DEFAULT flag at creation time
● UpdateSubresource to update
Update
s M
ore
Ofte
n
“Forever”
Long Lived
Transient
Temporary
Constants
![Page 18: Don't Throw it all Away: Efficient Buffer Management](https://reader034.vdocument.in/reader034/viewer/2022051102/5873b6a81a28aba9328b45f2/html5/thumbnails/18.jpg)
Temporary buffers
● Fire-and-forget data
● E.g. Particle systems
● Almost certainly lives in system RAM
● DYNAMIC flag at create time
● Prefer Map/Unmap to update these
● UpdateSubresource involves an extra copy
Update
s M
ore
Ofte
n
“Forever”
Long Lived
Transient
Temporary
Constants
![Page 19: Don't Throw it all Away: Efficient Buffer Management](https://reader034.vdocument.in/reader034/viewer/2022051102/5873b6a81a28aba9328b45f2/html5/thumbnails/19.jpg)
Constant Buffers
● These are different than other buffers in D3D11.
● The GPU can deal with many of them in flight at once
● Create with DYNAMIC
● Map/DISCARD to Update
● More on these in a bit
Update
s M
ore
Ofte
n
“Forever”
Long Lived
Transient
Temporary
Constants
![Page 20: Don't Throw it all Away: Efficient Buffer Management](https://reader034.vdocument.in/reader034/viewer/2022051102/5873b6a81a28aba9328b45f2/html5/thumbnails/20.jpg)
We skipped one…
● Transient Buffers
● New informal class of Buffer
● Used for (e.g.) UI/Text
● Things that are dynamic, but few vertices each—and may need to be updated on odd schedules
● DYNAMIC flag at creation time
● Transient Buffers are part of a new class of buffer…
![Page 21: Don't Throw it all Away: Efficient Buffer Management](https://reader034.vdocument.in/reader034/viewer/2022051102/5873b6a81a28aba9328b45f2/html5/thumbnails/21.jpg)
Contention-Free Buffers
![Page 22: Don't Throw it all Away: Efficient Buffer Management](https://reader034.vdocument.in/reader034/viewer/2022051102/5873b6a81a28aba9328b45f2/html5/thumbnails/22.jpg)
Transient Buffer Overview
● Treat Buffer as a Memory Heap, with a twist
● On CPU, Freed memory available now
● On GPU, Freed memory is available when GPU is finished with it
● Assume memory is in use until told otherwise
● Determine when GPU must be finished with Freed memory, then return to the “really free” list
Update
s M
ore
Ofte
n
“Forever”
Long Lived
Transient
Temporary
Constants
![Page 23: Don't Throw it all Away: Efficient Buffer Management](https://reader034.vdocument.in/reader034/viewer/2022051102/5873b6a81a28aba9328b45f2/html5/thumbnails/23.jpg)
CTransientBuffer ● On Alloc, walk a Free list
looking for best fit ● Data is updated using
Map/NO_OVERWRITE
● Return opaque, immutable handle
● On Free, record that chunk was freed—into RetiredFrames.back()
● Just after present, an “OnPresent” function is called
class CTransientBuffer { ID3D11Buffer* mBuffer; UINT mLengthBytes; ID3D11Device* mOwner; vector<CSubAlloc> mFreeList; list<RetiredFrame> mRetiredFrames; public: CSubAlloc* Alloc(UINT, void*, ID3D11DeviceContext*); void Free(CSubAlloc*); void OnPresent(ID3D11DeviceContext*);
![Page 24: Don't Throw it all Away: Efficient Buffer Management](https://reader034.vdocument.in/reader034/viewer/2022051102/5873b6a81a28aba9328b45f2/html5/thumbnails/24.jpg)
CTransientBuffer Guts class CTransientBuffer { ID3D11Buffer* mBuffer; UINT mLengthBytes; ID3D11Device* mOwner; vector<CSubAlloc> mFreeList; list<RetiredFrame> mRetiredFrames; public: CSubAlloc* Alloc(UINT, void*, ID3D11DeviceContext*); void Free(CSubAlloc*); void OnPresent(ID3D11DeviceContext*); ...
struct RetiredFrame { list<CSubAlloc*> mPendingFrees; ID3D11Query* mFrameCompleteQuery; };
class CSubAlloc { UINT mOffset; UINT mLength; ...
![Page 25: Don't Throw it all Away: Efficient Buffer Management](https://reader034.vdocument.in/reader034/viewer/2022051102/5873b6a81a28aba9328b45f2/html5/thumbnails/25.jpg)
CTransientBuffer::OnPresent void CTransientBuffer::OnPresent(ID3D11DeviceContext* _dc)
{
// First, deal with deletes from this frame
RetiredFrame& retFrame = mRetiredFrames.back();
if (!retFrame.mPendingFrees.empty()) {
retFrame.mFrameCompleteQuery = CreateAndIssueEventQuery(_dc); // Append a new (empty) RetiredFrame to mRetiredFrames
mRetiredFrames.push_back(RetiredFrame());
}
// Second, return pending frees to mFreeList
![Page 26: Don't Throw it all Away: Efficient Buffer Management](https://reader034.vdocument.in/reader034/viewer/2022051102/5873b6a81a28aba9328b45f2/html5/thumbnails/26.jpg)
CTransientBuffer::OnPresent // Second, return pending frees to mFreeList
FOREACH(frameIt, mRetiredFrames) {
auto query = frameIt->mFrameCompleteQuery;
if (!(query && IsQueryComplete(query)))
break;
FOREACH(suballocIt, frameIt->mPendingFrees) {
ReallyFree(*subAllocIt);
}
}
}
![Page 27: Don't Throw it all Away: Efficient Buffer Management](https://reader034.vdocument.in/reader034/viewer/2022051102/5873b6a81a28aba9328b45f2/html5/thumbnails/27.jpg)
CTransientBuffer Visualized
Free List Retired Frames
![Page 28: Don't Throw it all Away: Efficient Buffer Management](https://reader034.vdocument.in/reader034/viewer/2022051102/5873b6a81a28aba9328b45f2/html5/thumbnails/28.jpg)
CTransientBuffer Visualized
Free List Retired Frames
Allocating four Buffers
![Page 29: Don't Throw it all Away: Efficient Buffer Management](https://reader034.vdocument.in/reader034/viewer/2022051102/5873b6a81a28aba9328b45f2/html5/thumbnails/29.jpg)
CTransientBuffer Visualized
Free List Retired Frames
Nothing
![Page 30: Don't Throw it all Away: Efficient Buffer Management](https://reader034.vdocument.in/reader034/viewer/2022051102/5873b6a81a28aba9328b45f2/html5/thumbnails/30.jpg)
CTransientBuffer Visualized
Free List Retired Frames
Deallocating Yellow and Green
EQ
![Page 31: Don't Throw it all Away: Efficient Buffer Management](https://reader034.vdocument.in/reader034/viewer/2022051102/5873b6a81a28aba9328b45f2/html5/thumbnails/31.jpg)
CTransientBuffer Visualized
Free List Retired Frames
EvEEEVEentE
Deallocating Yellow and Green
EQ
![Page 32: Don't Throw it all Away: Efficient Buffer Management](https://reader034.vdocument.in/reader034/viewer/2022051102/5873b6a81a28aba9328b45f2/html5/thumbnails/32.jpg)
CTransientBuffer Visualized
Free List Retired Frames
EQ Returns for Retired Frame
![Page 33: Don't Throw it all Away: Efficient Buffer Management](https://reader034.vdocument.in/reader034/viewer/2022051102/5873b6a81a28aba9328b45f2/html5/thumbnails/33.jpg)
CTransientBuffer: Handling OOM
● Ways to handle Out of Memory on Alloc:
● Spin-lock waiting for RetiredFrame Queries to return
● Allocate a new, larger buffer ● Release current buffer
● Requires a system memory copy to initially fill new buffer
● These will (probably) stall
● But in your code ● can be easily logged -and/or-
● Recorded to adjust and avoid for subsequent runs
![Page 34: Don't Throw it all Away: Efficient Buffer Management](https://reader034.vdocument.in/reader034/viewer/2022051102/5873b6a81a28aba9328b45f2/html5/thumbnails/34.jpg)
Transient Buffer Pattern
● Works in D3D9 as well
● Can be extended and simplified to contention-free Temporary Buffers, too!
● Let’s take a quick look at that.
![Page 35: Don't Throw it all Away: Efficient Buffer Management](https://reader034.vdocument.in/reader034/viewer/2022051102/5873b6a81a28aba9328b45f2/html5/thumbnails/35.jpg)
Discard-Free Temporary Buffers
● Allocate out of Buffer as a circular buffer
● No opaque handle needed
● Remember ending address of the last allocation
● Per frame: Assuming any allocations, issue query
● Later: When query returns, move the end pointer to indicate additional available space
● Credit: Blizzard’s StarCraft 2 Team (thanks!)
![Page 36: Don't Throw it all Away: Efficient Buffer Management](https://reader034.vdocument.in/reader034/viewer/2022051102/5873b6a81a28aba9328b45f2/html5/thumbnails/36.jpg)
Discard-Free Temp Buffer Visualized
Start Retired Frames End
Start State
![Page 37: Don't Throw it all Away: Efficient Buffer Management](https://reader034.vdocument.in/reader034/viewer/2022051102/5873b6a81a28aba9328b45f2/html5/thumbnails/37.jpg)
Discard-Free Temp Buffer Visualized
Start Retired Frames
NextEnd
End
Allocate some stuff
![Page 38: Don't Throw it all Away: Efficient Buffer Management](https://reader034.vdocument.in/reader034/viewer/2022051102/5873b6a81a28aba9328b45f2/html5/thumbnails/38.jpg)
Discard-Free Temp Buffer Visualized
Start Retired Frames
NextEnd
End
Go on…
NextEnd
![Page 39: Don't Throw it all Away: Efficient Buffer Management](https://reader034.vdocument.in/reader034/viewer/2022051102/5873b6a81a28aba9328b45f2/html5/thumbnails/39.jpg)
Discard-Free Temp Buffer Visualized
Start Retired Frames
NextEnd
End
Queries start to return…
NextEnd
![Page 40: Don't Throw it all Away: Efficient Buffer Management](https://reader034.vdocument.in/reader034/viewer/2022051102/5873b6a81a28aba9328b45f2/html5/thumbnails/40.jpg)
Discard-Free Temp Buffer Visualized
Start Retired Frames
NextEnd
End
etc…
NextEnd
![Page 41: Don't Throw it all Away: Efficient Buffer Management](https://reader034.vdocument.in/reader034/viewer/2022051102/5873b6a81a28aba9328b45f2/html5/thumbnails/41.jpg)
Discard-Free Temp Buffer Visualized
Start Retired Frames
NextEnd
End
etc…
NextEnd
![Page 42: Don't Throw it all Away: Efficient Buffer Management](https://reader034.vdocument.in/reader034/viewer/2022051102/5873b6a81a28aba9328b45f2/html5/thumbnails/42.jpg)
Constant Buffers
![Page 43: Don't Throw it all Away: Efficient Buffer Management](https://reader034.vdocument.in/reader034/viewer/2022051102/5873b6a81a28aba9328b45f2/html5/thumbnails/43.jpg)
Constant Buffer Organization
● Group by frequency of update
● The cheapest buffers are the ones you never update
● You can bind multiple buffers in one call (Reduce those API calls!)
![Page 44: Don't Throw it all Away: Efficient Buffer Management](https://reader034.vdocument.in/reader034/viewer/2022051102/5873b6a81a28aba9328b45f2/html5/thumbnails/44.jpg)
Proposed Buffer Grouping
● Assuming you are not vertex shading limited
● Don’t solve the travelling salesman in your VS
● Seriously: this isn’t common
![Page 45: Don't Throw it all Away: Efficient Buffer Management](https://reader034.vdocument.in/reader034/viewer/2022051102/5873b6a81a28aba9328b45f2/html5/thumbnails/45.jpg)
Multiple Constant Buffers
● One for per-frame constants (GI values, lights)
● One for per-camera constants (ViewProj matrix, camera position in world, RT dimensions)
oPos = in.Position * cWorldViewPos;
oPos = in.Position * cWorld * cViewPos; ^ One extra 3x3 matrix multiply in the VS. No biggie.
Old HLSL New HLSL
![Page 46: Don't Throw it all Away: Efficient Buffer Management](https://reader034.vdocument.in/reader034/viewer/2022051102/5873b6a81a28aba9328b45f2/html5/thumbnails/46.jpg)
Multiple Constant Buffers cont’d
● One for per-object constants (World matrix, dynamic material properties, etc)
● One for per-material constants (if these are shared—if not then drop them in with per-object constants)
● Splitting constants this way eliminates constant updates for static objects.
![Page 47: Don't Throw it all Away: Efficient Buffer Management](https://reader034.vdocument.in/reader034/viewer/2022051102/5873b6a81a28aba9328b45f2/html5/thumbnails/47.jpg)
Constant Buffer Tricks
● Use shared structs to update when possible
● Struct can be included from both hlsl and C++
● Makes buffer updates trivial!
● Assign them to slots by convention:
● b0: Per-Frame, b1: Per-Camera, etc
● Slot assignment can live in shared header, too.
![Page 48: Don't Throw it all Away: Efficient Buffer Management](https://reader034.vdocument.in/reader034/viewer/2022051102/5873b6a81a28aba9328b45f2/html5/thumbnails/48.jpg)
Performance Investigations
![Page 49: Don't Throw it all Away: Efficient Buffer Management](https://reader034.vdocument.in/reader034/viewer/2022051102/5873b6a81a28aba9328b45f2/html5/thumbnails/49.jpg)
Performance Investigation
● Scene from a Typical D3D11 Application (unreleased)
● 115 Dynamic Vertex Buffer Updates (particles) per frame
● Total Time: 4.36ms / frame
Per- Call Frame
Map/Unmap 0.036 ms 3.79 ms
Memcpy ~0.004 ms 0.4 ms
![Page 50: Don't Throw it all Away: Efficient Buffer Management](https://reader034.vdocument.in/reader034/viewer/2022051102/5873b6a81a28aba9328b45f2/html5/thumbnails/50.jpg)
Let’s buffer the updates
● All Dynamic Updates during one update
● 1 Map per frame (using MAP_DISCARD)
● Still 115 memcpys (I’m lazy)
● Total Time: 0.267ms / frame (savings: 4.1ms!)
Per- Call Frame
Map/Unmap 0.036 ms 0.036 ms
Memcpy ~0.002 ms 0.231 ms
![Page 51: Don't Throw it all Away: Efficient Buffer Management](https://reader034.vdocument.in/reader034/viewer/2022051102/5873b6a81a28aba9328b45f2/html5/thumbnails/51.jpg)
Buffered update, no discards
● One update into a triple buffer
● 1 Map per frame (using MAP_NOOVERWRITE)
● Still 115 memcpys (I’m still lazy)
● Total Time: 0.217ms / frame (savings: 4.15ms)
● Bonus: No hitching ever
● Downside: 3x the memory
Per- Call Frame
Map/Unmap 0.031 ms 0.031 ms
Memcpy ~0.002 ms 0.231 ms
![Page 52: Don't Throw it all Away: Efficient Buffer Management](https://reader034.vdocument.in/reader034/viewer/2022051102/5873b6a81a28aba9328b45f2/html5/thumbnails/52.jpg)
Performance Results
● Reducing API usage was a huge CPU-side savings (4.09 ms). GPU Perf Unaffected
● Discard-Free updates were marginally faster still—but would never hitch.
Total Frame Time
Original 4.360 ms
Buffered Updates 0.267 ms
Discard-Free 0.217 ms
![Page 53: Don't Throw it all Away: Efficient Buffer Management](https://reader034.vdocument.in/reader034/viewer/2022051102/5873b6a81a28aba9328b45f2/html5/thumbnails/53.jpg)
GPUView
● Covered by Jon Story earlier today
● Hopefully you caught it!
● Great for finding CPU-GPU sync points
![Page 54: Don't Throw it all Away: Efficient Buffer Management](https://reader034.vdocument.in/reader034/viewer/2022051102/5873b6a81a28aba9328b45f2/html5/thumbnails/54.jpg)
Questions?
● jmcdonald at nvidia dot com
![Page 55: Don't Throw it all Away: Efficient Buffer Management](https://reader034.vdocument.in/reader034/viewer/2022051102/5873b6a81a28aba9328b45f2/html5/thumbnails/55.jpg)
Nifty Buffer Summary Table Type Usage (e.g) Create Flag Update Method
“Forever” Level BSPs IMMUTABLE Cannot Update
Long-Lived Characters DEFAULT UpdateSubResource
Transient UI/Text DYNAMIC CTransientBuffer
Temporary Particles DYNAMIC Map/NO_OVERWRITE
Constant Material Props DYNAMIC Map/DISCARD