practical parallel rendering with directx 9 and 10 -...

45
Practical Parallel Rendering with DirectX 9 and 10 Windows PC Command Buffers Vincent Scheib Architect, Gamebryo Emergent Game Technologies

Upload: others

Post on 17-Sep-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Practical Parallel Rendering with DirectX 9 and 10 - AMDdeveloper.amd.com/.../2012/10/S2008-Scheib-ParallelRenderingSiggr… · Practical Parallel Rendering with DirectX 9 and 10

Practical Parallel Rendering with DirectX 9 and 10

Windows PC Command Buffers

Vincent Scheib

Architect, Gamebryo

Emergent Game Technologies

Page 2: Practical Parallel Rendering with DirectX 9 and 10 - AMDdeveloper.amd.com/.../2012/10/S2008-Scheib-ParallelRenderingSiggr… · Practical Parallel Rendering with DirectX 9 and 10

Foundational technology, over 200 shipped titles,

more than 13 genres, and multiple platforms.

TITLES

Munch's Oddysee

Sid Meier's Pirates!

Barbie Digital

Makeover

Dark Age of Camelot

Futurama

Tetris Worlds

Crash Racing

Elder Scrolls

Sim Patient

Zero Cup Soccer

Civilization 4

PLATFORMS

PC

Xbox 360

PS3

WiiXbox

PS2

GC

GENRES

Action

Adventure

Family

MMO

Platformer

PuzzleRacing

RPG

Vis / Sim

Sports

Strategy

Page 3: Practical Parallel Rendering with DirectX 9 and 10 - AMDdeveloper.amd.com/.../2012/10/S2008-Scheib-ParallelRenderingSiggr… · Practical Parallel Rendering with DirectX 9 and 10

Customers

Page 4: Practical Parallel Rendering with DirectX 9 and 10 - AMDdeveloper.amd.com/.../2012/10/S2008-Scheib-ParallelRenderingSiggr… · Practical Parallel Rendering with DirectX 9 and 10

Introduction

• Take advantage of multiple cores with parallel rendering

• Performance should scale by number of cores

0

1

2

3

4

Single

Core

Dual

Core

- Quad

Core

Pe

rfo

rma

nce

Ra

tio

Observed data

from this project,

details follow

Page 5: Practical Parallel Rendering with DirectX 9 and 10 - AMDdeveloper.amd.com/.../2012/10/S2008-Scheib-ParallelRenderingSiggr… · Practical Parallel Rendering with DirectX 9 and 10

Presentation Outline

• Motivation and problem definition

• Command buffers

– Requirements

– Implementation

– Handling effects and resources

• Application models

• Integrating to existing code

• Prototype results

• Future work

Page 6: Practical Parallel Rendering with DirectX 9 and 10 - AMDdeveloper.amd.com/.../2012/10/S2008-Scheib-ParallelRenderingSiggr… · Practical Parallel Rendering with DirectX 9 and 10

Motivation

• Take advantage of multi core machines

– 40% machines have 2+ physical CPUs (steamJul08)

• Rendering can have high CPU cost

• Direct3D 11 display lists coming, but want

support for Direct3D 9 and 10 now

– Currently 81% DX9 HW, 9% DX10 HW (steamJul08)

– Rough DX9 HW forecast: 2011 ~30% (emergent)

– Asia HW trends lag somewhat

Page 7: Practical Parallel Rendering with DirectX 9 and 10 - AMDdeveloper.amd.com/.../2012/10/S2008-Scheib-ParallelRenderingSiggr… · Practical Parallel Rendering with DirectX 9 and 10

Multithreaded DX Device?

• DirectX 9 and 10 primarily designed for

single-threaded game architectures

• Multithreaded mode incurs overhead

– Cuts FPS roughly in half on DX9

for a CPU render call bound application

• DX is Stateful

– Requires additional synchronization for parallel

rendering

Page 8: Practical Parallel Rendering with DirectX 9 and 10 - AMDdeveloper.amd.com/.../2012/10/S2008-Scheib-ParallelRenderingSiggr… · Practical Parallel Rendering with DirectX 9 and 10

Ideal Scenario

• One thread per hardware thread

• Application manages dispatching work to multiple

threads

• Rendering data completely prepared, ready to be

sent to single-threaded D3D device

– Function calls, conditionals, and final matrix multiplies are

wasted time on a D3D device thread

Page 9: Practical Parallel Rendering with DirectX 9 and 10 - AMDdeveloper.amd.com/.../2012/10/S2008-Scheib-ParallelRenderingSiggr… · Practical Parallel Rendering with DirectX 9 and 10

Reality

• Update()

– Seldomly generates coherent data in API specific format.

• Render()

– Some work done between calls to DirectX API

Page 10: Practical Parallel Rendering with DirectX 9 and 10 - AMDdeveloper.amd.com/.../2012/10/S2008-Scheib-ParallelRenderingSiggr… · Practical Parallel Rendering with DirectX 9 and 10

Going Wide

Update

Render

Main

Thread

Worker

Thread

Worker

Thread

Worker

Thread

Page 11: Practical Parallel Rendering with DirectX 9 and 10 - AMDdeveloper.amd.com/.../2012/10/S2008-Scheib-ParallelRenderingSiggr… · Practical Parallel Rendering with DirectX 9 and 10

Command Buffers

• Record calls to D3D

– Store in a command buffer

– Can be done concurrently on multiple threads, to multiple

command buffers

• Playback D3D commands

– Efficiently on main thread

– Exact data for DX API

– Coherent in memory

• Clean and modular point to

integrate to application

Page 12: Practical Parallel Rendering with DirectX 9 and 10 - AMDdeveloper.amd.com/.../2012/10/S2008-Scheib-ParallelRenderingSiggr… · Practical Parallel Rendering with DirectX 9 and 10

Command Buffer Requirements

• Minimal modifications to rendering code

– Most code uses pointer to D3DDevice

– Parameters from stack, e.g., D3DRECT

– Support most of the device API

• Draw calls, setting state, constants, shaders, textures,

stream source, and so on

– Support effects

• Playback does not modify buffer

• Playback is ideal performance

Page 13: Practical Parallel Rendering with DirectX 9 and 10 - AMDdeveloper.amd.com/.../2012/10/S2008-Scheib-ParallelRenderingSiggr… · Practical Parallel Rendering with DirectX 9 and 10

Command Buffer Allowances

• No support for:

– Create methods

– Get methods

– Miscellaneous other functions that return values

• QueryInterface, ShowCursor

Page 14: Practical Parallel Rendering with DirectX 9 and 10 - AMDdeveloper.amd.com/.../2012/10/S2008-Scheib-ParallelRenderingSiggr… · Practical Parallel Rendering with DirectX 9 and 10

Command Buffer: Nice to Have

• Buffers played back multiple times

• Optimization of buffers

– Remove redundant state calls

• Offload main thread by doing this on recorder threads

– Reordering of sort independent draw calls

Page 15: Practical Parallel Rendering with DirectX 9 and 10 - AMDdeveloper.amd.com/.../2012/10/S2008-Scheib-ParallelRenderingSiggr… · Practical Parallel Rendering with DirectX 9 and 10

Design: Recording

• Wrap every API call

– Unsupported calls, return error

– Supported calls

• Store enumeration for call into buffer

• Store parameters into buffer

• Make copies of non-reference counted objects such as

D3DMATRIX, D3DRECT, shader constants, and so on

Page 16: Practical Parallel Rendering with DirectX 9 and 10 - AMDdeveloper.amd.com/.../2012/10/S2008-Scheib-ParallelRenderingSiggr… · Practical Parallel Rendering with DirectX 9 and 10

Design: Playback

• Playback, read from buffer, and

– select function call pointer from table given token

– each playback function unpacks parameters buffer

Page 17: Practical Parallel Rendering with DirectX 9 and 10 - AMDdeveloper.amd.com/.../2012/10/S2008-Scheib-ParallelRenderingSiggr… · Practical Parallel Rendering with DirectX 9 and 10

Recording Example

virtual HRESULT STDMETHODCALLTYPE DrawPrimitive(

D3DPRIMITIVETYPE PrimitiveType,

UINT StartVertex,

UINT PrimitiveCount)

{

m_pCommandBuffer->Put(CBD3D_COMMANDS::DrawPrimitive);

m_pCommandBuffer->Put(PrimitiveType);

m_pCommandBuffer->Put(StartVertex);

m_pCommandBuffer->Put(PrimitiveCount);

return D3D_OK;

}

Page 18: Practical Parallel Rendering with DirectX 9 and 10 - AMDdeveloper.amd.com/.../2012/10/S2008-Scheib-ParallelRenderingSiggr… · Practical Parallel Rendering with DirectX 9 and 10

Playback Example

void CBPlayer9::DoDrawPrimitive()

{

D3DPRIMITIVETYPE arg1;

m_pCommandBuffer->Get(&arg1);

UINT arg2;

m_pCommandBuffer->Get(&arg2);

UINT arg3;

m_pCommandBuffer->Get(&arg3);

if(FAILED(m_pDevice->DrawPrimitive(arg1, arg2, arg3)))

OutputDebugStringA(__FUNCTION__ " failed in playback\n");

}

Page 19: Practical Parallel Rendering with DirectX 9 and 10 - AMDdeveloper.amd.com/.../2012/10/S2008-Scheib-ParallelRenderingSiggr… · Practical Parallel Rendering with DirectX 9 and 10

Effects: Problem

• Effect takes pointer to device at creation

• Effect then creates resources

• At render, effect should use our recorder

• Our recording device cannot create

resources

Page 20: Practical Parallel Rendering with DirectX 9 and 10 - AMDdeveloper.amd.com/.../2012/10/S2008-Scheib-ParallelRenderingSiggr… · Practical Parallel Rendering with DirectX 9 and 10

Effects: Solutions

1. Create FX with command buffer device

• Fails: needs real device for initialization

2. Wrap and record FX calls and play them back

• Inefficient

3. Give FX EffectStateManager class to redirect calls to

command buffer, give it real device for initialization

• Disables FX use of state blocks

4. Create redirecting device

• Acts as real device at init, command buffer device at render time

Page 21: Practical Parallel Rendering with DirectX 9 and 10 - AMDdeveloper.amd.com/.../2012/10/S2008-Scheib-ParallelRenderingSiggr… · Practical Parallel Rendering with DirectX 9 and 10

Resource Management

• Multiple threads wish to:

– Create resources (e.g., background loading)

– Update resources (e.g., dynamic geometry)

• App must use playback thread only to modify

resources

– App specific logic

• Deferred creation, double buffering

– Support in command buffers (next slide)

Page 22: Practical Parallel Rendering with DirectX 9 and 10 - AMDdeveloper.amd.com/.../2012/10/S2008-Scheib-ParallelRenderingSiggr… · Practical Parallel Rendering with DirectX 9 and 10

Resource Management (2)

• Command buffer library could encapsulate details

– (This is Future Work)

• Gamebryo Volatile Type Buffers

– D3DUSAGE: WRITEONLY | DYNAMIC

D3DLOCK: NOOVERWRITE, DISCARD

– Lock() is stored into command buffer

– Memory allocated from command buffer, returned from Lock()

– At playback, true lock is performed

• Gamebryo Mutable Type Buffers:

– CPU read and infrequent access

– Backing store required, copied on each Lock()

Page 23: Practical Parallel Rendering with DirectX 9 and 10 - AMDdeveloper.amd.com/.../2012/10/S2008-Scheib-ParallelRenderingSiggr… · Practical Parallel Rendering with DirectX 9 and 10

Implementation Considerations

• Ease of changing implementation

– Macros provide implementation

– Preprocessor & Beautifier produce debuggable code

– Many macro permutations required (~40) for different

argument count and return type

• Generated from Excel

– Function overloading to store non ref counted parameters

• Everything but shader constants then stored with same function

signature.

Page 24: Practical Parallel Rendering with DirectX 9 and 10 - AMDdeveloper.amd.com/.../2012/10/S2008-Scheib-ParallelRenderingSiggr… · Practical Parallel Rendering with DirectX 9 and 10

Application Models

• Command buffers can be used in various

ways by applications

– Fork and join

– Fork and join, frame deferred

– Work queue

– …

• Record once, play back several times

Page 25: Practical Parallel Rendering with DirectX 9 and 10 - AMDdeveloper.amd.com/.../2012/10/S2008-Scheib-ParallelRenderingSiggr… · Practical Parallel Rendering with DirectX 9 and 10

Fork & Join

…Update

Render

Main

Thread

Worker

Thread

Signal threads to start

Record command buffer

Wait for signal

Signal command buffer completeWait for command buffers

Record command buffer

Playback command buffers Starve!

Page 26: Practical Parallel Rendering with DirectX 9 and 10 - AMDdeveloper.amd.com/.../2012/10/S2008-Scheib-ParallelRenderingSiggr… · Practical Parallel Rendering with DirectX 9 and 10

Fork & Join, Frame Deferred

…Update

Render

Main

Thread

Worker

Thread

Signal threads to start

Record command buffer

Wait for signal

Signal command buffer complete

Record command buffer

Playback command buffers

Update… …

Next Frame

Page 27: Practical Parallel Rendering with DirectX 9 and 10 - AMDdeveloper.amd.com/.../2012/10/S2008-Scheib-ParallelRenderingSiggr… · Practical Parallel Rendering with DirectX 9 and 10

Work Queue

Update

Pla

y

Record

Main

Thread

Worker

Thread

Update

Pla

y

Record

Record U

pdate

Page 28: Practical Parallel Rendering with DirectX 9 and 10 - AMDdeveloper.amd.com/.../2012/10/S2008-Scheib-ParallelRenderingSiggr… · Practical Parallel Rendering with DirectX 9 and 10

Adapting to an Existing Codebase

• Refactor code to take pointer to device that can be

changed easily

– Easy if pointer passed on stack

– Thread local storage if used from heap

• Add ownership of recording devices, playback

class, and pool of command buffers

• Determine application model, and add high-level

logic to parcel out rendering work.

• Manage resources over

recording and playback

Page 29: Practical Parallel Rendering with DirectX 9 and 10 - AMDdeveloper.amd.com/.../2012/10/S2008-Scheib-ParallelRenderingSiggr… · Practical Parallel Rendering with DirectX 9 and 10

Integration into DX Samples

• Instancing

– Effects, shader constants

• Textures tutorial

– Simple, added multithreading

• Stress test

– Fork and join multithreading, with optional:

• Frame delay of playback

• Draw call count

• CPU and memory access

• Recorder thread count

Page 30: Practical Parallel Rendering with DirectX 9 and 10 - AMDdeveloper.amd.com/.../2012/10/S2008-Scheib-ParallelRenderingSiggr… · Practical Parallel Rendering with DirectX 9 and 10

Stress Test Information

• Render call contains:

– Matrices computed with D3DX calls * 3

– SetTransform * 3

– SetRenderState

– SetTexture

– SetTextureStageState * 8

– SetStreamSource

– SetFVF

– DrawPrimitive

Page 31: Practical Parallel Rendering with DirectX 9 and 10 - AMDdeveloper.amd.com/.../2012/10/S2008-Scheib-ParallelRenderingSiggr… · Practical Parallel Rendering with DirectX 9 and 10

CPU Busy Loops

• Draw call CPU cost varies in real applications

• Stress test simulates cost with CPU Busy Loops

– Scattered reads from a large buffer in memory

– Perform some logic, integer, and floating point operations

• Gamebryo render on DX9: 100-200 μs

• (on a Pentium 4, 3 GHz, nVidia 7800)

• Stress test can simulate Gamebryo render calls with

0-200 loops.

Page 32: Practical Parallel Rendering with DirectX 9 and 10 - AMDdeveloper.amd.com/.../2012/10/S2008-Scheib-ParallelRenderingSiggr… · Practical Parallel Rendering with DirectX 9 and 10

DX Sample Stress Test Demo

Page 33: Practical Parallel Rendering with DirectX 9 and 10 - AMDdeveloper.amd.com/.../2012/10/S2008-Scheib-ParallelRenderingSiggr… · Practical Parallel Rendering with DirectX 9 and 10

DX Call Cost vs. Recorder Cost

• Render call cost with DirectX device is

13 times as expensive as command buffer

recorder

– DX: 92μs

– Recorder: 7μs

• (on a Pentium 4, 3 GHz, nVidia 7800)

Page 34: Practical Parallel Rendering with DirectX 9 and 10 - AMDdeveloper.amd.com/.../2012/10/S2008-Scheib-ParallelRenderingSiggr… · Practical Parallel Rendering with DirectX 9 and 10

Thread Profiler Quadcore

1 Recorder Thread

• CPU Busy Loops: 110

PlaybackRecord

Page 35: Practical Parallel Rendering with DirectX 9 and 10 - AMDdeveloper.amd.com/.../2012/10/S2008-Scheib-ParallelRenderingSiggr… · Practical Parallel Rendering with DirectX 9 and 10

Thread Profiler Quadcore

4 Recorder Threads

Page 36: Practical Parallel Rendering with DirectX 9 and 10 - AMDdeveloper.amd.com/.../2012/10/S2008-Scheib-ParallelRenderingSiggr… · Practical Parallel Rendering with DirectX 9 and 10

FPS by Threads and Computer

0

10

20

30

40

50

60

70

0 1 2 3 4 5

2 - XP-A - Intel G965

Express

2 - XP-A - NVIDIA GeForce

7800 GTX

2 - XP-B - NVIDIA GeForce

8800 GTS 512

4 - XP-C - NVIDIA GeForce

8800 GT

4 - Vista-A - NVIDIA

GeForce 8800 GTX

CPU Busy Loops 150 DrawPrimitives 1936

Sum of FPS

Threads

Cores

Computer

GPU

Page 37: Practical Parallel Rendering with DirectX 9 and 10 - AMDdeveloper.amd.com/.../2012/10/S2008-Scheib-ParallelRenderingSiggr… · Practical Parallel Rendering with DirectX 9 and 10

Definition: Performance Ratio

• Charts that follow use

Performance Ratio = FPS test / FPS baseline

• Normalized result

• Useful for comparisons while varying

– Number of draw calls

– CPU busy loops

Page 38: Practical Parallel Rendering with DirectX 9 and 10 - AMDdeveloper.amd.com/.../2012/10/S2008-Scheib-ParallelRenderingSiggr… · Practical Parallel Rendering with DirectX 9 and 10

Perf by Threads & Busy Loops

0

0.5

1

1.5

2

2.5

3

3.5

4

0 1 2 3 4 5

0

50

100

150

200

250

Computer XP-C Cores 4 Draw Primitives 1936

Average of FPSPerfRatio

Threads

CPU Busy Loops

Page 39: Practical Parallel Rendering with DirectX 9 and 10 - AMDdeveloper.amd.com/.../2012/10/S2008-Scheib-ParallelRenderingSiggr… · Practical Parallel Rendering with DirectX 9 and 10

Perf by Threads & Busy Loops

0

0.5

1

1.5

2

2.5

3

3.5

4

0 1 2 3 4 5

100 - 0

100 - 50

100 - 100

100 - 150

100 - 200

100 - 250

196 - 0

196 - 50

196 - 100

196 - 150

196 - 200

196 - 250

289 - 0

289 - 50

289 - 100

289 - 150

289 - 200

289 - 250

400 - 0

Computer XP-C Cores 4

Average of FPSPerfRatio

Threads

Draw Primitives

CPU Busy Loops

Page 40: Practical Parallel Rendering with DirectX 9 and 10 - AMDdeveloper.amd.com/.../2012/10/S2008-Scheib-ParallelRenderingSiggr… · Practical Parallel Rendering with DirectX 9 and 10

Perf by Draws & Busy Loops

0

0.5

1

1.5

2

2.5

3

3.5

4

100

196

289

400

484

576

676

784

900

961

1089

1156

1296

1369

1444

1600

1681

1764

1849

1936

0

50

100

150

200

250

Computer XP-C Cores 4 Threads 4

Average of FPSPerfRatio

DrawPrimitives

CPU Busy Loops

Page 41: Practical Parallel Rendering with DirectX 9 and 10 - AMDdeveloper.amd.com/.../2012/10/S2008-Scheib-ParallelRenderingSiggr… · Practical Parallel Rendering with DirectX 9 and 10

Perf by Busy Loops & Draws

0

0.5

1

1.5

2

2.5

3

3.5

4

0 50 100 150 200 250

100

400

784

1156

1600

1936

Computer XP-C Cores 4 Threads 4

Average of FPSPerfRatio

CPU Busy Loops

DrawPrimitives

Page 42: Practical Parallel Rendering with DirectX 9 and 10 - AMDdeveloper.amd.com/.../2012/10/S2008-Scheib-ParallelRenderingSiggr… · Practical Parallel Rendering with DirectX 9 and 10

Dual Core Results

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0 1 2 3

Intel G965 Express - 50

Intel G965 Express - 150

NVIDIA GeForce 7800 GTX - 50

NVIDIA GeForce 7800 GTX - 150

NVIDIA GeForce 8800 GTS 512 - 50

NVIDIA GeForce 8800 GTS 512 - 150

Cores 2 DrawPrimitives 1936

Average of FPSPerfRatio

Threads

GPU

CPU Busy Loops

Page 43: Practical Parallel Rendering with DirectX 9 and 10 - AMDdeveloper.amd.com/.../2012/10/S2008-Scheib-ParallelRenderingSiggr… · Practical Parallel Rendering with DirectX 9 and 10

Future Work

• Resource management facilitated through

command buffer, instead of application logic

• Optimization of command buffers by reordering

order independent draw calls

• DirectX10

Page 44: Practical Parallel Rendering with DirectX 9 and 10 - AMDdeveloper.amd.com/.../2012/10/S2008-Scheib-ParallelRenderingSiggr… · Practical Parallel Rendering with DirectX 9 and 10

Open Source Library

• Emergent has open sourced the command buffer

library

– Command buffer serialization

– Recording device

– Playback class

– Redirecting device

– EffectStateManager

– DX9 only so far

Page 45: Practical Parallel Rendering with DirectX 9 and 10 - AMDdeveloper.amd.com/.../2012/10/S2008-Scheib-ParallelRenderingSiggr… · Practical Parallel Rendering with DirectX 9 and 10

Thank You. Questions?

[email protected]

– Co-Developer: Bo Wilson

• For code & presentation

Google: parallel rendering scheib