Download - 1 Optimizing Game Architectures with Task-based Parallelism Brad Werth Intel Senior Software Engineer

1

Optimizing Game Architectures with Task-based Parallelism

Brad WerthIntel Senior Software Engineer

2

Parallelism in games is no longer optional

The unending quest for realism in games is causing game content and gameplay to become increasingly complex.

More complicated scenes + more complicated behavior = increased computation.

CPUs and GPUs are no longer competing on clock speed, but on degree of parallelism.

High-end games require threading.

You can't go home again.

3

Threaded architectures for games are challenging to designTechniques for threading individual computations/systems are

well-known, but... the techniques often have inefficient interactions. games rely on middleware to provide some functionality – more

potential conflict. the moment-to-moment workload can change dramatically. the variety of CPU topologies complicates tuning.

Task-based parallelism is a viable way out of this mess.But first, let's gaze into the abyss...

4

A threaded game architecture –full of pain and oversubscription

Particles array could be partitioned. One-off jobs run on "job threads". Physics threads are created by middleware. Sound mixing is on a dedicated thread. Bones/skinning is a Directed Acyclic Graph.

Particles Jobs Bones

Physics Sound

#?#?

@$#!!

?

? + ? = ???

5

Tasks are an efficient method for handling all of this parallel work

With a task scheduler, and with all of this work decomposed into tasks, then...

one thread pool can process all work. oversubscription will be avoided by using the same

threads for all parallel work. the game will scale well to different core topologies

without painful tuning.

Tasks can do it!

6

Task-based parallelism is agile threading

A thread is... run on the OS. able to be pre-empted. expected to wait on

events. most efficient with some

oversubscription. optimized for a specific

core topology.

A task is... run on a thread pool. run to completion. heavily penalized for

blocking. efficient by avoiding

oversubscription. able to adapt to any

number of threads/cores.

7

Tasks are the uninterrupted portions of threaded work

Texture Lookup

Data Parallelism

Processing

Setup

8

Tasks can be arranged in a dependency graph

Texture Lookup

Data Parallelism

Processing

Setup

9

Dependency graph can be mapped to a thread pool

Lots of work means lots of tasks which fill in the gaps in the thread pool.

The decomposition of tasks and mapping to threads is the job of the task scheduler.

10

Task schedulers have similar ingredients but different flavorsCilk scheduler has been extremely influential.Most have task queues per thread to avoid contention (often multiple

queues per thread).Cache-aware distribution of work is a key performance feature.Most prevent direct manipulation of queues.

The APIs vary in some ways: Constructive schedulers define tasks a priori. Reductive schedulers subdivide tasks in flight. Event-driven schedulers trigger off of I/O. Computation schedulers are triggered manually.

11

Threading Building Blocks is Intel's Open Source task-based schedulerTBB is a reductive, computation scheduler designed to... be cross-platform (Windows*, OS X*, Linux, XBox360*). simplify data parallelism coding. provide scalability and high performance.

TBB has a high-level API for easy parallelism, low-level API for control.

API is not so low-level that it exposes threads or queues.

*Other names and brands may be claimed as the property of others.

12

Enough! Let's look at code

This talk shows code solutions to threaded game architecture problems.

Common threading patterns in games are decomposed into tasks, using the TBB API.

The code is available:

http://software.intel.com/file/14997

13

Start with the easy stuff – turn independent loops into tasks

The TBB high-level API provides parallel_for().

Behold, the humble for loop:

for(int i = 0; i < ELEMENT_MAX; ++i){

doSomethingWith(element[i]);}

14

Using parallel_for() is a 2-step process; step 1 is objectify the loop

class DoSomethingContext{

void operator()(const tbb::blocked_range<int> &range

){

for(int i = range.begin(); i != range.end(); ++i)

{doSomethingWith(element[i]);

}}

}

15

parallel_for() step 2: invoke the objectified loop with a range

tbb::parallel_for(tbb::blocked_range<int>(0, ELEMENT_MAX),

*pDoSomethingContext);

Particles

For more general task decomposition problems, we need a low-level API...

16

TBB low-level API: work trees with blocking calls and/or continuations

Root

Task

More

Callback

Spawn & Wait

Root

Task

More

Spawn

Wait

Blocking calls go down

Continuations go up

Root

17

Work trees can implement common game threading patternsThe TBB low-level API creates and processes trees of work – each node is a

task.

Work trees of tasks can be made to process: Callbacks Promises Synchronized callbacks Long, low priority operations Directed acyclic graph

We'll look at how these patterns can be decomposed into tasks using the TBB low-level API.

18

Callbacks – send it off and never wait

Callbacks are function pointers executed on another thread.

Execution begins immediately.No waiting on individual callbacks - can wait

in aggregate.

void doCallback(FunctionPointer fFunc, void *pParam);

19

void doCallback(FunctionPointer fCallback, void *pParam)

{// allocation with "placement new" syntaxCallbackTask *pCallbackTask = new( s_pCallbackRoot->allocate_additional_child_of( *s_pCallbackRoot )) CallbackTask(fCallback, pParam); s_pCallbackRoot->spawn(*pCallbackTask);

}

Code and tree: Callback

20

Callbacks are simple and powerful, but have limits

No waiting! Callbacks are run on demand.No waiting? Callback has to report its own

completion.No waiting?! Need special case code to run on 1-

core system.

If this is a deal-breaker, there are other options...

21

Promises – come back for it later

Promises are an evolution of Callbacks.

Like Callbacks: Promises are function pointers executed on another

thread. Execution begins immediately.

Unlike Callbacks: Promises provide a method for efficient waiting.

Promise *doPromise(FunctionPointer fFunc, void *pParam);

22

void doPromise(FunctionPointer fCallback, void *pParam, Promise *pPromise){ // allocation with "placement new" syntax tbb::task *pParentTask = new( tbb::task::allocate_root() ) tbb::empty_task();

pPromise->setRoot(pParentTask);

PromiseTask *pPromiseTask = new( pParentTask->allocate_child() ) PromiseTask(fCallback, pParam, pPromise); pParentTask->set_ref_count(2); pParentTask->spawn(*pPromiseTask);}

Code and tree: Promise setup

23

Code and tree: Promise execution

void Promise::waitUntilDone(){if(m_pRoot != NULL){ tbb::spin_mutex::scoped_lock(m_tMutex); if(m_pRoot != NULL) { m_pRoot->wait_for_all(); m_pRoot->destroy(*m_pRoot); m_pRoot = NULL; }}

}

24

Promises seem almost too good to be true

Blocking wait only if result is not available when requested.

If wait blocks, the current thread actively contributes to completion.

2 files, 3 classes, ~150 lines of code.

Robust Promise systems can also: Cancel jobs in progress Get partial progress updates

25

Synchronized Call – wait until all threads call it exactly once

Synchronized Calls can be useful for: Initialization of thread-specific data Coordination with some middleware Instrumentation and profiling

Trivial if you have direct access to threads, but trickier with a task-based system.

void doSynchronizedCallback( FunctionPointer fFunc, void *pParam);

26

Code and tree: Synchronized Call setupvoid doSynchronizedCallback(FunctionPointer fCallback, void *pParam,

int iThreads){

tbb::atomic<int> tAtomicCount;tAtomicCount = iThreads;

tbb::task *pRootTask = new(tbb::task::allocate_root()) tbb::empty_task;tbb::task_list tList;for(int i = 0; i < iThreads; i++){ tbb::task *pSynchronizeTask = new( pRootTask->allocate_child() ) SynchronizeTask(fCallback, pParam, &tAtomicCount); tList.push_back(*pSynchronizeTask);}

pRootTask->set_ref_count(iThreads + 1);pRootTask->spawn_and_wait_for_all(tList);pRootTask->destroy(*pRootTask);

}

27

Code and tree: Synchronized Call executiontbb::task *SynchronizeTask::execute(){ m_fCallback(m_pParam);m_pAtomicCount->fetch_and_decrement();

while(*m_pAtomicCount > 0){ // yield while waiting tbb::this_tbb_thread::yield();}

return NULL;

}

28

Synchronized Calls are useful, but not efficient

Don't make Synchronized Calls in the middle of other work.

Performance penalty is negated if work queue is empty.

29

Long, low priority operation – hide some time-slicingMany games have long operations that run in parallel to the main

computation: Asset loading/decompression Sound mixing Texture tweaking AI pathfinding

It's not necessary to create a new thread to handle these operations!

Use the time-honored technique of time-slicing.

30

Code and tree: Long, low priority operationtbb::task *BaseTask::execute(){

if(s_tLowPriorityTaskFlag.compare_and_swap(false, true) == true){ // allocation with "placement new" syntax tbb::task *pLowPriorityTask = new( this->allocate_additional_child_of( *s_pLowPriorityRoot ) ) LowPriorityTask();

spawn(*pLowPriorityTask);}// spawn other children...

}

31

Long, low priority operations are tricky to get right

Task-based schedulers won’t swap out a task that runs a long time.

A low-priority task can’t reschedule itself naively, or it will create an infinite loop.

Even if scheduler designed with priority in mind, it only matters when a thread runs dry.

This approach doesn’t guarantee any minimum frequency of execution.

32

Directed Acyclic Graph – everyone's favorite paradigm

Directed Acyclic Graphs are popular for executing workflows and kernels in games.

Interface varies, but generally construct a DAG and then execute and wait.

How can work trees represent a DAG?

33

Tree: Directed Acyclic Graph

Root

More

Root

More

Root

More

Root

More

Root

More

Spawn

Spawn

Spawn

34

Directed Acyclic Graph gets the job done

The DAGs created by this approach are destroyed by waiting on them.

Persistent DAGs are possible, for re-use across several frames.

A scheduler could be DAG-based to begin with, making this trivial.

Remember, get the code from:http://software.intel.com/file/14997

35

Soon, rendering may also be decomposable into tasks

DirectX* 11 is designed for use on multi-core CPUs.Multiple threads can draw to local DirectX contexts ("devices"),

and those draw calls are aggregated once per frame.All those draw calls can be done as tasks!All the threads can be initialized with a DirectX context using

Synchronized Callbacks!

This is an extremely positive development; Intel will produce lots of samples to help promote to the industry.

*Other names and brands may be claimed as the property of others.

36

Our sample architecture can be handled by tasks top-to-bottom

Particles partitioning handled by parallel_for(). One-off jobs using Callbacks or Promises. Physics uses job threads via Synchronized Calls. Sound mixing is time-sliced as Low-Priority job. Bones/skinning DAG uses the job threads, too.

Particles Jobs Bones

Physics Sound

√√

37

TBB has other helpful features we didn't coverBeyond the high-level and low-level threading APIs, TBB

has: Atomic variables Scalable memory allocators Efficient thread-safe containers (vector, hash, etc.) High-precision time intervals Core count detection Tunable thread pool sizes Hardware thread abstraction

38

Using task parallelism will ensure continued game performance

Task-based parallelism scales performance on varying architectures.

Break loops into tasks for the maximum performance benefit.

Use tasks to implement a game's preferred threading paradigms.

39

Want more? Here's more

http://www.threadingbuildingblocks.org

[email protected]

Download - 1 Optimizing Game Architectures with Task-based Parallelism Brad Werth Intel Senior Software Engineer

Top Related