1
Optimizing Game Architectures with Task-based Parallelism
Brad WerthIntel Senior Software Engineer
2
Parallelism in games is no longer optional
The unending quest for realism in games is causing game content and gameplay to become increasingly complex.
More complicated scenes + more complicated behavior = increased computation.
CPUs and GPUs are no longer competing on clock speed, but on degree of parallelism.
High-end games require threading.
You can't go home again.
3
Threaded architectures for games are challenging to designTechniques for threading individual computations/systems are
well-known, but... the techniques often have inefficient interactions. games rely on middleware to provide some functionality – more
potential conflict. the moment-to-moment workload can change dramatically. the variety of CPU topologies complicates tuning.
Task-based parallelism is a viable way out of this mess.But first, let's gaze into the abyss...
4
A threaded game architecture –full of pain and oversubscription
Particles array could be partitioned. One-off jobs run on "job threads". Physics threads are created by middleware. Sound mixing is on a dedicated thread. Bones/skinning is a Directed Acyclic Graph.
Particles Jobs Bones
Physics Sound
#?#?
@$#!!
?
? + ? = ???
5
Tasks are an efficient method for handling all of this parallel work
With a task scheduler, and with all of this work decomposed into tasks, then...
one thread pool can process all work. oversubscription will be avoided by using the same
threads for all parallel work. the game will scale well to different core topologies
without painful tuning.
Tasks can do it!
6
Task-based parallelism is agile threading
A thread is... run on the OS. able to be pre-empted. expected to wait on
events. most efficient with some
oversubscription. optimized for a specific
core topology.
A task is... run on a thread pool. run to completion. heavily penalized for
blocking. efficient by avoiding
oversubscription. able to adapt to any
number of threads/cores.
7
Tasks are the uninterrupted portions of threaded work
Texture Lookup
Data Parallelism
Processing
Setup
8
Tasks can be arranged in a dependency graph
Texture Lookup
Data Parallelism
Processing
Setup
9
Dependency graph can be mapped to a thread pool
Lots of work means lots of tasks which fill in the gaps in the thread pool.
The decomposition of tasks and mapping to threads is the job of the task scheduler.
10
Task schedulers have similar ingredients but different flavorsCilk scheduler has been extremely influential.Most have task queues per thread to avoid contention (often multiple
queues per thread).Cache-aware distribution of work is a key performance feature.Most prevent direct manipulation of queues.
The APIs vary in some ways: Constructive schedulers define tasks a priori. Reductive schedulers subdivide tasks in flight. Event-driven schedulers trigger off of I/O. Computation schedulers are triggered manually.
11
Threading Building Blocks is Intel's Open Source task-based schedulerTBB is a reductive, computation scheduler designed to... be cross-platform (Windows*, OS X*, Linux, XBox360*). simplify data parallelism coding. provide scalability and high performance.
TBB has a high-level API for easy parallelism, low-level API for control.
API is not so low-level that it exposes threads or queues.
*Other names and brands may be claimed as the property of others.
12
Enough! Let's look at code
This talk shows code solutions to threaded game architecture problems.
Common threading patterns in games are decomposed into tasks, using the TBB API.
The code is available:
http://software.intel.com/file/14997
13
Start with the easy stuff – turn independent loops into tasks
The TBB high-level API provides parallel_for().
Behold, the humble for loop:
for(int i = 0; i < ELEMENT_MAX; ++i){
doSomethingWith(element[i]);}
14
Using parallel_for() is a 2-step process; step 1 is objectify the loop
class DoSomethingContext{
void operator()(const tbb::blocked_range<int> &range
){
for(int i = range.begin(); i != range.end(); ++i)
{doSomethingWith(element[i]);
}}
}
15
parallel_for() step 2: invoke the objectified loop with a range
tbb::parallel_for(tbb::blocked_range<int>(0, ELEMENT_MAX),
*pDoSomethingContext);
Particles
For more general task decomposition problems, we need a low-level API...
16
TBB low-level API: work trees with blocking calls and/or continuations
Root
Task
More
Callback
Spawn & Wait
Root
Task
More
Spawn
Wait
Blocking calls go down
Continuations go up
Root
17
Work trees can implement common game threading patternsThe TBB low-level API creates and processes trees of work – each node is a
task.
Work trees of tasks can be made to process: Callbacks Promises Synchronized callbacks Long, low priority operations Directed acyclic graph
We'll look at how these patterns can be decomposed into tasks using the TBB low-level API.
18
Callbacks – send it off and never wait
Callbacks are function pointers executed on another thread.
Execution begins immediately.No waiting on individual callbacks - can wait
in aggregate.
void doCallback(FunctionPointer fFunc, void *pParam);
19
void doCallback(FunctionPointer fCallback, void *pParam)
{// allocation with "placement new" syntaxCallbackTask *pCallbackTask = new( s_pCallbackRoot->allocate_additional_child_of( *s_pCallbackRoot )) CallbackTask(fCallback, pParam); s_pCallbackRoot->spawn(*pCallbackTask);
}
Code and tree: Callback
20
Callbacks are simple and powerful, but have limits
No waiting! Callbacks are run on demand.No waiting? Callback has to report its own
completion.No waiting?! Need special case code to run on 1-
core system.
If this is a deal-breaker, there are other options...
21
Promises – come back for it later
Promises are an evolution of Callbacks.
Like Callbacks: Promises are function pointers executed on another
thread. Execution begins immediately.
Unlike Callbacks: Promises provide a method for efficient waiting.
Promise *doPromise(FunctionPointer fFunc, void *pParam);
22
void doPromise(FunctionPointer fCallback, void *pParam, Promise *pPromise){ // allocation with "placement new" syntax tbb::task *pParentTask = new( tbb::task::allocate_root() ) tbb::empty_task();
pPromise->setRoot(pParentTask);
PromiseTask *pPromiseTask = new( pParentTask->allocate_child() ) PromiseTask(fCallback, pParam, pPromise); pParentTask->set_ref_count(2); pParentTask->spawn(*pPromiseTask);}
Code and tree: Promise setup
23
Code and tree: Promise execution
void Promise::waitUntilDone(){if(m_pRoot != NULL){ tbb::spin_mutex::scoped_lock(m_tMutex); if(m_pRoot != NULL) { m_pRoot->wait_for_all(); m_pRoot->destroy(*m_pRoot); m_pRoot = NULL; }}
}
24
Promises seem almost too good to be true
Blocking wait only if result is not available when requested.
If wait blocks, the current thread actively contributes to completion.
2 files, 3 classes, ~150 lines of code.
Robust Promise systems can also: Cancel jobs in progress Get partial progress updates
25
Synchronized Call – wait until all threads call it exactly once
Synchronized Calls can be useful for: Initialization of thread-specific data Coordination with some middleware Instrumentation and profiling
Trivial if you have direct access to threads, but trickier with a task-based system.
void doSynchronizedCallback( FunctionPointer fFunc, void *pParam);
26
Code and tree: Synchronized Call setupvoid doSynchronizedCallback(FunctionPointer fCallback, void *pParam,
int iThreads){
tbb::atomic<int> tAtomicCount;tAtomicCount = iThreads;
tbb::task *pRootTask = new(tbb::task::allocate_root()) tbb::empty_task;tbb::task_list tList;for(int i = 0; i < iThreads; i++){ tbb::task *pSynchronizeTask = new( pRootTask->allocate_child() ) SynchronizeTask(fCallback, pParam, &tAtomicCount); tList.push_back(*pSynchronizeTask);}
pRootTask->set_ref_count(iThreads + 1);pRootTask->spawn_and_wait_for_all(tList);pRootTask->destroy(*pRootTask);
}
27
Code and tree: Synchronized Call executiontbb::task *SynchronizeTask::execute(){ m_fCallback(m_pParam);m_pAtomicCount->fetch_and_decrement();
while(*m_pAtomicCount > 0){ // yield while waiting tbb::this_tbb_thread::yield();}
return NULL;
}
28
Synchronized Calls are useful, but not efficient
Don't make Synchronized Calls in the middle of other work.
Performance penalty is negated if work queue is empty.
29
Long, low priority operation – hide some time-slicingMany games have long operations that run in parallel to the main
computation: Asset loading/decompression Sound mixing Texture tweaking AI pathfinding
It's not necessary to create a new thread to handle these operations!
Use the time-honored technique of time-slicing.
30
Code and tree: Long, low priority operationtbb::task *BaseTask::execute(){
if(s_tLowPriorityTaskFlag.compare_and_swap(false, true) == true){ // allocation with "placement new" syntax tbb::task *pLowPriorityTask = new( this->allocate_additional_child_of( *s_pLowPriorityRoot ) ) LowPriorityTask();
spawn(*pLowPriorityTask);}// spawn other children...
}
31
Long, low priority operations are tricky to get right
Task-based schedulers won’t swap out a task that runs a long time.
A low-priority task can’t reschedule itself naively, or it will create an infinite loop.
Even if scheduler designed with priority in mind, it only matters when a thread runs dry.
This approach doesn’t guarantee any minimum frequency of execution.
32
Directed Acyclic Graph – everyone's favorite paradigm
Directed Acyclic Graphs are popular for executing workflows and kernels in games.
Interface varies, but generally construct a DAG and then execute and wait.
How can work trees represent a DAG?
33
Tree: Directed Acyclic Graph
Root
More
Root
More
Root
More
Root
More
Root
More
Spawn
Spawn
Spawn
34
Directed Acyclic Graph gets the job done
The DAGs created by this approach are destroyed by waiting on them.
Persistent DAGs are possible, for re-use across several frames.
A scheduler could be DAG-based to begin with, making this trivial.
Remember, get the code from:http://software.intel.com/file/14997
35
Soon, rendering may also be decomposable into tasks
DirectX* 11 is designed for use on multi-core CPUs.Multiple threads can draw to local DirectX contexts ("devices"),
and those draw calls are aggregated once per frame.All those draw calls can be done as tasks!All the threads can be initialized with a DirectX context using
Synchronized Callbacks!
This is an extremely positive development; Intel will produce lots of samples to help promote to the industry.
*Other names and brands may be claimed as the property of others.
36
Our sample architecture can be handled by tasks top-to-bottom
Particles partitioning handled by parallel_for(). One-off jobs using Callbacks or Promises. Physics uses job threads via Synchronized Calls. Sound mixing is time-sliced as Low-Priority job. Bones/skinning DAG uses the job threads, too.
Particles Jobs Bones
Physics Sound
√√
37
TBB has other helpful features we didn't coverBeyond the high-level and low-level threading APIs, TBB
has: Atomic variables Scalable memory allocators Efficient thread-safe containers (vector, hash, etc.) High-precision time intervals Core count detection Tunable thread pool sizes Hardware thread abstraction
38
Using task parallelism will ensure continued game performance
Task-based parallelism scales performance on varying architectures.
Break loops into tasks for the maximum performance benefit.
Use tasks to implement a game's preferred threading paradigms.