18 multiprocessor game loops - cognitive science departmentdestem/gamearch/18.pdf · multiprocessor...

Game Architecture4/8/16: Multiprocessor Game Loops

Monolithic

• Dead simple to set up, but it can get messy

• Flow-of-control can be complex

• Top-level may have “too much” knowledge of underlying systems (gross bubble-up effects like UT Actor)

• Tough to maintain

Cooperative Tasks

class Task {

virtual void Run() = 0;

};

class Renderer : public Task {

void Run(float time);

};

class TaskManager{

void RunTasks();

void AddTask(Task*);

};

void TaskManager::RunTasks(){

foreach(task)

task->Run();

}

Cooperative Tasks

• Flexible, but clarity suffers

• Can be too much flexibility

• What happens in what order difficult to discern by examining code

Pre-emptive

void InputThread(){

while(1) input();

}

void SimulationThread(){

while(1) simulate();

}

void RenderThread() {

while(1) render();

}

void SoundThread() {

while(1) sound();

}

Pre-emptive

• Tough to get right

• Complex interprocess communication

• Deadlocks, race conditions

• Questionable performance if used extensively

• But, increasingly parallel hardware makes this a major area for focus

Multiprocessor Game Loops

• In 2004, the microprocessor industry hit a brick wall due to heat dissipation problems

• Shifted focus to multicore processors

• Another painful shift (after all that graphics nonsense!) – multithreaded program design is much harder than single-threaded

• By 2008, most studios ended the gradual transition

Hot Chips 17 5

System Block Diagram

Core0 Core1 Core2

1MB L2

L1D

CPU

GPU

10MBEDRAM

512 MBDRAM

Memory

I/OChip

3D Core

DVD (SATA)

HDD port (SATA)

Rear Panel USB

Wireless controllers

MU ports (2 USB)

Front controllers (2 USB)

Ethernet

XM

A D

ecoder

IR

Audio Out

FLASH

SM

CSystem control

MC

0M

C1

BIU/IO Intf

VideoOut Video Out

L1IL1D L1IL1D L1I

AnalogChip

Memory Caches

• A cache is just a bank of memory that can be read from and written to by the CPU much more quickly than main RAM

• cache memory typically utilizes the fastest (and most expensive) technology available

• cache memory is located as physically close as possible to the CPU core, typically on the same die.

• Cache memory is usually quite a bit smaller in size than main RAM.

Memory Caches• Improves memory access performance by keeping

local copies in the cache of those chunks of data that are most frequently accessed by the program

• If the data requested by the CPU is already in the cache, it can be provided to the CPU very quickly – on the order of tens of cycles (hit)

• If the data is not already present in the cache, it must be fetched into the cache from main RAM (miss)

• Reading data from main RAM can take thousands of cycles, so the cost of a cache miss is very high indeed

I$ and D$

• Both instructions and data are cached

• The instruction cache (I$) is used to preload executable machine code before it runs

• The data cache (D$) is used to speed up reading and writing of data to main RAM

• Always physically distinct

Multilevel Caches• There is a fundamental trade-off between cache

latency and hit rate • Larger caches mean higher hit rates, but larger

caches cannot be located as close to the CPU, so they tend to be slower than smaller ones.

• Most game consoles employ at least two levels of cache

• The CPU first tries to find the data it’s looking for in the level 1 (L1) cache. (small, but very low access latency)

• If the data isn’t there, it tries the larger but higher-latency level 2 (L2) cache

• Only if the data cannot be found in the L2 cache do we incur the full cost of a main memory access.

Minimizing Misses• The best way to avoid D$ misses is to

organize your data in contiguous blocks that are as small as possible and then access them sequentially

• For I$, keep your high-performance loops as small as possible in terms of code size, and avoid calling functions within your inner- most loops. Keep the entire body of the loop in the cache the entire time the loop is running.

I$ Misses• Keep high-performance code as small as possible, in terms

of number of machine language instructions • The compiler and linker take care of keeping our

functions contiguous in memory • Avoid calling functions from within a performance-critical

section of code • If you have to, place it as close as possible to the calling

function – preferably immediately before or after the calling function and never in a different translation (compilation) unit

• Inlining? Inlining a small function can be a big performance boost. But too much bloats the size of the code, which can cause a performance-critical section of code to no longer fit within the cache

Hot Chips 17 5

System Block Diagram

Core0 Core1 Core2

1MB L2

L1D

CPU

GPU

10MBEDRAM

512 MBDRAM

Memory

I/OChip

3D Core

DVD (SATA)

HDD port (SATA)

Rear Panel USB

Wireless controllers

MU ports (2 USB)

Front controllers (2 USB)

Ethernet

XM

A D

ecoder

IR

Audio Out

FLASH

SM

CSystem control

MC

0M

C1

BIU/IO Intf

VideoOut Video Out

L1IL1D L1IL1D L1I

AnalogChip

360

hUMA - heterogeneous unified memory architecture

PS4

L2(2 MiB)

PS4 Cache Architecture

MAIN RAM(8 GiB)

CPU

30+CYCLES

220+ CYCLES

3CYCLES

L1 I$(32 KiB)

L1 D$(32 KiB)

RegsFREE

Tuesday, March 4, 14

L2(2 MiB)


MAIN RAM(8 GiB)

CPU

L1 I$(32 KiB)

L1 D$(32 KiB)

RegsFREE

C0 C1

C2 C3

C4 C5

C6 C7


L2(1 MiB)

L2(1 MiB)


MAIN RAM(8 GiB)

CPU

L1 I$(32 KiB)

L1 D$(32 KiB)

RegsFREE

C0 C1

C2 C3

C4 C5

C6 C7


L2(1 MiB)

L2(1 MiB)


MAIN RAM(8 GiB)

CPU

L1 I$(32 KiB)

L1 D$(32 KiB)

RegsFREE

C0 C1

C2 C3

C4 C5

C6 C7

26 CYCLES

26 CYCLES


L2(1 MiB)

L2(1 MiB)


MAIN RAM(8 GiB)

CPU

L1 I$(32 KiB)

L1 D$(32 KiB)

RegsFREE

C0 C1

C2 C3

C4 C5

C6 C7

190 CYCLES



0x00000x00400x00800x00C00x01000x01400x01800x01C00x02000x02400x0280

0x50000x50400x50800x50C00x51000x51400x51800x51C00x52000x52400x5280

MAIN RAM CACHE


PS4 Optimization

PS4-specific: avoid cross-cluster L2 cache line sharing (190 cycles versus 26 cycles)!

U32Bg_jobCount[6];B//BoneBperBcore


structBJobCount{BBBBU32Bm_count;BBBBU8BBm_padding[60];};JobCountBg_jobCount[6];B//BoneBperBcore

PS4 Optimization

PS4-specific: avoid cross-cluster L2 cache line sharing (190 cycles versus 26 cycles)!


Xbox One

Subtle Differences• Memory type

• The Xbox One utilizes GDDR3 RAM, while the PS4 uses GDDR5, which gives the PS4 higher theoretical memory bandwidth. The Xbox One counteracts this to some degree by providing its GPU with a dedicated 32 MiB memory store, implemented as very high-speed eSRAM

Subtle Differences• Bus speeds

• The buses in the Xbox One support higher bandwidth data transfers than those of the PS4 (30GB/sec vs 20)

• GPU • PS4’s GPU is roughly equivalent to an AMD

Radeon 7870, with 1152 parallel stream processors, the Xbox One’s GPU is closer to an AMD Radeon 7790, supporting only 768 stream processors

• the Xbox One’s GPU runs at 853MHz vs 800 for the PS4

Pose Blending

Pose Blending

Pose Blending

Post Animation Game Object Update

Simulate / Integrate



Ragdoll Physics

Update Game Objects

Fork

Join

Fork

Join

etc.

Main Thread

Main Thread

Animation Thread

Dynamics Thread

Rendering Thread

HID

Update Game Objects

Kick off Animation


Kick Dynamics Sim

Ragdoll Physics

Finalize AnimationFinalize Collision

Other Processing (AI Planning, Audio

Work, etc.)

Kick Redering (for next frame)

Sleep

Pose Blending

Sleep

Sleep

Ragdoll Skinning

Global Pose CalculationSkin Matrix

Palette Calculation

Sleep

Simulate and

Integrate

Sleep

Sleep

Broad Phase Coll.

Narrow Phase Coll.

Resolve Constraints Wait for V-

Blank

Wait for GPU

Visibility Determination

Sort

Submit Primitives

Full-Screen Effects

Swap Buffers

PPU

HID

Update Game Objects

Kick Animation Jobs


Kick Dynamics Jobs

Ragdoll Physics

Finalize AnimationFinalize Collision

Other Processing (AI Planning, Audio

Work, etc.)

Kick Redering (for next frame)

SPU0 SPU1

VisibilityVisibility

SortSort

VisibilityPose Blend

Physics Sim

SortPose Blend

Submit Primitives

Global PoseSubmit Primitives

Global PoseCollisions / Constraints

Matrix PaletteRagdoll Skinning

VisibilityVisibility

SortVisibility

Sort

Visibility

Pose Blend

Pose BlendPose Blend

Global PoseBroad PhaseNarrow PhaseNarrow Phase

Ragdoll Skinning

Matrix Palette

Physics Simulation

Async Designwhile (true) { // main game loop

// ...

// Cast a ray to see if the player has line of sight

// to the enemy.

RayCastResult r = castRay(playerPos, enemyPos);

// Now process the results...

if (r.hitSomething() && isEnemy(r.getHitObject())) {

// Player can see the enemy.

// ...

}

// …

}

Async Designwhile (true) { // main game loop // ... // Cast a ray to see if the player has line of sight // to the enemy. RayCastResult r; requestRayCast(playerPos, enemyPos, &r);

// Do other unrelated work while we wait for the // other CPU to perform the ray cast for us.

// …

// OK, we can't do any more useful work. Wait for the // results of our ray cast job. If the job is // complete, this function will return immediately. // Otherwise, the main thread will idle until the // results are ready... waitForRayCastResults(&r);

// Process results... if (r.hitSomething() && isEnemy(r.getHitObject())) { // Player can see the enemy. // ...

// ... } // ...

}

Async DesignRayCastResult r; bool rayJobPending = false;

while (true) { // main game loop // …

// Wait for the results of last frame's ray cast job. if (rayJobPending) { waitForRayCastResults(&r); // Process results... if (r.hitSomething() && isEnemy(r.getHitObject())) { // Player can see the enemy. // ...

} } // Cast a new ray for next frame. rayJobPending = true; requestRayCast(playerPos, enemyPos, &r);

// Do other work... // ...

}

18 multiprocessor game loops - cognitive science departmentdestem/gamearch/18.pdf · multiprocessor...

Documents