18 multiprocessor game loops - cognitive science departmentdestem/gamearch/18.pdf · multiprocessor...
TRANSCRIPT
Game Architecture4/8/16: Multiprocessor Game Loops
Monolithic
• Dead simple to set up, but it can get messy
• Flow-of-control can be complex
• Top-level may have “too much” knowledge of underlying systems (gross bubble-up effects like UT Actor)
• Tough to maintain
Cooperative Tasks
class Task {
virtual void Run() = 0;
};
class Renderer : public Task {
void Run(float time);
};
class TaskManager{
void RunTasks();
void AddTask(Task*);
};
void TaskManager::RunTasks(){
foreach(task)
task->Run();
}
Cooperative Tasks
• Flexible, but clarity suffers
• Can be too much flexibility
• What happens in what order difficult to discern by examining code
Pre-emptive
void InputThread(){
while(1) input();
}
void SimulationThread(){
while(1) simulate();
}
void RenderThread() {
while(1) render();
}
void SoundThread() {
while(1) sound();
}
Pre-emptive
• Tough to get right
• Complex interprocess communication
• Deadlocks, race conditions
• Questionable performance if used extensively
• But, increasingly parallel hardware makes this a major area for focus
Multiprocessor Game Loops
• In 2004, the microprocessor industry hit a brick wall due to heat dissipation problems
• Shifted focus to multicore processors
• Another painful shift (after all that graphics nonsense!) – multithreaded program design is much harder than single-threaded
• By 2008, most studios ended the gradual transition
Hot Chips 17 5
System Block Diagram
Core0 Core1 Core2
1MB L2
L1D
CPU
GPU
10MBEDRAM
512 MBDRAM
Memory
I/OChip
3D Core
DVD (SATA)
HDD port (SATA)
Rear Panel USB
Wireless controllers
MU ports (2 USB)
Front controllers (2 USB)
Ethernet
XM
A D
ecoder
IR
Audio Out
FLASH
SM
CSystem control
MC
0M
C1
BIU/IO Intf
VideoOut Video Out
L1IL1D L1IL1D L1I
AnalogChip
Memory Caches
• A cache is just a bank of memory that can be read from and written to by the CPU much more quickly than main RAM
• cache memory typically utilizes the fastest (and most expensive) technology available
• cache memory is located as physically close as possible to the CPU core, typically on the same die.
• Cache memory is usually quite a bit smaller in size than main RAM.
Memory Caches• Improves memory access performance by keeping
local copies in the cache of those chunks of data that are most frequently accessed by the program
• If the data requested by the CPU is already in the cache, it can be provided to the CPU very quickly – on the order of tens of cycles (hit)
• If the data is not already present in the cache, it must be fetched into the cache from main RAM (miss)
• Reading data from main RAM can take thousands of cycles, so the cost of a cache miss is very high indeed
I$ and D$
• Both instructions and data are cached
• The instruction cache (I$) is used to preload executable machine code before it runs
• The data cache (D$) is used to speed up reading and writing of data to main RAM
• Always physically distinct
Multilevel Caches• There is a fundamental trade-off between cache
latency and hit rate • Larger caches mean higher hit rates, but larger
caches cannot be located as close to the CPU, so they tend to be slower than smaller ones.
• Most game consoles employ at least two levels of cache
• The CPU first tries to find the data it’s looking for in the level 1 (L1) cache. (small, but very low access latency)
• If the data isn’t there, it tries the larger but higher-latency level 2 (L2) cache
• Only if the data cannot be found in the L2 cache do we incur the full cost of a main memory access.
Minimizing Misses• The best way to avoid D$ misses is to
organize your data in contiguous blocks that are as small as possible and then access them sequentially
• For I$, keep your high-performance loops as small as possible in terms of code size, and avoid calling functions within your inner- most loops. Keep the entire body of the loop in the cache the entire time the loop is running.
I$ Misses• Keep high-performance code as small as possible, in terms
of number of machine language instructions • The compiler and linker take care of keeping our
functions contiguous in memory • Avoid calling functions from within a performance-critical
section of code • If you have to, place it as close as possible to the calling
function – preferably immediately before or after the calling function and never in a different translation (compilation) unit
• Inlining? Inlining a small function can be a big performance boost. But too much bloats the size of the code, which can cause a performance-critical section of code to no longer fit within the cache
Hot Chips 17 5
System Block Diagram
Core0 Core1 Core2
1MB L2
L1D
CPU
GPU
10MBEDRAM
512 MBDRAM
Memory
I/OChip
3D Core
DVD (SATA)
HDD port (SATA)
Rear Panel USB
Wireless controllers
MU ports (2 USB)
Front controllers (2 USB)
Ethernet
XM
A D
ecoder
IR
Audio Out
FLASH
SM
CSystem control
MC
0M
C1
BIU/IO Intf
VideoOut Video Out
L1IL1D L1IL1D L1I
AnalogChip
360
PS3
PS4
hUMA - heterogeneous unified memory architecture
PS4
L2(2 MiB)
PS4 Cache Architecture
MAIN RAM(8 GiB)
CPU
30+CYCLES
220+ CYCLES
3CYCLES
L1 I$(32 KiB)
L1 D$(32 KiB)
RegsFREE
Tuesday, March 4, 14
L2(2 MiB)
PS4 Cache Architecture
MAIN RAM(8 GiB)
CPU
L1 I$(32 KiB)
L1 D$(32 KiB)
RegsFREE
C0 C1
C2 C3
C4 C5
C6 C7
Tuesday, March 4, 14
L2(1 MiB)
L2(1 MiB)
PS4 Cache Architecture
MAIN RAM(8 GiB)
CPU
L1 I$(32 KiB)
L1 D$(32 KiB)
RegsFREE
C0 C1
C2 C3
C4 C5
C6 C7
Tuesday, March 4, 14
L2(1 MiB)
L2(1 MiB)
PS4 Cache Architecture
MAIN RAM(8 GiB)
CPU
L1 I$(32 KiB)
L1 D$(32 KiB)
RegsFREE
C0 C1
C2 C3
C4 C5
C6 C7
26 CYCLES
26 CYCLES
Tuesday, March 4, 14
L2(1 MiB)
L2(1 MiB)
PS4 Cache Architecture
MAIN RAM(8 GiB)
CPU
L1 I$(32 KiB)
L1 D$(32 KiB)
RegsFREE
C0 C1
C2 C3
C4 C5
C6 C7
190 CYCLES
Tuesday, March 4, 14
PS4 Cache Architecture
0x00000x00400x00800x00C00x01000x01400x01800x01C00x02000x02400x0280
0x50000x50400x50800x50C00x51000x51400x51800x51C00x52000x52400x5280
MAIN RAM CACHE
Tuesday, March 4, 14
PS4 Cache Architecture
0x00000x00400x00800x00C00x01000x01400x01800x01C00x02000x02400x0280
0x50000x50400x50800x50C00x51000x51400x51800x51C00x52000x52400x5280
MAIN RAM CACHE
Tuesday, March 4, 14
PS4 Cache Architecture
0x00000x00400x00800x00C00x01000x01400x01800x01C00x02000x02400x0280
0x50000x50400x50800x50C00x51000x51400x51800x51C00x52000x52400x5280
MAIN RAM CACHE
Tuesday, March 4, 14
PS4 Optimization
PS4-specific: avoid cross-cluster L2 cache line sharing (190 cycles versus 26 cycles)!
U32Bg_jobCount[6];B//BoneBperBcore
Tuesday, March 4, 14
structBJobCount{BBBBU32Bm_count;BBBBU8BBm_padding[60];};JobCountBg_jobCount[6];B//BoneBperBcore
PS4 Optimization
PS4-specific: avoid cross-cluster L2 cache line sharing (190 cycles versus 26 cycles)!
Tuesday, March 4, 14
PS4
Xbox One
Subtle Differences• Memory type
• The Xbox One utilizes GDDR3 RAM, while the PS4 uses GDDR5, which gives the PS4 higher theoretical memory bandwidth. The Xbox One counteracts this to some degree by providing its GPU with a dedicated 32 MiB memory store, implemented as very high-speed eSRAM
Subtle Differences• Bus speeds
• The buses in the Xbox One support higher bandwidth data transfers than those of the PS4 (30GB/sec vs 20)
• GPU • PS4’s GPU is roughly equivalent to an AMD
Radeon 7870, with 1152 parallel stream processors, the Xbox One’s GPU is closer to an AMD Radeon 7790, supporting only 768 stream processors
• the Xbox One’s GPU runs at 853MHz vs 800 for the PS4
Pose Blending
Pose Blending
Pose Blending
Post Animation Game Object Update
Simulate / Integrate
Simulate / Integrate
Simulate / Integrate
Ragdoll Physics
Update Game Objects
Fork
Join
Fork
Join
etc.
Main Thread
Main Thread
Animation Thread
Dynamics Thread
Rendering Thread
HID
Update Game Objects
Kick off Animation
Post Animation Game Object Update
Kick Dynamics Sim
Ragdoll Physics
Finalize AnimationFinalize Collision
Other Processing (AI Planning, Audio
Work, etc.)
Kick Redering (for next frame)
Sleep
Pose Blending
Sleep
Sleep
Ragdoll Skinning
Global Pose CalculationSkin Matrix
Palette Calculation
Sleep
Simulate and
Integrate
Sleep
Sleep
Broad Phase Coll.
Narrow Phase Coll.
Resolve Constraints Wait for V-
Blank
Wait for GPU
Visibility Determination
Sort
Submit Primitives
Full-Screen Effects
Swap Buffers
PPU
HID
Update Game Objects
Kick Animation Jobs
Post Animation Game Object Update
Kick Dynamics Jobs
Ragdoll Physics
Finalize AnimationFinalize Collision
Other Processing (AI Planning, Audio
Work, etc.)
Kick Redering (for next frame)
SPU0 SPU1
VisibilityVisibility
SortSort
VisibilityPose Blend
Physics Sim
SortPose Blend
Submit Primitives
Global PoseSubmit Primitives
Global PoseCollisions / Constraints
Matrix PaletteRagdoll Skinning
VisibilityVisibility
SortVisibility
Sort
Visibility
Pose Blend
Pose BlendPose Blend
Global PoseBroad PhaseNarrow PhaseNarrow Phase
Ragdoll Skinning
Matrix Palette
Physics Simulation
Async Designwhile (true) { // main game loop
// ...
// Cast a ray to see if the player has line of sight
// to the enemy.
RayCastResult r = castRay(playerPos, enemyPos);
// Now process the results...
if (r.hitSomething() && isEnemy(r.getHitObject())) {
// Player can see the enemy.
// ...
}
// …
}
Async Designwhile (true) { // main game loop // ... // Cast a ray to see if the player has line of sight // to the enemy. RayCastResult r; requestRayCast(playerPos, enemyPos, &r);
// Do other unrelated work while we wait for the // other CPU to perform the ray cast for us.
// …
// OK, we can't do any more useful work. Wait for the // results of our ray cast job. If the job is // complete, this function will return immediately. // Otherwise, the main thread will idle until the // results are ready... waitForRayCastResults(&r);
// Process results... if (r.hitSomething() && isEnemy(r.getHitObject())) { // Player can see the enemy. // ...
// ... } // ...
}
Async DesignRayCastResult r; bool rayJobPending = false;
while (true) { // main game loop // …
// Wait for the results of last frame's ray cast job. if (rayJobPending) { waitForRayCastResults(&r); // Process results... if (r.hitSomething() && isEnemy(r.getHitObject())) { // Player can see the enemy. // ...
} } // Cast a new ray for next frame. rayJobPending = true; requestRayCast(playerPos, enemyPos, &r);
// Do other work... // ...
}