“the slow game of life” allan murphy senior software development engineer xna developer...
TRANSCRIPT
“The Slow Game Of Life”
Allan MurphySenior Software Development Engineer
XNA Developer ConnectionMicrosoft
Understanding PerformanceConsumer Hardware And Performance Coding
Hello
So…Who exactly am I?And what am I doing here
Firstly, hands up who…Has heavily optimized an applicationHasn’t, and doesn’t careIs actually here and aliveIs hungover and mainly hoping for the answers for the group assignment
Hello
Duncan let me speak today because…
Career spent on performance hardwareExperience with a variety of consolesHave managed teams building game enginesStill have those photos of Duncan
With Doug, the West Highland Terrier
I did my degree at StrathclydeComputer architectureLow level programming
Will Optimize For Money
Previous Experience
Did database analysis, hated itWorked in telecoms, hated itMoved to 3 person game company…“Until I could find a proper job”
It’s not all about meExcept this bit
StrathclydeThe Game Of Life assignment
Left StrathclydeImmediately paid enormous fortuneDidn’t wear a suit, worked in gamesBought first Ferrari 3 months after UniHad more than 1 girlfriend
Previous Experience
2 years PC engine development2D 640x480 bitmap graphicsC, C++, 80x86 (486, Pentium)
3 years at Sony3 years PS1 3rd party support and game devC, C++, MIPS R3000
2 years at game developer in Glasgow
PS1 engine developmentC, C++, MIPS R3000
Previous Experience
6 years owning own developerPS1, PS2, GC, Xbox 1, PC developmentC, C++, MIPS R4400, VU assembly, HLSL
2 years at EurocomPS3, 360, PCC, C++, PowerPC, SPU assembly
2 years at MicrosoftXbox 360, some WindowsC, C++, PowerPC, HLSL
Previous Experience
Fair amount of optimization experiencePart of XDC group at Microsoft
3rd party developer support groupVisited 60+ game developersPerformance reviewsConsultancySample codeBespoke coding
Previous Experience
“All this will go away soon”1992Multiplying by 320 in x86 assembler
Surely it should, because…Processor power increasingProcessor cost reducingCompilers getting better
Console Hardware
Console Hardware
Console hardware is about…Maximum performance…for minimum costOften CPUs are…
Cut down production processorsHave bespoke processing hardware addedEg vector processing units
Attached to cheap memory and peripherals
Consoles are sold at a loss
80x86 PC (circa mid-90s)
Pentium Pro200Mhz
Main Memory
Gra
ph
ics Card
VR
AM
To m
onito
r
AG
P
512Kb L2 Cache
FPU+
MMX8Kb L1Somewhat abstracted
PS1
MIPS R300033.868Mhz
GTE
MDEC
2Mb Main Memory
I$ D$
GPU1M
b V
RA
M
To te
lly
Xbox 1
Pentium III733Mhz
64Mb UMA Main Memory
nV
idia
NV
2A
To te
lly
128Kb L2 Cache
FPU+
MMXSSEL1
PS2
MIPS R4400294Mhz
32Mb Main Memory
I$
D$
GS
4M
b V
RA
MS-Pa
d
GIF
FPU +
MMX
EE
VIF0
VU0
mem
VU1
VIF1
mem
To te
lly
Xbox 360
512Mb UMA
1Mb L2 Cache
PowerPCCore
L1
ATI X
en
os
To te
lly
FPU
+ V
MX
PowerPCCore
L1
FPU
+ V
MX
PowerPCCore
L1
FPU
+ V
MX
PS3
256Mb
To te
lly
nV
idia
RS
X
256
Mb V
RA
M
Cell
PPE
SPE SPESPE SPE
SPE SPESPE SPEL1
LSLS LS LS
LSLS LS LS
L2 Cache
SPE
DMAC
The Sad Truth About CPU DesignIn which programmers have to do the hard work again
This Is What You Want
CPU
Main Memory
Ridiculously Fast
Very Wide,Very Fast
Very BIG, Very
Fast
CPUs Not Getting Faster…
Core 0 Core 1 Core 2
Main Memory
?
Fast Memory is Expensive…
Core 0 Core 1 Core 2
Main Memory
Cache
This Is What You Get…
Core 0L1 Core 1L1 Core 2L1
Main Memory
L2 Cache NCU 0 NCU 1 NCU 0
Store Queue
Load Queue
StoreGather
Store Queue
Load Queue
StoreGather
Store Queue
Load Queue
StoreGather
RC Machines
Multicore Strategy
Multicore is future of performanceScenario forced on unwilling game developersNot necessarily a happy marriageGame systems often highly…
Temporally connectedIntertwinedGame devs often from single thread background
Some tasks easy to parallelizeRendering, physics, effects, animation
Multicore Strategy
Single threadedOn Xbox360 and PS3, this is a bad plan
Two main threadsGame logic updateRenderer submission
Two main threads + fixed tasksAs above plus……fixed tasks in parallel… eg streaming, effects, audio
Multicore Strategy
Truly multi-threadedUsually a main game logic threadMain tasks sliced into independent piecesRendering, physics, collision, effects…Scheduler controls task execution
Tasks execute when preconditions met
Scheduler runs task on any available unitReal trick is…
Balancing schedulingMaking sure tasks truly independent
Multicore Strategy
ProblemsVery hard to debug a task system……especially at sub millisecond resolutionBalancing tasks and scheduler can be hardSlicing data and tasks into pieces trickyMany conditions very hard to find……never mind debugSide effects in code not always obvious
Game Engine Concerns
Game Engine Coding
Main concerns:SpeedFeature setMemory usageDisc space for assets
But most importantly…Speed
Because this dictates game contentSlow means less features
Game Engine Coding
Speed measured in…Frames per secondOr equivalently ms per frame33.33ms in a frame at 30fps
Game must perform update in this time
Update all of the game’s systemsSet up and submit all rendering for frameDo all of the drawing for previous frame
Game Engine Coding
Critical choices for engine designAlgorithms
Sorting, searching, pruning calculations
Rendering policyData structuringHow you bend the above around hardwareConsoles have hardware acceleration……for certain tasks…for certain data…for certain data layouts
Game Engine Coding
Example: VMX instructions on Xbox360
SIMD instructions, operating on vectorsVector can be 8, 16, 32 bit values32 bit can be float or intMultiply, add, shift, pack, unpack
Great! But…No divide, sqrt, individual bit operationsOnly aligned loadingLoading individual pieces to build expensivePossible to lose improvement easily
The 360 Core
Remember, cheap hardwareCut down PowerPC coreMissing out of order execution hardwareMissing store forwarding hardwareIe, this is an in-order processor
Attached to slow memoryMeans loading data is painfulWhich in turn makes data layout critical
360 Core
Very commonly ocurring penalties:Load Hit StoreL2 cache missExpensive instructionsBranch mispredict
Load-Hit-Store (LHS)
What is it?Storing to a memory location……then loading from it very shortly after
What causes LHS?Type casts, changing register set, aliasingPassing by value, or by reference
Why is it a problem?On PC, bullet usually dodged by…
Instruction re-orderingStore forwarding hardware
L2 Miss
What is it?Loading from a location not already in cache
Why is it a problem?Costs ~610 cycles to load a cache lineYou can do a lot of work in 610 cycles
What can we do about it?Hot/cold splitReduce in-memory data sizeUse cache coherent structures
Expensive InstructionsWhat is it?
Certain instructions not pipelinedNo other instructions issued ‘til they completeStalls both hardware threads
high latency and low throughput
What can we do about it?Know when those instructions are generatedAvoid or code round those situations
But only in critical places
Branch Mispredicts
What is it?Mispredicting a branch causes…
…CPU to discard instructions it predicted it needed
…23-24 cycle delay as correct instructions fetched
Why is this a problem?Misprediction penalty can……dominate total time in tight loops…waste time fetching unneeded instructions
PIX for Xbox 360
PIX
Performance Investigator for XboxFor analysing various kinds of performanceRendering, file system, CPU
For CPU…Several different mechanismsStochastic samplingHigh level timers and countersInstruction trace
CPU Instruction Trace
What is an instruction trace?CPU core set to single step modeTools record instructions and load/store addrs400x slower than normal executionTrace (and code) affected by:
Compiler output – un-optimized / optimized
Some statistics are simulatedEg cache statistics assumes
Cache starts emptyNo other threads run and evict data
CPU Instruction Trace
Instruction trace contains 5 tabs:Summary tabTop Issues tabMemory Accesses tabSource tabFunctions tab
CPU Instruction Trace
Summary tabInstructions executed statisticsI-cache statisticsD-cache statistics
Very useful: cache line usage %
TLB statisticsVery useful: 4Kb and 64Kb page usageVery useful: TLB miss rate exceeding 1024
Instruction type histogram
Summary Tab
Cache line efficiency – try for 35%
minimum
Executed instructions – gives notion of possible maximum speed
Top Issues Tab
Major CPU penalties, by cycle cost orderIncludes link to:
Address of instruction where penalty occursFunction in source viewL2 miss and LHS normally dominateOther common penalties:
Branch mispredictfcmp
Expensive instructions (fdiv et al)
Top Issue Tab
Cache misses Displays % of data used before eviction
Load-hit-storesDisplays store instruction addr, last data addrSource / destination register types
Expensive instructionsLocation of instruction
Branch mispredictions Conditional or branch target mispredict
Memory Accesses Tab
Shows all memory accesses by…Page type, address, and cache line
For each cache lines shows…Symbol that touched the cache line mostRight click gives all symbols touching the line
Source Tab
Annotated source and assemblyColumns show ‘penalty’ counts
With hot links to more details
Click here for load-hit-store details
Brings up this dialog, showing you all store instructions that this
load hit
Functions Tab
Per-function values of six counters:Instruction countsL2 misses, LHS, fcmp, L1 D & I cache missesAll available as inclusive and exclusive
Exclusive – for this function onlyInclusive – this function and everything it calls
Optimization Example
Optimization Zen
Perspective is king90% of time spent in 10% of codeOptimization is expensive, slow, error prone
Improvement to execution speed
Generality
Maintainability
Understandability
Speed of development
Optimization Zen
Ground rules for optimizationHave CPU budgets in place
Budget planning assists good performance
Measure twice, cut onceOptimize in an iterative pruning fashion
Remove easiet to tackle & worst culprits firstRe-evaluat timing and metricsStop as soon as budget achieved
Be sure to performance issues correctly
Optimization Exampleclass BaseParticle{
public:…virtual Vector& Position() { return mPosition; }virtual Vector& PreviousPosition() { return mPreviousPosition; }float& Intensity() { return mIntensity; }float& Lifetime() { return mLifetime; }bool& Active() { return mActive; }…
private:…float mIntensity;float mLifetime;bool mActive;Vector mPosition;Vector mPreviousPosition;…
};
Optimization Example// Boring old vector classclass Vector{
… public:
float x,y,z,w;};
// Boring old generic linked list classtemplate <class T> class ListNode{
public:ListNode(T* contents) : mNext(NULL), mContents(contents) {}void SetNext(ListNode* node) { mNext = node; }ListNode* NextNode() { return mNext; }T* Contents() { return mContents; }
private:ListNode<T>* mNext;T* mContents;
};
Optimization Example// Run through list and update each active particlefor (ListNode<BaseParticle>* node = gParticles; node != NULL; node = node->NextNode())
if (node->Contents()->Active()){
Vector vel;vel.x = node->Contents()->Position().x - node->Contents()-
>PrevPosition().x;vel.y = node->Contents()->Position().y - node->Contents()-
>PrevPosition().y;vel.z = node->Contents()->Position().z - node->Contents()-
>PrevPosition().z;const float length = __fsqrts((vel.x*vel.x) + (vel.y*vel.y) +
(vel.z*vel.z));
if (length > cLimitLength){
float newIntensity = cMaxIntensity - node->Contents()->Lifetime();
if (newIntensity < 0.0f)newIntensity = 0.0f;
node->Contents()->Intensity() = newIntensity;}else
node->Contents()->Intensity() = 0.0f;}
Optimization Example
// Replacement for straight C vector work
// Build 360 friendly __vector4s__vector4 position, prevPosition;position.x = node->Contents()->Position().x;position.y = node->Contents()->Position().y;position.z = node->Contents()->Position().z;prevPosition.x = node->Contents()->PrevPosition().x;prevPosition.y = node->Contents()->PrevPosition().y;prevPosition.z = node->Contents()->PrevPosition().z;
// Use VMX to do the calculations__vector4 velocity = __vsubfp(position,previousPosition);__vector4 velocitySqr = __vmsum4fp(velocity,velocity);
// Grab the length result from the vectorconst float length = __fsqrts(velocitySqr.x);
Measure FirstPIX Summary
704k instructions executed40% L2 cache line usageTop penalties
L2 cache miss @ 3m cyclesbctr mispredicts @ 1.14m cycles__fsqrt @ 696k cycles2x fcmp @ 490k cyclesSome 20.9m cycles of penalty overall
Takes 7.528ms
Improving Original Example
1) Avoid branch mispredict #1Ditch the zealous use of virtualCall functions just onceGives 1.13x speedup
2) Improve L2 use #1Refactoring list to contiguous arrayHot/cold splitUsing bitfield for active flagGives 3.59x speedup
Improving Original Example
4) Remove expensive instructionsDitch __fsqrts and compare with squaresGives 4.05x speedup
5) Avoid fcmp pipeline flushInsert __fsel() to select tail lengthGives 4.44x speedupInsert 2nd fsel Now only branch on active flag remainsGives 5.0x speedup
Improving Original Example
7) Use VMXUse __vsubfp and __vmsum3fp for vector mathGives 5.28x speedup
8) Avoid branching too oftenUnroll the loop 4xSticks at 5.28x speedup
Improving Original Example9) Avoid branch mispredict #2
Read vector4 of tail intensitiesBuild a __vector4 mask from active flags__vsel tail lengths from existing and newWrite updated vector4 of tail intensities backGives 6.01x speedup
10) Improve L2 access #2Add __dcbt on particle array Gives 16.01x speedup
Improving Original Example11) Improve L2 use #3
Move to short coordinatesNow loading ¼ the data for positionsGives 21.23x speedup
12) Avoid unnecessary workWe are now writing tail lengths for every particleWait, we don’t care about inactive particlesEpiphany - don’t check active flag at allGives 23.2x speedup
Improving Original Example
13) Improve L2 use #4Remaining L2 misses on output array__dcbt that tooTweak __dcbt offsets and pre-load39.01x speedup
Check its correct!
for (int loop = 0; loop < cParticleCount; loop+=4){
__dcbt(768,&gParticles[loop]);__dcbt(768,&gParticleLifetime[loop]);
__vector4 lifetimes = *(__vector4 *)&gParticleLifetime[loop];__vector4 newIntensity = __vsubfp(maxLifetime,lifetimes);
const __vector4 velocity0 = gParticles[loop].Velocity();__vector4 lengthSqr0 = __vmsum3fp(velocity0,velocity0);
// …calculate remaining lengths and concatenate into one __vector4
lengths = __vsubfp(lengths,cLimitLengthSqrV);
__vector4 lengthMask = __vcmpgtfp(lengths,zero);
newIntensity = __vmaxfp(newIntensity,zero);__vector4 result = __vsel(zero,newIntensity,lengthMask);*(__vector4 *)&gParticleTailIntensity[loop] =
__vsel(zero,newIntensity,lengthMask);}
Improving Original Example
PIX Summary259k instructions executed99.4% L2 usageTop penalties
ERAT Data Miss @ 14k cycles1 LHS via 4kb aliasingNo mispredict penalties71k cycles of penalty overall
Takes 0.193ms
Summary
Summary
Thanks for listeningHopefully you gathered something about:
Cheap consumer hardwareMulticore strategiesWhat game engine programmers worry aboutHow games are profiled and optimized
Q&A
© 2008 Microsoft Corporation. All rights reserved.This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.
http://www.xna.com
Dawson’s Creek Figures
Clock rate = 3.2 GHz = 3,200,000,000 cycles per second60 fps = 53,333,333 cycles per frame30 fps = 106,666,666 cycles per frame Dawson’s Law: average 0.2 IPC in a game titleTherefore …at 60 fps, you can do 10,666,666 instructions ~= 10Mat 30 fps, you can do 21,333,333 instructions ~= 21M
Or put another way… how bad is a 1M-cycle penalty?It’s approx 200K instructions of quality execution going missing.1M cycles is 1/50th – 2% of a frame at 60 fps, or 1/100th – 1% of a
frame at 30 fps, or 1% of a frame at 30 fps1M cycles is ~0.32 ms.