“the slow game of life” allan murphy senior software development engineer xna developer...

“The Slow Game Of Life”

Allan MurphySenior Software Development Engineer

XNA Developer ConnectionMicrosoft

Understanding PerformanceConsumer Hardware And Performance Coding

Hello

So…Who exactly am I?And what am I doing here

Firstly, hands up who…Has heavily optimized an applicationHasn’t, and doesn’t careIs actually here and aliveIs hungover and mainly hoping for the answers for the group assignment

Hello

Duncan let me speak today because…

Career spent on performance hardwareExperience with a variety of consolesHave managed teams building game enginesStill have those photos of Duncan

With Doug, the West Highland Terrier

I did my degree at StrathclydeComputer architectureLow level programming

Will Optimize For Money

Previous Experience

Did database analysis, hated itWorked in telecoms, hated itMoved to 3 person game company…“Until I could find a proper job”

It’s not all about meExcept this bit

StrathclydeThe Game Of Life assignment

Left StrathclydeImmediately paid enormous fortuneDidn’t wear a suit, worked in gamesBought first Ferrari 3 months after UniHad more than 1 girlfriend

Previous Experience

2 years PC engine development2D 640x480 bitmap graphicsC, C++, 80x86 (486, Pentium)

3 years at Sony3 years PS1 3rd party support and game devC, C++, MIPS R3000

2 years at game developer in Glasgow

PS1 engine developmentC, C++, MIPS R3000

Previous Experience

6 years owning own developerPS1, PS2, GC, Xbox 1, PC developmentC, C++, MIPS R4400, VU assembly, HLSL

2 years at EurocomPS3, 360, PCC, C++, PowerPC, SPU assembly

2 years at MicrosoftXbox 360, some WindowsC, C++, PowerPC, HLSL

Previous Experience

Fair amount of optimization experiencePart of XDC group at Microsoft

3rd party developer support groupVisited 60+ game developersPerformance reviewsConsultancySample codeBespoke coding

Previous Experience

“All this will go away soon”1992Multiplying by 320 in x86 assembler

Surely it should, because…Processor power increasingProcessor cost reducingCompilers getting better

Console Hardware

Console Hardware

Console hardware is about…Maximum performance…for minimum costOften CPUs are…

Cut down production processorsHave bespoke processing hardware addedEg vector processing units

Attached to cheap memory and peripherals

Consoles are sold at a loss

80x86 PC (circa mid-90s)

Pentium Pro200Mhz

Main Memory

Gra

ph

ics Card

VR

AM

To m

onito

r

AG

P

512Kb L2 Cache

FPU+

MMX8Kb L1Somewhat abstracted

PS1

MIPS R300033.868Mhz

GTE

MDEC

2Mb Main Memory

I$ D$

GPU1M

b V

RA

M

To te

lly

Xbox 1

Pentium III733Mhz

64Mb UMA Main Memory

nV

idia

NV

2A

To te

lly

128Kb L2 Cache

FPU+

MMXSSEL1

PS2

MIPS R4400294Mhz

32Mb Main Memory

I$

D$

GS

4M

b V

RA

MS-Pa

d

GIF

FPU +

MMX

EE

VIF0

VU0

mem

VU1

VIF1

mem

To te

lly

Xbox 360

512Mb UMA

1Mb L2 Cache

PowerPCCore

L1

ATI X

en

os

To te

lly

FPU

+ V

MX

PowerPCCore

L1

FPU

+ V

MX

PowerPCCore

L1

FPU

+ V

MX

PS3

256Mb

To te

lly

nV

idia

RS

X

256

Mb V

RA

M

Cell

PPE

SPE SPESPE SPE

SPE SPESPE SPEL1

LSLS LS LS

LSLS LS LS

L2 Cache

SPE

DMAC

The Sad Truth About CPU DesignIn which programmers have to do the hard work again

This Is What You Want

CPU

Main Memory

Ridiculously Fast

Very Wide,Very Fast

Very BIG, Very

Fast

CPUs Not Getting Faster…

Core 0 Core 1 Core 2

Main Memory

?

Fast Memory is Expensive…

Core 0 Core 1 Core 2

Main Memory

Cache

This Is What You Get…

Core 0L1 Core 1L1 Core 2L1

Main Memory

L2 Cache NCU 0 NCU 1 NCU 0

Store Queue

Load Queue

StoreGather

Store Queue

Load Queue

StoreGather

Store Queue

Load Queue

StoreGather

RC Machines

Multicore Strategy

Multicore is future of performanceScenario forced on unwilling game developersNot necessarily a happy marriageGame systems often highly…

Temporally connectedIntertwinedGame devs often from single thread background

Some tasks easy to parallelizeRendering, physics, effects, animation

Multicore Strategy

Single threadedOn Xbox360 and PS3, this is a bad plan

Two main threadsGame logic updateRenderer submission

Two main threads + fixed tasksAs above plus……fixed tasks in parallel… eg streaming, effects, audio

Multicore Strategy

Truly multi-threadedUsually a main game logic threadMain tasks sliced into independent piecesRendering, physics, collision, effects…Scheduler controls task execution

Tasks execute when preconditions met

Scheduler runs task on any available unitReal trick is…

Balancing schedulingMaking sure tasks truly independent

Multicore Strategy

ProblemsVery hard to debug a task system……especially at sub millisecond resolutionBalancing tasks and scheduler can be hardSlicing data and tasks into pieces trickyMany conditions very hard to find……never mind debugSide effects in code not always obvious

Game Engine Concerns

Game Engine Coding

Main concerns:SpeedFeature setMemory usageDisc space for assets

But most importantly…Speed

Because this dictates game contentSlow means less features

Game Engine Coding

Speed measured in…Frames per secondOr equivalently ms per frame33.33ms in a frame at 30fps

Game must perform update in this time

Update all of the game’s systemsSet up and submit all rendering for frameDo all of the drawing for previous frame

Game Engine Coding

Critical choices for engine designAlgorithms

Sorting, searching, pruning calculations

Rendering policyData structuringHow you bend the above around hardwareConsoles have hardware acceleration……for certain tasks…for certain data…for certain data layouts

Game Engine Coding

Example: VMX instructions on Xbox360

SIMD instructions, operating on vectorsVector can be 8, 16, 32 bit values32 bit can be float or intMultiply, add, shift, pack, unpack

Great! But…No divide, sqrt, individual bit operationsOnly aligned loadingLoading individual pieces to build expensivePossible to lose improvement easily

The 360 Core

Remember, cheap hardwareCut down PowerPC coreMissing out of order execution hardwareMissing store forwarding hardwareIe, this is an in-order processor

Attached to slow memoryMeans loading data is painfulWhich in turn makes data layout critical

360 Core

Very commonly ocurring penalties:Load Hit StoreL2 cache missExpensive instructionsBranch mispredict

Load-Hit-Store (LHS)

What is it?Storing to a memory location……then loading from it very shortly after

What causes LHS?Type casts, changing register set, aliasingPassing by value, or by reference

Why is it a problem?On PC, bullet usually dodged by…

Instruction re-orderingStore forwarding hardware

L2 Miss

What is it?Loading from a location not already in cache

Why is it a problem?Costs ~610 cycles to load a cache lineYou can do a lot of work in 610 cycles

What can we do about it?Hot/cold splitReduce in-memory data sizeUse cache coherent structures

Expensive InstructionsWhat is it?

Certain instructions not pipelinedNo other instructions issued ‘til they completeStalls both hardware threads

high latency and low throughput

What can we do about it?Know when those instructions are generatedAvoid or code round those situations

But only in critical places

Branch Mispredicts

What is it?Mispredicting a branch causes…

…CPU to discard instructions it predicted it needed

…23-24 cycle delay as correct instructions fetched

Why is this a problem?Misprediction penalty can……dominate total time in tight loops…waste time fetching unneeded instructions

PIX for Xbox 360

PIX

Performance Investigator for XboxFor analysing various kinds of performanceRendering, file system, CPU

For CPU…Several different mechanismsStochastic samplingHigh level timers and countersInstruction trace

CPU Instruction Trace

What is an instruction trace?CPU core set to single step modeTools record instructions and load/store addrs400x slower than normal executionTrace (and code) affected by:

Compiler output – un-optimized / optimized

Some statistics are simulatedEg cache statistics assumes

Cache starts emptyNo other threads run and evict data


Instruction trace contains 5 tabs:Summary tabTop Issues tabMemory Accesses tabSource tabFunctions tab


Summary tabInstructions executed statisticsI-cache statisticsD-cache statistics

Very useful: cache line usage %

TLB statisticsVery useful: 4Kb and 64Kb page usageVery useful: TLB miss rate exceeding 1024

Instruction type histogram

Summary Tab

Cache line efficiency – try for 35%

minimum

Executed instructions – gives notion of possible maximum speed

Top Issues Tab

Major CPU penalties, by cycle cost orderIncludes link to:

Address of instruction where penalty occursFunction in source viewL2 miss and LHS normally dominateOther common penalties:

Branch mispredictfcmp

Expensive instructions (fdiv et al)

Top Issue Tab

Cache misses Displays % of data used before eviction

Load-hit-storesDisplays store instruction addr, last data addrSource / destination register types

Expensive instructionsLocation of instruction

Branch mispredictions Conditional or branch target mispredict

Memory Accesses Tab

Shows all memory accesses by…Page type, address, and cache line

For each cache lines shows…Symbol that touched the cache line mostRight click gives all symbols touching the line

Source Tab

Annotated source and assemblyColumns show ‘penalty’ counts

With hot links to more details

Click here for load-hit-store details

Brings up this dialog, showing you all store instructions that this

load hit

Functions Tab

Per-function values of six counters:Instruction countsL2 misses, LHS, fcmp, L1 D & I cache missesAll available as inclusive and exclusive

Exclusive – for this function onlyInclusive – this function and everything it calls

Optimization Example

Optimization Zen

Perspective is king90% of time spent in 10% of codeOptimization is expensive, slow, error prone

Improvement to execution speed

Generality

Maintainability

Understandability

Speed of development

Optimization Zen

Ground rules for optimizationHave CPU budgets in place

Budget planning assists good performance

Measure twice, cut onceOptimize in an iterative pruning fashion

Remove easiet to tackle & worst culprits firstRe-evaluat timing and metricsStop as soon as budget achieved

Be sure to performance issues correctly

Optimization Exampleclass BaseParticle{

public:…virtual Vector& Position() { return mPosition; }virtual Vector& PreviousPosition() { return mPreviousPosition; }float& Intensity() { return mIntensity; }float& Lifetime() { return mLifetime; }bool& Active() { return mActive; }…

private:…float mIntensity;float mLifetime;bool mActive;Vector mPosition;Vector mPreviousPosition;…

};

Optimization Example// Boring old vector classclass Vector{

… public:

float x,y,z,w;};

// Boring old generic linked list classtemplate <class T> class ListNode{

public:ListNode(T* contents) : mNext(NULL), mContents(contents) {}void SetNext(ListNode* node) { mNext = node; }ListNode* NextNode() { return mNext; }T* Contents() { return mContents; }

private:ListNode<T>* mNext;T* mContents;

};

Optimization Example// Run through list and update each active particlefor (ListNode<BaseParticle>* node = gParticles; node != NULL; node = node->NextNode())

if (node->Contents()->Active()){

Vector vel;vel.x = node->Contents()->Position().x - node->Contents()-

>PrevPosition().x;vel.y = node->Contents()->Position().y - node->Contents()-

>PrevPosition().y;vel.z = node->Contents()->Position().z - node->Contents()-

>PrevPosition().z;const float length = __fsqrts((vel.x*vel.x) + (vel.y*vel.y) +

(vel.z*vel.z));

if (length > cLimitLength){

float newIntensity = cMaxIntensity - node->Contents()->Lifetime();

if (newIntensity < 0.0f)newIntensity = 0.0f;

node->Contents()->Intensity() = newIntensity;}else

node->Contents()->Intensity() = 0.0f;}

Optimization Example

// Replacement for straight C vector work

// Build 360 friendly __vector4s__vector4 position, prevPosition;position.x = node->Contents()->Position().x;position.y = node->Contents()->Position().y;position.z = node->Contents()->Position().z;prevPosition.x = node->Contents()->PrevPosition().x;prevPosition.y = node->Contents()->PrevPosition().y;prevPosition.z = node->Contents()->PrevPosition().z;

// Use VMX to do the calculations__vector4 velocity = __vsubfp(position,previousPosition);__vector4 velocitySqr = __vmsum4fp(velocity,velocity);

// Grab the length result from the vectorconst float length = __fsqrts(velocitySqr.x);

Measure FirstPIX Summary

704k instructions executed40% L2 cache line usageTop penalties

L2 cache miss @ 3m cyclesbctr mispredicts @ 1.14m cycles__fsqrt @ 696k cycles2x fcmp @ 490k cyclesSome 20.9m cycles of penalty overall

Takes 7.528ms

Improving Original Example

1) Avoid branch mispredict #1Ditch the zealous use of virtualCall functions just onceGives 1.13x speedup

2) Improve L2 use #1Refactoring list to contiguous arrayHot/cold splitUsing bitfield for active flagGives 3.59x speedup


4) Remove expensive instructionsDitch __fsqrts and compare with squaresGives 4.05x speedup

5) Avoid fcmp pipeline flushInsert __fsel() to select tail lengthGives 4.44x speedupInsert 2nd fsel Now only branch on active flag remainsGives 5.0x speedup


7) Use VMXUse __vsubfp and __vmsum3fp for vector mathGives 5.28x speedup

8) Avoid branching too oftenUnroll the loop 4xSticks at 5.28x speedup

Improving Original Example9) Avoid branch mispredict #2

Read vector4 of tail intensitiesBuild a __vector4 mask from active flags__vsel tail lengths from existing and newWrite updated vector4 of tail intensities backGives 6.01x speedup

10) Improve L2 access #2Add __dcbt on particle array Gives 16.01x speedup

Improving Original Example11) Improve L2 use #3

Move to short coordinatesNow loading ¼ the data for positionsGives 21.23x speedup

12) Avoid unnecessary workWe are now writing tail lengths for every particleWait, we don’t care about inactive particlesEpiphany - don’t check active flag at allGives 23.2x speedup


13) Improve L2 use #4Remaining L2 misses on output array__dcbt that tooTweak __dcbt offsets and pre-load39.01x speedup

Check its correct!

for (int loop = 0; loop < cParticleCount; loop+=4){

__dcbt(768,&gParticles[loop]);__dcbt(768,&gParticleLifetime[loop]);

__vector4 lifetimes = *(__vector4 *)&gParticleLifetime[loop];__vector4 newIntensity = __vsubfp(maxLifetime,lifetimes);

const __vector4 velocity0 = gParticles[loop].Velocity();__vector4 lengthSqr0 = __vmsum3fp(velocity0,velocity0);

// …calculate remaining lengths and concatenate into one __vector4

lengths = __vsubfp(lengths,cLimitLengthSqrV);

__vector4 lengthMask = __vcmpgtfp(lengths,zero);

newIntensity = __vmaxfp(newIntensity,zero);__vector4 result = __vsel(zero,newIntensity,lengthMask);*(__vector4 *)&gParticleTailIntensity[loop] =

__vsel(zero,newIntensity,lengthMask);}


PIX Summary259k instructions executed99.4% L2 usageTop penalties

ERAT Data Miss @ 14k cycles1 LHS via 4kb aliasingNo mispredict penalties71k cycles of penalty overall

Takes 0.193ms

Summary

Summary

Thanks for listeningHopefully you gathered something about:

Cheap consumer hardwareMulticore strategiesWhat game engine programmers worry aboutHow games are profiled and optimized

© 2008 Microsoft Corporation. All rights reserved.This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.

http://www.xna.com

http://www.xna.com/

Dawson’s Creek Figures

Clock rate = 3.2 GHz = 3,200,000,000 cycles per second60 fps = 53,333,333 cycles per frame30 fps = 106,666,666 cycles per frame Dawson’s Law: average 0.2 IPC in a game titleTherefore …at 60 fps, you can do 10,666,666 instructions ~= 10Mat 30 fps, you can do 21,333,333 instructions ~= 21M

Or put another way… how bad is a 1M-cycle penalty?It’s approx 200K instructions of quality execution going missing.1M cycles is 1/50th – 2% of a frame at 60 fps, or 1/100th – 1% of a

frame at 30 fps, or 1% of a frame at 30 fps1M cycles is ~0.32 ms.

“the slow game of life” allan murphy senior software development engineer xna developer...

Documents

telly slide

hlsl slide

better slide

abstracted slide

loss slide

money slide

girlfriend slide

mips r3000 slide