spu assisted rendering

90
/* * SPU Assisted Rendering. */ Steven Tovey & Stephen McAule Graphics Programmers, Bizarre Creations Ltd http://www.bizarrecreations.co [email protected] [email protected]

Upload: steven-tovey

Post on 09-Jun-2015

9.057 views

Category:

Technology


2 download

DESCRIPTION

Talk about SPU-accelerated rendering in Blur, that was given by Steven Tovey and Stephen McAuley @ Develop 2010.

TRANSCRIPT

Page 1: SPU Assisted Rendering

/* * SPU Assis

ted Rendering.

*/

Steven Tovey & Steph

en McAuley

Graphics Programmers

, Bizarre Creations

Ltd.

http://www.bizarrecreations.com

[email protected]@bizarrecreations.com

Page 2: SPU Assisted Rendering

/* Welcome! */

- We have some copies of Blur to give away, stick around and fill out your evaluation sheets!

Page 3: SPU Assisted Rendering

- Part I (w/ Steven Tovey):– What is SPU Assisted Rendering?– Case Studies

- Car Damage- Car Lighting

- Part II (w/ Stephen McAuley):– Fragment Shading– Parallelisation– Case Study

• Pre-pass Lighting on SPUs

- Questions

/* Agenda */

Page 4: SPU Assisted Rendering

/* * Part I w/

Steven Tovey

*/ SPU Acceleration of

Car Rendering in Blu

r

Page 5: SPU Assisted Rendering

- Assisting RSX™ with the SPUs (der!)

- Why do this?– Free up RSX™ to do other things.– Enable otherwise unfeasible techniques.

– Optimise rendering.

/* What is SPU AR? I */

Page 6: SPU Assisted Rendering

- Problems involved?- Synchronisation.- Optimising SPU modules.- Memory considerations:

- Local store- Resource allocation

- Etc.

/* What is SPU AR? II */

Page 7: SPU Assisted Rendering

- Original Xenon implementation:- Totally GPU-based.- 2xVTF (volume & 2D) for damage.- Large amount of work in vertex shader, making cars in Blur heavily vertex-bound.

- All lighting in pixel shader.

/* Case Study: Cars I */

Page 8: SPU Assisted Rendering

- Loose fitting damage volume:

/* Case Study: Cars II */

Page 9: SPU Assisted Rendering

- Control points:

/* Case Study: Cars III */

Page 10: SPU Assisted Rendering

- Morph targets:

/* Case Study: Cars IV */

Page 11: SPU Assisted Rendering

- Scratch/dent textures:

/* Case Study: Cars IV */

Page 12: SPU Assisted Rendering

- Challenges:- Increase rendering speed of cars.- Maintain same quality.

/* Case Study: Cars VI */

Page 13: SPU Assisted Rendering

- Our solution:- Large parts are SPU based.- On demand.- Sync-free.- Deferred.- Work split between GPU/SPU.

/* Damage: Solution */

Page 14: SPU Assisted Rendering

- 2 vertex streams:- Read-only car vertex data.

- Shared between similar cars.- SPU-modified damage vertex data.

- Per instance.- One-to-one mapping of vertices.

- Control points:- Crude approximation of volume preservation.

- Dent/scratch blend levels.

/* Damage: Data I */

Page 15: SPU Assisted Rendering

Stream0

/* Damage: Data II */

Position

Normal UV0

UV1 PosOffset

NormalOffset

AO

ControlPoints

Stream1

SPU_Position

SPU_Normal

Page 16: SPU Assisted Rendering

Stream0

/* Damage: Data II */

Position

Normal UV0

UV1 PosOffset

NormalOffset

AO

ControlPoints

Stream1

SPU_Position

SPU_Normal

Page 17: SPU Assisted Rendering

Stream0

/* Damage: Data II */

Position

Normal UV0

UV1 PosOffset

NormalOffset

AO

ControlPoints

Stream1

SPU_Position

SPU_Normal

Page 18: SPU Assisted Rendering

Stream0

/* Damage: Data II */

Position

Normal UV0

UV1 PosOffset

NormalOffset

AO

ControlPoints

Stream1

SPU_Position

SPU_Normal

Page 19: SPU Assisted Rendering

Stream0

/* Damage: Data II */

Position

Normal UV0

UV1 PosOffset

NormalOffset

AO

ControlPoints

Stream1

SPU_Position

SPU_Normal

Page 20: SPU Assisted Rendering

Stream0

/* Damage: Data II */

Position

Normal UV0

UV1 PosOffset

NormalOffset

AO

ControlPoints

Stream1

SPU_Position

SPU_Normal

Page 21: SPU Assisted Rendering

Stream0

/* Damage: Data II */

Position

Normal UV0

UV1 PosOffset

NormalOffset

AO

ControlPoints

Stream1

SPU_Position

SPU_Normal

Page 22: SPU Assisted Rendering

- MFC writes data atomically in 16 byte chunks...- If vertex format is 16 bytes exactly can atomically change a vertex from SPU.

- If you can live with the odd vertex being wrong for a frame, this could be a huge win!

/* Damage: Data III */

Page 23: SPU Assisted Rendering

/* Damage: Data IV */

RSX LocalMain

Write-only Vertices

SPU

Read-only Vertices

Page 24: SPU Assisted Rendering

- Damage events from game-side code are queued.- Note: There is no link to the player health, purely superficial.

/* Damage: Events */

Impact

Impact

Impact

Impact

Impact

Impact

Game Code

Page 25: SPU Assisted Rendering

/* Damage: Data V */

GPUWrite-only Vertices*

SPU

Read-only Vertices*

ImpactImpact

ImpactImpactImpactImpact

Constants

* - w.r.t to SPU

Page 26: SPU Assisted Rendering

/* Damage: Data VI */

GPUWrite-only Vertices*

SPU

Read-only Vertices*

* - w.r.t to SPU

Page 27: SPU Assisted Rendering

Kick off SPU tasks

- Less sync points should be the goal of any multi-core code:

/* Damage: Control */

Other Work(1) PPU Damage

Page 28: SPU Assisted Rendering

- Less sync points should be the goal of any multi-core code:

/* Damage: Control */

Other Work(1)Other Work(1) PPU Damage

Page 29: SPU Assisted Rendering

- Less sync points should be the goal of any multi-core code:

/* Damage: Control */

Other Work(1)

Vertex Work

Vertex Work

Vertex Work

Other Work(1) PPU Damage

Vertex Work

Page 30: SPU Assisted Rendering

- Less sync points should be the goal of any multi-core code:

/* Damage: Control */

Other Work(1)

Vertex Work

Vertex Work

Vertex Work

Other Work(1) PPU Damage

FlagVertex Work

Page 31: SPU Assisted Rendering

- Less sync points should be the goal of any multi-core code:

/* Damage: Control */

Other Work(1) Other Work(2)

Vertex Work

Vertex Work

Vertex Work

Other Work(1) PPU Damage

FlagVertex Work

Page 32: SPU Assisted Rendering

- Less sync points should be the goal of any multi-core code:

/* Damage: Control */

Other Work(1) Other Work(2)

Vertex Work

Vertex Work

Vertex Work

Other Work(1) PPU Damage PPU Damage

FlagVertex Work

Page 33: SPU Assisted Rendering

- Less sync points should be the goal of any multi-core code:

/* Damage: Control */

Other Work(1) Other Work(2)

Vertex Work

Vertex Work

Vertex Work

Other Work(1) PPU Damage PPU Damage

FlagVertex Work

Page 34: SPU Assisted Rendering

- Pretty easy to go from shaders to SPU intrinsics or asm.- We favour si style for simplicity and ease.

/* de-code into IEEE754-ish 32bit float (meh): */qword sign_bit = si_and(result, sign_bit_mask); sign_bit = si_shli(sign_bit, 0x10); /* move 16 bits into correct place. */qword significand = si_and(result, mant_bit_mask); significand = si_shli(significand, 0xd);qword is_zero_mask = si_cgti(significand, 0x0); /* all bits set if non-zero. */

expo_bias = si_and(is_zero_mask, expo_bias);qword exponent_bias= si_a(significand, expo_bias); /* move expo up range,

0x07800000=>0x3f800000. */ exponent_bias= si_or(exponent_bias, sign_bit);

/* Damage: SPU I */

Page 35: SPU Assisted Rendering

- Problems:- GPU version relied on bilinear filtering of volume texture to smooth damage. - Filtering on SPU is a bit of a pain.

- Working out which events affect which vertices?

/* Damage: SPU II */

Page 36: SPU Assisted Rendering

- Simplest solution:- Two-stage x-form:

- 1. Get data in volume texture-ish format.

- 2. Apply x-form to all vertices.

/* Damage: SPU III */

Page 37: SPU Assisted Rendering

- Filtering:- Software bilinear filtering.- Some interesting instructions in ISA will help here.

/* Damage: SPU IV */

Page 38: SPU Assisted Rendering

- Data flow through SPU program is paramount to performance.– Process in 16KB chunks.– Multi-buffer input and output.

- If your system isn’t ‘mission critical’, align and lose double buffer.

/* Damage: Lessons I */

Page 39: SPU Assisted Rendering

- Make use of SoA mode data layout, liberated from rigidity of GPU programming model!

/* Damage: Lessons II */

x

x

x

x

y

y

y

y

z

z

z

z

w

w

w

w

x x x x

y y y y

z z z z

w w w w

Page 40: SPU Assisted Rendering

- Add value to your SPU program for relatively small computational effort:- We added some of the per-vertex lighting calculations for brake lights, for example.

/* Damage: Lessons III */

Page 41: SPU Assisted Rendering

/* Damage: Results */

Page 42: SPU Assisted Rendering

- Our solution:- SPU-generated cube maps.

- 40 in total (accounting for double buffer).- 8x8 per face.

- Deferred.- Work split between GPU/SPU.

- Cars are lit with a mixture of things:- SH (world + dynamic)- Cube map lighting- Vertex lighting

/* Lighting: Solution */

Page 43: SPU Assisted Rendering

- Input:- Nearest 16 lights.

- Output:- Cube map.

- Simples!

/* Lighting: Data */

Cube mapSPU

Light

Light

Light

Light

Light

Light

Page 44: SPU Assisted Rendering

- Each frame PPU kicks off SPU-tasks to build cube maps.- Cube maps are double buffered to avoid artefacts and contention with GPU.

- Workload scalable.- Number of cube maps per task can change dynamically if need be.

/* Lighting: Control */

Page 45: SPU Assisted Rendering

- On SPU we put some ‘Bizarre Creations Secret Sauce™’ into the cube maps:

/* Lighting: SPU */

Page 46: SPU Assisted Rendering

- On GPU, we sample with reflected view vector:

reflect(view_dir, normal);

/* Lighting: GPU */

Page 47: SPU Assisted Rendering

/* Lighting: Results I */

Page 48: SPU Assisted Rendering

/* Lighting: Results I */

Page 49: SPU Assisted Rendering

/* Lighting: Results I */

Page 50: SPU Assisted Rendering

/* Lighting: Results II */

Page 51: SPU Assisted Rendering

/* Lighting: Results II */

Page 52: SPU Assisted Rendering

/* * Part II w

/ Steve McAuley

*/ SPU Acceleration of

Fragment Shading

Page 53: SPU Assisted Rendering

- Problem:– Our fragment programs are expensive.

- Solution:– Let’s use the SPUs to help.

/* The Problem */

Page 54: SPU Assisted Rendering

ROP

/* The Pipeline */

Vertex shader

Triangle setup

Rasterisation

Fragment shader

Vertices

Textures

Page 55: SPU Assisted Rendering

- Solution:– Make look-up textures on the SPUs to speed up our fragment programs.

- What could we look up?– Lighting– Shadows– Ambient occlusion– Fog

- Sounds like deferred rendering!

/* Look It Up! */

Page 56: SPU Assisted Rendering

Forward

Rendering

=FAIL

Page 57: SPU Assisted Rendering

- Goal:– Move dynamic lighting into a look-up texture.

- Solution:– Sounds like deferred rendering!

- In Blur, we used a light pre-pass renderer.

/* Case Study: Lighting */

Page 58: SPU Assisted Rendering

/* Light Pre-Pass */

NormalsFinal Colour

Geometry

Geometry

Depth

Real-Time Lighting

Page 59: SPU Assisted Rendering

/* A Frame of Blur */

Solid AlphaGPU: PostPre-Pass LightsMirror,

Cube Map & Reflection

Page 60: SPU Assisted Rendering

- Move the lights onto the SPUs:– But there’s a gap!

/* A Frame of Blur */

Solid AlphaGPU: PostPre-Pass

Lights

Mirror, Cube Map & Reflection

SPUs:

Page 61: SPU Assisted Rendering

- Option #1:– Defer the lighting by a frame.

/* A Frame of Blur */

Solid AlphaGPU: PostPre-PassMirror,

Cube Map & Reflection

SPUs: Lights

Page 62: SPU Assisted Rendering

- Option #2:– Parallelise with another part of the rendering.

/* A Frame of Blur */

Solid AlphaGPU: PostPre-PassMirror,

Cube Map & Reflection

SPUs: Lights

Page 63: SPU Assisted Rendering

- Option #2:– Taking it further…

/* A Frame of Blur */

Solid AlphaGPU: PostPre-PassMirror,

Cube Map & Reflection

SPUs: Lights

Shadows

Blur

Page 64: SPU Assisted Rendering

- Key point: you must find something to parallelise with!– Design your engine accordingly!– Otherwise you risk a frame of latency.

- This is true multi-GPU.– Two graphics processors, working on separate tasks, in parallel.

/* Parallelism */

Page 65: SPU Assisted Rendering

- Goal:– Move the lighting stage of the light pre-pass onto the SPUs.

- There are just six easy steps to enlightenment…

/* Case Study: Lighting */

Page 66: SPU Assisted Rendering

/* Step #1: The Data */

Transform

Normals

Depth

Lights

Page 67: SPU Assisted Rendering

/* Step #1: The Data */

Transform

Lights

Normal X

Normal Y

Depth Hi

Depth Lo

Page 68: SPU Assisted Rendering

- We have six SPUs, and each of them wants a lighting job…

- Divide the frame buffer into tiles.

- Each tile is a unit of work.

/* Step #2: Jobs */

Page 69: SPU Assisted Rendering

Index

- Keep working until they’re all gone!– (Then hand out the P45s…)

/* Step #2: Jobs */

SPU SPU SPU SPU SPU SPUAtomic

Increment

Page 70: SPU Assisted Rendering

- Can be a time sink if you’re not careful!– Expect to find your worst bugs here.

– Best to get it right first time!

/* Step #3: Sync */

Page 71: SPU Assisted Rendering

/* Step #3: Sync */

GPU: Solid Alpha PostPre-Pass

SPUs:

Mirror, Cube Map & Reflection

Lights

Page 72: SPU Assisted Rendering

/* Step #3: Sync */

GPU: Solid Alpha PostPre-Pass

SPUs:

Mirror, Cube Map & Reflection

Lights

Page 73: SPU Assisted Rendering

/* Step #3: Sync */

GPU: Solid Alpha PostPre-Pass

SPUs:

Mirror, Cube Map & Reflection

Lights

Write Label

Page 74: SPU Assisted Rendering

/* Step #3: Sync */

GPU: Solid Alpha PostPre-Pass

SPUs:

Mirror, Cube Map & Reflection

Lights

Write Label

Wait on Label

Page 75: SPU Assisted Rendering

/* Step #3: Sync */

GPU: Solid Alpha PostPre-Pass

SPUs:

Mirror, Cube Map & Reflection

Lights

Write Label

Wait on Label

Page 76: SPU Assisted Rendering

/* Step #3: Sync */

GPU: Solid Alpha PostPre-Pass

SPUs:

Mirror, Cube Map & Reflection

Lights

Write Label

Jump To

Self

Wait on Label

Page 77: SPU Assisted Rendering

/* Step #3: Sync */

GPU: Solid Alpha PostPre-Pass

SPUs:

Mirror, Cube Map & Reflection

Lights

Write Label

Wait on Label

Page 78: SPU Assisted Rendering

- Build a view frustum for each tile.– Remember, we have the depth buffer so can calculate the minimum and maximum depth!

- Gather only the lights that intersect this frustum.

- Cull an entire tile if:– Depth min and max are both far clip.– No lights intersect.

/* Step #4: Culling */

Page 79: SPU Assisted Rendering

/* Step #5: Light! */

Page 80: SPU Assisted Rendering

- Multi-buffering:– Do the following simultaneously:• Load data for next job.• Process data for the current job.• Save data from the previous job.

– Costs local store but is usually worth it.

/* Step #6: Optimise! */

Page 81: SPU Assisted Rendering

- Structure-of-arrays:– Transpose your data for massive damage!

– e.g.

/* Step #6: Optimise! */

x

x

x

x

y

y

y

y

z

z

z

z

w

w

w

w

x x x x

y y y y

z z z z

w w w w

Page 82: SPU Assisted Rendering

- Array-of-structures:– 1 dot product, 23 cycles

qword d0 = si_fm(xyz0, abc0);qword d1 = si_rotqbyi(d0, 0x4);qword d2 = si_rotqbyi(d0, 0x8);qword dot = si_fa(d0, d1); dot = si_fa(dot, d2);

- Structure-of-arrays:– 4 dot products, 18 cycles

qword dot0123 = si_fm(x0123, a0123); dot0123 = si_fma(y0123, b0123, dot0123); dot0123 = si_fma(z0123, c0123, dot0123);

/* Step #6: Optimise! */

Page 83: SPU Assisted Rendering

- Batching:– Light 16 pixels at a time.• Minimises dependent instruction stalls.• Helps compiler with even/odd pipeline balance.

– Use trial and error to find your ideal batch size!• A balance between register spilling and setup cost.

/* Step #6: Optimise! */

Page 84: SPU Assisted Rendering

- Ran on 3 SPUs.- Slightly faster than the RSX.- An optimisation even if you have nothing to parallelise with!

/* Case Study: Lighting */

Page 85: SPU Assisted Rendering

/* Case Study: Lighting */

Page 86: SPU Assisted Rendering

• Lighting• Damage• Rendering• Physics

/* The Complete Picture */

Page 87: SPU Assisted Rendering

- Use the SPUs to accelerate your rendering!– Think about the data.– Design your engine appropriately.– Avoid frames of latency.– Keep synchronisation simple.– Add value.

- It’s actually really easy, try it!

/* Conclusion */

Page 88: SPU Assisted Rendering

- Steven Tovey & Stephen McAuley, “Parallelized Light Pre-Pass Rendering with the Cell Broadband Engine”, GPU Pro

- Stephen McAuley & Steven Tovey, “A Bizarre Way to do Real-Time Lighting”, Develop in Liverpool 2009

/* Further Reading */

Page 89: SPU Assisted Rendering

If you’re talented, then

we’re hiring ;)

[email protected]

Page 90: SPU Assisted Rendering

lqd $r1,question_count

stopd $r0,$r0,0x1

; thanks for listening! ;)

brnz $r1,questions