Download - SPU Assisted Rendering

/* * SPU Assis

ted Rendering.

*/

Steven Tovey & Steph

en McAuley

Graphics Programmers

, Bizarre Creations

Ltd.

http://www.bizarrecreations.com

[email protected]@bizarrecreations.com

/* Welcome! */

- We have some copies of Blur to give away, stick around and fill out your evaluation sheets!

- Part I (w/ Steven Tovey):– What is SPU Assisted Rendering?– Case Studies

- Car Damage- Car Lighting

- Part II (w/ Stephen McAuley):– Fragment Shading– Parallelisation– Case Study

• Pre-pass Lighting on SPUs

- Questions

/* Agenda */

/* * Part I w/

Steven Tovey

*/ SPU Acceleration of

Car Rendering in Blu

r

- Assisting RSX™ with the SPUs (der!)

- Why do this?– Free up RSX™ to do other things.– Enable otherwise unfeasible techniques.

– Optimise rendering.

/* What is SPU AR? I */

- Problems involved?- Synchronisation.- Optimising SPU modules.- Memory considerations:

- Local store- Resource allocation

- Etc.

/* What is SPU AR? II */

- Original Xenon implementation:- Totally GPU-based.- 2xVTF (volume & 2D) for damage.- Large amount of work in vertex shader, making cars in Blur heavily vertex-bound.

- All lighting in pixel shader.

/* Case Study: Cars I */

- Loose fitting damage volume:

/* Case Study: Cars II */

- Control points:

/* Case Study: Cars III */

- Morph targets:

/* Case Study: Cars IV */

- Scratch/dent textures:

/* Case Study: Cars IV */

- Challenges:- Increase rendering speed of cars.- Maintain same quality.

/* Case Study: Cars VI */

- Our solution:- Large parts are SPU based.- On demand.- Sync-free.- Deferred.- Work split between GPU/SPU.

/* Damage: Solution */

- 2 vertex streams:- Read-only car vertex data.

- Shared between similar cars.- SPU-modified damage vertex data.

- Per instance.- One-to-one mapping of vertices.

- Control points:- Crude approximation of volume preservation.

- Dent/scratch blend levels.

/* Damage: Data I */

Stream0

/* Damage: Data II */

Position

Normal UV0

UV1 PosOffset

NormalOffset

AO

ControlPoints

Stream1

SPU_Position

SPU_Normal

- MFC writes data atomically in 16 byte chunks...- If vertex format is 16 bytes exactly can atomically change a vertex from SPU.

- If you can live with the odd vertex being wrong for a frame, this could be a huge win!

/* Damage: Data III */

/* Damage: Data IV */

RSX LocalMain

Write-only Vertices

SPU

Read-only Vertices

- Damage events from game-side code are queued.- Note: There is no link to the player health, purely superficial.

/* Damage: Events */

Impact

Impact

Impact

Impact

Impact

Impact

Game Code

/* Damage: Data V */

GPUWrite-only Vertices*

SPU

Read-only Vertices*

ImpactImpact

ImpactImpactImpactImpact

Constants

* - w.r.t to SPU

/* Damage: Data VI */

GPUWrite-only Vertices*

SPU

Read-only Vertices*

* - w.r.t to SPU

Kick off SPU tasks

- Less sync points should be the goal of any multi-core code:

/* Damage: Control */

Other Work(1) PPU Damage



Other Work(1)Other Work(1) PPU Damage



Other Work(1)

Vertex Work

Vertex Work

Vertex Work


Vertex Work



Other Work(1)

Vertex Work

Vertex Work

Vertex Work


FlagVertex Work



Other Work(1) Other Work(2)

Vertex Work

Vertex Work

Vertex Work


FlagVertex Work



Other Work(1) Other Work(2)

Vertex Work

Vertex Work

Vertex Work

Other Work(1) PPU Damage PPU Damage

FlagVertex Work

- Pretty easy to go from shaders to SPU intrinsics or asm.- We favour si style for simplicity and ease.

/* de-code into IEEE754-ish 32bit float (meh): */qword sign_bit = si_and(result, sign_bit_mask); sign_bit = si_shli(sign_bit, 0x10); /* move 16 bits into correct place. */qword significand = si_and(result, mant_bit_mask); significand = si_shli(significand, 0xd);qword is_zero_mask = si_cgti(significand, 0x0); /* all bits set if non-zero. */

expo_bias = si_and(is_zero_mask, expo_bias);qword exponent_bias= si_a(significand, expo_bias); /* move expo up range,

0x07800000=>0x3f800000. */ exponent_bias= si_or(exponent_bias, sign_bit);

/* Damage: SPU I */

- Problems:- GPU version relied on bilinear filtering of volume texture to smooth damage. - Filtering on SPU is a bit of a pain.

- Working out which events affect which vertices?

/* Damage: SPU II */

- Simplest solution:- Two-stage x-form:

- 1. Get data in volume texture-ish format.

- 2. Apply x-form to all vertices.

/* Damage: SPU III */

- Filtering:- Software bilinear filtering.- Some interesting instructions in ISA will help here.

/* Damage: SPU IV */

- Data flow through SPU program is paramount to performance.– Process in 16KB chunks.– Multi-buffer input and output.

- If your system isn’t ‘mission critical’, align and lose double buffer.

/* Damage: Lessons I */

- Make use of SoA mode data layout, liberated from rigidity of GPU programming model!

/* Damage: Lessons II */

x

x

x

x

y

y

y

y

z

z

z

z

w

w

w

w

x x x x

y y y y

z z z z

w w w w

- Add value to your SPU program for relatively small computational effort:- We added some of the per-vertex lighting calculations for brake lights, for example.

/* Damage: Lessons III */

/* Damage: Results */

- Our solution:- SPU-generated cube maps.

- 40 in total (accounting for double buffer).- 8x8 per face.

- Deferred.- Work split between GPU/SPU.

- Cars are lit with a mixture of things:- SH (world + dynamic)- Cube map lighting- Vertex lighting

/* Lighting: Solution */

- Input:- Nearest 16 lights.

- Output:- Cube map.

- Simples!

/* Lighting: Data */

Cube mapSPU

Light

Light

Light

Light

Light

Light

- Each frame PPU kicks off SPU-tasks to build cube maps.- Cube maps are double buffered to avoid artefacts and contention with GPU.

- Workload scalable.- Number of cube maps per task can change dynamically if need be.

/* Lighting: Control */

- On SPU we put some ‘Bizarre Creations Secret Sauce™’ into the cube maps:

/* Lighting: SPU */

- On GPU, we sample with reflected view vector:

reflect(view_dir, normal);

/* Lighting: GPU */

/* Lighting: Results I */

/* Lighting: Results II */

/* * Part II w

/ Steve McAuley

*/ SPU Acceleration of

Fragment Shading

- Problem:– Our fragment programs are expensive.

- Solution:– Let’s use the SPUs to help.

/* The Problem */

ROP

/* The Pipeline */

Vertex shader

Triangle setup

Rasterisation

Fragment shader

Vertices

Textures

- Solution:– Make look-up textures on the SPUs to speed up our fragment programs.

- What could we look up?– Lighting– Shadows– Ambient occlusion– Fog

- Sounds like deferred rendering!

/* Look It Up! */

Forward

Rendering

=FAIL

- Goal:– Move dynamic lighting into a look-up texture.

- Solution:– Sounds like deferred rendering!

- In Blur, we used a light pre-pass renderer.

/* Case Study: Lighting */

/* Light Pre-Pass */

NormalsFinal Colour

Geometry

Geometry

Depth

Real-Time Lighting

/* A Frame of Blur */

Solid AlphaGPU: PostPre-Pass LightsMirror,

Cube Map & Reflection

- Move the lights onto the SPUs:– But there’s a gap!


Solid AlphaGPU: PostPre-Pass

Lights

Mirror, Cube Map & Reflection

SPUs:

- Option #1:– Defer the lighting by a frame.


Solid AlphaGPU: PostPre-PassMirror,


SPUs: Lights

- Option #2:– Parallelise with another part of the rendering.




SPUs: Lights

- Option #2:– Taking it further…




SPUs: Lights

Shadows

Blur

- Key point: you must find something to parallelise with!– Design your engine accordingly!– Otherwise you risk a frame of latency.

- This is true multi-GPU.– Two graphics processors, working on separate tasks, in parallel.

/* Parallelism */

- Goal:– Move the lighting stage of the light pre-pass onto the SPUs.

- There are just six easy steps to enlightenment…


/* Step #1: The Data */

Transform

Normals

Depth

Lights

/* Step #1: The Data */

Transform

Lights

Normal X

Normal Y

Depth Hi

Depth Lo

- We have six SPUs, and each of them wants a lighting job…

- Divide the frame buffer into tiles.

- Each tile is a unit of work.

/* Step #2: Jobs */

Index

- Keep working until they’re all gone!– (Then hand out the P45s…)

/* Step #2: Jobs */

SPU SPU SPU SPU SPU SPUAtomic

Increment

- Can be a time sink if you’re not careful!– Expect to find your worst bugs here.

– Best to get it right first time!

/* Step #3: Sync */

/* Step #3: Sync */

GPU: Solid Alpha PostPre-Pass

SPUs:


Lights

/* Step #3: Sync */


SPUs:


Lights

Write Label

/* Step #3: Sync */


SPUs:


Lights

Write Label

Wait on Label

/* Step #3: Sync */


SPUs:


Lights

Write Label

Jump To

Self

Wait on Label

/* Step #3: Sync */


SPUs:


Lights

Write Label

Wait on Label

- Build a view frustum for each tile.– Remember, we have the depth buffer so can calculate the minimum and maximum depth!

- Gather only the lights that intersect this frustum.

- Cull an entire tile if:– Depth min and max are both far clip.– No lights intersect.

/* Step #4: Culling */

/* Step #5: Light! */

- Multi-buffering:– Do the following simultaneously:• Load data for next job.• Process data for the current job.• Save data from the previous job.

– Costs local store but is usually worth it.

/* Step #6: Optimise! */

- Structure-of-arrays:– Transpose your data for massive damage!

– e.g.


x

x

x

x

y

y

y

y

z

z

z

z

w

w

w

w

x x x x

y y y y

z z z z

w w w w

- Array-of-structures:– 1 dot product, 23 cycles

qword d0 = si_fm(xyz0, abc0);qword d1 = si_rotqbyi(d0, 0x4);qword d2 = si_rotqbyi(d0, 0x8);qword dot = si_fa(d0, d1); dot = si_fa(dot, d2);

- Structure-of-arrays:– 4 dot products, 18 cycles

qword dot0123 = si_fm(x0123, a0123); dot0123 = si_fma(y0123, b0123, dot0123); dot0123 = si_fma(z0123, c0123, dot0123);


- Batching:– Light 16 pixels at a time.• Minimises dependent instruction stalls.• Helps compiler with even/odd pipeline balance.

– Use trial and error to find your ideal batch size!• A balance between register spilling and setup cost.


- Ran on 3 SPUs.- Slightly faster than the RSX.- An optimisation even if you have nothing to parallelise with!


• Lighting• Damage• Rendering• Physics

/* The Complete Picture */

- Use the SPUs to accelerate your rendering!– Think about the data.– Design your engine appropriately.– Avoid frames of latency.– Keep synchronisation simple.– Add value.

- It’s actually really easy, try it!

/* Conclusion */

- Steven Tovey & Stephen McAuley, “Parallelized Light Pre-Pass Rendering with the Cell Broadband Engine”, GPU Pro

- Stephen McAuley & Steven Tovey, “A Bizarre Way to do Real-Time Lighting”, Develop in Liverpool 2009

/* Further Reading */

If you’re talented, then

we’re hiring ;)

[email protected]

lqd $r1,question_count

stopd $r0,$r0,0x1

; thanks for listening! ;)

brnz $r1,questions

Download - SPU Assisted Rendering

Top Related