Download - SPU Assisted Rendering
/* * SPU Assis
ted Rendering.
*/
Steven Tovey & Steph
en McAuley
Graphics Programmers
, Bizarre Creations
Ltd.
http://www.bizarrecreations.com
[email protected]@bizarrecreations.com
/* Welcome! */
- We have some copies of Blur to give away, stick around and fill out your evaluation sheets!
- Part I (w/ Steven Tovey):– What is SPU Assisted Rendering?– Case Studies
- Car Damage- Car Lighting
- Part II (w/ Stephen McAuley):– Fragment Shading– Parallelisation– Case Study
• Pre-pass Lighting on SPUs
- Questions
/* Agenda */
/* * Part I w/
Steven Tovey
*/ SPU Acceleration of
Car Rendering in Blu
r
- Assisting RSX™ with the SPUs (der!)
- Why do this?– Free up RSX™ to do other things.– Enable otherwise unfeasible techniques.
– Optimise rendering.
/* What is SPU AR? I */
- Problems involved?- Synchronisation.- Optimising SPU modules.- Memory considerations:
- Local store- Resource allocation
- Etc.
/* What is SPU AR? II */
- Original Xenon implementation:- Totally GPU-based.- 2xVTF (volume & 2D) for damage.- Large amount of work in vertex shader, making cars in Blur heavily vertex-bound.
- All lighting in pixel shader.
/* Case Study: Cars I */
- Loose fitting damage volume:
/* Case Study: Cars II */
- Control points:
/* Case Study: Cars III */
- Morph targets:
/* Case Study: Cars IV */
- Scratch/dent textures:
/* Case Study: Cars IV */
- Challenges:- Increase rendering speed of cars.- Maintain same quality.
/* Case Study: Cars VI */
- Our solution:- Large parts are SPU based.- On demand.- Sync-free.- Deferred.- Work split between GPU/SPU.
/* Damage: Solution */
- 2 vertex streams:- Read-only car vertex data.
- Shared between similar cars.- SPU-modified damage vertex data.
- Per instance.- One-to-one mapping of vertices.
- Control points:- Crude approximation of volume preservation.
- Dent/scratch blend levels.
/* Damage: Data I */
Stream0
/* Damage: Data II */
Position
Normal UV0
UV1 PosOffset
NormalOffset
AO
ControlPoints
Stream1
SPU_Position
SPU_Normal
Stream0
/* Damage: Data II */
Position
Normal UV0
UV1 PosOffset
NormalOffset
AO
ControlPoints
Stream1
SPU_Position
SPU_Normal
Stream0
/* Damage: Data II */
Position
Normal UV0
UV1 PosOffset
NormalOffset
AO
ControlPoints
Stream1
SPU_Position
SPU_Normal
Stream0
/* Damage: Data II */
Position
Normal UV0
UV1 PosOffset
NormalOffset
AO
ControlPoints
Stream1
SPU_Position
SPU_Normal
Stream0
/* Damage: Data II */
Position
Normal UV0
UV1 PosOffset
NormalOffset
AO
ControlPoints
Stream1
SPU_Position
SPU_Normal
Stream0
/* Damage: Data II */
Position
Normal UV0
UV1 PosOffset
NormalOffset
AO
ControlPoints
Stream1
SPU_Position
SPU_Normal
Stream0
/* Damage: Data II */
Position
Normal UV0
UV1 PosOffset
NormalOffset
AO
ControlPoints
Stream1
SPU_Position
SPU_Normal
- MFC writes data atomically in 16 byte chunks...- If vertex format is 16 bytes exactly can atomically change a vertex from SPU.
- If you can live with the odd vertex being wrong for a frame, this could be a huge win!
/* Damage: Data III */
/* Damage: Data IV */
RSX LocalMain
Write-only Vertices
SPU
Read-only Vertices
- Damage events from game-side code are queued.- Note: There is no link to the player health, purely superficial.
/* Damage: Events */
Impact
Impact
Impact
Impact
Impact
Impact
Game Code
/* Damage: Data V */
GPUWrite-only Vertices*
SPU
Read-only Vertices*
ImpactImpact
ImpactImpactImpactImpact
Constants
* - w.r.t to SPU
/* Damage: Data VI */
GPUWrite-only Vertices*
SPU
Read-only Vertices*
* - w.r.t to SPU
Kick off SPU tasks
- Less sync points should be the goal of any multi-core code:
/* Damage: Control */
Other Work(1) PPU Damage
- Less sync points should be the goal of any multi-core code:
/* Damage: Control */
Other Work(1)Other Work(1) PPU Damage
- Less sync points should be the goal of any multi-core code:
/* Damage: Control */
Other Work(1)
Vertex Work
Vertex Work
Vertex Work
Other Work(1) PPU Damage
Vertex Work
- Less sync points should be the goal of any multi-core code:
/* Damage: Control */
Other Work(1)
Vertex Work
Vertex Work
Vertex Work
Other Work(1) PPU Damage
FlagVertex Work
- Less sync points should be the goal of any multi-core code:
/* Damage: Control */
Other Work(1) Other Work(2)
Vertex Work
Vertex Work
Vertex Work
Other Work(1) PPU Damage
FlagVertex Work
- Less sync points should be the goal of any multi-core code:
/* Damage: Control */
Other Work(1) Other Work(2)
Vertex Work
Vertex Work
Vertex Work
Other Work(1) PPU Damage PPU Damage
FlagVertex Work
- Less sync points should be the goal of any multi-core code:
/* Damage: Control */
Other Work(1) Other Work(2)
Vertex Work
Vertex Work
Vertex Work
Other Work(1) PPU Damage PPU Damage
FlagVertex Work
- Pretty easy to go from shaders to SPU intrinsics or asm.- We favour si style for simplicity and ease.
/* de-code into IEEE754-ish 32bit float (meh): */qword sign_bit = si_and(result, sign_bit_mask); sign_bit = si_shli(sign_bit, 0x10); /* move 16 bits into correct place. */qword significand = si_and(result, mant_bit_mask); significand = si_shli(significand, 0xd);qword is_zero_mask = si_cgti(significand, 0x0); /* all bits set if non-zero. */
expo_bias = si_and(is_zero_mask, expo_bias);qword exponent_bias= si_a(significand, expo_bias); /* move expo up range,
0x07800000=>0x3f800000. */ exponent_bias= si_or(exponent_bias, sign_bit);
/* Damage: SPU I */
- Problems:- GPU version relied on bilinear filtering of volume texture to smooth damage. - Filtering on SPU is a bit of a pain.
- Working out which events affect which vertices?
/* Damage: SPU II */
- Simplest solution:- Two-stage x-form:
- 1. Get data in volume texture-ish format.
- 2. Apply x-form to all vertices.
/* Damage: SPU III */
- Filtering:- Software bilinear filtering.- Some interesting instructions in ISA will help here.
/* Damage: SPU IV */
- Data flow through SPU program is paramount to performance.– Process in 16KB chunks.– Multi-buffer input and output.
- If your system isn’t ‘mission critical’, align and lose double buffer.
/* Damage: Lessons I */
- Make use of SoA mode data layout, liberated from rigidity of GPU programming model!
/* Damage: Lessons II */
x
x
x
x
y
y
y
y
z
z
z
z
w
w
w
w
x x x x
y y y y
z z z z
w w w w
- Add value to your SPU program for relatively small computational effort:- We added some of the per-vertex lighting calculations for brake lights, for example.
/* Damage: Lessons III */
/* Damage: Results */
- Our solution:- SPU-generated cube maps.
- 40 in total (accounting for double buffer).- 8x8 per face.
- Deferred.- Work split between GPU/SPU.
- Cars are lit with a mixture of things:- SH (world + dynamic)- Cube map lighting- Vertex lighting
/* Lighting: Solution */
- Input:- Nearest 16 lights.
- Output:- Cube map.
- Simples!
/* Lighting: Data */
Cube mapSPU
Light
Light
Light
Light
Light
Light
- Each frame PPU kicks off SPU-tasks to build cube maps.- Cube maps are double buffered to avoid artefacts and contention with GPU.
- Workload scalable.- Number of cube maps per task can change dynamically if need be.
/* Lighting: Control */
- On SPU we put some ‘Bizarre Creations Secret Sauce™’ into the cube maps:
/* Lighting: SPU */
- On GPU, we sample with reflected view vector:
reflect(view_dir, normal);
/* Lighting: GPU */
/* Lighting: Results I */
/* Lighting: Results I */
/* Lighting: Results I */
/* Lighting: Results II */
/* Lighting: Results II */
/* * Part II w
/ Steve McAuley
*/ SPU Acceleration of
Fragment Shading
- Problem:– Our fragment programs are expensive.
- Solution:– Let’s use the SPUs to help.
/* The Problem */
ROP
/* The Pipeline */
Vertex shader
Triangle setup
Rasterisation
Fragment shader
Vertices
Textures
- Solution:– Make look-up textures on the SPUs to speed up our fragment programs.
- What could we look up?– Lighting– Shadows– Ambient occlusion– Fog
- Sounds like deferred rendering!
/* Look It Up! */
Forward
Rendering
=FAIL
- Goal:– Move dynamic lighting into a look-up texture.
- Solution:– Sounds like deferred rendering!
- In Blur, we used a light pre-pass renderer.
/* Case Study: Lighting */
/* Light Pre-Pass */
NormalsFinal Colour
Geometry
Geometry
Depth
Real-Time Lighting
/* A Frame of Blur */
Solid AlphaGPU: PostPre-Pass LightsMirror,
Cube Map & Reflection
- Move the lights onto the SPUs:– But there’s a gap!
/* A Frame of Blur */
Solid AlphaGPU: PostPre-Pass
Lights
Mirror, Cube Map & Reflection
SPUs:
- Option #1:– Defer the lighting by a frame.
/* A Frame of Blur */
Solid AlphaGPU: PostPre-PassMirror,
Cube Map & Reflection
SPUs: Lights
- Option #2:– Parallelise with another part of the rendering.
/* A Frame of Blur */
Solid AlphaGPU: PostPre-PassMirror,
Cube Map & Reflection
SPUs: Lights
- Option #2:– Taking it further…
/* A Frame of Blur */
Solid AlphaGPU: PostPre-PassMirror,
Cube Map & Reflection
SPUs: Lights
Shadows
Blur
- Key point: you must find something to parallelise with!– Design your engine accordingly!– Otherwise you risk a frame of latency.
- This is true multi-GPU.– Two graphics processors, working on separate tasks, in parallel.
/* Parallelism */
- Goal:– Move the lighting stage of the light pre-pass onto the SPUs.
- There are just six easy steps to enlightenment…
/* Case Study: Lighting */
/* Step #1: The Data */
Transform
Normals
Depth
Lights
/* Step #1: The Data */
Transform
Lights
Normal X
Normal Y
Depth Hi
Depth Lo
- We have six SPUs, and each of them wants a lighting job…
- Divide the frame buffer into tiles.
- Each tile is a unit of work.
/* Step #2: Jobs */
Index
- Keep working until they’re all gone!– (Then hand out the P45s…)
/* Step #2: Jobs */
SPU SPU SPU SPU SPU SPUAtomic
Increment
- Can be a time sink if you’re not careful!– Expect to find your worst bugs here.
– Best to get it right first time!
/* Step #3: Sync */
/* Step #3: Sync */
GPU: Solid Alpha PostPre-Pass
SPUs:
Mirror, Cube Map & Reflection
Lights
/* Step #3: Sync */
GPU: Solid Alpha PostPre-Pass
SPUs:
Mirror, Cube Map & Reflection
Lights
/* Step #3: Sync */
GPU: Solid Alpha PostPre-Pass
SPUs:
Mirror, Cube Map & Reflection
Lights
Write Label
/* Step #3: Sync */
GPU: Solid Alpha PostPre-Pass
SPUs:
Mirror, Cube Map & Reflection
Lights
Write Label
Wait on Label
/* Step #3: Sync */
GPU: Solid Alpha PostPre-Pass
SPUs:
Mirror, Cube Map & Reflection
Lights
Write Label
Wait on Label
/* Step #3: Sync */
GPU: Solid Alpha PostPre-Pass
SPUs:
Mirror, Cube Map & Reflection
Lights
Write Label
Jump To
Self
Wait on Label
/* Step #3: Sync */
GPU: Solid Alpha PostPre-Pass
SPUs:
Mirror, Cube Map & Reflection
Lights
Write Label
Wait on Label
- Build a view frustum for each tile.– Remember, we have the depth buffer so can calculate the minimum and maximum depth!
- Gather only the lights that intersect this frustum.
- Cull an entire tile if:– Depth min and max are both far clip.– No lights intersect.
/* Step #4: Culling */
/* Step #5: Light! */
- Multi-buffering:– Do the following simultaneously:• Load data for next job.• Process data for the current job.• Save data from the previous job.
– Costs local store but is usually worth it.
/* Step #6: Optimise! */
- Structure-of-arrays:– Transpose your data for massive damage!
– e.g.
/* Step #6: Optimise! */
x
x
x
x
y
y
y
y
z
z
z
z
w
w
w
w
x x x x
y y y y
z z z z
w w w w
- Array-of-structures:– 1 dot product, 23 cycles
qword d0 = si_fm(xyz0, abc0);qword d1 = si_rotqbyi(d0, 0x4);qword d2 = si_rotqbyi(d0, 0x8);qword dot = si_fa(d0, d1); dot = si_fa(dot, d2);
- Structure-of-arrays:– 4 dot products, 18 cycles
qword dot0123 = si_fm(x0123, a0123); dot0123 = si_fma(y0123, b0123, dot0123); dot0123 = si_fma(z0123, c0123, dot0123);
/* Step #6: Optimise! */
- Batching:– Light 16 pixels at a time.• Minimises dependent instruction stalls.• Helps compiler with even/odd pipeline balance.
– Use trial and error to find your ideal batch size!• A balance between register spilling and setup cost.
/* Step #6: Optimise! */
- Ran on 3 SPUs.- Slightly faster than the RSX.- An optimisation even if you have nothing to parallelise with!
/* Case Study: Lighting */
/* Case Study: Lighting */
• Lighting• Damage• Rendering• Physics
/* The Complete Picture */
- Use the SPUs to accelerate your rendering!– Think about the data.– Design your engine appropriately.– Avoid frames of latency.– Keep synchronisation simple.– Add value.
- It’s actually really easy, try it!
/* Conclusion */
- Steven Tovey & Stephen McAuley, “Parallelized Light Pre-Pass Rendering with the Cell Broadband Engine”, GPU Pro
- Stephen McAuley & Steven Tovey, “A Bizarre Way to do Real-Time Lighting”, Develop in Liverpool 2009
/* Further Reading */
lqd $r1,question_count
stopd $r0,$r0,0x1
; thanks for listening! ;)
brnz $r1,questions