a bizarre way to do real-time lighting

Post on 09-Jun-2015

13.370 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

A Bizarre Way to do Real-Time Lighting

Stephen McAuley & Steven ToveyGraphics Programmers, Bizarre Creations Ltd.

stephen.mcauley@bizarrecreations.com

steven.tovey@bizarrecreations.com

http://www.bizarrecreations.com/

“Welcome, I think not!” Let us start by wishing you a

good bonfire night!

Agenda A sneak preview of Blur Light Pre-Pass Rendering 10 Step Guide to free Lighting on PS3 The Future...

Blur

Coming 2010 on X360, PS3 and PC. Twenty cars on track for intense wheel-

to-wheel racing. Exciting power-ups bring depth and

strategy to racing. Real-world cars and locations, set

between dusk and dawn. Extensive multiplayer options.

Technical Analysis So, we have twenty cars, racing around a track

in the dark…

…they all have headlights, rear lights, brake lights…

…not to mention any other effects we might have going on around the track…

…therefore, we need some sort of real-time lighting solution.

Light Pre-Pass

Many people came up with this… so you know it’s good!

Given its name by [Engel08]. Credits also due to [Balestra08]. Half-way between traditional and

deferred rendering.

NormalsFinal Colour

Geometry

Geometry

Depth

Real-Time Lighting

Light Pre-Pass

Light Pre-Pass in Blur

Final Image

Step #1: Render Pre-Pass Render scene normals and depth.

We pack view space normals and depth into one RGBA8 surface:

This means all the info we need is in one texture, not two!

It’s also faster to calculate view space position than world space position.

normal x normal y depth hi depth lo

R G B A

Step #1: Render Pre-Pass Pack depth:

Unpack depth:

(Note: here fDepth is in [0, 1] range)

half2 vPackedDepth =half2( floor(fDepth * 255.f) / 255.f,

frac(fDepth * 255.f) );

float fDepth =vPackedDepth.x + vPackedDepth.y * (1.f / 255.f);

Step #1: Render Pre-Pass Get view space position from texture

coordinates and depth:

float3 vPosition= float3(g_vScale.xy * vUV + g_vScale.zw, 1.f)* fDepth;

In [0, FarClip] range

In some circumstances, possible to move this to the vertex shader.

g_vScale moves vUV into [-1, 1] range and scales by inverse projection matrix values

Step #1: Render Pre-Pass

Normals X & Y Depth Hi & Lo

Normal X, Normal Y, Depth Hi, Depth Lo

Step #1: Render Pre-Pass Some good advice: at this stage, it’s really best to

render only what you need…

So don’t render geometry that isn’t affected by real-time lights!

Why not also try bringing in the far clip plane?

We also don’t render the very, very vertex-heavy cars.They get their real-time lighting from a spherical

harmonic. Doesn’t look too bad!

Step #2: The Lighting

We render the lighting to an RGBA8 texture.

Lighting is in [0, 1] range.We just about got away with range and

precision issues. Two types of lights:

Point lightsSpot lights

Step #2: Point Lights First up, it’s the point lights turn. Let’s copy [Balestra08] and render them tiled. Split the screen into tiles:

Big savings!Save on fill rate.Minimise overhead of unpacking view space

position and normal.

for each tilegather affecting lightsselect shaderrender tile

end

Step #2: Point Lights

1 1

1 2 1

1 1

Step #2: Point Lights Optimise: mask out the sky in the stencil buffer.

Step #2: Point Lights

Real-Time Lighting (Point Lights)

Step #2: Spot Lights Next, it’s the spot lights. Three different types:

Bog standard.2D projected texture.Volume texture.

Render as volumes.A cone for the bog-standard and projected.A box for the volume textured.

If they’re big enough on screen, do a stencil test.

Step #2: Spot Lights

Render back faces:Colour write disabledDepth test greater-equalStencil write enabled

Step #2: Spot Lights

Render front faces:Colour write enabledDepth test less-equalStencil test enabled

Step #2: Spot Lights

Hold on a minute… what happens if the camera goes inside the light volume?

Rendering the front faces doesn’t work any more…

Step #2: Spot Lights

Worst case scenario! Not only does the light fill the whole screen, but…

You just have to bite your tongue and only render back faces.You lose your stencil test. And maybe even early-z too.

Step #2: Spot Lights

Step #2: The Lighting

Real-Time Lighting

Step #3: Render the Scene Just do everything as you normally

would…

Except that you now have a texture containing the real-time lighting for each pixel!

But remember to composite it properly…

Step #3: Render the Scene

You’d probably want to do something clever involving a Fresnel term here.

half3 vDiffuseLighting =vStaticLighting.rgb + vDynamicLighting.rgb;

half3 vFinalColour =vDiffuseLighting * vAlbedoColour.rgb +vSpecularLighting;

The real-time lighting from the texture.

From our lightmaps.

And Finally…

Real-Time Lighting in Blur

Point Lights: brake lights, rear lights

Real-Time Lighting in Blur

Point Lights: pick-ups

Real-Time Lighting in Blur

Point Lights: power-up effects

Real-Time Lighting in Blur

Spot Lights: headlights

Real-Time Lighting in Blur

Spot Lights: start line effects

Great, It Works!

But can we make it faster?

Deferred lighting is image processing – no rasterization required.See how we draw our point lights.

Seems like this suits the PLAYSTATION®3’s SPUs…

PLAYSTATION®3: In Brief Time to switch gears a little bit...

So you’ve heard this stuff a million times before... Here are the important takeaway facts:PS3 has 6 SPUs.SPUs are fast!

○ (...Given the right data! )

SPU

SPU

PLAYSTATION®3: In Brief

Main Memory(XDR - 256MB)

RSX™

Graphics Memory(GDDR3 - 256MB)

SPU

SPU SPU

SPU

SPE

PLAYSTATION®3: In Brief

MFC

SPU

Local Store(256KiB)

SXU

Main Memory(256MB)

Graphics Memory(256MB)

Goals for PLAYSTATION®3 Reduce overall frame latency to

acceptable level (<33ms). Preserve picture quality (and resolution).

Blur runs @ 720p on X360 and PS3. Preserve lighting accuracy. Lighting and main scene must match:

Cars move fast... Deferring the lighting simply not an option,

works great in [Swoboda09] though.

Step #1: Look At The Data Data is *really* important!

Trivially easy in this case as we’re coming from a stream processing model, but never hurts to understand it anyway.

Kinda gives us a small glimpse of DX11 compute shaders .

Step #1: Look At The Data

Lights

xform

Step #1: Look At The Data

Lights

xform

Step #2: Parallelism

Stream processing highly suited to parallelisation and we have 6 x SPUs.

The obvious question arises:What size should a unit of work be?

Answer: Look at the data again!

Step #3: Look At The Data Fun fact: Frame buffers are not usually linear!

Many reasons for this (Think filtering and RSX™ quads).

Our unit size is closely tied to the internal format of frame buffer produced by the RSX™.

Not going to get into the exact formats here, it’s dull and it’s all in the Sony SDK Docs – RTFM!

Recommend PhyreEngine for good reference examples.

Step #4: Arbitrating Work Synchronisation points are fail. Keep to an

absolute minimum.

Solution: Atomics are your friend! Target hardware has an ATO, Use it, <3 it...

Move through data in tiles, tile dictated by an index – DMA into the local store for processing.

Index

Step #4: Arbitrating Work

SPU SPU SPU SPU SPU SPU

Step #5: Multi-Buffering Move data and process data at the

same time. Costs local store, but usually worth it. Different tag group for each buffer.

Step #5: Multi-Buffering We used triple-buffering, since we’re

decoding the normal/depth buffer.Normal/Depth Buffer (Main)

Lighting Buffer (Main) MFC

SXU

Step #6: Lighting (SOA) SOA is basically a transpose of the

obvious layout:

qword dot_xx = si_fm(v, v);qword dot_xx_r4 = si_rotqbyi(dot_xx, 4); dot_xx = si_fa(dot_xx, dot_xx_r4);qword dot_xx_r8 = si_rotqbyi(dot_xx, 8); dot_xx = si_fa(dot_xx, dot_xx_r8);return si_to_float(dot_xx);

qword dot_x = si_fm(x, x);qword dot_y = si_fma(y, y, dot_x);qword dot_z = si_fma(z, z, dot_y);return dot_z;

Vs.1x square length (~18 cycles)

4 x square lengths (~12 cycles)

Z

X Y Z W

X

X

X

X XXX

Y

Y

Y

Y Y Y YZ

Z Z Z Z Z

W

W

W W W W W

Step #6: Lighting (SOA) Pre-transpose lighting data, splat values

across entire qword.16 byte aligned, single lqd.

struct light{

float m_x[4];float m_y[4];float m_z[4];float m_inv_radius_sq[4];float m_colour_r[4];float m_colour_g[4];float m_colour_b[4];

};

Never actually used radius, pre-compute

(1/radius)^2

4 copies of world-space X, in each element of the

array

Step #6: Lighting (Batch I) qword everywhere. Batch reads and

writes into 16 byte chunks.Read 4 pixels from

normal/depth.Write 4 pixels to

lighting buffer.

qword clmp0 = si_cfltu(diffuse0, 0x20);qword clmp1 = si_cfltu(diffuse1, 0x20);qword clmp2 = si_cfltu(diffuse2, 0x20);qword clmp3 = si_cfltu(diffuse3, 0x20);qword r = si_ila(0x8000);qword scl = si_ilh(0xff00); dif0 = si_mpyhhau(clmp0, scl, r); dif1 = si_mpyhhau(clmp1, scl, r); dif2 = si_mpyhhau(clmp2, scl, r); dif3 = si_mpyhhau(clmp3, scl, r);const vector unsigned char _shuf_uint = { 0xc0, 0x00, 0x04, 0x08, 0xc0, 0x10, 0x14, 0x18, 0xc0, 0x00, 0x04, 0x08, 0xc0, 0x10, 0x14, 0x18 };qword shuf_ = (const qword)_shuf_uint;qword base_add = si_from_ptr(pResult);qword p0_1 = si_shufb(dif0, dif1, shuf_);qword p0_2 = si_shufb(dif2, dif3, shuf_);qword pix0 = si_selb(p0_1, p0_2, m_00ff);

si_stqd(pix0, base_add, 0x0);

qword depth_addr = si_from_ptr(depth_buf);qword depth0 = si_lqd(depth_addr, 0x00); qword depth1 = si_lqd(depth_addr, 0x10);qword depth2 = si_lqd(depth_addr, 0x20);qword depth3 = si_lqd(depth_addr, 0x30);

Step #6: Lighting (Balance) Lighting SPU program performance limited by number

of instructions issued. Pipeline balance is vital!

SPU dual issues if: Correctly aligned within single fetch group. No dependencies. Instructions are for correct pipelines.

Luckily, compiler maintained balance quite well with nop/lnop insertion and some instruction re-ordering.

Lighting larger batches helps out balance at the cost of register file usage Mileage may vary here again, how bad are you hammering

the even pipe?

Step #6: Lighting (Batch II)

Fixed setup cost for a single line of our sub-tile size (32 pixels wide). Unfortunately, too many to

process at once despite SPU’s massive register file . Loop is pipelined and lots of live variables to multiplex onto register file.

Settled for 16 pixels, no spilling . Note: First attempt worked on

4 pixel batches like RSX™. Lots of wasted cycles in inner loop – less dual issue.

32 Pixels

16 Pixels

Register spilling...

Wasted cycles and increased setup overhead

Happy medium 16 Pixels

4 4 4 4 4 4 4 4

Step #7: Culling Culling works on more granular sub-tiles. Allows us

to potentially reject more tiles (of course, YMMV ). (Note: diagram below is an example, it’s not our actual sub-

tile size).

Similar to GPU, basically a tile is culled if... Depth max and min depth are both far clip. No lights intersect the frustum constructed for the tile.

Sub-tile

Step #7: Culling

Remember, SPUs can execute general purpose code.Take advantage of high-level constructs

where they are suitable – this means branches, early-outs, etc.○ Note: Branches generally suck. Not suitable in

lighting inner-loop, discard an entire sub-tile at once.

Step #8: Synchronisation Custom SPURS policy module made

RSX™ initiated jobs easy. Our jobs can optionally depend on a 128 byte line written by RSX™ (or PPU, whatever).

Non-blocking :Freedom to run other scheduler tasks while waiting.

○ Really should investigate using SPE’s mailboxes to stop us from hammering the bus.

Physics team happy again!Not pre-emptive.

Step #8: Synchronisation Can be painful! Expect hard to find bugs here.

We had a couple, *ahem* both were other Steve’s fault ;-)

Worth it in the end though! Keep an eye on overall timings.

Originally lighting pushed out physics.Very easy to forget the bigger picture.Impossible to predict up front.

Step #9: Slotting it in...

Mirror Reflection

Lighting #3

Main Scene

Lighting #2

Lighting #1

Pre-Pass

Physics

Physics

Physics Car Damage

Audio

Physics

Physics

Command Buffer

Command Buffer

Command Buffer

Command Buffer

GPU:

SPU:

Audio Command Buffer

Scene Graph

Scene Graph

Scene Graph

Step #9: Slotting it in... Ended up running the lighting on 3

SPUs, still easily within our timeframe and no longer pushed the physics out.

Step #10: Profit! SPU implementation faster than RSX™ even

without parallelism. (~2-3ms on 3 SPUs). Overall frame latency reduced by up to 25%! More benefits:

Blending in alternative colour space becomes trivial.Add value by outputting other useful stuff from SPU

program – down-sampled Z buffer anyone? Lighting becomes free*.

* - In the strictest computer science sense of the word, ;-).

The Future... MSAA -- Big challenge, but solvable... Experiment with different colour spaces? Remove de-coding step...

Upsets my OCD as not really needed for the data transformation –

But also allows us to overlap input and output buffers. Specular. Better normals:

Ideally higher precision for use in main pass.Fix positive z-component sign assumption.

○ Stereographic Projection○ Lambert Azimuthal Equal-area Projection et al.

References[Engel08] W. Engel, “Light Pre-Pass Renderer”,

http://diaryofagraphicsprogrammer.blogspot.com/2008/03/light-pre-pass-renderer.html, accessed on 4th July 2009

[Balestra08] C. Balestra and P. Engstad, “The Technology of Uncharted: Drake’s Fortune”, GDC2008.

[Swoboda09] M. Swoboda, “Deferred Lighting and Post Processing on PLAYSTATION®3”, GDC2009.

Special Thanks!

Matt Swoboda and Colin Hughes (SCE R&D)

and

The Bizarre Creations Core Tech Team

Shameless Plug

Steve and I contributed to this book... It’s out

March 2010, you should buy it for your desk, studio library, etc.

http://gpupro.blogspot.com

Thanks for Listening! Questions?

Check out www.blurgame.com

top related