an optimized diffusion depth of field solver (ddof) 28th february 20112amds favorite effects holger...

40

Upload: clarissa-claudia-pryce

Post on 28-Mar-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: An Optimized Diffusion Depth Of Field Solver (DDOF) 28th February 20112AMDs Favorite Effects Holger Gruen – AMD
Page 2: An Optimized Diffusion Depth Of Field Solver (DDOF) 28th February 20112AMDs Favorite Effects Holger Gruen – AMD

An Optimized Diffusion Depth Of Field Solver (DDOF)

28th February 2011 2AMD‘s Favorite Effects

Holger Gruen – AMD

Page 3: An Optimized Diffusion Depth Of Field Solver (DDOF) 28th February 20112AMDs Favorite Effects Holger Gruen – AMD

Agenda• Motivation• Recap of a high-level explanation of DDOF• Recap of earlier DDOF solvers• A Vanilla Cyclic Reduction(CR) DDOF solver• A DX11 optimized CR solver for DDOF• Results28th February 2011 AMD‘s Favorite Effects 3

Page 4: An Optimized Diffusion Depth Of Field Solver (DDOF) 28th February 20112AMDs Favorite Effects Holger Gruen – AMD

Motivation• Solver presented at GDC 2010 [RS2010] has

some weaknesses• Great implementation but memory reqs and

runtime too high for many game developers• Looking for faster and memory efficient solver

28th February 2011 AMD‘s Favorite Effects 4

Page 5: An Optimized Diffusion Depth Of Field Solver (DDOF) 28th February 20112AMDs Favorite Effects Holger Gruen – AMD

Diffusion DOF recap 1• DDOF is an enhanced way of blurring a picture

taking an arbitrary CoC at a pixel into account• Interprets input image as a heat distribution• Uses the CoC at a pixel to derive a per pixel

heat conductivity CoC=Circle of Confusion

28th February 2011 AMD‘s Favorite Effects 5

Page 6: An Optimized Diffusion Depth Of Field Solver (DDOF) 28th February 20112AMDs Favorite Effects Holger Gruen – AMD

Diffusion DOF recap 2• Blurring is done by time stepping a differential

equation that models the diffusion of heat• ADI method used to arrive at a separable

solution for stepping• Need to solve tri-diagonal linear system for

each row and then each colum of the input28th February 2011 AMD‘s Favorite Effects 6

Page 7: An Optimized Diffusion Depth Of Field Solver (DDOF) 28th February 20112AMDs Favorite Effects Holger Gruen – AMD

DDOF Tri-diagonal system

28th February 2011 AMD‘s Favorite Effects 7

1 1 1 1

2 2 2 2 2

3 3 3 3 3

0

0 n n n n

b c y x

a b c y x

a b c y x

a b y x

• row/col of inputimage

• derived from CoC at each pixel of aninput row/col

• resulting blurred row/col

Page 8: An Optimized Diffusion Depth Of Field Solver (DDOF) 28th February 20112AMDs Favorite Effects Holger Gruen – AMD

Solver recap 1• The GDC2010 solver [RS2010] is a ‚hybrid‘ solver

– Performs three PCR steps upfront– Performs serial ‚Sweep‘ algorithm to solve small

resulting systems– Check [ZCO2010] for details on other hybrid

solvers

28th February 2011 AMD‘s Favorite Effects 8

Page 9: An Optimized Diffusion Depth Of Field Solver (DDOF) 28th February 20112AMDs Favorite Effects Holger Gruen – AMD

Solver recap 2• The GDC2010 solver [RS2010] has drawbacks

– It uses a large UAV as a RW scratch-pad to store the modified coefficients of the sweep algorithm

• GPUs without RW cache will suffer

– For high resolutions three PCR steps produce tri-diagonal system of substantial size

• This means a serial (sweep) algorithm is run on a ‚big‘ system

28th February 2011 AMD‘s Favorite Effects 9

Page 10: An Optimized Diffusion Depth Of Field Solver (DDOF) 28th February 20112AMDs Favorite Effects Holger Gruen – AMD

Solver recap 3• Cyclic Reduction (CR) solver

– Used by [Kass2006] in the original DDOF paper– Runs in two phases

1. reduction phase2. backward substitution phase

28th February 2011 AMD‘s Favorite Effects 10

Page 11: An Optimized Diffusion Depth Of Field Solver (DDOF) 28th February 20112AMDs Favorite Effects Holger Gruen – AMD

Solver recap 4• According to [ZCO2010]:

– CR solver has lowest computational complexity of all solvers

– It suffers from lack of parallelism though • At the end of the reduction phase• At the start of the backwards substitution phase

28th February 2011 AMD‘s Favorite Effects 11

Page 12: An Optimized Diffusion Depth Of Field Solver (DDOF) 28th February 20112AMDs Favorite Effects Holger Gruen – AMD

Passes of a Vanilla CR Solver

28th February 2011 AMD‘s Favorite Effects 12

Input imageX

Pass 1: construct from CoC

abc

1 1 1 1

2 2 2 2 2

3 3 3 3 3

0

0 n n n n

b c y x

a b c y x

a b c y x

a b y x

reduce

reduce

reduce

reduce

Stop at size 1Solve for the first y

Y substitutesubstitute

Blurred image

Page 13: An Optimized Diffusion Depth Of Field Solver (DDOF) 28th February 20112AMDs Favorite Effects Holger Gruen – AMD

Vanilla Solver Results• Higher performance than reported in

[Bavoil2010] (~6 ms vs. ~8ms at 1600x1200)

• Memory footprint prohibitively high – >200 MB at 1600x1200

• Need an answer to tackling the lack of parallelism problem – answer given in [ZCO2010]

28th February 2011 AMD‘s Favorite Effects 13

Page 14: An Optimized Diffusion Depth Of Field Solver (DDOF) 28th February 20112AMDs Favorite Effects Holger Gruen – AMD

Vanilla CR Solver

28th February 2011 AMD‘s Favorite Effects 14

Input imageX

Pass 1: construct from CoC

abc

reduce

reduce

reduce

reduce

Stop at size 1Solve for the first y

Y substitutesubstitute

Blurred image

This is what kills

parallelism

Page 15: An Optimized Diffusion Depth Of Field Solver (DDOF) 28th February 20112AMDs Favorite Effects Holger Gruen – AMD

Keeping the parallelism high

28th February 2011 AMD‘s Favorite Effects 15

Input imageX

Pass 1: construct from CoC

abc

reduce

reduce

reduce

reduce

Stop at a reasonable size

Solve for Y at that resolution to have a big enough parallel workload (e.g using PCR see [ZCO2010])

Y substitutesubstitute

Blurred image

Page 16: An Optimized Diffusion Depth Of Field Solver (DDOF) 28th February 20112AMDs Favorite Effects Holger Gruen – AMD

Memory Optimizations 1

28th February 2011 AMD‘s Favorite Effects 16

Input imageX

Pass 1: construct from CoC

abc

reduce

reduce

reduce

reduce

Stop at a reasonable size

Solve for Y at that resolution

Y substitutesubstitute

Blurred image

Page 17: An Optimized Diffusion Depth Of Field Solver (DDOF) 28th February 20112AMDs Favorite Effects Holger Gruen – AMD

Memory Optimizations 1

28th February 2011 AMD‘s Favorite Effects 17

rgab32fX

rgab32fabc

rgab32f

rgab32f

reduce

reduce

reduce

reduce

Stop at a reasonable size

Solve for Y at that resolution

Y substitutesubstitute

rgba32f rgab32fsubsti-tute

Page 18: An Optimized Diffusion Depth Of Field Solver (DDOF) 28th February 20112AMDs Favorite Effects Holger Gruen – AMD

Memory Optimizations 1

28th February 2011 AMD‘s Favorite Effects 18

rgab16fX

rgab32fabc

rgab16f

rgab32f

reduce

reduce

reduce

reduce

Stop at a reasonable size

Solve for Y at that resolution

Y substitutesubstitute

rgba16f rgab16fsubsti-tute

This saves some significant amount of memory - We found no artifacts for going from rgba32f to rgba16f

Page 19: An Optimized Diffusion Depth Of Field Solver (DDOF) 28th February 20112AMDs Favorite Effects Holger Gruen – AMD

Memory Optimizations 2

28th February 2011 AMD‘s Favorite Effects 19

rgab16fX

rgab32fabc

rgab16f

rgab32f

reduce

reduce

reduce

reduce

Stop at a reasonable size

Solve for Y at that resolution

Y substitutesubstitute

rgba16f rgab16fsubsti-tute

This does again save a significant amount of memory as this is the biggest surface used by the solver

Page 20: An Optimized Diffusion Depth Of Field Solver (DDOF) 28th February 20112AMDs Favorite Effects Holger Gruen – AMD

Memory Optimizations 2

28th February 2011 AMD‘s Favorite Effects 20

rgab16fX

abc

rgab16f

rgab32f

reduce reduce

reduce

Stop at a reasonable size

Solve for Y at that resolution

Y substitutesubstitute

rgba16f rgab16fsubsti-tute

Skip abc construction pass and compute abc on-the-fly during 1. reduction pass

Page 21: An Optimized Diffusion Depth Of Field Solver (DDOF) 28th February 20112AMDs Favorite Effects Holger Gruen – AMD

Intermediate Results 1600x1200

28th February 2011 AMD‘s Favorite Effects 21

Solver Time in ms Memory in MegabytesHD5870 GTX480

GDC2010 hybrid solver on GTX480 ~8.5 8.00 [Bavoil2010]

~117 (guesstimate)

Standard Solver (already skips high res abc construction)

3.66 3.33 ~132

Page 22: An Optimized Diffusion Depth Of Field Solver (DDOF) 28th February 20112AMDs Favorite Effects Holger Gruen – AMD

Memory Optimizations 3

28th February 2011 AMD‘s Favorite Effects 22

rgab16fX

abc

rgab16f

rgab32f

reduce reduce

reduce

Stop at a reasonable size

Solve for Y at that resolution

Y substitutesubstitute

rgba16f rgab16fsubsti-tute

Skip abc construction pass compute abc during 1. reduction pass

Yet again this saves a significant amount of memory !

Page 23: An Optimized Diffusion Depth Of Field Solver (DDOF) 28th February 20112AMDs Favorite Effects Holger Gruen – AMD

Memory Optimizations 3

28th February 2011 AMD‘s Favorite Effects 23

rgab16fX

abc

reduce4

Stop at a reasonable size

Solve for Y at that resolution

Y substitutesubstitute

rgba16f

substitute4

Skip abc construction pass compute abc during 1. reduction pass

Reduce 4-to-1in a special first reduction pass

Substitute 1-to-4 in a special substitution pass

Page 24: An Optimized Diffusion Depth Of Field Solver (DDOF) 28th February 20112AMDs Favorite Effects Holger Gruen – AMD

Intermediate Results 1600x1200

28th February 2011 AMD‘s Favorite Effects 24

Solver Time in ms Memory in MegabytesHD5870 GTX480

GDC2010 hybrid solver on GTX480 ~8.5 8.00 [Bavoil2010]

~117 (guesstimate)

Standard Solver (already skips high res abc construction)

3.66 3.33 ~132

4–to-1 Reduction 2.87 3.32 ~73

Page 25: An Optimized Diffusion Depth Of Field Solver (DDOF) 28th February 20112AMDs Favorite Effects Holger Gruen – AMD

DX11 Memory Optimizations 1

28th February 2011 AMD‘s Favorite Effects 25

rgab16fX

abc

reduce4

Stop at a reasonable size

Solve for Y at that resolution

Y substitutesubstitute

rgba16f

substitute4

Skip abc construction pass compute abc during 1. reduction pass

Reduce 4-to-1in a special first reduction pass

Substitute 1-to-4 in a special substitution pass

Page 26: An Optimized Diffusion Depth Of Field Solver (DDOF) 28th February 20112AMDs Favorite Effects Holger Gruen – AMD

DX11 Memory Optimizations 1

28th February 2011 AMD‘s Favorite Effects 26

rgab16fX

abc

reduce4

Stop at a reasonable size

Solve for Y at that resolution

Y substitutesubstitute

rgba16f

substitute4

Skip abc construction pass compute abc during 1. reduction pass

Reduce 4-to-1in a special first reduction pass

Substitute 1-to-4 in a special substitution pass

Pack abc and X into one rgba_uint surface

Page 27: An Optimized Diffusion Depth Of Field Solver (DDOF) 28th February 20112AMDs Favorite Effects Holger Gruen – AMD

Using SM5 for data packing

28th February 2011 AMD‘s Favorite Effects 27

rgab16fX

rgab32fabc

uint

uint

uint

uint

pack x,y channel

(f32tof16(X.x) + (f32tof16(X.y) << 16))

Page 28: An Optimized Diffusion Depth Of Field Solver (DDOF) 28th February 20112AMDs Favorite Effects Holger Gruen – AMD

Using SM5 for data packing

28th February 2011 AMD‘s Favorite Effects 28

rgab16fX

rgab32fabc

uint

uint

uint

uint

lower 5 bits of z channel

higher 27 bits of x channel

pack

(asuint(abc.x) &0xFFFFFFC0) | (f32tof16(X.z) & 0x3F))

Steal 6 lowest mantissa bits of abc.x to store some bits of X.z

Page 29: An Optimized Diffusion Depth Of Field Solver (DDOF) 28th February 20112AMDs Favorite Effects Holger Gruen – AMD

Using SM5 for data packing

28th February 2011 AMD‘s Favorite Effects 29

rgab16fX

rgab32fabc

uint

uint

uint

uint

central 5 bits of z channel

higher 27 bits of y channel

pack

(asuint(abc.y) &0xFFFFFFC0) | ((f32tof16(X.z) >>6 )& 0x3F))

Steal 6 lowest mantissa bits of abc.y to store some bits of X.z

Page 30: An Optimized Diffusion Depth Of Field Solver (DDOF) 28th February 20112AMDs Favorite Effects Holger Gruen – AMD

SM5 Memory Optimizations 1

28th February 2011 AMD‘s Favorite Effects 30

rgab16fX

rgab32fabc

uint

uint

uint

uint

higher 5 bits of z channel

higher 27 bits of z channel pack

(asuint(abc.z) &0xFFFFFFC0) | ((f32tof16(X.z) >>12 )& 0x3F))

Steal 6 lowest mantissa bits of abc.z to store some bits of X.z

Page 31: An Optimized Diffusion Depth Of Field Solver (DDOF) 28th February 20112AMDs Favorite Effects Holger Gruen – AMD

Sample Screenshot

28th February 2011 AMD‘s Favorite Effects 31

Page 32: An Optimized Diffusion Depth Of Field Solver (DDOF) 28th February 20112AMDs Favorite Effects Holger Gruen – AMD

Abs(Packed-Unpacked) x 255.0f

28th February 2011 AMD‘s Favorite Effects 32

Page 33: An Optimized Diffusion Depth Of Field Solver (DDOF) 28th February 20112AMDs Favorite Effects Holger Gruen – AMD

DX11 Memory Optimizations 2• Solver does a horizonal and vertical pass• Chain of lower res RTs needs to be there twice

– Horizontal reduction/substitution chain– Vertical reduction/substitution chain

• How can DX11 help?

28th February 2011 AMD‘s Favorite Effects 33

Page 34: An Optimized Diffusion Depth Of Field Solver (DDOF) 28th February 20112AMDs Favorite Effects Holger Gruen – AMD

DX11 Memory Optimizations 2• UAVs allow us to reuse data of the horizontal

chain for the vertical chain• A proof of concept implementation shows that this

works nicely but impacts the runtime significantly – ~40% lower fps

• Stayed with RTs as memory was already quite low• Use only if you are really concerned about memory

28th February 2011 AMD‘s Favorite Effects 34

Page 35: An Optimized Diffusion Depth Of Field Solver (DDOF) 28th February 20112AMDs Favorite Effects Holger Gruen – AMD

Final Results 1600x1200

28th February 2011 AMD‘s Favorite Effects 35

Solver Time in ms Memory in MegabytesHD5870 GTX480

GDC2010 hybrid solver on GTX480 ~8.5 8.00 [Bavoil2010]

~117 (guesstimate,)

Standard Solver (already skips high res abc construction)

3.66 3.33 ~132

4–to-1 Reduction 2.87 3.32 ~73

4-to-1 Reduction + SM5 Packing 2.75 3.14 ~58

Page 36: An Optimized Diffusion Depth Of Field Solver (DDOF) 28th February 20112AMDs Favorite Effects Holger Gruen – AMD

Future Work• Look into CS acceleration of the solver

– 4-to-1 reduction pass– 1-to-4 substitution pass

• Look into using heat diffusion for other effects– e.g. Motion blur

28th February 2011 AMD‘s Favorite Effects 36

Page 37: An Optimized Diffusion Depth Of Field Solver (DDOF) 28th February 20112AMDs Favorite Effects Holger Gruen – AMD

Conclusion• Optimized CR solver is fast and mem-efficient

– Used in Dragon Age 2– 4aGames considering its use for new projects– Detailed description in ‚Game Engine Gems 2‘

• Mail me ([email protected]) if you want access to the sources

28th February 2011 AMD‘s Favorite Effects 37

Page 38: An Optimized Diffusion Depth Of Field Solver (DDOF) 28th February 20112AMDs Favorite Effects Holger Gruen – AMD

References• [Kass2006] “Interactive depth of field using simulated diffusion on a GPU”

Michael Kass, Pixar Animation studios, Pixar technical memo #06-01• [ZCO2010] “Fast Tridiagonal Solvers on the GPU” Y. Zhang, J. Cohen, J. D.

Owens, PPoPP 2010• [RS2010] “DX11 Effects in Metro 2033: The Last Refuge” A. Rege, O.

Shishkovtsov, GDC 2010• [Bavoil2010] „Modern Real-Time Rendering Techniques“, L. Bavoil,

FGO2010

28th February 2011 AMD‘s Favorite Effects 38

Page 39: An Optimized Diffusion Depth Of Field Solver (DDOF) 28th February 20112AMDs Favorite Effects Holger Gruen – AMD

Backup

28th February 2011 AMD‘s Favorite Effects 39

Page 40: An Optimized Diffusion Depth Of Field Solver (DDOF) 28th February 20112AMDs Favorite Effects Holger Gruen – AMD

Results 1920x1200

28th February 2011 AMD‘s Favorite Effects 40

Solver Time in ms Memory in MegabytesHD5870 GTX480

Standard Solver (already skips high res abc construction)

4.31 4.03 ~158

4–to-1 Reduction 3.36 4.02 ~88

4-to-1 Reduction + SM5 Packing 3.23 3.79 ~70