![Page 1: DirectCompute Accelerated Separable Filtering](https://reader035.vdocument.in/reader035/viewer/2022070408/568143cf550346895db05c46/html5/thumbnails/1.jpg)
![Page 2: DirectCompute Accelerated Separable Filtering](https://reader035.vdocument.in/reader035/viewer/2022070408/568143cf550346895db05c46/html5/thumbnails/2.jpg)
DirectCompute Accelerated Separable Filtering
28th February 2011 2AMD‘s Favorite Effects
![Page 3: DirectCompute Accelerated Separable Filtering](https://reader035.vdocument.in/reader035/viewer/2022070408/568143cf550346895db05c46/html5/thumbnails/3.jpg)
Separable Filters• Much faster than executing a box filter• Classically performed by the Pixel Shader• Consists of a horizontal and vertical pass • Source image over-sampling increases with
kernel size– Shader is usually TEX instruction limited
28th February 2011 AMD‘s Favorite Effects 3
![Page 4: DirectCompute Accelerated Separable Filtering](https://reader035.vdocument.in/reader035/viewer/2022070408/568143cf550346895db05c46/html5/thumbnails/4.jpg)
Separable? – Who Cares • In many cases developers use this technique
even though the filter may not actually be separable– Results are often still acceptable– Much faster than performing a real box filter– Accelerates many bilateral cases
28th February 2011 AMD‘s Favorite Effects 4
![Page 5: DirectCompute Accelerated Separable Filtering](https://reader035.vdocument.in/reader035/viewer/2022070408/568143cf550346895db05c46/html5/thumbnails/5.jpg)
Typical Pipeline Steps
28th February 2011 AMD‘s Favorite Effects 5
SourceRT
IntermediateRT
Destination RT
Horizontal Pass Vertical Pass
![Page 6: DirectCompute Accelerated Separable Filtering](https://reader035.vdocument.in/reader035/viewer/2022070408/568143cf550346895db05c46/html5/thumbnails/6.jpg)
Use Bilinear HW filtering?• Bilinear filter HW can halve the number of
ALU and TEX instructions– Just need to compute the correct sampling offsets
• Not possible with more advanced filters– Usually because weighting is a dynamic operation– Think about bilateral cases...
28th February 2011 AMD‘s Favorite Effects 6
![Page 7: DirectCompute Accelerated Separable Filtering](https://reader035.vdocument.in/reader035/viewer/2022070408/568143cf550346895db05c46/html5/thumbnails/7.jpg)
Where to start with DirectCompute
• Is the Pixel Shader version TEX or ALU limited?– You need to know what to optimize for!– Use IHV tools to establish this
• Achieving peak performance is not easy – so write a highly configurable kernel– Will allow you to easily experiment and fine tune
28th February 2011 AMD‘s Favorite Effects 7
![Page 8: DirectCompute Accelerated Separable Filtering](https://reader035.vdocument.in/reader035/viewer/2022070408/568143cf550346895db05c46/html5/thumbnails/8.jpg)
Thread Group Shared Memory (TGSM)• TGSM can be used to reduce TEX ops• TGSM can also be used to cache results
– Thus saving ALU ops too
• Load a sensible run length – base this on HW wavefront/warp size (AMD = 64, NVIDIA = 32) – Choose a good common factor (multiples of 64)
28th February 2011 AMD‘s Favorite Effects 8
![Page 9: DirectCompute Accelerated Separable Filtering](https://reader035.vdocument.in/reader035/viewer/2022070408/568143cf550346895db05c46/html5/thumbnails/9.jpg)
Kernel #1
• Redundant compute threads 28th February 2011 AMD‘s Favorite Effects 9
...........
128 threads load 128 texels
128 – ( Kernel Radius * 2 ) threads compute results
Kernel Radius
![Page 10: DirectCompute Accelerated Separable Filtering](https://reader035.vdocument.in/reader035/viewer/2022070408/568143cf550346895db05c46/html5/thumbnails/10.jpg)
Avoid Redundant Threads• Should ensure that all threads in a group have
useful work to do – wherever possible• Redundant threads will not be reassigned
work from another group• This would involve alot of redundancy for a
large kernel diameter28th February 2011 AMD‘s Favorite Effects 10
![Page 11: DirectCompute Accelerated Separable Filtering](https://reader035.vdocument.in/reader035/viewer/2022070408/568143cf550346895db05c46/html5/thumbnails/11.jpg)
Kernel #2
28th February 2011 AMD‘s Favorite Effects 11
...........
128 threads load 128 texels
128 threads compute results
Kernel Radius
• No redundant compute threads
Kernel Radius * 2 threadsload 1 extra texel each
![Page 12: DirectCompute Accelerated Separable Filtering](https://reader035.vdocument.in/reader035/viewer/2022070408/568143cf550346895db05c46/html5/thumbnails/12.jpg)
Multiple Pixels per Thread• Allows for natural vectorization
– 4 works well on AMD HW– Doesn‘t hurt performance on scalar HW
• Possible to cache TGSM reads on General Purpose Registers (GPRs)– Quartering TGSM reads - absolute winner!!
28th February 2011 AMD‘s Favorite Effects 12
![Page 13: DirectCompute Accelerated Separable Filtering](https://reader035.vdocument.in/reader035/viewer/2022070408/568143cf550346895db05c46/html5/thumbnails/13.jpg)
Kernel #3
• Compute threads not a multiple of 64 28th February 2011 AMD‘s Favorite Effects 13
...........
32 threads compute 128 results
Kernel Radius
32 threads load 128 texels
Kernel Radius * 2 threadsload 1 extra texel each
![Page 14: DirectCompute Accelerated Separable Filtering](https://reader035.vdocument.in/reader035/viewer/2022070408/568143cf550346895db05c46/html5/thumbnails/14.jpg)
Multiple Lines per Thread Group• Process multiple lines per thread group
– Better than one long line– 2 or 4 works well
• Improved texture cache efficiency• Compute threads back to a multiple of 64
28th February 2011 AMD‘s Favorite Effects 14
![Page 15: DirectCompute Accelerated Separable Filtering](https://reader035.vdocument.in/reader035/viewer/2022070408/568143cf550346895db05c46/html5/thumbnails/15.jpg)
Kernel #4
28th February 2011 AMD‘s Favorite Effects 15
...........
...........
Kernel Radius
64 threads compute 256 results
64 threads load 256 texels
Kernel Radius * 4 threadsload 1 extra texel each
![Page 16: DirectCompute Accelerated Separable Filtering](https://reader035.vdocument.in/reader035/viewer/2022070408/568143cf550346895db05c46/html5/thumbnails/16.jpg)
Kernel Diameter• Kernel diameter needs to be > 7 to see a
DirectCompute win– Otherwise the overhead cancels out the
advantage
• The larger the kernel diameter the greater the win
28th February 2011 AMD‘s Favorite Effects 16
![Page 17: DirectCompute Accelerated Separable Filtering](https://reader035.vdocument.in/reader035/viewer/2022070408/568143cf550346895db05c46/html5/thumbnails/17.jpg)
Use Packing in TGSM• Use packing to reduce storage space required in
TGSM– Only have 32k per SIMD
• Reduces reads/writes from TGSM• Often a uint is sufficient for color filtering• Use SM5.0 instructions f32tof16(), f16tof32()28th February 2011 AMD‘s Favorite Effects 17
![Page 18: DirectCompute Accelerated Separable Filtering](https://reader035.vdocument.in/reader035/viewer/2022070408/568143cf550346895db05c46/html5/thumbnails/18.jpg)
High Definition Ambient Occlusion
28th February 2011 AMD‘s Favorite Effects 18
Depth + Normals
HDAO buffer
* =
Original Scene Final Scene
![Page 19: DirectCompute Accelerated Separable Filtering](https://reader035.vdocument.in/reader035/viewer/2022070408/568143cf550346895db05c46/html5/thumbnails/19.jpg)
Perform at Half Resolution• HDAO at full resolution is expensive• Running at half resolution captures more
occlusion – and is obviously much faster• Problem: Artifacts are introduced when
combined with the full resolution scene
28th February 2011 AMD‘s Favorite Effects 19
![Page 20: DirectCompute Accelerated Separable Filtering](https://reader035.vdocument.in/reader035/viewer/2022070408/568143cf550346895db05c46/html5/thumbnails/20.jpg)
Bilateral Dilate & Blur
28th February 2011 AMD‘s Favorite Effects 20
HDAO buffer doesn‘t match with scene
A bilateral dilate & blur fixes the issue
![Page 21: DirectCompute Accelerated Separable Filtering](https://reader035.vdocument.in/reader035/viewer/2022070408/568143cf550346895db05c46/html5/thumbnails/21.jpg)
New Pipeline...
28th February 2011 AMD‘s Favorite Effects 21
Bilinear Upsample Intermediate UAV Dilated & Blurred
Horizontal Pass Vertical Pass
½ Res Still much faster than performing at full res!
![Page 22: DirectCompute Accelerated Separable Filtering](https://reader035.vdocument.in/reader035/viewer/2022070408/568143cf550346895db05c46/html5/thumbnails/22.jpg)
Pixel Shader vs DirectCompute
28th February 2011 AMD‘s Favorite Effects 22
*Tested on a range of AMD and NVIDIA DX11 HW, DirectCompute is between ~2.53x to ~3.17x faster than the Pixel Shader
![Page 23: DirectCompute Accelerated Separable Filtering](https://reader035.vdocument.in/reader035/viewer/2022070408/568143cf550346895db05c46/html5/thumbnails/23.jpg)
Depth of Field• Many techniques exist to solve this problem• A common technique is to figure out how
blurry a pixel should be– Often called the Cirle of Confusion (CoC)
• A Gaussian blur weighted by CoC is a pretty efficient way to implement this effect
28th February 2011 AMD‘s Favorite Effects 23
![Page 24: DirectCompute Accelerated Separable Filtering](https://reader035.vdocument.in/reader035/viewer/2022070408/568143cf550346895db05c46/html5/thumbnails/24.jpg)
The Pipeline...
28th February 2011 AMD‘s Favorite Effects 24
Intermediate UAV
CoC
Horizontal Pass Vertical Pass
![Page 25: DirectCompute Accelerated Separable Filtering](https://reader035.vdocument.in/reader035/viewer/2022070408/568143cf550346895db05c46/html5/thumbnails/25.jpg)
28th February 2011 AMD‘s Favorite Effects 25
Shogun 2: DoF OFF
![Page 26: DirectCompute Accelerated Separable Filtering](https://reader035.vdocument.in/reader035/viewer/2022070408/568143cf550346895db05c46/html5/thumbnails/26.jpg)
28th February 2011 AMD‘s Favorite Effects 26
Shogun 2: DoF ON
![Page 27: DirectCompute Accelerated Separable Filtering](https://reader035.vdocument.in/reader035/viewer/2022070408/568143cf550346895db05c46/html5/thumbnails/27.jpg)
Pixel Shader vs DirectCompute
28th February 2011 AMD‘s Favorite Effects 27
*Tested on a range of AMD and NVIDIA DX11 HW, DirectCompute is between ~1.48x to ~1.86x faster than the Pixel Shader
![Page 28: DirectCompute Accelerated Separable Filtering](https://reader035.vdocument.in/reader035/viewer/2022070408/568143cf550346895db05c46/html5/thumbnails/28.jpg)
Summary• DirectCompute greatly accelerates larger kernel diameter
filters• Allows for filtering at full resolution• For access to source code:
– HDAO11: [email protected]– DoF11: [email protected]
28th February 2011 AMD‘s Favorite Effects 28
![Page 29: DirectCompute Accelerated Separable Filtering](https://reader035.vdocument.in/reader035/viewer/2022070408/568143cf550346895db05c46/html5/thumbnails/29.jpg)
[email protected]@amd.com
Please fill in the feedback forms!28th February 2011 29AMD‘s Favorite Effects