high performance edge-preserving filter on gpu · 2014-04-18 · title: high performance...
TRANSCRIPT
![Page 1: High Performance Edge-Preserving Filter on GPU · 2014-04-18 · Title: High Performance Edge-Preserving Filter on GPU Author: Jonas Li Subject: The goal of this session is to show](https://reader034.vdocument.in/reader034/viewer/2022050409/5f8676d914373f611a7cc44c/html5/thumbnails/1.jpg)
HIGH PERFORMANCE EDGE-PRESERVING FILTER
ON GPU
Jonas Li ([email protected])
Compute Architect, NVIDIA
![Page 2: High Performance Edge-Preserving Filter on GPU · 2014-04-18 · Title: High Performance Edge-Preserving Filter on GPU Author: Jonas Li Subject: The goal of this session is to show](https://reader034.vdocument.in/reader034/viewer/2022050409/5f8676d914373f611a7cc44c/html5/thumbnails/2.jpg)
AGENDA
Algorithm basis
Optimization on GPU
Performance data
Demo
![Page 3: High Performance Edge-Preserving Filter on GPU · 2014-04-18 · Title: High Performance Edge-Preserving Filter on GPU Author: Jonas Li Subject: The goal of this session is to show](https://reader034.vdocument.in/reader034/viewer/2022050409/5f8676d914373f611a7cc44c/html5/thumbnails/3.jpg)
ALGORITHM BASIS
Based on state-of-art “domain transform[1]”
— From R5(X,Y,R,G,B) to R2(XY,RGB), distances preserved
— (X,RGB)->ct(x)
— (Y,RGB)->ct(y)
— 1D smoothing on ct(x) and ct(y) ≡ edge preserving on 2D color image
𝑐𝑡 𝑢 = 1 + |𝐼′(𝑥)|𝑑𝑥𝑢
0
[1]: http://www.inf.ufrgs.br/~eslgastal/DomainTransform/
ct(x) is accumulation of each L1 distance of (x, I(x))
Ct= Ωw
![Page 4: High Performance Edge-Preserving Filter on GPU · 2014-04-18 · Title: High Performance Edge-Preserving Filter on GPU Author: Jonas Li Subject: The goal of this session is to show](https://reader034.vdocument.in/reader034/viewer/2022050409/5f8676d914373f611a7cc44c/html5/thumbnails/4.jpg)
ALGORITHM BASIS
Based on state-of-art “domain transform[1]”
— From R5(X,Y,R,G,B) to R2(XY,RGB), distances preserved
— (X,RGB)->ct(x)
— (Y,RGB)->ct(y)
— 1D smoothing on ct(x) and ct(y) ≡ edge preserving on 2D color image
[1]: http://www.inf.ufrgs.br/~eslgastal/DomainTransform/
ct(x) is accumulation of each L1 distance of (x, I(x))
![Page 5: High Performance Edge-Preserving Filter on GPU · 2014-04-18 · Title: High Performance Edge-Preserving Filter on GPU Author: Jonas Li Subject: The goal of this session is to show](https://reader034.vdocument.in/reader034/viewer/2022050409/5f8676d914373f611a7cc44c/html5/thumbnails/5.jpg)
ALGORITHM BASIS
GPU implementation
— Normalized convolution is GPU friendly
— Implemented on Tegra K1
— 2 main phases
Integral image(calculate ct(x), ct(y))
Normalized convolution(binary search, matrix transposition)
![Page 6: High Performance Edge-Preserving Filter on GPU · 2014-04-18 · Title: High Performance Edge-Preserving Filter on GPU Author: Jonas Li Subject: The goal of this session is to show](https://reader034.vdocument.in/reader034/viewer/2022050409/5f8676d914373f611a7cc44c/html5/thumbnails/6.jpg)
1ST PHASE: INTEGRAL IMAGE
Naïve implementation
— Large thread block
— Warp-wide shuffle
— Shared memory
Issues
— Warp synchronization
partial_sum = integral(thread_value); SMEM <- partial_sum; sync(); if(warp_id == 0) warp_sum = integral(partial_sum); SMEM <- warp_sum; sync(); If(warp_id != 0) thread_value += warp_sum; sync(); output();
![Page 7: High Performance Edge-Preserving Filter on GPU · 2014-04-18 · Title: High Performance Edge-Preserving Filter on GPU Author: Jonas Li Subject: The goal of this session is to show](https://reader034.vdocument.in/reader034/viewer/2022050409/5f8676d914373f611a7cc44c/html5/thumbnails/7.jpg)
1ST PHASE: INTEGRAL IMAGE
Optimized implementation
— Eliminate warp synchronization
— Data prefetching
— 2.7x speedup vs. naïve code
— 95% of peak performance
![Page 8: High Performance Edge-Preserving Filter on GPU · 2014-04-18 · Title: High Performance Edge-Preserving Filter on GPU Author: Jonas Li Subject: The goal of this session is to show](https://reader034.vdocument.in/reader034/viewer/2022050409/5f8676d914373f611a7cc44c/html5/thumbnails/8.jpg)
1ST PHASE: INTEGRAL IMAGE
Optimized implementation
— Eliminate warp synchronization
— Data prefetching
— 2.7x speedup vs. naïve code
— 95% of peak performance
![Page 9: High Performance Edge-Preserving Filter on GPU · 2014-04-18 · Title: High Performance Edge-Preserving Filter on GPU Author: Jonas Li Subject: The goal of this session is to show](https://reader034.vdocument.in/reader034/viewer/2022050409/5f8676d914373f611a7cc44c/html5/thumbnails/9.jpg)
1ST PHASE: INTEGRAL IMAGE
Optimized implementation
— Eliminate warp synchronization
— Data prefetching
— 2.7x speedup vs. naïve code
— 95% of peak performance
![Page 10: High Performance Edge-Preserving Filter on GPU · 2014-04-18 · Title: High Performance Edge-Preserving Filter on GPU Author: Jonas Li Subject: The goal of this session is to show](https://reader034.vdocument.in/reader034/viewer/2022050409/5f8676d914373f611a7cc44c/html5/thumbnails/10.jpg)
2ND PHASE: NORMALIZED CONVOLUTION
Per-thread binary search
— Partially overlapped range
— Still highly divergent
— Data dependency
Matrix transposition
— Bank conflicts
while (right > left) { idx = (right + left)/ 2; v = input(idx); if (v > val) { right = idx; } else { left = idx + 1; } }
![Page 11: High Performance Edge-Preserving Filter on GPU · 2014-04-18 · Title: High Performance Edge-Preserving Filter on GPU Author: Jonas Li Subject: The goal of this session is to show](https://reader034.vdocument.in/reader034/viewer/2022050409/5f8676d914373f611a7cc44c/html5/thumbnails/11.jpg)
BINARY SEARCH
Texture
— Seems natural to handle divergence
— Longer latency
— 2D cache locality
Shared memory
— Data copy/refresh overhead
— Limited size per thread
L1 Cache
— As fast as shared memory
— Hardware managed data refresh
0
5
10
15
20
25
30
35
40
45
50
100 200 300 400 500 600 700 800 900 1000
kern
el ti
me(m
s)
radius
Binary search: Texture/Smem/L1 (lower is better)
tex
L1
Smem
![Page 12: High Performance Edge-Preserving Filter on GPU · 2014-04-18 · Title: High Performance Edge-Preserving Filter on GPU Author: Jonas Li Subject: The goal of this session is to show](https://reader034.vdocument.in/reader034/viewer/2022050409/5f8676d914373f611a7cc44c/html5/thumbnails/12.jpg)
MATRIX TRANSPOSITION
Previous implementation[1]
— Transpose in shared memory
— Tile padding to avoid bank conflict
Our implementation
— Transpose in shared memory
— No bank conflict
— No tile padding
[1]: http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/#shared-memory-in-matrix-multiplication-c-aa
![Page 13: High Performance Edge-Preserving Filter on GPU · 2014-04-18 · Title: High Performance Edge-Preserving Filter on GPU Author: Jonas Li Subject: The goal of this session is to show](https://reader034.vdocument.in/reader034/viewer/2022050409/5f8676d914373f611a7cc44c/html5/thumbnails/13.jpg)
MATRIX TRANSPOSITION
4 warps per block
4 bytes per pixel
Naïve implementation:
4-way bank conflict in
shared memory
![Page 14: High Performance Edge-Preserving Filter on GPU · 2014-04-18 · Title: High Performance Edge-Preserving Filter on GPU Author: Jonas Li Subject: The goal of this session is to show](https://reader034.vdocument.in/reader034/viewer/2022050409/5f8676d914373f611a7cc44c/html5/thumbnails/14.jpg)
MATRIX TRANSPOSITION
4 warps per block
4 bytes per pixel
Our implementation:
Store: No bank conflict
Load: No bank conflict
![Page 15: High Performance Edge-Preserving Filter on GPU · 2014-04-18 · Title: High Performance Edge-Preserving Filter on GPU Author: Jonas Li Subject: The goal of this session is to show](https://reader034.vdocument.in/reader034/viewer/2022050409/5f8676d914373f611a7cc44c/html5/thumbnails/15.jpg)
MATRIX TRANSPOSITION
4 warps per block
4 bytes per pixel
Our implementation:
Store: No bank conflict
Load: No bank conflict
![Page 16: High Performance Edge-Preserving Filter on GPU · 2014-04-18 · Title: High Performance Edge-Preserving Filter on GPU Author: Jonas Li Subject: The goal of this session is to show](https://reader034.vdocument.in/reader034/viewer/2022050409/5f8676d914373f611a7cc44c/html5/thumbnails/16.jpg)
MATRIX TRANSPOSITION
4 warps per block
4 bytes per pixel
Our implementation:
Store: No bank conflict
Load: No bank conflict
![Page 17: High Performance Edge-Preserving Filter on GPU · 2014-04-18 · Title: High Performance Edge-Preserving Filter on GPU Author: Jonas Li Subject: The goal of this session is to show](https://reader034.vdocument.in/reader034/viewer/2022050409/5f8676d914373f611a7cc44c/html5/thumbnails/17.jpg)
MATRIX TRANSPOSITION
4 warps per block
4 bytes per pixel
Our implementation:
Store: No bank conflict
Load: No bank conflict
![Page 18: High Performance Edge-Preserving Filter on GPU · 2014-04-18 · Title: High Performance Edge-Preserving Filter on GPU Author: Jonas Li Subject: The goal of this session is to show](https://reader034.vdocument.in/reader034/viewer/2022050409/5f8676d914373f611a7cc44c/html5/thumbnails/18.jpg)
PERFORMANCE DATA
Performance comparison with naïve version
Tegra K1, 1920*1080,BGR color video
— Naïve version: 15.52fps
— Optimized version: 32.59fps
0
0.5
1
1.5
2
2.5
3
Integral image NormalizedConvolution
Total
Speedup
Relative speedup
naïve version
optimized version
Total speedup: ~2.1x
![Page 19: High Performance Edge-Preserving Filter on GPU · 2014-04-18 · Title: High Performance Edge-Preserving Filter on GPU Author: Jonas Li Subject: The goal of this session is to show](https://reader034.vdocument.in/reader034/viewer/2022050409/5f8676d914373f611a7cc44c/html5/thumbnails/19.jpg)
SUMMARY
A real-time edge-preserving filter on GPU is presented
Fast integral without warp synchronization
L1-based in-place binary search
Efficient matrix transpose scheme
![Page 20: High Performance Edge-Preserving Filter on GPU · 2014-04-18 · Title: High Performance Edge-Preserving Filter on GPU Author: Jonas Li Subject: The goal of this session is to show](https://reader034.vdocument.in/reader034/viewer/2022050409/5f8676d914373f611a7cc44c/html5/thumbnails/20.jpg)
DEMO