comparative performance analysis of vulkan implementations ... · gaussian pyramid image pixel (g...
TRANSCRIPT
ECE DepartmentUniversity of Thessaly, Greece
05/14/2019 1
Comparative Performance Analysis of Vulkan Implementations of Computational
Applications
Maria Rafaela Gkeka, Nikolaos Bellas, Christos D. Antonopoulos
Computer Systems Lab
IWOCL 2019
Outline
→Motivation
• Background
• Local Laplacian Filters (LLF) algorithm
• LLF Implementations and Experimental Evaluation
• VO KinectFusion algorithm for SLAM (Simultaneous Localization and Mapping)
• VO KinectFusion Implementations and Experimental Evaluation
• Discussion
205/14/2019 IWOCL 2019
Motivation
➢GPUs widely used as compute and graphics accelerators
○ Multiple APIs used to program them (OpenCL, OpenGL, CUDA, Vulkan, …)
➢ Vulkan is a new API aiming (among others) at integrating compute and graphics pipelines
o As of today, almost exclusively used for 3D graphics
➢ There is a need to better understand Vulkan performance implications in realistic compute applications
305/14/2019 IWOCL 2019
Outline
• Motivation
→Background
• Local Laplacian Filters (LLF) algorithm
• LLF Implementations and Experimental Evaluation
• KinectFusion algorithm for SLAM (Simultaneous Localization and Mapping)
• KinectFusion Implementations and Experimental Evaluation
• Discussion
405/14/2019 IWOCL 2019
Background
➢ Vulkan: modern low-overhead cross-platform 3D graphics and compute API○ targets high-performance real time 3D graphics applications such as video
games
○ uses the Khronos SPIR-V intermediate representation with native support
for shader features
505/14/2019 IWOCL 2019
GeForce 256 (first GPU ever)
1992
OpenGL 1.0
1999
2004
2008
GLSL / OpenGL 2.0
OpenCL 1.0
2011
SPIR 1.0
2011
2015(Feb)
OpenCL 1.2
OpenCL 2.1 (load SPIR-V)Vulkan 1.0
SPIR-V
2017
OpenGL 4.6 (load SPIR-V)
2015(Nov)
2015(Mar)
Vulkan 1.1
2019(Mar)
Outline
• Motivation
• Background
→ Local Laplacian Filters (LLF) algorithm
• LLF Implementations and Experimental Evaluation
• KinectFusion algorithm for SLAM (Simultaneous Localization and Mapping)
• KinectFusion Implementations and Experimental Evaluation
• Discussion
605/14/2019 IWOCL 2019
Local Laplacian Filter
• Given an input image, the algorithm applies detail or tone enhancements.
• Edge-preserving image processing algorithm based on the direct manipulation of Gaussian/Laplacian pyramids. [1]
7
input image
Output image with(α,β,σr)=(4,1,0.2)
Output image with(α,β,σr)=(0.25,1,0.4)
05/14/2019 IWOCL 2019
[1] Sylvain Paris et al. Local Laplacian Filters: Edge-aware Image Processing with a Laplacian Pyramid
ACM Trans. Graph. 30.4 (2011)
LLF is a pyramid-based algorithm
Gaussian pyramidLi+1 image generated by blurring (5-by-5 Gaussian filter) the Li level image and downsample by 2
8IWOCL 201905/14/2019
0.0025 0.0125 0.02 0.0125 0.00250.0125 0.0625 0.1 0.0625 0.01250.02 0.1 0.16 0.1 0.02
0.0125 0.0625 0.1 0.0625 0.01250.0025 0.0125 0.02 0.0125 0.0025
Laplacian pyramidSaves the error between Gaussian images at intermediate levels.The algorithm adds in each level image the expanded image of the lower level.
level 0
level 1level 2
level 3 level 4
LLF algorithm
9
IWOCL 201905/14/2019
Gaussian pyramid generation
Remapping of region Rij around (xi,yj)
Convolution
Downsampling
Upsampling
Convolution
Subtraction
Upsampling & Addition
Remapping
Determines the kind (detail, edge) of each pixel and manipulates the neighbor region.
Creates a new sub-image for each Gaussian pyramid image pixel (g0).
∀(𝑥𝑖 , 𝑦𝑗 , 𝑙0 )
𝑙0+1 timesGaussian pyramid of Rij region
Laplacian for pixel(xi,yj)
9
level 0
level 1 level 2level 3 level 4
Outline
• Motivation
• Background
• Local Laplacian Filters (LLF) algorithm
→ LLF Implementations and Experimental Evaluation
• KinectFusion algorithm for SLAM (Simultaneous Localization and Mapping)
• KinectFusion Implementations and Experimental Evaluation
• Discussion
1005/14/2019 IWOCL 2019
LLF Sequential Implementation
11
Intel® Core i7-4820K @3.70GHz, 16GB DRAM
Single threaded C code O3 optimizations
800x533 input image
05/14/2019 IWOCL 2019
88142 ms
Where is parallelism?
+ Loop iterations are independent
+ Workload is balanced across pyramid levels
05/14/2019 IWOCL 2019 12
– But, parallel execution has very high memory requirements to store intermediate images
12
Gaussian pyramid generation
Remapping of region Rij around (xi,yj)
Convolution
Downsampling
Upsampling
Convolution
Subtraction
Upsampling & Addition
∀(𝑥𝑖 , 𝑦𝑗 , 𝑙0 )
𝑙0+1 times
Data parallel execution
Five OpenCL kernels executing in sequence.
• Remapping
• Convolution (blurring)
• Downsampling
• Upsampling
• Subtraction
Grid: processing all pixels of a single line
Work-items: processing single pixel of the sub-image
05/14/2019 IWOCL 2019 13
13
Gaussian pyramid generation
Remapping of region Rij around (xi,yj)
Convolution
Downsampling
Upsampling
Convolution
Subtraction
Upsampling & Addition
∀(𝑥𝑖 , 𝑦𝑗 , 𝑙0 )
𝑙0+1 times
Optimized OpenCL Implementation
14
Speedup = 19.3
05/14/2019 IWOCL 2019
88142 ms
4567 ms
nVIDIA GeForce GTX 770 @1.046 GHz, 1536 SP cores
2 GB GDDR5
OpenCL 1.2
Execution time per pyramid Level
15
Parallel calculations of each pixel. The ET of smaller levels decreases because depends only on pyramid image size.
The ET of smaller levels increases because of the bigger region used for calculations of one pixel.
05/14/2019 IWOCL 2019
46%
24%
16%
14%
LLF Vulkan Implementation and Optimization
In Vulkan API all shaders (kernels)
are presented in SPIR-V format.
SPIR-V is a binary intermediate
representation for graphical
shaders and compute kernels.
In our implementation we use the
glslangValidator for the compute
shaders compilation.
Vulkan API Version: 1.0.61
16
OpenGL kernel(.comp, .vert,
.frag)
SPIR-V bin shader(.spv)
Vulkan application
(.cpp)
glslangValidator
SPIR-V assembly
spirv-dis
load to
or glslc
05/14/2019 IWOCL 2019
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
Sequential OpenCL Vulkan-v1
Exe
cuti
on
tim
e (
s)
LLF Implementation
Allocation-Transfer
Upsample and Add
Subtraction
Downsample
Upsample
Blurring
Remapping
Gaussian Pyramid
0
1000
2000
3000
4000
5000
6000
7000
OpenCL Vulkan-v1
Exec
uti
on
tim
e (s
)
LLF Implementation
Allocation-Transfer
Upsample and Add
Subtraction
Downsample
Upsample
Blurring
Remapping
Gaussian Pyramid
Baseline Vulkan Implementation (v1)
1705/14/2019 IWOCL 2019
Create a pipeline for each shader
Choose Memory Heap and Memory Type
Host or Device visible
Allocate buffers in memory
Execute each shader
Add each pipeline to a command buffer
Submit the command buffer to execution queue
SPIR-V does not support parameterizable work-group size
Vulkan-v1: different kernel versions for each work-group size
88142 ms
6561 ms4567 ms
Speedup = 19.3 Speedup = 5.2
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
Sequential OpenCL Vulkan-v1 Vulkan-v2
Exe
cuti
on
tim
e (
s)
LLF Implementation
Allocation-Transfer
Upsample and Add
Subtraction
Downsample
Upsample
Blurring
Remapping
Gaussian Pyramid
Command Buffer optimization (v2)
1805/14/2019 IWOCL 2019
• We can record the work of all iterations in one command buffer and synchronize using memory barriers between iterations.
• Lower overhead due to kernel launching compared with OpenCL/OpenGL (and CUDA)
Speedup = 19.3Speedup = 13.4
4567 ms 6561 ms
Speedup = 17.4
88142 ms
5061 ms
0
1000
2000
3000
4000
5000
6000
7000
OpenCL Vulkan-v1 Vulkan-v2
Exec
uti
on
tim
e (s
)
LLF Implementation
Allocation-TransferUpsample andAddSubtraction
Downsample
Upsample
Blurring
Outline
• Motivation
• Background
• Local Laplacian Filters (LLF) algorithm
• LLF Implementations and Experimental Evaluation
→KinectFusion algorithm for SLAM (Simultaneous Localization and Mapping)
• KinectFusion Implementations and Experimental Evaluation
• Discussion
1905/14/2019 IWOCL 2019
Simultaneous Localization and Mapping (SLAM)
SLAM is used in robotics for autonomous movement:• Dynamically building a map of the environment (mapping)
• Navigating this environment using the map while keeping track of the robot’s relative position and orientation
• Used in many real systems and applications• Mobile robotics
• Driverless cars
• Many algorithms and implementations→ KinectFusion implementation in C++, OpenMP, OpenCL and CUDA [2]
05/14/2019 IWOCL 2019 20
[2] L.Nardi et al. Introducing SLAMBench, a performance and accuracy benchmarking methodology for SLAM
ICRA 2015
Visual Odometry (VO) & Mapping
• VO: the robot localize itself in its environment based on the constructed maps• without any human input
• Mapping: build the map of the environment based on sensory information• e.g. RGB-D cameras
• Lidar
05/14/2019 IWOCL 2019 21
VO
Mapping
SLAM
KinectFusion
05/14/2019 IWOCL 2019 22
Outline
• Motivation
• Background
• Local Laplacian Filters (LLF) algorithm
• LLF Implementations and Experimental Evaluation
• KinectFusion algorithm for SLAM (Simultaneous Localization and Mapping)
→KinectFusion Implementations and Experimental Evaluation
• Discussion
2305/14/2019 IWOCL 2019
KinectFusion Sequential Implementation
Input
Video stream of 882 depth frames VGA resolution (640x480)
05/14/2019 IWOCL 2019 24
Volume initialization
Type conversion
Filtering
Pyramid generation
Track
∀frame
Depth to Vertex
Vertex to Normal
(L-1) times
L times
Reduce
kl0 timesfor eachlevel
Tracking/VO
Preprocessing
0.0513 s
0
0.01
0.02
0.03
0.04
0.05
0.06
Sequential
Exe
cuti
on
tim
e p
er
fram
e (
s)
VO Implementation
tracking
preprocessing
acquisition
0
0.01
0.02
0.03
0.04
0.05
0.06
Sequential OpenCL-v1 OpenCL-v2
Exe
cuti
on
tim
e p
er
fram
e (
s)
VO Implementation
tracking
preprocessing
acquisition
0.0513 s
0.0052 s 0.0013 s
KinectFusion OpenCL Implementation
OpenCL-v1: OpenCL compiler optimizations disabled
•“-cl-opt-disable”
OpenCL-v2: Default OpenCL compiler optimizations
nVIDIA GeForce GTX 770 @1.046 GHz, 1536 SP cores
2 GB GDDR5
OpenCL 1.2
05/14/2019 IWOCL 2019 25
Speedup = 9.9 Speedup = 38.5
0
0.001
0.002
0.003
0.004
0.005
0.006
OpenCL-v1 OpenCL-v2
Exec
uti
on
tim
e (s
)
VO Implementation
tracking
preprocessing
acquisition
0
0.01
0.02
0.03
0.04
0.05
0.06
Sequential OpenCL-v1 OpenCL-v2 Vulkan-v1 Vulkan-v2
Exe
cuti
on
tim
e p
er
fram
e (
s)
VO Implementation
tracking
preprocessing
acquisition
KinectFusion Vulkan Implementation
Vulkan-v1: Initial implementation
Vulkan-v2: Command buffers optimization
05/14/2019 IWOCL 2019 26
Speedup = 9.9
0.0513 s
0.0052 s 0.0013 s
Speedup = 38.5Speedup = 4.5
Speedup = 4.9
0.0113 s 0.0105 s
0
0.002
0.004
0.006
0.008
0.01
0.012
OpenCL-v1 OpenCL-v2 Vulkan-v1 Vulkan-v2
Exec
uti
on
tim
e (s
)
VO Implementation
tracking
preprocessing
acquisition
Outline
• Motivation
• Background
• Local Laplacian Filters (LLF) algorithm
• LLF Implementations and Experimental Evaluation
• KinectFusion algorithm for SLAM (Simultaneous Localization and Mapping)
• KinectFusion Implementations and Experimental Evaluation
→Discussion
2705/14/2019 IWOCL 2019
Discussion (I)
05/14/2019 IWOCL 2019 28
28
Gaussian pyramid generation
Remapping of region Rij around (xi,yj)
Convolution
Downsampling
Upsampling
Convolution
Subtraction
Upsampling & Addition
∀(𝒙𝒊, 𝒚𝒋 , 𝒍𝟎 )
𝑙0+1 times
Volume initialization
Type conversion
Filtering
Pyramid generation
Track
∀frame
Depth to Vertex
Vertex to Normal
(L-1) times
L times
Reduce
kl0 timesfor eachlevel
Tracking/VO
Preprocessing
Vulkan has widely different behavior in the two apps.
Vulkan in LLF is almost as fast as OpenCL in LLF
LLF: 1 command buffer invocation for each pyramid level
VO: up to 20 command buffer invocations per frame
At the end of each frame, the flow transfers data to CPU for checking. Processing of next frame depends on the checkingresults.
Therefore large invocation overhead compared with optimized LLF.
LLF flow: VO flow:
Cross-iteration dependences require that kernels are invoked
from different command buffers
Discussion (II)
05/14/2019IWOCL 2019 29
Compiler support for SPIR-V is still immature• spirv-opt does not improve performance
• Many data types not supported by the SPIR-V IR• char, short
• Array of structs in buffer elements
Discussion (III)
+ Programmer can leverage low level semantics of Vulkan (e.g. command buffer manipulation) to improve performance• Use a single command buffer and synchronize using memory barriers.
Important for iterative applications.
• Avoid frequent control and memory exchanges with CPU memory
– Additional complexity of Vulkan compute worth it?
– Vulkan compute is still work in progress!• Still cannot load an OpenCL kernel without intermediate compilation to
GLSL compute shader.
• SPIR-V Driver/compiler immaturity
05/14/2019 IWOCL 2019 30
Acknowledgements
05/14/2019 IWOCL 2019 31
«Acknowledgment: This research has been co‐financed by the European Union and Greek national funds through the Operational Program Competitiveness, Entrepreneurship and Innovation, under the call RESEARCH – CREATE – INNOVATE Project VipGPU: Very Low Power GPUs for Mobile Robotics and Virtual Reality applications (project code: Τ1ΕDΚ-01149 )».
Thank you for your attentionQuestions?
05/14/2019 IWOCL 2019 32