comparative performance analysis of vulkan implementations ... · gaussian pyramid image pixel (g...

ECE DepartmentUniversity of Thessaly, Greece

05/14/2019 1

Comparative Performance Analysis of Vulkan Implementations of Computational

Applications

Maria Rafaela Gkeka, Nikolaos Bellas, Christos D. Antonopoulos

Computer Systems Lab

IWOCL 2019

Outline

→Motivation

• Background

• Local Laplacian Filters (LLF) algorithm

• LLF Implementations and Experimental Evaluation

• VO KinectFusion algorithm for SLAM (Simultaneous Localization and Mapping)

• VO KinectFusion Implementations and Experimental Evaluation

• Discussion

205/14/2019 IWOCL 2019

Motivation

➢GPUs widely used as compute and graphics accelerators

○ Multiple APIs used to program them (OpenCL, OpenGL, CUDA, Vulkan, …)

➢ Vulkan is a new API aiming (among others) at integrating compute and graphics pipelines

o As of today, almost exclusively used for 3D graphics

➢ There is a need to better understand Vulkan performance implications in realistic compute applications

305/14/2019 IWOCL 2019

Outline

• Motivation

→Background



• KinectFusion algorithm for SLAM (Simultaneous Localization and Mapping)

• KinectFusion Implementations and Experimental Evaluation

• Discussion

405/14/2019 IWOCL 2019

Background

➢ Vulkan: modern low-overhead cross-platform 3D graphics and compute API○ targets high-performance real time 3D graphics applications such as video

games

○ uses the Khronos SPIR-V intermediate representation with native support

for shader features

505/14/2019 IWOCL 2019

GeForce 256 (first GPU ever)

1992

OpenGL 1.0

1999

2004

2008

GLSL / OpenGL 2.0

OpenCL 1.0

2011

SPIR 1.0

2011

2015(Feb)

OpenCL 1.2

OpenCL 2.1 (load SPIR-V)Vulkan 1.0

SPIR-V

2017

OpenGL 4.6 (load SPIR-V)

2015(Nov)

2015(Mar)

Vulkan 1.1

2019(Mar)

Outline

• Motivation

• Background

→ Local Laplacian Filters (LLF) algorithm




• Discussion

605/14/2019 IWOCL 2019

Local Laplacian Filter

• Given an input image, the algorithm applies detail or tone enhancements.

• Edge-preserving image processing algorithm based on the direct manipulation of Gaussian/Laplacian pyramids. [1]

7

input image

Output image with(α,β,σr)=(4,1,0.2)

Output image with(α,β,σr)=(0.25,1,0.4)

05/14/2019 IWOCL 2019

[1] Sylvain Paris et al. Local Laplacian Filters: Edge-aware Image Processing with a Laplacian Pyramid

ACM Trans. Graph. 30.4 (2011)

LLF is a pyramid-based algorithm

Gaussian pyramidLi+1 image generated by blurring (5-by-5 Gaussian filter) the Li level image and downsample by 2

8IWOCL 201905/14/2019

0.0025 0.0125 0.02 0.0125 0.00250.0125 0.0625 0.1 0.0625 0.01250.02 0.1 0.16 0.1 0.02

0.0125 0.0625 0.1 0.0625 0.01250.0025 0.0125 0.02 0.0125 0.0025

Laplacian pyramidSaves the error between Gaussian images at intermediate levels.The algorithm adds in each level image the expanded image of the lower level.

level 0

level 1level 2

level 3 level 4

LLF algorithm

9

IWOCL 201905/14/2019

Gaussian pyramid generation

Remapping of region Rij around (xi,yj)

Convolution

Downsampling

Upsampling

Convolution

Subtraction

Upsampling & Addition

Remapping

Determines the kind (detail, edge) of each pixel and manipulates the neighbor region.

Creates a new sub-image for each Gaussian pyramid image pixel (g0).

∀(𝑥𝑖 , 𝑦𝑗 , 𝑙0 )

𝑙0+1 timesGaussian pyramid of Rij region

Laplacian for pixel(xi,yj)

9

level 0

level 1 level 2level 3 level 4

Outline

• Motivation

• Background


→ LLF Implementations and Experimental Evaluation



• Discussion

1005/14/2019 IWOCL 2019

LLF Sequential Implementation

11

Intel® Core i7-4820K @3.70GHz, 16GB DRAM

Single threaded C code O3 optimizations

800x533 input image

05/14/2019 IWOCL 2019

88142 ms

Where is parallelism?

+ Loop iterations are independent

+ Workload is balanced across pyramid levels

05/14/2019 IWOCL 2019 12

– But, parallel execution has very high memory requirements to store intermediate images

12



Convolution

Downsampling

Upsampling

Convolution

Subtraction


∀(𝑥𝑖 , 𝑦𝑗 , 𝑙0 )

𝑙0+1 times

Data parallel execution

Five OpenCL kernels executing in sequence.

• Remapping

• Convolution (blurring)

• Downsampling

• Upsampling

• Subtraction

Grid: processing all pixels of a single line

Work-items: processing single pixel of the sub-image

05/14/2019 IWOCL 2019 13

13



Convolution

Downsampling

Upsampling

Convolution

Subtraction


∀(𝑥𝑖 , 𝑦𝑗 , 𝑙0 )

𝑙0+1 times

Optimized OpenCL Implementation

14

Speedup = 19.3

05/14/2019 IWOCL 2019

88142 ms

4567 ms

nVIDIA GeForce GTX 770 @1.046 GHz, 1536 SP cores

2 GB GDDR5

OpenCL 1.2

Execution time per pyramid Level

15

Parallel calculations of each pixel. The ET of smaller levels decreases because depends only on pyramid image size.

The ET of smaller levels increases because of the bigger region used for calculations of one pixel.

05/14/2019 IWOCL 2019

46%

24%

16%

14%

LLF Vulkan Implementation and Optimization

In Vulkan API all shaders (kernels)

are presented in SPIR-V format.

SPIR-V is a binary intermediate

representation for graphical

shaders and compute kernels.

In our implementation we use the

glslangValidator for the compute

shaders compilation.

Vulkan API Version: 1.0.61

16

OpenGL kernel(.comp, .vert,

.frag)

SPIR-V bin shader(.spv)

Vulkan application

(.cpp)

glslangValidator

SPIR-V assembly

spirv-dis

load to

or glslc

05/14/2019 IWOCL 2019

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

Sequential OpenCL Vulkan-v1

Exe

cuti

on

tim

e (

s)

LLF Implementation

Allocation-Transfer

Upsample and Add

Subtraction

Downsample

Upsample

Blurring

Remapping

Gaussian Pyramid

0

1000

2000

3000

4000

5000

6000

7000

OpenCL Vulkan-v1

Exec

uti

on

tim

e (s

)

LLF Implementation

Allocation-Transfer

Upsample and Add

Subtraction

Downsample

Upsample

Blurring

Remapping

Gaussian Pyramid

Baseline Vulkan Implementation (v1)

1705/14/2019 IWOCL 2019

Create a pipeline for each shader

Choose Memory Heap and Memory Type

Host or Device visible

Allocate buffers in memory

Execute each shader

Add each pipeline to a command buffer

Submit the command buffer to execution queue

SPIR-V does not support parameterizable work-group size

Vulkan-v1: different kernel versions for each work-group size

88142 ms

6561 ms4567 ms

Speedup = 19.3 Speedup = 5.2

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

Sequential OpenCL Vulkan-v1 Vulkan-v2

Exe

cuti

on

tim

e (

s)

LLF Implementation

Allocation-Transfer

Upsample and Add

Subtraction

Downsample

Upsample

Blurring

Remapping

Gaussian Pyramid

Command Buffer optimization (v2)

1805/14/2019 IWOCL 2019

• We can record the work of all iterations in one command buffer and synchronize using memory barriers between iterations.

• Lower overhead due to kernel launching compared with OpenCL/OpenGL (and CUDA)

Speedup = 19.3Speedup = 13.4

4567 ms 6561 ms

Speedup = 17.4

88142 ms

5061 ms

0

1000

2000

3000

4000

5000

6000

7000

OpenCL Vulkan-v1 Vulkan-v2

Exec

uti

on

tim

e (s

)

LLF Implementation

Allocation-TransferUpsample andAddSubtraction

Downsample

Upsample

Blurring

Outline

• Motivation

• Background



→KinectFusion algorithm for SLAM (Simultaneous Localization and Mapping)


• Discussion

1905/14/2019 IWOCL 2019

Simultaneous Localization and Mapping (SLAM)

SLAM is used in robotics for autonomous movement:• Dynamically building a map of the environment (mapping)

• Navigating this environment using the map while keeping track of the robot’s relative position and orientation

• Used in many real systems and applications• Mobile robotics

• Driverless cars

• Many algorithms and implementations→ KinectFusion implementation in C++, OpenMP, OpenCL and CUDA [2]

05/14/2019 IWOCL 2019 20

[2] L.Nardi et al. Introducing SLAMBench, a performance and accuracy benchmarking methodology for SLAM

ICRA 2015

Visual Odometry (VO) & Mapping

• VO: the robot localize itself in its environment based on the constructed maps• without any human input

• Mapping: build the map of the environment based on sensory information• e.g. RGB-D cameras

• Lidar

05/14/2019 IWOCL 2019 21

VO

Mapping

SLAM

KinectFusion

05/14/2019 IWOCL 2019 22

Outline

• Motivation

• Background




→KinectFusion Implementations and Experimental Evaluation

• Discussion

2305/14/2019 IWOCL 2019

KinectFusion Sequential Implementation

Input

Video stream of 882 depth frames VGA resolution (640x480)

05/14/2019 IWOCL 2019 24

Volume initialization

Type conversion

Filtering

Pyramid generation

Track

∀frame

Depth to Vertex

Vertex to Normal

(L-1) times

L times

Reduce

kl0 timesfor eachlevel

Tracking/VO

Preprocessing

0.0513 s

0

0.01

0.02

0.03

0.04

0.05

0.06

Sequential

Exe

cuti

on

tim

e p

er

fram

e (

s)

VO Implementation

tracking

preprocessing

acquisition

0

0.01

0.02

0.03

0.04

0.05

0.06

Sequential OpenCL-v1 OpenCL-v2

Exe

cuti

on

tim

e p

er

fram

e (

s)

VO Implementation

tracking

preprocessing

acquisition

0.0513 s

0.0052 s 0.0013 s

KinectFusion OpenCL Implementation

OpenCL-v1: OpenCL compiler optimizations disabled

•“-cl-opt-disable”

OpenCL-v2: Default OpenCL compiler optimizations

nVIDIA GeForce GTX 770 @1.046 GHz, 1536 SP cores

2 GB GDDR5

OpenCL 1.2

05/14/2019 IWOCL 2019 25

Speedup = 9.9 Speedup = 38.5

0

0.001

0.002

0.003

0.004

0.005

0.006

OpenCL-v1 OpenCL-v2

Exec

uti

on

tim

e (s

)

VO Implementation

tracking

preprocessing

acquisition

0

0.01

0.02

0.03

0.04

0.05

0.06

Sequential OpenCL-v1 OpenCL-v2 Vulkan-v1 Vulkan-v2

Exe

cuti

on

tim

e p

er

fram

e (

s)

VO Implementation

tracking

preprocessing

acquisition

KinectFusion Vulkan Implementation

Vulkan-v1: Initial implementation

Vulkan-v2: Command buffers optimization

05/14/2019 IWOCL 2019 26

Speedup = 9.9

0.0513 s

0.0052 s 0.0013 s

Speedup = 38.5Speedup = 4.5

Speedup = 4.9

0.0113 s 0.0105 s

0

0.002

0.004

0.006

0.008

0.01

0.012

OpenCL-v1 OpenCL-v2 Vulkan-v1 Vulkan-v2

Exec

uti

on

tim

e (s

)

VO Implementation

tracking

preprocessing

acquisition

Outline

• Motivation

• Background





→Discussion

2705/14/2019 IWOCL 2019

Discussion (I)

05/14/2019 IWOCL 2019 28

28



Convolution

Downsampling

Upsampling

Convolution

Subtraction


∀(𝒙𝒊, 𝒚𝒋 , 𝒍𝟎 )

𝑙0+1 times

Volume initialization

Type conversion

Filtering

Pyramid generation

Track

∀frame

Depth to Vertex

Vertex to Normal

(L-1) times

L times

Reduce

kl0 timesfor eachlevel

Tracking/VO

Preprocessing

Vulkan has widely different behavior in the two apps.

Vulkan in LLF is almost as fast as OpenCL in LLF

LLF: 1 command buffer invocation for each pyramid level

VO: up to 20 command buffer invocations per frame

At the end of each frame, the flow transfers data to CPU for checking. Processing of next frame depends on the checkingresults.

Therefore large invocation overhead compared with optimized LLF.

LLF flow: VO flow:

Cross-iteration dependences require that kernels are invoked

from different command buffers

Discussion (II)

05/14/2019IWOCL 2019 29

Compiler support for SPIR-V is still immature• spirv-opt does not improve performance

• Many data types not supported by the SPIR-V IR• char, short

• Array of structs in buffer elements

Discussion (III)

+ Programmer can leverage low level semantics of Vulkan (e.g. command buffer manipulation) to improve performance• Use a single command buffer and synchronize using memory barriers.

Important for iterative applications.

• Avoid frequent control and memory exchanges with CPU memory

– Additional complexity of Vulkan compute worth it?

– Vulkan compute is still work in progress!• Still cannot load an OpenCL kernel without intermediate compilation to

GLSL compute shader.

• SPIR-V Driver/compiler immaturity

05/14/2019 IWOCL 2019 30

Acknowledgements

05/14/2019 IWOCL 2019 31

«Acknowledgment: This research has been co‐financed by the European Union and Greek national funds through the Operational Program Competitiveness, Entrepreneurship and Innovation, under the call RESEARCH – CREATE – INNOVATE Project VipGPU: Very Low Power GPUs for Mobile Robotics and Virtual Reality applications (project code: Τ1ΕDΚ-01149 )».

Thank you for your attentionQuestions?

05/14/2019 IWOCL 2019 32

comparative performance analysis of vulkan implementations ... · gaussian pyramid image pixel (g...

Documents