comparative performance analysis of vulkan implementations ... · gaussian pyramid image pixel (g...

32
ECE Department University of Thessaly, Greece 05/14/2019 1 Comparative Performance Analysis of Vulkan Implementations of Computational Applications Maria Rafaela Gkeka, Nikolaos Bellas, Christos D. Antonopoulos Computer Systems Lab IWOCL 2019

Upload: others

Post on 13-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Comparative Performance Analysis of Vulkan Implementations ... · Gaussian pyramid image pixel (g 0). ... Blurring Remapping Gaussian Pyramid 0 1000 2000 3000 4000 5000 6000 7000

ECE DepartmentUniversity of Thessaly, Greece

05/14/2019 1

Comparative Performance Analysis of Vulkan Implementations of Computational

Applications

Maria Rafaela Gkeka, Nikolaos Bellas, Christos D. Antonopoulos

Computer Systems Lab

IWOCL 2019

Page 2: Comparative Performance Analysis of Vulkan Implementations ... · Gaussian pyramid image pixel (g 0). ... Blurring Remapping Gaussian Pyramid 0 1000 2000 3000 4000 5000 6000 7000

Outline

→Motivation

• Background

• Local Laplacian Filters (LLF) algorithm

• LLF Implementations and Experimental Evaluation

• VO KinectFusion algorithm for SLAM (Simultaneous Localization and Mapping)

• VO KinectFusion Implementations and Experimental Evaluation

• Discussion

205/14/2019 IWOCL 2019

Page 3: Comparative Performance Analysis of Vulkan Implementations ... · Gaussian pyramid image pixel (g 0). ... Blurring Remapping Gaussian Pyramid 0 1000 2000 3000 4000 5000 6000 7000

Motivation

➢GPUs widely used as compute and graphics accelerators

○ Multiple APIs used to program them (OpenCL, OpenGL, CUDA, Vulkan, …)

➢ Vulkan is a new API aiming (among others) at integrating compute and graphics pipelines

o As of today, almost exclusively used for 3D graphics

➢ There is a need to better understand Vulkan performance implications in realistic compute applications

305/14/2019 IWOCL 2019

Page 4: Comparative Performance Analysis of Vulkan Implementations ... · Gaussian pyramid image pixel (g 0). ... Blurring Remapping Gaussian Pyramid 0 1000 2000 3000 4000 5000 6000 7000

Outline

• Motivation

→Background

• Local Laplacian Filters (LLF) algorithm

• LLF Implementations and Experimental Evaluation

• KinectFusion algorithm for SLAM (Simultaneous Localization and Mapping)

• KinectFusion Implementations and Experimental Evaluation

• Discussion

405/14/2019 IWOCL 2019

Page 5: Comparative Performance Analysis of Vulkan Implementations ... · Gaussian pyramid image pixel (g 0). ... Blurring Remapping Gaussian Pyramid 0 1000 2000 3000 4000 5000 6000 7000

Background

➢ Vulkan: modern low-overhead cross-platform 3D graphics and compute API○ targets high-performance real time 3D graphics applications such as video

games

○ uses the Khronos SPIR-V intermediate representation with native support

for shader features

505/14/2019 IWOCL 2019

GeForce 256 (first GPU ever)

1992

OpenGL 1.0

1999

2004

2008

GLSL / OpenGL 2.0

OpenCL 1.0

2011

SPIR 1.0

2011

2015(Feb)

OpenCL 1.2

OpenCL 2.1 (load SPIR-V)Vulkan 1.0

SPIR-V

2017

OpenGL 4.6 (load SPIR-V)

2015(Nov)

2015(Mar)

Vulkan 1.1

2019(Mar)

Page 6: Comparative Performance Analysis of Vulkan Implementations ... · Gaussian pyramid image pixel (g 0). ... Blurring Remapping Gaussian Pyramid 0 1000 2000 3000 4000 5000 6000 7000

Outline

• Motivation

• Background

→ Local Laplacian Filters (LLF) algorithm

• LLF Implementations and Experimental Evaluation

• KinectFusion algorithm for SLAM (Simultaneous Localization and Mapping)

• KinectFusion Implementations and Experimental Evaluation

• Discussion

605/14/2019 IWOCL 2019

Page 7: Comparative Performance Analysis of Vulkan Implementations ... · Gaussian pyramid image pixel (g 0). ... Blurring Remapping Gaussian Pyramid 0 1000 2000 3000 4000 5000 6000 7000

Local Laplacian Filter

• Given an input image, the algorithm applies detail or tone enhancements.

• Edge-preserving image processing algorithm based on the direct manipulation of Gaussian/Laplacian pyramids. [1]

7

input image

Output image with(α,β,σr)=(4,1,0.2)

Output image with(α,β,σr)=(0.25,1,0.4)

05/14/2019 IWOCL 2019

[1] Sylvain Paris et al. Local Laplacian Filters: Edge-aware Image Processing with a Laplacian Pyramid

ACM Trans. Graph. 30.4 (2011)

Page 8: Comparative Performance Analysis of Vulkan Implementations ... · Gaussian pyramid image pixel (g 0). ... Blurring Remapping Gaussian Pyramid 0 1000 2000 3000 4000 5000 6000 7000

LLF is a pyramid-based algorithm

Gaussian pyramidLi+1 image generated by blurring (5-by-5 Gaussian filter) the Li level image and downsample by 2

8IWOCL 201905/14/2019

0.0025 0.0125 0.02 0.0125 0.00250.0125 0.0625 0.1 0.0625 0.01250.02 0.1 0.16 0.1 0.02

0.0125 0.0625 0.1 0.0625 0.01250.0025 0.0125 0.02 0.0125 0.0025

Laplacian pyramidSaves the error between Gaussian images at intermediate levels.The algorithm adds in each level image the expanded image of the lower level.

level 0

level 1level 2

level 3 level 4

Page 9: Comparative Performance Analysis of Vulkan Implementations ... · Gaussian pyramid image pixel (g 0). ... Blurring Remapping Gaussian Pyramid 0 1000 2000 3000 4000 5000 6000 7000

LLF algorithm

9

IWOCL 201905/14/2019

Gaussian pyramid generation

Remapping of region Rij around (xi,yj)

Convolution

Downsampling

Upsampling

Convolution

Subtraction

Upsampling & Addition

Remapping

Determines the kind (detail, edge) of each pixel and manipulates the neighbor region.

Creates a new sub-image for each Gaussian pyramid image pixel (g0).

∀(𝑥𝑖 , 𝑦𝑗 , 𝑙0 )

𝑙0+1 timesGaussian pyramid of Rij region

Laplacian for pixel(xi,yj)

9

level 0

level 1 level 2level 3 level 4

Page 10: Comparative Performance Analysis of Vulkan Implementations ... · Gaussian pyramid image pixel (g 0). ... Blurring Remapping Gaussian Pyramid 0 1000 2000 3000 4000 5000 6000 7000

Outline

• Motivation

• Background

• Local Laplacian Filters (LLF) algorithm

→ LLF Implementations and Experimental Evaluation

• KinectFusion algorithm for SLAM (Simultaneous Localization and Mapping)

• KinectFusion Implementations and Experimental Evaluation

• Discussion

1005/14/2019 IWOCL 2019

Page 11: Comparative Performance Analysis of Vulkan Implementations ... · Gaussian pyramid image pixel (g 0). ... Blurring Remapping Gaussian Pyramid 0 1000 2000 3000 4000 5000 6000 7000

LLF Sequential Implementation

11

Intel® Core i7-4820K @3.70GHz, 16GB DRAM

Single threaded C code O3 optimizations

800x533 input image

05/14/2019 IWOCL 2019

88142 ms

Page 12: Comparative Performance Analysis of Vulkan Implementations ... · Gaussian pyramid image pixel (g 0). ... Blurring Remapping Gaussian Pyramid 0 1000 2000 3000 4000 5000 6000 7000

Where is parallelism?

+ Loop iterations are independent

+ Workload is balanced across pyramid levels

05/14/2019 IWOCL 2019 12

– But, parallel execution has very high memory requirements to store intermediate images

12

Gaussian pyramid generation

Remapping of region Rij around (xi,yj)

Convolution

Downsampling

Upsampling

Convolution

Subtraction

Upsampling & Addition

∀(𝑥𝑖 , 𝑦𝑗 , 𝑙0 )

𝑙0+1 times

Page 13: Comparative Performance Analysis of Vulkan Implementations ... · Gaussian pyramid image pixel (g 0). ... Blurring Remapping Gaussian Pyramid 0 1000 2000 3000 4000 5000 6000 7000

Data parallel execution

Five OpenCL kernels executing in sequence.

• Remapping

• Convolution (blurring)

• Downsampling

• Upsampling

• Subtraction

Grid: processing all pixels of a single line

Work-items: processing single pixel of the sub-image

05/14/2019 IWOCL 2019 13

13

Gaussian pyramid generation

Remapping of region Rij around (xi,yj)

Convolution

Downsampling

Upsampling

Convolution

Subtraction

Upsampling & Addition

∀(𝑥𝑖 , 𝑦𝑗 , 𝑙0 )

𝑙0+1 times

Page 14: Comparative Performance Analysis of Vulkan Implementations ... · Gaussian pyramid image pixel (g 0). ... Blurring Remapping Gaussian Pyramid 0 1000 2000 3000 4000 5000 6000 7000

Optimized OpenCL Implementation

14

Speedup = 19.3

05/14/2019 IWOCL 2019

88142 ms

4567 ms

nVIDIA GeForce GTX 770 @1.046 GHz, 1536 SP cores

2 GB GDDR5

OpenCL 1.2

Page 15: Comparative Performance Analysis of Vulkan Implementations ... · Gaussian pyramid image pixel (g 0). ... Blurring Remapping Gaussian Pyramid 0 1000 2000 3000 4000 5000 6000 7000

Execution time per pyramid Level

15

Parallel calculations of each pixel. The ET of smaller levels decreases because depends only on pyramid image size.

The ET of smaller levels increases because of the bigger region used for calculations of one pixel.

05/14/2019 IWOCL 2019

46%

24%

16%

14%

Page 16: Comparative Performance Analysis of Vulkan Implementations ... · Gaussian pyramid image pixel (g 0). ... Blurring Remapping Gaussian Pyramid 0 1000 2000 3000 4000 5000 6000 7000

LLF Vulkan Implementation and Optimization

In Vulkan API all shaders (kernels)

are presented in SPIR-V format.

SPIR-V is a binary intermediate

representation for graphical

shaders and compute kernels.

In our implementation we use the

glslangValidator for the compute

shaders compilation.

Vulkan API Version: 1.0.61

16

OpenGL kernel(.comp, .vert,

.frag)

SPIR-V bin shader(.spv)

Vulkan application

(.cpp)

glslangValidator

SPIR-V assembly

spirv-dis

load to

or glslc

05/14/2019 IWOCL 2019

Page 17: Comparative Performance Analysis of Vulkan Implementations ... · Gaussian pyramid image pixel (g 0). ... Blurring Remapping Gaussian Pyramid 0 1000 2000 3000 4000 5000 6000 7000

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

Sequential OpenCL Vulkan-v1

Exe

cuti

on

tim

e (

s)

LLF Implementation

Allocation-Transfer

Upsample and Add

Subtraction

Downsample

Upsample

Blurring

Remapping

Gaussian Pyramid

0

1000

2000

3000

4000

5000

6000

7000

OpenCL Vulkan-v1

Exec

uti

on

tim

e (s

)

LLF Implementation

Allocation-Transfer

Upsample and Add

Subtraction

Downsample

Upsample

Blurring

Remapping

Gaussian Pyramid

Baseline Vulkan Implementation (v1)

1705/14/2019 IWOCL 2019

Create a pipeline for each shader

Choose Memory Heap and Memory Type

Host or Device visible

Allocate buffers in memory

Execute each shader

Add each pipeline to a command buffer

Submit the command buffer to execution queue

SPIR-V does not support parameterizable work-group size

Vulkan-v1: different kernel versions for each work-group size

88142 ms

6561 ms4567 ms

Speedup = 19.3 Speedup = 5.2

Page 18: Comparative Performance Analysis of Vulkan Implementations ... · Gaussian pyramid image pixel (g 0). ... Blurring Remapping Gaussian Pyramid 0 1000 2000 3000 4000 5000 6000 7000

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

Sequential OpenCL Vulkan-v1 Vulkan-v2

Exe

cuti

on

tim

e (

s)

LLF Implementation

Allocation-Transfer

Upsample and Add

Subtraction

Downsample

Upsample

Blurring

Remapping

Gaussian Pyramid

Command Buffer optimization (v2)

1805/14/2019 IWOCL 2019

• We can record the work of all iterations in one command buffer and synchronize using memory barriers between iterations.

• Lower overhead due to kernel launching compared with OpenCL/OpenGL (and CUDA)

Speedup = 19.3Speedup = 13.4

4567 ms 6561 ms

Speedup = 17.4

88142 ms

5061 ms

0

1000

2000

3000

4000

5000

6000

7000

OpenCL Vulkan-v1 Vulkan-v2

Exec

uti

on

tim

e (s

)

LLF Implementation

Allocation-TransferUpsample andAddSubtraction

Downsample

Upsample

Blurring

Page 19: Comparative Performance Analysis of Vulkan Implementations ... · Gaussian pyramid image pixel (g 0). ... Blurring Remapping Gaussian Pyramid 0 1000 2000 3000 4000 5000 6000 7000

Outline

• Motivation

• Background

• Local Laplacian Filters (LLF) algorithm

• LLF Implementations and Experimental Evaluation

→KinectFusion algorithm for SLAM (Simultaneous Localization and Mapping)

• KinectFusion Implementations and Experimental Evaluation

• Discussion

1905/14/2019 IWOCL 2019

Page 20: Comparative Performance Analysis of Vulkan Implementations ... · Gaussian pyramid image pixel (g 0). ... Blurring Remapping Gaussian Pyramid 0 1000 2000 3000 4000 5000 6000 7000

Simultaneous Localization and Mapping (SLAM)

SLAM is used in robotics for autonomous movement:• Dynamically building a map of the environment (mapping)

• Navigating this environment using the map while keeping track of the robot’s relative position and orientation

• Used in many real systems and applications• Mobile robotics

• Driverless cars

• Many algorithms and implementations→ KinectFusion implementation in C++, OpenMP, OpenCL and CUDA [2]

05/14/2019 IWOCL 2019 20

[2] L.Nardi et al. Introducing SLAMBench, a performance and accuracy benchmarking methodology for SLAM

ICRA 2015

Page 21: Comparative Performance Analysis of Vulkan Implementations ... · Gaussian pyramid image pixel (g 0). ... Blurring Remapping Gaussian Pyramid 0 1000 2000 3000 4000 5000 6000 7000

Visual Odometry (VO) & Mapping

• VO: the robot localize itself in its environment based on the constructed maps• without any human input

• Mapping: build the map of the environment based on sensory information• e.g. RGB-D cameras

• Lidar

05/14/2019 IWOCL 2019 21

VO

Mapping

SLAM

Page 22: Comparative Performance Analysis of Vulkan Implementations ... · Gaussian pyramid image pixel (g 0). ... Blurring Remapping Gaussian Pyramid 0 1000 2000 3000 4000 5000 6000 7000

KinectFusion

05/14/2019 IWOCL 2019 22

Page 23: Comparative Performance Analysis of Vulkan Implementations ... · Gaussian pyramid image pixel (g 0). ... Blurring Remapping Gaussian Pyramid 0 1000 2000 3000 4000 5000 6000 7000

Outline

• Motivation

• Background

• Local Laplacian Filters (LLF) algorithm

• LLF Implementations and Experimental Evaluation

• KinectFusion algorithm for SLAM (Simultaneous Localization and Mapping)

→KinectFusion Implementations and Experimental Evaluation

• Discussion

2305/14/2019 IWOCL 2019

Page 24: Comparative Performance Analysis of Vulkan Implementations ... · Gaussian pyramid image pixel (g 0). ... Blurring Remapping Gaussian Pyramid 0 1000 2000 3000 4000 5000 6000 7000

KinectFusion Sequential Implementation

Input

Video stream of 882 depth frames VGA resolution (640x480)

05/14/2019 IWOCL 2019 24

Volume initialization

Type conversion

Filtering

Pyramid generation

Track

∀frame

Depth to Vertex

Vertex to Normal

(L-1) times

L times

Reduce

kl0 timesfor eachlevel

Tracking/VO

Preprocessing

0.0513 s

0

0.01

0.02

0.03

0.04

0.05

0.06

Sequential

Exe

cuti

on

tim

e p

er

fram

e (

s)

VO Implementation

tracking

preprocessing

acquisition

Page 25: Comparative Performance Analysis of Vulkan Implementations ... · Gaussian pyramid image pixel (g 0). ... Blurring Remapping Gaussian Pyramid 0 1000 2000 3000 4000 5000 6000 7000

0

0.01

0.02

0.03

0.04

0.05

0.06

Sequential OpenCL-v1 OpenCL-v2

Exe

cuti

on

tim

e p

er

fram

e (

s)

VO Implementation

tracking

preprocessing

acquisition

0.0513 s

0.0052 s 0.0013 s

KinectFusion OpenCL Implementation

OpenCL-v1: OpenCL compiler optimizations disabled

•“-cl-opt-disable”

OpenCL-v2: Default OpenCL compiler optimizations

nVIDIA GeForce GTX 770 @1.046 GHz, 1536 SP cores

2 GB GDDR5

OpenCL 1.2

05/14/2019 IWOCL 2019 25

Speedup = 9.9 Speedup = 38.5

0

0.001

0.002

0.003

0.004

0.005

0.006

OpenCL-v1 OpenCL-v2

Exec

uti

on

tim

e (s

)

VO Implementation

tracking

preprocessing

acquisition

Page 26: Comparative Performance Analysis of Vulkan Implementations ... · Gaussian pyramid image pixel (g 0). ... Blurring Remapping Gaussian Pyramid 0 1000 2000 3000 4000 5000 6000 7000

0

0.01

0.02

0.03

0.04

0.05

0.06

Sequential OpenCL-v1 OpenCL-v2 Vulkan-v1 Vulkan-v2

Exe

cuti

on

tim

e p

er

fram

e (

s)

VO Implementation

tracking

preprocessing

acquisition

KinectFusion Vulkan Implementation

Vulkan-v1: Initial implementation

Vulkan-v2: Command buffers optimization

05/14/2019 IWOCL 2019 26

Speedup = 9.9

0.0513 s

0.0052 s 0.0013 s

Speedup = 38.5Speedup = 4.5

Speedup = 4.9

0.0113 s 0.0105 s

0

0.002

0.004

0.006

0.008

0.01

0.012

OpenCL-v1 OpenCL-v2 Vulkan-v1 Vulkan-v2

Exec

uti

on

tim

e (s

)

VO Implementation

tracking

preprocessing

acquisition

Page 27: Comparative Performance Analysis of Vulkan Implementations ... · Gaussian pyramid image pixel (g 0). ... Blurring Remapping Gaussian Pyramid 0 1000 2000 3000 4000 5000 6000 7000

Outline

• Motivation

• Background

• Local Laplacian Filters (LLF) algorithm

• LLF Implementations and Experimental Evaluation

• KinectFusion algorithm for SLAM (Simultaneous Localization and Mapping)

• KinectFusion Implementations and Experimental Evaluation

→Discussion

2705/14/2019 IWOCL 2019

Page 28: Comparative Performance Analysis of Vulkan Implementations ... · Gaussian pyramid image pixel (g 0). ... Blurring Remapping Gaussian Pyramid 0 1000 2000 3000 4000 5000 6000 7000

Discussion (I)

05/14/2019 IWOCL 2019 28

28

Gaussian pyramid generation

Remapping of region Rij around (xi,yj)

Convolution

Downsampling

Upsampling

Convolution

Subtraction

Upsampling & Addition

∀(𝒙𝒊, 𝒚𝒋 , 𝒍𝟎 )

𝑙0+1 times

Volume initialization

Type conversion

Filtering

Pyramid generation

Track

∀frame

Depth to Vertex

Vertex to Normal

(L-1) times

L times

Reduce

kl0 timesfor eachlevel

Tracking/VO

Preprocessing

Vulkan has widely different behavior in the two apps.

Vulkan in LLF is almost as fast as OpenCL in LLF

LLF: 1 command buffer invocation for each pyramid level

VO: up to 20 command buffer invocations per frame

At the end of each frame, the flow transfers data to CPU for checking. Processing of next frame depends on the checkingresults.

Therefore large invocation overhead compared with optimized LLF.

LLF flow: VO flow:

Cross-iteration dependences require that kernels are invoked

from different command buffers

Page 29: Comparative Performance Analysis of Vulkan Implementations ... · Gaussian pyramid image pixel (g 0). ... Blurring Remapping Gaussian Pyramid 0 1000 2000 3000 4000 5000 6000 7000

Discussion (II)

05/14/2019IWOCL 2019 29

Compiler support for SPIR-V is still immature• spirv-opt does not improve performance

• Many data types not supported by the SPIR-V IR• char, short

• Array of structs in buffer elements

Page 30: Comparative Performance Analysis of Vulkan Implementations ... · Gaussian pyramid image pixel (g 0). ... Blurring Remapping Gaussian Pyramid 0 1000 2000 3000 4000 5000 6000 7000

Discussion (III)

+ Programmer can leverage low level semantics of Vulkan (e.g. command buffer manipulation) to improve performance• Use a single command buffer and synchronize using memory barriers.

Important for iterative applications.

• Avoid frequent control and memory exchanges with CPU memory

– Additional complexity of Vulkan compute worth it?

– Vulkan compute is still work in progress!• Still cannot load an OpenCL kernel without intermediate compilation to

GLSL compute shader.

• SPIR-V Driver/compiler immaturity

05/14/2019 IWOCL 2019 30

Page 31: Comparative Performance Analysis of Vulkan Implementations ... · Gaussian pyramid image pixel (g 0). ... Blurring Remapping Gaussian Pyramid 0 1000 2000 3000 4000 5000 6000 7000

Acknowledgements

05/14/2019 IWOCL 2019 31

«Acknowledgment: This research has been co‐financed by the European Union and Greek national funds through the Operational Program Competitiveness, Entrepreneurship and Innovation, under the call RESEARCH – CREATE – INNOVATE Project VipGPU: Very Low Power GPUs for Mobile Robotics and Virtual Reality applications (project code: Τ1ΕDΚ-01149 )».

Page 32: Comparative Performance Analysis of Vulkan Implementations ... · Gaussian pyramid image pixel (g 0). ... Blurring Remapping Gaussian Pyramid 0 1000 2000 3000 4000 5000 6000 7000

Thank you for your attentionQuestions?

05/14/2019 IWOCL 2019 32