efficient video processing on embedded gpu€¦ · zürcher fachhochschule software frameworks on...

29
Tobias Kammacher Armin Weiss Matthias Frei Institute of Embedded Systems High Performance Multimedia Research Group Zurich University of Applied Sciences (ZHAW) Efficient Video Processing on Embedded GPU

Upload: others

Post on 01-Jun-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Efficient Video Processing on Embedded GPU€¦ · Zürcher Fachhochschule Software Frameworks on TX1/TX2 • OS: Linux for Tegra (L4T) by Nvidia – Kernel 4.4.15 – Video Input:

Zürcher Fachhochschule

Tobias Kammacher

Armin Weiss

Matthias Frei

Institute of Embedded Systems

High Performance Multimedia Research Group

Zurich University of Applied Sciences (ZHAW)

Efficient Video Processing

on Embedded GPU

Page 2: Efficient Video Processing on Embedded GPU€¦ · Zürcher Fachhochschule Software Frameworks on TX1/TX2 • OS: Linux for Tegra (L4T) by Nvidia – Kernel 4.4.15 – Video Input:

Zürcher Fachhochschule

Goals

2

1. Share Experiences

2. Benefits of Embedded GPU

3. Bottlenecks

Page 3: Efficient Video Processing on Embedded GPU€¦ · Zürcher Fachhochschule Software Frameworks on TX1/TX2 • OS: Linux for Tegra (L4T) by Nvidia – Kernel 4.4.15 – Video Input:

Zürcher Fachhochschule

Experience with Gstreameron Embedded Devices

3

Gbps Mbps

• Live Video Stream– HW / SW

– Embedded + 4K

– Drivers

Page 4: Efficient Video Processing on Embedded GPU€¦ · Zürcher Fachhochschule Software Frameworks on TX1/TX2 • OS: Linux for Tegra (L4T) by Nvidia – Kernel 4.4.15 – Video Input:

Zürcher Fachhochschule

Experience with Gstreameron Embedded Devices

4

Gbps Mbps

• Live Video Stream– HW / SW

– Embedded + 4K

– Drivers

Nvidia Jetson

TX1

Development

Board

4K HDMI

Capture

Module

Page 5: Efficient Video Processing on Embedded GPU€¦ · Zürcher Fachhochschule Software Frameworks on TX1/TX2 • OS: Linux for Tegra (L4T) by Nvidia – Kernel 4.4.15 – Video Input:

Zürcher Fachhochschule

Experience with Gstreameron Embedded Devices

5

Gbps Mbps

• Live Video Stream– HW / SW

– Embedded + 4K

– Drivers

• Multi Camera Capture

– Debayer on GPU

• GPU is powerful– Realtime?

Page 6: Efficient Video Processing on Embedded GPU€¦ · Zürcher Fachhochschule Software Frameworks on TX1/TX2 • OS: Linux for Tegra (L4T) by Nvidia – Kernel 4.4.15 – Video Input:

Zürcher Fachhochschule

Experience with Gstreameron Embedded Devices

6

Gbps Mbps

• Live Video Stream– HW / SW

– Embedded + 4K

– Drivers

• Multi Camera Capture

– Debayer on GPU

• GPU is powerful– Realtime?

• Live Video Processing– Computer Vision

– Deep Learning

Person Tree

Page 7: Efficient Video Processing on Embedded GPU€¦ · Zürcher Fachhochschule Software Frameworks on TX1/TX2 • OS: Linux for Tegra (L4T) by Nvidia – Kernel 4.4.15 – Video Input:

Zürcher Fachhochschule

Embedded: Nvidia TX1/TX2

7

Interfaces

CSI

PCIe

USB

Ethernet

Image: nvidia.com

Page 8: Efficient Video Processing on Embedded GPU€¦ · Zürcher Fachhochschule Software Frameworks on TX1/TX2 • OS: Linux for Tegra (L4T) by Nvidia – Kernel 4.4.15 – Video Input:

Zürcher Fachhochschule

Embedded: Nvidia TX1/TX2

8

Interfaces

CSI

PCIe

USB

Ethernet

Image: nvidia.com

Page 9: Efficient Video Processing on Embedded GPU€¦ · Zürcher Fachhochschule Software Frameworks on TX1/TX2 • OS: Linux for Tegra (L4T) by Nvidia – Kernel 4.4.15 – Video Input:

Zürcher Fachhochschule

Embedded: Nvidia TX1/TX2

9

Interfaces

CSI

PCIe

USB

Ethernet

Processing

GStreamer

MM API

CPU

GPU

DMAs

CODECs

H.264

H.265

VP8

Streaming

HLS

Mpeg-TS

RT(S)P

Image: nvidia.com

Page 10: Efficient Video Processing on Embedded GPU€¦ · Zürcher Fachhochschule Software Frameworks on TX1/TX2 • OS: Linux for Tegra (L4T) by Nvidia – Kernel 4.4.15 – Video Input:

Zürcher Fachhochschule

Software Frameworks on TX1/TX2

• OS: Linux for Tegra (L4T) by Nvidia– Kernel 4.4.15

– Video Input: V4L2 drivers (e.g. for CSI)

– Video Output: Xorg or proprietary framebuffer

• Multimedia APIs– GStreamer

• Hardware Scaling, CODECs (omx)

• Video Input, Display

• ISP hidden

– L4T Multimedia API (Nvidia)

• Video input, V4L2 API, Buffer management

– OpenCV, Deep Learning Frameworks (TensorRT, Yolo, ..)

• GPU Integration– CUDA

– OpenGL (ES) / EGL

– Vulkan

10GStreamer is free software available under the terms of the LGPL license

OpenGL® and the oval logo are trademarks or registered trademarks of Silicon Graphics, Inc

Page 11: Efficient Video Processing on Embedded GPU€¦ · Zürcher Fachhochschule Software Frameworks on TX1/TX2 • OS: Linux for Tegra (L4T) by Nvidia – Kernel 4.4.15 – Video Input:

Zürcher Fachhochschule

Software Stack

11

CPU

Video Source

Linux Kernel

(Frameworks)

GPU

OpenGL, EGL, Vulkan CUDA

V4L2, videobuf2

Modules / Drivers

ALSA

Display Ctrl Eth PHY

DRM/KMS/FB

Host1x / Graphics Host Eth Driver

TCP/IP/UDP

Sources Sinks Processing CODECs Stream

OpenMAX (omx)

GPU Driver

CODECs

H.264/265/VP8

PCIe Ctrl

Sockets

GStreamer

Multimedia API

v4l2, alsa, tcp/udpxvideo, overlay

(omx), tcp/udp mix, scale, convert,

cuda, openGL

omx h264/h265,

libav, mp3

rtp, rtsp, hls,

mpeg-ts

libargus, V4L2 API NVOSD

Buffer utility

VisionWorks

X11

VI (CSI)

v4l2-subdev

Convert

cuda, openGL

NvVideoEncoder,

NvVideoDecoder

HW

Kernel

Space

Libraries

User

Space

OpenCV (-> AI)

TensorRTHigh

Level

Page 12: Efficient Video Processing on Embedded GPU€¦ · Zürcher Fachhochschule Software Frameworks on TX1/TX2 • OS: Linux for Tegra (L4T) by Nvidia – Kernel 4.4.15 – Video Input:

Zürcher Fachhochschule

Simple Video Streaming PipelineHLS

12

V4L2

Source

HLS

Sink

Gstreamer Pipeline

ConvertMPEG-

TS Mux

$ gst-launch-1.0 v4l2src !

videoconvert !

omxh265enc

bitrate=5000000 !

mpegtsmux !

hlssink

playlist-location=/var/www/playlist.m3u8

location=/var/www/segment%05d.ts

playlist-root=http://192.168.0.1

Encode

H.265

WebServer (lighttpd)

Page 13: Efficient Video Processing on Embedded GPU€¦ · Zürcher Fachhochschule Software Frameworks on TX1/TX2 • OS: Linux for Tegra (L4T) by Nvidia – Kernel 4.4.15 – Video Input:

Zürcher Fachhochschule

Video ProcessingScaling, Mixing

13

Mixing two sources (4K and 1080p)

V4L2

Source

Format

Convert

Render

HDMI

Gstreamer Pipeline

ScaleMix

(PiP)

V4L2

Source

Page 14: Efficient Video Processing on Embedded GPU€¦ · Zürcher Fachhochschule Software Frameworks on TX1/TX2 • OS: Linux for Tegra (L4T) by Nvidia – Kernel 4.4.15 – Video Input:

Zürcher Fachhochschule

Video ProcessingExample: Scaling, Mixing

14

4K Video

1080p Video

Logo

Images: CC BY-SA Wikimedia

Page 15: Efficient Video Processing on Embedded GPU€¦ · Zürcher Fachhochschule Software Frameworks on TX1/TX2 • OS: Linux for Tegra (L4T) by Nvidia – Kernel 4.4.15 – Video Input:

Zürcher Fachhochschule

Video ProcessingExample: Scaling, Mixing

15

Mixing two sources (4K and 1080p)

V4L2

Source

Format

Convert

Render

HDMI

Gstreamer Pipeline

ScaleMix

(PiP)

V4L2

Source

Page 16: Efficient Video Processing on Embedded GPU€¦ · Zürcher Fachhochschule Software Frameworks on TX1/TX2 • OS: Linux for Tegra (L4T) by Nvidia – Kernel 4.4.15 – Video Input:

Zürcher Fachhochschule

Video ProcessingExample: Scaling, Mixing

16

Mixing two sources (4K and 1080p)

• CPU: Using compositor element: 1.2 FPS

V4L2

Source

Format

Convert

Render

HDMI

Gstreamer Pipeline

ScaleMix

(PiP)

V4L2

Source

gst-launch-1.0 v4l2src ! 'video/x-raw, format=UYVY,

framerate=30/1, width=3840, height=2160' ! compositor

name=comp sink_0::alpha=1 sink_1::alpha=0.5 ! xvimagesink

sync=false videotestsrc pattern=1 ! 'video/x-

raw,format=UYVY, framerate=30/1, width=1000, height=1000'

! comp.

Page 17: Efficient Video Processing on Embedded GPU€¦ · Zürcher Fachhochschule Software Frameworks on TX1/TX2 • OS: Linux for Tegra (L4T) by Nvidia – Kernel 4.4.15 – Video Input:

Zürcher Fachhochschule

Video ProcessingExample: Scaling, Mixing

17

Mixing two sources (4K and 1080p)

• CPU: Using compositor element: 1.2 FPS

• OpenGL (glvideomixer & glimagesink): 6.8 FPS

V4L2

Source

Format

Convert

Render

HDMI

Gstreamer Pipeline

ScaleMix

(PiP)

V4L2

Source

gst-launch-1.0 v4l2src ! 'video/x-raw, format=UYVY,

framerate=30/1, width=3840, height=2160' ! compositor

name=comp sink_0::alpha=1 sink_1::alpha=0.5 ! xvimagesink

sync=false videotestsrc pattern=1 ! 'video/x-

raw,format=UYVY, framerate=30/1, width=1000, height=1000'

! comp.

Page 18: Efficient Video Processing on Embedded GPU€¦ · Zürcher Fachhochschule Software Frameworks on TX1/TX2 • OS: Linux for Tegra (L4T) by Nvidia – Kernel 4.4.15 – Video Input:

Zürcher Fachhochschule

Video ProcessingExample: Scaling, Mixing

18

Mixing two sources (4K and 1080p)

• CPU: Using compositor element: 1.2 FPS

• OpenGL (glvideomixer & glimagesink): 6.8 FPS

• Need a solution with better performance => GPU

V4L2

Source

Format

Convert

Render

HDMI

Gstreamer Pipeline

ScaleMix

(PiP)

V4L2

Source

gst-launch-1.0 v4l2src ! 'video/x-raw, format=UYVY,

framerate=30/1, width=3840, height=2160' ! compositor

name=comp sink_0::alpha=1 sink_1::alpha=0.5 ! xvimagesink

sync=false videotestsrc pattern=1 ! 'video/x-

raw,format=UYVY, framerate=30/1, width=1000, height=1000'

! comp.

0

5

10

15

20

25

30

35

PiP Pipeline FPS

CPU OpenGL Required

?

Page 19: Efficient Video Processing on Embedded GPU€¦ · Zürcher Fachhochschule Software Frameworks on TX1/TX2 • OS: Linux for Tegra (L4T) by Nvidia – Kernel 4.4.15 – Video Input:

Zürcher Fachhochschule

Use GPU with GStreamer

19

• GStreamer Plugin

• From nvidia: nvivafilter

– CUDA processing

– NVMM frame format (Nv internal)

– EGLImage type

– Only 1 input and 1 output pad

• Our own plugin (internal)

– CUDA processing

– Multiple input pads, 1 output pad

– Allocate managed memory from GPU and pass to src plugin

– Support Userptr io-mode

• Alternatives?

Signals

V4L2

SourceDisplay

Sink

GPU

Plugin

Gstreamer Pipeline

V4L2

Source

Page 20: Efficient Video Processing on Embedded GPU€¦ · Zürcher Fachhochschule Software Frameworks on TX1/TX2 • OS: Linux for Tegra (L4T) by Nvidia – Kernel 4.4.15 – Video Input:

Zürcher Fachhochschule

GPU ProcessingGPU Memory Access Methods

20

TX1

CPU GPU

DRAM 4GB

Memory Controller

L2

Cache

Unified Virtual Addressing

L2

Cache

CPU

Buffer

GPU

Buffer

Page 21: Efficient Video Processing on Embedded GPU€¦ · Zürcher Fachhochschule Software Frameworks on TX1/TX2 • OS: Linux for Tegra (L4T) by Nvidia – Kernel 4.4.15 – Video Input:

Zürcher Fachhochschule

GPU ProcessingGPU Memory Access Methods

21

TX1

CPU GPU

DRAM 4GB

Memory Controller

L2

Cache

Unified Virtual Addressing

L2

Cache

CPU

Buffer

GPU

Buffer

TX1

CPU GPU

DRAM 4GB

Memory Controller

L2

Cache

Zero Copy

L2

Cache

Shared

Buffer

Page 22: Efficient Video Processing on Embedded GPU€¦ · Zürcher Fachhochschule Software Frameworks on TX1/TX2 • OS: Linux for Tegra (L4T) by Nvidia – Kernel 4.4.15 – Video Input:

Zürcher Fachhochschule

GPU ProcessingGPU Memory Access Methods

22

TX1

CPU GPU

DRAM 4GB

Memory Controller

L2

Cache

Unified Virtual Addressing

L2

Cache

CPU

Buffer

GPU

Buffer

TX1

CPU GPU

DRAM 4GB

Memory Controller

L2

Cache

Zero Copy

L2

Cache

Shared

Buffer

TX1

CPU GPU

DRAM 4GB

Memory Controller

L2

Cache

Managed Memory

L2

Cache

Shared

Buffer

Page 23: Efficient Video Processing on Embedded GPU€¦ · Zürcher Fachhochschule Software Frameworks on TX1/TX2 • OS: Linux for Tegra (L4T) by Nvidia – Kernel 4.4.15 – Video Input:

Zürcher Fachhochschule

GPU ProcessingPiP Test (GPU Data Transfer and Kernel Execution)

23* Upload 4K + 1080p, Download 4K

Unified Virtual Addressing

Step 1: cudaMemcpy() to GPU * 12.5 ms

Step 2: Execute kernel 9-11 ms

Step 3: cudaMemcpy() to host * 7.2 ms Total: 30 ms

Page 24: Efficient Video Processing on Embedded GPU€¦ · Zürcher Fachhochschule Software Frameworks on TX1/TX2 • OS: Linux for Tegra (L4T) by Nvidia – Kernel 4.4.15 – Video Input:

Zürcher Fachhochschule

GPU ProcessingPiP Test (GPU Data Transfer and Kernel Execution)

24* Upload 4K + 1080p, Download 4K

** One time only operation

Unified Virtual Addressing

Step 1: cudaMemcpy() to GPU * 12.5 ms

Step 2: Execute kernel 9-11 ms

Step 3: cudaMemcpy() to host * 7.2 ms

Zero Copy

Step 1: cudaMallocHost(): Allocate memory on host **

-

Step 2: Execute kernel 23.5 – 25.7 ms

Total: 30 ms

Total: 25 ms

Page 25: Efficient Video Processing on Embedded GPU€¦ · Zürcher Fachhochschule Software Frameworks on TX1/TX2 • OS: Linux for Tegra (L4T) by Nvidia – Kernel 4.4.15 – Video Input:

Zürcher Fachhochschule

GPU ProcessingPiP Test (GPU Data Transfer and Kernel Execution)

25* Upload 4K + 1080p, Download 4K

** One time only operation

Unified Virtual Addressing

Step 1: cudaMemcpy() to GPU * 12.5 ms

Step 2: Execute kernel 9-11 ms

Step 3: cudaMemcpy() to host * 7.2 ms

Zero Copy

Step 1: cudaMallocHost(): Allocate memory on host **

-

Step 2: Execute kernel 23.5 – 25.7 ms

Managed Memory

Step 1: cudaMallocManaged(): Allocate shared memory **

-

Step 2: Execute kernel 9-11 ms

Step 3: synchronize with CPU 0.2 ms

Total: 30 ms

Total: 25 ms

Total: 10 ms

Page 26: Efficient Video Processing on Embedded GPU€¦ · Zürcher Fachhochschule Software Frameworks on TX1/TX2 • OS: Linux for Tegra (L4T) by Nvidia – Kernel 4.4.15 – Video Input:

Zürcher Fachhochschule

GPU ProcessingResults

• PiP pipeline achieves 30 FPS

– Using managed memory

Additional:

• Consecutive kernels executed

faster

26

0

5

10

15

20

25

30

35

PiP Pipeline FPS

CPU OpenGL GPU

Page 27: Efficient Video Processing on Embedded GPU€¦ · Zürcher Fachhochschule Software Frameworks on TX1/TX2 • OS: Linux for Tegra (L4T) by Nvidia – Kernel 4.4.15 – Video Input:

Zürcher Fachhochschule

ConclusionHardware Mapping

27

Color

Space

Conversion

Scaling

Picture

in

Picture

Audio/Video

MuxEncryption

Transport

Protocol

Packer

Forward

Error

Correction

Recorder

Video

Input

Ethernet

Output

Audio

2nd Video Source

GPU

HW Block

CPU

H.264/H.265

Encoder

Gbps Mbps

Page 28: Efficient Video Processing on Embedded GPU€¦ · Zürcher Fachhochschule Software Frameworks on TX1/TX2 • OS: Linux for Tegra (L4T) by Nvidia – Kernel 4.4.15 – Video Input:

Zürcher Fachhochschule

Conclusion

• Live 4K on Embedded

• GPU and HW-accelerated blocks

– Enable Desktop -> Embedded

• Bottlenecks and Solutions

– Allocate GPU Managed Memory for Capture

– Gst GPU Plugin

28

Page 29: Efficient Video Processing on Embedded GPU€¦ · Zürcher Fachhochschule Software Frameworks on TX1/TX2 • OS: Linux for Tegra (L4T) by Nvidia – Kernel 4.4.15 – Video Input:

Zürcher Fachhochschule

Get started with embedded GPU now!

29

Blog: https://blog.zhaw.ch/high-performance/

4K Drivers: https://github.com/ines-hpmm

Hardware Board: http://pender.ch/products_zhaw.shtml

[email protected]

[email protected]