deep learning in matlab - gtc on-demand featured...

Post on 23-Jul-2018

222 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1© 2017 The MathWorks, Inc.

Deep learning in MATLABFrom Concept to CUDA Code

Roy Fahn

Applications Engineer

Systematics

royf@systematics.co.il

03-7660111

Ram Kokku

Principal Engineer

MathWorks

ram.kokku@mathworks.com

2

Talk Outline

Design Deep

Learning & Vision

Algorithms

High Performance

Deployment

• Manage large image sets

• Automate image labeling

• Easy access to models

• Pre-built training

frameworks

Automate compilation

with GPU Coder

On TitanXP: 7x faster than TensorFlow

5x faster than pyCaffe2

On Jetson: On par with TensorRT

2x faster than C++-Caffe

Accelerate and Scale

Training

• Acceleration with GPU’s

• Scale to clusters

3

Example: Transfer Learning Workflow

Transfer Learning

Images

New

ClassifierLearn New

Weights

Modify

Network

Structure

Load

Reference

NetworkLabels

Training Data

Labels: Cars, Trucks,

BigTrucks, SUVs, Vans

4

Example: Transfer Learning in MATLAB

Set up

training

dataset

Split, shuffle, re-arrange images

Read image, Data augmentation

(clip, rotate, resize, etc)

Easily manage large sets of images

Single line of code to access images

Operates on disk, database, big-data file system

5

Example: Transfer Learning in MATLAB

Load

Reference

Network

Set up

training

dataset

Create DNNs in MATLAB

1. Easy access to research models

2. Caffe Model importer

3. Build from scratch

6

Example: Transfer Learning in MATLAB

Modify

Network

Structure

Load

Reference

Network

Set up

training

dataset

7

Example: Transfer Learning in MATLAB

Modify

Network

Structure

Load

Reference

Network

Set up

training

dataset

8

Example: Transfer Learning in MATLAB

Learn New

Weights

Modify

Network

Structure

Load

Reference

Network

Set up

training

dataset

Many more training options

9

Deep learning on CPU, GPU, multi-GPU and clusters

More GPUs

10

Deep learning on CPU, GPU, multi-GPU and clusters

More GPUs

Mo

re C

PU

s

11

Deep learning on CPU, GPU, multi-GPU and clusters

More GPUs

Mo

re C

PU

s

13

Visualizing and Debugging Intermediate Results

Filters…

Activations

Deep Dream

Training Accuracy Visualization Deep Dream

Layer Activations Feature Visualization

• Many options for visualizations and debugging• Examples to get started

14

GPU Coder for Deployment: New Product in R2017b

Neural Networks

Deep Learning, machine learning

Image Processing and

Computer Vision

Image filtering, feature detection/extraction

Signal Processing and

Communications FFT, filtering, cross correlation,

7x faster than state-of-art 700x faster than CPUs

for feature extraction

20x faster than

CPUs for FFTs

GPU Coder

Accelerated implementation of

parallel algorithms on GPUs

15

GPU Coder Compilation Flow

GPU Coder

CUDA Kernel creation

Memory allocation

Data transfer minimization

• Library function mapping

• Loop optimizations

• Dependence analysis

• Data locality analysis

• GPU memory allocation

• Data-dependence analysis

• Dynamic memcpy reduction

16

Scalarized MATLAB

GPU Coder Generates CUDA from MATLAB: saxpy

CUDA kernel for GPU parallelization

CUDA

Vectorized MATLAB

Loops and matrix operations are

directly compiled in to kernels

17

Generated CUDA Optimized for Memory Performance

Mandelbrot space

CUDA kernel for GPU parallelization

… …

… …

CUDA

Kernel data allocation is

automatically optimized

20

Algorithm Design to Embedded Deployment Workflow

MATLAB algorithm

(functional reference)

Functional test1 Deployment

unit-test

2

Desktop

GPU

C++

Deployment

integration-test

3

Desktop

GPU

C++

Real-time test4

Embedded GPU

.mex .lib Cross-compiled

.lib

Build type

Call CUDA

from MATLAB

directly

Call CUDA from

(C++) hand-

coded main()

Call CUDA from (C++)

hand-coded main().

21

Demo: Alexnet Deployment with ‘mex’ Code Generation

22

Algorithm Design to Embedded Deployment on Tegra GPU

MATLAB algorithm

(functional reference)

Functional test1

(Test in MATLAB on host)

Deployment

unit-test

2

(Test generated code in

MATLAB on host + GPU)

Tesla

GPU

C++

Deployment

integration-test

3

(Test generated code within

C/C++ app on host + GPU)

Tesla

GPU

C++

Real-time test4

(Test generated code within

C/C++ app on Tegra target)

Tegra GPU

.mex .lib Cross-compiled

.lib

Build type

Call CUDA

from MATLAB

directly

Call CUDA from

(C++) hand-

coded main()

Call CUDA from (C++)

hand-coded main().

Cross-compiled on host

with Linaro toolchain

23

Alexnet Deployment to Tegra: Cross-Compiled with ‘lib’

Two small changes

1. Change build-type to ‘lib’

2. Select cross-compile toolchain

24

End-to-End Application: Lane Detection

Transfer Learning

Alexnet

Lane detection

CNN

Post-processing

(find left/right lane

points)Image

Image with

marked lanes

Left lane co-efficients

Right lane co-efficients

Output of CNN is lane parabola co-efficients according to: y = ax^2 + bx + c

GPU coder generates code for whole application

https://tinyurl.com/ybaxnxjg

25

How Good is Generated Code Performance

Performance of image processing and computer vision

Performance of CNN inference (Alexnet) on Titan XP GPU

Performance of CNN inference (Alexnet) on Jetson (Tegra) TX2

26

GPU Coder for Image Processing and Computer Vision

Distance

transform

Fog removal

SURF feature

extraction

Ray tracing

Stereo disparity

Orders magnitude speedup over CPU

27

Alexnet Inference on NVIDIA Titan XP

MATLAB GPU Coder

(R2017b)

TensorFlow (1.2.0)

Caffe2 (0.8.1)

Fra

mes p

er

second

Batch Size

CPU Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz

GPU Pascal Titan Xp

cuDNN v5

Testing platform

mxNet (0.10)

MATLAB (R2017b)

2x 7x5x

28

0

50

100

150

200

250

300

350

400

1 16 32 64 128 256

Alexnet Inference on Jetson TX2: Frame-Rate Performance

MATLAB GPU Coder

(R2017b)

Fra

me

s p

er

se

co

nd

Batch Size

C++ Caffe

(1.0.0-rc5)

TensorRT (2.1)

2x

0.85x

30

Alexnet Inference on Jetson TX2: Memory PerformanceP

ea

k M

em

ory

(M

B)

Batch Size

MATLAB GPU Coder

(R2017b)

C++ Caffe

(1.0.0-rc5)

TensorRT 2.1

(using giexec wrapper)

31

Design Your DNNs in MATLAB, Deploy with GPU Coder

Design Deep

Learning & Vision

Algorithms

High Performance

Deployment

Manage large image sets

Automate image labeling

Easy access to models

Pre-built training

frameworks

Automate compilation

with GPU Coder

On TitanXP: 7x faster than TensorFlow

5x faster than pyCaffe2

On Jetson TX2: On par with TensorRT

2x faster than C++-Caffe

Accelerate and Scale

Training

Acceleration with GPU’s

Scale to clusters

32

Check Out Deep Learning in MATLAB and GPU Coder

GPU Coder

Deep learning in MATLAB

systematics.co.il\mwevents

top related