deep learning seminar on xilinx socs from training to deployment · docker: running different...

11/25/2019

Deep Learning Seminar on Xilinx SoCs

From Training to deployment

Agenda

01 Xilinx Deep Learning Solutions

02 Keras / TensorFlow ResNet50 Training

Building a “Fruit Recognizer”

03 Integration of the Deep Learning Processing

Unit in Vivado

04 Xilinx DNNDK: From a TensorFlow net

to the DPU Firmware

05 Programming Model: The DPU API

06 Question and Answer

11/25/2019 2

Xilinx Deep Learning

Solutions

Xilinx Focuses on Inference

Xilinx AI Inference Solution

Deep Learning Applications

Cloud On Premises Edge

Featuring the Most Powerful

FPGA in the Cloud

Virtex Ultrascale+ VU9P Zynq Ultrascale+ MPSoC

Many Applications for Machine Learning

RoboticsIIoT Gateways &

Edge Appliances

Drives &

Motor Control PLC/PAC/IPCI/O Modules &

Smart Sensors

Machine

Interface

Video Surveillance &

Smart City

Machine & Computer

Vision Smart Grid3D Printing &

Additive Manufacturing

Xilinx AI Solution from Edge to Cloud

Edge Cloud

ZCU102 ZCU104 Ultra96Xilinx U200, U250, U280

DPU xDNN

DNNDK Runtime xfDNN Runtime

DNNDK Compiler xfDNN Compiler

DNNDK Quantizer xfDNN Quantizer

DNNDK Pruning

Models

Software Stack

FPGA IP

AI Platforms

Xilinx Solution Stack for Edge AI

Models

ZCU102 ZCU104 Ultra96

Framework

Tools & IP

Edge AI Platforms

CustomPublic

Custom

Why Xilinx for Edge AI ?

− Xilinx offers the optimal tradeoff for Edge AI− Latency− Power− Cost− Flexibility− Scalability− Time-to-market

− Xilinx pruning technology− Up to 50x optimization− Increased performance− Reduced power

Xilinx Edge AI – Value Proposition

Whole Application Acceleration

Keras / TensorFlow Training

Building a “Fruit Recognizer”

ResNet50

11/25/2019 12

ResNet was the Winner of the ImageNet

Large Scale Visual Recognition Challenge

(ILSVRC) in 2015

Steps to your Xilinx ML application

11/25/2019 13

1 Prepare your data

2 Train your network

3 Test your trained network on test data

4 Freeze your TensorFlow model

5 Vivado Project with DPU, export DPU configuration

5 Quantizise the model

6 Compile the model with the DPU configurtaion

7 Link the compiled model against your C/C++/Phyton application

8 Deploy it on the PetaLinux system

Decide what you want to classify

11/25/2019 14

The ImageNet Database:

Free database with ~ 20000 classes

with at least 500 pictures per class.

Current size166GByte

Docker: running different DNNDK/TensorFlow versions not

depending on the Host OS/Python/CUDA installation

11/25/2019 15

Copy over what you need for your training/validation/test folders

using python

11/25/2019 16

Train your net using Keras

11/25/2019 17

Test your TF Model with images NOTpart of the training and

validation folders

11/25/2019 18

Integration of the Deep Learning

Processing Unit in Vivado

DPU IP with High Efficiency

CPU MEM CONTROLLER

Data Mover

IMG WR SCHEDULER

WEIGHTS WRSCHEDULER

SMART MEM FABRIC

IMG RD SCHEDULER

WEIGHTS RD SCHEDULER

PE Array

PE PE PE PE

DISPATCHER

EXTERNAL MEMORY

INSTRFETCHER

DECODER

REG MAP

WB WR SCHEDULER

CTRLSIGNALS

MISC CALC

AVG POOL

MAX POOL

ROI POOL

ELEMENTWISE ...

DPU – Supported Operations

− Operations supported by the DPU core(s)

− Operations supported by additional cores

• Conv

• Dilation

• Pooling

• Max

• Average

• ReLU / Leaky Relu/ Relu6

• Full Connected (FC)

• Batch Normalization

• Concat

• Elementwise

• Deconv

• Depthwise conv

• Mean scale

• Upsampling

• Split

• Reorg

• Softmax

DPU – Interfaces and Parallelism

Master-axi-0 Master-axi-1 Master-axi-2slave-axi

32bits 32bits

128bits 128bits

Master-axi-0 Master-axi-1 Master-axi-2slave-axi

32bits 32bits

64bits 64bits− 3-level parallelism is exploited

− Pixel * input channel * output

channel

− Small core - B1152

− Parallelism: 4*12*12

− target Z7020/ZU2/ZU3

− Big core - B4096

− Parallelism: 8*16*16

− Target ZU5 and above

Xilinx Vivado Integration

Xilinx DNNDK: From a TensorFlow

net to the DPU Firmware

DPU Runtime Engine

− Distributed with the DPU TRD

− DPU IP Product Guide (PG338)

− yocto recipes

− Can be used in Petalinux project

− Runtime N2Cube (Cube of Neural Network)

− DPU linux kernel driver

− DPU run-time libraries

− DPU utilities

DPU – Linux Kernel Driver

− Distributed as

− DPU yocto recipe

− Files

− dpudef.h

− dpucore.c/.h

− dpuext.c/.h

− Source Code

− today, distributed with DPU TRD

− will be pushed to github repository when mature

DPU – run-time libraries

− Distributed as

− DNNDK yocto recipe

− DPU run-time library (libn2cube.so)

− DPU Loader

− DPU Task Scheduling

− DPU Task monitoring

− DPU Task profiling

− DPU utility library (libdputils.so)

− Utility functions to load images into DPU

vivado

dpu.ko

dputils

n2cube

device treeDPU

sysroot

application.el

BOOT.BIN

image.ub

SD Card

main.cc

dpu_{model}.elf

decentmodel dnnc

petalinux

tensorflow

DNNDK – Deep Neural Network Development Kit

− Distributed with

− DNNDK User Guide (UG1327)

− Composed of:

− Model Compression (DECENT)

− Pruning

− Quantization

− Model Compilation (DNNC)

− Compiler

− Assembler

DECENT – DEEp CompressioN Tool

− Pruning

− available separately (Xilinx AI Optimizer)

− requires license

− Quantization

− available in free version of tools

Quantization – Flow

− Preprocess

− folds batchnorm layers

− removes useless nodes

− Quantize

− weights / biases

− activations

− Calibrate

− using calibration dataset

− without labels

− Generate

− deployable DPU model

Floating-point model

preprocess

fixed-point / deploy

Calibration dataset

(without labels)

Quantize

Calibrate

Generate DPU model

DECENT_Q

Quantization – Results

Networks Float32 baseline 8-bit Quantization

Top1 Top5 Top1 ΔTop1 Top5 ΔTop5

Inception_v1 66.90% 87.68% 66.62% -0.28% 87.58% -0.10%

Inception_v2 72.78% 91.04% 72.40% -0.38% 90.82% -0.23%

Inception_v3 77.01% 93.29% 76.56% -0.45% 93.00% -0.29%

Inception_v4 79.74% 94.80% 79.42% -0.32% 94.64% -0.16%

ResNet-50 74.76% 92.09% 74.59% -0.17% 91.95% -0.14%

VGG16 70.97% 89.85% 70.77% -0.20% 89.76% -0.09%

Inception-ResNet-v2 79.95% 95.13% 79.45% -0.51% 94.97% -0.16%

− Uniform Quantization

− 8-bit for both weights and activation

− A small set of images for calibration

Quantization – Usage (TensorFlow)

− Input

− floating-point frozen graph (frozen_graph.pb)

− calibration dataset

− python pre-processing function

− Output

− quantized model for deployment (deploy_model.pb)

− quantized model for evaluation (quantize_eval_model.pb)

− Syntax

decent_q quantize

--input_frozen_graph frozen_graph.pb

--input_nodes {input node}

--input_shapes ?,28,28,1

--output_nodes {output node}

--input_fn {python script}

--method {0=non-overflow, 1=min-diffs}

--gpu 0

--calib_iter 200

DNNC – Deep Neural Network Compiler

− Compiler

− Specific to version of DPU core

− dnnc-dpu1.4.0 => DPU core with Low RAM usage

− dnnc-dpu1.4.0.1 => DPU core with High RAM usage

− Convert quantized network to micro-code for DPU

− Optimization via Fusion of layers

− Assembler

− not invoked directly, called by the Compiler

DNNC – Usage (TensorFlow)

− Input

− Quantized model (deploy_model.pb)

− Output

− ELF file(s) for DPU kernel(s)

− Syntax

dnnc –-parser=tensorflow

--frozen_pb={path to quantized model}

--dcf={DPU configuration file}

--cpu_arch=arm32

--mode=normal

--net_name={network name}

--output_dir={path to output directory}35

Take Away – Compiling a network with DNNDK

− Pruning / AI Optimizer

− Available separately (license required)

− Significant compression with minimal loss in accuracy

− Increased FPS / Reduced power

− Quantization

− Available for free

− Requires a small calibration dataset (without labels)

− Quantization to 8 bits with minimal loss in accuracy

References / What Next ?

The following tutorials provide additional examples based on TensorFlow

− https://github.com/Xilinx/Edge-AI-Platform-Tutorials

− CIFAR10 Classification with TensorFlow (UG1338)

− Freezing a Keras model for use with DNNDK (UG1380)

− Deep Learning with custom GoogleNet and ResNet in Keras and Xilinx DNNDK TF 3.0

(UG1381)

Programming Model: The DPU API

Programming with DNNDK API Makefile

Programming with DNNDK API, DPU Setup

Programming with DNNDK API, DPU Task

Take Away – Compiling a network with DNNDK

− Pruning / AI Optimizer

− Available separately (license required)

− Significant compression with minimal loss in accuracy

− Increased FPS / Reduced power

− Quantization

− Available for free

− Requires a small calibration dataset (without labels)

− Quantization to 8 bits with minimal loss in accuracy

Fragen?

Kontakt

11/25/2019 44

Team Kontakt Software and Services:

sas@avnet.eu

11/25/2019

Thank you!

deep learning seminar on xilinx socs from training to deployment · docker: running different...

Documents

tensorflow extended (tfx) · tensorflow transform estimator...

introduction of tensorflow -...

google tensorflow tutorial

tensorflow input pipeline - stanford...

high performance distributed tensorflow with gpus -...

introduction to tensorflow 2...deep learning intro to...

rnns and tensorflow

dnndk user guide (ug1327) - xilinxchapter 1: quick start...

tensorflow graph optimizations

welcome to tensorflow! · 2017. 7. 10. · awesome projects...

schedules & seminar...

tensorflow - aalto · tensorflow api tensorflow has apis...

ntu ml tensorflow

tensorflow in context

tensorflow tutorial

tensorflow basics -...

tensorflow - intro (2017)

cs224d: tensorflow tutorial

nvidia digits with tensorflow · nvidia digits with...

tensorflow on gcp