deep learning seminar on xilinx socs from training to deployment · docker: running different...
TRANSCRIPT
11/25/2019
Deep Learning Seminar on Xilinx SoCs
From Training to deployment
Agenda
01 Xilinx Deep Learning Solutions
02 Keras / TensorFlow ResNet50 Training
Building a “Fruit Recognizer”
03 Integration of the Deep Learning Processing
Unit in Vivado
04 Xilinx DNNDK: From a TensorFlow net
to the DPU Firmware
05 Programming Model: The DPU API
06 Question and Answer
11/25/2019 2
Xilinx Deep Learning
Solutions
Xilinx Focuses on Inference
4
Xilinx AI Inference Solution
Deep Learning Applications
Cloud On Premises Edge
Featuring the Most Powerful
FPGA in the Cloud
Virtex Ultrascale+ VU9P Zynq Ultrascale+ MPSoC
5
Many Applications for Machine Learning
RoboticsIIoT Gateways &
Edge Appliances
Drives &
Motor Control PLC/PAC/IPCI/O Modules &
Smart Sensors
Human
Machine
Interface
Video Surveillance &
Smart City
Machine & Computer
Vision Smart Grid3D Printing &
Additive Manufacturing
6
Xilinx AI Solution from Edge to Cloud
Edge Cloud
ZCU102 ZCU104 Ultra96Xilinx U200, U250, U280
DPU xDNN
DNNDK Runtime xfDNN Runtime
DNNDK Compiler xfDNN Compiler
DNNDK Quantizer xfDNN Quantizer
DNNDK Pruning
20+
LSTM
7
Models
Software Stack
FPGA IP
AI Platforms
Xilinx Solution Stack for Edge AI
Models
8
ZCU102 ZCU104 Ultra96
Framework
Tools & IP
Edge AI Platforms
CustomPublic
Custom
Why Xilinx for Edge AI ?
− Xilinx offers the optimal tradeoff for Edge AI− Latency− Power− Cost− Flexibility− Scalability− Time-to-market
− Xilinx pruning technology− Up to 50x optimization− Increased performance− Reduced power
9
Xilinx Edge AI – Value Proposition
Whole Application Acceleration
10
Keras / TensorFlow Training
Building a “Fruit Recognizer”
ResNet50
11/25/2019 12
ResNet was the Winner of the ImageNet
Large Scale Visual Recognition Challenge
(ILSVRC) in 2015
Steps to your Xilinx ML application
11/25/2019 13
1 Prepare your data
2 Train your network
3 Test your trained network on test data
4 Freeze your TensorFlow model
5 Vivado Project with DPU, export DPU configuration
5 Quantizise the model
6 Compile the model with the DPU configurtaion
7 Link the compiled model against your C/C++/Phyton application
8 Deploy it on the PetaLinux system
Decide what you want to classify
11/25/2019 14
The ImageNet Database:
Free database with ~ 20000 classes
with at least 500 pictures per class.
Current size166GByte
Docker: running different DNNDK/TensorFlow versions not
depending on the Host OS/Python/CUDA installation
11/25/2019 15
Copy over what you need for your training/validation/test folders
using python
11/25/2019 16
Train your net using Keras
11/25/2019 17
Test your TF Model with images NOTpart of the training and
validation folders
11/25/2019 18
Integration of the Deep Learning
Processing Unit in Vivado
DPU IP with High Efficiency
CPU MEM CONTROLLER
BUS
Data Mover
IMG WR SCHEDULER
WEIGHTS WRSCHEDULER
SMART MEM FABRIC
IMG RD SCHEDULER
WEIGHTS RD SCHEDULER
PE Array
PE PE PE PE
DISPATCHER
...
EXTERNAL MEMORY
INSTRFETCHER
DECODER
REG MAP
WB WR SCHEDULER
CTRLSIGNALS
MISC CALC
AVG POOL
MAX POOL
ROI POOL
ELEMENTWISE ...
20
DPU – Supported Operations
− Operations supported by the DPU core(s)
− Operations supported by additional cores
21
• Conv
• Dilation
• Pooling
• Max
• Average
• ReLU / Leaky Relu/ Relu6
• Full Connected (FC)
• Batch Normalization
• Concat
• Elementwise
• Deconv
• Depthwise conv
• Mean scale
• Upsampling
• Split
• Reorg
• Softmax
DPU – Interfaces and Parallelism
DPU
B4096
Master-axi-0 Master-axi-1 Master-axi-2slave-axi
32bits 32bits
128bits 128bits
DPU
B1152
Master-axi-0 Master-axi-1 Master-axi-2slave-axi
32bits 32bits
64bits 64bits− 3-level parallelism is exploited
− Pixel * input channel * output
channel
− Small core - B1152
− Parallelism: 4*12*12
− target Z7020/ZU2/ZU3
− Big core - B4096
− Parallelism: 8*16*16
− Target ZU5 and above
22
Xilinx Vivado Integration
23
Xilinx DNNDK: From a TensorFlow
net to the DPU Firmware
DPU Runtime Engine
− Distributed with the DPU TRD
− DPU IP Product Guide (PG338)
− yocto recipes
− Can be used in Petalinux project
− Runtime N2Cube (Cube of Neural Network)
− DPU linux kernel driver
− DPU run-time libraries
− DPU utilities
25
DPU – Linux Kernel Driver
− Distributed as
− DPU yocto recipe
− Files
− dpudef.h
− dpucore.c/.h
− dpuext.c/.h
− Source Code
− today, distributed with DPU TRD
− will be pushed to github repository when mature
26
DPU – run-time libraries
− Distributed as
− DNNDK yocto recipe
− DPU run-time library (libn2cube.so)
− DPU Loader
− DPU Task Scheduling
− DPU Task monitoring
− DPU Task profiling
− DPU utility library (libdputils.so)
− Utility functions to load images into DPU
27
Flow
28
vivado
dpu.ko
dputils
n2cube
device treeDPU
sysroot
application.el
f
BOOT.BIN
image.ub
SD Card
main.cc
dpu_{model}.elf
.hdf
decentmodel dnnc
petalinux
xsdk
4
1 2
DPU
3
DNNDK
caffe
tensorflow
5
DNNDK – Deep Neural Network Development Kit
− Distributed with
− DNNDK User Guide (UG1327)
− Composed of:
− Model Compression (DECENT)
− Pruning
− Quantization
− Model Compilation (DNNC)
− Compiler
− Assembler
29
DECENT – DEEp CompressioN Tool
− Pruning
− available separately (Xilinx AI Optimizer)
− requires license
− Quantization
− available in free version of tools
30
Quantization – Flow
− Preprocess
− folds batchnorm layers
− removes useless nodes
− Quantize
− weights / biases
− activations
− Calibrate
− using calibration dataset
− without labels
− Generate
− deployable DPU model
31
Floating-point model
preprocess
fixed-point / deploy
model
Calibration dataset
(without labels)
Quantize
&
Calibrate
Generate DPU model
DECENT_Q
Quantization – Results
Networks Float32 baseline 8-bit Quantization
Top1 Top5 Top1 ΔTop1 Top5 ΔTop5
Inception_v1 66.90% 87.68% 66.62% -0.28% 87.58% -0.10%
Inception_v2 72.78% 91.04% 72.40% -0.38% 90.82% -0.23%
Inception_v3 77.01% 93.29% 76.56% -0.45% 93.00% -0.29%
Inception_v4 79.74% 94.80% 79.42% -0.32% 94.64% -0.16%
ResNet-50 74.76% 92.09% 74.59% -0.17% 91.95% -0.14%
VGG16 70.97% 89.85% 70.77% -0.20% 89.76% -0.09%
Inception-ResNet-v2 79.95% 95.13% 79.45% -0.51% 94.97% -0.16%
− Uniform Quantization
− 8-bit for both weights and activation
− A small set of images for calibration
32
Quantization – Usage (TensorFlow)
− Input
− floating-point frozen graph (frozen_graph.pb)
− calibration dataset
− python pre-processing function
− Output
− quantized model for deployment (deploy_model.pb)
− quantized model for evaluation (quantize_eval_model.pb)
− Syntax
decent_q quantize
--input_frozen_graph frozen_graph.pb
--input_nodes {input node}
--input_shapes ?,28,28,1
--output_nodes {output node}
--input_fn {python script}
--method {0=non-overflow, 1=min-diffs}
--gpu 0
--calib_iter 200
33
DNNC – Deep Neural Network Compiler
− Compiler
− Specific to version of DPU core
− dnnc-dpu1.4.0 => DPU core with Low RAM usage
− dnnc-dpu1.4.0.1 => DPU core with High RAM usage
− Convert quantized network to micro-code for DPU
− Optimization via Fusion of layers
− Assembler
− not invoked directly, called by the Compiler
34
DNNC – Usage (TensorFlow)
− Input
− Quantized model (deploy_model.pb)
− Output
− ELF file(s) for DPU kernel(s)
− Syntax
dnnc –-parser=tensorflow
--frozen_pb={path to quantized model}
--dcf={DPU configuration file}
--cpu_arch=arm32
--mode=normal
--net_name={network name}
--output_dir={path to output directory}35
Take Away – Compiling a network with DNNDK
− Pruning / AI Optimizer
− Available separately (license required)
− Significant compression with minimal loss in accuracy
− Increased FPS / Reduced power
− Quantization
− Available for free
− Requires a small calibration dataset (without labels)
− Quantization to 8 bits with minimal loss in accuracy
36
References / What Next ?
The following tutorials provide additional examples based on TensorFlow
− https://github.com/Xilinx/Edge-AI-Platform-Tutorials
− CIFAR10 Classification with TensorFlow (UG1338)
− Freezing a Keras model for use with DNNDK (UG1380)
− Deep Learning with custom GoogleNet and ResNet in Keras and Xilinx DNNDK TF 3.0
(UG1381)
37
Programming Model: The DPU API
Programming with DNNDK API Makefile
39
Programming with DNNDK API, DPU Setup
40
Programming with DNNDK API, DPU Task
41
Take Away – Compiling a network with DNNDK
− Pruning / AI Optimizer
− Available separately (license required)
− Significant compression with minimal loss in accuracy
− Increased FPS / Reduced power
− Quantization
− Available for free
− Requires a small calibration dataset (without labels)
− Quantization to 8 bits with minimal loss in accuracy
42
Fragen?
11/25/2019
Thank you!
45