introduction to machine learning on fpgas...fpga ml workflow 21/11/2019 challenge: efficient mapping...

Introduction to machine learning on FPGAs

Arthur Ruder ¦ Enclustra GmbH ¦ AI seminar EPFL Lausanne & ZHAW Winterthur ¦ 19 & 21/11/2019

Quick reminder: neural network

21/11/2019

input layer:

e.g. pixelshidden layer 1

output layer:

e.g. probability

hidden layer 2

𝑤1

𝑤2

𝑤3

𝑥1

𝑥2

𝑥3

𝑎

𝑎𝑎

2

21/11/20193

forward-propagation

Inputs: training set

• Goal: obtain trained weights

untrained network

back-propagation

Machine learning concepts: training phase

But: label says

100 % dog

Outputs: classification

probability

40 % dog,

60 % cat

21/11/20194

forward-propagation

Inputs: e.g. photographsOutputs: classification

probability

99.07 % dog

0.93 % cat

trained network

Machine learning concepts: inference

AlexNet VGG GoogleNet ResNet

2010 2011 2012 2013 2014 2014 2015

class

ific

ati

on

err

or

[%]

30

25

20

15

10

5

0

Quick reminder: Deep Learning

21/11/2019

Human error

shallow8 layers

19 layers

22 layers

152 layers

Image recognition challenge winner

5

Hardware platform

21/11/20196

What hardware do we need for this?

CPUs, GPUs, FPGAs, ASICs??

21/11/20197



• What are the requirements for…?

Hardware platform

21/11/2019




a) training

b) inference

8

Hardware platform

21/11/2019




a) training

b) inference

• What type of hardware is best suited for each task?

9

Hardware platform

Neural network training: computational complexity

21/11/2019

forward-propagation

back-propagation

Untrained neural network

ResNet50Result:

50 % cat

50 % dog

Label:

100% dog

For one picture: image classification

Labelled data

10

21/11/2019

forward-propagation

back-propagation


ResNet50


7.7 billion operations

~35 MB parameter storage

Labelled data

11


Result:

50 % cat

50 % dog

Label:

100% dog

21/11/2019

23 billion operations


forward-propagation

back-propagation


ResNet50




Labelled data

12


Result:

50 % cat

50 % dog

Label:

100% dog

21/11/2019



forward-propagation

back-propagation


ResNet50




* for forward propagation only, backward propagation similar

Labelled data

13


Result:

50 % cat

50 % dog

Label:

100% dog

21/11/2019

ResNet50

forward-propagation

back-propagation


~380 MB for parameter storage

ImageNet: 1.2 Million

pictures

Result?

1 epoch: 1.2𝑀 ∗ 30.7𝐵 ≈ 37 ∗ 1015 operations (majority MAC)



For the whole training process:

14


21/11/2019

ResNet50

forward-propagation

back-propagation


~380 MB for parameter storage

ImageNet: 1.2 Million

pictures

Result?

1 epoch: 1.2𝑀 ∗ 30.7𝐵 ≈ 37 ∗ 1015 operations (majority MAC)

ResNet50 needs 100 epochs for training…



For the whole training process:

15


Requirements breakdown: training

21/11/201917

21/11/2019

• Typically not time-critical

18


21/11/2019


• Compute billions of floating point calculations

19


21/11/2019



• Handle large data sets (GBs to hundreds of GBs)

20


21/11/2019




• Flexibility to train a wide variety of neural networks

21


21/11/2019




• Flexibility to train a wide variety of neural networks

22

Clear answer (for now): GPUs do the heavy lifting

of neural network training


Requirements: inference

21/11/201923


21/11/2019

• Edge requirements

• Cloud requirements

24


21/11/2019


• Low (deterministic) latency (e.g. real-time object detection)


25


21/11/2019



• Power efficiency (limited battery capacity)


26


21/11/2019




• Sensor fusion (e.g. industrial surveillance)


27


21/11/2019





• Robustness (e.g. temperature)


28


21/11/2019







• Low latency, e.g. search engines

29


21/11/2019







• Low latency, e.g. search engines

• Power efficiency (heat dissipation/cooling cost)

30

21/11/201932

Resource requirements overview

21/11/2019

Image

Classification

33


21/11/2019

Image

Classification

Object

Detection

34


21/11/2019

Image

Classification

Object

Detection

Semantic

Segmentation

35


21/11/2019

Image

Classification

Object

Detection

Semantic

SegmentationOCR

36


21/11/2019

Image

Classification

Object

Detection

Semantic

Segmentation

Speech

RecognitionOCR

37


21/11/2019

Image

Classification

Object

Detection

Semantic

Segmentation

Speech

RecognitionOCR

Main takeaway points:

• Inference is challenging

• Huge variation in compute and memory

requirements (even within subgroups)

• Models typically don’t fit into local memory

38


Inference Accelerator

Architectural challenges

21/11/2019

DMA

External memory

Buffer Compute Array

Partial Sums

Activation Functions, …

Weight Buffer

input result

39

Inference Accelerator

Architectural challenges

21/11/2019

DMA

External memory

Buffer Compute Array

Partial Sums

Activation Functions, …

Weight Buffer

input result

Huge amount of

computations

Memory bandwidth

Memory bandwidth

40

Performance & Power Efficiency

Fle

xib

ilit

y &

Ease

of

Use

Qualitative hardware comparison

21/11/201941

Performance & Power Efficiency

Fle

xib

ilit

y &

Ease

of

Use

21/11/201942


Requirements GPU FPGA ASIC

Low (deterministic) latency


21/11/201944




21/11/201945


21/11/201946



High throughput


21/11/201947



High throughput



High throughput

Power efficiency


21/11/201948



High throughput

Power efficiency


21/11/201949


21/11/201950



High throughput

Power efficiency

Sensor fusion


21/11/201951



High throughput

Power efficiency

Sensor fusion


21/11/201952



High throughput

Power efficiency

Sensor fusion

Robustness


21/11/201953



High throughput

Power efficiency

Sensor fusion

Robustness



High throughput

Power efficiency

Sensor fusion

Robustness

Programmability


21/11/201954



High throughput

Power efficiency

Sensor fusion

Robustness

Programmability


21/11/201955



High throughput

Power efficiency

Sensor fusion

Robustness

Programmability

Flexibility


21/11/201956



High throughput

Power efficiency

Sensor fusion

Robustness

Programmability

Flexibility


21/11/201957



High throughput

Power efficiency

Sensor fusion

Robustness

Programmability

Flexibility

Ease-of-use


21/11/201958



High throughput

Power efficiency

Sensor fusion

Robustness

Programmability

Flexibility

Ease-of-use


21/11/201959



High throughput

Power efficiency

Sensor fusion

Robustness

Programmability

Flexibility

Ease-of-use

(Development) cost


21/11/201960



High throughput

Power efficiency

Sensor fusion

Robustness

Programmability

Flexibility

Ease-of-use

(Development) cost


21/11/201961



High throughput

Power efficiency

Sensor fusion

Robustness

Programmability

Flexibility

Ease-of-use

(Development) cost


21/11/201962



High throughput

Power efficiency

Sensor fusion

Robustness

Programmability

Flexibility

Ease-of-use

(Development) cost


21/11/201963



High throughput

Power efficiency

Sensor fusion

Robustness

Programmability

Flexibility

Ease-of-use

(Development) cost


21/11/201964

FPGA ML workflow

21/11/201965

FPGA ML workflow

21/11/2019

Challenge: efficient mapping of floating point model to FPGA implementation

without losing accuracy

FP32

Trained network

Floating point model

66

FPGA ML workflow

21/11/2019



FP32

Trained network


Compression67

FPGA ML workflow

21/11/2019



FP32

Pruning

Pruned network

Trained network


Compression68

Quick digression

21/11/201969

FPGA ML workflow

21/11/2019



FP32

Pruning

Pruned network

Quantization

Trained network


Compression70

FPGA ML workflow

21/11/2019



FP32

Pruning

Pruned network

Quantization

Compilation

Trained network


Compression74

FPGA ML workflow

21/11/2019



FP32

Pruning

Pruned network

Quantization

Compilation

FPGA implementationTrained network


Compression

Fixed Point

75

Impact of compression

21/11/2019

https://www.hotchips.org/hc30/0tutorials/T2_Part_2_Song_Hanv3.pdf

76


21/11/2019


77


21/11/2019


Compression allows using significantly less resources when

deploying a neural network

with minimal impact on network accuracy78

Hardware implementation architectures

21/11/2019

• Streaming architecture

Memory CPU

CO

NV

…

FPGA

HO

ST

PO

OL

CO

NV

FC

80


21/11/2019

• Streaming architecture • Single computation engine

NLCONV/FC POOL

MemoryCPU

HO

ST CONV LAYER

ACTIVATION

POOL

CONV LAYER

ACTIVATION

FC

DMAControl Unit

FP

GA

Memory CPU

CO

NV

…

FPGA

HO

ST

PO

OL

CO

NV

FC

81


21/11/2019

• Streaming architecture • Single computation engine

NLCONV/FC POOL

MemoryCPU

HO

ST CONV LAYER

ACTIVATION

POOL

CONV LAYER

ACTIVATION

FC

DMAControl Unit

FP

GA

Memory CPU

CO

NV

…

FPGA

HO

ST

PO

OL

CO

NV

FC

Properties Streaming architecture Single computation engine

Customizability

Flexibility

Power efficiency

82

Toolchains for AI on FPGAs

21/11/2019

Provider

Edge Cloud

Computer vision Language processing Computer visionLanguage processing

Xilinx

DNNDK

(Deep Neural Network

Development Kit)

- ML (Machine Learning) Suite

Intel - - OpenVINO

Omnitek DPU (Deep Learning Processing Unit) + software framework

Lattice sensAI -

83

Toolchains for AI on FPGAs

21/11/2019

Provider

Edge Cloud

Computer vision Language processing Computer visionLanguage processing

Xilinx

DNNDK

(Deep Neural Network

Development Kit)

- ML (Machine Learning) Suite

Intel - - OpenVINO

Omnitek DPU (Deep Learning Processing Unit) + software framework

Lattice sensAI -

84

Summary

21/11/201985

Summary

21/11/2019

• Neural network inference is viable on FPGAs

• Low power (~mW – W)

• Sensor integration

• Flexibility

• Low deterministic latency

• Edge examples

86

Summary

21/11/2019




• Flexibility


• Edge examples

Xnor.ai: solar powered

person detection

87




• Flexibility


• Cloud examples

Summary

21/11/2019




• Flexibility


• Edge examples

CERN: sensor data filtering

and classificationXnor.ai: solar powered

person detection

88




• Flexibility


• Cloud examples

Summary

21/11/2019




• Flexibility


• Edge examples

CERN: sensor data filtering

and classificationMicrosoft: Azure cloud AIXnor.ai: solar powered

person detection

89




• Flexibility


• Cloud examples

introduction to machine learning on fpgas...fpga ml workflow 21/11/2019 challenge: efficient mapping...

Documents