laboratory for embedded and programmable systems

21
Laboratory for Embedded and Programmable Systems

Upload: others

Post on 05-Jun-2022

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Laboratory for Embedded and Programmable Systems

Laboratory for Embedded and Programmable Systems

Page 2: Laboratory for Embedded and Programmable Systems

Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks

Machine Vision: Past, Present and Future!

Feature Extraction Approaches– Hand crafted features such as HoG and SIFT

– Automated features extraction using Convolutional Neural Networks

[Krizhevsky et al. 2012]dlib.net

CNN Based Feature Extraction

– Very effective in different vision tasks

– Very high computational complexity

2

Page 3: Laboratory for Embedded and Programmable Systems

Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks

Precision – Depth Tradeoff

AlexNet

VGG AVGG B

VGG D VGG EGoogleNet

5 10 15 20 25

6

7

8

9

10

11

12

13

14

15

16

Number of Convolutional Layers

To

p 5

Err

or

Dataset: ImageNet 2012 – Top 5 Error [Simonyan 2015]

Higher Depth

Higher Precision

Higher Execution Time

Higher Power Consumption

Mobile devices have to offload the computation to a cloud.

3

Page 4: Laboratory for Embedded and Programmable Systems

Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks

Implementation Choices

An energy efficient and fast implementation of DCNNs is very beneficial for mobile devices. This can be achieved by hardware based acceleration of DCNNs.

4

GP

U β€’ High power consumption

β€’ Inflexibility (Single and Floating point only)

ASI

C β€’ Initial Cost

β€’ Reusability

β€’ Complexity of design

FPG

A β€’ Complexity of design

Page 5: Laboratory for Embedded and Programmable Systems

Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks

AlexNetConvolutional Layers.

Over 90% of computation time.

Fully Connected Layers. They can extract local and global features.

Classifier

5

[Krizhevsky 2012]

Page 6: Laboratory for Embedded and Programmable Systems

Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks

2D Convolution

β€’ Center element of kernel is placed on each pixel of Input Feature Map (IFM)

β€’ Convolution Result:4 Γ— 0 + 0 Γ— 0 + 0 Γ— 0 + 0 Γ— 0 + 0 Γ— 1+ 0 Γ— 1 + 0 Γ— 0 + 0 Γ— 1 + βˆ’2 Γ— 4 = βˆ’8

developer.apple.com

6

Page 7: Laboratory for Embedded and Programmable Systems

Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks

3D Convolution

β€’ Each Output Feature Map (OFM) is the result of a 3D convolution of the Input Feature Map (IFM) with a Kernel stack.

β€’ Example𝑂𝐹𝑀𝐴 = 3π·πΆπ‘œπ‘›π‘£ 𝐼𝐹𝑀,πΎπ‘’π‘Ÿπ‘›π‘’π‘™π΄

βˆ€ 𝑖 ∈ 0, 1, … , 255 : 𝑂𝐹𝑀𝑖 = 3π·πΆπ‘œπ‘›π‘£ (𝐼𝐹𝑀,πΎπ‘’π‘Ÿπ‘›π‘’π‘™π‘–)

7

Page 8: Laboratory for Embedded and Programmable Systems

Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks

Parallelism Sources

β€’ Inter Layer Parallelism

β€’ Inter Output Parallelism– Compute different OFMs in

parallel

β€’ Inter Kernel Parallelism– Compute one OFM in parallel

β€’ Intra Kernel Parallelism – Compute one convolution in

parallel

8

Page 9: Laboratory for Embedded and Programmable Systems

Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks

Design Philosophy

β€’ DCNNs– Computation bound

– Communication bound

β€’ Computation – Communication balance– Memory model

– Computation model

9

Improving CommunicationImproving Computation

Balancing Computation and Communication

Page 10: Laboratory for Embedded and Programmable Systems

Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks

Tilling in Convolutional LayersN M

R

C

K

K

IFM’s tile size: 𝑇𝑛

OFM’s tile size: π‘‡π‘š

Column Tile Size: 𝑇𝑐

Row Tile Size: π‘‡π‘Ÿ

𝑇𝑖

𝑇𝑗

From Previous Layer To Next Layer

Input Feature Maps

Output Feature Maps

Kernels

10

Page 11: Laboratory for Embedded and Programmable Systems

Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks

The Architecture Template

β€’ Intra Kernel Parallelism

– PCE: Parallel Convolution Engine

– Here the number of parallel multiplications (π‘‡π‘˜) is 4.

β€’ Inter Kernel Parallelism

– Convolve different IFMs

β€’ Inter Output Parallelism

– PCEs with different weights

11

Page 12: Laboratory for Embedded and Programmable Systems

Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks

Computation Model

β€’ Number of cycles in tiled model:𝐢𝑦𝑐𝑙𝑒𝑠 =

#π‘…π‘œπ‘’π‘›π‘‘π‘  Γ— #π‘‚π‘π‘’π‘Ÿπ‘Žπ‘‘π‘–π‘œπ‘›π‘  π‘π‘’π‘Ÿ π‘Ÿπ‘œπ‘’π‘›π‘‘

β€’ #π‘…π‘œπ‘’π‘›π‘‘π‘  =𝑀

π‘‡π‘šΓ—

N

𝑇𝑛×

𝑅𝐢

π‘‡π‘Ÿπ‘‡π‘Γ—

𝐾

𝑇𝑖×

𝐾

𝑇𝑖

β€’ #𝑂𝑝𝑠 π‘π‘’π‘Ÿ π‘Ÿπ‘œπ‘’π‘›π‘‘ = (π‘‡π‘Ÿπ‘‡π‘ ×𝑇𝑖𝑇𝑗

π‘‡π‘˜+ 𝑃)

12

Page 13: Laboratory for Embedded and Programmable Systems

Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks

Memory Model

β€’ Computation to communication ratio:

𝐢𝑇𝐢 =π‘‡π‘œπ‘‘π‘Žπ‘™ πΆπ‘œπ‘šπ‘π‘’π‘‘π‘Žπ‘‘π‘–π‘œπ‘›

π‘‡π‘œπ‘‘π‘Žπ‘™ πΆπ‘œπ‘šπ‘šπ‘’π‘›π‘–π‘π‘Žπ‘‘π‘–π‘œπ‘›

=2 ×𝑀 Γ— 𝑁 Γ— 𝑅 Γ— 𝐢 Γ— 𝐾 Γ— 𝐾

𝛼𝑖𝑛 Γ— 𝛽𝑖𝑛 + π›Όπ‘œπ‘’π‘‘ Γ— π›½π‘œπ‘’π‘‘ + π›Όπ‘€π‘”β„Žπ‘‘ Γ— π›½π‘€π‘”β„Žπ‘‘

β€’ Weight’s buffer sizeπ›½π‘€π‘”β„Žπ‘‘ = π‘‡π‘š Γ— 𝑇𝑛 Γ— 𝑇𝑖 Γ— 𝑇𝑗

β€’ Number of loads and stores of weights

π›Όπ‘€π‘”β„Žπ‘‘ =𝑀

π‘‡π‘šΓ—π‘

𝑇𝑛×𝑅

π‘‡π‘ŸΓ—πΆ

𝑇𝑐×𝐾

𝑇𝑖×𝐾

𝑇𝑗

OFM

13

Memory Controller

Memory

Weight1 Weight0

IFM1 IFM0

Execution Pipeline

Page 14: Laboratory for Embedded and Programmable Systems

Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks

Design Space Exploration

β€’ Goal– Maximize throughput and CTC

β€’ Constraints– Memory bandwidth

– On-chip memory

– Area limit (computation)

β€’ Approach– Explore the design space for

different values of π‘‡π‘š, 𝑇𝑛, π‘‡π‘Ÿ, 𝑇𝑐, 𝑇𝑖 and 𝑇𝑗.

14

β€’ AlexNet CONV1:

Page 15: Laboratory for Embedded and Programmable Systems

Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks

Re-configurability Effects (1)La

yer Dynamic π‘»π’Ž, 𝑻𝒏 and π‘»π’Œ Dynamic π‘»π’Ž and 𝑻𝒏 Fixed π‘»π’Ž, 𝑻𝒏 and π‘»π’Œ

π‘‡π‘š 𝑇𝑛 π‘‡π‘˜ Cycles GFLOPS π‘‡π‘š 𝑇𝑛 π‘‡π‘˜ Cycles GFLOPS π‘‡π‘š 𝑇𝑛 π‘‡π‘˜ Cycles GFLOPS

1 16 3 10 117975 86 48 3 3 124025 85 16 3 9 127050 83

2 4 24 5 233280 96 10 16 3 255879 87 16 3 9 279936 80

3 15 32 1 79092 95 16 10 3 79092 95 16 3 9 87204 86

4 15 32 1 118638 95 32 5 3 118638 95 16 3 9 129792 86

5 10 48 1 79092 95 10 16 3 79092 95 16 3 9 86528 86

Sum 628077 656726 755642

15

Towards a static solution

Page 16: Laboratory for Embedded and Programmable Systems

Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks

Re-configurability Effects (2)

– Dynamic re-configurability has a minimal effect on the performance.

16

0

0.2

0.4

0.6

0.8

1

1.2

Dynamic Tm, Tn and Tk Dynamic Tm and Tn Fixed Tm, Tn and Tk

No

rmal

ize

d P

erf

orm

ance

Towards a static solution

Page 17: Laboratory for Embedded and Programmable Systems

Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks

Performance Comparison (1)

ICCD2013[Peeman 2013]

PACT2010[Cadambi 2010]

ISCA2010[Chakradhar 2010]

ISFPGA2015[Zhang 2015]

Proposed Sol.

Precision Fixed Fixed Fixed 32bits float 32bits float

Frequency 150 MHz 125 MHz 200 MHz 100 MHz 100 MHz

FPGA Chip VLX240T SX240T SX240T VX485T VX485T

Performance 17 GOPs 7.0 GOPs 16 GOPs 61.62 GFLOPs 84.2 GFLOPs

GOPs/Slice 4.5E-04 1.9E-04 4.3E-04 8.12E-04 11.09E-04

17

Page 18: Laboratory for Embedded and Programmable Systems

Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks

Performance Comparison (2)

0

0.5

1

1.5

2

2.5

3

3.5

Xilinx Virtex7 485t FPGA with 2X larger area and BW

No

rma

lize

d P

erfo

rman

ce

ISFPGA2015 Proposed Solution

18

Page 19: Laboratory for Embedded and Programmable Systems

Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks

Conclusion

β€’ Template architecture for convolution acceleration

β€’ Analytically characterize performance and memory requirements

β€’ Expand the design space to find best architecture

– Parallel Convolution Engines

– Tiling scheme expanded to kernel level

β€’ Simulation shows speedup of 1.4X … 1.9X over existing accelerators

19

Page 20: Laboratory for Embedded and Programmable Systems

Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks

Thank you!

20

Page 21: Laboratory for Embedded and Programmable Systems

Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks

References

β€’ [Krizhevsky 2012] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deepconvolutional neural networks." Advances in neural information processing systems. 2012.

β€’ [Simonyan 2015] Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scaleimage recognition." arXiv preprint arXiv:1409.1556 (2014).

β€’ [Peeman 2013] M. Peemen, A. A. Setio, B. Mesman, and H. Corporaal. Memory-centric accelerator design forconvolutional neural networks. In Computer Design (ICCD), 2013 IEEE 31st International Conference on, pages13{19. IEEE, 2013.

β€’ [Farabet 2009] Farabet, ClΓ©ment, et al. "Neuflow: A runtime reconfigurable dataflow processor forvision." Computer Vision and Pattern Recognition Workshops (CVPRW), 2011 IEEE Computer Society Conferenceon. IEEE, 2011.

β€’ [Cadambi 2010] Cadambi, Srihari, et al. "A programmable parallel accelerator for learning andclassification." Proceedings of the 19th international conference on Parallel architectures and compilationtechniques. ACM, 2010.

β€’ [Chakradhar 2010] Chakradhar, Srimat, et al. "A dynamically configurable coprocessor for convolutional neuralnetworks." ACM SIGARCH Computer Architecture News. Vol. 38. No. 3. ACM, 2010.

β€’ [Zhang 2015] Zhang, Chen, et al. "Optimizing FPGA-based Accelerator Design for Deep Convolutional NeuralNetworks." Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays.ACM, 2015.

21