efficient processing for deep learning: challenges and opportunies · high-dimensional cnn...

59
Efficient Processing for Deep Learning: Challenges and Opportuni:es Contact Info email: [email protected] website: www.rle.mit.edu/eems Vivienne Sze Massachuse@s Ins:tute of Technology In collabora*on with Yu-Hsin Chen, Joel Emer, Tien-Ju Yang

Upload: others

Post on 16-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Efficient Processing for Deep Learning: Challenges and Opportunies · High-Dimensional CNN Convoluon E Many Output Image Filters (M) Many Output Channels (M) M … R R 1 R R C M H

EfficientProcessingforDeepLearning:ChallengesandOpportuni:es

ContactInfoemail:[email protected]:www.rle.mit.edu/eems

VivienneSzeMassachuse@sIns:tuteofTechnology

Incollabora*onwithYu-HsinChen,JoelEmer,Tien-JuYang

Page 2: Efficient Processing for Deep Learning: Challenges and Opportunies · High-Dimensional CNN Convoluon E Many Output Image Filters (M) Many Output Channels (M) M … R R 1 R R C M H

VideoistheBiggestBigData

Needenergy-efficientpixelprocessing!

Over70%oftoday’sInternettrafficisvideoOver300hoursofvideouploadedtoYouTubeeveryminute

Over500millionhoursofvideosurveillancecollectedeveryday

Energylimitedduetoba:erycapacity

Powerlimitedduetoheatdissipa?on

2

Page 3: Efficient Processing for Deep Learning: Challenges and Opportunies · High-Dimensional CNN Convoluon E Many Output Image Filters (M) Many Output Channels (M) M … R R 1 R R C M H

DeepConvolu:onalNeuralNetworks

Classes FC Layers

Modern deep CNN: up to 1000 CONV layers

CONV Layer

CONV Layer

Low-level Features

High-level Features

3

Page 4: Efficient Processing for Deep Learning: Challenges and Opportunies · High-Dimensional CNN Convoluon E Many Output Image Filters (M) Many Output Channels (M) M … R R 1 R R C M H

DeepConvolu:onalNeuralNetworks

CONV Layer

CONV Layer

Low-level Features

High-level Features

Classes FC Layers

1 – 3 layers

4

Page 5: Efficient Processing for Deep Learning: Challenges and Opportunies · High-Dimensional CNN Convoluon E Many Output Image Filters (M) Many Output Channels (M) M … R R 1 R R C M H

DeepConvolu:onalNeuralNetworks

Classes CONV Layer

CONV Layer

FC Layers

Convolutions account for more than 90% of overall computation, dominating runtime and energy consumption

5

Page 6: Efficient Processing for Deep Learning: Challenges and Opportunies · High-Dimensional CNN Convoluon E Many Output Image Filters (M) Many Output Channels (M) M … R R 1 R R C M H

High-DimensionalCNNConvolu:on

R

R

H

Input Image (Feature Map)

Filter

H

6

Page 7: Efficient Processing for Deep Learning: Challenges and Opportunies · High-Dimensional CNN Convoluon E Many Output Image Filters (M) Many Output Channels (M) M … R R 1 R R C M H

R

Filter

High-DimensionalCNNConvolu:on

Input Image (Feature Map)

R

Element-wise Multiplication

H

H

7

Page 8: Efficient Processing for Deep Learning: Challenges and Opportunies · High-Dimensional CNN Convoluon E Many Output Image Filters (M) Many Output Channels (M) M … R R 1 R R C M H

R

Filter

R

High-DimensionalCNNConvolu:on

E

E Partial Sum (psum)

Accumulation

Input Image (Feature Map) Output Image

Element-wise Multiplication

H

a pixel

H

8

Page 9: Efficient Processing for Deep Learning: Challenges and Opportunies · High-Dimensional CNN Convoluon E Many Output Image Filters (M) Many Output Channels (M) M … R R 1 R R C M H

H R

Filter

R

High-DimensionalCNNConvolu:on

E

Sliding Window Processing

Input Image (Feature Map) a pixel

Output Image

H E

9

Page 10: Efficient Processing for Deep Learning: Challenges and Opportunies · High-Dimensional CNN Convoluon E Many Output Image Filters (M) Many Output Channels (M) M … R R 1 R R C M H

H

High-DimensionalCNNConvolu:on

R

R

C

Input Image

Output Image C Filter

Many Input Channels (C)

E

H E

AlexNet:3–192Channels(C)

10

Page 11: Efficient Processing for Deep Learning: Challenges and Opportunies · High-Dimensional CNN Convoluon E Many Output Image Filters (M) Many Output Channels (M) M … R R 1 R R C M H

High-DimensionalCNNConvolu:on

E

Output Image Many Filters (M)

Many Output Channels (M)

M

R

R 1

R

R

C

M

H

Input Image C

C

H E

AlexNet:96–384Filters(M)

11

Page 12: Efficient Processing for Deep Learning: Challenges and Opportunies · High-Dimensional CNN Convoluon E Many Output Image Filters (M) Many Output Channels (M) M … R R 1 R R C M H

High-DimensionalCNNConvolu:on

M

Many Input Images (N) Many

Output Images (N) …

R

R

R

R

C

C

Filters

E

E

H

C

H

H

C

E 1 1

N N

H E

Imagebatchsize:1–256(N)

12

Page 13: Efficient Processing for Deep Learning: Challenges and Opportunies · High-Dimensional CNN Convoluon E Many Output Image Filters (M) Many Output Channels (M) M … R R 1 R R C M H

LargeSizeswithVaryingShapes

Layer FilterSize(R) #Filters(M) #Channels(C) Stride1 11x11 96 3 42 5x5 256 48 13 3x3 384 256 14 3x3 384 192 15 3x3 256 192 1

AlexNet1Convolu:onalLayerConfigura:ons

34kParams 307kParams 885kParams

Layer1 Layer2 Layer3

1.[Krizhevsky,NIPS2012]

105MMACs 224MMACs 150MMACs

13

Page 14: Efficient Processing for Deep Learning: Challenges and Opportunies · High-Dimensional CNN Convoluon E Many Output Image Filters (M) Many Output Channels (M) M … R R 1 R R C M H

•  LeNet(1998)•  AlexNet(2012)•  OverFeat(2013)•  VGGNet(2014)•  GoogleNet(2014)•  ResNet(2015)

PopularCNNs

0

2

4

6

8

10

12

14

16

18

2012 2013 2014 2015 Human

Acc

urac

y (T

op 5

err

or)

[O. Russakovsky et al., IJCV 2015]

AlexNet

OverFeat

GoogLeNet

ResNet

Clarifa

i

VGGNet

ImageNet: Large Scale Visual Recognition Challenge (ILSVRC)

14

Page 15: Efficient Processing for Deep Learning: Challenges and Opportunies · High-Dimensional CNN Convoluon E Many Output Image Filters (M) Many Output Channels (M) M … R R 1 R R C M H

Metrics LeNet-5 AlexNet VGG-16 GoogLeNet (v1)

ResNet-50

Top-5 error n/a 16.4 7.4 6.7 5.3

Input Size 28x28 227x227 224x224 224x224 224x224 # of CONV Layers 2 5 16 21 (depth) 49 Filter Sizes 5 3, 5,11 3 1, 3 , 5, 7 1, 3, 7 # of Channels 1, 6 3 - 256 3 - 512 3 - 1024 3 - 2048 # of Filters 6, 16 96 - 384 64 - 512 64 - 384 64 - 2048 Stride 1 1, 4 1 1, 2 1, 2 # of Weights 2.6k 2.3M 14.7M 6.0M 23.5M # of MACs 283k 666M 15.3G 1.43G 3.86G # of FC layers 2 3 3 1 1 # of Weights 58k 58.6M 124M 1M 2M # of MACs 58k 58.6M 124M 1M 2M Total Weights 60k 61M 138M 7M 25.5M Total MACs 341k 724M 15.5G 1.43G 3.9G

SummaryofPopularCNNs

CONV Layers increasingly important!

15

Page 16: Efficient Processing for Deep Learning: Challenges and Opportunies · High-Dimensional CNN Convoluon E Many Output Image Filters (M) Many Output Channels (M) M … R R 1 R R C M H

Trainingvs.Inference

Training (determine weights)

Weights Large Datasets

Inference (use weights)

16

Page 17: Efficient Processing for Deep Learning: Challenges and Opportunies · High-Dimensional CNN Convoluon E Many Output Image Filters (M) Many Output Channels (M) M … R R 1 R R C M H

Challenges

17

Page 18: Efficient Processing for Deep Learning: Challenges and Opportunies · High-Dimensional CNN Convoluon E Many Output Image Filters (M) Many Output Channels (M) M … R R 1 R R C M H

• Accuracy–  EvaluatehardwareusingtheappropriateDNNmodelanddataset

•  Programmability–  Supportmulapleapplicaaons–  Differentweights

•  Energy/Power–  Energyperoperaaon–  DRAMBandwidth

•  Throughput/Latency–  GOPS,framerate,delay

•  Cost–  Area(sizeofmemoryand#ofcores)

KeyMetrics

DRAM

Chip

ComputerVision

SpeechRecogni:on

18

[Sze et al., CICC 2017]

ImageNetMNIST

Page 19: Efficient Processing for Deep Learning: Challenges and Opportunies · High-Dimensional CNN Convoluon E Many Output Image Filters (M) Many Output Channels (M) M … R R 1 R R C M H

Opportunities in Architecture

19

Page 20: Efficient Processing for Deep Learning: Challenges and Opportunies · High-Dimensional CNN Convoluon E Many Output Image Filters (M) Many Output Channels (M) M … R R 1 R R C M H

GPUsandCPUsTarge:ngDeepLearning

Knights Mill: next gen Xeon Phi “optimized for deep

learning”

Intel Knights Landing (2016) Nvidia PASCAL GP100 (2016)

20

Use matrix multiplication libraries on CPUs and GPUs

Page 21: Efficient Processing for Deep Learning: Challenges and Opportunies · High-Dimensional CNN Convoluon E Many Output Image Filters (M) Many Output Channels (M) M … R R 1 R R C M H

AccelerateMatrixMul:plica:on

•  Implementation: Matrix Multiplication (GEMM)

•  CPU: OpenBLAS, Intel MKL, etc •  GPU: cuBLAS, cuDNN, etc

•  Optimized by tiling to storage hierarchy

21

Page 22: Efficient Processing for Deep Learning: Challenges and Opportunies · High-Dimensional CNN Convoluon E Many Output Image Filters (M) Many Output Channels (M) M … R R 1 R R C M H

MapDNNtoaMatrixMul:plica:on

•  Convert to matrix mult. using the Toeplitz Matrix

1 2 34 5 67 8 9

1 23 4

Filter Input Fmap Output Fmap

* = 1 23 4

1 2 3 41 2 4 52 3 5 64 5 7 85 6 8 9

1 2 3 4 × =

Toeplitz Matrix (w/ redundant data)

Convolution:

Matrix Mult:

22

1 2 4 52 3 5 64 5 7 85 6 8 9

Data is repeated Goal: Reduced number of operations to increase throughput

Page 23: Efficient Processing for Deep Learning: Challenges and Opportunies · High-Dimensional CNN Convoluon E Many Output Image Filters (M) Many Output Channels (M) M … R R 1 R R C M H

• Goal:Bitwisesameresult,butreducenumberofopera:ons

•  Focusesmostlyoncompute

Computa:onTransforma:ons23

Page 24: Efficient Processing for Deep Learning: Challenges and Opportunies · High-Dimensional CNN Convoluon E Many Output Image Filters (M) Many Output Channels (M) M … R R 1 R R C M H

Analogy:Gauss’sMul:plica:onAlgorithm

4 multiplications + 3 additions

3 multiplications + 5 additions

24

Reduce number of multiplications, but increase number of additions

Page 25: Efficient Processing for Deep Learning: Challenges and Opportunies · High-Dimensional CNN Convoluon E Many Output Image Filters (M) Many Output Channels (M) M … R R 1 R R C M H

• Winograd[Lavin,CVPR2016]– Pro:2.25xspeedupfor3x3filter– Con:Specializedprocessingdependingonfiltersize

•  FastFourierTransform[Mathieu,ICLR2014]

– Pro:DirectconvoluaonO(No2Nf

2)toO(No2log2No)

– Con:Increasestoragerequirements

•  Strassen[Cong,ICANN2014]– Pro:O(N3)to(N2.807)– Con:Numericalstability

ReduceOpera:onsinMatrixMul:plica:on25

Page 26: Efficient Processing for Deep Learning: Challenges and Opportunies · High-Dimensional CNN Convoluon E Many Output Image Filters (M) Many Output Channels (M) M … R R 1 R R C M H

cuDNN:SpeedupwithTransforma:ons

Source: Nvidia

26

Page 27: Efficient Processing for Deep Learning: Challenges and Opportunies · High-Dimensional CNN Convoluon E Many Output Image Filters (M) Many Output Channels (M) M … R R 1 R R C M H

Specialized Hardware (Accelerators)

27

Page 28: Efficient Processing for Deep Learning: Challenges and Opportunies · High-Dimensional CNN Convoluon E Many Output Image Filters (M) Many Output Channels (M) M … R R 1 R R C M H

Proper:esWeCanLeverage

•  Operaaonsexhibithighparallelismàhighthroughputpossible

•  MemoryAccessistheBojleneck

28

ALU

Memory Read Memory Write MAC*

* multiply-and-accumulate

filter weight image pixel partial sum updated

partial sum

•  Example: AlexNet [NIPS 2012] has 724M MACs à 2896M DRAM accesses required

Worst Case: all memory R/W are DRAM accesses

200x 1x

DRAM DRAM

Page 29: Efficient Processing for Deep Learning: Challenges and Opportunies · High-Dimensional CNN Convoluon E Many Output Image Filters (M) Many Output Channels (M) M … R R 1 R R C M H

Proper:esWeCanLeverage

•  Operaaonsexhibithighparallelismàhighthroughputpossible

•  Inputdatareuseopportuniaes(upto500x) àexploitlow-costmemory

Convolu:onalReuse

(pixels,weights)

Filter Image

ImageReuse(pixels)

2

1

Filters

Image

FilterReuse

(weights)

Filter

Images

2

1

29

Page 30: Efficient Processing for Deep Learning: Challenges and Opportunies · High-Dimensional CNN Convoluon E Many Output Image Filters (M) Many Output Channels (M) M … R R 1 R R C M H

Highly-ParallelComputeParadigms30

Temporal Architecture (SIMD/SIMT)

Register File

Memory Hierarchy

Spatial Architecture (Dataflow Processing)

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

Control

Memory Hierarchy

ALU ALU ALU ALU

ALU ALU ALU ALU

ALU ALU ALU ALU

ALU ALU ALU ALU

Page 31: Efficient Processing for Deep Learning: Challenges and Opportunies · High-Dimensional CNN Convoluon E Many Output Image Filters (M) Many Output Channels (M) M … R R 1 R R C M H

AdvantagesofSpa:alArchitecture31

Temporal Architecture (SIMD/SIMT)

Register File

Memory Hierarchy

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

Control

Spatial Architecture (Dataflow Processing)

Memory Hierarchy

ALU ALU ALU ALU

ALU ALU ALU ALU

ALU ALU ALU ALU

ALU ALU ALU ALU

EfficientDataReuseDistributedlocalstorage(RF)

Inter-PECommunica:onSharingamongregionsofPEs

ProcessingElement(PE)

Control

Reg File 0.5 – 1.0 kB

Page 32: Efficient Processing for Deep Learning: Challenges and Opportunies · High-Dimensional CNN Convoluon E Many Output Image Filters (M) Many Output Channels (M) M … R R 1 R R C M H

DataMovementisExpensive32

DRAM ALU

Buffer ALU

PE ALU

RF ALU

ALU

Data Movement Energy Cost

200×

1× (Reference)

Off-Chip DRAM ALU = PE

Processing Engine

Accelerator

GlobalBuffer

PE

PE PE

ALU

Maximizedatareuseatlowerlevelsofhierarchy

Page 33: Efficient Processing for Deep Learning: Challenges and Opportunies · High-Dimensional CNN Convoluon E Many Output Image Filters (M) Many Output Channels (M) M … R R 1 R R C M H

WeightSta:onary(WS)

•  Minimize weight read energy consumption −  maximize convolutional and filter reuse of weights

•  Examples: [Chakradhar, ISCA 2010] [nn-X (NeuFlow), CVPRW 2014]

[Park, ISSCC 2015] [Origami, GLSVLSI 2015]

Global Buffer

W0 W1 W2 W3 W4 W5 W6 W7

Psum Pixel

PE Weight

33

Page 34: Efficient Processing for Deep Learning: Challenges and Opportunies · High-Dimensional CNN Convoluon E Many Output Image Filters (M) Many Output Channels (M) M … R R 1 R R C M H

•  Minimize partial sum R/W energy consumption −  maximize local accumulation

•  Examples:

OutputSta:onary(OS)

[Gupta, ICML 2015] [ShiDianNao, ISCA 2015] [Peemen, ICCD 2013]

Global Buffer

P0 P1 P2 P3 P4 P5 P6 P7

Pixel Weight

PE Psum

34

Page 35: Efficient Processing for Deep Learning: Challenges and Opportunies · High-Dimensional CNN Convoluon E Many Output Image Filters (M) Many Output Channels (M) M … R R 1 R R C M H

•  Use a large global buffer as shared storage −  Reduce DRAM access energy consumption

•  Examples:

NoLocalReuse(NLR)

[DianNao, ASPLOS 2014] [DaDianNao, MICRO 2014] [Zhang, FPGA 2015]

PE Pixel

Psum

Global Buffer Weight

35

Page 36: Efficient Processing for Deep Learning: Challenges and Opportunies · High-Dimensional CNN Convoluon E Many Output Image Filters (M) Many Output Channels (M) M … R R 1 R R C M H

RowSta:onaryDataflow

PE 1

Row 1 Row 1

PE 2

Row 2 Row 2

PE 3

Row 3 Row 3

Row 1

= *

PE 4

Row 1 Row 2

PE 5

Row 2 Row 3

PE 6

Row 3 Row 4

Row 2

= *

PE 7

Row 1 Row 3

PE 8

Row 2 Row 4

PE 9

Row 3 Row 5

Row 3

= *

* * *

* * *

* * *

36

Page 37: Efficient Processing for Deep Learning: Challenges and Opportunies · High-Dimensional CNN Convoluon E Many Output Image Filters (M) Many Output Channels (M) M … R R 1 R R C M H

DataflowComparison:CONVLayers

0

0.5

1

1.5

2

Normalized Energy/MAC

WS OSA OSB OSC NLR RS

psums

weights

pixels

CNN Dataflows

37

[Chen, ISCA 2016]

RS uses 1.4× – 2.5× lower energy than other dataflows

Page 38: Efficient Processing for Deep Learning: Challenges and Opportunies · High-Dimensional CNN Convoluon E Many Output Image Filters (M) Many Output Channels (M) M … R R 1 R R C M H

EyerissDeepCNNAccelerator38

Off-Chip DRAM

… …

Decomp

Comp ReLU

Input Image

Output Image

Filter Filt

Img

Psum

Psum

Global Buffer SRAM

108KB

64 bits

DCNN Accelerator

14×12 PE Array

Link Clock Core Clock

[Chenetal.,ISSCC2016]

Page 39: Efficient Processing for Deep Learning: Challenges and Opportunies · High-Dimensional CNN Convoluon E Many Output Image Filters (M) Many Output Channels (M) M … R R 1 R R C M H

ComparisonwithGPU39

Eyeriss NVIDIA TK1 (Jetson Kit) Technology 65nm 28nm Clock Rate 200MHz 852MHz

# Multipliers 168 192

On-Chip Storage Buffer: 108KB Spad: 75.3KB

Shared Mem: 64KB Reg File: 256KB

Word Bit-Width 16b Fixed 32b Float Throughput1 34.7 fps 68 fps

Measured Power 278 mW Idle/Active2: 3.7W/10.2W

DRAM Bandwidth 127 MB/s 1120 MB/s 3

1.  AlexNet Convolutional Layers Only 2.  Board Power 3.  Modeled from [Tan, SC11] http://eyeriss.mit.edu

Page 40: Efficient Processing for Deep Learning: Challenges and Opportunies · High-Dimensional CNN Convoluon E Many Output Image Filters (M) Many Output Channels (M) M … R R 1 R R C M H

Features:Energyvs.Accuracy40

0.1

1

10

100

1000

10000

0 20 40 60 80

Accuracy(AveragePrecision)

Energy/Pixel(nJ)

VGG162

AlexNet2

HOG1

Measuredin65nm*1.   [Suleiman,VLSI2016]2.   [Chen,ISSCC2016]*Onlyfeatureextrac?on.Doesnotincludedata,augmenta?on,ensembleandclassifica?onenergy,etc.

MeasuredinonVOC2007Dataset1.   DPMv5[Girshick,2012]2.   FastR-CNN[Girshick,CVPR2015]

Exponen?al

Linear

VideoCompression

[Suleiman et al., ISCAS 2017]

Page 41: Efficient Processing for Deep Learning: Challenges and Opportunies · High-Dimensional CNN Convoluon E Many Output Image Filters (M) Many Output Channels (M) M … R R 1 R R C M H

Opportunities in Joint Algorithm Hardware Design

41

Page 42: Efficient Processing for Deep Learning: Challenges and Opportunies · High-Dimensional CNN Convoluon E Many Output Image Filters (M) Many Output Channels (M) M … R R 1 R R C M H

• Reducesizeofoperandsforstorage/compute–  FloaangpointàFixedpoint– Bit-widthreducaon– Non-linearquanazaaon

• Reducenumberofopera:onsforstorage/compute– ExploitAcavaaonStaasacs(Compression)– NetworkPruning– CompactNetworkArchitectures

Approaches42

Page 43: Efficient Processing for Deep Learning: Challenges and Opportunies · High-Dimensional CNN Convoluon E Many Output Image Filters (M) Many Output Channels (M) M … R R 1 R R C M H

CommercialProductsusing8-bitInteger

Nvidia’s Pascal (2016) Google’s TPU (2016)

43

Page 44: Efficient Processing for Deep Learning: Challenges and Opportunies · High-Dimensional CNN Convoluon E Many Output Image Filters (M) Many Output Channels (M) M … R R 1 R R C M H

•  Reducenumberofbits–  BinaryNets[Courbariaux,NIPS2015]

•  Reducenumberofuniqueweights–  TernaryWeightNets[Li,arXiv2016]–  XNOR-Net[Rategari,ECCV2016]

•  Non-LinearQuan:za:on–  LogNet[Lee,ICASSP2017]

ReducedPrecisioninResearch44

Binary Filters

Log Domain Quantization

Page 45: Efficient Processing for Deep Learning: Challenges and Opportunies · High-Dimensional CNN Convoluon E Many Output Image Filters (M) Many Output Channels (M) M … R R 1 R R C M H

SparsityinFeatureMaps

9 -1 -3 1 -5 5 -2 6 -1

Many zeros in output fmaps after ReLU ReLU 9 0 0

1 0 5 0 6 0

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5 CONV Layer

# of activations # of non-zero activations

(Normalized)

45

Page 46: Efficient Processing for Deep Learning: Challenges and Opportunies · High-Dimensional CNN Convoluon E Many Output Image Filters (M) Many Output Channels (M) M … R R 1 R R C M H

ExploitSparsity46

[Chenetal.,ISSCC2016]

Method2:Compressdatatoreducestorageanddatamovement

0

1

2

3

4

5

6

1 2 3 4 5

DRAM

Access

(MB)

AlexNetConvLayer

Uncompressed Compressed

1 2 3 4 5 AlexNet Conv Layer

DRAM Access (MB)

0

2

4

6 1.2×

1.4× 1.7×

1.8× 1.9×

Uncompressed Fmaps + Weights

RLE Compressed Fmaps + Weights

== 0 Zero Buff

Scratch Pad

Enable

Zero Data Skipping

RegisterFile

NoR/W NoSwitching

Method1:Skipmemoryaccessandcomputa*on

45%energysavings

Page 47: Efficient Processing for Deep Learning: Challenges and Opportunies · High-Dimensional CNN Convoluon E Many Output Image Filters (M) Many Output Channels (M) M … R R 1 R R C M H

Op:malBrainDamage

Pruning–MakeWeightsSparse

[Lecun et al., NIPS 1989]

retraining

47

SUXQLQJ�QHXURQV

SUXQLQJ�V\QDSVHV

DIWHU�SUXQLQJEHIRUH�SUXQLQJ

PruneDNNbasedonmagnitudeofweights

[Han et al., NIPS 2015]

Example: AlexNet Weight Reduction:

CONV layers 2.7x, FC layers 9.9x Overall Reduction:

Weights 9x, MACs 3x

Page 48: Efficient Processing for Deep Learning: Challenges and Opportunies · High-Dimensional CNN Convoluon E Many Output Image Filters (M) Many Output Channels (M) M … R R 1 R R C M H

NetworkArchitectureDesign

5x5 filter Two 3x3 filters

decompose

Apply sequentially

decompose

5x5 filter 5x1 filter

1x5 filter

Apply sequentially GoogleNet/Inception v3

VGG-16

Build Network with series of Small Filters

separable filters

48

Page 49: Efficient Processing for Deep Learning: Challenges and Opportunies · High-Dimensional CNN Convoluon E Many Output Image Filters (M) Many Output Channels (M) M … R R 1 R R C M H

1x1Bo@leneckinPopularDNNmodels

ResNet

GoogleNet

compress

expand

compress

49

SqueezeNet

Page 50: Efficient Processing for Deep Learning: Challenges and Opportunies · High-Dimensional CNN Convoluon E Many Output Image Filters (M) Many Output Channels (M) M … R R 1 R R C M H

• AccuracyàMeasuredonDataset

•  SpeedàNumberofMACs

•  StorageFootprintàNumberofWeights

•  Energyà?

KeyMetricsforEmbeddedDNN50

Page 51: Efficient Processing for Deep Learning: Challenges and Opportunies · High-Dimensional CNN Convoluon E Many Output Image Filters (M) Many Output Channels (M) M … R R 1 R R C M H

Energy-Evalua:onMethodology51

CNN Shape Configuration (# of channels, # of filters, etc.)

CNN Weights and Input Data

[0.3, 0, -0.4, 0.7, 0, 0, 0.1, …]

CNN Energy Consumption L1 L2 L3

Energy

Memory Accesses

Optimization

# of MACs Calculation

# acc. at mem. level 1 # acc. at mem. level 2

# acc. at mem. level n

# of MACs

Hardware Energy Costs of each MAC and Memory Access

Ecomp

Edata

Energy estimation tool available at http://eyeriss.mit.edu

[Yang et al., CVPR 2017]

Page 52: Efficient Processing for Deep Learning: Challenges and Opportunies · High-Dimensional CNN Convoluon E Many Output Image Filters (M) Many Output Channels (M) M … R R 1 R R C M H

•  Numberofweightsaloneisnotagoodmetricforenergy

•  Alldatatypesshouldbeconsidered

KeyObserva:ons52

OutputFeatureMap43%

InputFeatureMap25%

Weights22%

Computa:on10%

EnergyConsump:onofGoogLeNet

[Yang et al., CVPR 2017]

Page 53: Efficient Processing for Deep Learning: Challenges and Opportunies · High-Dimensional CNN Convoluon E Many Output Image Filters (M) Many Output Channels (M) M … R R 1 R R C M H

[Yang et al., CVPR 2017]

53

AlexNet SqueezeNet

GoogLeNet

ResNet-50VGG-16

77%

79%

81%

83%

85%

87%

89%

91%

93%

5E+08 5E+09 5E+10

Top-5Ac

curacy

NormalizedEnergyConsump:on

OriginalDNN

DeeperCNNswithfewerweightsdonotnecessarilyconsumelessenergythanshallowerCNNswithmoreweights

EnergyConsump:onofExis:ngDNNs

Page 54: Efficient Processing for Deep Learning: Challenges and Opportunies · High-Dimensional CNN Convoluon E Many Output Image Filters (M) Many Output Channels (M) M … R R 1 R R C M H

54

AlexNet SqueezeNet

GoogLeNet

ResNet-50VGG-16

AlexNet

SqueezeNet

77%

79%

81%

83%

85%

87%

89%

91%

93%

5E+08 5E+09 5E+10

Top-5Ac

curacy

NormalizedEnergyConsump:on

OriginalDNN Magnitude-basedPruning[6][Hanetal.,NIPS2015]

Reducenumberofweightsbyremovingsmallmagnitudeweights

Magnitude-basedWeightPruning

Page 55: Efficient Processing for Deep Learning: Challenges and Opportunies · High-Dimensional CNN Convoluon E Many Output Image Filters (M) Many Output Channels (M) M … R R 1 R R C M H

[Yang et al., CVPR 2017]

55

AlexNet SqueezeNet

GoogLeNet

ResNet-50VGG-16

AlexNet

SqueezeNet

AlexNet SqueezeNet

GoogLeNet

77%

79%

81%

83%

85%

87%

89%

91%

93%

5E+08 5E+09 5E+10

Top-5Ac

curacy

NormalizedEnergyConsump:on

OriginalDNN Magnitude-basedPruning[6] Energy-awarePruning(ThisWork)

Removeweightsfromlayersinorderofhighesttolowestenergy3.7xreduc:oninAlexNet/1.6xreduc:oninGoogLeNet

Energy-AwarePruning

1.74x

Page 56: Efficient Processing for Deep Learning: Challenges and Opportunies · High-Dimensional CNN Convoluon E Many Output Image Filters (M) Many Output Channels (M) M … R R 1 R R C M H

•  Energy-EfficientApproaches– Minimizedatamovement– Balanceflexibilityandenergy-efficiency– Exploitsparsitywithjointalgorithmandhardwaredesign

•  Jointalgorithmandhardwaredesigncandeliveraddiaonalenergysavings(directlytargetenergy)

•  Linearincreaseinaccuracyrequiresexponen:alincreaseinenergy

Summary56

Acknowledgements:ThisworkisfundedbytheDARPAYFAgrant,MITCenterforIntegratedCircuits&Systems,andgirsfromIntel,NvidiaandGoogle.

Page 57: Efficient Processing for Deep Learning: Challenges and Opportunies · High-Dimensional CNN Convoluon E Many Output Image Filters (M) Many Output Channels (M) M … R R 1 R R C M H

References

MoreinfoaboutEyerissandTutorialonDNNArchitectureshjp://eyeriss.mit.edu

Forupdateshttp://mailman.mit.edu/mailman/listinfo/eems-news

OverviewPaperV.Sze,Y.-H.Chen,T-J.Yang,J.Emer,“EfficientProcessingofDeepNeuralNetworks:ATutorialandSurvey”,arXiv,2017hjps://arxiv.org/pdf/1703.09039.pdf

57

MITProfessionalEducaaonCourseon“DesigningEfficientDeepLearningSystems”March26–27,2018inMountainView,CAhjp://professional-educaaon.mit.edu/deeplearning

Page 58: Efficient Processing for Deep Learning: Challenges and Opportunies · High-Dimensional CNN Convoluon E Many Output Image Filters (M) Many Output Channels (M) M … R R 1 R R C M H

© 2016 Embedded Vision Alliance 58

The Embedded Vision Alliance (www.Embedded-Vision.com) is a partnership of ~70 leading embedded vision technology and services suppliers

Mission: Inspire and empower product creators to incorporate visual intelligence into their products

The Alliance provides low-cost, high-quality technical educational resources for product developers

Register for updates at www.Embedded-Vision.com

The Alliance enables vision technology providers to grow their businesses through leads, ecosystem partnerships, and insights

For membership, email us: [email protected]

Empowering Product Creators to Harness Embedded Vision

Page 59: Efficient Processing for Deep Learning: Challenges and Opportunies · High-Dimensional CNN Convoluon E Many Output Image Filters (M) Many Output Channels (M) M … R R 1 R R C M H

© 2016 Embedded Vision Alliance 59

The only industry event focused on enabling product creators to create “machines that see”

•  “Awesome! I was very inspired!”

•  “Fantastic. Learned a lot and met great people.”

•  “Wonderful speakers and informative exhibits!”

Embedded Vision Summit 2018 highlights: •  Inspiring keynotes by leading innovators

•  High-quality, practical technical, business and product talks

•  Exciting demos of the latest apps and technologies

Visit www.EmbeddedVisionSummit.com to sign up for updates

Join us at the Embedded Vision Summit May 22-24, 2018—Santa Clara, California