processing and training deep neural networks on chip · 2020. 2. 14. · processing and training...
TRANSCRIPT
-
Processing and Training Deep Neural Networks on Chip
Ghouthi BOUKLI HACENE, Vincent GRIPON, Nicolas FARRUGIA, Matthieu ARZEL, Michel JEZEQUEL, Yoshua BENGIO
-
Context 2
Processing and Training DNNs on Chip
-
Context 2
1024 V100 during 1 day for training 4 TPUs during 1 month for training
1T FLOPs for one decision 100M parameters to learn
Processing and Training DNNs on Chip
-
Challenges 3
Technical ChallengesReal time applications.Running deep learning on limited resources embedded systems.
1T FLOPs for one decision 100M parameters to learn
Processing and Training DNNs on Chip
-
Challenges 3
Scientific ChallengesLarge architectures harden visualization and interpretation.Simulation time limits the progress of the field.
4 TPUs during 1 month for training 100M parameters to learn
Processing and Training DNNs on Chip
-
Challenges 3
Societal ChallengesLarge energy consumption.Accessibility of deep learning to everyone.
1024 V100 during 1 day for training4 TPUs during 1 month for training
Processing and Training DNNs on Chip
-
1. Deep Learning
•1.1 Deep Learning
•1.2 Some DNN Architectures
•1.3 Importance of the Architecture
2. Efficient Inference
•2.1 Reducing DNNs Size
•2.2 Compression Methods
•2.3 Quantization
•2.4 Pruning
3. Conclusion
Outline
-
Deep Learning
-
Deep Learning 5
A deep learning architecture is basically an assembly of functions.
Each function can be represented by an entity called layer.
Twomost important layers:
Processing and Training DNNs on Chip
-
Deep Learning 5
A deep learning architecture is basically an assembly of functions.
Each function can be represented by an entity called layer.
Twomost important layers:
Processing and Training DNNs on Chip
-
Deep Learning 5
A deep learning architecture is basically an assembly of functions.
Each function can be represented by an entity called layer.
Twomost important layers:
Fully connected layerInput
Output
w1,1
1 2 3
1 2 3 4 5
w3,5
w1,1 w1,2 w1,3 w1,4 w1,5w2,1 w2,2 w2,3 w2,4 w2,5w3,1 w3,2 w3,3 w3,4 w3,5
Processing and Training DNNs on Chip
-
Deep Learning 5
A deep learning architecture is basically an assembly of functions.
Each function can be represented by an entity called layer.
Twomost important layers:
Convolutional layerInput
Output
w1 w2 w3 0 00 w1 w2 w3 00 0 w1 w2 w3
w4 w5 w6 0 00 w4 w5 w6 0
0 0 w4 w5 w6
w7 w8 w9 0 00 w7 w8 w9 0
0 0 w7 w8 w9
Processing and Training DNNs on Chip
-
Some DNN Architectures 6
AlexNet-like NetworkInput
Layer 1
Layer 2
Layer 3
Output
Residual-like NetworkInput
Layer 1
Layer 2
Addition
Layer 3
Output
Processing and Training DNNs on Chip
-
Some DNN Architectures 6
AlexNet-like NetworkInput
Layer 1
Layer 2
Layer 3
Output
Residual-like NetworkInput
Layer 1
Layer 2
Addition
Layer 3
Output
Processing and Training DNNs on Chip
-
Importance of the architecture: memory 7
Memory vs. Top-1 error for models trained on 2012 ILSVRC
0 100 200 300 400 500 600 700 800
20
25
30
35
40
45alexnetca�enetsqueezenet1-0squeezenet1-1 vgg-f
vgg-mvgg-svgg-m-2048vgg-m-1024
vgg-m-128
vgg-vd-16vgg-vd-19
googlenet
resnet18
resnet34
resnet-50resnet-101 resnet-152resnext-50-32x4dresnext-101-32x4d
resnext-101-64x4d
inception-v3SE-ResNet-50 SE-ResNet-101SE-ResNet-152SE-ResNeXt-50-32x4dSE-ResNeXt-101-32x4d
SENet
SE-BN-Inceptiondensenet121
densenet161densenet169
densenet201
mcn-mobilenet
Memory (MB)
Top-1error(%)
Processing and Training DNNs on Chip
-
Importance of the architecture: memory 7
Memory vs. Top-1 error for models trained on 2012 ILSVRC
0 100 200 300 400 500 600 700 800
20
25
30
35
40
45alexnetca�enetsqueezenet1-0squeezenet1-1 vgg-f
vgg-mvgg-svgg-m-2048vgg-m-1024
vgg-m-128
vgg-vd-16vgg-vd-19
googlenet
resnet18
resnet34
resnet-50resnet-101 resnet-152resnext-50-32x4dresnext-101-32x4d
resnext-101-64x4d
inception-v3SE-ResNet-50 SE-ResNet-101SE-ResNet-152SE-ResNeXt-50-32x4dSE-ResNeXt-101-32x4d
SENet
SE-BN-Inceptiondensenet121
densenet161densenet169
densenet201
mcn-mobilenet
Memory (MB)
Top-1error(%)
Processing and Training DNNs on Chip
-
Importance of the architecture: FLOPs 8
FLOPs vs. Top-1 error for models trained on 2012 ILSVRC
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2·104
20
25
30
35
40
45alexnetca�enetsqueezenet1-0squeezenet1-1vgg-f
vgg-mvgg-svgg-m-2048vgg-m-1024
vgg-m-128
vgg-vd-16 vgg-vd-19
googlenet
resnet18
resnet34
resnet-50resnet-101 resnet-152resnext-50-32x4d
resnext-101-32x4dresnext-101-64x4d
inception-v3SE-ResNet-50 SE-ResNet-101SE-ResNet-152SE-ResNeXt-50-32x4dSE-ResNeXt-101-32x4d
SENet
SE-BN-Inceptiondensenet121
densenet161densenet169
densenet201
mcn-mobilenet
MFLOPs
Top-1error(%)
Processing and Training DNNs on Chip
-
Importance of the architecture: FLOPs 8
FLOPs vs. Top-1 error for models trained on 2012 ILSVRC
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2·104
20
25
30
35
40
45alexnetca�enetsqueezenet1-0squeezenet1-1vgg-f
vgg-mvgg-svgg-m-2048vgg-m-1024
vgg-m-128
vgg-vd-16 vgg-vd-19
googlenet
resnet18
resnet34
resnet-50resnet-101 resnet-152resnext-50-32x4d
resnext-101-32x4dresnext-101-64x4d
inception-v3SE-ResNet-50 SE-ResNet-101SE-ResNet-152SE-ResNeXt-50-32x4dSE-ResNeXt-101-32x4d
SENet
SE-BN-Inceptiondensenet121
densenet161densenet169
densenet201
mcn-mobilenet
MFLOPs
Top-1error(%)
Processing and Training DNNs on Chip
-
Efficient Inference
-
Reducing DNNs size (CIFAR100) 9
50 100 150 200
65
70
75
80
Memory (MB)
Testsetaccuracy(%)
VGG19MobileNetV2Resnet34Densenet121
Memory vs accuracy.
0 1 2 3
65
70
75
80
GFLOPs
Testsetaccuracy(%)
VGG19MobileNetV2Resnet34Densenet121
FLOPs vs accuracy.
Processing and Training DNNs on Chip
-
Reducing DNNs size (CIFAR100) 9
50 100 150 200
65
70
75
80
Memory (MB)
Testsetaccuracy(%)
VGG19MobileNetV2Resnet34Densenet121
Memory vs accuracy.
0 1 2 3
65
70
75
80
GFLOPs
Testsetaccuracy(%)
VGG19MobileNetV2Resnet34Densenet121
FLOPs vs accuracy.
Scaling down proportionally the number of feature maps of each layer.
Processing and Training DNNs on Chip
-
Compression Methods 10
Layer 1 Layer 2
0.478 0.314 0.231 1.231 −0.423
−1.987 1.332 0.977 −0.541 1.230
0.322 0.431 0.221 0.112 −0.445
−0.718 0.891 −0.231−1.231−0.331
−1.412 0.490 0.791 0.901 −1.002
0.528 0.710 0.730 0.231 −1.423
−1.087 0.132 1.797 −1.041 1.131
1.220 0.321 0.341 1.912 −1.445
−0.798 1.291 1.481 −0.871−0.821
1.772 −0.484 0.179 −0.121−1.921
Baseline
Processing and Training DNNs on Chip
-
Compression Methods 10
Layer 1 Layer 2
0.478 0.314 0.231 1.231 −0.423
−1.987 1.332 0.977 −0.541 1.230
0.322 0.431 0.221 0.112 −0.445
−0.718 0.891 −0.231−1.231−0.331
−1.412 0.490 0.791 0.901 −1.002
0.528 0.710 0.730 0.231 −1.423
−1.087 0.132 1.797 −1.041 1.131
1.220 0.321 0.341 1.912 −1.445
−0.798 1.291 1.481 −0.871−0.821
1.772 −0.484 0.179 −0.121−1.921
Baseline
Processing and Training DNNs on Chip
-
Compression Methods 10
Layer 1 Layer 2
0.5 0.3 0.2 1.2 −0.4
−2.0 1.3 1.0 −0.5 1.2
0.3 0.4 0.2 0.1 −0.4
−0.7 0.9 −0.2 −1.2 −0.3
−1.4 0.5 0.8 0.9 −1.0
0.5 0.7 0.7 0.2 −1.4
−1.1 0.1 1.8 −1.0 1.1
1.2 0.3 0.3 1.9 −1.4
−0.8 1.3 1.5 −0.9 −0.8
1.8 −0.5 0.2 −0.1 −1.9
Quantization
Processing and Training DNNs on Chip
-
HowMany Bits DoWe Need? 11
5 10 15 20 25 30
20
40
60
80
100
values precision (in bits)
Testsetaccuracy(%)
Resnet18PreActResnet18SeNet18MobileNetV2
Weights only (CIFAR10).
5 10 15 20 25 30
20
40
60
80
100
values precision (in bits)
Testsetaccuracy(%)
Resnet18PreActResnet18SeNet18MobileNetV2
Weights and activations (CIFAR10).
Processing and Training DNNs on Chip
-
Binary Connect (BC) 12
BC straight through principle:
1 Map weight values to their signs (1 or−1).
2 Compute feed forward and back propagation.
3 Update initial floating point values.
BinaryWeight Network (BWN) outperforms BC by adding ascaling factor (−α or α).Comparison of accuracy between baseline, BC and BWN on CIFAR10.
Resnet34 Densenet121 MobilenetV2Full-precision 95.0% 95.0% 93.8%BC 93.6% 94.5% 93.0%BWN 94.3% 94.7% 93.4%
Processing and Training DNNs on Chip
-
Binary Connect (BC) 12
BC straight through principle:
1 Map weight values to their signs (1 or−1).
2 Compute feed forward and back propagation.
3 Update initial floating point values.
BinaryWeight Network (BWN) outperforms BC by adding ascaling factor (−α or α).Comparison of accuracy between baseline, BC and BWN on CIFAR10.
Resnet34 Densenet121 MobilenetV2Full-precision 95.0% 95.0% 93.8%BC 93.6% 94.5% 93.0%BWN 94.3% 94.7% 93.4%
Processing and Training DNNs on Chip
-
Binary Connect (BC) 12
BC straight through principle:
1 Map weight values to their signs (1 or−1).
2 Compute feed forward and back propagation.
3 Update initial floating point values.
BinaryWeight Network (BWN) outperforms BC by adding ascaling factor (−α or α).Comparison of accuracy between baseline, BC and BWN on CIFAR10.
Resnet34 Densenet121 MobilenetV2Full-precision 95.0% 95.0% 93.8%BC 93.6% 94.5% 93.0%BWN 94.3% 94.7% 93.4%
Processing and Training DNNs on Chip
-
Binary Connect (BC) 12
BC straight through principle:
1 Map weight values to their signs (1 or−1).
2 Compute feed forward and back propagation.
3 Update initial floating point values.
BinaryWeight Network (BWN) outperforms BC by adding ascaling factor (−α or α).Comparison of accuracy between baseline, BC and BWN on CIFAR10.
Resnet34 Densenet121 MobilenetV2Full-precision 95.0% 95.0% 93.8%BC 93.6% 94.5% 93.0%BWN 94.3% 94.7% 93.4%
Processing and Training DNNs on Chip
-
Binary Connect (BC) 12
BC straight through principle:
1 Map weight values to their signs (1 or−1).
2 Compute feed forward and back propagation.
3 Update initial floating point values.
BinaryWeight Network (BWN) outperforms BC by adding ascaling factor (−α or α).
Comparison of accuracy between baseline, BC and BWN on CIFAR10.
Resnet34 Densenet121 MobilenetV2Full-precision 95.0% 95.0% 93.8%BC 93.6% 94.5% 93.0%BWN 94.3% 94.7% 93.4%
Processing and Training DNNs on Chip
-
Binary Connect (BC) 12
BC straight through principle:
1 Map weight values to their signs (1 or−1).
2 Compute feed forward and back propagation.
3 Update initial floating point values.
BinaryWeight Network (BWN) outperforms BC by adding ascaling factor (−α or α).Comparison of accuracy between baseline, BC and BWN on CIFAR10.
Resnet34 Densenet121 MobilenetV2Full-precision 95.0% 95.0% 93.8%BC 93.6% 94.5% 93.0%BWN 94.3% 94.7% 93.4%
Processing and Training DNNs on Chip
-
Compression Methods 13
Layer 1 Layer 2w11 w
12 w
13 w
14 w
15
w16 w17 w
18 w
19 w
110
w111 w112 w
113 w
114 w
115
w116 w117 w
118 w
119 w
120
w121 w122 w
123 w
124 w
125
w21 w22 w
23 w
24 w
25
w26 w27 w
28 w
29 w
210
w211 w212 w
213 w
214 w
215
w216 w217 w
218 w
219 w
220
w221 w222 w
223 w
224 w
225
Baseline
Processing and Training DNNs on Chip
-
Compression Methods 13
Layer 1 Layer 20 w12 0 w
14 w
15
0 0 w18 w19 w
110
w111 0 w113 w
114 0
w116 w117 0 w
119 0
w121 w122 0 w
124 w
125
0 0 0 0 0
w26 w27 w
28 w
29 w
210
w211 w212 w
213 w
214 w
215
0 0 0 0 0
w221 w222 w
223 w
224 w
225
StructuredNon structuredPruning
Processing and Training DNNs on Chip
-
Pruning 14
Evaluate the importance of neurons and eliminate the leastimportant ones to reduce neural network size.
Non structured pruning: eliminate neurons independently, onlyexploitable for very large levels of sparsity.
Structured pruning: eliminate kernels, filters or even layers,exploitable for even low levels of sparsity.
Processing and Training DNNs on Chip
-
Pruning 14
Evaluate the importance of neurons and eliminate the leastimportant ones to reduce neural network size.
Non structured pruning: eliminate neurons independently, onlyexploitable for very large levels of sparsity.
Structured pruning: eliminate kernels, filters or even layers,exploitable for even low levels of sparsity.
Processing and Training DNNs on Chip
-
Pruning 14
Evaluate the importance of neurons and eliminate the leastimportant ones to reduce neural network size.
Non structured pruning: eliminate neurons independently, onlyexploitable for very large levels of sparsity.
Structured pruning: eliminate kernels, filters or even layers,exploitable for even low levels of sparsity.
Processing and Training DNNs on Chip
-
Structured pruning and shift layers 15
L
C
S
X Yd
L
C
S
X Yd
L
Shifted X Yd
C
Shift Attention Layer (SAL)Simplified operations,Reduced number of parameters,Fully exploitable technique.
Processing and Training DNNs on Chip
-
Structured pruning and shift layers 15
L
C
S
X Yd
L
C
S
X Yd
L
Shifted X Yd
C
Shift Attention Layer (SAL)Simplified operations,Reduced number of parameters,Fully exploitable technique.
Processing and Training DNNs on Chip
-
Experimental Results 16
Comparison of accuracy, number of parameters and number of floating pointoperations (FLOPs) using ResNet architectures.
Method CIFAR10 CIFAR100Accuracy NP (M) MFLOPs Accuracy NP (M) MFLOPs
Pruned-B 93.06% 0.73 91 73.6% 7.83 616NISP 93.01% 0.49 71 − − −PCAS 93.58% 0.39 56 73.84% 4.02 475SAL 93.6% 0.36 42 77.6% 3.9 251
Processing and Training DNNs on Chip
-
Quantized Shift Layers 17
Shift Layers + BC or BWNShift layer: replace a convolution by a multiplication.BC/BWN: replace a multiplication by a low-cost multiplexer.Shift layer +BC/BWN: replace a convolution by a low-costmultiplexer.
Comparison of accuracy and memory usage between Resnet-20 baseline,SAL, SAL with BC and SAL with BWN on CIFAR10.
Accuracy(%) Memory(Mb)Baseline 94.66 39.04SAL 95.52 31.36SAL + BC 93.20 6.87SAL + BWN 94.00 6.87
Processing and Training DNNs on Chip
-
Quantized Shift Layers 17
Shift Layers + BC or BWNShift layer: replace a convolution by a multiplication.BC/BWN: replace a multiplication by a low-cost multiplexer.Shift layer +BC/BWN: replace a convolution by a low-costmultiplexer.
Comparison of accuracy and memory usage between Resnet-20 baseline,SAL, SAL with BC and SAL with BWN on CIFAR10.
Accuracy(%) Memory(Mb)Baseline 94.66 39.04SAL 95.52 31.36SAL + BC 93.20 6.87SAL + BWN 94.00 6.87
Processing and Training DNNs on Chip
-
Comparison of Methods 18
0 0.2 0.4 0.6 0.8 190
92
94
96
Memory footprint
Testsetaccuracy(%)
MobileNetV2
Evolution of accuracy when applying compression methods on di�erent DNNarchitectures for the CIFAR10 dataset.
Processing and Training DNNs on Chip
-
Comparison of Methods 18
0 0.2 0.4 0.6 0.8 190
92
94
96
Memory footprint
Testsetaccuracy(%)
MobileNetV2SqueezeNet
Evolution of accuracy when applying compression methods on di�erent DNNarchitectures for the CIFAR10 dataset.
Processing and Training DNNs on Chip
-
Comparison of Methods 18
0 0.2 0.4 0.6 0.8 190
92
94
96
Memory footprint
Testsetaccuracy(%)
MobileNetV2SqueezeNetResnet18
Evolution of accuracy when applying compression methods on di�erent DNNarchitectures for the CIFAR10 dataset.
Processing and Training DNNs on Chip
-
Comparison of Methods 18
0 0.2 0.4 0.6 0.8 190
92
94
96
Memory footprint
Testsetaccuracy(%)
MobileNetV2SqueezeNetResnet18Resnet18+BCResnet18+BWN
Evolution of accuracy when applying compression methods on di�erent DNNarchitectures for the CIFAR10 dataset.
Processing and Training DNNs on Chip
-
Comparison of Methods 18
0 0.2 0.4 0.6 0.8 190
92
94
96
Memory footprint
Testsetaccuracy(%)
MobileNetV2SqueezeNetResnet18Resnet18+BCResnet18+BWNResnet18+SAL
Evolution of accuracy when applying compression methods on di�erent DNNarchitectures for the CIFAR10 dataset.
Processing and Training DNNs on Chip
-
Conclusion
-
Conclusion 19
Compression MethodsDi�erent ways to reduce DNNs size, complexity and thus energyconsumption.Compression methods are only applicable to the inference part,and not the learning part.
DirectionsWhich combinations of quantization methods are e�cient?Can training process decide the most e�cient number of bits toquantize values?Can SAL perform well in other domains than classification?Can compression methods be reconsidered to reduce trainingcomplexity?
Processing and Training DNNs on Chip
-
Conclusion 19
Compression MethodsDi�erent ways to reduce DNNs size, complexity and thus energyconsumption.Compression methods are only applicable to the inference part,and not the learning part.
DirectionsWhich combinations of quantization methods are e�cient?Can training process decide the most e�cient number of bits toquantize values?Can SAL perform well in other domains than classification?Can compression methods be reconsidered to reduce trainingcomplexity?
Processing and Training DNNs on Chip
f05eaa30cba787dd37bf2058e73fef3cdd960a92da81ab3d6d6a6702d650135a.pdfSlide 1
f05eaa30cba787dd37bf2058e73fef3cdd960a92da81ab3d6d6a6702d650135a.pdfContext and Challenges
f05eaa30cba787dd37bf2058e73fef3cdd960a92da81ab3d6d6a6702d650135a.pdfSlide 2Slide 3
f05eaa30cba787dd37bf2058e73fef3cdd960a92da81ab3d6d6a6702d650135a.pdff05eaa30cba787dd37bf2058e73fef3cdd960a92da81ab3d6d6a6702d650135a.pdfSlide 4
f05eaa30cba787dd37bf2058e73fef3cdd960a92da81ab3d6d6a6702d650135a.pdff05eaa30cba787dd37bf2058e73fef3cdd960a92da81ab3d6d6a6702d650135a.pdff05eaa30cba787dd37bf2058e73fef3cdd960a92da81ab3d6d6a6702d650135a.pdff05eaa30cba787dd37bf2058e73fef3cdd960a92da81ab3d6d6a6702d650135a.pdfSlide 5
f05eaa30cba787dd37bf2058e73fef3cdd960a92da81ab3d6d6a6702d650135a.pdf