"techniques for efficient implementation of deep neural networks," a presentation from...
TRANSCRIPT
![Page 1: "Techniques for Efficient Implementation of Deep Neural Networks," a Presentation from Stanford](https://reader031.vdocument.in/reader031/viewer/2022030305/587148311a28ab55588b5de1/html5/thumbnails/1.jpg)
Deep Compression and EIE: ——Deep Neural Network Model Compression
and Efficient Inference Engine Song Han
CVA group, Stanford University Mar 1, 2015
![Page 2: "Techniques for Efficient Implementation of Deep Neural Networks," a Presentation from Stanford](https://reader031.vdocument.in/reader031/viewer/2022030305/587148311a28ab55588b5de1/html5/thumbnails/2.jpg)
A few words about us
• Fourth year PhD with Prof. Bill Dally at Stanford.• Research interest is computer architecture for deep
learning, to improve the energy efficiency of neural networks running on mobile and embedded systems.
• Recent work on “Deep Compression” and “EIE: Efficient Inference Engine” covered by TheNextPlatform & O’ReillySong Han
Bill Dally
• Professor at Stanford University and former chairman of CS department, leads the Concurrent VLSI Architecture Group.
• Chief Scientist of NVIDIA.• Member of the National Academy of Engineering, Fellow of
the American Academy of Arts & Sciences, Fellow of the IEEE, Fellow of the ACM…
![Page 3: "Techniques for Efficient Implementation of Deep Neural Networks," a Presentation from Stanford](https://reader031.vdocument.in/reader031/viewer/2022030305/587148311a28ab55588b5de1/html5/thumbnails/3.jpg)
This Talk:
• Deep Compression: A Deep Neural Network Model Compression Pipeline.
• EIE Accelerator: Efficient Inference Engine that Accelerates the Compressed Deep Neural Network Model.
• => SW / HW co-design
[1]. Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015 [2]. Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, ICLR 2016 [3]. Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network” arXiv 2016 [4]. Iandola et al. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size
![Page 4: "Techniques for Efficient Implementation of Deep Neural Networks," a Presentation from Stanford](https://reader031.vdocument.in/reader031/viewer/2022030305/587148311a28ab55588b5de1/html5/thumbnails/4.jpg)
Deep Learning Next Wave of AI
Image Recognition
Speech Recognition
Natural Language Processing
![Page 5: "Techniques for Efficient Implementation of Deep Neural Networks," a Presentation from Stanford](https://reader031.vdocument.in/reader031/viewer/2022030305/587148311a28ab55588b5de1/html5/thumbnails/5.jpg)
Applications
![Page 6: "Techniques for Efficient Implementation of Deep Neural Networks," a Presentation from Stanford](https://reader031.vdocument.in/reader031/viewer/2022030305/587148311a28ab55588b5de1/html5/thumbnails/6.jpg)
App developers suffers from the model size
“At Baidu, our #1 motivation for compressing networks is to bring down the size of the binary file. As a mobile-first company, we frequently update various apps via different app stores. We've very sensitive to the size of our binary files, and a feature that increases the binary size by 100MB will receive much more scrutiny than one that increases it by 10MB.” —Andrew Ng
The Problem:If Running DNN on Mobile…
![Page 7: "Techniques for Efficient Implementation of Deep Neural Networks," a Presentation from Stanford](https://reader031.vdocument.in/reader031/viewer/2022030305/587148311a28ab55588b5de1/html5/thumbnails/7.jpg)
Hardware engineer suffers from the model size(embedded system, limited resource)
The Problem:If Running DNN on Mobile…
![Page 8: "Techniques for Efficient Implementation of Deep Neural Networks," a Presentation from Stanford](https://reader031.vdocument.in/reader031/viewer/2022030305/587148311a28ab55588b5de1/html5/thumbnails/8.jpg)
The Problem:
Intelligent but Inefficient
NetworkDelay
PowerBudget
User Privacy
If Running DNN on the Cloud…
![Page 9: "Techniques for Efficient Implementation of Deep Neural Networks," a Presentation from Stanford](https://reader031.vdocument.in/reader031/viewer/2022030305/587148311a28ab55588b5de1/html5/thumbnails/9.jpg)
Solver 1: Deep Compression
Deep Neural Network Model Compression
Smaller SizeCompress Mobile App
Size by 35x-50x
Accuracyno loss of accuracyimproved accuracy
Speedupmake inference faster
![Page 10: "Techniques for Efficient Implementation of Deep Neural Networks," a Presentation from Stanford](https://reader031.vdocument.in/reader031/viewer/2022030305/587148311a28ab55588b5de1/html5/thumbnails/10.jpg)
Solve 2: EIE Accelerator
ASIC accelerator: EIE (Efficient Inference Engine)
OfflineNo dependency on network connection
Real TimeNo network delayhigh frame rate
Low PowerHigh energy efficiencythat preserves battery
![Page 11: "Techniques for Efficient Implementation of Deep Neural Networks," a Presentation from Stanford](https://reader031.vdocument.in/reader031/viewer/2022030305/587148311a28ab55588b5de1/html5/thumbnails/11.jpg)
Deep Compression
• AlexNet: 35×, 240MB => 6.9MB => 0.52MB
• VGG16: 49× 552MB => 11.3MB
• Both with no loss of accuracy on ImageNet12
• Weights fits on-chip SRAM, taking 120x less energy than DRAM
1. Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015 2. Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, ICLR 2016 3. Iandola et al. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size
![Page 12: "Techniques for Efficient Implementation of Deep Neural Networks," a Presentation from Stanford](https://reader031.vdocument.in/reader031/viewer/2022030305/587148311a28ab55588b5de1/html5/thumbnails/12.jpg)
1. Pruning
Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015
![Page 13: "Techniques for Efficient Implementation of Deep Neural Networks," a Presentation from Stanford](https://reader031.vdocument.in/reader031/viewer/2022030305/587148311a28ab55588b5de1/html5/thumbnails/13.jpg)
Pruning: Motivation
• Trillion of synapses are generated in the human brain during the first few months of birth.
• 1 year old, peaked at 1000 trillion
• Pruning begins to occur.
• 10 years old, a child has nearly 500 trillion synapses
• This ’pruning’ mechanism removes redundant connections in the brain.
[1] Christopher A Walsh. Peter huttenlocher (1931-2013). Nature, 502(7470):172–172, 2013.
![Page 14: "Techniques for Efficient Implementation of Deep Neural Networks," a Presentation from Stanford](https://reader031.vdocument.in/reader031/viewer/2022030305/587148311a28ab55588b5de1/html5/thumbnails/14.jpg)
Retrain to Recover Accuracy
-4.5%-4.0%-3.5%-3.0%-2.5%-2.0%-1.5%-1.0%-0.5%0.0%0.5%
40% 50% 60% 70% 80% 90% 100%
Accu
racy
Los
s
Parametes Pruned Away
L2 regularization w/o retrain L1 regularization w/o retrain L1 regularization w/ retrain L2 regularization w/ retrain L2 regularization w/ iterative prune and retrain
Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015
![Page 15: "Techniques for Efficient Implementation of Deep Neural Networks," a Presentation from Stanford](https://reader031.vdocument.in/reader031/viewer/2022030305/587148311a28ab55588b5de1/html5/thumbnails/15.jpg)
Pruning: Result on 4 Covnets
Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015
![Page 16: "Techniques for Efficient Implementation of Deep Neural Networks," a Presentation from Stanford](https://reader031.vdocument.in/reader031/viewer/2022030305/587148311a28ab55588b5de1/html5/thumbnails/16.jpg)
AlexNet & VGGNet
Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015
![Page 17: "Techniques for Efficient Implementation of Deep Neural Networks," a Presentation from Stanford](https://reader031.vdocument.in/reader031/viewer/2022030305/587148311a28ab55588b5de1/html5/thumbnails/17.jpg)
Mask Visualization
Visualization of the first FC layer’s sparsity pattern of Lenet-300-100. It has a banded structure repeated 28 times, which correspond to the un-pruned parameters in the center of the images, since the digits are written in the center.
![Page 18: "Techniques for Efficient Implementation of Deep Neural Networks," a Presentation from Stanford](https://reader031.vdocument.in/reader031/viewer/2022030305/587148311a28ab55588b5de1/html5/thumbnails/18.jpg)
Pruning NeuralTalk and LSTM
[1] Thanks Shijian Tang pruning Neural Talk
![Page 19: "Techniques for Efficient Implementation of Deep Neural Networks," a Presentation from Stanford](https://reader031.vdocument.in/reader031/viewer/2022030305/587148311a28ab55588b5de1/html5/thumbnails/19.jpg)
• Original: a basketball player in a white uniform is playing with a ball
• Pruned 90%: a basketball player in a white uniform is playing with a basketball
• Original : a brown dog is running through a grassy field
• Pruned 90%: a brown dog is running through a grassy area
• Original : a soccer player in red is running in the field
• Pruned 95%: a man in a red shirt and black and white black shirt is running through a field
• Original : a man is riding a surfboard on a wave
• Pruned 90%: a man in a wetsuit is riding a wave on a beach
![Page 20: "Techniques for Efficient Implementation of Deep Neural Networks," a Presentation from Stanford](https://reader031.vdocument.in/reader031/viewer/2022030305/587148311a28ab55588b5de1/html5/thumbnails/20.jpg)
Speedup (FC layer)
• Intel Core i7 5930K: MKL CBLAS GEMV, MKL SPBLAS CSRMV
• NVIDIA GeForce GTX Titan X: cuBLAS GEMV, cuSPARSE CSRMV
• NVIDIA Tegra K1: cuBLAS GEMV, cuSPARSE CSRMV
![Page 21: "Techniques for Efficient Implementation of Deep Neural Networks," a Presentation from Stanford](https://reader031.vdocument.in/reader031/viewer/2022030305/587148311a28ab55588b5de1/html5/thumbnails/21.jpg)
Energy Efficiency (FC layer)
• Intel Core i7 5930K: CPU socket and DRAM power are reported by pcm-power utility
• NVIDIA GeForce GTX Titan X: reported by nvidia-smi utility
• NVIDIA Tegra K1: measured the total power consumption with a power-meter, 15% AC to DC conversion loss, 85% regulator efficiency and 15% power consumed by peripheral components => 60% AP+DRAM power
![Page 22: "Techniques for Efficient Implementation of Deep Neural Networks," a Presentation from Stanford](https://reader031.vdocument.in/reader031/viewer/2022030305/587148311a28ab55588b5de1/html5/thumbnails/22.jpg)
2. Weight Sharing (Trained Quantization)
Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, ICLR 2016
![Page 23: "Techniques for Efficient Implementation of Deep Neural Networks," a Presentation from Stanford](https://reader031.vdocument.in/reader031/viewer/2022030305/587148311a28ab55588b5de1/html5/thumbnails/23.jpg)
Weight Sharing: Overview
Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, ICLR 2016
![Page 24: "Techniques for Efficient Implementation of Deep Neural Networks," a Presentation from Stanford](https://reader031.vdocument.in/reader031/viewer/2022030305/587148311a28ab55588b5de1/html5/thumbnails/24.jpg)
Finetune Centroids
Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, ICLR 2016
![Page 25: "Techniques for Efficient Implementation of Deep Neural Networks," a Presentation from Stanford](https://reader031.vdocument.in/reader031/viewer/2022030305/587148311a28ab55588b5de1/html5/thumbnails/25.jpg)
Accuracy ~ #Bits on 5 Conv Layer + 3 FC Layer
![Page 26: "Techniques for Efficient Implementation of Deep Neural Networks," a Presentation from Stanford](https://reader031.vdocument.in/reader031/viewer/2022030305/587148311a28ab55588b5de1/html5/thumbnails/26.jpg)
Quantization: Result
• 16 Million => 2^4=16
• 8/5 bit quantization results in no accuracy loss
• 8/4 bit quantization results in no top-5 accuracy loss, 0.1% top-1 accuracy loss
• 4/2 bit quantization results in -1.99% top-1 accuracy loss, and -2.60% top-5 accuracy loss, not that bad-:
Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, ICLR 2016
![Page 27: "Techniques for Efficient Implementation of Deep Neural Networks," a Presentation from Stanford](https://reader031.vdocument.in/reader031/viewer/2022030305/587148311a28ab55588b5de1/html5/thumbnails/27.jpg)
Reduced Precision for Inference
Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, ICLR 2016
![Page 28: "Techniques for Efficient Implementation of Deep Neural Networks," a Presentation from Stanford](https://reader031.vdocument.in/reader031/viewer/2022030305/587148311a28ab55588b5de1/html5/thumbnails/28.jpg)
Pruning and Quantization Works Well Together
Under review as a conference paper at ICLR 2016
Figure 6: Accuracy v.s. compression rate under different compression methods. Pruning andquantization works best when combined.
Figure 7: Pruning doesn’t hurt quantization. Dashed: quantization on unpruned network. Solid:quantization on pruned network; Accuracy begins to drop at the same number of quantization bitswhether or not the network has been pruned. Although pruning made the number of parameters less,quantization still works well, or even better(3 bits case on the left figure) as in the unpruned network.
Figure 8: Accuracy of different initialization methods. Left: top-1 accuracy. Right: top-5 accuracy.Linear initialization gives best result.
6.2 CENTROID INITIALIZATION
Figure 8 compares the accuracy of the three different initialization methods with respect to top-1accuracy (Left) and top-5 accuracy (Right). The network is quantized to 2 ⇠ 8 bits as shown onx-axis. Linear initialization outperforms the density initialization and random initialization in allcases except at 3 bits.
The initial centroids of linear initialization spread equally across the x-axis, from the min value to themax value. That helps to maintain the large weights as the large weights play a more important rolethan smaller ones, which is also shown in network pruning Han et al. (2015). Neither random nordensity-based initialization retains large centroids. With these initialization methods, large weights areclustered to the small centroids because there are few large weights. In contrast, linear initializationallows large weights a better chance to form a large centroid.
6.3 SPEEDUP AND ENERGY EFFICIENCY
Deep Compression is targeting extremely latency-focused applications running on mobile, whichrequires real-time inference, such as pedestrian detection on an embedded processor inside anautonomous vehicle. Waiting for a batch to assemble significantly adds latency. So when bench-
8
Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, ICLR 2016
![Page 29: "Techniques for Efficient Implementation of Deep Neural Networks," a Presentation from Stanford](https://reader031.vdocument.in/reader031/viewer/2022030305/587148311a28ab55588b5de1/html5/thumbnails/29.jpg)
3. Huffman Coding
Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, ICLR 2016
![Page 30: "Techniques for Efficient Implementation of Deep Neural Networks," a Presentation from Stanford](https://reader031.vdocument.in/reader031/viewer/2022030305/587148311a28ab55588b5de1/html5/thumbnails/30.jpg)
Huffman CodingHuffman code is a type of optimal prefix code that is commonly used for loss-less data compression. It produces a variable-length code table for encoding source symbol. The table is derived from the occurrence probability for each symbol. As in other entropy encoding methods, more common symbols are represented with fewer bits than less common symbols, thus save the total space.
![Page 31: "Techniques for Efficient Implementation of Deep Neural Networks," a Presentation from Stanford](https://reader031.vdocument.in/reader031/viewer/2022030305/587148311a28ab55588b5de1/html5/thumbnails/31.jpg)
Deep Compression Result on 4 Convnets
Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, ICLR 2016
![Page 32: "Techniques for Efficient Implementation of Deep Neural Networks," a Presentation from Stanford](https://reader031.vdocument.in/reader031/viewer/2022030305/587148311a28ab55588b5de1/html5/thumbnails/32.jpg)
Result: AlexNet
Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, ICLR 2016
![Page 33: "Techniques for Efficient Implementation of Deep Neural Networks," a Presentation from Stanford](https://reader031.vdocument.in/reader031/viewer/2022030305/587148311a28ab55588b5de1/html5/thumbnails/33.jpg)
AlexNet: Breakdown
Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, ICLR 2016
![Page 34: "Techniques for Efficient Implementation of Deep Neural Networks," a Presentation from Stanford](https://reader031.vdocument.in/reader031/viewer/2022030305/587148311a28ab55588b5de1/html5/thumbnails/34.jpg)
New Network Topology
+ Deep Compression
The Big Gun:
Iandola et al. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size
![Page 35: "Techniques for Efficient Implementation of Deep Neural Networks," a Presentation from Stanford](https://reader031.vdocument.in/reader031/viewer/2022030305/587148311a28ab55588b5de1/html5/thumbnails/35.jpg)
Iandola et al. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size
![Page 36: "Techniques for Efficient Implementation of Deep Neural Networks," a Presentation from Stanford](https://reader031.vdocument.in/reader031/viewer/2022030305/587148311a28ab55588b5de1/html5/thumbnails/36.jpg)
520KB model, AlexNet-accuracy
(1) Smaller DNNs require less communication across servers during distributed training. (2) Smaller DNNs require less bandwidth to export a new model from the cloud to an autonomous car. (3) Smaller DNNs are more feasible to deploy on FPGAs and other hardware with limited memory. (4) Smaller DNNs make it easier to download deep learning-powered Apps from App Store.
Iandola et al. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size
![Page 37: "Techniques for Efficient Implementation of Deep Neural Networks," a Presentation from Stanford](https://reader031.vdocument.in/reader031/viewer/2022030305/587148311a28ab55588b5de1/html5/thumbnails/37.jpg)
Conclusion
• We have presented a method to compress neural networks without affecting accuracy by finding the right connections and quantizing the weights.
• Pruning the unimportant connections => quantizing the network and enforce weight sharing => apply Huffman encoding.
• We highlight our experiments on ImageNet, and reduced the weight storage by 35×, VGG16 by 49×, without loss of accuracy.
• Now weights can fit in cache
![Page 38: "Techniques for Efficient Implementation of Deep Neural Networks," a Presentation from Stanford](https://reader031.vdocument.in/reader031/viewer/2022030305/587148311a28ab55588b5de1/html5/thumbnails/38.jpg)
Product: A Model Compression Tool for Deep Learning Developers
• Easy Version: ✓ No training needed ✓ Fast x 5x - 10x compression rate x 1% loss of accuracy
• Advanced Version: ✓ 35x - 50x compression rate ✓ no loss of accuracy x Training is needed x Slow
![Page 39: "Techniques for Efficient Implementation of Deep Neural Networks," a Presentation from Stanford](https://reader031.vdocument.in/reader031/viewer/2022030305/587148311a28ab55588b5de1/html5/thumbnails/39.jpg)
EIE: Efficient Inference Engine on Compressed Deep Neural
NetworkSong Han
CVA group, Stanford University Jan 6, 2015
Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network” arXiv 2016
![Page 40: "Techniques for Efficient Implementation of Deep Neural Networks," a Presentation from Stanford](https://reader031.vdocument.in/reader031/viewer/2022030305/587148311a28ab55588b5de1/html5/thumbnails/40.jpg)
ASIC Accelerator that Runs DNN on Mobile
OfflineNo dependency on network connection
Real TimeNo network delayhigh frame rate
Low PowerHigh energy efficiencythat preserves battery
![Page 41: "Techniques for Efficient Implementation of Deep Neural Networks," a Presentation from Stanford](https://reader031.vdocument.in/reader031/viewer/2022030305/587148311a28ab55588b5de1/html5/thumbnails/41.jpg)
Solution: Everything on Chip• We present the sparse, indirectly indexed, weight shared MxV
accelerator.
• Large DNN models fit on-chip SRAM, 120× energy savings.
• EIE exploits the sparsity of activations (30% non-zero).
• EIE works on compressed model (30x model reduction)
• Distributed both storage and computation across multiple PEs, which achieves load balance and good scalability.
• Evaluated EIE on a wide range of deep learning models, including CNN for object detection, LSTM for natural language processing and image captioning. We also compare EIE to CPUs, GPUs, and other accelerators.
![Page 42: "Techniques for Efficient Implementation of Deep Neural Networks," a Presentation from Stanford](https://reader031.vdocument.in/reader031/viewer/2022030305/587148311a28ab55588b5de1/html5/thumbnails/42.jpg)
Distribute Storage and Processing
PE PE PE PE
PE PE PE PE
PE PE PE PE
PE PE PE PE
Central Control
Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network” arXiv 2016
![Page 43: "Techniques for Efficient Implementation of Deep Neural Networks," a Presentation from Stanford](https://reader031.vdocument.in/reader031/viewer/2022030305/587148311a28ab55588b5de1/html5/thumbnails/43.jpg)
Inside each PE:
Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network” arXiv 2016
![Page 44: "Techniques for Efficient Implementation of Deep Neural Networks," a Presentation from Stanford](https://reader031.vdocument.in/reader031/viewer/2022030305/587148311a28ab55588b5de1/html5/thumbnails/44.jpg)
Evaluation
1. Cycle-accurate C++ simulator. Two abstract methods: Propagate and Update. Used for DSE and verification. 2. RTL in Verilog, verified its output result with the golden model in Modelsim.3. Synthesized EIE using the Synopsys Design Compiler (DC) under the TSMC 45nm GP standard VT library with worst case PVT corner. 4. Placed and routed the PE using the Synopsys IC compiler (ICC). We used Cacti to get SRAM area and energy numbers. 5. Annotated the toggle rate from the RTL simulation to the gate-level netlist, which was dumped to switching activity interchange format (SAIF), and estimated the power using Prime-Time PX.
![Page 45: "Techniques for Efficient Implementation of Deep Neural Networks," a Presentation from Stanford](https://reader031.vdocument.in/reader031/viewer/2022030305/587148311a28ab55588b5de1/html5/thumbnails/45.jpg)
Baseline and Benchmark• CPU: Intel Core-i7 5930k
• GPU: NVIDIA TitanX GPU
• Mobile GPU: Jetson TK1 with NVIDIA
Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network” arXiv 2016
![Page 46: "Techniques for Efficient Implementation of Deep Neural Networks," a Presentation from Stanford](https://reader031.vdocument.in/reader031/viewer/2022030305/587148311a28ab55588b5de1/html5/thumbnails/46.jpg)
Layout of an EIE PE
Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network” arXiv 2016
![Page 47: "Techniques for Efficient Implementation of Deep Neural Networks," a Presentation from Stanford](https://reader031.vdocument.in/reader031/viewer/2022030305/587148311a28ab55588b5de1/html5/thumbnails/47.jpg)
Result: Speedup / Energy Efficiency
Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network” arXiv 2016
![Page 48: "Techniques for Efficient Implementation of Deep Neural Networks," a Presentation from Stanford](https://reader031.vdocument.in/reader031/viewer/2022030305/587148311a28ab55588b5de1/html5/thumbnails/48.jpg)
Where are the savings from?
• Three factors for energy saving:
• Matrix is compressed by 35×; less work to do; less bricks to carry
• DRAM => SRAM, no need to go off-chip: 120×; carry bricks from SF to NY => SF to SJ
• Sparse activation: 3×;lighter bricks to carry
![Page 49: "Techniques for Efficient Implementation of Deep Neural Networks," a Presentation from Stanford](https://reader031.vdocument.in/reader031/viewer/2022030305/587148311a28ab55588b5de1/html5/thumbnails/49.jpg)
Result: Speedup
Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network” arXiv 2016
Unit: us
![Page 50: "Techniques for Efficient Implementation of Deep Neural Networks," a Presentation from Stanford](https://reader031.vdocument.in/reader031/viewer/2022030305/587148311a28ab55588b5de1/html5/thumbnails/50.jpg)
Scalability
Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network” arXiv 2016
![Page 51: "Techniques for Efficient Implementation of Deep Neural Networks," a Presentation from Stanford](https://reader031.vdocument.in/reader031/viewer/2022030305/587148311a28ab55588b5de1/html5/thumbnails/51.jpg)
Useful Computation / Load Balance
Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network” arXiv 2016
![Page 52: "Techniques for Efficient Implementation of Deep Neural Networks," a Presentation from Stanford](https://reader031.vdocument.in/reader031/viewer/2022030305/587148311a28ab55588b5de1/html5/thumbnails/52.jpg)
Load Balance
Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network” arXiv 2016
![Page 53: "Techniques for Efficient Implementation of Deep Neural Networks," a Presentation from Stanford](https://reader031.vdocument.in/reader031/viewer/2022030305/587148311a28ab55588b5de1/html5/thumbnails/53.jpg)
Design Space Exploration
![Page 54: "Techniques for Efficient Implementation of Deep Neural Networks," a Presentation from Stanford](https://reader031.vdocument.in/reader031/viewer/2022030305/587148311a28ab55588b5de1/html5/thumbnails/54.jpg)
Media Coverage: TheNextPlatform
http://www.nextplatform.com/2015/12/08/emergent-chip-vastly-accelerates-deep-neural-networks/
![Page 55: "Techniques for Efficient Implementation of Deep Neural Networks," a Presentation from Stanford](https://reader031.vdocument.in/reader031/viewer/2022030305/587148311a28ab55588b5de1/html5/thumbnails/55.jpg)
Media Coverage: O’Reilly
https://www.oreilly.com/ideas/compressed-representations-in-the-age-of-big-data
![Page 56: "Techniques for Efficient Implementation of Deep Neural Networks," a Presentation from Stanford](https://reader031.vdocument.in/reader031/viewer/2022030305/587148311a28ab55588b5de1/html5/thumbnails/56.jpg)
Media Coverage: Hacker News
https://news.ycombinator.com/item?id=10881683
![Page 57: "Techniques for Efficient Implementation of Deep Neural Networks," a Presentation from Stanford](https://reader031.vdocument.in/reader031/viewer/2022030305/587148311a28ab55588b5de1/html5/thumbnails/57.jpg)
Conclusion
• We present EIE, an energy-efficient engine optimized to operate on compressed deep neural networks.
• By leveraging sparsity in both the activations and the weights, EIE reduces the energy needed to compute a typical FC layer by 3,000×.
• With wrapper logic on top of EIE, 1x1 convolution and 3x3 convolution is possible.
![Page 58: "Techniques for Efficient Implementation of Deep Neural Networks," a Presentation from Stanford](https://reader031.vdocument.in/reader031/viewer/2022030305/587148311a28ab55588b5de1/html5/thumbnails/58.jpg)
Hardware for Deep Learning
PC Mobile Intelligent Mobile
?
Computation Mobile Computation
Intelligent Mobile
Computation
![Page 59: "Techniques for Efficient Implementation of Deep Neural Networks," a Presentation from Stanford](https://reader031.vdocument.in/reader031/viewer/2022030305/587148311a28ab55588b5de1/html5/thumbnails/59.jpg)
Thank [email protected]
[1]. Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015 [2]. Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, ICLR 2016 [3]. Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network” arXiv 2016 [4]. Iandola et al. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size
![Page 60: "Techniques for Efficient Implementation of Deep Neural Networks," a Presentation from Stanford](https://reader031.vdocument.in/reader031/viewer/2022030305/587148311a28ab55588b5de1/html5/thumbnails/60.jpg)
backup slides
![Page 61: "Techniques for Efficient Implementation of Deep Neural Networks," a Presentation from Stanford](https://reader031.vdocument.in/reader031/viewer/2022030305/587148311a28ab55588b5de1/html5/thumbnails/61.jpg)
A robot asking the cloud is like a child asking mom. That’s not the real intelligence.