deep learning and automatic differentiation from theano to pytorch
TRANSCRIPT
![Page 1: Deep Learning and Automatic Differentiation from Theano to PyTorch](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647b5f7f8b9a3b568b4967/html5/thumbnails/1.jpg)
Deep Learning and Automatic Differentiationfrom Theano to PyTorch
ş
![Page 2: Deep Learning and Automatic Differentiation from Theano to PyTorch](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647b5f7f8b9a3b568b4967/html5/thumbnails/2.jpg)
About me
Postdoctoral researcher, University of Oxford● Working with Frank Wood, Department of Engineering Science● Working on probabilistic programming and its applications in science
(high-energy physics and HPC)
Long-term interests: ● Automatic (algorithmic) differentiation (e.g. http://diffsharp.github.io )● Evolutionary algorithms● Computational physics
2
![Page 3: Deep Learning and Automatic Differentiation from Theano to PyTorch](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647b5f7f8b9a3b568b4967/html5/thumbnails/3.jpg)
Deep learning
3
![Page 4: Deep Learning and Automatic Differentiation from Theano to PyTorch](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647b5f7f8b9a3b568b4967/html5/thumbnails/4.jpg)
Deep learning
A reincarnation/rebranding of artificial neural networks, with roots in
● Threshold logic (McCulloch & Pitts, 1943)● Hebbian learning (Hebb, 1949)● Perceptron (Rosenblatt, 1957)● Backpropagation in NNs (Werbos, 1975; Rumelhart, Hinton, Williams,
1986)
Frank Rosenblatt with the Mark I Perceptron, holding a set of neural network weights
4
![Page 5: Deep Learning and Automatic Differentiation from Theano to PyTorch](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647b5f7f8b9a3b568b4967/html5/thumbnails/5.jpg)
Deep learning
State of the art in computer vision● ImageNet classification with deep convolutional neural networks
(Krizhevsky et al., 2012)○ Halved the error rate achieved with pre-deep-learning methods
● Replacing hand-engineered features● Modern systems surpass human performance
Faster R-CNN(Ren et al., 2015)
Top-5 error rate for ImageNethttps://devblogs.nvidia.com/parallelforall/mocha-jl-deep-learning-julia/ 5
![Page 6: Deep Learning and Automatic Differentiation from Theano to PyTorch](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647b5f7f8b9a3b568b4967/html5/thumbnails/6.jpg)
Deep learning
State of the art in speech recognition● Seminal work by Hinton et al. (2012)
○ First major industrial application of deep learning● Replacing HMM-GMM-based models
Recognition word error rates (Huang et al., 2014) 6
![Page 7: Deep Learning and Automatic Differentiation from Theano to PyTorch](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647b5f7f8b9a3b568b4967/html5/thumbnails/7.jpg)
Deep learning
State of the art in machine translation● Based on RNNs and CNNs (Bahdanu et al., 2015; Wu et al., 2016)● Replacing statistical translation with engineered features● Google (Sep 2016), Microsoft (Nov 2016), Facebook (Aug 2017) moved to
neural machine translationhttps://techcrunch.com/2017/08/03/facebook-finishes-its-move-to-neural-machine-translation/
Google Neural Machine Translation System (GNMT) 7
![Page 8: Deep Learning and Automatic Differentiation from Theano to PyTorch](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647b5f7f8b9a3b568b4967/html5/thumbnails/8.jpg)
What makes deep learning tick?
8
![Page 9: Deep Learning and Automatic Differentiation from Theano to PyTorch](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647b5f7f8b9a3b568b4967/html5/thumbnails/9.jpg)
Deep neural networks
An artificial “neuron" is loosely based on the biological one● Receive a set of weighted inputs (dendrites)● Integrating and transforming (cell body)● Passing the output further (axon)
9
![Page 10: Deep Learning and Automatic Differentiation from Theano to PyTorch](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647b5f7f8b9a3b568b4967/html5/thumbnails/10.jpg)
Deep neural networks
Three main building blocks
● Feedforward (Rosenblatt, 1957)● Convolutional (LeCun et al., 1989)● Recurrent (Hopfield, 1982; Hochreiter & Schmidhuber, 1997)
10
![Page 11: Deep Learning and Automatic Differentiation from Theano to PyTorch](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647b5f7f8b9a3b568b4967/html5/thumbnails/11.jpg)
Deep neural networks
Newer additions introduce (differentiable) algorithmic elements ● Neural Turing Machine (Graves et al., 2014)
○ Can infer algorithms: copy, sort, recall● Stack-augmented RNN (Joulin & Mikolov, 2015)● End-to-end memory network (Sukhbaatar et al., 2015)● Stack, queue, deque (Grefenstette et al., 2015)● Discrete interfaces (Zaremba & Sutskever, 2015)
Neural Turing Machine on copy task (Graves et al., 2014)
11
![Page 12: Deep Learning and Automatic Differentiation from Theano to PyTorch](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647b5f7f8b9a3b568b4967/html5/thumbnails/12.jpg)
Deep neural networks
● Deep, distributed representations learned end-to-end (also called “representation learning”)○ From raw pixels to object labels○ From raw waveforms of speech to text
(Masuch, 2015)
Labels
12
![Page 13: Deep Learning and Automatic Differentiation from Theano to PyTorch](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647b5f7f8b9a3b568b4967/html5/thumbnails/13.jpg)
Deep neural networks
Deeper is better (Bengio, 2009)● Deep architectures consistently beat shallow ones● Shallow architectures exponentially inefficient for same performance● Hierarchical, distributed representations achieve non-local
generalization
13
![Page 14: Deep Learning and Automatic Differentiation from Theano to PyTorch](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647b5f7f8b9a3b568b4967/html5/thumbnails/14.jpg)
Data
● Deep learning needs massive amounts of data○ need to be labeled if performing supervised learning
● A rough rule of thumb: “deep learning will generally match or exceed human performance with a dataset containing at least 10 million labeled examples.” (The Deep Learning book, Goodfellow et al., 2016)
Dataset sizes (Goodfellow et al., 2016)
14
![Page 15: Deep Learning and Automatic Differentiation from Theano to PyTorch](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647b5f7f8b9a3b568b4967/html5/thumbnails/15.jpg)
GPUs
Layers of NNs are conveniently expressed as series of matrix multiplications
One input vector, n neurons of k inputs
A batch of m input vectors
15
![Page 16: Deep Learning and Automatic Differentiation from Theano to PyTorch](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647b5f7f8b9a3b568b4967/html5/thumbnails/16.jpg)
GPUs
● BLAS (mainly GEMM) is at the hearth of mainstream deep learning, commonly running on off-the-shelf graphics processing units
● Rapid adoption after ○ Nvidia released CUDA (2007)○ Raina et al. (2009) and Ciresan et al. (2010)
● ASICs such as tensor processing units (TPUs) are being introduced○ As low as 8-bit floating point precision, better power efficiency
Nvidia Titan Xp (2017) Google Cloud TPU server 16
![Page 17: Deep Learning and Automatic Differentiation from Theano to PyTorch](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647b5f7f8b9a3b568b4967/html5/thumbnails/17.jpg)
Deep learning frameworks
● Modern tools make it extremely easy to implement / reuse models● Off-the-shelf components
○ Simple: linear, convolution, recurrent layers○ Complex: compositions of complex models (e.g., CNN + RNN)
● Base frameworks: Torch (2002), Theano (2011), Caffe (2014), TensorFlow (2015), PyTorch (2016)
● Higher-level model-building libraries: Keras (Theano & TensorFlow), Lasagne (Theano)
● High-performance low-level bindings: BLAS, Intel MKL, CUDA, Magma
17
![Page 18: Deep Learning and Automatic Differentiation from Theano to PyTorch](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647b5f7f8b9a3b568b4967/html5/thumbnails/18.jpg)
Learning: gradient-based optimization
Loss function
(Ruder, 2017)http://ruder.io/optimizing-gradient-descent/
18
Parameter update
Stochastic gradient descent (SGD)
In practice we use SGD or adaptive-learning-rate varieties such as Adam, RMSProp
![Page 19: Deep Learning and Automatic Differentiation from Theano to PyTorch](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647b5f7f8b9a3b568b4967/html5/thumbnails/19.jpg)
Deep learning
19
Neural networks+ Data+ Gradient-based optimization
![Page 20: Deep Learning and Automatic Differentiation from Theano to PyTorch](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647b5f7f8b9a3b568b4967/html5/thumbnails/20.jpg)
Deep learning
20
Neural networks+ Data+ Gradient-based optimization
We need derivatives
![Page 21: Deep Learning and Automatic Differentiation from Theano to PyTorch](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647b5f7f8b9a3b568b4967/html5/thumbnails/21.jpg)
How do we compute derivatives?
21
![Page 22: Deep Learning and Automatic Differentiation from Theano to PyTorch](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647b5f7f8b9a3b568b4967/html5/thumbnails/22.jpg)
Manual
● Calculus 101, rules of differentiation
...
22
![Page 23: Deep Learning and Automatic Differentiation from Theano to PyTorch](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647b5f7f8b9a3b568b4967/html5/thumbnails/23.jpg)
Manual
● Analytical derivatives are needed for theoretical insight○ analytic solutions, proofs○ mathematical analysis, e.g., stability of fixed points
● They are unnecessary when we just need numerical derivatives for optimization
● Until very recently, machine learning looked like this:
23
![Page 24: Deep Learning and Automatic Differentiation from Theano to PyTorch](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647b5f7f8b9a3b568b4967/html5/thumbnails/24.jpg)
Symbolic derivatives
● Symbolic computation with Mathematica, Maple, Maxima, also deep learning frameworks such as Theano
● Main issue: expression swell
24
![Page 25: Deep Learning and Automatic Differentiation from Theano to PyTorch](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647b5f7f8b9a3b568b4967/html5/thumbnails/25.jpg)
Symbolic derivatives
● Symbolic computation with, e.g., Mathematica, Maple, Maxima● Main limitation: only applicable to closed-form expressions
You can find the derivative of:
But not of:
In deep learning, symbolic graph builders such as Theano and TensorFlowface issues with control flow, loops, recursion
![Page 26: Deep Learning and Automatic Differentiation from Theano to PyTorch](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647b5f7f8b9a3b568b4967/html5/thumbnails/26.jpg)
Numerical differentiation
26
![Page 27: Deep Learning and Automatic Differentiation from Theano to PyTorch](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647b5f7f8b9a3b568b4967/html5/thumbnails/27.jpg)
Numerical differentiation
27
![Page 28: Deep Learning and Automatic Differentiation from Theano to PyTorch](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647b5f7f8b9a3b568b4967/html5/thumbnails/28.jpg)
Automatic differentiation
● Small but established subfield of scientific computinghttp://www.autodiff.org/
● Traditional application domains:○ Computational fluid dynamics○ Atmospheric sciences○ Engineering design
optimization○ Computational finance
AD has shared roots with the backpropagation algorithm for neural networks, but it is more general
28
![Page 29: Deep Learning and Automatic Differentiation from Theano to PyTorch](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647b5f7f8b9a3b568b4967/html5/thumbnails/29.jpg)
Automatic differentiation
29
![Page 30: Deep Learning and Automatic Differentiation from Theano to PyTorch](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647b5f7f8b9a3b568b4967/html5/thumbnails/30.jpg)
Automatic differentiation
Exact derivatives, not an approximation 30
![Page 31: Deep Learning and Automatic Differentiation from Theano to PyTorch](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647b5f7f8b9a3b568b4967/html5/thumbnails/31.jpg)
Automatic differentiation
AD has two main modes:
● Forward mode: straightforward
● Reverse mode: slightly more difficult, when you see it for the first time
31
![Page 32: Deep Learning and Automatic Differentiation from Theano to PyTorch](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647b5f7f8b9a3b568b4967/html5/thumbnails/32.jpg)
Forward mode
32
![Page 33: Deep Learning and Automatic Differentiation from Theano to PyTorch](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647b5f7f8b9a3b568b4967/html5/thumbnails/33.jpg)
Forward mode
33
![Page 34: Deep Learning and Automatic Differentiation from Theano to PyTorch](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647b5f7f8b9a3b568b4967/html5/thumbnails/34.jpg)
Reverse mode
34
![Page 35: Deep Learning and Automatic Differentiation from Theano to PyTorch](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647b5f7f8b9a3b568b4967/html5/thumbnails/35.jpg)
Reverse mode
35
![Page 36: Deep Learning and Automatic Differentiation from Theano to PyTorch](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647b5f7f8b9a3b568b4967/html5/thumbnails/36.jpg)
Reverse mode
36
![Page 37: Deep Learning and Automatic Differentiation from Theano to PyTorch](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647b5f7f8b9a3b568b4967/html5/thumbnails/37.jpg)
Reverse mode
37
![Page 38: Deep Learning and Automatic Differentiation from Theano to PyTorch](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647b5f7f8b9a3b568b4967/html5/thumbnails/38.jpg)
Reverse mode
38
![Page 39: Deep Learning and Automatic Differentiation from Theano to PyTorch](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647b5f7f8b9a3b568b4967/html5/thumbnails/39.jpg)
Reverse mode
39
More on this in the afternoon exercise session
![Page 40: Deep Learning and Automatic Differentiation from Theano to PyTorch](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647b5f7f8b9a3b568b4967/html5/thumbnails/40.jpg)
Forward vs reverse
40
![Page 41: Deep Learning and Automatic Differentiation from Theano to PyTorch](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647b5f7f8b9a3b568b4967/html5/thumbnails/41.jpg)
Forward vs reverse
41
![Page 42: Deep Learning and Automatic Differentiation from Theano to PyTorch](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647b5f7f8b9a3b568b4967/html5/thumbnails/42.jpg)
Derivatives and deep learning frameworks
42
![Page 43: Deep Learning and Automatic Differentiation from Theano to PyTorch](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647b5f7f8b9a3b568b4967/html5/thumbnails/43.jpg)
Deep learning frameworks
Two main families:
● Symbolic graph builders
● Dynamic graph builders (general-purpose AD)
43
![Page 44: Deep Learning and Automatic Differentiation from Theano to PyTorch](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647b5f7f8b9a3b568b4967/html5/thumbnails/44.jpg)
Symbolic graph builders
● Implement models using symbolic placeholders, using a mini-language
● Severely limited (and unintuitive) control flow and expressivity● The graph gets “compiled” to take care of expression swell
44
Graph compilation in Theano
![Page 45: Deep Learning and Automatic Differentiation from Theano to PyTorch](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647b5f7f8b9a3b568b4967/html5/thumbnails/45.jpg)
Symbolic graph builders
45
Theano, TensorFlow, CNTK
![Page 46: Deep Learning and Automatic Differentiation from Theano to PyTorch](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647b5f7f8b9a3b568b4967/html5/thumbnails/46.jpg)
Dynamic graph builders (general-purpose AD)
46
● Use general-purpose automatic differentiation operator overloading
● Implement models as regular programs, with full support for control flow, branching, loops, recursion, procedure calls
![Page 47: Deep Learning and Automatic Differentiation from Theano to PyTorch](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647b5f7f8b9a3b568b4967/html5/thumbnails/47.jpg)
Dynamic graph builders (general-purpose AD)
a general-purpose AD system that allows you to implement models as regular Python programs (more on this in the exercise session)
47
autograd (Python) by Harvard Intelligent Probabilistic Systems Grouphttps://github.com/HIPS/autograd
� torch-autograd by Twitter Cortexhttps://github.com/twitter/torch-autograd
� PyTorchhttp://pytorch.org/
![Page 48: Deep Learning and Automatic Differentiation from Theano to PyTorch](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647b5f7f8b9a3b568b4967/html5/thumbnails/48.jpg)
Summary
48
![Page 49: Deep Learning and Automatic Differentiation from Theano to PyTorch](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647b5f7f8b9a3b568b4967/html5/thumbnails/49.jpg)
Summary
● Neural networks + data + gradient descent = deep learning● General-purpose AD is the future of deep learning● More distance to cover:
○ Implementations better than operator overloading exist in AD literature (source transformation) but not in machine learning
○ Forward AD is not currently present in any mainstream machine learning framework
○ Nesting of forward and reverse AD enables efficient higher-order derivatives such as Hessian-vector products
49
![Page 50: Deep Learning and Automatic Differentiation from Theano to PyTorch](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647b5f7f8b9a3b568b4967/html5/thumbnails/50.jpg)
Thank you!
50
![Page 51: Deep Learning and Automatic Differentiation from Theano to PyTorch](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647b5f7f8b9a3b568b4967/html5/thumbnails/51.jpg)
Baydin, A.G., Pearlmutter, B.A., Radul, A.A. and Siskind, J.M., 2015. Automatic differentiation in machine learning: a survey. arXiv preprint arXiv:1502.05767.
Baydin, Atılım Güneş, Barak A. Pearlmutter, and Jeffrey Mark Siskind. 2016. “Tricks from Deep Learning.” In 7th International Conference on Algorithmic Differentiation, Christ Church Oxford, UK, September 12–15, 2016.
Baydin, Atılım Güneş, Barak A. Pearlmutter, and Jeffrey Mark Siskind. 2016. “DiffSharp: An AD Library for .NET Languages.” In 7th International Conference on Algorithmic Differentiation, Christ Church Oxford, UK, September 12–15, 2016.
Maclaurin, D., Duvenaud, D. and Adams, R., 2015, June. Gradient-based hyperparameter optimization through reversible learning. In International Conference on Machine Learning (pp. 2113-2122).
Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, Jeffrey Dean. Technical Report, 2016.
Convolutional Sequence to Sequence Learning. Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, Yann N. Dauphin. arXiv, 2017
Neural Machine Translation by Jointly Learning to Align and Translate. Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio. International Conference on Learning Representations, 2015.
A Historical Perspective of Speech Recognition. Xuedong Huang, James Baker, Raj Reddy. Communications of the ACM, Vol. 57 No. 1, Pages 94-103 10.1145/2500887
Bengio, Yoshua. "Learning deep architectures for AI." Foundations and trends® in Machine Learning 2.1 (2009): 1-127.
References
51