optimization i - dlrl summer school › wp-content › uploads › ...optimization i (from a deep...
TRANSCRIPT
![Page 1: Optimization I - DLRL Summer School › wp-content › uploads › ...Optimization I (from a deep learning perspective) Jimmy Ba. Outline Random search v.s. gradient descent Finding](https://reader033.vdocument.in/reader033/viewer/2022042315/5f035f887e708231d408e4d7/html5/thumbnails/1.jpg)
Tutorial on:
Optimization I(from a deep learning perspective)
Jimmy Ba
![Page 2: Optimization I - DLRL Summer School › wp-content › uploads › ...Optimization I (from a deep learning perspective) Jimmy Ba. Outline Random search v.s. gradient descent Finding](https://reader033.vdocument.in/reader033/viewer/2022042315/5f035f887e708231d408e4d7/html5/thumbnails/2.jpg)
Outline
● Random search v.s. gradient descent
● Finding better search directions
● Design “white-box” optimization methods to improve computation efficiency
![Page 3: Optimization I - DLRL Summer School › wp-content › uploads › ...Optimization I (from a deep learning perspective) Jimmy Ba. Outline Random search v.s. gradient descent Finding](https://reader033.vdocument.in/reader033/viewer/2022042315/5f035f887e708231d408e4d7/html5/thumbnails/3.jpg)
Neural networks
Consider an input and output pair (x,y) from a set of N training examples.
Neural networks are parameterized functions that map input x to output prediction y using d number of weights .
Compare the output prediction with the correct answer to obtain a loss function for the current weights:
loss:
prediction:
input:
target:
Averaged loss:
Shorthand notation
![Page 4: Optimization I - DLRL Summer School › wp-content › uploads › ...Optimization I (from a deep learning perspective) Jimmy Ba. Outline Random search v.s. gradient descent Finding](https://reader033.vdocument.in/reader033/viewer/2022042315/5f035f887e708231d408e4d7/html5/thumbnails/4.jpg)
Why is learning difficult
Neural networks are non-convex.
convex set non-convex set convex function non-convex function
Many local optimums
![Page 5: Optimization I - DLRL Summer School › wp-content › uploads › ...Optimization I (from a deep learning perspective) Jimmy Ba. Outline Random search v.s. gradient descent Finding](https://reader033.vdocument.in/reader033/viewer/2022042315/5f035f887e708231d408e4d7/html5/thumbnails/5.jpg)
Why is learning difficult
Neural networks are non-convex.
Neural networks are not smooth.
Lipschitz-continuous Not Lipschitz-continuous
L is the Lipschitz constant
The change in f is smooth and bounded.
![Page 6: Optimization I - DLRL Summer School › wp-content › uploads › ...Optimization I (from a deep learning perspective) Jimmy Ba. Outline Random search v.s. gradient descent Finding](https://reader033.vdocument.in/reader033/viewer/2022042315/5f035f887e708231d408e4d7/html5/thumbnails/6.jpg)
Why is learning difficult
Neural networks are non-convex.
Neural networks are not smooth.
Millions of trainable weights.
.
.
.
.
.
.~5 million
ImageNet
~100 millionNeural Machine Translation
~40 million ?AlphaGo
![Page 7: Optimization I - DLRL Summer School › wp-content › uploads › ...Optimization I (from a deep learning perspective) Jimmy Ba. Outline Random search v.s. gradient descent Finding](https://reader033.vdocument.in/reader033/viewer/2022042315/5f035f887e708231d408e4d7/html5/thumbnails/7.jpg)
How to train neural networks with random search
Consider perturbing the weights of a neural network by a random vector .
Evaluate the perturbed averaged loss over the training examples.
Keep the perturbation if loss improves, discard otherwise.
Repeat
loss:
prediction:
input:
target:
![Page 8: Optimization I - DLRL Summer School › wp-content › uploads › ...Optimization I (from a deep learning perspective) Jimmy Ba. Outline Random search v.s. gradient descent Finding](https://reader033.vdocument.in/reader033/viewer/2022042315/5f035f887e708231d408e4d7/html5/thumbnails/8.jpg)
How to train neural networks with random search
Consider perturbing the weights of a neural network by a random vector .
Evaluate the perturbed averaged loss over the training examples.
Add the perturbation weighted by the perturbed loss to the current weights
Repeat
loss:
prediction:
input:
target:
![Page 9: Optimization I - DLRL Summer School › wp-content › uploads › ...Optimization I (from a deep learning perspective) Jimmy Ba. Outline Random search v.s. gradient descent Finding](https://reader033.vdocument.in/reader033/viewer/2022042315/5f035f887e708231d408e4d7/html5/thumbnails/9.jpg)
How to train neural networks with random search
Consider perturbing the weights of a neural network by M random vector .
Evaluate the perturbed averaged loss over the training examples.
Add the perturbation weighted by the perturbed loss to the current weights
Repeat
loss:
prediction:
input:
target:
![Page 10: Optimization I - DLRL Summer School › wp-content › uploads › ...Optimization I (from a deep learning perspective) Jimmy Ba. Outline Random search v.s. gradient descent Finding](https://reader033.vdocument.in/reader033/viewer/2022042315/5f035f887e708231d408e4d7/html5/thumbnails/10.jpg)
How to train neural networks with random search
Consider perturbing the weights of a neural network by M random vector .
Evaluate the perturbed averaged loss over the training examples.
Add the perturbation weighted by the perturbed loss to the current weights
Repeat
loss:
prediction:
input:
target:
Source: Saliman et. al, 2017, Flaxman et al, 2005, Agarwal, et al. 2010, Nesterov, 2011, Berahas, et. al, 2018
This family of algorithms has many names: Evolution Strategy/Zero order method/Derivative-free method
![Page 11: Optimization I - DLRL Summer School › wp-content › uploads › ...Optimization I (from a deep learning perspective) Jimmy Ba. Outline Random search v.s. gradient descent Finding](https://reader033.vdocument.in/reader033/viewer/2022042315/5f035f887e708231d408e4d7/html5/thumbnails/11.jpg)
How well does random search work?
Training a 784-512-512-10 neural network on MNIST (0.5 million weights):● 1600 samples vs 5000 samples vs backprop
Source: Wen, et. al, 2018
![Page 12: Optimization I - DLRL Summer School › wp-content › uploads › ...Optimization I (from a deep learning perspective) Jimmy Ba. Outline Random search v.s. gradient descent Finding](https://reader033.vdocument.in/reader033/viewer/2022042315/5f035f887e708231d408e4d7/html5/thumbnails/12.jpg)
How well does random search work?
Why does random research work?
w2
Aggregated search direction
Random search approximate finite difference with stochastic samples:
Each perturbation gives a directional gradient
This seems really inefficient w.r.t. the number of weights. (come back to this later)
![Page 13: Optimization I - DLRL Summer School › wp-content › uploads › ...Optimization I (from a deep learning perspective) Jimmy Ba. Outline Random search v.s. gradient descent Finding](https://reader033.vdocument.in/reader033/viewer/2022042315/5f035f887e708231d408e4d7/html5/thumbnails/13.jpg)
Gradient descent and back-propagation
Using random search, computing d directional gradients requires d forward propagation.
Although it can be done in parallel, it would be more efficient if we could directly query gradient information.
We can obtain the full gradient information on continuous loss function more efficiently by back-prop.
w2
![Page 14: Optimization I - DLRL Summer School › wp-content › uploads › ...Optimization I (from a deep learning perspective) Jimmy Ba. Outline Random search v.s. gradient descent Finding](https://reader033.vdocument.in/reader033/viewer/2022042315/5f035f887e708231d408e4d7/html5/thumbnails/14.jpg)
Gradient descent
To justify our search direction choice, we formulate the following optimization problem:
Consider a small change to the weights that will minimize the averaged loss w.r.t. the weights.
The small change is formulated as a constraint in the Euclidean space:
![Page 15: Optimization I - DLRL Summer School › wp-content › uploads › ...Optimization I (from a deep learning perspective) Jimmy Ba. Outline Random search v.s. gradient descent Finding](https://reader033.vdocument.in/reader033/viewer/2022042315/5f035f887e708231d408e4d7/html5/thumbnails/15.jpg)
Gradient descent
To justify our search direction choice, we formulate the following optimization problem:
Consider a small change to the weights that will minimize the averaged loss w.r.t. the weights.
The small change is formulated as a constraint in the Euclidean space:
Linearize the loss function and solve the Lagrangian to get the update rule:
![Page 16: Optimization I - DLRL Summer School › wp-content › uploads › ...Optimization I (from a deep learning perspective) Jimmy Ba. Outline Random search v.s. gradient descent Finding](https://reader033.vdocument.in/reader033/viewer/2022042315/5f035f887e708231d408e4d7/html5/thumbnails/16.jpg)
How well does gradient descent work?
w2
Let L be the Lipschitz constant of the gradient:
Assume convexity, gradient descent converges in 1/T:
![Page 17: Optimization I - DLRL Summer School › wp-content › uploads › ...Optimization I (from a deep learning perspective) Jimmy Ba. Outline Random search v.s. gradient descent Finding](https://reader033.vdocument.in/reader033/viewer/2022042315/5f035f887e708231d408e4d7/html5/thumbnails/17.jpg)
How well does gradient descent work?
w2
Gradient descent can be inefficient under ill-conditioned curvature.
How to improve gradient descent?
![Page 18: Optimization I - DLRL Summer School › wp-content › uploads › ...Optimization I (from a deep learning perspective) Jimmy Ba. Outline Random search v.s. gradient descent Finding](https://reader033.vdocument.in/reader033/viewer/2022042315/5f035f887e708231d408e4d7/html5/thumbnails/18.jpg)
Momentum: smooth gradient with moving average
Keep a running average of the gradient updates can smooth out the zig-zag behavior:
Nesterov’s accelerated gradient:
Lookahead trick
Assume convexity, NAG obtains 1/T^2 rate
![Page 19: Optimization I - DLRL Summer School › wp-content › uploads › ...Optimization I (from a deep learning perspective) Jimmy Ba. Outline Random search v.s. gradient descent Finding](https://reader033.vdocument.in/reader033/viewer/2022042315/5f035f887e708231d408e4d7/html5/thumbnails/19.jpg)
Stochastic gradient descent: improve efficiency
Computing averaged loss requires going through the entire training dataset.
Maybe we do not have to go over millions of images for a small weight update.
loss:
prediction:
input:
target:
Averaged loss:
N data pointsmini-batch loss:
Sample B points from the whole training set.
![Page 20: Optimization I - DLRL Summer School › wp-content › uploads › ...Optimization I (from a deep learning perspective) Jimmy Ba. Outline Random search v.s. gradient descent Finding](https://reader033.vdocument.in/reader033/viewer/2022042315/5f035f887e708231d408e4d7/html5/thumbnails/20.jpg)
Stochastic gradient descent: improve efficiency
mini-batch loss:
Sample B points from the whole training set.
Without losing generality, assume strong convexity, B = 1 and decay learning rate with
SGD obtains convergence rate of
![Page 21: Optimization I - DLRL Summer School › wp-content › uploads › ...Optimization I (from a deep learning perspective) Jimmy Ba. Outline Random search v.s. gradient descent Finding](https://reader033.vdocument.in/reader033/viewer/2022042315/5f035f887e708231d408e4d7/html5/thumbnails/21.jpg)
Stochastic gradient descent: improve efficiency
mini-batch loss:
Sample B points from the whole training set.
Without losing generality, assume strong convexity, B = 1 and decay learning rate with
SGD obtains convergence rate of
Are we making improvement over random search at all?
Yes! Random search with two-point methods has a rate of (Duchi, et. al. 2015)
![Page 22: Optimization I - DLRL Summer School › wp-content › uploads › ...Optimization I (from a deep learning perspective) Jimmy Ba. Outline Random search v.s. gradient descent Finding](https://reader033.vdocument.in/reader033/viewer/2022042315/5f035f887e708231d408e4d7/html5/thumbnails/22.jpg)
Constrained optimization problems for different gradient-based algorithms:
Revisit gradient descent
Objective:
Gradient descent:
Are there other constraints we can use here?
![Page 23: Optimization I - DLRL Summer School › wp-content › uploads › ...Optimization I (from a deep learning perspective) Jimmy Ba. Outline Random search v.s. gradient descent Finding](https://reader033.vdocument.in/reader033/viewer/2022042315/5f035f887e708231d408e4d7/html5/thumbnails/23.jpg)
Constrained optimization problems for different gradient-based algorithms:
Revisit gradient descent
Objective:
Gradient descent:
Natural gradient (Amari, 1998):
![Page 24: Optimization I - DLRL Summer School › wp-content › uploads › ...Optimization I (from a deep learning perspective) Jimmy Ba. Outline Random search v.s. gradient descent Finding](https://reader033.vdocument.in/reader033/viewer/2022042315/5f035f887e708231d408e4d7/html5/thumbnails/24.jpg)
Natural gradient descent
Again, let us consider a small change to the weights that will minimize the averaged loss w.r.t. the weights.
A small change is formulated as a constraint in the Probability space using KL divergence:
Linearize the model and take the second-order taylor expansion around the current weights:
![Page 25: Optimization I - DLRL Summer School › wp-content › uploads › ...Optimization I (from a deep learning perspective) Jimmy Ba. Outline Random search v.s. gradient descent Finding](https://reader033.vdocument.in/reader033/viewer/2022042315/5f035f887e708231d408e4d7/html5/thumbnails/25.jpg)
Fisher information matrix
Fisher information matrix
loss:
prediction:
input:
target:
![Page 26: Optimization I - DLRL Summer School › wp-content › uploads › ...Optimization I (from a deep learning perspective) Jimmy Ba. Outline Random search v.s. gradient descent Finding](https://reader033.vdocument.in/reader033/viewer/2022042315/5f035f887e708231d408e4d7/html5/thumbnails/26.jpg)
Geodesic perspective: distance measure in weight space v.s. model output space
Fisher information matrix
![Page 27: Optimization I - DLRL Summer School › wp-content › uploads › ...Optimization I (from a deep learning perspective) Jimmy Ba. Outline Random search v.s. gradient descent Finding](https://reader033.vdocument.in/reader033/viewer/2022042315/5f035f887e708231d408e4d7/html5/thumbnails/27.jpg)
When first-order methods fails
better gradient estimation does not overcome correlated ill-conditioned curvature in the weight space
SGD path
w1
w2
![Page 28: Optimization I - DLRL Summer School › wp-content › uploads › ...Optimization I (from a deep learning perspective) Jimmy Ba. Outline Random search v.s. gradient descent Finding](https://reader033.vdocument.in/reader033/viewer/2022042315/5f035f887e708231d408e4d7/html5/thumbnails/28.jpg)
Second-order optimization algorithmsw1
w2
Natural gradient corrects for the ill-conditioned curvature using a preconditioning matrix.
Amari, 1998
Question: how to find the best preconditioning matrix?
![Page 29: Optimization I - DLRL Summer School › wp-content › uploads › ...Optimization I (from a deep learning perspective) Jimmy Ba. Outline Random search v.s. gradient descent Finding](https://reader033.vdocument.in/reader033/viewer/2022042315/5f035f887e708231d408e4d7/html5/thumbnails/29.jpg)
Constrained optimization problems for different gradient-based algorithms:
Second-order method algorithms
Objective:
Gradient descent:
Natural gradient:
Family of second-order approximations:
![Page 30: Optimization I - DLRL Summer School › wp-content › uploads › ...Optimization I (from a deep learning perspective) Jimmy Ba. Outline Random search v.s. gradient descent Finding](https://reader033.vdocument.in/reader033/viewer/2022042315/5f035f887e708231d408e4d7/html5/thumbnails/30.jpg)
Intuitively, we would like to find a preconditioning matrix A focusing on the weight space that has
not been explored much from the previous updates:
Consider the following optimization problem to find the preconditioner:
Find a good preconditioning matrix
In practice, most of the algorithms like AdaGrad, Adam, RMSProp, Tonga uses diagonal approximation to
make this preconditioning tractable.
![Page 31: Optimization I - DLRL Summer School › wp-content › uploads › ...Optimization I (from a deep learning perspective) Jimmy Ba. Outline Random search v.s. gradient descent Finding](https://reader033.vdocument.in/reader033/viewer/2022042315/5f035f887e708231d408e4d7/html5/thumbnails/31.jpg)
When first-order methods work well
SGD takes noisy gradient steps
SGD path
w1
w2
Kingma and Ba, ICLR 2015
small variance
large variance
Adam: adapt the learning rate for each parameters differently
![Page 32: Optimization I - DLRL Summer School › wp-content › uploads › ...Optimization I (from a deep learning perspective) Jimmy Ba. Outline Random search v.s. gradient descent Finding](https://reader033.vdocument.in/reader033/viewer/2022042315/5f035f887e708231d408e4d7/html5/thumbnails/32.jpg)
When first-order methods work wellw1
w2
Kingma and Ba, ICLR 2015
Adam path
speed up learning for parameters with low variance gradients
slow down learning along the high variance direction
![Page 33: Optimization I - DLRL Summer School › wp-content › uploads › ...Optimization I (from a deep learning perspective) Jimmy Ba. Outline Random search v.s. gradient descent Finding](https://reader033.vdocument.in/reader033/viewer/2022042315/5f035f887e708231d408e4d7/html5/thumbnails/33.jpg)
When first-order methods failw1
w2
We need full curvature matrix to correct for the correlation among the weights.
![Page 34: Optimization I - DLRL Summer School › wp-content › uploads › ...Optimization I (from a deep learning perspective) Jimmy Ba. Outline Random search v.s. gradient descent Finding](https://reader033.vdocument.in/reader033/viewer/2022042315/5f035f887e708231d408e4d7/html5/thumbnails/34.jpg)
Learning on a single machine
data
noisy gradient of the loss function on a random mini-batch of data
model
SGD updates
N data points
batch size (Bz) of B
Commonly used SGD variants:AdaGrad, Adam, Nesterovs’ momentum
Kingma and Ba, ICLR 2015
Duchi et al., COLT 2010
Sutskever et al., ICML 2013
![Page 35: Optimization I - DLRL Summer School › wp-content › uploads › ...Optimization I (from a deep learning perspective) Jimmy Ba. Outline Random search v.s. gradient descent Finding](https://reader033.vdocument.in/reader033/viewer/2022042315/5f035f887e708231d408e4d7/html5/thumbnails/35.jpg)
Distributed learning
data data data
...
Bz 256
Bz 1024Bz 2048
SGD updates
# of training example consumed
diminishing returns
better gradient estimation
Where does the speedup comes from?1. More accurate gradient estimation2. More frequent updates
N data points
![Page 36: Optimization I - DLRL Summer School › wp-content › uploads › ...Optimization I (from a deep learning perspective) Jimmy Ba. Outline Random search v.s. gradient descent Finding](https://reader033.vdocument.in/reader033/viewer/2022042315/5f035f887e708231d408e4d7/html5/thumbnails/36.jpg)
Amazon EC2 instance cost
Here is the training plot of a state-of-the-art ResNet trained on 8 GPUs:
![Page 37: Optimization I - DLRL Summer School › wp-content › uploads › ...Optimization I (from a deep learning perspective) Jimmy Ba. Outline Random search v.s. gradient descent Finding](https://reader033.vdocument.in/reader033/viewer/2022042315/5f035f887e708231d408e4d7/html5/thumbnails/37.jpg)
SGD natural gradient
Scalability with additional computational resources
Scalability with the number of weights
greatpoor
great poor
Scalability of the “black-box” optimization algorithms
1. Memory inefficient
2. Inverse intractable
Two problems:
![Page 38: Optimization I - DLRL Summer School › wp-content › uploads › ...Optimization I (from a deep learning perspective) Jimmy Ba. Outline Random search v.s. gradient descent Finding](https://reader033.vdocument.in/reader033/viewer/2022042315/5f035f887e708231d408e4d7/html5/thumbnails/38.jpg)
Background: Natural gradient for neural networks
● Insight: The flattened gradient matrix in a layer has a Kronecker product structure.
Gradient computation:
This is equivalent to:
![Page 39: Optimization I - DLRL Summer School › wp-content › uploads › ...Optimization I (from a deep learning perspective) Jimmy Ba. Outline Random search v.s. gradient descent Finding](https://reader033.vdocument.in/reader033/viewer/2022042315/5f035f887e708231d408e4d7/html5/thumbnails/39.jpg)
Background: Natural gradient for neural networks
● Insight: The flattened gradient matrix in a layer has a Kronecker product structure.
Gradient computation:
This is equivalent to:
Kronecker product definition:
● Tile the larger matrix with the rescaled smaller matrices.
![Page 40: Optimization I - DLRL Summer School › wp-content › uploads › ...Optimization I (from a deep learning perspective) Jimmy Ba. Outline Random search v.s. gradient descent Finding](https://reader033.vdocument.in/reader033/viewer/2022042315/5f035f887e708231d408e4d7/html5/thumbnails/40.jpg)
Background: Natural gradient for neural networks
● Insight: The flattened gradient matrix in a layer has a Kronecker product structure.
Fisher information matrix
![Page 41: Optimization I - DLRL Summer School › wp-content › uploads › ...Optimization I (from a deep learning perspective) Jimmy Ba. Outline Random search v.s. gradient descent Finding](https://reader033.vdocument.in/reader033/viewer/2022042315/5f035f887e708231d408e4d7/html5/thumbnails/41.jpg)
Background: Kronecker-factored natural gradient
Fisher information matrix
Martens and Grosse, ICML 2015
![Page 42: Optimization I - DLRL Summer School › wp-content › uploads › ...Optimization I (from a deep learning perspective) Jimmy Ba. Outline Random search v.s. gradient descent Finding](https://reader033.vdocument.in/reader033/viewer/2022042315/5f035f887e708231d408e4d7/html5/thumbnails/42.jpg)
Background: Kronecker-factored natural gradient
memory efficient
Martens and Grosse, ICML 2015
Fisher information matrix
Kronecker factors
![Page 43: Optimization I - DLRL Summer School › wp-content › uploads › ...Optimization I (from a deep learning perspective) Jimmy Ba. Outline Random search v.s. gradient descent Finding](https://reader033.vdocument.in/reader033/viewer/2022042315/5f035f887e708231d408e4d7/html5/thumbnails/43.jpg)
Background: Kronecker-factored natural gradient
memory efficient
Martens and Grosse, ICML 2015
Problem: It is still computationally expensive comparing to SGD
Fisher information matrix
tractable inverse
Kronecker factors
![Page 44: Optimization I - DLRL Summer School › wp-content › uploads › ...Optimization I (from a deep learning perspective) Jimmy Ba. Outline Random search v.s. gradient descent Finding](https://reader033.vdocument.in/reader033/viewer/2022042315/5f035f887e708231d408e4d7/html5/thumbnails/44.jpg)
Distributed K-FAC natural gradient
Ba et al., ICLR 2017
data data data
...
gradients workers
data data
stats workers
inverse workers
...
K-FAC updatesnatural gradient
![Page 45: Optimization I - DLRL Summer School › wp-content › uploads › ...Optimization I (from a deep learning perspective) Jimmy Ba. Outline Random search v.s. gradient descent Finding](https://reader033.vdocument.in/reader033/viewer/2022042315/5f035f887e708231d408e4d7/html5/thumbnails/45.jpg)
# of training example consumed
Scalability experiments
Ba et al., ICLR 2017
Scientific question: How well does distributed K-FAC scale with more
parallel computational resources comparing to distributed SGD?
Bz 256
Bz 1024Bz 2048
better gradient estimation
diminishing returns
almost linear speedup!!
dist. SGDdist. K-FAC
![Page 46: Optimization I - DLRL Summer School › wp-content › uploads › ...Optimization I (from a deep learning perspective) Jimmy Ba. Outline Random search v.s. gradient descent Finding](https://reader033.vdocument.in/reader033/viewer/2022042315/5f035f887e708231d408e4d7/html5/thumbnails/46.jpg)
Constrained optimization problems for different gradient-based algorithms:
Second-order method algorithms
Objective:
Gradient descent:
Natural gradient:
Family of second-order approximations:
![Page 47: Optimization I - DLRL Summer School › wp-content › uploads › ...Optimization I (from a deep learning perspective) Jimmy Ba. Outline Random search v.s. gradient descent Finding](https://reader033.vdocument.in/reader033/viewer/2022042315/5f035f887e708231d408e4d7/html5/thumbnails/47.jpg)
Thank you