lecture 7: training neural networks, part ics231n.stanford.edu/slides/2020/lecture_7.pdf ·...
TRANSCRIPT
![Page 1: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/1.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 20201
Lecture 7:Training Neural Networks,Part I
1
![Page 2: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/2.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 2020
Administrative: Project Proposal
Due yesterday, 4/27 on GradeScope
1 person per group needs to submit, but tag all group members
2
![Page 3: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/3.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 2020
Administrative: A2
A2 is out, due Wednesday 5/6
We recommend using Colab for the assignment, especially if your local machine uses Windows
3
![Page 4: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/4.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 2020
Where we are now...
x
W
hinge loss
R
+ Ls (scores)
Computational graphs
*
4
![Page 5: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/5.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 2020
Where we are now...
Linear score function:
2-layer Neural Network
x hW1 sW2
3072 100 10
Neural Networks
5
![Page 6: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/6.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 2020
Illustration of LeCun et al. 1998 from CS231n 2017 Lecture 1
Where we are now...
Convolutional Neural Networks
6
![Page 7: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/7.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 2020
Where we are now...Convolutional Layer
32
32
3
32x32x3 image5x5x3 filter
convolve (slide) over all spatial locations
activation map
1
28
28
7
![Page 8: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/8.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 2020
Where we are now...Convolutional Layer
32
32
3
Convolution Layer
activation maps
6
28
28
For example, if we had 6 5x5 filters, we’ll get 6 separate activation maps:
We stack these up to get a “new image” of size 28x28x6!
8
![Page 9: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/9.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 2020
Where we are now...
Landscape image is CC0 1.0 public domainWalking man image is CC0 1.0 public domain
Learning network parameters through optimization
9
![Page 10: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/10.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 2020
Where we are now...
Mini-batch SGDLoop:1. Sample a batch of data2. Forward prop it through the graph
(network), get loss3. Backprop to calculate the gradients4. Update the parameters using the gradient
10
![Page 11: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/11.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 2020
Where we are now...
Hardware + SoftwarePyTorch
TensorFlow
11
![Page 12: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/12.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 2020
Next: Training Neural Networks
12
![Page 13: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/13.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 2020
Overview1. One time setup
activation functions, preprocessing, weight initialization, regularization, gradient checking
2. Training dynamicstransfer learning, babysitting the learning process, parameter updates, hyperparameter optimization
3. Evaluationmodel ensembles, test-time augmentation
13
![Page 14: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/14.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 2020
Part 1- Activation Functions- Data Preprocessing- Weight Initialization- Batch Normalization- Transfer learning
14
![Page 15: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/15.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 2020
Activation Functions
15
![Page 16: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/16.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 2020
Activation Functions
16
![Page 17: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/17.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 2020
Activation FunctionsSigmoid
tanh
ReLU
Leaky ReLU
Maxout
ELU
17
![Page 18: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/18.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 2020
Activation Functions
Sigmoid
- Squashes numbers to range [0,1]- Historically popular since they
have nice interpretation as a saturating “firing rate” of a neuron
18
![Page 19: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/19.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 2020
Activation Functions
Sigmoid
- Squashes numbers to range [0,1]- Historically popular since they
have nice interpretation as a saturating “firing rate” of a neuron
3 problems:
1. Saturated neurons “kill” the gradients
19
![Page 20: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/20.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 2020
sigmoid gate
x
20
![Page 21: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/21.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 2020
sigmoid gate
x
What happens when x = -10?
21
![Page 22: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/22.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 2020
sigmoid gate
x
What happens when x = -10?What happens when x = 0?
22
![Page 23: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/23.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 2020
sigmoid gate
x
What happens when x = -10?What happens when x = 0?What happens when x = 10?
23
![Page 24: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/24.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 2020
sigmoid gate
x
What happens when x = -10?What happens when x = 0?What happens when x = 10?
24
![Page 25: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/25.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 202025
Activation Functions
Sigmoid
- Squashes numbers to range [0,1]- Historically popular since they
have nice interpretation as a saturating “firing rate” of a neuron
3 problems:
1. Saturated neurons “kill” the gradients
2. Sigmoid outputs are not zero-centered
25
![Page 26: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/26.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 202026
Consider what happens when the input to a neuron is always positive...
What can we say about the gradients on w?
26
![Page 27: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/27.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 202027
Consider what happens when the input to a neuron is always positive...
What can we say about the gradients on w?
27
![Page 28: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/28.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 202028
Consider what happens when the input to a neuron is always positive...
What can we say about the gradients on w?
28
We know that local gradient of sigmoid is always positive
![Page 29: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/29.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 202029
Consider what happens when the input to a neuron is always positive...
What can we say about the gradients on w?We know that local gradient of sigmoid is always positiveWe are assuming x is always positive
29
![Page 30: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/30.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 202030
Consider what happens when the input to a neuron is always positive...
What can we say about the gradients on w?We know that local gradient of sigmoid is always positiveWe are assuming x is always positive
So!! Sign of gradient for all wi is the same as the sign of upstream scalar gradient!
30
![Page 31: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/31.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 202031
Consider what happens when the input to a neuron is always positive...
What can we say about the gradients on w?Always all positive or all negative :(
hypothetical optimal w vector
zig zag path
allowed gradient update directions
allowed gradient update directions
31
![Page 32: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/32.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 202032
Consider what happens when the input to a neuron is always positive...
What can we say about the gradients on w?Always all positive or all negative :((For a single element! Minibatches help)
hypothetical optimal w vector
zig zag path
allowed gradient update directions
allowed gradient update directions
32
![Page 33: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/33.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 202033
Activation Functions
Sigmoid
- Squashes numbers to range [0,1]- Historically popular since they
have nice interpretation as a saturating “firing rate” of a neuron
3 problems:
1. Saturated neurons “kill” the gradients
2. Sigmoid outputs are not zero-centered
3. exp() is a bit compute expensive
33
![Page 34: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/34.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 202034
Activation Functions
tanh(x)
- Squashes numbers to range [-1,1]- zero centered (nice)- still kills gradients when saturated :(
[LeCun et al., 1991]
34
![Page 35: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/35.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 202035
Activation Functions - Computes f(x) = max(0,x)
- Does not saturate (in +region)- Very computationally efficient- Converges much faster than
sigmoid/tanh in practice (e.g. 6x)
ReLU(Rectified Linear Unit)
[Krizhevsky et al., 2012]
35
![Page 36: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/36.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 202036
Activation Functions
ReLU(Rectified Linear Unit)
- Computes f(x) = max(0,x)
- Does not saturate (in +region)- Very computationally efficient- Converges much faster than
sigmoid/tanh in practice (e.g. 6x)
- Not zero-centered output
36
![Page 37: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/37.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 202037
Activation Functions
ReLU(Rectified Linear Unit)
- Computes f(x) = max(0,x)
- Does not saturate (in +region)- Very computationally efficient- Converges much faster than
sigmoid/tanh in practice (e.g. 6x)
- Not zero-centered output- An annoyance:
hint: what is the gradient when x < 0?
37
![Page 38: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/38.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 202038
ReLU gate
x
What happens when x = -10?What happens when x = 0?What happens when x = 10?
38
![Page 39: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/39.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 202039
DATA CLOUDactive ReLU
dead ReLUwill never activate => never update
39
![Page 40: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/40.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 202040
DATA CLOUDactive ReLU
dead ReLUwill never activate => never update
=> people like to initialize ReLU neurons with slightly positive biases (e.g. 0.01)
40
![Page 41: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/41.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 202041
Activation Functions
Leaky ReLU
- Does not saturate- Computationally efficient- Converges much faster than
sigmoid/tanh in practice! (e.g. 6x)- will not “die”.
[Mass et al., 2013][He et al., 2015]
41
![Page 42: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/42.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 202042
Activation Functions
Leaky ReLU
- Does not saturate- Computationally efficient- Converges much faster than
sigmoid/tanh in practice! (e.g. 6x)- will not “die”.
Parametric Rectifier (PReLU)
backprop into \alpha(parameter)
[Mass et al., 2013][He et al., 2015]
42
![Page 43: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/43.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 2020
Activation FunctionsExponential Linear Units (ELU)
- All benefits of ReLU- Closer to zero mean outputs- Negative saturation regime
compared with Leaky ReLU adds some robustness to noise
- Computation requires exp()
[Clevert et al., 2015]
43
(Alpha default = 1)
![Page 44: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/44.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 2020
Activation FunctionsScaled Exponential Linear Units (SELU)
- Scaled versionof ELU that works better for deep networks
- “Self-normalizing” property;- Can train deep SELU networks
without BatchNorm - (will discuss more later)
[Klambauer et al. ICLR 2017]
44
α = 1.6733, λ = 1.0507
![Page 45: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/45.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 2020
Maxout “Neuron”- Does not have the basic form of dot product ->
nonlinearity- Generalizes ReLU and Leaky ReLU - Linear Regime! Does not saturate! Does not die!
Problem: doubles the number of parameters/neuron :(
[Goodfellow et al., 2013]
45
![Page 46: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/46.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 2020
Activation FunctionsSwish
- They trained a neural network to generate and test out different non-linearities.
- Swish outperformed all other options for CIFAR-10 accuracy
[Ramachandran et al. 2018]
46
![Page 47: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/47.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 202047
TLDR: In practice:
- Use ReLU. Be careful with your learning rates- Try out Leaky ReLU / Maxout / ELU / SELU
- To squeeze out some marginal gains- Don’t use sigmoid or tanh
47
![Page 48: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/48.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 202048
Data Preprocessing
48
![Page 49: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/49.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 202049
Data Preprocessing
(Assume X [NxD] is data matrix, each example in a row)
49
![Page 50: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/50.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 202050
Remember: Consider what happens when the input to a neuron is always positive...
What can we say about the gradients on w?Always all positive or all negative :((this is also why you want zero-mean data!)
hypothetical optimal w vector
zig zag path
allowed gradient update directions
allowed gradient update directions
50
![Page 51: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/51.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 202051
(Assume X [NxD] is data matrix, each example in a row)
Data Preprocessing
51
![Page 52: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/52.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 202052
Data PreprocessingIn practice, you may also see PCA and Whitening of the data
(data has diagonal covariance matrix)
(covariance matrix is the identity matrix)
52
![Page 53: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/53.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201853
Data PreprocessingBefore normalization: classification loss very sensitive to changes in weight matrix; hard to optimize
After normalization: less sensitive to small changes in weights; easier to optimize
![Page 54: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/54.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 202054
TLDR: In practice for Images: center only
- Subtract the mean image (e.g. AlexNet)(mean image = [32,32,3] array)
- Subtract per-channel mean (e.g. VGGNet)(mean along each channel = 3 numbers)
- Subtract per-channel mean andDivide by per-channel std (e.g. ResNet)(mean along each channel = 3 numbers)
e.g. consider CIFAR-10 example with [32,32,3] images
Not common to do PCA or whitening
54
![Page 55: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/55.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 202055
Weight Initialization
55
![Page 56: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/56.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 202056
- Q: what happens when W=constant init is used?
56
![Page 57: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/57.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 202057
- First idea: Small random numbers (gaussian with zero mean and 1e-2 standard deviation)
57
![Page 58: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/58.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 202058
- First idea: Small random numbers (gaussian with zero mean and 1e-2 standard deviation)
Works ~okay for small networks, but problems with deeper networks.
58
![Page 59: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/59.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 202059
Weight Initialization: Activation statisticsForward pass for a 6-layer net with hidden size 4096
59
What will happen to the activations for the last layer?
![Page 60: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/60.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 202060
Weight Initialization: Activation statisticsForward pass for a 6-layer net with hidden size 4096
All activations tend to zero for deeper network layers
Q: What do the gradients dL/dW look like?
60
![Page 61: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/61.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 202061
Weight Initialization: Activation statisticsForward pass for a 6-layer net with hidden size 4096
All activations tend to zero for deeper network layers
Q: What do the gradients dL/dW look like?
A: All zero, no learning =(
61
![Page 62: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/62.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 202062
Weight Initialization: Activation statisticsIncrease std of initial weights from 0.01 to 0.05
62
What will happen to the activations for the last layer?
![Page 63: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/63.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 202063
Weight Initialization: Activation statisticsIncrease std of initial weights from 0.01 to 0.05
All activations saturate
Q: What do the gradients look like?
63
![Page 64: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/64.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 202064
Weight Initialization: Activation statisticsIncrease std of initial weights from 0.01 to 0.05
All activations saturate
Q: What do the gradients look like?
A: Local gradients all zero, no learning =(
64
![Page 65: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/65.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 202065
Weight Initialization: “Xavier” Initialization“Xavier” initialization: std = 1/sqrt(Din)
Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010
65
![Page 66: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/66.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 202066
“Just right”: Activations are nicely scaled for all layers!
“Xavier” initialization: std = 1/sqrt(Din)
Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010
Weight Initialization: “Xavier” Initialization
66
![Page 67: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/67.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 202067
“Just right”: Activations are nicely scaled for all layers!
For conv layers, Din is filter_size2 * input_channels
“Xavier” initialization: std = 1/sqrt(Din)
Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010
Weight Initialization: “Xavier” Initialization
67
![Page 68: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/68.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 202068
“Xavier” initialization: std = 1/sqrt(Din)
Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010
Weight Initialization: “Xavier” Initialization
y = Wxh = f(y)
Var(yi) = Din * Var(xiwi) [Assume x, w are iid]
Derivation:
“Just right”: Activations are nicely scaled for all layers!
For conv layers, Din is filter_size2 * input_channels
68
![Page 69: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/69.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 202069
“Xavier” initialization: std = 1/sqrt(Din)
Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010
Weight Initialization: “Xavier” Initialization
y = Wxh = f(y)
Var(yi) = Din * Var(xiwi) [Assume x, w are iid] = Din * (E[xi
2]E[wi2] - E[xi]
2 E[wi]2) [Assume x, w independant]
Derivation:
“Just right”: Activations are nicely scaled for all layers!
For conv layers, Din is filter_size2 * input_channels
69
![Page 70: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/70.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 202070
“Xavier” initialization: std = 1/sqrt(Din)
Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010
Weight Initialization: “Xavier” Initialization
y = Wxh = f(y)
Var(yi) = Din * Var(xiwi) [Assume x, w are iid] = Din * (E[xi
2]E[wi2] - E[xi]
2 E[wi]2) [Assume x, w independant]
= Din * Var(xi) * Var(wi) [Assume x, w are zero-mean]
Derivation:
“Just right”: Activations are nicely scaled for all layers!
For conv layers, Din is kernel_size2 * input_channels
70
![Page 71: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/71.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 202071
“Xavier” initialization: std = 1/sqrt(Din)
Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010
Weight Initialization: “Xavier” Initialization
y = Wxh = f(y)
Var(yi) = Din * Var(xiwi) [Assume x, w are iid] = Din * (E[xi
2]E[wi2] - E[xi]
2 E[wi]2) [Assume x, w independant]
= Din * Var(xi) * Var(wi) [Assume x, w are zero-mean]
If Var(wi) = 1/Din then Var(yi) = Var(xi)
Derivation:
“Just right”: Activations are nicely scaled for all layers!
For conv layers, Din is kernel_size2 * input_channels
71
![Page 72: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/72.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 202072
Weight Initialization: What about ReLU?
Change from tanh to ReLU
72
![Page 73: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/73.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 202073
Weight Initialization: What about ReLU?
Xavier assumes zero centered activation function
Activations collapse to zero again, no learning =(
Change from tanh to ReLU
73
![Page 74: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/74.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 2020
Weight Initialization: Kaiming / MSRA Initialization
ReLU correction: std = sqrt(2 / Din)
He et al, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”, ICCV 2015
“Just right”: Activations are nicely scaled for all layers!
74
![Page 75: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/75.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 2020
Proper initialization is an active area of research…Understanding the difficulty of training deep feedforward neural networksby Glorot and Bengio, 2010
Exact solutions to the nonlinear dynamics of learning in deep linear neural networks by Saxe et al, 2013
Random walk initialization for training very deep feedforward networks by Sussillo and Abbott, 2014
Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification by He et al., 2015
Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., 2015
All you need is a good init, Mishkin and Matas, 2015
Fixup Initialization: Residual Learning Without Normalization, Zhang et al, 2019
The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks, Frankle and Carbin, 2019
75
![Page 76: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/76.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 202076
Batch Normalization
76
![Page 77: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/77.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 202077
Batch Normalization“you want zero-mean unit-variance activations? just make them so.”
[Ioffe and Szegedy, 2015]
consider a batch of activations at some layer. To make each dimension zero-mean unit-variance, apply:
this is a vanilla differentiable function...
77
![Page 78: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/78.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201878
Input: Per-channel mean, shape is D
Per-channel var, shape is D
Normalized x,Shape is N x D
Batch Normalization [Ioffe and Szegedy, 2015]
XN
D
![Page 79: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/79.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201879
Input: Per-channel mean, shape is D
Per-channel var, shape is D
Normalized x,Shape is N x D
Batch Normalization [Ioffe and Szegedy, 2015]
XN
D Problem: What if zero-mean, unit variance is too hard of a constraint?
![Page 80: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/80.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201880
Input: Per-channel mean, shape is D
Per-channel var, shape is D
Normalized x,Shape is N x D
Batch Normalization [Ioffe and Szegedy, 2015]
Learnable scale and shift parameters:
Output,Shape is N x D
Learning = , = will recover the identity function!
![Page 81: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/81.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201881
Input: Per-channel mean, shape is D
Per-channel var, shape is D
Normalized x,Shape is N x D
Batch Normalization: Test-Time
Learnable scale and shift parameters:
Output,Shape is N x D
Learning = , = will recover the identity function!
Estimates depend on minibatch; can’t do this at test-time!
![Page 82: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/82.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201882
Input: Per-channel mean, shape is D
Per-channel var, shape is D
Normalized x,Shape is N x D
Batch Normalization: Test-Time
Learnable scale and shift parameters:
Output,Shape is N x D
(Running) average of values seen during training
(Running) average of values seen during training
During testing batchnorm becomes a linear operator! Can be fused with the previous fully-connected or conv layer
![Page 83: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/83.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 202083
Batch Normalization [Ioffe and Szegedy, 2015]
FC
BN
tanh
FC
BN
tanh
...
Usually inserted after Fully Connected or Convolutional layers, and before nonlinearity.
83
![Page 84: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/84.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 202084
Batch Normalization [Ioffe and Szegedy, 2015]
FC
BN
tanh
FC
BN
tanh
...
- Makes deep networks much easier to train!- Improves gradient flow- Allows higher learning rates, faster convergence- Networks become more robust to initialization- Acts as regularization during training- Zero overhead at test-time: can be fused with conv!- Behaves differently during training and testing: this
is a very common source of bugs!
84
![Page 85: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/85.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201885
Batch Normalization for ConvNets
x: N × D
𝞵,𝝈: 1 × Dɣ,β: 1 × Dy = ɣ(x-𝞵)/𝝈+β
x: N×C×H×W
𝞵,𝝈: 1×C×1×1ɣ,β: 1×C×1×1y = ɣ(x-𝞵)/𝝈+β
Normalize Normalize
Batch Normalization for fully-connected networks
Batch Normalization for convolutional networks(Spatial Batchnorm, BatchNorm2D)
![Page 86: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/86.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201886
Layer Normalization
x: N × D
𝞵,𝝈: 1 × Dɣ,β: 1 × Dy = ɣ(x-𝞵)/𝝈+β
x: N × D
𝞵,𝝈: N × 1ɣ,β: 1 × Dy = ɣ(x-𝞵)/𝝈+β
Normalize Normalize
Layer Normalization for fully-connected networksSame behavior at train and test!Can be used in recurrent networks
Batch Normalization for fully-connected networks
Ba, Kiros, and Hinton, “Layer Normalization”, arXiv 2016
![Page 87: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/87.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201887
Instance Normalization
x: N×C×H×W
𝞵,𝝈: 1×C×1×1ɣ,β: 1×C×1×1y = ɣ(x-𝞵)/𝝈+β
x: N×C×H×W
𝞵,𝝈: N×C×1×1ɣ,β: 1×C×1×1y = ɣ(x-𝞵)/𝝈+β
Normalize Normalize
Instance Normalization for convolutional networksSame behavior at train / test!
Batch Normalization for convolutional networks
Ulyanov et al, Improved Texture Networks: Maximizing Quality and Diversity in Feed-forward Stylization and Texture Synthesis, CVPR 2017
![Page 88: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/88.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201888
Comparison of Normalization Layers
Wu and He, “Group Normalization”, ECCV 2018
![Page 89: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/89.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201889
Group Normalization
Wu and He, “Group Normalization”, ECCV 2018
![Page 90: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/90.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018
Transfer learning
90
![Page 91: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/91.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018
“You need a lot of a data if you want to train/use CNNs”
91
![Page 92: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/92.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018
“You need a lot of a data if you want to train/use CNNs”
92
BUSTED
![Page 93: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/93.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201893
Transfer Learning with CNNs
Image
Conv-64Conv-64MaxPool
Conv-128Conv-128MaxPool
Conv-256Conv-256MaxPool
Conv-512Conv-512MaxPool
Conv-512Conv-512MaxPool
FC-4096FC-4096FC-1000
1. Train on Imagenet
Donahue et al, “DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition”, ICML 2014Razavian et al, “CNN Features Off-the-Shelf: An Astounding Baseline for Recognition”, CVPR Workshops 2014
![Page 94: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/94.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201894
Transfer Learning with CNNs
Image
Conv-64Conv-64MaxPool
Conv-128Conv-128MaxPool
Conv-256Conv-256MaxPool
Conv-512Conv-512MaxPool
Conv-512Conv-512MaxPool
FC-4096FC-4096FC-1000
1. Train on Imagenet
Image
Conv-64Conv-64MaxPool
Conv-128Conv-128MaxPool
Conv-256Conv-256MaxPool
Conv-512Conv-512MaxPool
Conv-512Conv-512MaxPool
FC-4096FC-4096
FC-C
2. Small Dataset (C classes)
Freeze these
Reinitialize this and train
Donahue et al, “DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition”, ICML 2014Razavian et al, “CNN Features Off-the-Shelf: An Astounding Baseline for Recognition”, CVPR Workshops 2014
![Page 95: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/95.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201895
Transfer Learning with CNNs
Image
Conv-64Conv-64MaxPool
Conv-128Conv-128MaxPool
Conv-256Conv-256MaxPool
Conv-512Conv-512MaxPool
Conv-512Conv-512MaxPool
FC-4096FC-4096FC-1000
1. Train on Imagenet
Image
Conv-64Conv-64MaxPool
Conv-128Conv-128MaxPool
Conv-256Conv-256MaxPool
Conv-512Conv-512MaxPool
Conv-512Conv-512MaxPool
FC-4096FC-4096
FC-C
2. Small Dataset (C classes)
Freeze these
Reinitialize this and train
Donahue et al, “DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition”, ICML 2014Razavian et al, “CNN Features Off-the-Shelf: An Astounding Baseline for Recognition”, CVPR Workshops 2014
Donahue et al, “DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition”, ICML 2014
Finetuned from AlexNet
![Page 96: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/96.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201896
Transfer Learning with CNNs
Image
Conv-64Conv-64MaxPool
Conv-128Conv-128MaxPool
Conv-256Conv-256MaxPool
Conv-512Conv-512MaxPool
Conv-512Conv-512MaxPool
FC-4096FC-4096FC-1000
1. Train on Imagenet
Image
Conv-64Conv-64MaxPool
Conv-128Conv-128MaxPool
Conv-256Conv-256MaxPool
Conv-512Conv-512MaxPool
Conv-512Conv-512MaxPool
FC-4096FC-4096
FC-C
2. Small Dataset (C classes)
Freeze these
Reinitialize this and train
Image
Conv-64Conv-64MaxPool
Conv-128Conv-128MaxPool
Conv-256Conv-256MaxPool
Conv-512Conv-512MaxPool
Conv-512Conv-512MaxPool
FC-4096FC-4096
FC-C
3. Bigger dataset
Freeze these
Train these
With bigger dataset, train more layers
Lower learning rate when finetuning; 1/10 of original LR is good starting point
Donahue et al, “DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition”, ICML 2014Razavian et al, “CNN Features Off-the-Shelf: An Astounding Baseline for Recognition”, CVPR Workshops 2014
![Page 97: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/97.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201897
Image
Conv-64Conv-64MaxPool
Conv-128Conv-128MaxPool
Conv-256Conv-256MaxPool
Conv-512Conv-512MaxPool
Conv-512Conv-512MaxPool
FC-4096FC-4096FC-1000
More generic
More specific
very similar dataset
very different dataset
very little data ? ?
quite a lot of data
? ?
![Page 98: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/98.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201898
Image
Conv-64Conv-64MaxPool
Conv-128Conv-128MaxPool
Conv-256Conv-256MaxPool
Conv-512Conv-512MaxPool
Conv-512Conv-512MaxPool
FC-4096FC-4096FC-1000
More generic
More specific
very similar dataset
very different dataset
very little data Use Linear Classifier ontop layer
?
quite a lot of data
Finetune a few layers
?
![Page 99: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/99.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201899
Image
Conv-64Conv-64MaxPool
Conv-128Conv-128MaxPool
Conv-256Conv-256MaxPool
Conv-512Conv-512MaxPool
Conv-512Conv-512MaxPool
FC-4096FC-4096FC-1000
More generic
More specific
very similar dataset
very different dataset
very little data Use Linear Classifier on top layer
You’re in trouble… Try linear classifier from different stages
quite a lot of data
Finetune a few layers
Finetune a larger number of layers
![Page 100: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/100.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018100
Transfer learning with CNNs is pervasive…(it’s the norm, not an exception)
Image Captioning: CNN + RNN
Girshick, “Fast R-CNN”, ICCV 2015Figure copyright Ross Girshick, 2015. Reproduced with permission.
Object Detection (Fast R-CNN)
Karpathy and Fei-Fei, “Deep Visual-Semantic Alignments for Generating Image Descriptions”, CVPR 2015Figure copyright IEEE, 2015. Reproduced for educational purposes.
![Page 101: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/101.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018101
Transfer learning with CNNs is pervasive…(it’s the norm, not an exception)
Image Captioning: CNN + RNN
Girshick, “Fast R-CNN”, ICCV 2015Figure copyright Ross Girshick, 2015. Reproduced with permission.
Karpathy and Fei-Fei, “Deep Visual-Semantic Alignments for Generating Image Descriptions”, CVPR 2015Figure copyright IEEE, 2015. Reproduced for educational purposes.
Object Detection (Fast R-CNN) CNN pretrained
on ImageNet
![Page 102: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/102.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018102
Transfer learning with CNNs is pervasive…(it’s the norm, not an exception)
Image Captioning: CNN + RNN
Girshick, “Fast R-CNN”, ICCV 2015Figure copyright Ross Girshick, 2015. Reproduced with permission.
Object Detection (Fast R-CNN) CNN pretrained
on ImageNet
Word vectors pretrained with word2vec Karpathy and Fei-Fei, “Deep Visual-Semantic Alignments for
Generating Image Descriptions”, CVPR 2015Figure copyright IEEE, 2015. Reproduced for educational purposes.
![Page 103: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/103.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018103
Transfer learning with CNNs is pervasive…(it’s the norm, not an exception)
1. Train CNN on ImageNet2. Fine-Tune (1) for object detection on
Visual Genome3. Train BERT language model on lots of text4. Combine(2) and (3), train for joint image /
language modeling5. Fine-tune (4) for imagecaptioning, visual
question answering, etc.
Zhou et al, “Unified Vision-Language Pre-Training for Image Captioning and VQA” CVPR 2020Figure copyright Luowei Zhou, 2020. Reproduced with permission.
Krishna et al, “Visual genome: Connecting language and vision using crowdsourced dense image annotations” IJCV 2017Devlin et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” ArXiv 2018
![Page 104: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/104.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018
We will discuss different architectures in detail in two lectures
104
Transfer learning with CNNs - Architecture matters
Girshick, “The Generalized R-CNN Framework for Object Detection”, ICCV 2017 Tutorial on Instance-Level Visual Recognition
![Page 105: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/105.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018
Training from scratch can work just as well as training from a pretrained ImageNet model for object detection
But it takes 2-3x as long to train.
They also find that collecting more data is better than finetuning on a related task
105
Transfer learning with CNNs is pervasive…But recent results show it might not always be necessary!
He et al, “Rethinking ImageNet Pre-training”, ICCV 2019Figure copyright Kaiming He, 2019. Reproduced with permission.
![Page 106: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/106.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018
Takeaway for your projects and beyond:
106
Source: AI & Deep Learning Memes For Back-propagated Poets
![Page 107: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/107.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018107
Takeaway for your projects and beyond:Have some dataset of interest but it has < ~1M images?
1. Find a very large dataset that has similar data, train a big ConvNet there
2. Transfer learn to your dataset
Deep learning frameworks provide a “Model Zoo” of pretrained models so you don’t need to train your own
TensorFlow: https://github.com/tensorflow/modelsPyTorch: https://github.com/pytorch/vision
![Page 108: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/108.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 2020108
SummaryWe looked in detail at:
- Activation Functions (use ReLU)- Data Preprocessing (images: subtract mean)- Weight Initialization (use Xavier/He init)- Batch Normalization (use this!)- Transfer learning (use this if you can!)
TLDRs
108
![Page 109: Lecture 7: Training Neural Networks, Part Ics231n.stanford.edu/slides/2020/lecture_7.pdf · 2020-04-27 · Lecture 7 - 1 April 28, 2020 Lecture 7: Training Neural Networks, Part I](https://reader035.vdocument.in/reader035/viewer/2022062506/5fb47a75ac5a4c432c6cf3a2/html5/thumbnails/109.jpg)
Fei-Fei, Krishna, Xu Lecture 7 - April 28, 2020109
Next time: Training Neural Networks, Part 2- Parameter update schemes- Learning rate schedules- Gradient checking- Regularization (Dropout etc.)- Babysitting learning- Evaluation (Ensembles etc.)- Hyperparameter Optimization- Transfer learning / fine-tuning
109