tips for deep learning - 國立臺灣大學
TRANSCRIPT
![Page 1: Tips for Deep Learning - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022020620/61e51adc0fda3a4ea069715d/html5/thumbnails/1.jpg)
Tips for Deep Learning
![Page 2: Tips for Deep Learning - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022020620/61e51adc0fda3a4ea069715d/html5/thumbnails/2.jpg)
Neural Network
Good Results on Testing Data?
Good Results on Training Data?
Step 3: pick the best function
Step 2: goodness of function
Step 1: define a set of function
YES
YES
NO
NO
Overfitting!
Recipe of Deep Learning
![Page 3: Tips for Deep Learning - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022020620/61e51adc0fda3a4ea069715d/html5/thumbnails/3.jpg)
Do not always blame Overfitting
Deep Residual Learning for Image Recognition
http://arxiv.org/abs/1512.03385
Testing Data
Overfitting?
Training Data
Not well trained
![Page 4: Tips for Deep Learning - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022020620/61e51adc0fda3a4ea069715d/html5/thumbnails/4.jpg)
Neural Network
Good Results on Testing Data?
Good Results on Training Data?
YES
YES
Recipe of Deep Learning
Different approaches for different problems.
e.g. dropout for good results on testing data
![Page 5: Tips for Deep Learning - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022020620/61e51adc0fda3a4ea069715d/html5/thumbnails/5.jpg)
Good Results on Testing Data?
Good Results on Training Data?
YES
YES
Recipe of Deep Learning
New activation function
Adaptive Learning Rate
Early Stopping
Regularization
Dropout
![Page 6: Tips for Deep Learning - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022020620/61e51adc0fda3a4ea069715d/html5/thumbnails/6.jpg)
Hard to get the power of Deep …
Deeper usually does not imply better.
Results on Training Data
![Page 7: Tips for Deep Learning - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022020620/61e51adc0fda3a4ea069715d/html5/thumbnails/7.jpg)
Vanishing Gradient Problem
Larger gradients
Almost random Already converge
based on random!?
Learn very slow Learn very fast
1x
2x
……
Nx
……
……
……
……
……
……
……
y1
y2
yM
Smaller gradients
![Page 8: Tips for Deep Learning - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022020620/61e51adc0fda3a4ea069715d/html5/thumbnails/8.jpg)
Vanishing Gradient Problem
1x
2x
……
Nx
……
……
……
……
……
……
……
𝑦1
𝑦2
𝑦𝑀
……
ො𝑦1
ො𝑦2
ො𝑦𝑀
𝑙
Intuitive way to compute the derivatives …
𝜕𝑙
𝜕𝑤=?
+∆𝑤
+∆𝑙
∆𝑙
∆𝑤
Smaller gradients
Large input
Small output
![Page 9: Tips for Deep Learning - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022020620/61e51adc0fda3a4ea069715d/html5/thumbnails/9.jpg)
ReLU
• Rectified Linear Unit (ReLU)
Reason:
1. Fast to compute
2. Biological reason
3. Infinite sigmoid with different biases
4. Vanishing gradient problem
𝑧
𝑎
𝑎 = 𝑧
𝑎 = 0
𝜎 𝑧
[Xavier Glorot, AISTATS’11][Andrew L. Maas, ICML’13][Kaiming He, arXiv’15]
![Page 10: Tips for Deep Learning - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022020620/61e51adc0fda3a4ea069715d/html5/thumbnails/10.jpg)
ReLU
1x
2x
1y
2y
0
0
0
0
𝑧
𝑎
𝑎 = 𝑧
𝑎 = 0
![Page 11: Tips for Deep Learning - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022020620/61e51adc0fda3a4ea069715d/html5/thumbnails/11.jpg)
ReLU
1x
2x
1y
2y
A Thinner linear network
Do not have smaller gradients
𝑧
𝑎
𝑎 = 𝑧
𝑎 = 0
![Page 12: Tips for Deep Learning - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022020620/61e51adc0fda3a4ea069715d/html5/thumbnails/12.jpg)
ReLU - variant
𝑧
𝑎
𝑎 = 𝑧
𝑎 = 0.01𝑧
𝐿𝑒𝑎𝑘𝑦 𝑅𝑒𝐿𝑈
𝑧
𝑎
𝑎 = 𝑧
𝑎 = 𝛼𝑧
𝑃𝑎𝑟𝑎𝑚𝑒𝑡𝑟𝑖𝑐 𝑅𝑒𝐿𝑈
α also learned by gradient descent
![Page 13: Tips for Deep Learning - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022020620/61e51adc0fda3a4ea069715d/html5/thumbnails/13.jpg)
Maxout
• Learnable activation function [Ian J. Goodfellow, ICML’13]
Max
1x
2x
Input
Max
+ 5
+ 7
+ −1
+ 1
7
1
Max
Max
+ 1
+ 2
+ 4
+ 3
2
4
ReLU is a special cases of Maxout
You can have more than 2 elements in a group.
neuron
![Page 14: Tips for Deep Learning - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022020620/61e51adc0fda3a4ea069715d/html5/thumbnails/14.jpg)
Maxout
Max
x
1
Input+ 𝑧1
+ 𝑧2
𝑎
𝑚𝑎𝑥 𝑧1 , 𝑧2
𝑤𝑏
0
0
𝑥
𝑧 = 𝑤𝑥 + 𝑏𝑎
x
1
Input ReLU𝑧
𝑤
𝑏
𝑎
𝑥
𝑧1 = 𝑤𝑥 + 𝑏
𝑎
𝑧2 =0
ReLU is a special cases of Maxout
![Page 15: Tips for Deep Learning - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022020620/61e51adc0fda3a4ea069715d/html5/thumbnails/15.jpg)
Maxout
Max
x
1
Input+ 𝑧1
+ 𝑧2
𝑎
𝑚𝑎𝑥 𝑧1 , 𝑧2
𝑤𝑏
𝑤′
𝑏′
𝑥
𝑧 = 𝑤𝑥 + 𝑏𝑎
x
1
Input ReLU𝑧
𝑤
𝑏
𝑎
𝑥
𝑧1 = 𝑤𝑥 + 𝑏
𝑎
𝑧2 = 𝑤′𝑥 + 𝑏′
Learnable Activation Function
More than ReLU
![Page 16: Tips for Deep Learning - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022020620/61e51adc0fda3a4ea069715d/html5/thumbnails/16.jpg)
Maxout
• Learnable activation function [Ian J. Goodfellow, ICML’13]
• Activation function in maxout network can be any piecewise linear convex function
• How many pieces depending on how many elements in a group
2 elements in a group 3 elements in a group
![Page 17: Tips for Deep Learning - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022020620/61e51adc0fda3a4ea069715d/html5/thumbnails/17.jpg)
Maxout - Training
• Given a training data x, we know which z would be the max
Max
1x
2x
Input
Max𝑥
+ 𝑧11
+ 𝑧21
+ 𝑧31
+ 𝑧41
𝑎11
𝑎21
Max
Max
+ 𝑧12
+ 𝑧22
+ 𝑧32
+ 𝑧42
𝑎12
𝑎22
𝑎1 𝑎2
𝑚𝑎𝑥 𝑧11, 𝑧2
1
![Page 18: Tips for Deep Learning - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022020620/61e51adc0fda3a4ea069715d/html5/thumbnails/18.jpg)
Maxout - Training
• Given a training data x, we know which z would be the max
• Train this thin and linear network
1x
2x
Input
𝑥
+ 𝑧11
+ 𝑧21
+ 𝑧31
+ 𝑧41
𝑎11
𝑎21
+ 𝑧12
+ 𝑧22
+ 𝑧32
+ 𝑧42
𝑎12
𝑎22
𝑎1 𝑎2
Different thin and linear network for different examples
![Page 19: Tips for Deep Learning - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022020620/61e51adc0fda3a4ea069715d/html5/thumbnails/19.jpg)
Good Results on Testing Data?
Good Results on Training Data?
YES
YES
Recipe of Deep Learning
New activation function
Adaptive Learning Rate
Early Stopping
Regularization
Dropout
![Page 20: Tips for Deep Learning - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022020620/61e51adc0fda3a4ea069715d/html5/thumbnails/20.jpg)
Review
𝑤1
𝑤2
Larger Learning Rate
Smaller Learning Rate
Adagrad
𝑤𝑡+1 ← 𝑤𝑡 −𝜂
σ𝑖=0𝑡 𝑔𝑖 2
𝑔𝑡
Use first derivative to estimate second derivative
![Page 21: Tips for Deep Learning - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022020620/61e51adc0fda3a4ea069715d/html5/thumbnails/21.jpg)
RMSProp
𝑤1
𝑤2
Error Surface can be very complex when training NN.
Larger Learning Rate
Smaller Learning Rate
![Page 22: Tips for Deep Learning - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022020620/61e51adc0fda3a4ea069715d/html5/thumbnails/22.jpg)
RMSProp
𝑤1 ← 𝑤0 −𝜂
𝜎0𝑔0
……
𝑤2 ← 𝑤1 −𝜂
𝜎1𝑔1
𝑤𝑡+1 ← 𝑤𝑡 −𝜂
𝜎𝑡𝑔𝑡
𝜎0 = 𝑔0
𝜎1 = 𝛼 𝜎0 2 + 1 − 𝛼 𝑔1 2
𝑤3 ← 𝑤2 −𝜂
𝜎2𝑔2 𝜎2 = 𝛼 𝜎1 2 + 1 − 𝛼 𝑔2 2
𝜎𝑡 = 𝛼 𝜎𝑡−1 2 + 1 − 𝛼 𝑔𝑡 2
Root Mean Square of the gradients with previous gradients being decayed
![Page 23: Tips for Deep Learning - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022020620/61e51adc0fda3a4ea069715d/html5/thumbnails/23.jpg)
Hard to find optimal network parameters
TotalLoss
The value of a network parameter w
Very slow at the plateau
Stuck at local minima
𝜕𝐿 ∕ 𝜕𝑤= 0
Stuck at saddle point
𝜕𝐿 ∕ 𝜕𝑤= 0
𝜕𝐿 ∕ 𝜕𝑤≈ 0
![Page 24: Tips for Deep Learning - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022020620/61e51adc0fda3a4ea069715d/html5/thumbnails/24.jpg)
In physical world ……
• Momentum
How about put this phenomenon in gradient descent?
![Page 25: Tips for Deep Learning - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022020620/61e51adc0fda3a4ea069715d/html5/thumbnails/25.jpg)
Review: Vanilla Gradient Descent
Start at position 𝜃0
Compute gradient at 𝜃0
Move to 𝜃1 = 𝜃0 - η𝛻𝐿 𝜃0
Compute gradient at 𝜃1
Move to 𝜃2 = 𝜃1 – η𝛻𝐿 𝜃1Movement
Gradient
……
𝜃0
𝜃1
𝜃2
𝜃3
𝛻𝐿 𝜃0
𝛻𝐿 𝜃1
𝛻𝐿 𝜃2
𝛻𝐿 𝜃3
Stop until 𝛻𝐿 𝜃𝑡 ≈ 0
![Page 26: Tips for Deep Learning - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022020620/61e51adc0fda3a4ea069715d/html5/thumbnails/26.jpg)
MomentumStart at point 𝜃0
Compute gradient at 𝜃0
Move to 𝜃1 = 𝜃0 + v1
Compute gradient at 𝜃1
Movement v0=0
Movement v1 = λv0 - η𝛻𝐿 𝜃0
Movement v2 = λv1 - η𝛻𝐿 𝜃1
Move to 𝜃2 = 𝜃1 + v2Movement
Gradient
𝜃0
𝜃1
𝜃2
𝜃3
𝛻𝐿 𝜃0𝛻𝐿 𝜃1
𝛻𝐿 𝜃2
𝛻𝐿 𝜃3 Movement not just based on gradient, but previous movement.
Movementof last step
Movement: movement of last step minus gradient at present
![Page 27: Tips for Deep Learning - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022020620/61e51adc0fda3a4ea069715d/html5/thumbnails/27.jpg)
Momentum
vi is actually the weighted sum of all the previous gradient:
𝛻𝐿 𝜃0 ,𝛻𝐿 𝜃1 , … 𝛻𝐿 𝜃𝑖−1
v0 = 0
v1 = - η𝛻𝐿 𝜃0
v2 = - λ η𝛻𝐿 𝜃0 - η𝛻𝐿 𝜃1
……
Start at point 𝜃0
Compute gradient at 𝜃0
Move to 𝜃1 = 𝜃0 + v1
Compute gradient at 𝜃1
Movement v0=0
Movement v1 = λv0 - η𝛻𝐿 𝜃0
Movement v2 = λv1 - η𝛻𝐿 𝜃1
Move to 𝜃2 = 𝜃1 + v2
Movement not just based on gradient, but previous movement
Movement: movement of last step minus gradient at present
![Page 28: Tips for Deep Learning - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022020620/61e51adc0fda3a4ea069715d/html5/thumbnails/28.jpg)
Movement = Negative of 𝜕𝐿∕𝜕𝑤 + Momentum
Momentum
cost
𝜕𝐿∕𝜕𝑤 = 0
Still not guarantee reaching global minima, but give some hope ……
Negative of 𝜕𝐿 ∕ 𝜕𝑤
Momentum
Real Movement
![Page 29: Tips for Deep Learning - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022020620/61e51adc0fda3a4ea069715d/html5/thumbnails/29.jpg)
Adam RMSProp + Momentum
for momentumfor RMSprop
![Page 30: Tips for Deep Learning - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022020620/61e51adc0fda3a4ea069715d/html5/thumbnails/30.jpg)
Good Results on Testing Data?
Good Results on Training Data?
YES
YES
Recipe of Deep Learning
New activation function
Adaptive Learning Rate
Early Stopping
Regularization
Dropout
![Page 31: Tips for Deep Learning - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022020620/61e51adc0fda3a4ea069715d/html5/thumbnails/31.jpg)
Early Stopping
Epochs
TotalLoss
Training set
Testing set
Stop at here
Validation set
http://keras.io/getting-started/faq/#how-can-i-interrupt-training-when-the-validation-loss-isnt-decreasing-anymoreKeras:
![Page 32: Tips for Deep Learning - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022020620/61e51adc0fda3a4ea069715d/html5/thumbnails/32.jpg)
Good Results on Testing Data?
Good Results on Training Data?
YES
YES
Recipe of Deep Learning
New activation function
Adaptive Learning Rate
Early Stopping
Regularization
Dropout
![Page 33: Tips for Deep Learning - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022020620/61e51adc0fda3a4ea069715d/html5/thumbnails/33.jpg)
Regularization
• New loss function to be minimized
• Find a set of weight not only minimizing original cost but also close to zero
22
1L L
Original loss(e.g. minimize square error, cross entropy …)
,, 21 ww
(usually not consider biases)
2
2
2
12ww
Regularization term
L2 regularization:
![Page 34: Tips for Deep Learning - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022020620/61e51adc0fda3a4ea069715d/html5/thumbnails/34.jpg)
Regularization
• New loss function to be minimized
Gradient: www
LL
Update:w
ww tt
L1
tt w
ww
L
w
wt
L1
Closer to zero
22
1L L
Weight Decay
2
2
2
12ww
L2 regularization:
![Page 35: Tips for Deep Learning - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022020620/61e51adc0fda3a4ea069715d/html5/thumbnails/35.jpg)
Regularization
• New loss function to be minimized
www
sgnLL
Update:
www tt
L1
tt w
ww sgn
L
tt ww
w sgnL
Always delete
12
1L L
211ww
L1 regularization:
w
wt
L1 …… L2
![Page 36: Tips for Deep Learning - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022020620/61e51adc0fda3a4ea069715d/html5/thumbnails/36.jpg)
Regularization - Weight Decay
• Our brain prunes out the useless link between neurons.
Doing the same thing to machine’s brain improves the performance.
![Page 37: Tips for Deep Learning - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022020620/61e51adc0fda3a4ea069715d/html5/thumbnails/37.jpg)
Good Results on Testing Data?
Good Results on Training Data?
YES
YES
Recipe of Deep Learning
New activation function
Adaptive Learning Rate
Early Stopping
Regularization
Dropout
![Page 38: Tips for Deep Learning - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022020620/61e51adc0fda3a4ea069715d/html5/thumbnails/38.jpg)
DropoutTraining:
➢ Each time before updating the parameters
Each neuron has p% to dropout
![Page 39: Tips for Deep Learning - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022020620/61e51adc0fda3a4ea069715d/html5/thumbnails/39.jpg)
DropoutTraining:
➢ Each time before updating the parameters
Each neuron has p% to dropout
Using the new network for training
The structure of the network is changed.
Thinner!
For each mini-batch, we resample the dropout neurons
![Page 40: Tips for Deep Learning - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022020620/61e51adc0fda3a4ea069715d/html5/thumbnails/40.jpg)
DropoutTesting:
➢ No dropout
If the dropout rate at training is p%, all the weights times 1-p%
Assume that the dropout rate is 50%. If a weight w = 1 by training, set 𝑤 = 0.5 for testing.
![Page 41: Tips for Deep Learning - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022020620/61e51adc0fda3a4ea069715d/html5/thumbnails/41.jpg)
Dropout- Intuitive Reason
Training
Testing
Dropout (腳上綁重物)
No dropout(拿下重物後就變很強)
![Page 42: Tips for Deep Learning - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022020620/61e51adc0fda3a4ea069715d/html5/thumbnails/42.jpg)
Dropout - Intuitive Reason
➢ When teams up, if everyone expect the partner will do the work, nothing will be done finally.
➢ However, if you know your partner will dropout, you will do better.
我的 partner 會擺爛,所以我要好好做
➢ When testing, no one dropout actually, so obtaining good results eventually.
![Page 43: Tips for Deep Learning - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022020620/61e51adc0fda3a4ea069715d/html5/thumbnails/43.jpg)
Dropout - Intuitive Reason
• Why the weights should multiply (1-p)% (dropout rate) when testing?
Training of Dropout Testing of Dropout
𝑤1
𝑤2
𝑤3
𝑤4
𝑧
𝑤1
𝑤2
𝑤3
𝑤4
𝑧′
Assume dropout rate is 50%
0.5 ×
0.5 ×
0.5 ×
0.5 ×
No dropout
Weights from training
𝑧′ ≈ 2𝑧
𝑧′ ≈ 𝑧
Weights multiply 1-p%
![Page 44: Tips for Deep Learning - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022020620/61e51adc0fda3a4ea069715d/html5/thumbnails/44.jpg)
Dropout is a kind of ensemble.
Ensemble
Network1
Network2
Network3
Network4
Train a bunch of networks with different structures
Training Set
Set 1 Set 2 Set 3 Set 4
![Page 45: Tips for Deep Learning - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022020620/61e51adc0fda3a4ea069715d/html5/thumbnails/45.jpg)
Dropout is a kind of ensemble.
Ensemble
y1
Network1
Network2
Network3
Network4
Testing data x
y2 y3 y4
average
![Page 46: Tips for Deep Learning - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022020620/61e51adc0fda3a4ea069715d/html5/thumbnails/46.jpg)
Dropout is a kind of ensemble.
Training of Dropout
minibatch1
……
➢Using one mini-batch to train one network
➢Some parameters in the network are shared
minibatch2
minibatch3
minibatch4
M neurons
2M possible networks
![Page 47: Tips for Deep Learning - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022020620/61e51adc0fda3a4ea069715d/html5/thumbnails/47.jpg)
Dropout is a kind of ensemble.
testing data xTesting of Dropout
……
average
y1 y2 y3
All the weights multiply 1-p%
≈ y?????
![Page 48: Tips for Deep Learning - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022020620/61e51adc0fda3a4ea069715d/html5/thumbnails/48.jpg)
Testing of Dropout
w1 w2
x1 x2
w1 w2
x1 x2
w1 w2
x1 x2
w1 w2
x1 x2
z=w1x1+w2x2 z=w2x2
z=w1x1 z=0
x1 x2
w1 w2
1
2
1
2
x1 x2
w1 w2
z=w1x1+w2x2
𝑧 =1
2𝑤1𝑥1 +
1
2𝑤2𝑥2
![Page 49: Tips for Deep Learning - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022020620/61e51adc0fda3a4ea069715d/html5/thumbnails/49.jpg)
Neural Network
Good Results on Testing Data?
Good Results on Training Data?
Step 3: pick the best function
Step 2: goodness of function
Step 1: define a set of function
YES
YES
NO
NO
Overfitting!
Recipe of Deep Learning
![Page 50: Tips for Deep Learning - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022020620/61e51adc0fda3a4ea069715d/html5/thumbnails/50.jpg)
Try another task
http://top-breaking-news.com/
Machine
政治
體育
經濟
“president” in document
“stock” in document
體育 政治 財經
![Page 51: Tips for Deep Learning - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022020620/61e51adc0fda3a4ea069715d/html5/thumbnails/51.jpg)
Try another task
![Page 52: Tips for Deep Learning - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022020620/61e51adc0fda3a4ea069715d/html5/thumbnails/52.jpg)
Live Demo