deep learning tutorial - 國立臺灣大學
TRANSCRIPT
![Page 1: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/1.jpg)
Deep Learning Tutorial李宏毅
Hung-yi Lee
![Page 2: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/2.jpg)
Deep learning attracts lots of attention.• Google Trends
Deep learning obtains many exciting results.
2007 2009 2011 2013 2015
The talks in this afternoon
This talk will focus on the technical part.
![Page 3: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/3.jpg)
Outline
Part IV: Neural Network with Memory
Part III: Tips for Training Deep Neural Network
Part II: Why Deep?
Part I: Introduction of Deep Learning
![Page 4: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/4.jpg)
Part I: Introduction of Deep Learning
What people already knew in 1980s
![Page 5: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/5.jpg)
Example Application
• Handwriting Digit Recognition
Machine “2”
![Page 6: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/6.jpg)
Handwriting Digit Recognition
Input Output
16 x 16 = 256
1x
2x
256x…
…
Ink → 1No ink → 0
……
y1
y2
y10
Each dimension represents the confidence of a digit.
is 1
is 2
is 0
……
0.1
0.7
0.2
The image is “2”
![Page 7: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/7.jpg)
Example Application
• Handwriting Digit Recognition
Machine “2”
1x
2x
256x
…… ……
y1
y2
y10𝑓: 𝑅256 → 𝑅10
In deep learning, the function 𝑓 is represented by neural network
![Page 8: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/8.jpg)
bwawawaz KK 2211
Element of Neural Network
𝑓: 𝑅𝐾 → 𝑅
z
1w
2w
Kw…
1a
2a
Ka
b
z
bias
a
Activation functionweights
Neuron
![Page 9: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/9.jpg)
Output LayerHidden Layers
Input Layer
Neural Network
Input Output
1x
2x
Layer 1
……
Nx
……
Layer 2
……
Layer L
……
……
……
……
……
y1
y2
yM
Deep means many hidden layers
neuron
![Page 10: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/10.jpg)
Example of Neural Network
z
z
ze
z
1
1
Sigmoid Function
1
-1
1
-2
1
-1
1
0
4
-2
0.98
0.12
![Page 11: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/11.jpg)
Example of Neural Network
1
-2
1
-1
1
0
4
-2
0.98
0.12
2
-1
-1
-2
3
-1
4
-1
0.86
0.11
0.62
0.83
0
0
-2
2
1
-1
![Page 12: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/12.jpg)
Example of Neural Network
1
-2
1
-1
1
0
0.73
0.5
2
-1
-1
-2
3
-1
4
-1
0.72
0.12
0.51
0.85
0
0
-2
2
𝑓00
=0.510.85
Different parameters define different function
𝑓1−1
=0.620.83
𝑓: 𝑅2 → 𝑅2
0
0
![Page 13: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/13.jpg)
𝜎
Matrix Operation
2y
1y1
-2
1
-1
1
0
4
-2
0.98
0.12
1−1
1 −2−1 1
+10
0.980.12
=
1
-1
4−2
![Page 14: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/14.jpg)
1x
2x
……
Nx
……
……
……
……
……
……
……
y1
y2
yM
Neural Network
W1 W2 WL
b2 bL
x a1 a2 y
b1W1 x +𝜎b2W2 a1 +𝜎
bLWL +𝜎 aL-1
b1
![Page 15: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/15.jpg)
= 𝜎 𝜎
1x
2x
……
Nx
……
……
……
……
……
……
……
y1
y2
yM
Neural Network
W1 W2 WL
b2 bL
x a1 a2 y
y = 𝑓 x
b1W1 x +𝜎 b2W2 + bLWL +…
b1
…
Using parallel computing techniques to speed up matrix operation
![Page 16: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/16.jpg)
Softmax
• Softmax layer as the output layer
Ordinary Layer
11 zy
22 zy
33 zy
1z
2z
3z
In general, the output of network can be any value.
May not be easy to interpret
![Page 17: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/17.jpg)
Softmax
• Softmax layer as the output layer
1z
2z
3z
Softmax Layer
e
e
e
1ze
2ze
3ze
3
1
11
j
zz jeey
3
1j
z je
3
-3
1 2.7
20
0.05
0.88
0.12
≈0
Probability: 1 > 𝑦𝑖 > 0 𝑖 𝑦𝑖 = 1
3
1
22
j
zz jeey
3
1
33
j
zz jeey
![Page 18: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/18.jpg)
How to set network parameters
16 x 16 = 256
1x
2x
……
256x
……
……
……
……
Ink → 1No ink → 0
……
y1
y2
y10
0.1
0.7
0.2
y1 has the maximum value
Set the network parameters 𝜃 such that ……
Input:
y2 has the maximum valueInput:
is 1
is 2
is 0
How to let the neural network achieve this
Softm
ax
𝜃 = 𝑊1, 𝑏1,𝑊2, 𝑏2, ⋯𝑊𝐿 , 𝑏𝐿
![Page 19: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/19.jpg)
Training Data
• Preparing training data: images and their labels
Using the training data to find the network parameters.
“5” “0” “4” “1”
“3”“1”“2”“9”
![Page 20: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/20.jpg)
Cost
1x
2x
……
256x
……
……
……
……
……
y1
y2
y10
Cost
0.2
0.3
0.5
“1”
……
1
0
0
……
Cost can be Euclidean distance or cross entropy of the network output and target
Given a set of network parameters 𝜃, each example has a cost value.
target
𝐿(𝜃)
![Page 21: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/21.jpg)
Total Cost
x1
x2
xR
NN
NN
NN
……
……
y1
y2
yR
𝑦1
𝑦2
𝑦𝑅
𝐿1 𝜃
……
……
x3 NN y3 𝑦3
For all training data …
𝐶 𝜃 =
𝑟=1
𝑅
𝐿𝑟 𝜃
Find the network parameters 𝜃∗ that minimize this value
Total Cost:
How bad the network parameters 𝜃 is on this task
𝐿2 𝜃
𝐿3 𝜃
𝐿𝑅 𝜃
![Page 22: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/22.jpg)
Gradient Descent
𝑤1
𝑤2
Assume there are only two parameters w1 and w2 in a network.
The colors represent the value of C. Randomly pick a starting point 𝜃0
Compute the negative gradient at 𝜃0
−𝛻𝐶 𝜃0
𝜃0
−𝛻𝐶 𝜃0
Times the learning rate 𝜂
−𝜂𝛻𝐶 𝜃0𝛻𝐶 𝜃0 =
𝜕𝐶 𝜃0 /𝜕𝑤1
𝜕𝐶 𝜃0 /𝜕𝑤2
−𝜂𝛻𝐶 𝜃0
𝜃 = 𝑤1, 𝑤2Error Surface
𝜃∗
![Page 23: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/23.jpg)
Gradient Descent
𝑤1
𝑤2
Compute the negative gradient at 𝜃0
−𝛻𝐶 𝜃0
𝜃0
Times the learning rate 𝜂
−𝜂𝛻𝐶 𝜃0
𝜃1−𝛻𝐶 𝜃1
−𝜂𝛻𝐶 𝜃1
−𝛻𝐶 𝜃2
−𝜂𝛻𝐶 𝜃2
𝜃2
Eventually, we would reach a minima ….. Randomly pick a
starting point 𝜃0
![Page 24: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/24.jpg)
Local Minima
• Gradient descent never guarantee global minima
𝐶
𝑤1 𝑤2
Different initial point 𝜃0
Reach different minima, so different results
Who is Afraid of Non-Convex Loss Functions?http://videolectures.net/eml07_lecun_wia/
![Page 25: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/25.jpg)
Besides local minima ……
cost
parameter space
Very slow at the plateau
Stuck at local minima
𝛻𝐶 𝜃= 0
Stuck at saddle point
𝛻𝐶 𝜃= 0
𝛻𝐶 𝜃≈ 0
![Page 26: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/26.jpg)
In physical world ……
• Momentum
How about put this phenomenon in gradient descent?
![Page 27: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/27.jpg)
Momentum
costMovement = Negative of Gradient + Momentum
Gradient = 0
Still not guarantee reaching global minima, but give some hope ……
Negative of Gradient
Momentum
Real Movement
![Page 28: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/28.jpg)
Mini-batch
x1 NN
……
y1 𝑦1
𝐿1
x31 NN y31 𝑦31
𝐿31
x2 NN
……
y2 𝑦2
𝐿2
x16 NN y16 𝑦16
𝐿16
Pick the 1st batch
Randomly initialize 𝜃0
𝜃1 ← 𝜃0 − 𝜂𝛻𝐶 𝜃0
Pick the 2nd batch
𝜃2 ← 𝜃1 − 𝜂𝛻𝐶 𝜃1
…
Min
i-b
atch
Min
i-b
atch
C is different each time when we update parameters!
𝐶 = 𝐿1 + 𝐿31 + ⋯
𝐶 = 𝐿2 + 𝐿16 + ⋯
![Page 29: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/29.jpg)
Mini-batch
Original Gradient Descent With Mini-batch
unstable
The colors represent the total C on all training data.
![Page 30: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/30.jpg)
Mini-batch
x1 NN
……
y1 𝑦1
𝐶1
x31 NN y31 𝑦31
𝐶31
x2 NN
……
y2 𝑦2
𝐶2
x16 NN y16 𝑦16
𝐶16
Pick the 1st batch
Randomly initialize 𝜃0
𝜃1 ← 𝜃0 − 𝜂𝛻𝐶 𝜃0
Pick the 2nd batch
𝜃2 ← 𝜃1 − 𝜂𝛻𝐶 𝜃1
Until all mini-batches have been picked
…
one epoch
Faster Better!
Min
i-b
atch
Min
i-b
atch
Repeat the above process
𝐶 = 𝐶1 + 𝐶31 + ⋯
𝐶 = 𝐶2 + 𝐶16 + ⋯
![Page 31: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/31.jpg)
Backpropagation
• A network can have millions of parameters.
• Backpropagation is the way to compute the gradients efficiently (not today)
• Ref: http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture/DNN%20backprop.ecm.mp4/index.html
• Many toolkits can compute the gradients automatically
Ref: http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture/Theano%20DNN.ecm.mp4/index.html
![Page 32: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/32.jpg)
Part II:Why Deep?
![Page 33: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/33.jpg)
Layer X SizeWord Error
Rate (%)Layer X Size
Word Error Rate (%)
1 X 2k 24.2
2 X 2k 20.4
3 X 2k 18.4
4 X 2k 17.8
5 X 2k 17.2 1 X 3772 22.5
7 X 2k 17.1 1 X 4634 22.6
1 X 16k 22.1
Deeper is Better?
Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription
Using Context-Dependent Deep Neural Networks." Interspeech. 2011.
Not surprised, more parameters, better performance
![Page 34: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/34.jpg)
Universality Theorem
Reference for the reason: http://neuralnetworksanddeeplearning.com/chap4.html
Any continuous function f
M: RRf N
Can be realized by a network with one hidden layer
(given enough hidden neurons)
Why “Deep” neural network not “Fat” neural network?
![Page 35: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/35.jpg)
Fat + Short v.s. Thin + Tall
1x 2x ……Nx
Deep
1x 2x ……Nx
……
Shallow
Which one is better?
The same number of parameters
![Page 36: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/36.jpg)
Fat + Short v.s. Thin + Tall
Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription
Using Context-Dependent Deep Neural Networks." Interspeech. 2011.
Layer X SizeWord Error
Rate (%)Layer X Size
Word Error Rate (%)
1 X 2k 24.2
2 X 2k 20.4
3 X 2k 18.4
4 X 2k 17.8
5 X 2k 17.2 1 X 3772 22.5
7 X 2k 17.1 1 X 4634 22.6
1 X 16k 22.1
![Page 37: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/37.jpg)
長髮男
Why Deep?
• Deep → Modularization
Girls with long hair
Boys with short hair
Boys with long hair
Image
Classifier 1
Classifier 2
Classifier 3
長髮女長髮女長髮女長髮女
Girls with short hair
短髮女
短髮男短髮男短髮男短髮男
短髮女短髮女短髮女
Classifier 4
Little examplesweak
![Page 38: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/38.jpg)
Why Deep?
• Deep → Modularization
Image
Long or short?
Boy or Girl?
Classifiers for the attributes
長髮男
長髮女長髮女長髮女長髮女
短髮女 短髮
男短髮男短髮男短髮男
短髮女短髮女短髮女
v.s.
長髮男
長髮女長髮女長髮女長髮女
短髮女
短髮男短髮男短髮男短髮男
短髮女短髮女短髮女
v.s.
Each basic classifier can have sufficient training examples.
Basic Classifier
![Page 39: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/39.jpg)
Why Deep?
• Deep → Modularization
Image
Long or short?
Boy or Girl?
Sharing by the following classifiers
as module
can be trained by little data
Girls with long hair
Boys with short hair
Boys with long hair
Classifier 1
Classifier 2
Classifier 3
Girls with short hair
Classifier 4
Little datafineBasic Classifier
![Page 40: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/40.jpg)
Why Deep?
• Deep → Modularization
1x
2x
……
Nx
……
……
……
……
……
……
The most basic classifiers
Use 1st layer as module to build classifiers
Use 2nd layer as module ……
The modularization is automatically learned from data.
→ Less training data?
Deep Learning also works on small data set like TIMIT.
![Page 41: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/41.jpg)
Hand-crafted kernel function
SVM
Source of image: http://www.gipsa-lab.grenoble-inp.fr/transfert/seminaire/455_Kadri2013Gipsa-lab.pdf
Apply simple classifier
Deep Learning
1x
2x
…Nx
… … …
y1
y2
yM
…
……
……
……
𝑥
𝜙 𝑥simple classifier
Learnable kernel
![Page 42: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/42.jpg)
Hard to get the power of Deep …
Before 2006, deeper usually does not imply better.
![Page 43: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/43.jpg)
Part III:Tips for Training DNN
![Page 44: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/44.jpg)
Recipe for Learning
http://www.gizmodo.com.au/2015/04/the-basic-recipe-for-machine-learning-explained-in-a-single-powerpoint-slide/
![Page 45: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/45.jpg)
Recipe for Learning
http://www.gizmodo.com.au/2015/04/the-basic-recipe-for-machine-learning-explained-in-a-single-powerpoint-slide/
overfittingDon’t forget!
PreventingOverfitting
Modify the Network
Better optimization Strategy
![Page 46: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/46.jpg)
Recipe for Learning
• New activation functions, for example, ReLUor Maxout
Modify the Network
• Adaptive learning rates
Better optimization Strategy
• Dropout
Prevent Overfitting
Only use this approach when you already obtained good results on the training data.
![Page 47: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/47.jpg)
Part III:Tips for Training DNNNew Activation Function
![Page 48: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/48.jpg)
ReLU
• Rectified Linear Unit (ReLU)
Reason:
1. Fast to compute
2. Biological reason
3. Infinite sigmoid with different biases
4. Vanishing gradient problem
𝑧
𝑎
𝑎 = 𝑧
𝑎 = 0
𝜎 𝑧
[Xavier Glorot, AISTATS’11][Andrew L. Maas, ICML’13][Kaiming He, arXiv’15]
![Page 49: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/49.jpg)
Vanishing Gradient Problem
Larger gradients
Almost random Already converge
based on random!?
Learn very slow Learn very fast
1x
2x
……
Nx
……
……
……
……
……
……
……
y1
y2
yM
Smaller gradients
In 2006, people used RBM pre-training.In 2015, people use ReLU.
![Page 50: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/50.jpg)
Vanishing Gradient Problem
1x
2x
……
Nx
……
……
……
……
……
……
……
𝑦1
𝑦2
𝑦𝑀
……
𝑦1
𝑦2
𝑦𝑀
𝐶
Intuitive way to compute the gradient …
𝜕𝐶
𝜕𝑤=?
+∆𝑤
+∆𝐶
∆𝐶
∆𝑤
Smaller gradients
Large input
Small output
![Page 51: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/51.jpg)
ReLU
1x
2x
1y
2y
0
0
0
0
𝑧
𝑎
𝑎 = 𝑧
𝑎 = 0
![Page 52: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/52.jpg)
ReLU
1x
2x
1y
2y
A Thinner linear network
Do not have smaller gradients
𝑧
𝑎
𝑎 = 𝑧
𝑎 = 0
![Page 53: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/53.jpg)
Maxout
• Learnable activation function [Ian J. Goodfellow, ICML’13]
Max
1x
2x
Input
Max
+ 5
+ 7
+ −1
+ 1
7
1
Max
Max
+ 1
+ 2
+ 4
+ 3
2
4
ReLU is a special cases of Maxout
You can have more than 2 elements in a group.
neuron
![Page 54: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/54.jpg)
Maxout
• Learnable activation function [Ian J. Goodfellow, ICML’13]
• Activation function in maxout network can be any piecewise linear convex function
• How many pieces depending on how many elements in a group
ReLU is a special cases of Maxout
2 elements in a group 3 elements in a group
![Page 55: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/55.jpg)
Part III:Tips for Training DNN
Adaptive Learning Rate
![Page 56: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/56.jpg)
𝑤1
𝑤2
𝜃0
−𝛻𝐶 𝜃0
Learning Rate
If learning rate is too large
Cost may not decrease after each update
Set the learning rate η carefully
−𝜂𝛻𝐶 𝜃0
![Page 57: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/57.jpg)
𝑤1
𝑤2
𝜃0
−𝛻𝐶 𝜃0
Learning Rate
If learning rate is too large
Cost may not decrease after each update
Set the learning rate η carefully
−𝜂𝛻𝐶 𝜃0
If learning rate is too small
Training would be too slow
Can we give different parameters different
learning rates?
![Page 58: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/58.jpg)
Adagrad
Parameter dependent learning rate
𝑤𝑡+1 ← 𝑤𝑡 − 𝑤𝑔𝑡ߟ
constant
𝜃𝑡 ← 𝜃𝑡−1 − 𝜂𝛻𝐶 𝜃𝑡−1
Original Gradient Descent
Each parameter w are considered separately
𝑔𝑡 =𝜕𝐶 𝜃𝑡
𝜕𝑤
𝑤ߟ =𝜂
𝑖=0𝑡 𝑔𝑖 2 Summation of the square of
the previous derivatives
![Page 59: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/59.jpg)
Adagrad
g0 g1 ……
0.1 0.2 ……
g0 g1 ……
20.0 10.0 ……
Observation: 1. Learning rate is smaller and smaller for all parameters
2. Smaller derivatives, larger learning rate, and vice versa
𝜂
0.12
𝜂
0.12 + 0.22
𝜂
202
𝜂
202 + 102
=𝜂
0.1
=𝜂
0.22
=𝜂
20
=𝜂
22
Why?
𝑤ߟ =𝜂
𝑖=0𝑡 𝑔𝑖 2
Learning rate: Learning rate:
𝑤1 𝑤2
![Page 60: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/60.jpg)
Smaller Derivatives
Larger Learning Rate
2. Smaller derivatives, larger learning rate, and vice versa
Why?
Smaller Learning Rate
Larger derivatives
![Page 61: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/61.jpg)
Not the whole story ……
• Adagrad [John Duchi, JMLR’11]
• RMSprop• https://www.youtube.com/watch?v=O3sxAc4hxZU
• Adadelta [Matthew D. Zeiler, arXiv’12]
• Adam [Diederik P. Kingma, ICLR’15]
• AdaSecant [Caglar Gulcehre, arXiv’14]
• “No more pesky learning rates” [Tom Schaul, arXiv’12]
![Page 62: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/62.jpg)
Part III:Tips for Training DNN
Dropout
![Page 63: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/63.jpg)
DropoutTraining:
𝜃𝑡 ← 𝜃𝑡−1 − 𝜂𝛻𝐶 𝜃𝑡−1
Each time before computing the gradients
Each neuron has p% to dropout
Pick a mini-batch
![Page 64: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/64.jpg)
DropoutTraining:
Each time before computing the gradients
Each neuron has p% to dropout
Using the new network for training
The structure of the network is changed.
Thinner!
For each mini-batch, we resample the dropout neurons
𝜃𝑡 ← 𝜃𝑡−1 − 𝜂𝛻𝐶 𝜃𝑡−1
Pick a mini-batch
![Page 65: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/65.jpg)
DropoutTesting:
No dropout
If the dropout rate at training is p%, all the weights times (1-p)%
Assume that the dropout rate is 50%. If a weight w = 1 by training, set 𝑤 = 0.5 for testing.
![Page 66: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/66.jpg)
Dropout - Intuitive Reason
When teams up, if everyone expect the partner will do the work, nothing will be done finally.
However, if you know your partner will dropout, you will do better.
我的 partner 會擺爛,所以我要好好做
When testing, no one dropout actually, so obtaining good results eventually.
![Page 67: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/67.jpg)
Dropout - Intuitive Reason
• Why the weights should multiply (1-p)% (dropout rate) when testing?
Training of Dropout Testing of Dropout
𝑤1
𝑤2
𝑤3
𝑤4
𝑧
𝑤1
𝑤2
𝑤3
𝑤4
𝑧′
Assume dropout rate is 50%
0.5 ×
0.5 ×
0.5 ×
0.5 ×
No dropout
Weights from training
𝑧′ ≈ 2𝑧
𝑧′ ≈ 𝑧
Weights multiply (1-p)%
![Page 68: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/68.jpg)
Dropout is a kind of ensemble.
Ensemble
Network1
Network2
Network3
Network4
Train a bunch of networks with different structures
Training Set
Set 1 Set 2 Set 3 Set 4
![Page 69: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/69.jpg)
Dropout is a kind of ensemble.
Ensemble
y1
Network1
Network2
Network3
Network4
Testing data x
y2 y3 y4
average
![Page 70: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/70.jpg)
Dropout is a kind of ensemble.
Training of Dropout
minibatch1
……
Using one mini-batch to train one network
Some parameters in the network are shared
minibatch2
minibatch3
minibatch4
M neurons
2M possible networks
![Page 71: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/71.jpg)
Dropout is a kind of ensemble.
testing data xTesting of Dropout
……
average
y1 y2 y3
All the weights multiply (1-p)%
≈ y
![Page 72: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/72.jpg)
More about dropout
• More reference for dropout [Nitish Srivastava, JMLR’14] [Pierre Baldi, NIPS’13][Geoffrey E. Hinton, arXiv’12]
• Dropout works better with Maxout [Ian J. Goodfellow, ICML’13]
• Dropconnect [Li Wan, ICML’13]
• Dropout delete neurons
• Dropconnect deletes the connection between neurons
• Annealed dropout [S.J. Rennie, SLT’14]
• Dropout rate decreases by epochs
• Standout [J. Ba, NISP’13]
• Each neural has different dropout rate
![Page 73: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/73.jpg)
Part IV:Neural Network
with Memory
![Page 74: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/74.jpg)
Neural Network needs Memory
• Name Entity Recognition
• Detecting named entities like name of people, locations, organization, etc. in a sentence.
apple
…
0
0
0
1
DNN
people
location
organization
none
0.5
0.3
0.1
0.1
![Page 75: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/75.jpg)
Neural Network needs Memory
• Name Entity Recognition
• Detecting named entities like name of people, locations, organization, etc. in a sentence.
x1 x2 x3 x4
the president of apple eats an applex5 x6 x7
ORG
y1 y2 y3 y4 y5 y6 y7
NONE
DNN DNN DNN DNN DNN DNN DNN
DNN needs memory!
target target
![Page 76: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/76.jpg)
Recurrent Neural Network (RNN)
1x 2x
2y1y
1a 2a
Memory can be considered as another input.
The output of hidden layer are stored in the memory.
copy
![Page 77: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/77.jpg)
RNN
copy copy
x1 x2 x3
y1 y2 y3
Output yi depends on x1, x2, …… xi
Wi
Wo
Wh WhWi Wi
Wo Woa1
a1 a2
a2 a3
The same network is used again and again.
![Page 78: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/78.jpg)
L1 L2 L3
RNN
x1 x2 x3
y1 y2 y3
Wi
Wo
……Wh Wh
Wi
Wo
Wi
Wo
𝑦1 𝑦2 𝑦3
Backpropagation through time (BPTT)
How to train?
target target target
Find the network parameters to minimize the total cost:
![Page 79: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/79.jpg)
Of course it can be deep …
…… ……
xt xt+1 xt+2
……
……
yt
……
……
yt+1…
…yt+2
……
……
![Page 80: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/80.jpg)
Bidirectional RNN
yt+1
…… ……
…………yt+2yt
xt xt+1 xt+2
xt xt+1 xt+2
![Page 81: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/81.jpg)
Many to Many (Output is shorter)
• Both input and output are both sequences, but the output is shorter.
• E.g. Speech Recognition
好 好 好
Trimming
棒 棒 棒 棒 棒
“好棒”
Why can’t it be “好棒棒”
Input:
Output: (character sequence)
(vector sequence)
Problem?
![Page 82: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/82.jpg)
Many to Many (Output is shorter)
• Both input and output are both sequences, but the output is shorter.
• Connectionist Temporal Classification (CTC) [Alex Graves, ICML’06][Alex Graves, ICML’14][Haşim Sak, Interspeech’15][Jie Li, Interspeech’15][Andrew Senior, ASRU’15]
好 φ φ 棒 φ φ φ φ 好 φ φ 棒 φ 棒 φ φ
“好棒” “好棒棒”Add an extra symbol “φ” representing “null”
![Page 83: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/83.jpg)
Many to Many (No Limitation)
• Both input and output are both sequences with different lengths. → Sequence to sequence learning
• E.g. Machine Translation (machine learning→機器學習)
Containing all information about
input sequence
learnin
g
mach
ine
![Page 84: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/84.jpg)
learnin
g
Many to Many (No Limitation)
• Both input and output are both sequences with different lengths. → Sequence to sequence learning
• E.g. Machine Translation (machine learning→機器學習)
mach
ine
機 習器 學
……
……
Don’t know when to stop
慣 性
![Page 85: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/85.jpg)
Many to Many (No Limitation)
推 tlkagk: =========斷==========
Ref:http://zh.pttpedia.wikia.com/wiki/%E6%8E%A5%E9%BE%8D%E6%8E%A8%E6%96%87 (鄉民百科)
![Page 86: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/86.jpg)
learnin
g
Many to Many (No Limitation)
• Both input and output are both sequences with different lengths. → Sequence to sequence learning
• E.g. Machine Translation (machine learning→機器學習)
mach
ine
機 習器 學
Add a symbol “===“ (斷)
[Ilya Sutskever, NIPS’14][Dzmitry Bahdanau, arXiv’15]
===
![Page 87: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/87.jpg)
Unfortunately ……
• RNN-based network is not always easy to learn
感謝曾柏翔同學提供實驗結果
Real experiments on Language modeling
Lucky
sometimes
![Page 88: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/88.jpg)
The error surface is rough.
w1
w2
Co
st
The error surface is either very flat or very steep.
Clipping
[Razvan Pascanu, ICML’13]
![Page 89: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/89.jpg)
Why?
1
1
y1
0
1
w
y2
0
1
w
y3
0
1
w
y1000
……
𝑤 = 1
𝑤 = 1.01
𝑦1000 = 1
𝑦1000 ≈ 20000
𝑤 = 0.99
𝑤 = 0.01
𝑦1000 ≈ 0
𝑦1000 ≈ 0
1 1 1 1
Large gradient
Small Learning rate?
small gradient
Large Learning rate?
Toy Example
=w999
![Page 90: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/90.jpg)
• Nesterov’s Accelerated Gradient (NAG):
• Advance momentum method
• RMS Prop
• Advanced approach to give each parameter different learning rates
• Considering the change of Second derivatives
• Long Short-term Memory (LSTM)
• Can deal with gradient vanishing (not gradient explode)
Helpful Techniques
![Page 91: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/91.jpg)
MemoryCell
Long Short-term Memory (LSTM)
Input Gate
Output Gate
Signal control the input gate
Signal control the output gate
Forget Gate
Signal control the forget gate
Other part of the network
Other part of the network
(Other part of the network)
(Other part of the network)
(Other part of the network)
LSTM
Special Neuron:4 inputs, 1 output
![Page 92: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/92.jpg)
𝑧
𝑧𝑖
𝑧𝑓
𝑧𝑜
𝑔 𝑧
𝑓 𝑧𝑖multiply
multiplyActivation function f is usually a sigmoid function
Between 0 and 1
Mimic open and close gate
c
𝑐′ = 𝑔 𝑧 𝑓 𝑧𝑖 + 𝑐𝑓 𝑧𝑓
ℎ 𝑐′𝑓 𝑧𝑜
𝑎 = ℎ 𝑐′ 𝑓 𝑧𝑜
𝑔 𝑧 𝑓 𝑧𝑖
𝑐′
𝑓 𝑧𝑓
𝑐𝑓 𝑧𝑓
𝑐
![Page 93: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/93.jpg)
x1 x2 Input
Original Network:
Simply replace the neurons with LSTM
𝑎1 𝑎2
……
……
𝑧1 𝑧2
![Page 94: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/94.jpg)
x1 x2
+
+
+
+
+
+
+
+
Input
𝑎1 𝑎2
4 times of parameters
![Page 95: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/95.jpg)
LSTM
xt
zzi
×
zf zo
× + ×
yt
xt+1
zzi
×
zf zo
× + ×
yt+1
ht
Extension: “peephole”
ht-1 ctct-1
ct-1 ct ct+1
![Page 96: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/96.jpg)
Other Simpler Alternatives
Vanilla RNN Initialized with Identity matrix + ReLU activation function [Quoc V. Le, arXiv’15]
Outperform or be comparable with LSTM in 4 different tasks
[Cho, EMNLP’14]
Gated Recurrent Unit (GRU)
[Tomas Mikolov, ICLR’15]
Structurally Constrained Recurrent Network (SCRN)
![Page 97: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/97.jpg)
What is the next wave?
• Attention-based Model
Reading Head Controller
Input x
Reading Head Writing Head
DNN/LSTM
Writing Head Controller
output y
…… ……
Internal memory or information from output
Already applied on speech recognition, caption generation, QA, visual QA
![Page 98: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/98.jpg)
What is the next wave?
• Attention-based Model• End-To-End Memory Networks. S. Sukhbaatar, A. Szlam, J. Weston, R. Fergus.
arXiv Pre-Print, 2015.
• Neural Turing Machines. Alex Graves, Greg Wayne, Ivo Danihelka. arXiv Pre-Print, 2014
• Ask Me Anything: Dynamic Memory Networks for Natural Language Processing. Kumar et al. arXiv Pre-Print, 2015
• Neural Machine Translation by Jointly Learning to Align and Translate. D. Bahdanau, K. Cho, Y. Bengio; International Conference on Representation Learning 2015.
• Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. Kelvin Xu et. al.. arXiv Pre-Print, 2015.
• Attention-Based Models for Speech Recognition. Jan Chorowski, DzmitryBahdanau, Dmitriy Serdyuk, Kyunghyun Cho, Yoshua Bengio. arXiv Pre-Print, 2015.
• Recurrent models of visual attention. V. Mnih, N. Hees, A. Graves and K. Kavukcuoglu. In NIPS, 2014.
• A Neural Attention Model for Abstractive Sentence Summarization. A. M. Rush, S. Chopra and J. Weston. EMNLP 2015.
![Page 99: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/99.jpg)
Concluding Remarks
![Page 100: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/100.jpg)
Concluding Remarks
• Introduction of deep learning
• Discussing some reasons using deep learning
• New techniques for deep learning
• ReLU, Maxout
• Giving all the parameters different learning rates
• Dropout
• Network with memory
• Recurrent neural network
• Long short-term memory (LSTM)
![Page 101: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/101.jpg)
Reading Materials
• “Neural Networks and Deep Learning”
• written by Michael Nielsen
• http://neuralnetworksanddeeplearning.com/
• “Deep Learning” (not finished yet)
• Written by Yoshua Bengio, Ian J. Goodfellow and Aaron Courville
• http://www.iro.umontreal.ca/~bengioy/dlbook/
![Page 102: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/102.jpg)
Thank you for your attention!
![Page 103: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/103.jpg)
Acknowledgement
•感謝 Ryan Sun 來信指出投影片上的錯字
![Page 104: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/104.jpg)
Appendix
![Page 105: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/105.jpg)
𝜎
Matrix Operation
1x
2x 2y
1y
1
-1
1
-2
1
-1
1
0
4
-2
0.98
0.12
1−1
1 −2−1 1
+10
0.980.12
=W x ab
![Page 106: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/106.jpg)
Why Deep? – Logic Circuits
• A two levels of basic logic gates can represent any Boolean function.
• However, no one uses two levels of logic gates to build computers
• Using multiple layers of logic gates to build some functions are much simpler (less gates needed).
![Page 107: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/107.jpg)
Boosting
Deep Learning
1x
2x
…
Nx
… ……
Input𝑥
Weak classifier
Weak classifier
CombineWeak classifier
Weak classifier
……
Boosted weak classifier
Boosted Boostedweak classifier
……
……
……
![Page 108: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/108.jpg)
Maxout
Max
x
1
Input+ 𝑧1
+ 𝑧2
𝑎
𝑚𝑎𝑥 𝑧1 , 𝑧2
𝑤𝑏
0
0
𝑥
𝑧 = 𝑤𝑥 + 𝑏𝑎
x
1
Input ReLU𝑧
𝑤
𝑏
𝑎
𝑥
𝑧1 = 𝑤𝑥 + 𝑏
𝑎
𝑧2 =0
ReLU is a special cases of Maxout
![Page 109: Deep Learning Tutorial - 國立臺灣大學](https://reader031.vdocument.in/reader031/viewer/2022012413/616caeb8d7ebad4e7f2d19e5/html5/thumbnails/109.jpg)
Maxout
Max
x
1
Input+ 𝑧1
+ 𝑧2
𝑎
𝑚𝑎𝑥 𝑧1 , 𝑧2
𝑤𝑏
𝑤′
𝑏′
𝑥
𝑧 = 𝑤𝑥 + 𝑏𝑎
x
1
Input ReLU𝑧
𝑤
𝑏
𝑎
𝑥
𝑧1 = 𝑤𝑥 + 𝑏
𝑎
𝑧2 = 𝑤′𝑥 + 𝑏′
Learnable Activation Function
ReLU is a special cases of Maxout