tips for training neural network scratch the surface
TRANSCRIPT
Tips for Training Neural Network
scratch the surface
Two Concerns
• There are two things you have to concern.
Optimization
• Can I find the “best” parameter set θ*
in limited of time?
Generalization
• Is the “best” parameter set θ* good for testing data as well?
Initialization
• For gradient decent, we need to pick an initialization parameter θ0.• Do not set all the parameters θ0 equal• Set the parameters in θ0 randomly
Learning Rate
• Toy Example
y zw
b
x
1zy
x = [0.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0, 5.5, 6.0, 6.5, 7.0, 7.5, 8.0, 8.5, 9.0, 9.5]y = [0.1, 0.4, 0.9, 1.6, 2.2, 2.5, 2.8, 3.5, 3.9, 4.7, 5.1, 5.3, 6.3, 6.5, 6.7, 7.5, 8.1, 8.5, 8.9, 9.5]
Training Data (20 examples)
0
1*
b
w
11 iii C • Set the learning rate η carefully
• Toy Example
Learning Rate
Error Surface: C(w,b)
target
start
11 iii C
• Toy Example
updates 30.~ k
1.0
Learning Rate
updates 3~ k001.0 01.0
Different learning rate η
Gradient Decent
r
rr yxfR
C2
ˆ1
rR
rC1
Gradient Decent
Stochastic Gradient Decent
11 iii C r
iri CR
C 11 1
11 irii C Pick an example xr
If all example xr have equal probabilities to be picked
r
irir CR
CE 11 1
Gradient Decent Stochastic Gradient Decent
0101 CStarting at θ0
Training Data: RRrr yxyxyxyx ˆ,,ˆ,,ˆ,,ˆ, 2211
1212 C
pick x1
pick x2
pick xr 11 rrrr C
pick xR 1RR1RR C
pick x1 R1R1R C
Seen all the examples once
One epoch
… …
… …
Gradient Decent
• Toy Example
Gradient Decent Stochastic Gradient Decent1 epoch
See all examples
See only one example
Update 20 times in an epoch
Gradient Decent
Shuffle your data
Stochastic Gradient Decent
11 irii C Pick an example xr
Mini Batch Gradient DescentPick B examples as a batch b
bx
irii
r
CB
11 1
Average the gradient of the examples in the batch b
(B is batch size)
Gradient Decent
11 iii C r
iri CR
C 11 1
Gradient Decent
• Real Example: Handwriting Digit Classification
Batch size = 1 Gradient Decent
Two Concerns
• There are two things you have to concern.
Optimization
• Can I find the “best” parameter set θ*
in limited of time?
Generalization
• Is the “best” parameter set θ* good for testing data as well?
Generalization
• You pick a “best” parameter set θ*
rr yx ˆ,Training Data: rr yxfr ˆ;: *
uxTesting Data: uu yxf ˆ; *
However,
Training Data: Testing Data:
Training data and testing data have different distribution.
Panacea
• Have more training data if possible ……• Create more training data (?)
Original Training Data:
Created Training Data:
Shift 15 。
Handwriting recognition:
Reference
• Chapter 3 of Neural network and Deep Learning• http://neuralnetworksanddeeplearning.com/
chap3.html
Appendix
Overfitting
• The function that performs well on the training data does not necessarily perform well on the testing data.
ux
rr yx ˆ,Training Data:
Testing Data:
rr yxfr ˆ~
:
uu yxf ˆ~
Overfitting in our daily life:Memorize the answers of the previous examples……
• Joke for overfiting• http://xkcd.com/1122/
Initialization
• For gradient decent, we need to pick an initialization parameter θ0.• Do not set all the parameters θ0 equal• Or your parameters will always be equal, no matter
how many times you update the parameters• Randomly pick θ0
• If the last layer has more neurons, the initialization values should be smaller.• E.g. Last layer has Nl-1
1/1,0~ llij NNw 11 /1,/1U~ ll
lij NNw
MNIST
• The MNIST data comes in two parts. The first part contains 60,000 images to be used as training data. These images are scanned handwriting samples from 250 people, half of whom were US Census Bureau employees, and half of whom were high school students. The images are greyscale and 28 by 28 pixels in size. The second part of the MNIST data set is 10,000 images to be used as test data. Again, these are 28 by 28 greyscale images.
• git clone https://github.com/mnielsen/neural-networks-and-deep-learning.git
• http://yann.lecun.com/exdb/mnist/• http://www.deeplearning.net/tutorial/gettingstarted.html
MNIST
• The current (2013) record is classifying 9,979 of 10,000 images correctly. This was done by Li Wan, Matthew Zeiler, Sixin Zhang, Yann LeCun, and Rob Fergus. • At that level the performance is close to human-
equivalent, and is arguably better, since quite a few of the MNIST images are difficult even for humans to recognize with confidence.
Early Stopping
• For iteration• Layer
Difficulty of Deep
• Lower layer cannot plan