tips for training neural network scratch the surface

Tips for Training Neural Network

scratch the surface

Two Concerns

• There are two things you have to concern.

Optimization

• Can I find the “best” parameter set θ*

in limited of time?

Generalization

• Is the “best” parameter set θ* good for testing data as well?

Initialization

• For gradient decent, we need to pick an initialization parameter θ0.• Do not set all the parameters θ0 equal• Set the parameters in θ0 randomly

Learning Rate

• Toy Example

y zw

b

x

1zy

x = [0.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0, 5.5, 6.0, 6.5, 7.0, 7.5, 8.0, 8.5, 9.0, 9.5]y = [0.1, 0.4, 0.9, 1.6, 2.2, 2.5, 2.8, 3.5, 3.9, 4.7, 5.1, 5.3, 6.3, 6.5, 6.7, 7.5, 8.1, 8.5, 8.9, 9.5]

Training Data (20 examples)

0

1*

b

w

11 iii C • Set the learning rate η carefully

• Toy Example

Learning Rate

Error Surface: C(w,b)

target

start

11 iii C

• Toy Example

updates 30.~ k

1.0

Learning Rate

updates 3~ k001.0 01.0

Different learning rate η

Gradient Decent

r

rr yxfR

C2

ˆ1

rR

rC1

Gradient Decent

Stochastic Gradient Decent

11 iii C r

iri CR

C 11 1

11 irii C Pick an example xr

If all example xr have equal probabilities to be picked

r

irir CR

CE 11 1

Gradient Decent Stochastic Gradient Decent

0101 CStarting at θ0

Training Data: RRrr yxyxyxyx ˆ,,ˆ,,ˆ,,ˆ, 2211

1212 C

pick x1

pick x2

pick xr 11 rrrr C

pick xR 1RR1RR C

pick x1 R1R1R C

Seen all the examples once

One epoch

… …

… …

Gradient Decent

• Toy Example

Gradient Decent Stochastic Gradient Decent1 epoch

See all examples

See only one example

Update 20 times in an epoch

Gradient Decent

Shuffle your data

Stochastic Gradient Decent

11 irii C Pick an example xr

Mini Batch Gradient DescentPick B examples as a batch b

bx

irii

r

CB

11 1

Average the gradient of the examples in the batch b

(B is batch size)

Gradient Decent

11 iii C r

iri CR

C 11 1

Gradient Decent

• Real Example: Handwriting Digit Classification

Batch size = 1 Gradient Decent

Two Concerns

• There are two things you have to concern.

Optimization

• Can I find the “best” parameter set θ*

in limited of time?

Generalization

• Is the “best” parameter set θ* good for testing data as well?

Generalization

• You pick a “best” parameter set θ*

rr yx ˆ,Training Data: rr yxfr ˆ;: *

uxTesting Data: uu yxf ˆ; *

However,

Training Data: Testing Data:

Training data and testing data have different distribution.

Panacea

• Have more training data if possible ……• Create more training data (?)

Original Training Data:

Created Training Data:

Shift 15 。

Handwriting recognition:

Reference

• Chapter 3 of Neural network and Deep Learning• http://neuralnetworksanddeeplearning.com/

chap3.html

Appendix

Overfitting

• The function that performs well on the training data does not necessarily perform well on the testing data.

ux

rr yx ˆ,Training Data:

Testing Data:

rr yxfr ˆ~

:

uu yxf ˆ~

Overfitting in our daily life:Memorize the answers of the previous examples……

• Joke for overfiting• http://xkcd.com/1122/

Initialization

• For gradient decent, we need to pick an initialization parameter θ0.• Do not set all the parameters θ0 equal• Or your parameters will always be equal, no matter

how many times you update the parameters• Randomly pick θ0

• If the last layer has more neurons, the initialization values should be smaller.• E.g. Last layer has Nl-1

1/1,0~ llij NNw 11 /1,/1U~ ll

lij NNw

MNIST

• The MNIST data comes in two parts. The first part contains 60,000 images to be used as training data. These images are scanned handwriting samples from 250 people, half of whom were US Census Bureau employees, and half of whom were high school students. The images are greyscale and 28 by 28 pixels in size. The second part of the MNIST data set is 10,000 images to be used as test data. Again, these are 28 by 28 greyscale images.

• git clone https://github.com/mnielsen/neural-networks-and-deep-learning.git

• http://yann.lecun.com/exdb/mnist/• http://www.deeplearning.net/tutorial/gettingstarted.html

https://github.com/mnielsen/neural-networks-and-deep-learning.git



http://yann.lecun.com/exdb/mnist/

http://yann.lecun.com/exdb/mnist/

MNIST

• The current (2013) record is classifying 9,979 of 10,000 images correctly. This was done by Li Wan, Matthew Zeiler, Sixin Zhang, Yann LeCun, and Rob Fergus. • At that level the performance is close to human-

equivalent, and is arguably better, since quite a few of the MNIST images are difficult even for humans to recognize with confidence.

Early Stopping

• For iteration• Layer

Difficulty of Deep

• Lower layer cannot plan

tips for training neural network scratch the surface

Documents

testing data

original training data

gradient decentreal

training neural networkscratch

good gradient minecraft

batch b b

example xrif

batch sizegradient decent