administrative a2 is out. it was late 2 days so due...
TRANSCRIPT
![Page 1: Administrative A2 is out. It was late 2 days so due …vision.stanford.edu/teaching/cs231n/slides/2015/lecture6.pdfFei-Fei Li & Andrej Karpathy Lecture 6 - 1 21 Jan 2015 Administrative](https://reader034.vdocument.in/reader034/viewer/2022050510/5f9b07fe47881b076f62b8d0/html5/thumbnails/1.jpg)
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 20151
Administrative
- A2 is out. It was late 2 days so due date will be shifted by ~2 days.
- we updated the project page with many pointers to datasets.
![Page 2: Administrative A2 is out. It was late 2 days so due …vision.stanford.edu/teaching/cs231n/slides/2015/lecture6.pdfFei-Fei Li & Andrej Karpathy Lecture 6 - 1 21 Jan 2015 Administrative](https://reader034.vdocument.in/reader034/viewer/2022050510/5f9b07fe47881b076f62b8d0/html5/thumbnails/2.jpg)
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 20152
![Page 3: Administrative A2 is out. It was late 2 days so due …vision.stanford.edu/teaching/cs231n/slides/2015/lecture6.pdfFei-Fei Li & Andrej Karpathy Lecture 6 - 1 21 Jan 2015 Administrative](https://reader034.vdocument.in/reader034/viewer/2022050510/5f9b07fe47881b076f62b8d0/html5/thumbnails/3.jpg)
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 20153
![Page 4: Administrative A2 is out. It was late 2 days so due …vision.stanford.edu/teaching/cs231n/slides/2015/lecture6.pdfFei-Fei Li & Andrej Karpathy Lecture 6 - 1 21 Jan 2015 Administrative](https://reader034.vdocument.in/reader034/viewer/2022050510/5f9b07fe47881b076f62b8d0/html5/thumbnails/4.jpg)
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 20154
![Page 5: Administrative A2 is out. It was late 2 days so due …vision.stanford.edu/teaching/cs231n/slides/2015/lecture6.pdfFei-Fei Li & Andrej Karpathy Lecture 6 - 1 21 Jan 2015 Administrative](https://reader034.vdocument.in/reader034/viewer/2022050510/5f9b07fe47881b076f62b8d0/html5/thumbnails/5.jpg)
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 20155
Backpropagation(recursive chain rule)
![Page 6: Administrative A2 is out. It was late 2 days so due …vision.stanford.edu/teaching/cs231n/slides/2015/lecture6.pdfFei-Fei Li & Andrej Karpathy Lecture 6 - 1 21 Jan 2015 Administrative](https://reader034.vdocument.in/reader034/viewer/2022050510/5f9b07fe47881b076f62b8d0/html5/thumbnails/6.jpg)
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 20156
Mini-batch Gradient descent
Loop:1. Sample a batch of data2. Backprop to calculate the analytic gradient3. Perform a parameter update
![Page 7: Administrative A2 is out. It was late 2 days so due …vision.stanford.edu/teaching/cs231n/slides/2015/lecture6.pdfFei-Fei Li & Andrej Karpathy Lecture 6 - 1 21 Jan 2015 Administrative](https://reader034.vdocument.in/reader034/viewer/2022050510/5f9b07fe47881b076f62b8d0/html5/thumbnails/7.jpg)
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 20157
A bit of history
Widrow and Hoff, ~1960: Adaline
![Page 8: Administrative A2 is out. It was late 2 days so due …vision.stanford.edu/teaching/cs231n/slides/2015/lecture6.pdfFei-Fei Li & Andrej Karpathy Lecture 6 - 1 21 Jan 2015 Administrative](https://reader034.vdocument.in/reader034/viewer/2022050510/5f9b07fe47881b076f62b8d0/html5/thumbnails/8.jpg)
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 20158
A bit of history
Rumelhart et al. 1986: First time back-propagation became popular
recognizable maths
![Page 9: Administrative A2 is out. It was late 2 days so due …vision.stanford.edu/teaching/cs231n/slides/2015/lecture6.pdfFei-Fei Li & Andrej Karpathy Lecture 6 - 1 21 Jan 2015 Administrative](https://reader034.vdocument.in/reader034/viewer/2022050510/5f9b07fe47881b076f62b8d0/html5/thumbnails/9.jpg)
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 20159
A bit of history
[Hinton and Salkhutdinov 2006]
Reinvigorated research in Deep Learning
![Page 10: Administrative A2 is out. It was late 2 days so due …vision.stanford.edu/teaching/cs231n/slides/2015/lecture6.pdfFei-Fei Li & Andrej Karpathy Lecture 6 - 1 21 Jan 2015 Administrative](https://reader034.vdocument.in/reader034/viewer/2022050510/5f9b07fe47881b076f62b8d0/html5/thumbnails/10.jpg)
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 201510
Training Neural Networks
![Page 11: Administrative A2 is out. It was late 2 days so due …vision.stanford.edu/teaching/cs231n/slides/2015/lecture6.pdfFei-Fei Li & Andrej Karpathy Lecture 6 - 1 21 Jan 2015 Administrative](https://reader034.vdocument.in/reader034/viewer/2022050510/5f9b07fe47881b076f62b8d0/html5/thumbnails/11.jpg)
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 201511
Step 1: Preprocess the data
(Assume X [NxD] is data matrix, each example in a row)
![Page 12: Administrative A2 is out. It was late 2 days so due …vision.stanford.edu/teaching/cs231n/slides/2015/lecture6.pdfFei-Fei Li & Andrej Karpathy Lecture 6 - 1 21 Jan 2015 Administrative](https://reader034.vdocument.in/reader034/viewer/2022050510/5f9b07fe47881b076f62b8d0/html5/thumbnails/12.jpg)
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 201512
Step 1: Preprocess the dataIn practice, you may also see PCA and Whitening of the data
(data has diagonal covariance matrix)
(covariance matrix is the identity matrix)
![Page 13: Administrative A2 is out. It was late 2 days so due …vision.stanford.edu/teaching/cs231n/slides/2015/lecture6.pdfFei-Fei Li & Andrej Karpathy Lecture 6 - 1 21 Jan 2015 Administrative](https://reader034.vdocument.in/reader034/viewer/2022050510/5f9b07fe47881b076f62b8d0/html5/thumbnails/13.jpg)
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 201513
Step 2: Choose the architecture:say we start with one hidden layer of 50 neurons:
input layer hidden layer
output layerCIFAR-10 images, 3072 numbers
10 output neurons, one per class
50 hidden neurons
![Page 14: Administrative A2 is out. It was late 2 days so due …vision.stanford.edu/teaching/cs231n/slides/2015/lecture6.pdfFei-Fei Li & Andrej Karpathy Lecture 6 - 1 21 Jan 2015 Administrative](https://reader034.vdocument.in/reader034/viewer/2022050510/5f9b07fe47881b076f62b8d0/html5/thumbnails/14.jpg)
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 201514
Before we try training, lets initialize well:
- set weights to small random numbers
- set biases to zero
(Matrix of small numbers drawn randomly from a gaussian)Warning: This is not optimal, but simplest! (More on this later)
![Page 15: Administrative A2 is out. It was late 2 days so due …vision.stanford.edu/teaching/cs231n/slides/2015/lecture6.pdfFei-Fei Li & Andrej Karpathy Lecture 6 - 1 21 Jan 2015 Administrative](https://reader034.vdocument.in/reader034/viewer/2022050510/5f9b07fe47881b076f62b8d0/html5/thumbnails/15.jpg)
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 201515
Double check that the loss is reasonable:
returns the loss and the gradient for all parameters
disable regularization
loss ~2.3.“correct “ for 10 classes
![Page 16: Administrative A2 is out. It was late 2 days so due …vision.stanford.edu/teaching/cs231n/slides/2015/lecture6.pdfFei-Fei Li & Andrej Karpathy Lecture 6 - 1 21 Jan 2015 Administrative](https://reader034.vdocument.in/reader034/viewer/2022050510/5f9b07fe47881b076f62b8d0/html5/thumbnails/16.jpg)
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 201516
Double check that the loss is reasonable:
crank up regularization
loss went up, good. (sanity check)
![Page 17: Administrative A2 is out. It was late 2 days so due …vision.stanford.edu/teaching/cs231n/slides/2015/lecture6.pdfFei-Fei Li & Andrej Karpathy Lecture 6 - 1 21 Jan 2015 Administrative](https://reader034.vdocument.in/reader034/viewer/2022050510/5f9b07fe47881b076f62b8d0/html5/thumbnails/17.jpg)
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 201517
Lets try to train now…
Tip: Make sure that you can overfit very small portion of the data
The above code:- take the first 20 examples from CIFAR-10- turn off regularization (reg = 0.0)- use simple vanilla ‘sgd’
details:- (learning_rate_decay = 1 means no decay, the
learning rate will stay constant)- sample_batches = False means we’re doing full
gradient descent, not mini-batch SGD- we’ll perform 200 updates (epochs = 200)
“epoch”: number of times we see the training set
![Page 18: Administrative A2 is out. It was late 2 days so due …vision.stanford.edu/teaching/cs231n/slides/2015/lecture6.pdfFei-Fei Li & Andrej Karpathy Lecture 6 - 1 21 Jan 2015 Administrative](https://reader034.vdocument.in/reader034/viewer/2022050510/5f9b07fe47881b076f62b8d0/html5/thumbnails/18.jpg)
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 201518
Lets try to train now…
Tip: Make sure that you can overfit very small portion of the data
Very small loss, train accuracy 1.00, nice!
![Page 19: Administrative A2 is out. It was late 2 days so due …vision.stanford.edu/teaching/cs231n/slides/2015/lecture6.pdfFei-Fei Li & Andrej Karpathy Lecture 6 - 1 21 Jan 2015 Administrative](https://reader034.vdocument.in/reader034/viewer/2022050510/5f9b07fe47881b076f62b8d0/html5/thumbnails/19.jpg)
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 201519
Lets try to train now…
I like to start with small regularization and find learning rate that makes the loss go down.
loss not going down:learning rate too lowloss exploding:learning rate too high
![Page 20: Administrative A2 is out. It was late 2 days so due …vision.stanford.edu/teaching/cs231n/slides/2015/lecture6.pdfFei-Fei Li & Andrej Karpathy Lecture 6 - 1 21 Jan 2015 Administrative](https://reader034.vdocument.in/reader034/viewer/2022050510/5f9b07fe47881b076f62b8d0/html5/thumbnails/20.jpg)
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 201520
Lets try to train now…
I like to start with small regularization and find learning rate that makes the loss go down.
loss not going down:learning rate too lowloss exploding:learning rate too high
Loss barely changing: Learning rate must be too low. (could also be reg too high)
Notice train/val accuracy goes to 20% though, what’s up with that? (remember this is softmax)
![Page 21: Administrative A2 is out. It was late 2 days so due …vision.stanford.edu/teaching/cs231n/slides/2015/lecture6.pdfFei-Fei Li & Andrej Karpathy Lecture 6 - 1 21 Jan 2015 Administrative](https://reader034.vdocument.in/reader034/viewer/2022050510/5f9b07fe47881b076f62b8d0/html5/thumbnails/21.jpg)
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 201521
Lets try to train now…
I like to start with small regularization and find learning rate that makes the loss go down.
loss not going down:learning rate too lowloss exploding:learning rate too high
Okay now lets try learning rate 1e6. What could possibly go wrong?
![Page 22: Administrative A2 is out. It was late 2 days so due …vision.stanford.edu/teaching/cs231n/slides/2015/lecture6.pdfFei-Fei Li & Andrej Karpathy Lecture 6 - 1 21 Jan 2015 Administrative](https://reader034.vdocument.in/reader034/viewer/2022050510/5f9b07fe47881b076f62b8d0/html5/thumbnails/22.jpg)
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015
cost: NaN almost always means high learning rate...
22
Lets try to train now…
I like to start with small regularization and find learning rate that makes the loss go down.
loss not going down:learning rate too lowloss exploding:learning rate too high
![Page 23: Administrative A2 is out. It was late 2 days so due …vision.stanford.edu/teaching/cs231n/slides/2015/lecture6.pdfFei-Fei Li & Andrej Karpathy Lecture 6 - 1 21 Jan 2015 Administrative](https://reader034.vdocument.in/reader034/viewer/2022050510/5f9b07fe47881b076f62b8d0/html5/thumbnails/23.jpg)
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 201523
Lets try to train now…
I like to start with small regularization and find learning rate that makes the loss go down.
loss not going down:learning rate too lowloss exploding:learning rate too high
3e-3 is still too high. Cost explodes….
=> Rough range for learning rate we should be cross-validating is somewhere [1e-3 … 1e-5]
![Page 24: Administrative A2 is out. It was late 2 days so due …vision.stanford.edu/teaching/cs231n/slides/2015/lecture6.pdfFei-Fei Li & Andrej Karpathy Lecture 6 - 1 21 Jan 2015 Administrative](https://reader034.vdocument.in/reader034/viewer/2022050510/5f9b07fe47881b076f62b8d0/html5/thumbnails/24.jpg)
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 201524
Cross-validation strategyI like to do coarse -> fine cross-validation in stages
First stage: only a few epochs to get rough idea of what params workSecond stage: longer running time, finer search… (repeat as necessary)
Tip for detecting explosions in the solver: If the cost is ever > 3 * original cost, break out early
![Page 25: Administrative A2 is out. It was late 2 days so due …vision.stanford.edu/teaching/cs231n/slides/2015/lecture6.pdfFei-Fei Li & Andrej Karpathy Lecture 6 - 1 21 Jan 2015 Administrative](https://reader034.vdocument.in/reader034/viewer/2022050510/5f9b07fe47881b076f62b8d0/html5/thumbnails/25.jpg)
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 201525
For example: run coarse search for 5 epochs
nice
note it’s best to optimize in log space
![Page 26: Administrative A2 is out. It was late 2 days so due …vision.stanford.edu/teaching/cs231n/slides/2015/lecture6.pdfFei-Fei Li & Andrej Karpathy Lecture 6 - 1 21 Jan 2015 Administrative](https://reader034.vdocument.in/reader034/viewer/2022050510/5f9b07fe47881b076f62b8d0/html5/thumbnails/26.jpg)
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 201526
Now run finer search...adjust range
53% - relatively good for a 2-layer neural net with 50 hidden neurons.
![Page 27: Administrative A2 is out. It was late 2 days so due …vision.stanford.edu/teaching/cs231n/slides/2015/lecture6.pdfFei-Fei Li & Andrej Karpathy Lecture 6 - 1 21 Jan 2015 Administrative](https://reader034.vdocument.in/reader034/viewer/2022050510/5f9b07fe47881b076f62b8d0/html5/thumbnails/27.jpg)
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 201527
Now run finer search...adjust range
53% - relatively good for a 2-layer neural net with 50 hidden neurons.
But this best cross-validation result is worrying. Why?
![Page 28: Administrative A2 is out. It was late 2 days so due …vision.stanford.edu/teaching/cs231n/slides/2015/lecture6.pdfFei-Fei Li & Andrej Karpathy Lecture 6 - 1 21 Jan 2015 Administrative](https://reader034.vdocument.in/reader034/viewer/2022050510/5f9b07fe47881b076f62b8d0/html5/thumbnails/28.jpg)
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 201528
Normally you can’t afford a huge computational budget for expensive cross-validations.Need to rely more on intuitions and visualizations…
Visualizations to play with:- loss function- validation and training accuracy- min,max,std for values and updates, (and monitor
their ratio)- first-layer visualization of weights (if working with
images)
![Page 29: Administrative A2 is out. It was late 2 days so due …vision.stanford.edu/teaching/cs231n/slides/2015/lecture6.pdfFei-Fei Li & Andrej Karpathy Lecture 6 - 1 21 Jan 2015 Administrative](https://reader034.vdocument.in/reader034/viewer/2022050510/5f9b07fe47881b076f62b8d0/html5/thumbnails/29.jpg)
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 201529
Monitor and visualize the loss curve
If this looks too linear: learning rate is low.If it doesn’t decrease much: learning rate might be too high
![Page 30: Administrative A2 is out. It was late 2 days so due …vision.stanford.edu/teaching/cs231n/slides/2015/lecture6.pdfFei-Fei Li & Andrej Karpathy Lecture 6 - 1 21 Jan 2015 Administrative](https://reader034.vdocument.in/reader034/viewer/2022050510/5f9b07fe47881b076f62b8d0/html5/thumbnails/30.jpg)
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 201530
Monitor and visualize the loss curve
If this looks too linear: learning rate is low.If it doesn’t decrease much: learning rate might be too high
the “width” of the curve is related to the batch size. This one looks too wide (noisy)=> might want to increase batch size
![Page 31: Administrative A2 is out. It was late 2 days so due …vision.stanford.edu/teaching/cs231n/slides/2015/lecture6.pdfFei-Fei Li & Andrej Karpathy Lecture 6 - 1 21 Jan 2015 Administrative](https://reader034.vdocument.in/reader034/viewer/2022050510/5f9b07fe47881b076f62b8d0/html5/thumbnails/31.jpg)
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 201531
Monitor and visualize the accuracy:
big gap = overfitting=> increase regularization strength
no gap=> increase model capacity
![Page 32: Administrative A2 is out. It was late 2 days so due …vision.stanford.edu/teaching/cs231n/slides/2015/lecture6.pdfFei-Fei Li & Andrej Karpathy Lecture 6 - 1 21 Jan 2015 Administrative](https://reader034.vdocument.in/reader034/viewer/2022050510/5f9b07fe47881b076f62b8d0/html5/thumbnails/32.jpg)
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 201532
Track the ratio of weight updates / weight magnitudes:
ratio between the values and updates: ~ 0.0002 / 0.02 = 0.01 (about okay)want this to be somewhere around 0.01 - 0.001 or so
max
min
mean
![Page 33: Administrative A2 is out. It was late 2 days so due …vision.stanford.edu/teaching/cs231n/slides/2015/lecture6.pdfFei-Fei Li & Andrej Karpathy Lecture 6 - 1 21 Jan 2015 Administrative](https://reader034.vdocument.in/reader034/viewer/2022050510/5f9b07fe47881b076f62b8d0/html5/thumbnails/33.jpg)
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 201533
Visualizing first-layer weights:
Noisy weights => Regularization maybe not strong enough
![Page 34: Administrative A2 is out. It was late 2 days so due …vision.stanford.edu/teaching/cs231n/slides/2015/lecture6.pdfFei-Fei Li & Andrej Karpathy Lecture 6 - 1 21 Jan 2015 Administrative](https://reader034.vdocument.in/reader034/viewer/2022050510/5f9b07fe47881b076f62b8d0/html5/thumbnails/34.jpg)
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 201534
(Regarding your Assignment #1)
=> Regularization not strong enough
![Page 35: Administrative A2 is out. It was late 2 days so due …vision.stanford.edu/teaching/cs231n/slides/2015/lecture6.pdfFei-Fei Li & Andrej Karpathy Lecture 6 - 1 21 Jan 2015 Administrative](https://reader034.vdocument.in/reader034/viewer/2022050510/5f9b07fe47881b076f62b8d0/html5/thumbnails/35.jpg)
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 201535
So far:
- We’ve seen the process for performing cross-validations
- There are several things we can track to get intuitions about how to do it more efficiently
![Page 36: Administrative A2 is out. It was late 2 days so due …vision.stanford.edu/teaching/cs231n/slides/2015/lecture6.pdfFei-Fei Li & Andrej Karpathy Lecture 6 - 1 21 Jan 2015 Administrative](https://reader034.vdocument.in/reader034/viewer/2022050510/5f9b07fe47881b076f62b8d0/html5/thumbnails/36.jpg)
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 201536
Hyperparameters to play with:- network architecture- learning rate, its decay schedule, update type- regularization (L2/L1/Maxnorm/Dropout)- loss to use (e.g. SVM/Softmax)- initialization
neural networks practitionermusic = loss function
![Page 37: Administrative A2 is out. It was late 2 days so due …vision.stanford.edu/teaching/cs231n/slides/2015/lecture6.pdfFei-Fei Li & Andrej Karpathy Lecture 6 - 1 21 Jan 2015 Administrative](https://reader034.vdocument.in/reader034/viewer/2022050510/5f9b07fe47881b076f62b8d0/html5/thumbnails/37.jpg)
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 201537
Initialization- becomes more tricky and important in
deeper networks. Usually approx. W ~ N(0, 0.01) works. If not:
Consider what happens to the output distribution of neurons with different number of inputs (low or high)
10 inputs 100 inputs
![Page 38: Administrative A2 is out. It was late 2 days so due …vision.stanford.edu/teaching/cs231n/slides/2015/lecture6.pdfFei-Fei Li & Andrej Karpathy Lecture 6 - 1 21 Jan 2015 Administrative](https://reader034.vdocument.in/reader034/viewer/2022050510/5f9b07fe47881b076f62b8d0/html5/thumbnails/38.jpg)
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 201538
normalize by square root of the number of incoming connections (fan in)=> ensures equal variance of each neuron in network(tricky, subtle, but very important topic. See notes for details)
Initialization- becomes more tricky and important in
deeper networks. Usually approx. W ~ N(0, 0.01) works. If not:
![Page 39: Administrative A2 is out. It was late 2 days so due …vision.stanford.edu/teaching/cs231n/slides/2015/lecture6.pdfFei-Fei Li & Andrej Karpathy Lecture 6 - 1 21 Jan 2015 Administrative](https://reader034.vdocument.in/reader034/viewer/2022050510/5f9b07fe47881b076f62b8d0/html5/thumbnails/39.jpg)
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 201539
Regularization knobs- L2 regularization- L1 regularization- L1 + L2 can also be combined- Max norm constraint
L1 is “sparsity inducing”(many weights become almost exactly zero)
enforce maximum L2 norm of the incoming weights
![Page 40: Administrative A2 is out. It was late 2 days so due …vision.stanford.edu/teaching/cs231n/slides/2015/lecture6.pdfFei-Fei Li & Andrej Karpathy Lecture 6 - 1 21 Jan 2015 Administrative](https://reader034.vdocument.in/reader034/viewer/2022050510/5f9b07fe47881b076f62b8d0/html5/thumbnails/40.jpg)
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 201540
Seemingly unrelated: Model Ensembles- One way to always improve final accuracy:take several trained models and average their predictions
![Page 41: Administrative A2 is out. It was late 2 days so due …vision.stanford.edu/teaching/cs231n/slides/2015/lecture6.pdfFei-Fei Li & Andrej Karpathy Lecture 6 - 1 21 Jan 2015 Administrative](https://reader034.vdocument.in/reader034/viewer/2022050510/5f9b07fe47881b076f62b8d0/html5/thumbnails/41.jpg)
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 201541
Regularization: Dropout“randomly set some neurons to zero”
[Srivastava et al.]
![Page 42: Administrative A2 is out. It was late 2 days so due …vision.stanford.edu/teaching/cs231n/slides/2015/lecture6.pdfFei-Fei Li & Andrej Karpathy Lecture 6 - 1 21 Jan 2015 Administrative](https://reader034.vdocument.in/reader034/viewer/2022050510/5f9b07fe47881b076f62b8d0/html5/thumbnails/42.jpg)
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 201542
Example forward pass with a 3-layer network using dropout
![Page 43: Administrative A2 is out. It was late 2 days so due …vision.stanford.edu/teaching/cs231n/slides/2015/lecture6.pdfFei-Fei Li & Andrej Karpathy Lecture 6 - 1 21 Jan 2015 Administrative](https://reader034.vdocument.in/reader034/viewer/2022050510/5f9b07fe47881b076f62b8d0/html5/thumbnails/43.jpg)
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 201543
At test time we don’t drop, but have to be careful:
At test time all neurons are active always=> We must scale the activations so that for each neuron:output at test time = expected output at training time
if the output of a neuron is x but the probability of keeping it is only p, then the output of the neuron (in expectation) is:px + (1-p)0 = px
(this has the interpretation as an ensemble of all subnetworks)
![Page 44: Administrative A2 is out. It was late 2 days so due …vision.stanford.edu/teaching/cs231n/slides/2015/lecture6.pdfFei-Fei Li & Andrej Karpathy Lecture 6 - 1 21 Jan 2015 Administrative](https://reader034.vdocument.in/reader034/viewer/2022050510/5f9b07fe47881b076f62b8d0/html5/thumbnails/44.jpg)
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 201544
More common: “Inverted dropout”
test time is unchanged
![Page 45: Administrative A2 is out. It was late 2 days so due …vision.stanford.edu/teaching/cs231n/slides/2015/lecture6.pdfFei-Fei Li & Andrej Karpathy Lecture 6 - 1 21 Jan 2015 Administrative](https://reader034.vdocument.in/reader034/viewer/2022050510/5f9b07fe47881b076f62b8d0/html5/thumbnails/45.jpg)
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 201545
Learning rates and updates:- SGD + Momentum > SGD- Momentum 0.9 usually works well- Decrease the learning rate over time (people use 1/t, exp(-t), or steps) simplest: learning_rate *= 0.97 every epoch (or so)
![Page 46: Administrative A2 is out. It was late 2 days so due …vision.stanford.edu/teaching/cs231n/slides/2015/lecture6.pdfFei-Fei Li & Andrej Karpathy Lecture 6 - 1 21 Jan 2015 Administrative](https://reader034.vdocument.in/reader034/viewer/2022050510/5f9b07fe47881b076f62b8d0/html5/thumbnails/46.jpg)
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 201546
Summary
- Properly preprocess the data
- Run cross-validations across many tips/tricks
- Use visualizations to guide the ranges and cross-val
- Ensemble multiple models and report test error
![Page 47: Administrative A2 is out. It was late 2 days so due …vision.stanford.edu/teaching/cs231n/slides/2015/lecture6.pdfFei-Fei Li & Andrej Karpathy Lecture 6 - 1 21 Jan 2015 Administrative](https://reader034.vdocument.in/reader034/viewer/2022050510/5f9b07fe47881b076f62b8d0/html5/thumbnails/47.jpg)
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 201547
Next Lecture:
Convolutional Neural Networks