CS 189
Brian [email protected]
Slides at: brianchu.com/ml/
Office Hours:Cory 246, 6-7p Mon. (hackerspace lounge)
twitter: @brrrianchu
Agenda
•NEURAL NETS WOOOHOOO
Terminology
• Unit – each “neuron”• 2-layer neural network: a neural network with
one hidden layer (what you’re building)• Epoch – one pass through entire training data– For SGD, this is N iterations– For mini-batch gradient descent (batch size of B),
this is (N/B) iterations
First off…
• Many of you will struggle to even finish.• In which case you can ignore my bells and
whistles.• My 2.6GHz quad core 16GB RAM Macbook
takes ~1.5 hours to train to ~96-97%.
First off…
• Add a signal handler + snapshotting• E.g. implement functionality where if you
press Ctrl-C (on Unix systems, this is sending the interrupt signal), your code saves a snapshot of the state of the training (current iteration, decayed learning rate, momentum, current weights, anything else), then exits.– Look into Python “signal” and “pickle” libraries.
Art of tuning
• Training neural nets is an art, not a science• Cross-validation? Pfffft• “I used to tune that parameter but I’m too lazy and I don’t
bother any more” – grad student talking about weight decay hyperparameter.
• There are way too many hyperparameters for you to tune.• Training is too slow for you to bother using cross-
validation.• Many hyperparameters: just use what is standard and
spend your time elsewhere
Knobs
• Learning: SGD/mini-batch/full batch, momentum, RMSprop, Adagrad, NAG, etc.– How to decay?
• ReLU, tanh, sigmoid activations• Loss: MSE or cross-entropy (with softmax)• L1, L2, Max-norm, Dropout, Dropconnect regularization• Convolutional layers• Initialization: Xavier, Gaussian, etc.• When to stop? Early stop? Stopping rule? Or just run
forever
I recommend• Cross-entropy, softmax *• Only decay per epoch (or more than 1 epoch)*
– (e.g. don’t just divide by # iterations)– Epoch = one training pass thru entire data– Only decay after a round of seeing every data point.– Note: if your mini-batch size is 10, N = 20, then one epoch is 2 iterations
• Momentum learning rate (0.7-0.9?) *– Maybe RMSProp?
• Mini-batch (somewhere between 20-100) *• No regularization.• Gaussian initialization (mean 0, std. dev. 0.01) *• Run forever, take a snapshot when you feel like stopping (seriously!)
* = What everyone in the literature, in practice, uses
Activation functions
• tanh >>> sigmoid– (tanh is just shifted sigmoid anyways)
• ReLU = stacked sigmoid• ReLU is basically standard in computer vision
Almost certainly will improve accuracy but total overkill
• Considered “standard” today:– Convolutional layers (with max-pooling)– Dropout (Dropconnect?)
If using numpy
• Not a single for-loop should be in your code.• Avoid unnecessary memory allocation:• Use the “out=“ keyword argument to re-use
numpy arrays
May want to consider
• Faster implementation than Python w/ numpy:
• Cython, Java, Go, Julia, etc.
Honestly, if you want to win…
• (if you have a compatible graphics card) Write a CUDA or OpenCL implementation, train for many days.– (you might consider adding regularization in this
case)• I didn’t do this: I used other generic tricks that
you can read in the literature.
Debugging
• Check your dimensions• Check your numpy dtypes• Check your derivatives – comment all your
backprop steps• Numerical gradient calculator:– https://github.com/pbrod/numdifftools
Connection with SVMs / linear classifiers with kernels
• Kernel SVM can be thought of as:• 1st layer: |units| = |support vectors|– Value of each unit i = K(query, train(i))
• 2nd layer: linear combo of first layer• Simplest training for 1st layer: store all training
points as templates.
http://www.kdnuggets.com/2014/02/exclusive-yann-lecun-deep-learning-facebook-ai-lab.html