cs 189 brian chu [email protected] slides at: brianchu.com/ml/ brianchu.com/ml/ office hours:...

15
CS 189 Brian Chu brian.c@berkel ey.edu Slides at: brianchu.com/m l/ Office Hours: Cory 246, 6-7p Mon. (hackerspace

Upload: donald-mosley

Post on 17-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS 189 Brian Chu brian.c@berkeley.edu Slides at: brianchu.com/ml/ brianchu.com/ml/ Office Hours: Cory 246, 6-7p Mon. (hackerspace lounge) twitter: @brrrianchu

CS 189

Brian [email protected]

Slides at: brianchu.com/ml/

Office Hours:Cory 246, 6-7p Mon. (hackerspace lounge)

twitter: @brrrianchu

Page 2: CS 189 Brian Chu brian.c@berkeley.edu Slides at: brianchu.com/ml/ brianchu.com/ml/ Office Hours: Cory 246, 6-7p Mon. (hackerspace lounge) twitter: @brrrianchu

Agenda

•NEURAL NETS WOOOHOOO

Page 3: CS 189 Brian Chu brian.c@berkeley.edu Slides at: brianchu.com/ml/ brianchu.com/ml/ Office Hours: Cory 246, 6-7p Mon. (hackerspace lounge) twitter: @brrrianchu

Terminology

• Unit – each “neuron”• 2-layer neural network: a neural network with

one hidden layer (what you’re building)• Epoch – one pass through entire training data– For SGD, this is N iterations– For mini-batch gradient descent (batch size of B),

this is (N/B) iterations

Page 4: CS 189 Brian Chu brian.c@berkeley.edu Slides at: brianchu.com/ml/ brianchu.com/ml/ Office Hours: Cory 246, 6-7p Mon. (hackerspace lounge) twitter: @brrrianchu

First off…

• Many of you will struggle to even finish.• In which case you can ignore my bells and

whistles.• My 2.6GHz quad core 16GB RAM Macbook

takes ~1.5 hours to train to ~96-97%.

Page 5: CS 189 Brian Chu brian.c@berkeley.edu Slides at: brianchu.com/ml/ brianchu.com/ml/ Office Hours: Cory 246, 6-7p Mon. (hackerspace lounge) twitter: @brrrianchu

First off…

• Add a signal handler + snapshotting• E.g. implement functionality where if you

press Ctrl-C (on Unix systems, this is sending the interrupt signal), your code saves a snapshot of the state of the training (current iteration, decayed learning rate, momentum, current weights, anything else), then exits.– Look into Python “signal” and “pickle” libraries.

Page 6: CS 189 Brian Chu brian.c@berkeley.edu Slides at: brianchu.com/ml/ brianchu.com/ml/ Office Hours: Cory 246, 6-7p Mon. (hackerspace lounge) twitter: @brrrianchu

Art of tuning

• Training neural nets is an art, not a science• Cross-validation? Pfffft• “I used to tune that parameter but I’m too lazy and I don’t

bother any more” – grad student talking about weight decay hyperparameter.

• There are way too many hyperparameters for you to tune.• Training is too slow for you to bother using cross-

validation.• Many hyperparameters: just use what is standard and

spend your time elsewhere

Page 7: CS 189 Brian Chu brian.c@berkeley.edu Slides at: brianchu.com/ml/ brianchu.com/ml/ Office Hours: Cory 246, 6-7p Mon. (hackerspace lounge) twitter: @brrrianchu

Knobs

• Learning: SGD/mini-batch/full batch, momentum, RMSprop, Adagrad, NAG, etc.– How to decay?

• ReLU, tanh, sigmoid activations• Loss: MSE or cross-entropy (with softmax)• L1, L2, Max-norm, Dropout, Dropconnect regularization• Convolutional layers• Initialization: Xavier, Gaussian, etc.• When to stop? Early stop? Stopping rule? Or just run

forever

Page 8: CS 189 Brian Chu brian.c@berkeley.edu Slides at: brianchu.com/ml/ brianchu.com/ml/ Office Hours: Cory 246, 6-7p Mon. (hackerspace lounge) twitter: @brrrianchu

I recommend• Cross-entropy, softmax *• Only decay per epoch (or more than 1 epoch)*

– (e.g. don’t just divide by # iterations)– Epoch = one training pass thru entire data– Only decay after a round of seeing every data point.– Note: if your mini-batch size is 10, N = 20, then one epoch is 2 iterations

• Momentum learning rate (0.7-0.9?) *– Maybe RMSProp?

• Mini-batch (somewhere between 20-100) *• No regularization.• Gaussian initialization (mean 0, std. dev. 0.01) *• Run forever, take a snapshot when you feel like stopping (seriously!)

* = What everyone in the literature, in practice, uses

Page 9: CS 189 Brian Chu brian.c@berkeley.edu Slides at: brianchu.com/ml/ brianchu.com/ml/ Office Hours: Cory 246, 6-7p Mon. (hackerspace lounge) twitter: @brrrianchu

Activation functions

• tanh >>> sigmoid– (tanh is just shifted sigmoid anyways)

• ReLU = stacked sigmoid• ReLU is basically standard in computer vision

Page 10: CS 189 Brian Chu brian.c@berkeley.edu Slides at: brianchu.com/ml/ brianchu.com/ml/ Office Hours: Cory 246, 6-7p Mon. (hackerspace lounge) twitter: @brrrianchu

Almost certainly will improve accuracy but total overkill

• Considered “standard” today:– Convolutional layers (with max-pooling)– Dropout (Dropconnect?)

Page 11: CS 189 Brian Chu brian.c@berkeley.edu Slides at: brianchu.com/ml/ brianchu.com/ml/ Office Hours: Cory 246, 6-7p Mon. (hackerspace lounge) twitter: @brrrianchu

If using numpy

• Not a single for-loop should be in your code.• Avoid unnecessary memory allocation:• Use the “out=“ keyword argument to re-use

numpy arrays

Page 12: CS 189 Brian Chu brian.c@berkeley.edu Slides at: brianchu.com/ml/ brianchu.com/ml/ Office Hours: Cory 246, 6-7p Mon. (hackerspace lounge) twitter: @brrrianchu

May want to consider

• Faster implementation than Python w/ numpy:

• Cython, Java, Go, Julia, etc.

Page 13: CS 189 Brian Chu brian.c@berkeley.edu Slides at: brianchu.com/ml/ brianchu.com/ml/ Office Hours: Cory 246, 6-7p Mon. (hackerspace lounge) twitter: @brrrianchu

Honestly, if you want to win…

• (if you have a compatible graphics card) Write a CUDA or OpenCL implementation, train for many days.– (you might consider adding regularization in this

case)• I didn’t do this: I used other generic tricks that

you can read in the literature.

Page 14: CS 189 Brian Chu brian.c@berkeley.edu Slides at: brianchu.com/ml/ brianchu.com/ml/ Office Hours: Cory 246, 6-7p Mon. (hackerspace lounge) twitter: @brrrianchu

Debugging

• Check your dimensions• Check your numpy dtypes• Check your derivatives – comment all your

backprop steps• Numerical gradient calculator:– https://github.com/pbrod/numdifftools

Page 15: CS 189 Brian Chu brian.c@berkeley.edu Slides at: brianchu.com/ml/ brianchu.com/ml/ Office Hours: Cory 246, 6-7p Mon. (hackerspace lounge) twitter: @brrrianchu

Connection with SVMs / linear classifiers with kernels

• Kernel SVM can be thought of as:• 1st layer: |units| = |support vectors|– Value of each unit i = K(query, train(i))

• 2nd layer: linear combo of first layer• Simplest training for 1st layer: store all training

points as templates.

http://www.kdnuggets.com/2014/02/exclusive-yann-lecun-deep-learning-facebook-ai-lab.html