neural networks - cornell university · 2017-02-13 · slides adapted from dan klein, dan jurafsky,...

24
Neural Networks Instructor: Yoav Artzi CS5740: Natural Language Processing Spring 2017 Slides adapted from Dan Klein, Dan Jurafsky, Chris Manning, Michael Collins, Luke Zettlemoyer, Yejin Choi, and Slav Petrov

Upload: others

Post on 15-Jul-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Neural Networks - Cornell University · 2017-02-13 · Slides adapted from Dan Klein, Dan Jurafsky, Chris Manning, Michael Collins, Luke Zettlemoyer, YejinChoi, and Slav Petrov

Neural Networks

Instructor: Yoav Artzi

CS5740: Natural Language ProcessingSpring 2017

Slides adapted from Dan Klein, Dan Jurafsky, Chris Manning, Michael Collins, Luke Zettlemoyer, Yejin Choi, and Slav Petrov

Page 2: Neural Networks - Cornell University · 2017-02-13 · Slides adapted from Dan Klein, Dan Jurafsky, Chris Manning, Michael Collins, Luke Zettlemoyer, YejinChoi, and Slav Petrov

Overview• Introduction to Neural Networks• Word representations• NN Optimization tricks

Page 3: Neural Networks - Cornell University · 2017-02-13 · Slides adapted from Dan Klein, Dan Jurafsky, Chris Manning, Michael Collins, Luke Zettlemoyer, YejinChoi, and Slav Petrov

Some History• Neural network algorithms date from the 80’s– Originally inspired by early neuroscience

• Historically slow, complex, and unwieldy• Now: term is abstract enough to encompass

almost any model – but useful!• Dramatic shift in last 2-3 years away from

MaxEnt (linear, convex) to “neural net” (non-linear architecture)

Page 4: Neural Networks - Cornell University · 2017-02-13 · Slides adapted from Dan Klein, Dan Jurafsky, Chris Manning, Michael Collins, Luke Zettlemoyer, YejinChoi, and Slav Petrov

The “Promise”• Most ML works well because of

human-designed representations and input features

• ML becomes just optimizing weights

• Representation learningattempts to automatically learn good features and representations

• Deep learning attempts to learn multiple levels of representation of increasing complexity/abstraction

Named Entity Recognition

WordNet

Semantic Role Labeling

Page 5: Neural Networks - Cornell University · 2017-02-13 · Slides adapted from Dan Klein, Dan Jurafsky, Chris Manning, Michael Collins, Luke Zettlemoyer, YejinChoi, and Slav Petrov

Neuron• Neural networks comes with their

terminological baggage

• Parameters: – Weights: wi and b– Activation function

• If we drop the activation function, reminds you of something?

Page 6: Neural Networks - Cornell University · 2017-02-13 · Slides adapted from Dan Klein, Dan Jurafsky, Chris Manning, Michael Collins, Luke Zettlemoyer, YejinChoi, and Slav Petrov

Biological “Inspiration”

Page 7: Neural Networks - Cornell University · 2017-02-13 · Slides adapted from Dan Klein, Dan Jurafsky, Chris Manning, Michael Collins, Luke Zettlemoyer, YejinChoi, and Slav Petrov

Neural Network

Page 8: Neural Networks - Cornell University · 2017-02-13 · Slides adapted from Dan Klein, Dan Jurafsky, Chris Manning, Michael Collins, Luke Zettlemoyer, YejinChoi, and Slav Petrov

Neural Network

Page 9: Neural Networks - Cornell University · 2017-02-13 · Slides adapted from Dan Klein, Dan Jurafsky, Chris Manning, Michael Collins, Luke Zettlemoyer, YejinChoi, and Slav Petrov

Matrix NotationW00(W0a+ b0) + b00

a1

a2

W0 W0

h1 = W011a1 +W0

12a2 + b01

h2 = W021a1 +W0

22a2 + b02

o2 = W

0021h1 +W

0022h2 + b

002

o1 = W

0011h1 +W

0012h2 + b

001

Page 10: Neural Networks - Cornell University · 2017-02-13 · Slides adapted from Dan Klein, Dan Jurafsky, Chris Manning, Michael Collins, Luke Zettlemoyer, YejinChoi, and Slav Petrov

Neuron and Other Models• A single neuron is a perceptron• Strong connection to MaxEnt – how?

Page 11: Neural Networks - Cornell University · 2017-02-13 · Slides adapted from Dan Klein, Dan Jurafsky, Chris Manning, Michael Collins, Luke Zettlemoyer, YejinChoi, and Slav Petrov

From MaxEnt to Neural Nets• Vector form MaxEnt:

• For two classes:

P (y1|x;w) =e

w

>�(x,y1)

e

w

>�(x,y1) + e

w

>�(x,y2)

=e

w

>�(x,y1)

e

w

>�(x,y1) + e

w

>�(x,y2)

e

�w

>�(x,y1)

e

�w

>�(x,y1)

=1

1 + e

w

>(�(x,y2)��(x,y2))

=1

1 + e

�w

>z

= f(w>z)

P (y|x;w) = e

w

>�(x,y)

Py

0 ew

>�(x,y0)

LogisitcFunction(sigmoid)

z = �(x, y1)� �(x, y2)

Page 12: Neural Networks - Cornell University · 2017-02-13 · Slides adapted from Dan Klein, Dan Jurafsky, Chris Manning, Michael Collins, Luke Zettlemoyer, YejinChoi, and Slav Petrov

From MaxEnt to Neural Nets• Vector form MaxEnt:

• For two classes:

• Neuron:– Add an “always on” feature for class prior à bias

term (b)

P (y|x;w) = e

w

>�(x,y)

Py

0 ew

>�(x,y0)

P (y1|x;w) =1

1 + e

�w>z= f(w>

z)

hw,b(z) = f(w>z + b)

f(u) =1

1 + e�u

Page 13: Neural Networks - Cornell University · 2017-02-13 · Slides adapted from Dan Klein, Dan Jurafsky, Chris Manning, Michael Collins, Luke Zettlemoyer, YejinChoi, and Slav Petrov

From MaxEnt to Neural Nets• Vector form MaxEnt:

• Neuron:

• Neuron parameters: w, b

P (y|x;w) = e

w

>�(x,y)

Py

0 ew

>�(x,y0)

hw,b(z) = f(w>z + b)

f(u) =1

1 + e�u

Page 14: Neural Networks - Cornell University · 2017-02-13 · Slides adapted from Dan Klein, Dan Jurafsky, Chris Manning, Michael Collins, Luke Zettlemoyer, YejinChoi, and Slav Petrov

Neural Net = Several MaxEntModels

• Feed a number of MaxEnt models àvector of outputs

• And repeat …

Page 15: Neural Networks - Cornell University · 2017-02-13 · Slides adapted from Dan Klein, Dan Jurafsky, Chris Manning, Michael Collins, Luke Zettlemoyer, YejinChoi, and Slav Petrov

Neural Net = Several MaxEntModels

• But: how do we tell the hidden layer what to do?– Learning will figure it out

Page 16: Neural Networks - Cornell University · 2017-02-13 · Slides adapted from Dan Klein, Dan Jurafsky, Chris Manning, Michael Collins, Luke Zettlemoyer, YejinChoi, and Slav Petrov

How to Train?• No hidden layer:– Supervised– Just like MaxEnt

• With hidden layers:– Latent units à not convex–What do we do?• Back-propagate the gradient• About the same, but no guarantees

Page 17: Neural Networks - Cornell University · 2017-02-13 · Slides adapted from Dan Klein, Dan Jurafsky, Chris Manning, Michael Collins, Luke Zettlemoyer, YejinChoi, and Slav Petrov

Probabilistic Output from Neural Nets

• What if we want the output to be a probability distribution over possible outputs?

• Normalize the output activations using softmax:

– Where 𝑞 is the output layer

y = softmax(W · z + b)

softmax(q) =

e

q

Pkj=1 e

qj

Page 18: Neural Networks - Cornell University · 2017-02-13 · Slides adapted from Dan Klein, Dan Jurafsky, Chris Manning, Michael Collins, Luke Zettlemoyer, YejinChoi, and Slav Petrov

Word Representations• So far, atomic symbols:– “hotel”, “conference”, “walking”, “___ing”

• But neural networks take vector input• How can we bridge the gap?• One-hot vectors

– Dimensionality:• Size of vocabulary• 20K for speech• 500K for broad-coverage domains• 13M for Google corpora

hotel = [0 0 0 0 … 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0]conference = [0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0]

Page 19: Neural Networks - Cornell University · 2017-02-13 · Slides adapted from Dan Klein, Dan Jurafsky, Chris Manning, Michael Collins, Luke Zettlemoyer, YejinChoi, and Slav Petrov

Word Representations• One-hot vectors:

– Problems?– Information sharing? • “hotel” vs. “hotels”

hotel = [0 0 0 0 … 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0]conference = [0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0]hotels = [0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]

Page 20: Neural Networks - Cornell University · 2017-02-13 · Slides adapted from Dan Klein, Dan Jurafsky, Chris Manning, Michael Collins, Luke Zettlemoyer, YejinChoi, and Slav Petrov

Word Embeddings• Each word is represented using a dense

low-dimensional vector– Low-dimensional << vocabulary size

• If trained well, similar words will have similar vectors

• How to train? What objective to maximize? – Soon …

Page 21: Neural Networks - Cornell University · 2017-02-13 · Slides adapted from Dan Klein, Dan Jurafsky, Chris Manning, Michael Collins, Luke Zettlemoyer, YejinChoi, and Slav Petrov

Word Embeddings as Features• Example: sentiment classification – very positive, positive, neutral, negative, very

negative• Feature-based models: bag of words• Any good neural net architecture?– Concatenate all the vectors• Problem: different document à different length

– Instead: sum, average, etc.

Page 22: Neural Networks - Cornell University · 2017-02-13 · Slides adapted from Dan Klein, Dan Jurafsky, Chris Manning, Michael Collins, Luke Zettlemoyer, YejinChoi, and Slav Petrov

Neural Bag-of-words

[Iyyer et al. 2015; Wang and Manning 2012]

Deep Averaging Networks

BOW + fancy smoothing + SVM

88.23

NBOW + DAN 89.4

IMDB sentiment analysis

Page 23: Neural Networks - Cornell University · 2017-02-13 · Slides adapted from Dan Klein, Dan Jurafsky, Chris Manning, Michael Collins, Luke Zettlemoyer, YejinChoi, and Slav Petrov

Practical Tips• Select network structure appropriate for the

problem– Window vs. recurrent vs. recursive– Non-linearity function

• Gradient checks to identify bugs• Parameter initialization• Model is powerful enough?– If not, make it larger – Yes, so regularize, otherwise it will overfit

• Know your non-linearity function and its gradient

Page 24: Neural Networks - Cornell University · 2017-02-13 · Slides adapted from Dan Klein, Dan Jurafsky, Chris Manning, Michael Collins, Luke Zettlemoyer, YejinChoi, and Slav Petrov

Avoiding Overfitting• Reduce model size (but not too much)• L1 and L2 regularization• Early stopping (e.g., patience)• Dropout (Hinton et al. 2012)– Randomly set 50% of inputs in each layer to 0