neural network language models - umass amherstbrenocon/anlp2017/lectures/07-nn.pdf · neural...
TRANSCRIPT
![Page 1: Neural network language models - UMass Amherstbrenocon/anlp2017/lectures/07-nn.pdf · Neural Language Models Trigram NN language model h n = g (V [w nj 1; w n2]+c ) pˆ n = softmax(Wh](https://reader034.vdocument.in/reader034/viewer/2022043013/5fad07e2aed77f19e14e5afe/html5/thumbnails/1.jpg)
Neural network language models
Lecture, Feb 16CS 690N, Spring 2017
Advanced Natural Language Processinghttp://people.cs.umass.edu/~brenocon/anlp2017/
Brendan O’ConnorCollege of Information and Computer Sciences
University of Massachusetts Amherst
Thursday, February 16, 17
![Page 2: Neural network language models - UMass Amherstbrenocon/anlp2017/lectures/07-nn.pdf · Neural Language Models Trigram NN language model h n = g (V [w nj 1; w n2]+c ) pˆ n = softmax(Wh](https://reader034.vdocument.in/reader034/viewer/2022043013/5fad07e2aed77f19e14e5afe/html5/thumbnails/2.jpg)
Neural Language Models
Feed forward network
h = g(Vx + c)
y = Wh + bx
h
y
[Slide: Phil Blunsom]
Thursday, February 16, 17
![Page 3: Neural network language models - UMass Amherstbrenocon/anlp2017/lectures/07-nn.pdf · Neural Language Models Trigram NN language model h n = g (V [w nj 1; w n2]+c ) pˆ n = softmax(Wh](https://reader034.vdocument.in/reader034/viewer/2022043013/5fad07e2aed77f19e14e5afe/html5/thumbnails/3.jpg)
Nonlinear activation functions
3
5.3. RECURRENT NEURAL NETWORK LANGUAGE MODELS 91
Figure 5.2: Nonlinear activation functions for neural networks
recurrent neural network (RNN Mikolov et al., 2010). The basic idea is to recurrentlyupdate the context vectors as we move through the sequence. Let us write h
m
for thecontextual information at position m in the sequence. RNNs employ the following recur-rence:
x
m
,�
w
m
(5.27)h
m
=g(⇥h
m�1
+ x
m
) (5.28)
p(wm+1
| w1
, w2
, . . . , wm
) =exp(�
w
m+1 · hm
)Pw
02V exp(�w
0 · hm
), (5.29)
where � is a matrix of input word embeddings, and x
m
denotes the embedding for wordwm
. The function g is an element-wise nonlinear activation function. Typical choices are:
• tanh(x), the hyperbolic tangent;
• �(x), the sigmoid function 1
1+exp(�x)
;
• (x)+
, the rectified linear unit, (x)+
= max(x, 0), also called ReLU.
These activation functions are shown in Figure 5.2. The sigmoid and tanh functions“squash” their inputs into a fixed range: [0, 1] for the sigmoid, [�1, 1] for tanh. This makesit possible to chain together many iterations of these functions without numerical insta-bility.
A key point about the RNN language model is that although each wm
depends only onthe context vector h
m�1
, this vector is in turn influenced by all previous tokens, w1
, w2
, . . . wm�1
,through the recurrence operation: w
1
affects h
1
, which affects h
2
, and so on, until the in-formation is propagated all the way to h
m�1
, and then on to wm
(see Figure 5.1). Thisis an important distinction from n-gram language models, where any information out-side the n-word window is ignored. Thus, in principle, the RNN language model can
(c) Jacob Eisenstein 2014-2017. Work in progress.
sigmoid(x) =
e
x
1 + e
x
tanh(x) = 2⇥ sgm(x)� 1
a.k.a. “ReLU”(x)+ = max(0, x)
Thursday, February 16, 17
![Page 4: Neural network language models - UMass Amherstbrenocon/anlp2017/lectures/07-nn.pdf · Neural Language Models Trigram NN language model h n = g (V [w nj 1; w n2]+c ) pˆ n = softmax(Wh](https://reader034.vdocument.in/reader034/viewer/2022043013/5fad07e2aed77f19e14e5afe/html5/thumbnails/4.jpg)
Neural Language Models
Trigram NN language model
hn = g(V [wn�1
;wn�2
] + c)
pn = softmax(Whn + b)
softmax(u)i =exp uiPj exp uj
• wi are one hot vetors and pi aredistributions,
• |wi | = |pi | = V (words in thevocabulary),
• V is usually very large > 1e5.
wn�1
hn
pn
wn�2
[Slide: Phil Blunsom]
Neural Language Models
Trigram NN language model
hn = g(V [wn�1
;wn�2
] + c)
pn = softmax(Whn + b)
softmax(u)i =exp uiPj exp uj
• wi are one hot vetors and pi aredistributions,
• |wi | = |pi | = V (words in thevocabulary),
• V is usually very large > 1e5.
wn�1
hn
pn
wn�2
Word embeddings
Thursday, February 16, 17
![Page 5: Neural network language models - UMass Amherstbrenocon/anlp2017/lectures/07-nn.pdf · Neural Language Models Trigram NN language model h n = g (V [w nj 1; w n2]+c ) pˆ n = softmax(Wh](https://reader034.vdocument.in/reader034/viewer/2022043013/5fad07e2aed77f19e14e5afe/html5/thumbnails/5.jpg)
Neural Language Models: Sampling
wn|wn�1
,wn�2
⇠ pn
wn�1
hn
a
~
the it if was and
all
her
he cat
rock
dog
yes
we
ten
sun of a I you
There
built . . . . . . . . . . .
aardva
rk
wn�2
he built
pn
[Slide: Phil Blunsom]
Thursday, February 16, 17
![Page 6: Neural network language models - UMass Amherstbrenocon/anlp2017/lectures/07-nn.pdf · Neural Language Models Trigram NN language model h n = g (V [w nj 1; w n2]+c ) pˆ n = softmax(Wh](https://reader034.vdocument.in/reader034/viewer/2022043013/5fad07e2aed77f19e14e5afe/html5/thumbnails/6.jpg)
Neural Language Models: Sampling
wn|wn�1
,wn�2
⇠ pn
w0
There
~
the it if was and
all
her
he cat
rock
dog
yes
we
ten
sun of a I you
There
built . . . . . . . . . . .
aardva
rk
w�1
<s> <s>
p1
h1
[Slide: Phil Blunsom]
Thursday, February 16, 17
![Page 7: Neural network language models - UMass Amherstbrenocon/anlp2017/lectures/07-nn.pdf · Neural Language Models Trigram NN language model h n = g (V [w nj 1; w n2]+c ) pˆ n = softmax(Wh](https://reader034.vdocument.in/reader034/viewer/2022043013/5fad07e2aed77f19e14e5afe/html5/thumbnails/7.jpg)
Neural Language Models: Sampling
wn|wn�1
,wn�2
⇠ pn
w0
There
~
the it if was and
all
her
he cat
rock
dog
yes
we
ten
sun of a I you
There
built . . . . . . . . . . .
aardva
rk
w�1
<s> <s>
p1
he~
the it if was and
all
her
he cat
rock
dog
yes
we
ten
sun of a I you
There
built . . . . . . . . . . .
aardva
rk p2
w1w0
h1 h2
[Slide: Phil Blunsom]
Thursday, February 16, 17
![Page 8: Neural network language models - UMass Amherstbrenocon/anlp2017/lectures/07-nn.pdf · Neural Language Models Trigram NN language model h n = g (V [w nj 1; w n2]+c ) pˆ n = softmax(Wh](https://reader034.vdocument.in/reader034/viewer/2022043013/5fad07e2aed77f19e14e5afe/html5/thumbnails/8.jpg)
Neural Language Models: Sampling
wn|wn�1
,wn�2
⇠ pn
w0
There
~
the it if was and
all
her
he cat
rock
dog
yes
we
ten
sun of a I you
There
built . . . . . . . . . . .
aardva
rk
w�1
<s> <s>
p1
he~
the it if was and
all
her
he cat
rock
dog
yes
we
ten
sun of a I you
There
built . . . . . . . . . . .
aardva
rk
built
~
the it if was and
all
her
he cat
rock
dog
yes
we
ten
sun of a I you
There
built . . . . . . .. . . .
aardva
rkp2 p3
w1w0 w1 w2
h1 h2 h3
[Slide: Phil Blunsom]
Thursday, February 16, 17
![Page 9: Neural network language models - UMass Amherstbrenocon/anlp2017/lectures/07-nn.pdf · Neural Language Models Trigram NN language model h n = g (V [w nj 1; w n2]+c ) pˆ n = softmax(Wh](https://reader034.vdocument.in/reader034/viewer/2022043013/5fad07e2aed77f19e14e5afe/html5/thumbnails/9.jpg)
Neural Language Models: Sampling
wn|wn�1
,wn�2
⇠ pn
w0
There
~
the it if was and
all
her
he cat
rock
dog
yes
we
ten
sun of a I you
There
built . . . . . . . . . . .
aardva
rk
w�1
<s> <s>
p1
he~
the it if was and
all
her
he cat
rock
dog
yes
we
ten
sun of a I you
There
built . . . . . . . . . . .
aardva
rk
built
~
the it if was and
all
her
he cat
rock
dog
yes
we
ten
sun of a I you
There
built . . . . . . .. . . .
aardva
rk
a
~
the it if was and
all
her
he cat
rock
dog
yes
we
ten
sun of a I you
There
built . . . . . . . . . . .
aardva
rkp2 p3 p4
w1w0 w1 w2 w2 w3
h1 h2 h3 h4
[Slide: Phil Blunsom]
Thursday, February 16, 17
![Page 10: Neural network language models - UMass Amherstbrenocon/anlp2017/lectures/07-nn.pdf · Neural Language Models Trigram NN language model h n = g (V [w nj 1; w n2]+c ) pˆ n = softmax(Wh](https://reader034.vdocument.in/reader034/viewer/2022043013/5fad07e2aed77f19e14e5afe/html5/thumbnails/10.jpg)
Neural Language Models: Training
The usual training objective isthe cross entropy of the datagiven the model (MLE):
F = � 1
N
X
n
costn(wn, pn)
The cost function is simply themodel’s estimated log-probabilityof wn:
cost(a, b) = aT log b
(assuming wi is a one hotencoding of the word)
wn
costn
wn�1
hn
pn
wn�2
[Slide: Phil Blunsom]
Thursday, February 16, 17
![Page 11: Neural network language models - UMass Amherstbrenocon/anlp2017/lectures/07-nn.pdf · Neural Language Models Trigram NN language model h n = g (V [w nj 1; w n2]+c ) pˆ n = softmax(Wh](https://reader034.vdocument.in/reader034/viewer/2022043013/5fad07e2aed77f19e14e5afe/html5/thumbnails/11.jpg)
Neural Language Models: Training
Calculating the gradients isstraightforward with backpropagation:
@F@W
= � 1
N
Pn
@costn@pn
@pn@W
@F@V
= � 1
N
Pn
@costn@pn
@pn@hn
@hn@V
wn
costn
wn�1
hn
pn
wn�2
[Slide: Phil Blunsom]
Thursday, February 16, 17
![Page 12: Neural network language models - UMass Amherstbrenocon/anlp2017/lectures/07-nn.pdf · Neural Language Models Trigram NN language model h n = g (V [w nj 1; w n2]+c ) pˆ n = softmax(Wh](https://reader034.vdocument.in/reader034/viewer/2022043013/5fad07e2aed77f19e14e5afe/html5/thumbnails/12.jpg)
Neural Language Models: Training
Calculating the gradients is straightforward with back propagation:
@F@W
= �1
4
4X
n=1
@costn@pn
@pn@W
,@F@V
= �1
4
4X
n=1
@costn@pn
@pn@hn
@hn@V
w1 w2 w3 w4
cost1 cost2 cost3 cost4
w0
h1 h2 h3 h4
w1 w2 w3
p1 p2 p3 p4
F
w�1 w0 w1 w2
Note that calculating the gradients for each time step n is independent ofall other timesteps, as such they are calculated in parallel and summed.
[Slide: Phil Blunsom]
Thursday, February 16, 17
![Page 13: Neural network language models - UMass Amherstbrenocon/anlp2017/lectures/07-nn.pdf · Neural Language Models Trigram NN language model h n = g (V [w nj 1; w n2]+c ) pˆ n = softmax(Wh](https://reader034.vdocument.in/reader034/viewer/2022043013/5fad07e2aed77f19e14e5afe/html5/thumbnails/13.jpg)
Comparison with Count Based N-Gram LMs
Good
• Better generalisation on unseen n-grams, poorer on seen n-grams.Solution: direct (linear) ngram features.
• Simple NLMs are often an order magnitude smaller in memoryfootprint than their vanilla n-gram cousins (though not if you usethe linear features suggested above!).
Bad
• The number of parameters in the model scales with the n-gram sizeand thus the length of the history captured.
• The n-gram history is finite and thus there is a limit on the longestdependencies that an be captured.
• Mostly trained with Maximum Likelihood based objectives which donot encode the expected frequencies of words a priori.
[Slide: Phil Blunsom]
Thursday, February 16, 17
![Page 14: Neural network language models - UMass Amherstbrenocon/anlp2017/lectures/07-nn.pdf · Neural Language Models Trigram NN language model h n = g (V [w nj 1; w n2]+c ) pˆ n = softmax(Wh](https://reader034.vdocument.in/reader034/viewer/2022043013/5fad07e2aed77f19e14e5afe/html5/thumbnails/14.jpg)
Training NNs
• Dropout (preferred regularization method)
• Minibatching
• Parallelization (GPUs)
• Local optima?
14
Thursday, February 16, 17
![Page 15: Neural network language models - UMass Amherstbrenocon/anlp2017/lectures/07-nn.pdf · Neural Language Models Trigram NN language model h n = g (V [w nj 1; w n2]+c ) pˆ n = softmax(Wh](https://reader034.vdocument.in/reader034/viewer/2022043013/5fad07e2aed77f19e14e5afe/html5/thumbnails/15.jpg)
Word/feature embeddings
• “Lookup layer”: from discrete input features (words, ngrams, etc.) to continuous vectors
• Anything that was directly used in log-linear models, move to using vectors
• Learn or not?
• Learn: they’re just model parameters
• Fixed: use pretrained embeddings
• Use a faster-to-train model on very large, perhaps different, dataset[e.g. word2vec, glove pretrained word vectors]
• Both: initialize with pretrained, then learn
• Word at test but not training time?
• Shared representations fordomain adaptation and multitask learning
15
Thursday, February 16, 17
![Page 16: Neural network language models - UMass Amherstbrenocon/anlp2017/lectures/07-nn.pdf · Neural Language Models Trigram NN language model h n = g (V [w nj 1; w n2]+c ) pˆ n = softmax(Wh](https://reader034.vdocument.in/reader034/viewer/2022043013/5fad07e2aed77f19e14e5afe/html5/thumbnails/16.jpg)
16
wt | wt�2, wt�1 wt | w1, . . . wt�1
Local models Long-history models
Fully observeddirect word models
. . . . . . Log-linear models . . . . . .
Latent-classdirect word models
Markovian neural LM Recurrent neural LM
Thursday, February 16, 17