neural network language models - umass amherstbrenocon/anlp2017/lectures/07-nn.pdf · neural...

Neural network language models

Lecture, Feb 16CS 690N, Spring 2017

Advanced Natural Language Processinghttp://people.cs.umass.edu/~brenocon/anlp2017/

Brendan O’ConnorCollege of Information and Computer Sciences

University of Massachusetts Amherst

Thursday, February 16, 17

http://people.cs.umass.edu/~brenocon/anlp2017/

http://people.cs.umass.edu/~brenocon/anlp2017/

Neural Language Models

Feed forward network

h = g(Vx + c)

y = Wh + bx

h

y

[Slide: Phil Blunsom]


https://github.com/oxford-cs-deepnlp-2017/lectures


Nonlinear activation functions

3

5.3. RECURRENT NEURAL NETWORK LANGUAGE MODELS 91

Figure 5.2: Nonlinear activation functions for neural networks

recurrent neural network (RNN Mikolov et al., 2010). The basic idea is to recurrentlyupdate the context vectors as we move through the sequence. Let us write h

m

for thecontextual information at position m in the sequence. RNNs employ the following recur-rence:

x

m

,�

w

m

(5.27)h

m

=g(⇥h

m�1

+ x

m

) (5.28)

p(wm+1

| w1

, w2

, . . . , wm

) =exp(�

w

m+1 · hm

)Pw

02V exp(�w

0 · hm

), (5.29)

where � is a matrix of input word embeddings, and x

m

denotes the embedding for wordwm

. The function g is an element-wise nonlinear activation function. Typical choices are:

• tanh(x), the hyperbolic tangent;

• �(x), the sigmoid function 1

1+exp(�x)

;

• (x)+

, the rectified linear unit, (x)+

= max(x, 0), also called ReLU.

These activation functions are shown in Figure 5.2. The sigmoid and tanh functions“squash” their inputs into a fixed range: [0, 1] for the sigmoid, [�1, 1] for tanh. This makesit possible to chain together many iterations of these functions without numerical insta-bility.

A key point about the RNN language model is that although each wm

depends only onthe context vector h

m�1

, this vector is in turn influenced by all previous tokens, w1

, w2

, . . . wm�1

,through the recurrence operation: w

1

affects h

1

, which affects h

2

, and so on, until the in-formation is propagated all the way to h

m�1

, and then on to wm

(see Figure 5.1). Thisis an important distinction from n-gram language models, where any information out-side the n-word window is ignored. Thus, in principle, the RNN language model can

(c) Jacob Eisenstein 2014-2017. Work in progress.

sigmoid(x) =

e

x

1 + e

x

tanh(x) = 2⇥ sgm(x)� 1

a.k.a. “ReLU”(x)+ = max(0, x)



Trigram NN language model

hn = g(V [wn�1

;wn�2

] + c)

pn = softmax(Whn + b)

softmax(u)i =exp uiPj exp uj

• wi are one hot vetors and pi aredistributions,

• |wi | = |pi | = V (words in thevocabulary),

• V is usually very large > 1e5.

wn�1

hn

pn

wn�2



Trigram NN language model

hn = g(V [wn�1

;wn�2

] + c)

pn = softmax(Whn + b)

softmax(u)i =exp uiPj exp uj

• wi are one hot vetors and pi aredistributions,

• |wi | = |pi | = V (words in thevocabulary),

• V is usually very large > 1e5.

wn�1

hn

pn

wn�2

Word embeddings




Neural Language Models: Sampling

wn|wn�1

,wn�2

⇠ pn

wn�1

hn

a

~

the it if was and

all

her

he cat

rock

dog

yes

we

ten

sun of a I you

There

built . . . . . . . . . . .

aardva

rk

wn�2

he built

pn






wn|wn�1

,wn�2

⇠ pn

w0

There

~

the it if was and

all

her

he cat

rock

dog

yes

we

ten

sun of a I you

There

built . . . . . . . . . . .

aardva

rk

w�1

<s> <s>

p1

h1






wn|wn�1

,wn�2

⇠ pn

w0

There

~

the it if was and

all

her

he cat

rock

dog

yes

we

ten

sun of a I you

There

built . . . . . . . . . . .

aardva

rk

w�1

<s> <s>

p1

he~

the it if was and

all

her

he cat

rock

dog

yes

we

ten

sun of a I you

There

built . . . . . . . . . . .

aardva

rk p2

w1w0

h1 h2






wn|wn�1

,wn�2

⇠ pn

w0

There

~

the it if was and

all

her

he cat

rock

dog

yes

we

ten

sun of a I you

There

built . . . . . . . . . . .

aardva

rk

w�1

<s> <s>

p1

he~

the it if was and

all

her

he cat

rock

dog

yes

we

ten

sun of a I you

There

built . . . . . . . . . . .

aardva

rk

built

~

the it if was and

all

her

he cat

rock

dog

yes

we

ten

sun of a I you

There

built . . . . . . .. . . .

aardva

rkp2 p3

w1w0 w1 w2

h1 h2 h3






wn|wn�1

,wn�2

⇠ pn

w0

There

~

the it if was and

all

her

he cat

rock

dog

yes

we

ten

sun of a I you

There

built . . . . . . . . . . .

aardva

rk

w�1

<s> <s>

p1

he~

the it if was and

all

her

he cat

rock

dog

yes

we

ten

sun of a I you

There

built . . . . . . . . . . .

aardva

rk

built

~

the it if was and

all

her

he cat

rock

dog

yes

we

ten

sun of a I you

There

built . . . . . . .. . . .

aardva

rk

a

~

the it if was and

all

her

he cat

rock

dog

yes

we

ten

sun of a I you

There

built . . . . . . . . . . .

aardva

rkp2 p3 p4

w1w0 w1 w2 w2 w3

h1 h2 h3 h4





Neural Language Models: Training

The usual training objective isthe cross entropy of the datagiven the model (MLE):

F = � 1

N

X

n

costn(wn, pn)

The cost function is simply themodel’s estimated log-probabilityof wn:

cost(a, b) = aT log b

(assuming wi is a one hotencoding of the word)

wn

costn

wn�1

hn

pn

wn�2






Calculating the gradients isstraightforward with backpropagation:

@F@W

= � 1

N

Pn

@costn@pn

@pn@W

@F@V

= � 1

N

Pn

@costn@pn

@pn@hn

@hn@V

wn

costn

wn�1

hn

pn

wn�2






Calculating the gradients is straightforward with back propagation:

@F@W

= �1

4

4X

n=1

@costn@pn

@pn@W

,@F@V

= �1

4

4X

n=1

@costn@pn

@pn@hn

@hn@V

w1 w2 w3 w4

cost1 cost2 cost3 cost4

w0

h1 h2 h3 h4

w1 w2 w3

p1 p2 p3 p4

F

w�1 w0 w1 w2

Note that calculating the gradients for each time step n is independent ofall other timesteps, as such they are calculated in parallel and summed.





Comparison with Count Based N-Gram LMs

Good

• Better generalisation on unseen n-grams, poorer on seen n-grams.Solution: direct (linear) ngram features.

• Simple NLMs are often an order magnitude smaller in memoryfootprint than their vanilla n-gram cousins (though not if you usethe linear features suggested above!).

Bad

• The number of parameters in the model scales with the n-gram sizeand thus the length of the history captured.

• The n-gram history is finite and thus there is a limit on the longestdependencies that an be captured.

• Mostly trained with Maximum Likelihood based objectives which donot encode the expected frequencies of words a priori.





Training NNs

• Dropout (preferred regularization method)

• Minibatching

• Parallelization (GPUs)

• Local optima?

14


Word/feature embeddings

• “Lookup layer”: from discrete input features (words, ngrams, etc.) to continuous vectors

• Anything that was directly used in log-linear models, move to using vectors

• Learn or not?

• Learn: they’re just model parameters

• Fixed: use pretrained embeddings

• Use a faster-to-train model on very large, perhaps different, dataset[e.g. word2vec, glove pretrained word vectors]

• Both: initialize with pretrained, then learn

• Word at test but not training time?

• Shared representations fordomain adaptation and multitask learning

15


16

wt | wt�2, wt�1 wt | w1, . . . wt�1

Local models Long-history models

Fully observeddirect word models

. . . . . . Log-linear models . . . . . .

Latent-classdirect word models

Markovian neural LM Recurrent neural LM


neural network language models - umass amherstbrenocon/anlp2017/lectures/07-nn.pdf · neural...

Documents