neural networks language models - mt classmt-class.org/jhu/slides/lecture-nn-lm.pdf · philipp...

39
Neural Networks Language Models Philipp Koehn 10 October 2017 Philipp Koehn Machine Translation: Neural Networks 10 October 2017

Upload: vothu

Post on 17-May-2018

218 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Neural Networks Language Models - MT classmt-class.org/jhu/slides/lecture-nn-lm.pdf · Philipp Koehn Machine Translation: ... –input gate: how much new input ... Deep Learning?

Neural Networks Language Models

Philipp Koehn

10 October 2017

Philipp Koehn Machine Translation: Neural Networks 10 October 2017

Page 2: Neural Networks Language Models - MT classmt-class.org/jhu/slides/lecture-nn-lm.pdf · Philipp Koehn Machine Translation: ... –input gate: how much new input ... Deep Learning?

1N-Gram Backoff Language Model

• Previously, we approximated

p(W ) = p(w1, w2, ..., wn)

• ... by applying the chain rule

p(W ) =∑i

p(wi|w1, ..., wi−1)

• ... and limiting the history (Markov order)

p(wi|w1, ..., wi−1) ' p(wi|wi−4, wi−3, wi−2, wi−1)

• Each p(wi|wi−4, wi−3, wi−2, wi−1) may not have enough statistics to estimate

→ we back off to p(wi|wi−3, wi−2, wi−1), p(wi|wi−2, wi−1), etc., all the way to p(wi)

– exact details of backing off get complicated — ”interpolated Kneser-Ney”

Philipp Koehn Machine Translation: Neural Networks 10 October 2017

Page 3: Neural Networks Language Models - MT classmt-class.org/jhu/slides/lecture-nn-lm.pdf · Philipp Koehn Machine Translation: ... –input gate: how much new input ... Deep Learning?

2Refinements

• A whole family of back-off schemes

• Skip-n gram models that may back off to p(wi|wi−2)

• Class-based models p(C(wi)|C(wi−4), C(wi−3), C(wi−2), C(wi−1))

⇒ We are wrestling here with

– using as much relevant evidence as possible

– pooling evidence between words

Philipp Koehn Machine Translation: Neural Networks 10 October 2017

Page 4: Neural Networks Language Models - MT classmt-class.org/jhu/slides/lecture-nn-lm.pdf · Philipp Koehn Machine Translation: ... –input gate: how much new input ... Deep Learning?

3First Sketch

Word 1

Word 2

Word 3

Word 4

Word 5

Hid

den

Laye

r

Philipp Koehn Machine Translation: Neural Networks 10 October 2017

Page 5: Neural Networks Language Models - MT classmt-class.org/jhu/slides/lecture-nn-lm.pdf · Philipp Koehn Machine Translation: ... –input gate: how much new input ... Deep Learning?

4Representing Words

• Words are represented with a one-hot vector, e.g.,

– dog = (0,0,0,0,1,0,0,0,0,....)– cat = (0,0,0,0,0,0,0,1,0,....)– eat = (0,1,0,0,0,0,0,0,0,....)

• That’s a large vector!

• Remedies

– limit to, say, 20,000 most frequent words, rest are OTHER

– place words in√n classes, so each word is represented by

∗ 1 class label∗ 1 word in class label

Philipp Koehn Machine Translation: Neural Networks 10 October 2017

Page 6: Neural Networks Language Models - MT classmt-class.org/jhu/slides/lecture-nn-lm.pdf · Philipp Koehn Machine Translation: ... –input gate: how much new input ... Deep Learning?

5Word Classes for Two-Hot Representations

• WordNet classes

• Brown clusters

• Frequency binning

– sort words by frequency– place them in order into classes– each class has same token count→ very frequent words have their own class→ rare words share class with many other words

• Anything goes: assign words randomly to classes

Philipp Koehn Machine Translation: Neural Networks 10 October 2017

Page 7: Neural Networks Language Models - MT classmt-class.org/jhu/slides/lecture-nn-lm.pdf · Philipp Koehn Machine Translation: ... –input gate: how much new input ... Deep Learning?

6Second Sketch

Word 1

Word 2

Word 3

Word 4

Word 5

Hid

den

Laye

r

Philipp Koehn Machine Translation: Neural Networks 10 October 2017

Page 8: Neural Networks Language Models - MT classmt-class.org/jhu/slides/lecture-nn-lm.pdf · Philipp Koehn Machine Translation: ... –input gate: how much new input ... Deep Learning?

7

word embeddings

Philipp Koehn Machine Translation: Neural Networks 10 October 2017

Page 9: Neural Networks Language Models - MT classmt-class.org/jhu/slides/lecture-nn-lm.pdf · Philipp Koehn Machine Translation: ... –input gate: how much new input ... Deep Learning?

8Add a Hidden Layer

Word 1

Word 2

Word 3

Word 4

Word 5

Hid

den

Laye

rC

C

C

C

• Map each word first into a lower-dimensional real-valued space

• Shared weight matrix C

Philipp Koehn Machine Translation: Neural Networks 10 October 2017

Page 10: Neural Networks Language Models - MT classmt-class.org/jhu/slides/lecture-nn-lm.pdf · Philipp Koehn Machine Translation: ... –input gate: how much new input ... Deep Learning?

9Details (Bengio et al., 2003)

• Add direct connections from embedding layer to output layer

• Activation functions

– input→embedding: none

– embedding→hidden: tanh

– hidden→output: softmax

• Training

– loop through the entire corpus

– update between predicted probabilities and 1-hot vector for output word

Philipp Koehn Machine Translation: Neural Networks 10 October 2017

Page 11: Neural Networks Language Models - MT classmt-class.org/jhu/slides/lecture-nn-lm.pdf · Philipp Koehn Machine Translation: ... –input gate: how much new input ... Deep Learning?

10Word Embeddings

C

Word Embedding

• By-product: embedding of word into continuous space

• Similar contexts→ similar embedding

• Recall: distributional semantics

Philipp Koehn Machine Translation: Neural Networks 10 October 2017

Page 12: Neural Networks Language Models - MT classmt-class.org/jhu/slides/lecture-nn-lm.pdf · Philipp Koehn Machine Translation: ... –input gate: how much new input ... Deep Learning?

11Word Embeddings

Philipp Koehn Machine Translation: Neural Networks 10 October 2017

Page 13: Neural Networks Language Models - MT classmt-class.org/jhu/slides/lecture-nn-lm.pdf · Philipp Koehn Machine Translation: ... –input gate: how much new input ... Deep Learning?

12Word Embeddings

Philipp Koehn Machine Translation: Neural Networks 10 October 2017

Page 14: Neural Networks Language Models - MT classmt-class.org/jhu/slides/lecture-nn-lm.pdf · Philipp Koehn Machine Translation: ... –input gate: how much new input ... Deep Learning?

13Are Word Embeddings Magic?

• Morphosyntactic regularities (Mikolov et al., 2013)

– adjectives base form vs. comparative, e.g., good, better– nouns singular vs. plural, e.g., year, years– verbs present tense vs. past tense, e.g., see, saw

• Semantic regularities

– clothing is to shirt as dish is to bowl– evaluated on human judgment data of semantic similarities

Philipp Koehn Machine Translation: Neural Networks 10 October 2017

Page 15: Neural Networks Language Models - MT classmt-class.org/jhu/slides/lecture-nn-lm.pdf · Philipp Koehn Machine Translation: ... –input gate: how much new input ... Deep Learning?

14

recurrent neural networks

Philipp Koehn Machine Translation: Neural Networks 10 October 2017

Page 16: Neural Networks Language Models - MT classmt-class.org/jhu/slides/lecture-nn-lm.pdf · Philipp Koehn Machine Translation: ... –input gate: how much new input ... Deep Learning?

15Recurrent Neural Networks

Word 1 Word 2EC

1

H

• Start: predict second word from first

• Mystery layer with nodes all with value 1

Philipp Koehn Machine Translation: Neural Networks 10 October 2017

Page 17: Neural Networks Language Models - MT classmt-class.org/jhu/slides/lecture-nn-lm.pdf · Philipp Koehn Machine Translation: ... –input gate: how much new input ... Deep Learning?

16Recurrent Neural Networks

Word 1 Word 2EC

1

H

Word 2 Word 3EC H

H

copy values

Philipp Koehn Machine Translation: Neural Networks 10 October 2017

Page 18: Neural Networks Language Models - MT classmt-class.org/jhu/slides/lecture-nn-lm.pdf · Philipp Koehn Machine Translation: ... –input gate: how much new input ... Deep Learning?

17Recurrent Neural Networks

Word 1 Word 2EC

1

H

Word 2 Word 3EC H

H

copy values

Word 3 Word 4EC H

H

copy values

Philipp Koehn Machine Translation: Neural Networks 10 October 2017

Page 19: Neural Networks Language Models - MT classmt-class.org/jhu/slides/lecture-nn-lm.pdf · Philipp Koehn Machine Translation: ... –input gate: how much new input ... Deep Learning?

18Training

Word 1 Word 2E

1

H

• Process first training example

• Update weights with back-propagation

Philipp Koehn Machine Translation: Neural Networks 10 October 2017

Page 20: Neural Networks Language Models - MT classmt-class.org/jhu/slides/lecture-nn-lm.pdf · Philipp Koehn Machine Translation: ... –input gate: how much new input ... Deep Learning?

19Training

Word 2 Word 3E

H

H

• Process second training example

• Update weights with back-propagation

• And so on...

• But: no feedback to previous history

Philipp Koehn Machine Translation: Neural Networks 10 October 2017

Page 21: Neural Networks Language Models - MT classmt-class.org/jhu/slides/lecture-nn-lm.pdf · Philipp Koehn Machine Translation: ... –input gate: how much new input ... Deep Learning?

20Back-Propagation Through Time

Word 1 Word 2E

H

H

Word 2 Word 3E H

Word 3 Word 4E H

• After processing a few training examples,update through the unfolded recurrent neural network

Philipp Koehn Machine Translation: Neural Networks 10 October 2017

Page 22: Neural Networks Language Models - MT classmt-class.org/jhu/slides/lecture-nn-lm.pdf · Philipp Koehn Machine Translation: ... –input gate: how much new input ... Deep Learning?

21Back-Propagation Through Time

• Carry out back-propagation though time (BPTT) after each training example

– 5 time steps seems to be sufficient

– network learns to store information for more than 5 time steps

• Or: update in mini-batches

– process 10-20 training examples

– update backwards through all examples

– removes need for multiple steps for each training example

Philipp Koehn Machine Translation: Neural Networks 10 October 2017

Page 23: Neural Networks Language Models - MT classmt-class.org/jhu/slides/lecture-nn-lm.pdf · Philipp Koehn Machine Translation: ... –input gate: how much new input ... Deep Learning?

22

long short term memory

Philipp Koehn Machine Translation: Neural Networks 10 October 2017

Page 24: Neural Networks Language Models - MT classmt-class.org/jhu/slides/lecture-nn-lm.pdf · Philipp Koehn Machine Translation: ... –input gate: how much new input ... Deep Learning?

23Vanishing Gradients

• Error is propagated to previous steps

• Updates consider

– prediction at that time step– impact on future time steps

• Vanishing gradient: propagated error disappears

Philipp Koehn Machine Translation: Neural Networks 10 October 2017

Page 25: Neural Networks Language Models - MT classmt-class.org/jhu/slides/lecture-nn-lm.pdf · Philipp Koehn Machine Translation: ... –input gate: how much new input ... Deep Learning?

24Recent vs. Early History

• Hidden layer plays double duty

– memory of the network– continuous space representation used to predict output words

• Sometimes only recent context important

After much economic progress over the years, the country→ has

• Sometimes much earlier context important

The country which has made much economic progress over the years still→ has

Philipp Koehn Machine Translation: Neural Networks 10 October 2017

Page 26: Neural Networks Language Models - MT classmt-class.org/jhu/slides/lecture-nn-lm.pdf · Philipp Koehn Machine Translation: ... –input gate: how much new input ... Deep Learning?

25Long Short Term Memory (LSTM)

• Design quite elaborate, although not very complicated to use

• Basic building block: LSTM cell

– similar to a node in a hidden layer– but: has a explicit memory state

• Output and memory state change depends on gates

– input gate: how much new input changes memory state– forget gate: how much of prior memory state is retained– output gate: how strongly memory state is passed on to next layer.

• Gates can be not just be open (1) and closed (0), but slightly ajar (e.g., 0.2)

Philipp Koehn Machine Translation: Neural Networks 10 October 2017

Page 27: Neural Networks Language Models - MT classmt-class.org/jhu/slides/lecture-nn-lm.pdf · Philipp Koehn Machine Translation: ... –input gate: how much new input ... Deep Learning?

26LSTM Cell

input gate

output gate

forget gate

X i

m o

⊗ ⊕

⊗ h

m

LSTM Layer Time t-1

Next LayerY

LSTM Layer Time t

Preceding Layer

Philipp Koehn Machine Translation: Neural Networks 10 October 2017

Page 28: Neural Networks Language Models - MT classmt-class.org/jhu/slides/lecture-nn-lm.pdf · Philipp Koehn Machine Translation: ... –input gate: how much new input ... Deep Learning?

27LSTM Cell (Math)

• Memory and output values at time step t

memoryt = gateinput × inputt + gateforget ×memoryt−1

outputt = gateoutput ×memoryt

• Hidden node value ht passed on to next layer applies activation function f

ht = f(outputt)

• Input computed as input to recurrent neural network node

– given node values for prior layer ~xt = (xt1, ..., x

tX)

– given values for hidden layer from previous time step ~ht−1 = (ht−11 , ..., ht−1

H )– input value is combination of matrix multiplication with weights wx and wh

and activation function g

inputt = g

(X∑i=1

wxi x

ti +

H∑i=1

whi h

t−1i

)

Philipp Koehn Machine Translation: Neural Networks 10 October 2017

Page 29: Neural Networks Language Models - MT classmt-class.org/jhu/slides/lecture-nn-lm.pdf · Philipp Koehn Machine Translation: ... –input gate: how much new input ... Deep Learning?

28Values for Gates

• Gates are very important

• How do we compute their value?

→with a neural network layer!

• For each gate a ∈ (input, forget, output)

– weight matrix W xa to consider node values in previous layer ~xt

– weight matrix Wha to consider hidden layer ~ht−1 at previous time step– weight matrix Wma to consider memory at previous time step ~memoryt−1

– activation function h

gatea = h

(X∑i=1

wxai xt

i +

H∑i=1

whai ht−1

i +

H∑i=1

wmai memoryt−1

i

)

Philipp Koehn Machine Translation: Neural Networks 10 October 2017

Page 30: Neural Networks Language Models - MT classmt-class.org/jhu/slides/lecture-nn-lm.pdf · Philipp Koehn Machine Translation: ... –input gate: how much new input ... Deep Learning?

29Training

• LSTM are trained the same way as recurrent neural networks

• Back-propagation through time

• This looks all very complex, but:

– all the operations are still based on∗ matrix multiplications∗ differentiable activation functions

→ we can compute gradients for objective function with respect to all parameters

→ we can compute update functions

Philipp Koehn Machine Translation: Neural Networks 10 October 2017

Page 31: Neural Networks Language Models - MT classmt-class.org/jhu/slides/lecture-nn-lm.pdf · Philipp Koehn Machine Translation: ... –input gate: how much new input ... Deep Learning?

30What is the Point?

(from Tran, Bisazza, Monz, 2016)

• Each node has memory memoryi independent from current output hi

• Memory may be carried through unchanged (gateiinput = 0, gateimemory = 1)

⇒ can remember important features over long time span

(capture long distance dependencies)

Philipp Koehn Machine Translation: Neural Networks 10 October 2017

Page 32: Neural Networks Language Models - MT classmt-class.org/jhu/slides/lecture-nn-lm.pdf · Philipp Koehn Machine Translation: ... –input gate: how much new input ... Deep Learning?

31Visualizing Individual Cells

Karpathy et al. (2015): ”Visualizing and Understanding Recurrent Networks”

Philipp Koehn Machine Translation: Neural Networks 10 October 2017

Page 33: Neural Networks Language Models - MT classmt-class.org/jhu/slides/lecture-nn-lm.pdf · Philipp Koehn Machine Translation: ... –input gate: how much new input ... Deep Learning?

32Visualizing Individual Cells

Philipp Koehn Machine Translation: Neural Networks 10 October 2017

Page 34: Neural Networks Language Models - MT classmt-class.org/jhu/slides/lecture-nn-lm.pdf · Philipp Koehn Machine Translation: ... –input gate: how much new input ... Deep Learning?

33Gated Recurrent Unit (GRU)

update gate

reset gate

X x ⊕ h

h

GRU Layer Time t-1

Next LayerY

GRU Layer Time t

Preceding Layer

Philipp Koehn Machine Translation: Neural Networks 10 October 2017

Page 35: Neural Networks Language Models - MT classmt-class.org/jhu/slides/lecture-nn-lm.pdf · Philipp Koehn Machine Translation: ... –input gate: how much new input ... Deep Learning?

34Gated Recurrent Unit (Math)

• Two Gates

updatet = g(Wupdate inputt + Uupdate statet−1 + biasupdate)

resett = g(Wreset inputt + Ureset statet−1 + biasreset)

• Combination of input and previous state(similar to traditional recurrent neural network)

combinationt = f(W inputt + U(resett ◦ statet−1))

• Interpolation with previous state

statet =(1− updatet) ◦ statet−1 +

updatet ◦ combinationt) + bias

Philipp Koehn Machine Translation: Neural Networks 10 October 2017

Page 36: Neural Networks Language Models - MT classmt-class.org/jhu/slides/lecture-nn-lm.pdf · Philipp Koehn Machine Translation: ... –input gate: how much new input ... Deep Learning?

35

deeper models

Philipp Koehn Machine Translation: Neural Networks 10 October 2017

Page 37: Neural Networks Language Models - MT classmt-class.org/jhu/slides/lecture-nn-lm.pdf · Philipp Koehn Machine Translation: ... –input gate: how much new input ... Deep Learning?

36Deep Learning?

Input

HiddenLayer

Output

• Not much deep learning so far

• Between prediction from input to output: only 1 hidden layer

• How about more hidden layers?

Philipp Koehn Machine Translation: Neural Networks 10 October 2017

Page 38: Neural Networks Language Models - MT classmt-class.org/jhu/slides/lecture-nn-lm.pdf · Philipp Koehn Machine Translation: ... –input gate: how much new input ... Deep Learning?

37Deep Models

Input

HiddenLayer

Output

Input

HiddenLayer 2

Output

HiddenLayer 1

HiddenLayer 3

Shallow Deep Stacked

Input

HiddenLayer 2

Output

HiddenLayer 1

HiddenLayer 3

Deep Transition

Philipp Koehn Machine Translation: Neural Networks 10 October 2017

Page 39: Neural Networks Language Models - MT classmt-class.org/jhu/slides/lecture-nn-lm.pdf · Philipp Koehn Machine Translation: ... –input gate: how much new input ... Deep Learning?

38

questions?

Philipp Koehn Machine Translation: Neural Networks 10 October 2017