neural machine translation · neural machine translation spring 2020 2020-03-12 cmpt 825: natural...

Neural Machine Translation

Spring 20202020-03-12

CMPT 825: Natural Language Processing

!"#!"#$"%&$"'

Adapted from slides from Danqi Chen, Karthik Narasimhan, and Jetic Gu. (with some content from slides from Abigail See, Graham Neubig)

Course Logistics

‣ Project proposal is due today

‣ What problem are you addressing? Why is it interesting?

‣ What specific aspects will your project be on?

‣ Re-implement paper? Compare different methods?

‣ What data do you plan to use?

‣ What is your method?

‣ How do you plan to evaluate? What metrics?

• Statistical MT

• Word-based

• Phrase-based

• Syntactic

Last time

Neural Machine Translation

‣ A single neural network is used to translate from source to target

‣ Architecture: Encoder-Decoder

‣ Two main components:

‣ Encoder: Convert source sentence (input) into a vector/matrix

‣ Decoder: Convert encoding into a sentence in target language (output)

Sequence to Sequence learning (Seq2seq)

• Encode entire input sequence into a single vector (using an RNN)

• Decode one word at a time (again, using an RNN!)

• Beam search for better inference

• Learning is not trivial! (vanishing/exploding gradients)

(Sutskever et al., 2014)

Encoder

xt

ht−1

xt+1

ht

xt+2

ht+1 ht+2

xt+3

ht+3

h

This cat is cute

Sentence: This cat is cute

wordembedding

Encoder

x1

h0

xt+1

h1

xt+2

ht+1 ht+2

xt+3

ht+3

h


wordembedding

This cat is cute

x1

h0

x2

h1

x3

h2 h3

x4

h4

Encoder

xt+2

ht+2

xt+3

ht+3

h


wordembedding

This cat is cute

Encoder

x1

h0

x2

h1

x3

h2 h3

x4

h4

(encoded representation)Sentence: This cat is cute

wordembedding

henc

This cat is cute

<s> ce chat est mianon

Decoder

x′ 1 x′ 2

z1

x′ 3

z2

ce

o o

z3

o

x′ 4

z4

o

chat mignonest

x′ 5

z5

o

<e>

mignon

wordembedding

henc


Decoder

y1

henc

x′ 2

z1

x′ 3

z2

ce

o o

z3

o

x′ 4

z4

o

chat mignonest

x′ 5

z5

o

<e>

mignon

wordembedding


Decoder

y1 y2

z1

x′ 3

z2

ce

o o

z3

o

x′ 4

z4

o

chat mignonest

x′ 5

z5

o

<e>

mignon

wordembedding

henc

Decoder

y1 y2

z1

y3

z2

ce

o o

z3

o

y4

z4

o


chat mignon• A conditioned language model

est

y5

z5

o

<e>

wordembedding

henc

Seq2seq training

‣ Similar to training a language model!

‣ Minimize cross-entropy loss:

‣ Back-propagate gradients through both decoder and encoder

‣ Need a really big corpus

T

∑t=1

− log P(yt |y1, . . . , yt−1, x1, . . . , xn)

English: Machine translation is cool!

36M sentence pairs

Russian: Машинный перевод - это крутo!

Seq2seq training

(slide credit: Abigail See)

Remember masking

1 1 1 1 0 0

1 0 0 0 0 0

1 1 1 1 1 1

1 1 1 0 0 0

Use masking to help compute loss for batched sequences

Scheduled Sampling

(figure credit: Bengio et al, 2015)

Possible decay schedules(probability using true y decays over time)

How seq2seq changed the MT landscape

MT Progress

(source: Rico Sennrich)

(Wu et al., 2016)

NMT vs SMT

Pros

‣ Better performance

‣ Fluency

‣ Longer context

‣ Single NN optimized end-to-end

‣ Less engineering

‣Works out of the box for many language pairs

Cons

‣ Requires more data and compute

‣ Less interpretable

‣ Hard to debug

‣ Uncontrollable

‣ Heavily dependent on data - could lead to unwanted biases

‣More parameters

Seq2Seq for more than NMT

Task/Application Input Output

Machine Translation French English

Summarization Document Short Summary

Dialogue Utterance Response

Parsing Sentence Parse tree (as sequence)

Question Answering Context + Question Answer

Cross-Modal Seq2Seq

Task/Application Input Output

Speech Recognition Speech Signal Transcript

Image Captioning Image Text

Video Captioning Video Text

Issues with vanilla seq2seq

‣ A single encoding vector, , needs to capture all the information about source sentence

‣ Longer sequences can lead to vanishing gradients

‣ Overfitting

henc

Bottleneck



‣Longer sequences can lead to vanishing gradients

‣ Overfitting

henc

Bottleneck

Remember alignments?

Attention

‣ The neural MT equivalent of alignment models

‣ Key idea: At each time step during decoding, focus on a particular part of source sentence

‣ This depends on the decoder’s current hidden state (i.e. notion of what you are trying to decode)

‣ Usually implemented as a probability distribution over the hidden states of the encoder ( )henc

i

Seq2seq with attention




Can also use as input for next time step

y1

Computing attention

‣ Encoder hidden states:

‣ Decoder hidden state at time :

‣ First, get attention scores for this time step (we will see what is soon!):

‣ Obtain the attention distribution using softmax:

‣ Compute weighted sum of encoder hidden states:

‣ Finally, concatenate with decoder state and pass on to output layer:

henc1 , . . . , henc

n

t hdect

get = [g(henc

1 , hdect ), . . . , g(henc

n , hdect )]

αt = softmax (et) ∈ ℝn

at =n

∑i=1

αti h

enci ∈ ℝh

[at; hdect ] ∈ ℝ2h

Types of attention

‣ Assume encoder hidden states and decoder hidden state

1. Dot-product attention (assumes equal dimensions for and :

2. Multiplicative attention: , where is a weight matrix

3. Additive attention: where are weight matrices and is a weight vector

h1, h2, . . . , hn z

a bei = g(hi, z) = zThi ∈ ℝ

g(hi, z) = zTWhi ∈ ℝ W

g(hi, z) = vT tanh (W1hi + W2z) ∈ ℝW1, W2 v



‣ Longer sequences can lead to vanishing gradients

‣Overfitting

henc

Bottleneck

Dropout

‣ Form of regularization for RNNs (and any NN in general)

‣ Idea: “Handicap” NN by removing hidden units stochastically

‣ set each hidden unit in a layer to 0 with probability during training ( usually works well)

‣ scale outputs by

‣ hidden units forced to learn more general patterns

‣ Test time: Use all activations (no need to rescale)

pp = 0.5

1/(1 − p)

Handling large vocabularies

‣ Softmax can be expensive for large vocabularies

‣ English vocabulary size: 10K to 100K

P(yi) =exp(wi ⋅ h + bi)

∑|V|j=1 exp(wj ⋅ h + bj)

Expensive to compute

Approximate Softmax

‣ Negative Sampling

‣ Structured softmax

‣ Embedding prediction

Negative Sampling

• Softmax is expensive when vocabulary size is large

(figure credit: Graham Neubig)

Negative Sampling

• Sample just a subset of the vocabulary for negative

• Saw simple negative sampling in word2vec (Mikolov 2013)


Other ways to sample:Importance Sampling (Bengio and Senecal 2003)Noise Contrastive Estimation (Mnih & Teh 2012)

Hierarchical softmax(Morin and Bengio 2005)

(figure credit: Quora)

https://www.quora.com/Word2vec-How-can-hierarchical-soft-max-training-method-of-CBOW-guarantee-its-self-consistence

Class based softmax

‣ Two-layer: cluster words into classes, predict class and then predict word.

(Gooding 2001, Mikolov et al 2011)

‣ Clusters can be based on frequency, random, or word contexts.


Embedding prediction

‣ Directly predict embeddings of outputs themselves

‣ What loss to use? (Kumar and Tsvetkov 2019)

‣ L2? Cosine?

‣ Von-Mises Fisher distribution loss, make embeddings close on the unit ball

(slide credit: Graham Neubig)

Generation

How can we use our model (decoder) to generate sentences?

• Sampling: Try to generate a random sentence according the the probability distribution

• Argmax: Try to generate the best sentence, the sentence with the highest probability

Decoding Strategies

‣ Ancestral sampling

‣ Greedy decoding

‣ Exhaustive search

‣ Beam search

Ancestral Sampling

• Randomly sample words one by one

• Provides diverse output (high variance)

(figure credit: Luong, Cho, and Manning)

Greedy decoding

‣ Compute argmax at every step of decoder to generate word

‣ What’s wrong?

Exhaustive search?

‣ Find

‣ Requires computing all possible sequences

‣ complexity!

‣ Too expensive

arg maxy1,...,yT

P(y1, . . . , yT |x1, . . . , xn)

O(VT)

Recall: Beam search (a middle ground)

‣ Key idea: At every step, keep track of the k most probable partial translations (hypotheses)

‣ Score of each hypothesis = log probability

‣ Not guaranteed to be optimal

‣ More efficient than exhaustive search

j

∑t=1

log P(yt |y1, . . . , yt−1, x1, . . . , xn)

Beam decoding


Backtrack


Beam decoding

‣ Different hypotheses may produce (end) token at different time steps

‣ When a hypothesis produces , stop expanding it and place it aside

‣ Continue beam search until:

‣ All hypotheses produce OR

‣ Hit max decoding limit T

‣ Select top hypotheses using the normalized likelihood score

‣ Otherwise shorter hypotheses have higher scores

⟨eos⟩

⟨eos⟩

k ⟨eos⟩

1T

T

∑t=1

log P(yt |y1, . . . , yt−1, x1, . . . , xn)

Under translation• Under translation

• Beam search

• Ensemble

• Coverage models

• Crucial information are left untranslated; premature generation<EOS>

• Out-of-Vocabulary (OOV) Problem; Rare word problem

• Frequent word are translated correctly, rare words are not

• In any corpus, word frequencies are exponentially unbalanced

• Under translation

• Crucial information are left untranslated; premature generation

Under-trans<EOS>

!

RNN

<EOS>

nobody

RNN

expects

expects

RNN

the

the

RNN

spanish

spanish

RNN

inquisition

inquisition

RNN

!

<SOS>

RNN

nobody <EOS>

'inquisition': 0.49 <EOS>: 0.51

<EOS>




• Under translation

• Crucial information are left untranslated; premature generation<EOS>

Under-trans<EOS>

!

RNN

<EOS>

nobody

RNN

expects

expects

RNN

the

the

RNN

spanish

spanish

RNN

inquisition

inquisition

RNN

!

<SOS>

RNN

nobody <EOS>

'inquisition': 0.49 <EOS>: 0.51

Ensemble

• Similar to voting mechanism, but with probabilities

• multiple models of different parameters (usually from different checkpoints of the same training instance)

• use the output with the highest probability across all models

Concep

t

P0 XXXXXXX

P2 Ensemble

Ensemble

Demo

P0 XXXXXXX

P2 Ensemble

!

RNN RNN RNN

<SOS> prof loves

prof loves cheese

RNN

cheese

<EOS>

! ! !

RNN

Ensemble

Demo

P0 XXXXXXX

P2 Ensemble

RNN RNN RNN

prof loves cheese

RNN

<EOS>

RNNRNN RNN RNN

prof loves cheese

RNN

<EOS>

RNNRNN RNN RNN

prof loves cheese

RNN

<EOS>

RNN

seq2seq_E049.pt

Ensemble

Demo

P0 XXXXXXX

P2 Ensemble

RNN cp2 cp2

prof loves cheese

cp2

<EOS>

cp2

RNN cp3 cp3

prof loves pizza

cp3

<EOS>

cp3

RNN cp1 cp1

prof loves kebab

cp1

<EOS>

cp1

seq2seq_E049.pt

seq2seq_E048.pt

seq2seq_E047.pt

Ensemble

Demo

P0 XXXXXXX

P2 Ensemble

RNN cp2 cp2

prof loves cheese

cp2

<EOS>

cp2

RNN cp3 cp3

prof loves pizza

cp3

<EOS>

cp3

RNN cp1 cp1

prof loves kebab

cp1

<EOS>

cp1

seq2seq_E049.pt

seq2seq_E048.pt

seq2seq_E047.pt

'kebab': 0.9 'cheese': 0.05 'pizza': 0.05

……


……


……


……


……

Ensemble

Demo

P0 XXXXXXX

P2 Ensemble

RNN cp2 cp2

prof loves cheese

cp2

<EOS>

cp2

RNN cp3 cp3

prof loves pizza

cp3

<EOS>

cp3

RNN cp1 cp1

prof loves kebab

cp1

<EOS>

cp1

seq2seq_E049.pt

seq2seq_E048.pt

seq2seq_E047.pt


……

kebab

kebab

Ensemble

Demo

P0 XXXXXXX

P2 Ensemble

RNN cp2 cp2

prof loves cheese

cp2

<EOS>

cp2

RNN cp3 cp3

prof loves pizza

cp3

<EOS>

cp3

RNN cp1 cp1

prof loves kebab

cp1

<EOS>

cp1

seq2seq_E049.pt

seq2seq_E048.pt

seq2seq_E047.pt

kebab

kebab

Existing challenges with NMT

‣ Out-of-vocabulary words

‣ Low-resource languages

‣ Long-term context

‣ Common sense knowledge (e.g. hot dog, paper jam)

‣ Fairness and bias

‣ Uninterpretable

Out-of-vocabulary words




OOV

1. https://en.wikipedia.org/wiki/Zipf%27s_law

Zipf’s Law

Occ

urre

nce

Cou

nt

1

10

100

1000

10000

100000

1000000

Frequency Rank1 25000 50000

• rare word are exponentially less frequent than frequent words

• e.g. in HW4, 45%|65% (src|tgt) of the unique words occur once

Out-of-vocabulary words• Out-of-Vocabulary (OOV) Problem; Rare word problem

• Copy Mechanisms

• Subword / character models


Copy Mechanism

Concep

t

P0 XXXXXXX

P1 Copy

!

RNN RNN RNN

<SOS> prof loves

prof loves cheese

RNN

cheese

<EOS>

! ! !

RNN

1. CL2015015 [Luong et al.] Addressing the Rare Word Problem in Neural Machine Translation

Copy Mechanism

Concep

t

P0 XXXXXXX

P1 Copy

!

RNN RNN RNN

<SOS> prof loves

prof loves cheese

RNN

cheese

<EOS>

! ! !

RNN

<UNK>

prof

liebt

kebab

<UNK>

αt,1 = 0.1

αt,2 = 0.2

αt,3 = 0.7

kebab


Copy Mechanism

Concep

t

P0 XXXXXXX

P1 Copy

!

RNN RNN RNN

<SOS> prof loves

prof loves cheese

! !

RNN

prof

liebt

kebab

<UNK>

αt,1 = 0.1

αt,2 = 0.2

αt,3 = 0.7

kebab

• Sees <UNK> at step

• looks at attention weight

• replace <UNK> with the source word

t

αt

fargmaxiαt,i


*Pointer-Generator Copy Mechanism

Concep

t

P0 XXXXXXX

P1 Copy

1. CL2016032 [Gulcehre et al.] Pointing the Unknown Words

RNN

<UNK>

!

pg

!

RNN RNN RNN

<SOS> prof loves

prof loves cheese

RNN

cheese

<EOS>

! ! !

RNN

<UNK>

prof

liebt

kebab

<UNK>

αt,1 = 0.1

αt,2 = 0.2

αt,3 = 0.7

kebab

• binary classifier switch • use decoder output • use copy mechanism

*Pointer-Generator Copy Mechanism

Concep

t

P0 XXXXXXX

P1 Copy

1. CL2016032 [Gulcehre et al.] Pointing the Unknown Words

RNN

<UNK>

!

pg

• Sees <UNK> at step , or


• replace <UNK> with the source word

t pg([hdect ; contextt]) ≤ 0.5

αt

fargmaxiαt,i

*Pointer-Based Dictionary Fusion

Concep

t

P0 XXXXXXX

P1 Copy

1. CL2019331 [Gū et al.] Pointer-based Fusion of Bilingual Lexicons into Neural Machine Translation

RNN

<UNK>

!

pg

• Sees <UNK> at step , or


• replace <UNK> with translation of the source word

t pg([hdect ; contextt]) ≤ 0.5

αt

dict( fargmaxiαt,i)

Out-of-vocabulary words• Out-of-Vocabulary (OOV) Problem; Rare word problem

• Copy Mechanisms

• Subword / character models


Subword / character models• Character based seq2seq models


• Byte pair encoding (BPE)


• Byte pair encoding (BPE)

• Subword units help model morphology







‣ Uninterpretable

Massively multilingual MT

(Arivazhagan et al., 2019)

‣ Train a single neural network on 103 languages paired with English (remember Interlingua?)

‣ Massive improvements on low-resource languages







‣ Uninterpretable

Next time

‣ Contextualized embeddings

‣ ELMO, BERT, and friends

‣ Transformers

‣ Multi-headed self-attention

neural machine translation · neural machine translation spring 2020 2020-03-12 cmpt 825: natural...

Documents