neural machine translation · neural machine translation spring 2020 2020-03-12 cmpt 825: natural...
TRANSCRIPT
![Page 1: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/1.jpg)
Neural Machine Translation
Spring 20202020-03-12
CMPT 825: Natural Language Processing
!"#!"#$"%&$"'
Adapted from slides from Danqi Chen, Karthik Narasimhan, and Jetic Gu. (with some content from slides from Abigail See, Graham Neubig)
![Page 2: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/2.jpg)
Course Logistics
‣ Project proposal is due today
‣ What problem are you addressing? Why is it interesting?
‣ What specific aspects will your project be on?
‣ Re-implement paper? Compare different methods?
‣ What data do you plan to use?
‣ What is your method?
‣ How do you plan to evaluate? What metrics?
![Page 3: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/3.jpg)
• Statistical MT
• Word-based
• Phrase-based
• Syntactic
Last time
![Page 4: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/4.jpg)
Neural Machine Translation
‣ A single neural network is used to translate from source to target
‣ Architecture: Encoder-Decoder
‣ Two main components:
‣ Encoder: Convert source sentence (input) into a vector/matrix
‣ Decoder: Convert encoding into a sentence in target language (output)
![Page 5: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/5.jpg)
Sequence to Sequence learning (Seq2seq)
• Encode entire input sequence into a single vector (using an RNN)
• Decode one word at a time (again, using an RNN!)
• Beam search for better inference
• Learning is not trivial! (vanishing/exploding gradients)
(Sutskever et al., 2014)
![Page 6: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/6.jpg)
Encoder
xt
ht−1
xt+1
ht
xt+2
ht+1 ht+2
xt+3
ht+3
h
This cat is cute
Sentence: This cat is cute
wordembedding
![Page 7: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/7.jpg)
Encoder
x1
h0
xt+1
h1
xt+2
ht+1 ht+2
xt+3
ht+3
h
Sentence: This cat is cute
wordembedding
This cat is cute
![Page 8: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/8.jpg)
x1
h0
x2
h1
x3
h2 h3
x4
h4
Encoder
xt+2
ht+2
xt+3
ht+3
h
Sentence: This cat is cute
wordembedding
This cat is cute
![Page 9: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/9.jpg)
Encoder
x1
h0
x2
h1
x3
h2 h3
x4
h4
(encoded representation)Sentence: This cat is cute
wordembedding
henc
This cat is cute
![Page 10: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/10.jpg)
<s> ce chat est mianon
Decoder
x′ 1 x′ 2
z1
x′ 3
z2
ce
o o
z3
o
x′ 4
z4
o
chat mignonest
x′ 5
z5
o
<e>
mignon
wordembedding
henc
![Page 11: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/11.jpg)
<s> ce chat est mianon
Decoder
y1
henc
x′ 2
z1
x′ 3
z2
ce
o o
z3
o
x′ 4
z4
o
chat mignonest
x′ 5
z5
o
<e>
mignon
wordembedding
![Page 12: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/12.jpg)
<s> ce chat est mianon
Decoder
y1 y2
z1
x′ 3
z2
ce
o o
z3
o
x′ 4
z4
o
chat mignonest
x′ 5
z5
o
<e>
mignon
wordembedding
henc
![Page 13: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/13.jpg)
Decoder
y1 y2
z1
y3
z2
ce
o o
z3
o
y4
z4
o
<s> ce chat est mianon
chat mignon• A conditioned language model
est
y5
z5
o
<e>
wordembedding
henc
![Page 14: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/14.jpg)
Seq2seq training
‣ Similar to training a language model!
‣ Minimize cross-entropy loss:
‣ Back-propagate gradients through both decoder and encoder
‣ Need a really big corpus
T
∑t=1
− log P(yt |y1, . . . , yt−1, x1, . . . , xn)
English: Machine translation is cool!
36M sentence pairs
Russian: Машинный перевод - это крутo!
![Page 15: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/15.jpg)
Seq2seq training
(slide credit: Abigail See)
![Page 16: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/16.jpg)
Remember masking
1 1 1 1 0 0
1 0 0 0 0 0
1 1 1 1 1 1
1 1 1 0 0 0
Use masking to help compute loss for batched sequences
![Page 17: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/17.jpg)
Scheduled Sampling
(figure credit: Bengio et al, 2015)
Possible decay schedules(probability using true y decays over time)
![Page 18: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/18.jpg)
How seq2seq changed the MT landscape
![Page 19: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/19.jpg)
MT Progress
(source: Rico Sennrich)
![Page 20: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/20.jpg)
(Wu et al., 2016)
![Page 21: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/21.jpg)
NMT vs SMT
Pros
‣ Better performance
‣ Fluency
‣ Longer context
‣ Single NN optimized end-to-end
‣ Less engineering
‣Works out of the box for many language pairs
Cons
‣ Requires more data and compute
‣ Less interpretable
‣ Hard to debug
‣ Uncontrollable
‣ Heavily dependent on data - could lead to unwanted biases
‣More parameters
![Page 22: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/22.jpg)
Seq2Seq for more than NMT
Task/Application Input Output
Machine Translation French English
Summarization Document Short Summary
Dialogue Utterance Response
Parsing Sentence Parse tree (as sequence)
Question Answering Context + Question Answer
![Page 23: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/23.jpg)
Cross-Modal Seq2Seq
Task/Application Input Output
Speech Recognition Speech Signal Transcript
Image Captioning Image Text
Video Captioning Video Text
![Page 24: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/24.jpg)
Issues with vanilla seq2seq
‣ A single encoding vector, , needs to capture all the information about source sentence
‣ Longer sequences can lead to vanishing gradients
‣ Overfitting
henc
Bottleneck
![Page 25: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/25.jpg)
Issues with vanilla seq2seq
‣ A single encoding vector, , needs to capture all the information about source sentence
‣Longer sequences can lead to vanishing gradients
‣ Overfitting
henc
Bottleneck
![Page 26: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/26.jpg)
Remember alignments?
![Page 27: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/27.jpg)
Attention
‣ The neural MT equivalent of alignment models
‣ Key idea: At each time step during decoding, focus on a particular part of source sentence
‣ This depends on the decoder’s current hidden state (i.e. notion of what you are trying to decode)
‣ Usually implemented as a probability distribution over the hidden states of the encoder ( )henc
i
![Page 28: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/28.jpg)
Seq2seq with attention
(slide credit: Abigail See)
![Page 29: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/29.jpg)
Seq2seq with attention
(slide credit: Abigail See)
![Page 30: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/30.jpg)
Seq2seq with attention
(slide credit: Abigail See)
![Page 31: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/31.jpg)
Seq2seq with attention
(slide credit: Abigail See)
Can also use as input for next time step
y1
![Page 32: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/32.jpg)
Seq2seq with attention
(slide credit: Abigail See)
![Page 33: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/33.jpg)
Computing attention
‣ Encoder hidden states:
‣ Decoder hidden state at time :
‣ First, get attention scores for this time step (we will see what is soon!):
‣ Obtain the attention distribution using softmax:
‣ Compute weighted sum of encoder hidden states:
‣ Finally, concatenate with decoder state and pass on to output layer:
henc1 , . . . , henc
n
t hdect
get = [g(henc
1 , hdect ), . . . , g(henc
n , hdect )]
αt = softmax (et) ∈ ℝn
at =n
∑i=1
αti h
enci ∈ ℝh
[at; hdect ] ∈ ℝ2h
![Page 34: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/34.jpg)
Types of attention
‣ Assume encoder hidden states and decoder hidden state
1. Dot-product attention (assumes equal dimensions for and :
2. Multiplicative attention: , where is a weight matrix
3. Additive attention: where are weight matrices and is a weight vector
h1, h2, . . . , hn z
a bei = g(hi, z) = zThi ∈ ℝ
g(hi, z) = zTWhi ∈ ℝ W
g(hi, z) = vT tanh (W1hi + W2z) ∈ ℝW1, W2 v
![Page 35: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/35.jpg)
Issues with vanilla seq2seq
‣ A single encoding vector, , needs to capture all the information about source sentence
‣ Longer sequences can lead to vanishing gradients
‣Overfitting
henc
Bottleneck
![Page 36: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/36.jpg)
Dropout
‣ Form of regularization for RNNs (and any NN in general)
‣ Idea: “Handicap” NN by removing hidden units stochastically
‣ set each hidden unit in a layer to 0 with probability during training ( usually works well)
‣ scale outputs by
‣ hidden units forced to learn more general patterns
‣ Test time: Use all activations (no need to rescale)
pp = 0.5
1/(1 − p)
![Page 37: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/37.jpg)
Handling large vocabularies
‣ Softmax can be expensive for large vocabularies
‣ English vocabulary size: 10K to 100K
P(yi) =exp(wi ⋅ h + bi)
∑|V|j=1 exp(wj ⋅ h + bj)
Expensive to compute
![Page 38: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/38.jpg)
Approximate Softmax
‣ Negative Sampling
‣ Structured softmax
‣ Embedding prediction
![Page 39: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/39.jpg)
Negative Sampling
• Softmax is expensive when vocabulary size is large
(figure credit: Graham Neubig)
![Page 40: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/40.jpg)
Negative Sampling
• Sample just a subset of the vocabulary for negative
• Saw simple negative sampling in word2vec (Mikolov 2013)
(figure credit: Graham Neubig)
Other ways to sample:Importance Sampling (Bengio and Senecal 2003)Noise Contrastive Estimation (Mnih & Teh 2012)
![Page 41: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/41.jpg)
Hierarchical softmax(Morin and Bengio 2005)
(figure credit: Quora)
![Page 42: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/42.jpg)
Class based softmax
‣ Two-layer: cluster words into classes, predict class and then predict word.
(Gooding 2001, Mikolov et al 2011)
‣ Clusters can be based on frequency, random, or word contexts.
(figure credit: Graham Neubig)
![Page 43: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/43.jpg)
Embedding prediction
‣ Directly predict embeddings of outputs themselves
‣ What loss to use? (Kumar and Tsvetkov 2019)
‣ L2? Cosine?
‣ Von-Mises Fisher distribution loss, make embeddings close on the unit ball
(slide credit: Graham Neubig)
![Page 44: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/44.jpg)
Generation
How can we use our model (decoder) to generate sentences?
• Sampling: Try to generate a random sentence according the the probability distribution
• Argmax: Try to generate the best sentence, the sentence with the highest probability
![Page 45: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/45.jpg)
Decoding Strategies
‣ Ancestral sampling
‣ Greedy decoding
‣ Exhaustive search
‣ Beam search
![Page 46: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/46.jpg)
Ancestral Sampling
• Randomly sample words one by one
• Provides diverse output (high variance)
(figure credit: Luong, Cho, and Manning)
![Page 47: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/47.jpg)
Greedy decoding
‣ Compute argmax at every step of decoder to generate word
‣ What’s wrong?
![Page 48: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/48.jpg)
Exhaustive search?
‣ Find
‣ Requires computing all possible sequences
‣ complexity!
‣ Too expensive
arg maxy1,...,yT
P(y1, . . . , yT |x1, . . . , xn)
O(VT)
![Page 49: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/49.jpg)
Recall: Beam search (a middle ground)
‣ Key idea: At every step, keep track of the k most probable partial translations (hypotheses)
‣ Score of each hypothesis = log probability
‣ Not guaranteed to be optimal
‣ More efficient than exhaustive search
j
∑t=1
log P(yt |y1, . . . , yt−1, x1, . . . , xn)
![Page 50: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/50.jpg)
Beam decoding
(slide credit: Abigail See)
![Page 51: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/51.jpg)
Beam decoding
(slide credit: Abigail See)
![Page 52: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/52.jpg)
Beam decoding
(slide credit: Abigail See)
![Page 53: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/53.jpg)
Backtrack
(slide credit: Abigail See)
![Page 54: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/54.jpg)
Beam decoding
‣ Different hypotheses may produce (end) token at different time steps
‣ When a hypothesis produces , stop expanding it and place it aside
‣ Continue beam search until:
‣ All hypotheses produce OR
‣ Hit max decoding limit T
‣ Select top hypotheses using the normalized likelihood score
‣ Otherwise shorter hypotheses have higher scores
⟨eos⟩
⟨eos⟩
k ⟨eos⟩
1T
T
∑t=1
log P(yt |y1, . . . , yt−1, x1, . . . , xn)
![Page 55: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/55.jpg)
Under translation• Under translation
• Beam search
• Ensemble
• Coverage models
• Crucial information are left untranslated; premature generation<EOS>
![Page 56: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/56.jpg)
• Out-of-Vocabulary (OOV) Problem; Rare word problem
• Frequent word are translated correctly, rare words are not
• In any corpus, word frequencies are exponentially unbalanced
• Under translation
• Crucial information are left untranslated; premature generation
Under-trans<EOS>
!
RNN
<EOS>
nobody
RNN
expects
expects
RNN
the
the
RNN
spanish
spanish
RNN
inquisition
inquisition
RNN
!
<SOS>
RNN
nobody <EOS>
'inquisition': 0.49 <EOS>: 0.51
<EOS>
![Page 57: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/57.jpg)
• Out-of-Vocabulary (OOV) Problem; Rare word problem
• Frequent word are translated correctly, rare words are not
• In any corpus, word frequencies are exponentially unbalanced
• Under translation
• Crucial information are left untranslated; premature generation<EOS>
Under-trans<EOS>
!
RNN
<EOS>
nobody
RNN
expects
expects
RNN
the
the
RNN
spanish
spanish
RNN
inquisition
inquisition
RNN
!
<SOS>
RNN
nobody <EOS>
'inquisition': 0.49 <EOS>: 0.51
![Page 58: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/58.jpg)
Ensemble
• Similar to voting mechanism, but with probabilities
• multiple models of different parameters (usually from different checkpoints of the same training instance)
• use the output with the highest probability across all models
Concep
t
P0 XXXXXXX
P2 Ensemble
![Page 59: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/59.jpg)
Ensemble
Demo
P0 XXXXXXX
P2 Ensemble
!
RNN RNN RNN
<SOS> prof loves
prof loves cheese
RNN
cheese
<EOS>
! ! !
RNN
![Page 60: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/60.jpg)
Ensemble
Demo
P0 XXXXXXX
P2 Ensemble
RNN RNN RNN
prof loves cheese
RNN
<EOS>
RNNRNN RNN RNN
prof loves cheese
RNN
<EOS>
RNNRNN RNN RNN
prof loves cheese
RNN
<EOS>
RNN
seq2seq_E049.pt
![Page 61: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/61.jpg)
Ensemble
Demo
P0 XXXXXXX
P2 Ensemble
RNN cp2 cp2
prof loves cheese
cp2
<EOS>
cp2
RNN cp3 cp3
prof loves pizza
cp3
<EOS>
cp3
RNN cp1 cp1
prof loves kebab
cp1
<EOS>
cp1
seq2seq_E049.pt
seq2seq_E048.pt
seq2seq_E047.pt
![Page 62: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/62.jpg)
Ensemble
Demo
P0 XXXXXXX
P2 Ensemble
RNN cp2 cp2
prof loves cheese
cp2
<EOS>
cp2
RNN cp3 cp3
prof loves pizza
cp3
<EOS>
cp3
RNN cp1 cp1
prof loves kebab
cp1
<EOS>
cp1
seq2seq_E049.pt
seq2seq_E048.pt
seq2seq_E047.pt
'kebab': 0.9 'cheese': 0.05 'pizza': 0.05
……
'kebab': 0.3 'cheese': 0.55 'pizza': 0.05
……
'kebab': 0.3 'cheese': 0.05 'pizza': 0.55
……
![Page 63: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/63.jpg)
'kebab': 0.3 'cheese': 0.55 'pizza': 0.05
……
'kebab': 0.3 'cheese': 0.05 'pizza': 0.55
……
Ensemble
Demo
P0 XXXXXXX
P2 Ensemble
RNN cp2 cp2
prof loves cheese
cp2
<EOS>
cp2
RNN cp3 cp3
prof loves pizza
cp3
<EOS>
cp3
RNN cp1 cp1
prof loves kebab
cp1
<EOS>
cp1
seq2seq_E049.pt
seq2seq_E048.pt
seq2seq_E047.pt
'kebab': 0.9 'cheese': 0.05 'pizza': 0.05
……
kebab
kebab
![Page 64: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/64.jpg)
Ensemble
Demo
P0 XXXXXXX
P2 Ensemble
RNN cp2 cp2
prof loves cheese
cp2
<EOS>
cp2
RNN cp3 cp3
prof loves pizza
cp3
<EOS>
cp3
RNN cp1 cp1
prof loves kebab
cp1
<EOS>
cp1
seq2seq_E049.pt
seq2seq_E048.pt
seq2seq_E047.pt
kebab
kebab
![Page 65: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/65.jpg)
Existing challenges with NMT
‣ Out-of-vocabulary words
‣ Low-resource languages
‣ Long-term context
‣ Common sense knowledge (e.g. hot dog, paper jam)
‣ Fairness and bias
‣ Uninterpretable
![Page 66: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/66.jpg)
Existing challenges with NMT
‣ Out-of-vocabulary words
‣ Low-resource languages
‣ Long-term context
‣ Common sense knowledge (e.g. hot dog, paper jam)
‣ Fairness and bias
‣ Uninterpretable
![Page 67: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/67.jpg)
Out-of-vocabulary words
• Out-of-Vocabulary (OOV) Problem; Rare word problem
• Frequent word are translated correctly, rare words are not
• In any corpus, word frequencies are exponentially unbalanced
![Page 68: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/68.jpg)
OOV
1. https://en.wikipedia.org/wiki/Zipf%27s_law
Zipf’s Law
Occ
urre
nce
Cou
nt
1
10
100
1000
10000
100000
1000000
Frequency Rank1 25000 50000
• rare word are exponentially less frequent than frequent words
• e.g. in HW4, 45%|65% (src|tgt) of the unique words occur once
![Page 69: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/69.jpg)
Out-of-vocabulary words• Out-of-Vocabulary (OOV) Problem; Rare word problem
• Copy Mechanisms
• Subword / character models
• Out-of-Vocabulary (OOV) Problem; Rare word problem
![Page 70: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/70.jpg)
Copy Mechanism
Concep
t
P0 XXXXXXX
P1 Copy
!
RNN RNN RNN
<SOS> prof loves
prof loves cheese
RNN
cheese
<EOS>
! ! !
RNN
1. CL2015015 [Luong et al.] Addressing the Rare Word Problem in Neural Machine Translation
![Page 71: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/71.jpg)
Copy Mechanism
Concep
t
P0 XXXXXXX
P1 Copy
!
RNN RNN RNN
<SOS> prof loves
prof loves cheese
RNN
cheese
<EOS>
! ! !
RNN
<UNK>
prof
liebt
kebab
<UNK>
αt,1 = 0.1
αt,2 = 0.2
αt,3 = 0.7
kebab
1. CL2015015 [Luong et al.] Addressing the Rare Word Problem in Neural Machine Translation
![Page 72: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/72.jpg)
Copy Mechanism
Concep
t
P0 XXXXXXX
P1 Copy
!
RNN RNN RNN
<SOS> prof loves
prof loves cheese
RNN
cheese
<EOS>
! ! !
RNN
<UNK>
prof
liebt
kebab
<UNK>
αt,1 = 0.1
αt,2 = 0.2
αt,3 = 0.7
kebab
1. CL2015015 [Luong et al.] Addressing the Rare Word Problem in Neural Machine Translation
![Page 73: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/73.jpg)
Copy Mechanism
Concep
t
P0 XXXXXXX
P1 Copy
!
RNN RNN RNN
<SOS> prof loves
prof loves cheese
! !
RNN
prof
liebt
kebab
<UNK>
αt,1 = 0.1
αt,2 = 0.2
αt,3 = 0.7
kebab
• Sees <UNK> at step
• looks at attention weight
• replace <UNK> with the source word
t
αt
fargmaxiαt,i
1. CL2015015 [Luong et al.] Addressing the Rare Word Problem in Neural Machine Translation
![Page 74: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/74.jpg)
*Pointer-Generator Copy Mechanism
Concep
t
P0 XXXXXXX
P1 Copy
1. CL2016032 [Gulcehre et al.] Pointing the Unknown Words
RNN
<UNK>
!
pg
!
RNN RNN RNN
<SOS> prof loves
prof loves cheese
RNN
cheese
<EOS>
! ! !
RNN
<UNK>
prof
liebt
kebab
<UNK>
αt,1 = 0.1
αt,2 = 0.2
αt,3 = 0.7
kebab
• binary classifier switch • use decoder output • use copy mechanism
![Page 75: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/75.jpg)
*Pointer-Generator Copy Mechanism
Concep
t
P0 XXXXXXX
P1 Copy
1. CL2016032 [Gulcehre et al.] Pointing the Unknown Words
RNN
<UNK>
!
pg
• Sees <UNK> at step , or
• looks at attention weight
• replace <UNK> with the source word
t pg([hdect ; contextt]) ≤ 0.5
αt
fargmaxiαt,i
![Page 76: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/76.jpg)
*Pointer-Based Dictionary Fusion
Concep
t
P0 XXXXXXX
P1 Copy
1. CL2019331 [Gū et al.] Pointer-based Fusion of Bilingual Lexicons into Neural Machine Translation
RNN
<UNK>
!
pg
• Sees <UNK> at step , or
• looks at attention weight
• replace <UNK> with translation of the source word
t pg([hdect ; contextt]) ≤ 0.5
αt
dict( fargmaxiαt,i)
![Page 77: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/77.jpg)
Out-of-vocabulary words• Out-of-Vocabulary (OOV) Problem; Rare word problem
• Copy Mechanisms
• Subword / character models
• Out-of-Vocabulary (OOV) Problem; Rare word problem
![Page 78: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/78.jpg)
OOV
![Page 79: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/79.jpg)
Subword / character models• Character based seq2seq models
![Page 80: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/80.jpg)
Subword / character models• Character based seq2seq models
• Byte pair encoding (BPE)
![Page 81: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/81.jpg)
Subword / character models• Character based seq2seq models
• Byte pair encoding (BPE)
![Page 82: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/82.jpg)
Subword / character models• Character based seq2seq models
• Byte pair encoding (BPE)
![Page 83: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/83.jpg)
Subword / character models• Character based seq2seq models
• Byte pair encoding (BPE)
• Subword units help model morphology
![Page 84: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/84.jpg)
Existing challenges with NMT
‣ Out-of-vocabulary words
‣ Low-resource languages
‣ Long-term context
‣ Common sense knowledge (e.g. hot dog, paper jam)
‣ Fairness and bias
‣ Uninterpretable
![Page 85: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/85.jpg)
Massively multilingual MT
(Arivazhagan et al., 2019)
‣ Train a single neural network on 103 languages paired with English (remember Interlingua?)
‣ Massive improvements on low-resource languages
![Page 86: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/86.jpg)
Existing challenges with NMT
‣ Out-of-vocabulary words
‣ Low-resource languages
‣ Long-term context
‣ Common sense knowledge (e.g. hot dog, paper jam)
‣ Fairness and bias
‣ Uninterpretable
![Page 87: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,](https://reader030.vdocument.in/reader030/viewer/2022040605/5eac2689e8a98469946becdd/html5/thumbnails/87.jpg)
Next time
‣ Contextualized embeddings
‣ ELMO, BERT, and friends
‣ Transformers
‣ Multi-headed self-attention