![Page 1: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to](https://reader035.vdocument.in/reader035/viewer/2022071510/612e38861ecc51586942ad2d/html5/thumbnails/1.jpg)
Lecture 13: Machine Transla3on II
Alan Ri8er(many slides from Greg Durrett)
![Page 2: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to](https://reader035.vdocument.in/reader035/viewer/2022071510/612e38861ecc51586942ad2d/html5/thumbnails/2.jpg)
Neural MT Details
![Page 3: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to](https://reader035.vdocument.in/reader035/viewer/2022071510/612e38861ecc51586942ad2d/html5/thumbnails/3.jpg)
Encoder-Decoder MT
Sutskever et al. (2014)
‣ Sutskever seq2seq paper: first major applica3on of LSTMs to NLP
![Page 4: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to](https://reader035.vdocument.in/reader035/viewer/2022071510/612e38861ecc51586942ad2d/html5/thumbnails/4.jpg)
Encoder-Decoder MT
Sutskever et al. (2014)
‣ Sutskever seq2seq paper: first major applica3on of LSTMs to NLP
‣ Basic encoder-decoder with beam search
![Page 5: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to](https://reader035.vdocument.in/reader035/viewer/2022071510/612e38861ecc51586942ad2d/html5/thumbnails/5.jpg)
Encoder-Decoder MT
Sutskever et al. (2014)
‣ Sutskever seq2seq paper: first major applica3on of LSTMs to NLP
‣ Basic encoder-decoder with beam search
![Page 6: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to](https://reader035.vdocument.in/reader035/viewer/2022071510/612e38861ecc51586942ad2d/html5/thumbnails/6.jpg)
Encoder-Decoder MT
Sutskever et al. (2014)‣ SOTA = 37.0 — not all that compe33ve…
‣ Sutskever seq2seq paper: first major applica3on of LSTMs to NLP
‣ Basic encoder-decoder with beam search
![Page 7: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to](https://reader035.vdocument.in/reader035/viewer/2022071510/612e38861ecc51586942ad2d/html5/thumbnails/7.jpg)
Encoder-Decoder MT‣ Be8er model from seq2seq lectures: encoder-decoder with a8en3on
and copying for rare words
the movie was great
h1 h2 h3 h4
<s>
h̄1
c1
distribu3on over vocab + copying
…
le
![Page 8: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to](https://reader035.vdocument.in/reader035/viewer/2022071510/612e38861ecc51586942ad2d/html5/thumbnails/8.jpg)
Results: WMT English-French
![Page 9: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to](https://reader035.vdocument.in/reader035/viewer/2022071510/612e38861ecc51586942ad2d/html5/thumbnails/9.jpg)
Results: WMT English-French‣ 12M sentence pairs
![Page 10: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to](https://reader035.vdocument.in/reader035/viewer/2022071510/612e38861ecc51586942ad2d/html5/thumbnails/10.jpg)
Results: WMT English-French
Classic phrase-based system: ~33 BLEU, uses addi3onal target-language data
‣ 12M sentence pairs
![Page 11: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to](https://reader035.vdocument.in/reader035/viewer/2022071510/612e38861ecc51586942ad2d/html5/thumbnails/11.jpg)
Results: WMT English-French
Classic phrase-based system: ~33 BLEU, uses addi3onal target-language data
Rerank with LSTMs: 36.5 BLEU (long line of work here; Devlin+ 2014)
‣ 12M sentence pairs
![Page 12: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to](https://reader035.vdocument.in/reader035/viewer/2022071510/612e38861ecc51586942ad2d/html5/thumbnails/12.jpg)
Results: WMT English-French
Classic phrase-based system: ~33 BLEU, uses addi3onal target-language data
Rerank with LSTMs: 36.5 BLEU (long line of work here; Devlin+ 2014)
Sutskever+ (2014) seq2seq single: 30.6 BLEU
‣ 12M sentence pairs
![Page 13: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to](https://reader035.vdocument.in/reader035/viewer/2022071510/612e38861ecc51586942ad2d/html5/thumbnails/13.jpg)
Results: WMT English-French
Classic phrase-based system: ~33 BLEU, uses addi3onal target-language data
Rerank with LSTMs: 36.5 BLEU (long line of work here; Devlin+ 2014)
Sutskever+ (2014) seq2seq single: 30.6 BLEU
Sutskever+ (2014) seq2seq ensemble: 34.8 BLEU
‣ 12M sentence pairs
![Page 14: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to](https://reader035.vdocument.in/reader035/viewer/2022071510/612e38861ecc51586942ad2d/html5/thumbnails/14.jpg)
Results: WMT English-French
Classic phrase-based system: ~33 BLEU, uses addi3onal target-language data
Rerank with LSTMs: 36.5 BLEU (long line of work here; Devlin+ 2014)
Sutskever+ (2014) seq2seq single: 30.6 BLEU
Sutskever+ (2014) seq2seq ensemble: 34.8 BLEU
‣ But English-French is a really easy language pair and there’s tons of data for it! Does this approach work for anything harder?
Luong+ (2015) seq2seq ensemble with a8en3on and rare word handling: 37.5 BLEU
‣ 12M sentence pairs
![Page 15: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to](https://reader035.vdocument.in/reader035/viewer/2022071510/612e38861ecc51586942ad2d/html5/thumbnails/15.jpg)
Results: WMT English-German
‣ Not nearly as good in absolute BLEU, but not really comparable across languages
Classic phrase-based system: 20.7 BLEU
Luong+ (2014) seq2seq: 14 BLEU
Luong+ (2015) seq2seq ensemble with rare word handling: 23.0 BLEU
‣ 4.5M sentence pairs
![Page 16: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to](https://reader035.vdocument.in/reader035/viewer/2022071510/612e38861ecc51586942ad2d/html5/thumbnails/16.jpg)
Results: WMT English-German
‣ Not nearly as good in absolute BLEU, but not really comparable across languages
Classic phrase-based system: 20.7 BLEU
Luong+ (2014) seq2seq: 14 BLEU
‣ French, Spanish = easiest German, Czech = harder Japanese, Russian = hard (gramma3cally different, lots of morphology…)
Luong+ (2015) seq2seq ensemble with rare word handling: 23.0 BLEU
‣ 4.5M sentence pairs
![Page 17: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to](https://reader035.vdocument.in/reader035/viewer/2022071510/612e38861ecc51586942ad2d/html5/thumbnails/17.jpg)
MT Examples
Luong et al. (2015)
‣ NMT systems can hallucinate words, especially when not using a8en3on — phrase-based doesn’t do this
‣ best = with a8en3on, base = no a8en3on
![Page 18: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to](https://reader035.vdocument.in/reader035/viewer/2022071510/612e38861ecc51586942ad2d/html5/thumbnails/18.jpg)
MT Examples
Luong et al. (2015)
‣ best = with a8en3on, base = no a8en3on
![Page 19: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to](https://reader035.vdocument.in/reader035/viewer/2022071510/612e38861ecc51586942ad2d/html5/thumbnails/19.jpg)
Zhang et al. (2017)
‣ NMT can repeat itself if it gets confused (pH or pH)
‣ Phrase-based MT omen gets chunks right, may have more subtle ungramma3cali3es
MT Examples
![Page 20: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to](https://reader035.vdocument.in/reader035/viewer/2022071510/612e38861ecc51586942ad2d/html5/thumbnails/20.jpg)
Rare Words: Word Piece Models‣ Use Huffman encoding on a corpus, keep most common k (~10,000)
character sequences for source and target
Input: _the _eco tax _port i co _in _Po nt - de - Bu is …
Output: _le _port ique _éco taxe _de _Pont - de - Bui s
Wu et al. (2016)
![Page 21: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to](https://reader035.vdocument.in/reader035/viewer/2022071510/612e38861ecc51586942ad2d/html5/thumbnails/21.jpg)
Rare Words: Word Piece Models‣ Use Huffman encoding on a corpus, keep most common k (~10,000)
character sequences for source and target
‣ Captures common words and parts of rare words
Input: _the _eco tax _port i co _in _Po nt - de - Bu is …
Output: _le _port ique _éco taxe _de _Pont - de - Bui s
Wu et al. (2016)
![Page 22: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to](https://reader035.vdocument.in/reader035/viewer/2022071510/612e38861ecc51586942ad2d/html5/thumbnails/22.jpg)
Rare Words: Word Piece Models‣ Use Huffman encoding on a corpus, keep most common k (~10,000)
character sequences for source and target
‣ Captures common words and parts of rare words
Input: _the _eco tax _port i co _in _Po nt - de - Bu is …
Output: _le _port ique _éco taxe _de _Pont - de - Bui s
‣ Subword structure may make it easier to translate
Wu et al. (2016)
![Page 23: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to](https://reader035.vdocument.in/reader035/viewer/2022071510/612e38861ecc51586942ad2d/html5/thumbnails/23.jpg)
Rare Words: Word Piece Models‣ Use Huffman encoding on a corpus, keep most common k (~10,000)
character sequences for source and target
‣ Captures common words and parts of rare words
Input: _the _eco tax _port i co _in _Po nt - de - Bu is …
Output: _le _port ique _éco taxe _de _Pont - de - Bui s
‣ Subword structure may make it easier to translate
‣ Model balances transla3ng and translitera3ng without explicit switchingWu et al. (2016)
![Page 24: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to](https://reader035.vdocument.in/reader035/viewer/2022071510/612e38861ecc51586942ad2d/html5/thumbnails/24.jpg)
Rare Words: Byte Pair Encoding
Sennrich et al. (2016)
‣ Input: a dic3onary of words represented as characters‣ Simpler procedure, based only on the dic3onary
![Page 25: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to](https://reader035.vdocument.in/reader035/viewer/2022071510/612e38861ecc51586942ad2d/html5/thumbnails/25.jpg)
Rare Words: Byte Pair Encoding
Sennrich et al. (2016)
‣ Input: a dic3onary of words represented as characters‣ Simpler procedure, based only on the dic3onary
![Page 26: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to](https://reader035.vdocument.in/reader035/viewer/2022071510/612e38861ecc51586942ad2d/html5/thumbnails/26.jpg)
Rare Words: Byte Pair Encoding
‣ Count bigram character cooccurrences
Sennrich et al. (2016)
‣ Input: a dic3onary of words represented as characters‣ Simpler procedure, based only on the dic3onary
![Page 27: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to](https://reader035.vdocument.in/reader035/viewer/2022071510/612e38861ecc51586942ad2d/html5/thumbnails/27.jpg)
Rare Words: Byte Pair Encoding
‣ Count bigram character cooccurrences
Sennrich et al. (2016)
‣ Merge the most frequent pair of adjacent characters
‣ Input: a dic3onary of words represented as characters‣ Simpler procedure, based only on the dic3onary
![Page 28: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to](https://reader035.vdocument.in/reader035/viewer/2022071510/612e38861ecc51586942ad2d/html5/thumbnails/28.jpg)
Rare Words: Byte Pair Encoding
‣ Count bigram character cooccurrences
Sennrich et al. (2016)
‣ Merge the most frequent pair of adjacent characters
‣ Input: a dic3onary of words represented as characters
‣ Final size = ini3al vocab + num merges. Omen do 10k - 30k merges
‣ Simpler procedure, based only on the dic3onary
![Page 29: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to](https://reader035.vdocument.in/reader035/viewer/2022071510/612e38861ecc51586942ad2d/html5/thumbnails/29.jpg)
Rare Words: Byte Pair Encoding
‣ Count bigram character cooccurrences
Sennrich et al. (2016)
‣ Merge the most frequent pair of adjacent characters
‣ Input: a dic3onary of words represented as characters
‣ Final size = ini3al vocab + num merges. Omen do 10k - 30k merges
‣ Simpler procedure, based only on the dic3onary
‣ Most SOTA NMT systems use this on both source + target
![Page 30: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to](https://reader035.vdocument.in/reader035/viewer/2022071510/612e38861ecc51586942ad2d/html5/thumbnails/30.jpg)
Google’s NMT System
Wu et al. (2016)‣ 8-layer LSTM encoder-decoder with a8en3on, word piece vocabulary of
8k-32k
![Page 31: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to](https://reader035.vdocument.in/reader035/viewer/2022071510/612e38861ecc51586942ad2d/html5/thumbnails/31.jpg)
Google’s NMT System
Wu et al. (2016)
![Page 32: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to](https://reader035.vdocument.in/reader035/viewer/2022071510/612e38861ecc51586942ad2d/html5/thumbnails/32.jpg)
Google’s NMT System
Wu et al. (2016)
Luong+ (2015) seq2seq ensemble with rare word handling: 37.5 BLEUGoogle’s 32k word pieces: 38.95 BLEU
Google’s phrase-based system: 37.0 BLEUEnglish-French:
![Page 33: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to](https://reader035.vdocument.in/reader035/viewer/2022071510/612e38861ecc51586942ad2d/html5/thumbnails/33.jpg)
Google’s NMT System
Wu et al. (2016)
Luong+ (2015) seq2seq ensemble with rare word handling: 37.5 BLEUGoogle’s 32k word pieces: 38.95 BLEU
Google’s phrase-based system: 37.0 BLEUEnglish-French:
Luong+ (2015) seq2seq ensemble with rare word handling: 23.0 BLEUGoogle’s 32k word pieces: 24.2 BLEU
Google’s phrase-based system: 20.7 BLEUEnglish-German:
![Page 34: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to](https://reader035.vdocument.in/reader035/viewer/2022071510/612e38861ecc51586942ad2d/html5/thumbnails/34.jpg)
Human Evalua3on (En-Es)
Wu et al. (2016)
‣ Similar to human-level performance on English-Spanish
![Page 35: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to](https://reader035.vdocument.in/reader035/viewer/2022071510/612e38861ecc51586942ad2d/html5/thumbnails/35.jpg)
Google’s NMT System
Wu et al. (2016)
![Page 36: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to](https://reader035.vdocument.in/reader035/viewer/2022071510/612e38861ecc51586942ad2d/html5/thumbnails/36.jpg)
Google’s NMT System
Wu et al. (2016)
Gender is correct in GNMT but not in PBMT
![Page 37: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to](https://reader035.vdocument.in/reader035/viewer/2022071510/612e38861ecc51586942ad2d/html5/thumbnails/37.jpg)
Google’s NMT System
Wu et al. (2016)
Gender is correct in GNMT but not in PBMT
“sled”
![Page 38: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to](https://reader035.vdocument.in/reader035/viewer/2022071510/612e38861ecc51586942ad2d/html5/thumbnails/38.jpg)
Google’s NMT System
Wu et al. (2016)
Gender is correct in GNMT but not in PBMT
“sled”
![Page 39: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to](https://reader035.vdocument.in/reader035/viewer/2022071510/612e38861ecc51586942ad2d/html5/thumbnails/39.jpg)
Google’s NMT System
Wu et al. (2016)
Gender is correct in GNMT but not in PBMT
“sled”“walker”
![Page 40: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to](https://reader035.vdocument.in/reader035/viewer/2022071510/612e38861ecc51586942ad2d/html5/thumbnails/40.jpg)
Backtransla3on‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and
a large monolingual corpus T’ to train a language model. Can neural MT do the same?
Sennrich et al. (2015)
![Page 41: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to](https://reader035.vdocument.in/reader035/viewer/2022071510/612e38861ecc51586942ad2d/html5/thumbnails/41.jpg)
Backtransla3on‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and
a large monolingual corpus T’ to train a language model. Can neural MT do the same?
Sennrich et al. (2015)
‣ Approach 1: force the system to generate T’ as targets from null inputs
![Page 42: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to](https://reader035.vdocument.in/reader035/viewer/2022071510/612e38861ecc51586942ad2d/html5/thumbnails/42.jpg)
Backtransla3on‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and
a large monolingual corpus T’ to train a language model. Can neural MT do the same?
Sennrich et al. (2015)
s1, t1
[null], t’1
[null], t’2
s2, t2…
…
‣ Approach 1: force the system to generate T’ as targets from null inputs
![Page 43: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to](https://reader035.vdocument.in/reader035/viewer/2022071510/612e38861ecc51586942ad2d/html5/thumbnails/43.jpg)
Backtransla3on‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and
a large monolingual corpus T’ to train a language model. Can neural MT do the same?
Sennrich et al. (2015)
s1, t1
[null], t’1
[null], t’2
s2, t2…
…
‣ Approach 1: force the system to generate T’ as targets from null inputs
‣ Approach 2: generate synthe3c sources with a T->S machine transla3on system (backtransla3on)
![Page 44: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to](https://reader035.vdocument.in/reader035/viewer/2022071510/612e38861ecc51586942ad2d/html5/thumbnails/44.jpg)
Backtransla3on‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and
a large monolingual corpus T’ to train a language model. Can neural MT do the same?
Sennrich et al. (2015)
s1, t1
[null], t’1
[null], t’2
s2, t2…
…
‣ Approach 1: force the system to generate T’ as targets from null inputs
‣ Approach 2: generate synthe3c sources with a T->S machine transla3on system (backtransla3on)
s1, t1
MT(t’1), t’1
s2, t2…
…MT(t’2), t’2
![Page 45: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to](https://reader035.vdocument.in/reader035/viewer/2022071510/612e38861ecc51586942ad2d/html5/thumbnails/45.jpg)
Backtransla3on
Sennrich et al. (2015)
‣ parallelsynth: backtranslate training data; makes addi3onal noisy source sentences which could be useful
‣ Gigaword: large monolingual English corpus
![Page 46: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to](https://reader035.vdocument.in/reader035/viewer/2022071510/612e38861ecc51586942ad2d/html5/thumbnails/46.jpg)
Transformers for MT
![Page 47: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to](https://reader035.vdocument.in/reader035/viewer/2022071510/612e38861ecc51586942ad2d/html5/thumbnails/47.jpg)
Recall: Self-A8en3on
Vaswani et al. (2017)
the movie was great
‣ Each word forms a “query” which then computes a8en3on over each word
![Page 48: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to](https://reader035.vdocument.in/reader035/viewer/2022071510/612e38861ecc51586942ad2d/html5/thumbnails/48.jpg)
Recall: Self-A8en3on
Vaswani et al. (2017)
the movie was great
‣ Each word forms a “query” which then computes a8en3on over each word
x4
![Page 49: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to](https://reader035.vdocument.in/reader035/viewer/2022071510/612e38861ecc51586942ad2d/html5/thumbnails/49.jpg)
Recall: Self-A8en3on
Vaswani et al. (2017)
the movie was great
‣ Each word forms a “query” which then computes a8en3on over each word
x4
![Page 50: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to](https://reader035.vdocument.in/reader035/viewer/2022071510/612e38861ecc51586942ad2d/html5/thumbnails/50.jpg)
Recall: Self-A8en3on
Vaswani et al. (2017)
the movie was great
‣ Each word forms a “query” which then computes a8en3on over each word
x4
x04
![Page 51: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to](https://reader035.vdocument.in/reader035/viewer/2022071510/612e38861ecc51586942ad2d/html5/thumbnails/51.jpg)
Recall: Self-A8en3on
Vaswani et al. (2017)
the movie was great
‣ Each word forms a “query” which then computes a8en3on over each word
x4
x04
scalar↵i,j = softmax(x>i xj)
![Page 52: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to](https://reader035.vdocument.in/reader035/viewer/2022071510/612e38861ecc51586942ad2d/html5/thumbnails/52.jpg)
Recall: Self-A8en3on
Vaswani et al. (2017)
the movie was great
‣ Each word forms a “query” which then computes a8en3on over each word
x4
x04
scalar
vector = sum of scalar * vector
↵i,j = softmax(x>i xj)
x0i =
nX
j=1
↵i,jxj
![Page 53: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to](https://reader035.vdocument.in/reader035/viewer/2022071510/612e38861ecc51586942ad2d/html5/thumbnails/53.jpg)
Recall: Self-A8en3on
Vaswani et al. (2017)
the movie was great
‣ Each word forms a “query” which then computes a8en3on over each word
‣ Mul3ple “heads” analogous to different convolu3onal filters. Use parameters Wk and Vk to get different a8en3on values + transform vectors
x4
x04
scalar
vector = sum of scalar * vector
↵i,j = softmax(x>i xj)
x0i =
nX
j=1
↵i,jxj
![Page 54: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to](https://reader035.vdocument.in/reader035/viewer/2022071510/612e38861ecc51586942ad2d/html5/thumbnails/54.jpg)
Recall: Self-A8en3on
Vaswani et al. (2017)
the movie was great
‣ Each word forms a “query” which then computes a8en3on over each word
‣ Mul3ple “heads” analogous to different convolu3onal filters. Use parameters Wk and Vk to get different a8en3on values + transform vectors
x4
x04
scalar
vector = sum of scalar * vector
↵i,j = softmax(x>i xj)
x0i =
nX
j=1
↵i,jxj
↵k,i,j = softmax(x>i Wkxj)
![Page 55: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to](https://reader035.vdocument.in/reader035/viewer/2022071510/612e38861ecc51586942ad2d/html5/thumbnails/55.jpg)
Recall: Self-A8en3on
Vaswani et al. (2017)
the movie was great
‣ Each word forms a “query” which then computes a8en3on over each word
‣ Mul3ple “heads” analogous to different convolu3onal filters. Use parameters Wk and Vk to get different a8en3on values + transform vectors
x4
x04
scalar
vector = sum of scalar * vector
↵i,j = softmax(x>i xj)
x0i =
nX
j=1
↵i,jxj
↵k,i,j = softmax(x>i Wkxj) x0
k,i =nX
j=1
↵k,i,jVkxj
![Page 56: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to](https://reader035.vdocument.in/reader035/viewer/2022071510/612e38861ecc51586942ad2d/html5/thumbnails/56.jpg)
Recall: Self-A8en3on
Vaswani et al. (2017)
the movie was great
‣ Each word forms a “query” which then computes a8en3on over each word
‣ Mul3ple “heads” analogous to different convolu3onal filters. Use parameters Wk and Vk to get different a8en3on values + transform vectors
x4
x04
scalar
vector = sum of scalar * vector
↵i,j = softmax(x>i xj)
x0i =
nX
j=1
↵i,jxj
↵k,i,j = softmax(x>i Wkxj) x0
k,i =nX
j=1
↵k,i,jVkxj
![Page 57: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to](https://reader035.vdocument.in/reader035/viewer/2022071510/612e38861ecc51586942ad2d/html5/thumbnails/57.jpg)
Transformers
Vaswanietal.(2017)
themoviewasgreat
‣ Augmentwordembeddingwithposi=onembeddings,eachdimisasine/cosinewaveofadifferentfrequency.Closerpoints=higherdotproducts
‣Worksessen=allyaswellasjustencodingposi=onasaone-hotvector
themoviewasgreat
emb(1)
emb(2)
emb(3)
emb(4)
![Page 58: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to](https://reader035.vdocument.in/reader035/viewer/2022071510/612e38861ecc51586942ad2d/html5/thumbnails/58.jpg)
Transformers
Vaswanietal.(2017)
‣ Encoderanddecoderarebothtransformers
‣ Decoderconsumesthepreviousgeneratedtoken(anda8endstoinput),buthasnorecurrentstate
![Page 59: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to](https://reader035.vdocument.in/reader035/viewer/2022071510/612e38861ecc51586942ad2d/html5/thumbnails/59.jpg)
Transformers
Vaswanietal.(2017)
‣ Big=6layers,1000dimforeachtoken,16heads,base=6layers+otherparamshalved
![Page 60: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to](https://reader035.vdocument.in/reader035/viewer/2022071510/612e38861ecc51586942ad2d/html5/thumbnails/60.jpg)
Visualiza=on
Vaswanietal.(2017)
![Page 61: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to](https://reader035.vdocument.in/reader035/viewer/2022071510/612e38861ecc51586942ad2d/html5/thumbnails/61.jpg)
Visualiza=on
Vaswanietal.(2017)
![Page 62: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to](https://reader035.vdocument.in/reader035/viewer/2022071510/612e38861ecc51586942ad2d/html5/thumbnails/62.jpg)
Takeaways
‣ CanbuildMTsystemswithLSTMencoder-decoders,CNNs,ortransformers
‣Wordpiece/bytepairmodelsarereallyeffec=veandeasytouse
‣ Stateoftheartsystemsarege|ngpre8ygood,butlotsofchallengesremain,especiallyforlow-resourcese|ngs
‣ Next=me:pre-trainedtransformermodels(BERT),appliedtoothertasks