lecture 13: machine transla3on ii - github pages · 2021. 4. 26. · ‣ classical mt methods used...

62
Lecture 13: Machine Transla 3on II Alan Ri8er (many slides from Greg Durrett)

Upload: others

Post on 16-May-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to

Lecture 13: Machine Transla3on II

Alan Ri8er(many slides from Greg Durrett)

Page 2: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to

Neural MT Details

Page 3: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to

Encoder-Decoder MT

Sutskever et al. (2014)

‣ Sutskever seq2seq paper: first major applica3on of LSTMs to NLP

Page 4: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to

Encoder-Decoder MT

Sutskever et al. (2014)

‣ Sutskever seq2seq paper: first major applica3on of LSTMs to NLP

‣ Basic encoder-decoder with beam search

Page 5: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to

Encoder-Decoder MT

Sutskever et al. (2014)

‣ Sutskever seq2seq paper: first major applica3on of LSTMs to NLP

‣ Basic encoder-decoder with beam search

Page 6: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to

Encoder-Decoder MT

Sutskever et al. (2014)‣ SOTA = 37.0 — not all that compe33ve…

‣ Sutskever seq2seq paper: first major applica3on of LSTMs to NLP

‣ Basic encoder-decoder with beam search

Page 7: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to

Encoder-Decoder MT‣ Be8er model from seq2seq lectures: encoder-decoder with a8en3on

and copying for rare words

the movie was great

h1 h2 h3 h4

<s>

h̄1

c1

distribu3on over vocab + copying

le

Page 8: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to

Results: WMT English-French

Page 9: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to

Results: WMT English-French‣ 12M sentence pairs

Page 10: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to

Results: WMT English-French

Classic phrase-based system: ~33 BLEU, uses addi3onal target-language data

‣ 12M sentence pairs

Page 11: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to

Results: WMT English-French

Classic phrase-based system: ~33 BLEU, uses addi3onal target-language data

Rerank with LSTMs: 36.5 BLEU (long line of work here; Devlin+ 2014)

‣ 12M sentence pairs

Page 12: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to

Results: WMT English-French

Classic phrase-based system: ~33 BLEU, uses addi3onal target-language data

Rerank with LSTMs: 36.5 BLEU (long line of work here; Devlin+ 2014)

Sutskever+ (2014) seq2seq single: 30.6 BLEU

‣ 12M sentence pairs

Page 13: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to

Results: WMT English-French

Classic phrase-based system: ~33 BLEU, uses addi3onal target-language data

Rerank with LSTMs: 36.5 BLEU (long line of work here; Devlin+ 2014)

Sutskever+ (2014) seq2seq single: 30.6 BLEU

Sutskever+ (2014) seq2seq ensemble: 34.8 BLEU

‣ 12M sentence pairs

Page 14: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to

Results: WMT English-French

Classic phrase-based system: ~33 BLEU, uses addi3onal target-language data

Rerank with LSTMs: 36.5 BLEU (long line of work here; Devlin+ 2014)

Sutskever+ (2014) seq2seq single: 30.6 BLEU

Sutskever+ (2014) seq2seq ensemble: 34.8 BLEU

‣ But English-French is a really easy language pair and there’s tons of data for it! Does this approach work for anything harder?

Luong+ (2015) seq2seq ensemble with a8en3on and rare word handling: 37.5 BLEU

‣ 12M sentence pairs

Page 15: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to

Results: WMT English-German

‣ Not nearly as good in absolute BLEU, but not really comparable across languages

Classic phrase-based system: 20.7 BLEU

Luong+ (2014) seq2seq: 14 BLEU

Luong+ (2015) seq2seq ensemble with rare word handling: 23.0 BLEU

‣ 4.5M sentence pairs

Page 16: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to

Results: WMT English-German

‣ Not nearly as good in absolute BLEU, but not really comparable across languages

Classic phrase-based system: 20.7 BLEU

Luong+ (2014) seq2seq: 14 BLEU

‣ French, Spanish = easiest German, Czech = harder Japanese, Russian = hard (gramma3cally different, lots of morphology…)

Luong+ (2015) seq2seq ensemble with rare word handling: 23.0 BLEU

‣ 4.5M sentence pairs

Page 17: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to

MT Examples

Luong et al. (2015)

‣ NMT systems can hallucinate words, especially when not using a8en3on — phrase-based doesn’t do this

‣ best = with a8en3on, base = no a8en3on

Page 18: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to

MT Examples

Luong et al. (2015)

‣ best = with a8en3on, base = no a8en3on

Page 19: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to

Zhang et al. (2017)

‣ NMT can repeat itself if it gets confused (pH or pH)

‣ Phrase-based MT omen gets chunks right, may have more subtle ungramma3cali3es

MT Examples

Page 20: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to

Rare Words: Word Piece Models‣ Use Huffman encoding on a corpus, keep most common k (~10,000)

character sequences for source and target

Input: _the _eco tax _port i co _in _Po nt - de - Bu is …

Output: _le _port ique _éco taxe _de _Pont - de - Bui s

Wu et al. (2016)

Page 21: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to

Rare Words: Word Piece Models‣ Use Huffman encoding on a corpus, keep most common k (~10,000)

character sequences for source and target

‣ Captures common words and parts of rare words

Input: _the _eco tax _port i co _in _Po nt - de - Bu is …

Output: _le _port ique _éco taxe _de _Pont - de - Bui s

Wu et al. (2016)

Page 22: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to

Rare Words: Word Piece Models‣ Use Huffman encoding on a corpus, keep most common k (~10,000)

character sequences for source and target

‣ Captures common words and parts of rare words

Input: _the _eco tax _port i co _in _Po nt - de - Bu is …

Output: _le _port ique _éco taxe _de _Pont - de - Bui s

‣ Subword structure may make it easier to translate

Wu et al. (2016)

Page 23: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to

Rare Words: Word Piece Models‣ Use Huffman encoding on a corpus, keep most common k (~10,000)

character sequences for source and target

‣ Captures common words and parts of rare words

Input: _the _eco tax _port i co _in _Po nt - de - Bu is …

Output: _le _port ique _éco taxe _de _Pont - de - Bui s

‣ Subword structure may make it easier to translate

‣ Model balances transla3ng and translitera3ng without explicit switchingWu et al. (2016)

Page 24: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to

Rare Words: Byte Pair Encoding

Sennrich et al. (2016)

‣ Input: a dic3onary of words represented as characters‣ Simpler procedure, based only on the dic3onary

Page 25: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to

Rare Words: Byte Pair Encoding

Sennrich et al. (2016)

‣ Input: a dic3onary of words represented as characters‣ Simpler procedure, based only on the dic3onary

Page 26: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to

Rare Words: Byte Pair Encoding

‣ Count bigram character cooccurrences

Sennrich et al. (2016)

‣ Input: a dic3onary of words represented as characters‣ Simpler procedure, based only on the dic3onary

Page 27: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to

Rare Words: Byte Pair Encoding

‣ Count bigram character cooccurrences

Sennrich et al. (2016)

‣ Merge the most frequent pair of adjacent characters

‣ Input: a dic3onary of words represented as characters‣ Simpler procedure, based only on the dic3onary

Page 28: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to

Rare Words: Byte Pair Encoding

‣ Count bigram character cooccurrences

Sennrich et al. (2016)

‣ Merge the most frequent pair of adjacent characters

‣ Input: a dic3onary of words represented as characters

‣ Final size = ini3al vocab + num merges. Omen do 10k - 30k merges

‣ Simpler procedure, based only on the dic3onary

Page 29: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to

Rare Words: Byte Pair Encoding

‣ Count bigram character cooccurrences

Sennrich et al. (2016)

‣ Merge the most frequent pair of adjacent characters

‣ Input: a dic3onary of words represented as characters

‣ Final size = ini3al vocab + num merges. Omen do 10k - 30k merges

‣ Simpler procedure, based only on the dic3onary

‣ Most SOTA NMT systems use this on both source + target

Page 30: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to

Google’s NMT System

Wu et al. (2016)‣ 8-layer LSTM encoder-decoder with a8en3on, word piece vocabulary of

8k-32k

Page 31: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to

Google’s NMT System

Wu et al. (2016)

Page 32: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to

Google’s NMT System

Wu et al. (2016)

Luong+ (2015) seq2seq ensemble with rare word handling: 37.5 BLEUGoogle’s 32k word pieces: 38.95 BLEU

Google’s phrase-based system: 37.0 BLEUEnglish-French:

Page 33: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to

Google’s NMT System

Wu et al. (2016)

Luong+ (2015) seq2seq ensemble with rare word handling: 37.5 BLEUGoogle’s 32k word pieces: 38.95 BLEU

Google’s phrase-based system: 37.0 BLEUEnglish-French:

Luong+ (2015) seq2seq ensemble with rare word handling: 23.0 BLEUGoogle’s 32k word pieces: 24.2 BLEU

Google’s phrase-based system: 20.7 BLEUEnglish-German:

Page 34: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to

Human Evalua3on (En-Es)

Wu et al. (2016)

‣ Similar to human-level performance on English-Spanish

Page 35: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to

Google’s NMT System

Wu et al. (2016)

Page 36: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to

Google’s NMT System

Wu et al. (2016)

Gender is correct in GNMT but not in PBMT

Page 37: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to

Google’s NMT System

Wu et al. (2016)

Gender is correct in GNMT but not in PBMT

“sled”

Page 38: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to

Google’s NMT System

Wu et al. (2016)

Gender is correct in GNMT but not in PBMT

“sled”

Page 39: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to

Google’s NMT System

Wu et al. (2016)

Gender is correct in GNMT but not in PBMT

“sled”“walker”

Page 40: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to

Backtransla3on‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and

a large monolingual corpus T’ to train a language model. Can neural MT do the same?

Sennrich et al. (2015)

Page 41: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to

Backtransla3on‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and

a large monolingual corpus T’ to train a language model. Can neural MT do the same?

Sennrich et al. (2015)

‣ Approach 1: force the system to generate T’ as targets from null inputs

Page 42: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to

Backtransla3on‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and

a large monolingual corpus T’ to train a language model. Can neural MT do the same?

Sennrich et al. (2015)

s1, t1

[null], t’1

[null], t’2

s2, t2…

‣ Approach 1: force the system to generate T’ as targets from null inputs

Page 43: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to

Backtransla3on‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and

a large monolingual corpus T’ to train a language model. Can neural MT do the same?

Sennrich et al. (2015)

s1, t1

[null], t’1

[null], t’2

s2, t2…

‣ Approach 1: force the system to generate T’ as targets from null inputs

‣ Approach 2: generate synthe3c sources with a T->S machine transla3on system (backtransla3on)

Page 44: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to

Backtransla3on‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and

a large monolingual corpus T’ to train a language model. Can neural MT do the same?

Sennrich et al. (2015)

s1, t1

[null], t’1

[null], t’2

s2, t2…

‣ Approach 1: force the system to generate T’ as targets from null inputs

‣ Approach 2: generate synthe3c sources with a T->S machine transla3on system (backtransla3on)

s1, t1

MT(t’1), t’1

s2, t2…

…MT(t’2), t’2

Page 45: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to

Backtransla3on

Sennrich et al. (2015)

‣ parallelsynth: backtranslate training data; makes addi3onal noisy source sentences which could be useful

‣ Gigaword: large monolingual English corpus

Page 46: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to

Transformers for MT

Page 47: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to

Recall: Self-A8en3on

Vaswani et al. (2017)

the movie was great

‣ Each word forms a “query” which then computes a8en3on over each word

Page 48: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to

Recall: Self-A8en3on

Vaswani et al. (2017)

the movie was great

‣ Each word forms a “query” which then computes a8en3on over each word

x4

Page 49: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to

Recall: Self-A8en3on

Vaswani et al. (2017)

the movie was great

‣ Each word forms a “query” which then computes a8en3on over each word

x4

Page 50: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to

Recall: Self-A8en3on

Vaswani et al. (2017)

the movie was great

‣ Each word forms a “query” which then computes a8en3on over each word

x4

x04

Page 51: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to

Recall: Self-A8en3on

Vaswani et al. (2017)

the movie was great

‣ Each word forms a “query” which then computes a8en3on over each word

x4

x04

scalar↵i,j = softmax(x>i xj)

Page 52: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to

Recall: Self-A8en3on

Vaswani et al. (2017)

the movie was great

‣ Each word forms a “query” which then computes a8en3on over each word

x4

x04

scalar

vector = sum of scalar * vector

↵i,j = softmax(x>i xj)

x0i =

nX

j=1

↵i,jxj

Page 53: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to

Recall: Self-A8en3on

Vaswani et al. (2017)

the movie was great

‣ Each word forms a “query” which then computes a8en3on over each word

‣ Mul3ple “heads” analogous to different convolu3onal filters. Use parameters Wk and Vk to get different a8en3on values + transform vectors

x4

x04

scalar

vector = sum of scalar * vector

↵i,j = softmax(x>i xj)

x0i =

nX

j=1

↵i,jxj

Page 54: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to

Recall: Self-A8en3on

Vaswani et al. (2017)

the movie was great

‣ Each word forms a “query” which then computes a8en3on over each word

‣ Mul3ple “heads” analogous to different convolu3onal filters. Use parameters Wk and Vk to get different a8en3on values + transform vectors

x4

x04

scalar

vector = sum of scalar * vector

↵i,j = softmax(x>i xj)

x0i =

nX

j=1

↵i,jxj

↵k,i,j = softmax(x>i Wkxj)

Page 55: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to

Recall: Self-A8en3on

Vaswani et al. (2017)

the movie was great

‣ Each word forms a “query” which then computes a8en3on over each word

‣ Mul3ple “heads” analogous to different convolu3onal filters. Use parameters Wk and Vk to get different a8en3on values + transform vectors

x4

x04

scalar

vector = sum of scalar * vector

↵i,j = softmax(x>i xj)

x0i =

nX

j=1

↵i,jxj

↵k,i,j = softmax(x>i Wkxj) x0

k,i =nX

j=1

↵k,i,jVkxj

Page 56: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to

Recall: Self-A8en3on

Vaswani et al. (2017)

the movie was great

‣ Each word forms a “query” which then computes a8en3on over each word

‣ Mul3ple “heads” analogous to different convolu3onal filters. Use parameters Wk and Vk to get different a8en3on values + transform vectors

x4

x04

scalar

vector = sum of scalar * vector

↵i,j = softmax(x>i xj)

x0i =

nX

j=1

↵i,jxj

↵k,i,j = softmax(x>i Wkxj) x0

k,i =nX

j=1

↵k,i,jVkxj

Page 57: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to

Transformers

Vaswanietal.(2017)

themoviewasgreat

‣ Augmentwordembeddingwithposi=onembeddings,eachdimisasine/cosinewaveofadifferentfrequency.Closerpoints=higherdotproducts

‣Worksessen=allyaswellasjustencodingposi=onasaone-hotvector

themoviewasgreat

emb(1)

emb(2)

emb(3)

emb(4)

Page 58: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to

Transformers

Vaswanietal.(2017)

‣ Encoderanddecoderarebothtransformers

‣ Decoderconsumesthepreviousgeneratedtoken(anda8endstoinput),buthasnorecurrentstate

Page 59: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to

Transformers

Vaswanietal.(2017)

‣ Big=6layers,1000dimforeachtoken,16heads,base=6layers+otherparamshalved

Page 60: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to

Visualiza=on

Vaswanietal.(2017)

Page 61: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to

Visualiza=on

Vaswanietal.(2017)

Page 62: Lecture 13: Machine Transla3on II - GitHub Pages · 2021. 4. 26. · ‣ Classical MT methods used a bilingual corpus of sentences B = (S, T) and a large monolingual corpus T’ to

Takeaways

‣ CanbuildMTsystemswithLSTMencoder-decoders,CNNs,ortransformers

‣Wordpiece/bytepairmodelsarereallyeffec=veandeasytouse

‣ Stateoftheartsystemsarege|ngpre8ygood,butlotsofchallengesremain,especiallyforlow-resourcese|ngs

‣ Next=me:pre-trainedtransformermodels(BERT),appliedtoothertasks