OnLearningFormandMeaninginNeuralMachineTranslationModels
YonatanBelinkovMay2017
With:NadirDurrani,HassanSajjad,Fahim Dalvi,Lluis Marques,JamesGlass
Motivation
• Neuralmachinetranslation(NMT)obtainsstate-of-the-artresults• Elegantandsimpleend-to-endarchitecture
Motivation
• Neuralmachinetranslation(NMT)obtainsstate-of-the-artresults• Elegantandsimpleend-to-endarchitecture
• However,NMTmodelsaredifficulttointerpret;whatdotheylearnaboutthesourceandtargetlanguages?
Motivation
• Neuralmachinetranslation(NMT)obtainsstate-of-the-artresults• Elegantandsimpleend-to-endarchitecture
• However,NMTmodelsaredifficulttointerpret;whatdotheylearnaboutthesourceandtargetlanguages?
• Recentinterestinthecommunity(e.g.Shi+16onsyntax)
Motivation
• Thiswork:analyzingmorphology(andsemantics)inNMT
TranslationasDecoding
• WarrenWeavertoNorbertWiener,March4,1947:
Alsoknowingnothingofficialabout,buthavingguessedandinferredconsiderableabout,powerfulnewmechanizedmethodsincryptography-methodswhichIbelievesucceedevenwhenonedoesnotknowwhatlanguagehasbeencoded- onenaturallywondersiftheproblemoftranslationcouldconceivablybetreatedasaproblemincryptography.WhenIlookatanarticleinRussian,Isay"ThisisreallywritteninEnglish,butithasbeencodedinsomestrangesymbols.Iwillnowproceedtodecode.”
BriefHistoryofMachineTranslation
• 1947:InitialideasofMT(Weaver)• 1950s:FirstMTsystems• 1960s:High-qualityMTfails,cutingovernmentfunding• 1970s-1980s:Rule-basedsystems,interlinguaideas• 1990s:StatisticalMT,IBMalignmentmodels• 2000s:Phrase-basedMT,open-sourcetoolkits• 2014-2015:NeuralMT:seq2seq+attention
StatisticalMachineTranslation
• TranslateasourcesentenceF intoatargetsentenceE
StatisticalMachineTranslation
• TranslateasourcesentenceF intoatargetsentenceE
StatisticalMachineTranslation
• TranslateasourcesentenceF intoatargetsentenceE
StatisticalMachineTranslation
• TranslateasourcesentenceF intoatargetsentenceE
• – Translationmodel• – Languagemodel
StatisticalMachineTranslation
• TranslateasourcesentenceF intoatargetsentenceE
• – Translationmodel• – Languagemodel
Marianodió una alabruja verde
Marydidnotslapthegreenwitch
bofetada
From:Jurafsky &Martin2009
StatisticalMachineTranslation
• TranslateasourcesentenceF intoatargetsentenceE
• – Translationmodel• – Languagemodel
Marianodió una alabruja verde
Marydidnotslapthegreenwitch
bofetada
From:Jurafsky &Martin2009
NeuralMachineTranslation
Encoder
Decoder
Inputtext
Translatedtext
NeuralMachineTranslation
NeuralMachineTranslation
• Encoder:
• Decoder:
• Loss:
NeuralMachineTranslation
• Encoder:
• Decoder:
• Loss:
Sourcehiddenstate
Targethiddenstate
Summaryvector
Encoder-Decoder
Marianodió una bofetada alabruja verde
Marydidnotslapthegreenwitch<STOP>
• RaymondMooney,June26,2016:
TheProblemwiththeEncoder-Decoder
“Youcan’tcramthemeaningofawhole%&!$#sentenceintoasingle$&!#*vector!”
AttentionMechanism
Marianodió una bofetada alabruja verde
Marydidnotslapthegreenwitch<STOP>
AttentionMechanism
Marianodió una bofetada alabruja verde
Marydidnotslapthegreenwitch<STOP>
AttentionMechanism
Marianodió una bofetada alabruja verde
Marydidnotslapthegreenwitch<STOP>
Attentionassoftalignment
Marianodió una alabruja verde
Marydidnotslapthegreenwitch
bofetada
Phrase-basedMT
Attentionassoftalignment
Marianodió una alabruja verde
Marydidnotslapthegreenwitch
bofetadaMarianodió una alabruja verde
Marydidnotslapthegreenwitch
bofetada
Phrase-basedMTNeuralMT
ResearchQuestions
ResearchQuestions
• WhichpartsoftheNMTarchitecturecapturewordstructure?Whichcapturemeaning?• Whatisthedivisionoflaborbetweendifferentcomponents?• Howdodifferentwordrepresentationshelplearnbettermorphology?• Howdoesthetargetlanguageaffectthelearningofwordstructure?
Methodology
• Threestepprocedure:1. TrainaneuralMTsystem2. Extractfeaturerepresentationsusingtrainedthemodel3. Trainaclassifierusingextractedfeaturesandevaluateitonanextrinsictask
Methodology
• Threestepprocedure:1. TrainaneuralMTsystem2. Extractfeaturerepresentationsusingtrainedthemodel3. Trainaclassifierusingextractedfeaturesandevaluateitonanextrinsictask
• Assumption:performanceoftheclassifierreflectsqualityoftheNMTrepresentationsforthegiventask
Methodology
Methodology
Methodology
PartA:Morphology
ExperimentalSetup
• Tasks• Part-of-speechtagging• Morphologicaltagging
• Languages• Arabic-,German-,French-,andCzech-English• Arabic-Hebrew(richandsimilar)• Arabic-German(richbutdifferent)
ExperimentalSetup
• MTdata:TEDtalks• Annotateddata• Goldtags• Predictedtags
Encoder
EffectofWordRepresentation
running running
Wordembedding CharacterCNN
EffectofWordRepresentation
POSAccuracy BLEUWord Char Word Char
Ar-En
Ar-He
De-En
Fr-En
Cz-En
EffectofWordRepresentation
POSAccuracy BLEUWord Char Word Char
Ar-En 89.62 95.35 24.7 28.4
Ar-He 88.33 94.66 9.9 10.7
De-En 93.54 94.63 29.6 30.4
Fr-En 94.61 95.55 37.8 38.8
Cz-En 75.71 79.10 23.2 25.4
EffectofWordRepresentation
POSAccuracy BLEUWord Char Word Char
Ar-En 89.62 95.35 24.7 28.4
Ar-He 88.33 94.66 9.9 10.7
De-En 93.54 94.63 29.6 30.4
Fr-En 94.61 95.55 37.8 38.8
Cz-En 75.71 79.10 23.2 25.4
• Character-basedmodelsgeneratebetterrepresentationsforPOStagging
• Especiallywithrichermorphologicalsystems
EffectofWordRepresentation
POSAccuracy BLEUWord Char Word Char
Ar-En 89.62 95.35 24.7 28.4
Ar-He 88.33 94.66 9.9 10.7
De-En 93.54 94.63 29.6 30.4
Fr-En 94.61 95.55 37.8 38.8
Cz-En 75.71 79.10 23.2 25.4
EffectofWordRepresentation
POSAccuracy BLEUWord Char Word Char
Ar-En 89.62 95.35 24.7 28.4
Ar-He 88.33 94.66 9.9 10.7
De-En 93.54 94.63 29.6 30.4
Fr-En 94.61 95.55 37.8 38.8
Cz-En 75.71 79.10 23.2 25.4
• Character-basedmodelsimprovetranslationquality
ImpactofWordFrequency
ImpactofWordFrequency
ImpactofTagFrequency
ComparingSpecificTagsWord-based Char-based
ComparingSpecificTags
NN,NNP
DetDet
Word-based Char-based
EffectofEncoderDepth
• NMTmodelscanbeverydeep• GoogleTranslate:8encoder/decoderlayers• Zhou+2016:16layers
EffectofEncoderDepth
• NMTmodelscanbeverydeep• GoogleTranslate:8encoder/decoderlayers• Zhou+2016:16layers
• Whatkindofinformationislearnedateach?
EffectofEncoderDepth
• NMTmodelscanbeverydeep• GoogleTranslate:8encoder/decoderlayers• Zhou+2016:16layers
• Whatkindofinformationislearnedateach?• Weanalyzeda2-layerencoder• Extractrepresentationsfromdifferentlayersfortrainingtheclassifier
EffectofEncoderDepth
EffectofEncoderDepth
EffectofEncoderDepth
• Layer1>Layer2>Layer0• Butdeepermodelstranslatebetter
EffectofEncoderDepth
• Islayer2learningmoreaboutsemantics?Moreonthatlater…
EffectofTargetLanguage
• Howdoesthetargetlanguageaffectthelearnedsourcelanguagerepresentations?
EffectofTargetLanguage
• Howdoesthetargetlanguageaffectthelearnedsourcelanguagerepresentations?
• Experiment:• FixsourcesideandtrainNMTmodelsondifferenttargetlanguages• ComparelearnedrepresentationsonPOS/morphologicaltagging
EffectofTargetLanguage
• Sourcelanguage:Arabic• Targetlanguages:English,German,Hebrew,Arabic
EffectofTargetLanguage
• Sourcelanguage:Arabic• Targetlanguages:English,German,Hebrew,Arabic
EffectofTargetLanguage
• Poorermorphologyontargetside,bettersourcesiderepresentationsformorphology
EffectofTargetLanguage
• HigherBLEU≠betterrepresentations
Decoder
EncodervsDecoder
POSAccuracyEncoder Decoder
Arabic↔ English
German↔ English
Czech↔ English
EncodervsDecoder
POSAccuracyEncoder Decoder
Arabic↔ English 89.6 43.9
German↔ English 93.5 53.6
Czech↔ English 75.7 36.3
EncodervsDecoder
POSAccuracyEncoder Decoder
Arabic↔ English 89.6 43.9
German↔ English 93.5 53.6
Czech↔ English 75.7 36.3
• Thedecoderlearnsverylittleabouttargetlanguagemorphology
EncodervsDecoder
POSAccuracyEncoder Decoder
Arabic↔ English 89.6 43.9
German↔ English 93.5 53.6
Czech↔ English 75.7 36.3
• Thedecoderlearnsverylittleabouttargetlanguagemorphology• Why?
EffectofAttention
Marianodió una bofetada alabruja verde
Marydidnotslapthegreenwitch<STOP>
EffectofAttention
Marianodió una bofetada alabruja verde
Marydidnotslapthegreenwitch<STOP>
EffectofAttention
Marianodió una bofetada alabruja verde
Marydidnotslapthegreenwitch<STOP>
EffectofAttention
Withattention
Withoutattention
Englishà German
Englishà Czech
EffectofAttention
Withattention
Withoutattention
Englishà German 44.55 50.26
Englishà Czech 36.35 42.09
• Removingattentionimprovesdecoderrepresentations• Attentionisremovingburdenoffofthedecoder• Thedecoderdoesnotneedtolearnasmuchabouttargetwords
EffectofAttention
Withattention
Withoutattention
Englishà German 44.55 50.26
Englishà Czech 36.35 42.09
• Concatenatingmostattendedwordimprovesperformance• Encoderrepresentationshelpfulfortargetmorphology
EffectofAttention
Withattention
Withoutattention
Withmostattendedword
Englishà German 44.55 50.26 60.34
Englishà Czech 36.35 42.09 48.64
• Concatenatingmostattendedwordimprovesperformance• Encoderrepresentationshelpfulfortargetmorphology• Butusingonlyencodersideisnotasgood
EffectofAttention
Withattention
Withoutattention
Withmostattendedword
Onlymostattendedword
Englishà German 44.55 50.26 60.34 43.43
Englishà Czech 36.35 42.09 48.64 36.36
Summary
• NMTencoderlearnsgoodrepresentationsformorphology• Character-basedrepresentationsmuchbetterthanword-based• Targetlanguageimpactssourcesiderepresentations• Layer1>Layer2>Layer0
• Decoderlearnspoortargetsiderepresentations• Attentionmodelhelpsdecoderexploitsourcerepresentations
Summary
• NMTencoderlearnsgoodrepresentationsformorphology• Character-basedrepresentationsmuchbetterthanword-based• Targetlanguageimpactssourcesiderepresentations• Layer1>Layer2>Layer0
• Decoderlearnspoortargetsiderepresentations• Attentionmodelhelpsdecoderexploitsourcerepresentations
PartB:Semantics
Recap
• Wesaw• NMTrepresentationsfromlayer1betterthanlayer2(andlayer0)forPOSandmorphologicaltagging• Deepernetworksleadtobettertranslationperformance
Recap
• Wesaw• NMTrepresentationsfromlayer1betterthanlayer2(andlayer0)forPOSandmorphologicaltagging• Deepernetworksleadtobettertranslationperformance
• Questions• Whatiscapturedinhigherlayers?• Howissemanticinformationrepresented?
Recap
• Wesaw• NMTrepresentationsfromlayer1betterthanlayer2(andlayer0)forPOSandmorphologicaltagging• Deepernetworksleadtobettertranslationperformance
• Questions• Whatiscapturedinhigherlayers?• Howissemanticinformationrepresented?
• Let’sapplyasimilarmethodologytoasemantictask
Semantictagging
• Lexicalsemantics• AbstractionoverPOStagging• Language-neutral,aimedformulti-lingualsemanticparsing
Semantictagging
• Lexicalsemantics• AbstractionoverPOStagging• Language-neutral,aimedformulti-lingualsemanticparsing
• Someexamples• Determiners:every,no,some• Commaasconjunction,disjunction,apposition• Rolenouns,entitynouns• Comparisonadjectives:comparative,superlative,equative
ExperimentalSetup
• Semantictaggingdata• 66fine-grainedtags,13coarsecategories
• MTdata– UNcorpus• Multi-parallel• 11Msentences• Arabic,Chinese,English,French,Spanish,Russian
Train Dev TestSentences 42.5K 6.1K 12.2K
Tokens 937.1K 132.3K 265.5K
Baselines
System AccuracyMostfrequenttag 82.0
Unsupervised embeddings 81.1
Word2Tagencoder-decoder 91.4
State-of-the-art(Bjerva+16) 95.5
EffectofNetworkDepth
Mostfrequenttag
EffectofNetworkDepth
Mostfrequenttag
EffectofNetworkDepth
Mostfrequenttag
• Layer0belowbaseline
EffectofNetworkDepth
Mostfrequenttag
• Layer0belowbaseline• Layer1>>layer0
EffectofNetworkDepth
Mostfrequenttag
• Layer0belowbaseline• Layer1>>layer0• Layer4>layer1
EffectofNetworkDepth
Mostfrequenttag
• Layer0belowbaseline• Layer1>>layer0• Layer4>layer1
• Similartrendsforcoarsetags
EffectofTargetLanguage
Mostfrequenttag
EffectofTargetLanguage
Mostfrequenttag
• Noimpactonsemantictagging
EffectofTargetLanguage
Mostfrequenttag
• Noimpactonsemantictagging• Butlargeimpactontranslation:
BLEUEn-Ar 32.7
En-Es 49.1
En-Fr 38.5
En-Ru 34.2
En-Zh 32.1
AnalyzingSpecificTags
• Layer4vslayer1• Bleu:distinguishingamongcoarsetags• Red:distinguishingamongfine-grainedtagswithinacoarsecategory
AnalyzingSpecificTags
• Layer4>layer1
AnalyzingSpecificTags
• Layer4>layer1• Especiallywith:• Discourserelations(DIS)• Propertiesofnouns(ENT)• Events,tenses(EVE,TNS)• Logicrelationsandquantifiers (LOG)• Comparativeconstructions(COM)
AnalyzingSpecificTags
• Negativeexamples
AnalyzingSpecificTags
• Negativeexamples
• Modality(MOD)• Closed-class(“no”,“not”,“should”,”must”,etc.)
AnalyzingSpecificTags
• Negativeexamples
• Modality(MOD)• Closed-class(“no”,“not”,“should”,”must”,etc.)
• Namedentities(NAM)• OOVs?• NeuralMTlimitation?
Semantictagsvs.POStags
Semantictagsvs.POStags
0 1 2 3 4POS 87.9 92.0 91.7 91.8 91.9
Sem 81.8 87.8 87.4 87.6 88.2
• HigherlayersimprovesemantictaggingbutnotPOStagging• Layer1bestforPOS;layer4bestforsemantictagging
Semantictagsvs.POStags
0 1 2 3 4POS 87.9 92.0 91.7 91.8 91.9
Sem 81.8 87.8 87.4 87.6 88.2
• HigherlayersimprovesemantictaggingbutnotPOStagging• Layer1bestforPOS;layer4bestforsemantictagging• Similartrendswithbidirectionalencoder
Semantictagsvs.POStags
0 1 2 3 4
UniPOS 87.9 92.0 91.7 91.8 91.9
Sem 81.8 87.8 87.4 87.6 88.2
BiPOS 87.9 93.3 92.9 93.2 92.8
Sem 81.9 91.3 90.8 91.9 91.9
Summary
• NeuralMTrepresentationscontainusefulinformationaboutwordformandmeaning• LowerlayersfocusonPOS/morphology• Higherlayersfocuson(lexical)semantics• Targetlanguagedoesnotaffectsemantictaggingquality
FutureWork
• OtherneuralMTarchitectures• Wordrepresentations;multi-lingualmodels
• Otherlinguisticproperties• Syntacticandsemanticrelations,complexstructures
• ImprovingneuralMT• Multi-tasklearning
• Analyzingrepresentationsinotherneuralmodels• End-to-endspeechrecognition