Download - On Learning Form and Meaning in Neural Machine Translation ...people.csail.mit.edu › mitra › meetings › 2017-May09-Yonatan.pdf · Brief History of Machine Translation •1947:

OnLearningFormandMeaninginNeuralMachineTranslationModels

YonatanBelinkovMay2017

With:NadirDurrani,HassanSajjad,Fahim Dalvi,Lluis Marques,JamesGlass

Motivation

• Neuralmachinetranslation(NMT)obtainsstate-of-the-artresults• Elegantandsimpleend-to-endarchitecture

Motivation


• However,NMTmodelsaredifficulttointerpret;whatdotheylearnaboutthesourceandtargetlanguages?

Motivation


• However,NMTmodelsaredifficulttointerpret;whatdotheylearnaboutthesourceandtargetlanguages?

• Recentinterestinthecommunity(e.g.Shi+16onsyntax)

Motivation

• Thiswork:analyzingmorphology(andsemantics)inNMT

TranslationasDecoding

• WarrenWeavertoNorbertWiener,March4,1947:

Alsoknowingnothingofficialabout,buthavingguessedandinferredconsiderableabout,powerfulnewmechanizedmethodsincryptography-methodswhichIbelievesucceedevenwhenonedoesnotknowwhatlanguagehasbeencoded- onenaturallywondersiftheproblemoftranslationcouldconceivablybetreatedasaproblemincryptography.WhenIlookatanarticleinRussian,Isay"ThisisreallywritteninEnglish,butithasbeencodedinsomestrangesymbols.Iwillnowproceedtodecode.”

BriefHistoryofMachineTranslation

• 1947:InitialideasofMT(Weaver)• 1950s:FirstMTsystems• 1960s:High-qualityMTfails,cutingovernmentfunding• 1970s-1980s:Rule-basedsystems,interlinguaideas• 1990s:StatisticalMT,IBMalignmentmodels• 2000s:Phrase-basedMT,open-sourcetoolkits• 2014-2015:NeuralMT:seq2seq+attention

StatisticalMachineTranslation

• TranslateasourcesentenceF intoatargetsentenceE



• – Translationmodel• – Languagemodel



• – Translationmodel• – Languagemodel

Marianodió una alabruja verde

Marydidnotslapthegreenwitch

bofetada

From:Jurafsky &Martin2009

NeuralMachineTranslation

Encoder

Decoder

Inputtext

Translatedtext


• Encoder:

• Decoder:

• Loss:


• Encoder:

• Decoder:

• Loss:

Sourcehiddenstate

Targethiddenstate

Summaryvector

Encoder-Decoder

Marianodió una bofetada alabruja verde

Marydidnotslapthegreenwitch<STOP>

• RaymondMooney,June26,2016:

TheProblemwiththeEncoder-Decoder

“Youcan’tcramthemeaningofawhole%&!$#sentenceintoasingle$&!#*vector!”

AttentionMechanism



Attentionassoftalignment



bofetada

Phrase-basedMT

Attentionassoftalignment



bofetadaMarianodió una alabruja verde


bofetada

Phrase-basedMTNeuralMT

ResearchQuestions

ResearchQuestions

• WhichpartsoftheNMTarchitecturecapturewordstructure?Whichcapturemeaning?• Whatisthedivisionoflaborbetweendifferentcomponents?• Howdodifferentwordrepresentationshelplearnbettermorphology?• Howdoesthetargetlanguageaffectthelearningofwordstructure?

Methodology

• Threestepprocedure:1. TrainaneuralMTsystem2. Extractfeaturerepresentationsusingtrainedthemodel3. Trainaclassifierusingextractedfeaturesandevaluateitonanextrinsictask

Methodology

• Threestepprocedure:1. TrainaneuralMTsystem2. Extractfeaturerepresentationsusingtrainedthemodel3. Trainaclassifierusingextractedfeaturesandevaluateitonanextrinsictask

• Assumption:performanceoftheclassifierreflectsqualityoftheNMTrepresentationsforthegiventask

Methodology

PartA:Morphology

ExperimentalSetup

• Tasks• Part-of-speechtagging• Morphologicaltagging

• Languages• Arabic-,German-,French-,andCzech-English• Arabic-Hebrew(richandsimilar)• Arabic-German(richbutdifferent)

ExperimentalSetup

• MTdata:TEDtalks• Annotateddata• Goldtags• Predictedtags

Encoder

EffectofWordRepresentation

running running

Wordembedding CharacterCNN


POSAccuracy BLEUWord Char Word Char

Ar-En

Ar-He

De-En

Fr-En

Cz-En



Ar-En 89.62 95.35 24.7 28.4

Ar-He 88.33 94.66 9.9 10.7

De-En 93.54 94.63 29.6 30.4

Fr-En 94.61 95.55 37.8 38.8

Cz-En 75.71 79.10 23.2 25.4



Ar-En 89.62 95.35 24.7 28.4

Ar-He 88.33 94.66 9.9 10.7

De-En 93.54 94.63 29.6 30.4

Fr-En 94.61 95.55 37.8 38.8

Cz-En 75.71 79.10 23.2 25.4

• Character-basedmodelsgeneratebetterrepresentationsforPOStagging

• Especiallywithrichermorphologicalsystems



Ar-En 89.62 95.35 24.7 28.4

Ar-He 88.33 94.66 9.9 10.7

De-En 93.54 94.63 29.6 30.4

Fr-En 94.61 95.55 37.8 38.8

Cz-En 75.71 79.10 23.2 25.4



Ar-En 89.62 95.35 24.7 28.4

Ar-He 88.33 94.66 9.9 10.7

De-En 93.54 94.63 29.6 30.4

Fr-En 94.61 95.55 37.8 38.8

Cz-En 75.71 79.10 23.2 25.4

• Character-basedmodelsimprovetranslationquality

ImpactofWordFrequency

ImpactofTagFrequency

ComparingSpecificTagsWord-based Char-based

ComparingSpecificTags

NN,NNP

DetDet

Word-based Char-based

EffectofEncoderDepth

• NMTmodelscanbeverydeep• GoogleTranslate:8encoder/decoderlayers• Zhou+2016:16layers



• Whatkindofinformationislearnedateach?



• Whatkindofinformationislearnedateach?• Weanalyzeda2-layerencoder• Extractrepresentationsfromdifferentlayersfortrainingtheclassifier


• Layer1>Layer2>Layer0• Butdeepermodelstranslatebetter


• Islayer2learningmoreaboutsemantics?Moreonthatlater…

EffectofTargetLanguage

• Howdoesthetargetlanguageaffectthelearnedsourcelanguagerepresentations?


• Howdoesthetargetlanguageaffectthelearnedsourcelanguagerepresentations?

• Experiment:• FixsourcesideandtrainNMTmodelsondifferenttargetlanguages• ComparelearnedrepresentationsonPOS/morphologicaltagging


• Sourcelanguage:Arabic• Targetlanguages:English,German,Hebrew,Arabic


• Poorermorphologyontargetside,bettersourcesiderepresentationsformorphology


• HigherBLEU≠betterrepresentations

Decoder

EncodervsDecoder

POSAccuracyEncoder Decoder

Arabic↔ English

German↔ English

Czech↔ English

EncodervsDecoder


Arabic↔ English 89.6 43.9

German↔ English 93.5 53.6

Czech↔ English 75.7 36.3

EncodervsDecoder





• Thedecoderlearnsverylittleabouttargetlanguagemorphology

EncodervsDecoder





• Thedecoderlearnsverylittleabouttargetlanguagemorphology• Why?

EffectofAttention



EffectofAttention

Withattention

Withoutattention

Englishà German

Englishà Czech

EffectofAttention

Withattention

Withoutattention

Englishà German 44.55 50.26

Englishà Czech 36.35 42.09

• Removingattentionimprovesdecoderrepresentations• Attentionisremovingburdenoffofthedecoder• Thedecoderdoesnotneedtolearnasmuchabouttargetwords

EffectofAttention

Withattention

Withoutattention

Englishà German 44.55 50.26

Englishà Czech 36.35 42.09

• Concatenatingmostattendedwordimprovesperformance• Encoderrepresentationshelpfulfortargetmorphology

EffectofAttention

Withattention

Withoutattention

Withmostattendedword

Englishà German 44.55 50.26 60.34

Englishà Czech 36.35 42.09 48.64

• Concatenatingmostattendedwordimprovesperformance• Encoderrepresentationshelpfulfortargetmorphology• Butusingonlyencodersideisnotasgood

EffectofAttention

Withattention

Withoutattention

Withmostattendedword

Onlymostattendedword

Englishà German 44.55 50.26 60.34 43.43

Englishà Czech 36.35 42.09 48.64 36.36

Summary

• NMTencoderlearnsgoodrepresentationsformorphology• Character-basedrepresentationsmuchbetterthanword-based• Targetlanguageimpactssourcesiderepresentations• Layer1>Layer2>Layer0

• Decoderlearnspoortargetsiderepresentations• Attentionmodelhelpsdecoderexploitsourcerepresentations

PartB:Semantics

Recap

• Wesaw• NMTrepresentationsfromlayer1betterthanlayer2(andlayer0)forPOSandmorphologicaltagging• Deepernetworksleadtobettertranslationperformance

Recap


• Questions• Whatiscapturedinhigherlayers?• Howissemanticinformationrepresented?

Recap


• Questions• Whatiscapturedinhigherlayers?• Howissemanticinformationrepresented?

• Let’sapplyasimilarmethodologytoasemantictask

Semantictagging

• Lexicalsemantics• AbstractionoverPOStagging• Language-neutral,aimedformulti-lingualsemanticparsing

Semantictagging

• Lexicalsemantics• AbstractionoverPOStagging• Language-neutral,aimedformulti-lingualsemanticparsing

• Someexamples• Determiners:every,no,some• Commaasconjunction,disjunction,apposition• Rolenouns,entitynouns• Comparisonadjectives:comparative,superlative,equative

ExperimentalSetup

• Semantictaggingdata• 66fine-grainedtags,13coarsecategories

• MTdata– UNcorpus• Multi-parallel• 11Msentences• Arabic,Chinese,English,French,Spanish,Russian

Train Dev TestSentences 42.5K 6.1K 12.2K

Tokens 937.1K 132.3K 265.5K

Baselines

System AccuracyMostfrequenttag 82.0

Unsupervised embeddings 81.1

Word2Tagencoder-decoder 91.4

State-of-the-art(Bjerva+16) 95.5

EffectofNetworkDepth

Mostfrequenttag


Mostfrequenttag

• Layer0belowbaseline


Mostfrequenttag

• Layer0belowbaseline• Layer1>>layer0


Mostfrequenttag

• Layer0belowbaseline• Layer1>>layer0• Layer4>layer1


Mostfrequenttag

• Layer0belowbaseline• Layer1>>layer0• Layer4>layer1

• Similartrendsforcoarsetags


Mostfrequenttag


Mostfrequenttag

• Noimpactonsemantictagging


Mostfrequenttag

• Noimpactonsemantictagging• Butlargeimpactontranslation:

BLEUEn-Ar 32.7

En-Es 49.1

En-Fr 38.5

En-Ru 34.2

En-Zh 32.1

AnalyzingSpecificTags

• Layer4vslayer1• Bleu:distinguishingamongcoarsetags• Red:distinguishingamongfine-grainedtagswithinacoarsecategory


• Layer4>layer1


• Layer4>layer1• Especiallywith:• Discourserelations(DIS)• Propertiesofnouns(ENT)• Events,tenses(EVE,TNS)• Logicrelationsandquantifiers (LOG)• Comparativeconstructions(COM)


• Negativeexamples



• Modality(MOD)• Closed-class(“no”,“not”,“should”,”must”,etc.)



• Modality(MOD)• Closed-class(“no”,“not”,“should”,”must”,etc.)

• Namedentities(NAM)• OOVs?• NeuralMTlimitation?

Semantictagsvs.POStags


0 1 2 3 4POS 87.9 92.0 91.7 91.8 91.9

Sem 81.8 87.8 87.4 87.6 88.2

• HigherlayersimprovesemantictaggingbutnotPOStagging• Layer1bestforPOS;layer4bestforsemantictagging


0 1 2 3 4POS 87.9 92.0 91.7 91.8 91.9

Sem 81.8 87.8 87.4 87.6 88.2

• HigherlayersimprovesemantictaggingbutnotPOStagging• Layer1bestforPOS;layer4bestforsemantictagging• Similartrendswithbidirectionalencoder


0 1 2 3 4

UniPOS 87.9 92.0 91.7 91.8 91.9

Sem 81.8 87.8 87.4 87.6 88.2

BiPOS 87.9 93.3 92.9 93.2 92.8

Sem 81.9 91.3 90.8 91.9 91.9

Summary

• NeuralMTrepresentationscontainusefulinformationaboutwordformandmeaning• LowerlayersfocusonPOS/morphology• Higherlayersfocuson(lexical)semantics• Targetlanguagedoesnotaffectsemantictaggingquality

FutureWork

• OtherneuralMTarchitectures• Wordrepresentations;multi-lingualmodels

• Otherlinguisticproperties• Syntacticandsemanticrelations,complexstructures

• ImprovingneuralMT• Multi-tasklearning

• Analyzingrepresentationsinotherneuralmodels• End-to-endspeechrecognition

Download - On Learning Form and Meaning in Neural Machine Translation ...people.csail.mit.edu › mitra › meetings › 2017-May09-Yonatan.pdf · Brief History of Machine Translation •1947:

Top Related