question answering and machine comprehension ... - minjoon …minjoon seo phd student computer...

Post on 12-Aug-2020

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Questionansweringandmachinecomprehensionwith

neuralattentionMinjoonSeoPhDStudent

ComputerScience&EngineeringUniversityofWashington

TwoEnd-to-EndQuestionAnsweringSystemswithNeuralAttention

• BidirectionalAttentionFlow(BiDAF)• OnStanfordQuestionAnsweringDatasetandCNN/DailyMail ClozeTest

• Query-ReductionNetworks(QRN)• OnbAbI QAanddialog, DSTC2 datasets

TwoQuestionAnsweringSystemswithNeuralAttention

• BidirectionalAttentionFlow(BiDAF)• OnStanfordQuestionAnsweringDatasetandCNN/DailyMail ClozeTest

• Query-ReductionNetworks(QRN)• OnbAbI QAanddialogdatasets

QuestionAnsweringTask(StanfordQuestionAnsweringDataset,2016)

Q:WhichNFLteamrepresentedtheAFCatSuperBowl50?

A:DenverBroncos

WhyNeuralAttention?

Q:WhichNFLteamrepresentedtheAFCatSuperBowl50?

Allowsadeeplearningarchitecturetofocusonthemostrelevantphraseofthecontexttothequery

inadifferentiablemanner.

OurModel:Bi-directionalAttentionFlow(BiDAF)

Attention

Modeling

MLP+softmax

𝑖" = 0 𝑖% = 1

BarakObamaisthepresidentoftheU.S. WholeadstheUnitedStates?

Attention

(Bidirectional)AttentionFlow

Modeling Layer

Output Layer

Attention Flow Layer

Phrase Embed Layer

Word Embed Layer

x1 x2 x3 xT q1 qJ

LSTM

LSTM

LSTM

LSTM

Start End

h1 h2 hT

u1

u2

uJ

Softm

ax

h1 h2 hT

u1

u2

uJ

Max

Softmax

Context2Query

Query2Context

h1 h2 hT u1 uJ

LSTM + SoftmaxDense + Softmax

Context Query

Query2Context and Context2QueryAttention

WordEmbedding

GLOVE Char-CNN

Character Embed Layer

CharacterEmbedding

g1 g2 gT

m1 m2 mT

Char/WordEmbeddingLayers

Modeling Layer

Output Layer

Attention Flow Layer

Phrase Embed Layer

Word Embed Layer

x1 x2 x3 xT q1 qJ

LSTM

LSTM

LSTM

LSTM

Start End

h1 h2 hT

u1

u2

uJ

Softm

ax

h1 h2 hT

u1

u2

uJ

Max

Softmax

Context2Query

Query2Context

h1 h2 hT u1 uJ

LSTM + SoftmaxDense + Softmax

Context Query

Query2Context and Context2QueryAttention

WordEmbedding

GLOVE Char-CNN

Character Embed Layer

CharacterEmbedding

g1 g2 gT

m1 m2 mT

CharacterandWordEmbedding

• Wordembeddingisfragileagainstunseenwords• Charembeddingcan’teasilylearnsemanticsofwords• Useboth!

• CharembeddingasproposedbyKim(2015)

Seattle

SeattleCNN

+MaxPooling

concat

Embeddingvector

PhraseEmbeddingLayer

Modeling Layer

Output Layer

Attention Flow Layer

Phrase Embed Layer

Word Embed Layer

x1 x2 x3 xT q1 qJ

LSTM

LSTM

LSTM

LSTM

Start End

h1 h2 hT

u1

u2

uJ

Softm

ax

h1 h2 hT

u1

u2

uJ

Max

Softmax

Context2Query

Query2Context

h1 h2 hT u1 uJ

LSTM + SoftmaxDense + Softmax

Context Query

Query2Context and Context2QueryAttention

WordEmbedding

GLOVE Char-CNN

Character Embed Layer

CharacterEmbedding

g1 g2 gT

m1 m2 mT

PhraseEmbeddingLayer• Inputs:thechar/wordembeddingofqueryandcontextwords• Outputs:wordrepresentationsawareoftheirneighbors(phrase-awarewords)

• ApplybidirectionalRNN(LSTM)forbothqueryandcontext

Modeling Layer

Output Layer

Attention Flow Layer

Phrase Embed Layer

Word Embed Layer

x1 x2 x3 xT q1 qJ

LSTM

LSTM

LSTM

LSTM

Start End

h1 h2 hT

u1

u2

uJ

Softm

ax

h1 h2 hT

u1

u2

uJ

Max

Softmax

Context2Query

Query2Context

h1 h2 hT u1 uJ

LSTM + SoftmaxDense + Softmax

Context Query

Query2Context and Context2QueryAttention

WordEmbedding

GLOVE Char-CNN

Character Embed Layer

CharacterEmbedding

g1 g2 gT

m1 m2 mT

Context Query

AttentionLayer

Modeling Layer

Output Layer

Attention Flow Layer

Phrase Embed Layer

Word Embed Layer

x1 x2 x3 xT q1 qJ

LSTM

LSTM

LSTM

LSTM

Start End

h1 h2 hT

u1

u2

uJ

Softm

ax

h1 h2 hT

u1

u2

uJ

Max

Softmax

Context2Query

Query2Context

h1 h2 hT u1 uJ

LSTM + SoftmaxDense + Softmax

Context Query

Query2Context and Context2QueryAttention

WordEmbedding

GLOVE Char-CNN

Character Embed Layer

CharacterEmbedding

g1 g2 gT

m1 m2 mT

AttentionLayer

• Inputs:phrase-awarecontextandquerywords• Outputs:query-awarerepresentationsofcontextwords

• Context-to-queryattention:Foreach(phrase-aware)contextword,choosethemostrelevantwordfromthe(phrase-aware)querywords• Query-to-contextattention:Choosethecontextwordthatismostrelevanttoanyofquerywords.

Modeling Layer

Output Layer

Attention Flow Layer

Phrase Embed Layer

Word Embed Layer

x1 x2 x3 xT q1 qJ

LSTM

LSTM

LSTM

LSTM

Start End

h1 h2 hT

u1

u2

uJ

Softm

ax

h1 h2 hT

u1

u2

uJ

Max

Softmax

Context2Query

Query2Context

h1 h2 hT u1 uJ

LSTM + SoftmaxDense + Softmax

Context Query

Query2Context and Context2QueryAttention

WordEmbedding

GLOVE Char-CNN

Character Embed Layer

CharacterEmbedding

g1 g2 gT

m1 m2 mT

Context-to-QueryAttention(C2Q)

Q:WholeadstheUnitedStates?

C:BarakObamaisthepresidentoftheUSA.

Foreachcontextword,findthemostrelevantqueryword.

Query-to-ContextAttention(Q2C)

WhileSeattle’sweatherisveryniceinsummer,itsweatherisveryrainyinwinter,makingitoneofthemostgloomycitiesintheU.S.LAis…

Q:Whichcityisgloomyinwinter?

ModelingLayer

Modeling Layer

Output Layer

Attention Flow Layer

Phrase Embed Layer

Word Embed Layer

x1 x2 x3 xT q1 qJ

LSTM

LSTM

LSTM

LSTM

Start End

h1 h2 hT

u1

u2

uJ

Softm

ax

h1 h2 hT

u1

u2

uJ

Max

Softmax

Context2Query

Query2Context

h1 h2 hT u1 uJ

LSTM + SoftmaxDense + Softmax

Context Query

Query2Context and Context2QueryAttention

WordEmbedding

GLOVE Char-CNN

Character Embed Layer

CharacterEmbedding

g1 g2 gT

m1 m2 mT

ModelingLayer

• Attentionlayer:modelinginteractionsbetweenqueryandcontext• Modelinglayer:modelinginteractionswithin(query-aware)contextwordsviaRNN(LSTM)

• Divisionoflabor:letattentionandmodelinglayerssolelyfocusontheirowntasks• Weexperimentallyshowthatthisleadstoabetterresultthanintermixingattentionandmodeling

OutputLayer

Modeling Layer

Output Layer

Attention Flow Layer

Phrase Embed Layer

Word Embed Layer

x1 x2 x3 xT q1 qJ

LSTM

LSTM

LSTM

LSTM

Start End

h1 h2 hT

u1

u2

uJ

Softm

ax

h1 h2 hT

u1

u2

uJ

Max

Softmax

Context2Query

Query2Context

h1 h2 hT u1 uJ

LSTM + SoftmaxDense + Softmax

Context Query

Query2Context and Context2QueryAttention

WordEmbedding

GLOVE Char-CNN

Character Embed Layer

CharacterEmbedding

g1 g2 gT

m1 m2 mT

Training

• Minimizesthenegativelogprobabilitiesofthetruestartindexandthetrueendindex

𝑦() Trueendindexofexamplei

𝑦(* Truestartindexofexamplei

𝐩) Probabilitydistributionofstopindex

𝐩* Probabilitydistributionofstartindex

Previouswork

• Usingneuralattentionasacontroller(Xiong etal.,2016)• UsingneuralattentionwithinRNN(Wang&Jiang,2016)• Mostoftheseattentionsareuni-directional

• BiDAF (ourmodel)• usesneuralattentionasalayer,• Isseparatedfrommodelingpart(RNN),• Isbidirectional

VGG-16

Modeling Layer

Output Layer

Attention Flow Layer

Phrase Embed Layer

Word Embed Layer

x1 x2 x3 xT q1 qJ

LSTM

LSTM

LSTM

LSTM

Start End

h1 h2 hT

u1

u2

uJ

Softm

ax

h1 h2 hT

u1

u2

uJ

Max

Softmax

Context2Query

Query2Context

h1 h2 hT u1 uJ

LSTM + SoftmaxDense + Softmax

Context Query

Query2Context and Context2QueryAttention

WordEmbedding

GLOVE Char-CNN

Character Embed Layer

CharacterEmbedding

g1 g2 gT

m1 m2 mT

BiDAF (ours)

ImageClassifierandBiDAF

StanfordQuestionAnsweringDataset(SQuAD)(Rajpurkar etal.,2016)

• MostpopulararticlesfromWikipedia• QuestionsandanswersfromTurkers• 90ktrain,10kdev,?test(hidden)• Answermustlieinthecontext• Twometrics:ExactMatch(EM)andF1

SQuAD Results(http://stanford-qa.com)asof12pmToday

SQuAD Results

1:Rajpurkar etal.(2016)2:Yuetal.(2016)3:Yangetal.(2016)4:Wang&Jiang(2016)6:Xiong etal.(2016)

EM F1Stanford1 (baseline) 40.4 51.0IBM2 62.5 71.0CMU3 62.5 73.3SingaporeManagement4 (ensemble) 67.9 77.0IBMResearch(ensemble) 68.2 77.2SalesforceResearch6 (ensemble) 71.6 80.4MicrosoftResearchAsia (ensemble) 72.1 79.7Ours (ensemble) 73.3 81.1

50

55

60

65

70

75

80

NoCharEmbedding NoWordEmbedding NoC2QAttention NoQ2CAttention DynamicAttention FullModel

EM F1

Ablationsondevdata

InteractiveDemo

http://allenai.github.io/bi-att-flow/demo

AttentionVisualizations

There%are%13 natural%reserves%in%Warsaw%–among%others%,%Bielany Forest%,%KabatyWoods%,%Czerniaków Lake%.%About%15%kilometres (%9%miles%)%from%Warsaw%,%the%Vistula%river%'s%environment%changes%strikingly%and%features%a%perfectly%preserved%ecosystem%,%with%a%habitat%of%animals%that%includes%the%otter%,%beaver%and%hundreds%of%bird%species%.%There%are%also%several%lakes%in%Warsaw%– mainly%the%oxbow%lakes%,%like%Czerniaków Lake%,%the%lakes%in%the%Łazienkior%Wilanów Parks%,%Kamionek Lake%.%There%are%lot%of%small%lakes%in%the%parks%,%but%only%a%few%are%permanent%– the%majority%are%emptied%before%winter%to%clean%them%of%plants%and%sediments%.

Howmany

naturalreserves

arethere

inWarsaw

?

[]hundreds, few, among, 15, several, only, 13, 9natural, ofreservesare, are, are, are, are, includes[][]Warsaw, Warsaw, Warsawinter species

Where

did

Super

Bowl

50

take

place

?

Super%Bowl%50%was%an%American%football%game%to%determine%the%champion%of%the%National%Football%League%(%NFL%)%for%the%2015%season%.%The%American%Football%Conference%(%AFC%)%champion% Denver%Broncos%defeated%the%National%Football%Conference%(%NFC%)%champion%Carolina%Panthers%24–10%to%earn%their%third%Super%Bowl%title%.%The%game%was%played%on%February%7%,%2016%,%at%Levi%'s%Stadium%in%the%San%Francisco%Bay%Area%at%Santa%Clara%,%California .%As%this%was%the%50th%Super%Bowl%,%the%league%emphasized%the%"%golden%anniversary%"%with%various%goldZthemed%initiatives%,%as%well%as%temporarily%suspending%the%tradition%of%naming%each%Super%Bowl%game%with%Roman%numerals%(%under%which%the%game%would%have%been%known%as%"%Super%Bowl%L%"%)%,%so%that%the%logo%could%prominently%feature%the%Arabic%numerals%50%.

at, the, at, Stadium, Levi, in, Santa, Ana

[]

Super, Super, Super, Super, Super

Bowl, Bowl, Bowl, Bowl, Bowl

50

initiatives

EmbeddingVisualizationatWordvsPhraseLayers

January

September

August

July

May

may

effect and may result in

the state may not aid

of these may be more

Opening in May 1852 at

debut on May 5 ,

from 28 January to 25

but by September had been

Howdoesitcomparewithfeature-basedmodels?

CNN/DailyMail ClozeTest(Hermannetal.,2015)

• ClozeTest(PredictingMissingwords)• ArticlesfromCNN/DailyMail• Human-writtensummaries• Missingwordsarealwaysentities• CNN– 300karticle-querypairs• DailyMail – 1Marticle-querypairs

CNN/DailyMail ClozeTestResults

SomelimitationsofSQuAD

TwoQuestionAnsweringSystemswithNeuralAttention

• BidirectionalAttentionFlow(BiDAF)• OnStanfordQuestionAnsweringDatasetandCNN/DailyMail ClozeTest

• Query-ReductionNetworks(QRN)• OnbAbI QAanddialogdatasets

ReasoningQuestionAnswering

DialogSystem

U:CanyoubookatableinRomeinItalianCuisine

S:Howmanypeopleinyourparty?

U:Forfourpeopleplease.

S:Whatpricerangeareyoulookingfor?

DialogtaskvsQA

• DialogsystemcanbeconsideredasQAsystem:• Lastuser’sutteranceisthequery• Allpreviousconversationsarecontexttothequery• Thesystem’snextresponseistheanswertothequery

• Posesafewuniquechallenges• Dialogsystemrequirestrackingstates• Dialogsystemneedstolookatmultiplesentencesintheconversation• Buildingend-to-enddialogsystemismorechallenging

Ourapproach:Query-Reduction

<START>Sandragottheapplethere.Sandradroppedtheapple.Danieltooktheapplethere.Sandrawenttothehallway.Danieljourneyedtothegarden.

Q:Whereistheapple?

Reducedquery:

Whereistheapple?WhereisSandra?WhereisSandra?WhereisDaniel?WhereisDaniel?WhereisDaniel?à garden

A:garden

Query-ReductionNetworks• Reducethequeryintoaneasier-to-answerqueryoverthesequenceofstate-changingtriggers(sentences),invectorspace

Sandragottheapplethere.

!"

!"

#""

#"$

%""

%"$

Where isSandra?

Sandradroppedtheapple

!$

!$

#$"

#$$

%""

%$$

Danieltooktheapplethere.

!&

!&

#&"

#&$

%""

%&$

Where isDaniel?

Sandrawenttothehallway.

!'

!'

#'"

#'$

%""

%'$

Where isDaniel?

Danieljourneyedtothegarden.

!(

!(

#("

#($

%""

%($ → *+

Where isDaniel?

Whereistheapple?

#

garden

Where isSandra?

∅ ∅ ∅ ∅

QRNCell

𝛼 𝜌

1 − ×

× +

𝐱𝑡 𝐪𝑡

𝐡𝑡−1 𝐡𝑡

𝐳𝑡 𝐡𝑡

sentence query

reducedquery(hiddenstate)

updategatecandidatereducedquery

updatefunc reductionfunc

CharacteristicsofQRN

• Updategatecanbeconsideredaslocalattention• QRNchoosestoconsider/ignoreeachcandidatereducedquery• Thedecisionismadelocally(asopposedtoglobalsoftmax attention)

• SubclassofRecurrentNeuralNetwork(RNN)• Twoinputs,hiddenstate,gatingmechanism• Abletohandlesequentialdependency(attentioncannot)

• Simplerrecurrentupdateenablesparallelization overtime• Candidatehiddenstate(reducedquery)iscomputedfrominputsonly• Hiddenstatecanbeexplicitlycomputedasafunctionofinputs

Parallelizationcomputedfrominputsonly,socanbetriviallyparallelized

Canbeexplicitlyexpressedasthegeometricsumofpreviouscandidatehiddenstates

Parallelization

CharacteristicsofQRN

• Updategatecanbeconsideredaslocalattention• SubclassofRecurrentNeuralNetwork(RNN)• Simplerrecurrentupdateenablesparallelization overtime

QRNsitsbetweenneuralattentionmechanismandrecurrentneuralnetworks,takingtheadvantageofbothparadigms.

bAbI QADataset

• 20 differenttasks• 1kstory-questionpairsforeachtask(10kalsoavailable)• Syntheticallygenerated• Manyquestionsrequirelookingatmultiplesentences• Forend-to-endsystemsupervisedbyanswersonly

What’sdifferentfromSQuAD?

• Synthetic• Morethanlexical/syntacticunderstanding• Differentkindsofinferences• induction,deduction,counting,pathfinding,etc.

• Reasoningovermultiplesentences• InterestingtestbedtowardsdevelopingcomplexQAsystem(anddialogsystem)

bAbI QAResults(1k)

0

10

20

30

40

50

60

LSTM DMN+ MemN2N GMemN2N QRN(Ours)

AvgError(%)

AvgError(%)

bAbI QAResults(10k)

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

MemN2N DNC GMemN2N DMN+ QRN(Ours)

AvgError(%)

AvgError(%)

DialogDatasets

• bAbI DialogDataset• Synthetic• 5differenttasks• 1kdialogsforeachtask

• DSTC2*Dataset• Realdataset• EvaluationmetricisdifferentfromoriginalDSTC2:responsegenerationinsteadof“state-tracking”• Eachdialogis800+utterances• 2407possibleresponses

bAbI DialogResults(OOV)

0

5

10

15

20

25

30

35

MemN2N GMemN2N QRN(Ours)

AvgError(%)

AvgError(%)

DSTC2*DialogResults

0

10

20

30

40

50

60

70

MemN2N GMemN2N QRN(Ours)

AvgError(%)

AvgError(%)

bAbI QAVisualization

𝑧- = Localattention(updategate)atlayerl

DSTC2(Dialog)Visualization

𝑧- = Localattention(updategate)atlayerl

Conclusion

• PresentedtwonovelapproachesforQAtasksusingneuralattention

• BidirectionalAttentionFlow:usingattentionasalayer,onbothdirections(contexttoquery,querytocontext)• Query-reductionNetworks:asequentialmodelthattakesadvantageofbothattentionandRNNforreasoningovermultiplesentences

Thanks!

Whydoweneedattention?

• RNNhaslong-termdependencyproblem• Vanishinggradients(Pascanu etal.,2013)• Inherentlyunstableoveralongperiodoftime(Westonetal.,2016)

• Attentionprovidesshortcutaccesstorelevantinformation• Directlyretrievesthecontextvectorfromadistantlocation

• Criticaltomostmodernsequencemodels• Machinetranslation• Questionanswering,machinecomprehension

NeuralAttentioninSequenceModeling

(Bahdanau etal.,2015)

• ApplyRNNoncontextvectors• ApplyRNNonqueryvectors• Ateachtimestep,useneuralattentiontosoft-selectasinglecontextvector• Usetheselectedcontextvector,alongwithcurrentqueryvectorandcurrenthiddenstate,toobtainthenexthiddenstate

top related