from simple to complex qa - nus · – multiple-choice question answering: race, mctest – answer...

FromSimpletoComplexQA

EduardHovyCMULanguageTechnologiesInstitute

www.cs.cmu.edu/~hovy

WebclopediaQA,2003

•  Wherearezebrasmostlikelyfound? —inthedictionary

•  Wheredolobstersliketolive? —onthetable

•  HowmanypeopleliveinChile? —nine

Webclopedia(Hovyetal.2001)

•  Whatisaninvertebrate? —Dukakis 1

BasicsimplefactoidQA

•  IdentifykeywordsfromQ•  Build(Boolean)queryforIR•  RetrievetextsusingIR•  Ranktexts/passages

•  FindspecifiedQtype•  MoveApatternsovertextand

scoreeachposition•  Rankwindows;returntopN

Alist

InputQ

Corpus:30%

+Web:add10%

1Mdocuments3000sentences

50candidates5answers

…Xwasbornin<YEAR>……Xwasbornon<DATE>……X(<YEAR>–<YEAR>)…

2

WhereistheAnswer?—Progresssince2003?

TypicalQAformat:

EithertheQcontextprovidestheA1.  nearby(=n-wordwindow)context 2.  distant(=doc-level)context

Ornotatall…soyouhavetousebackgroundinfo3.  fromthetrainingdata4.  fromlogicalderivation/reasoningrules/procedure 3

Question:QContext:“wwwwww…w”A:

Either…

WhenallinfoneededtogettheAispresentintheQcontext

…thensomeformofsurfaceandsimpletypematching+sub-Acompositionisenough

—>Ultimately,justdo[nested]simpleQA

Or…

ButwhengettingtheArequiresinformationnotintheQcontext(likebackgroundinfo,calculation,etc.)

…thenyouareintrouble:thisisnotstandardized,henceimpossibletoevaluate

—>NocomplexQA!?4

Outline

1.  Ainthenearbycontext2.  Ainthedistantcontext3.  Ahiddeninthetrainingdata4.  Aonlybyreasoning

5

Option1:Ainnearbycontext

•  BuildanduseshortpatternsorarichLM•  Tonsofworksince2000onpatternlearningandgeneralization,QAtypologies,etc.

•  NumerousQAdatasets(TREC,SQuAD,CNN…)•  ManyQAcompetitions(SEMEVAL…)

6

Sowhere’sthelimit?

YoucandoaLOTwithpatternsDidyouknowyouareanexpertonthe

PanamaCanal?

BlahPanamaCanalblahblahPanamablahPres.RooseveltblahUSAblahblahblahblahblah10yearsblahuntil1914blahblahblahblah51milesblahblahblahblahblahblahblahblahblahblahblahblahblah8to10hoursblahblahblahblahGatunLakeblah

WhenwasthePanamaCanalcompleted?HowlongisthePanamaCanal?HowlongdidittaketobuildthePanamaCanal?HowlongdoesittaketocrossthePanamaCanal?WhatisthelakeinthePanamaCanalcalled?WhichUSPresidentenabledthePanamaCanal?WhichoceansdoesthePanamaCanalconnect?Inyourtrainingdata,youhavesurelyseen“PanamaCanal”withonlytwooceannames…

7Sowhere’sthelimit?

Acorpustotestthepowerofngram/patternQAmodels

•  CLOTH(Xie,Lai,Dai,Hovy,EMNLP2018)–  Large-scaleClozetestdataset–  CreatedbyEnglishteachersinChinaforEnglishexams(MiddleandHighschoollevels)

– Aftercleanup:7kpassages;99kquestions(2/3removed)

– Droppedwordsandwordoptionscarefullycreatedbyteachers:highlynuancedalternatives

–  Testsknowledgeofgrammar,vocabulary,reasoning•  Howwelldostate-of-the-artcomputationalmodelsdocomparedtohumans?– Wetestusinga1-billion-wordlanguagemodel

8

•  Tense,voice,preps•  Localcontentwords

•  Copy/paraphrasewords•  Contentwords,long-distancedependencies

Percentagesoftestexamples,Middle/Highschoollevels

(Xieetal.2018)

9

QAsystemresults

•  Evena1B-LMstilllagsbehindhumanperformance•  Increasingthecontextlengthfor1B-LMdoesnothelp•  However:human-createdquestionsaredifferent:

(Xieetal.2018)

10

(AR:AttentionReader)

Thiswaspre-BERT!)

Conclusionforoption1

ForfactoidQAtypesthatobeypatterns,iftheAiscloseenough,andyouhaveenoughtrainingdata……youwillalwayslearngoodenoughwordcombinationpatternstoconnectQparameters<–>Qcontextmaterial<–>A

(Ifyouhaven’tseenthenecessarywordcombinations,youwon’teverbeabletoanswertheQ)

11

Option2:Aindistantcontext

•  StillusesomeformofmatchingQandA•  Needamore-sophisticatedandlonger-distancetypeof‘pattern’

12

Makingmatchingmorecomplex:RACE:Abettertestbed

•  RACE:ReAdingComprehensiondatasetfromExaminations(Lai,Xie,Liu,Yang,Hovy,EMNLP2018)

•  CollectedfromChinesemiddleandhighschoolexamsthatevaluatehumanstudents’Englishreadingcomprehensionability– Designedbyhumanexperts:Ensuresqualityandbroadtopiccoverage

–  SubstantiallymoredifficultthanexistingQAdatasets(butRACE-MeasierthanRACE-H)

– About4/5ofsourcematerialfilteredouttoremoveduplicates,incorrectformat,etc.

•  Aftercleaning:27,933passages;97,687questions

(Laietal.2018)

13

Toward‘reasoning’:typesofmore-complexmatching

•  ParaphrasingQs:testlanguageability•  DetailQs:identifyandmatchdetailsofathing

•  AttitudeQs:findopinions/attitudesoftheauthortowardssomething(’sentiment’)

•  Whole-pictureQs:understandtheentirestory(multi-sentence)

•  SummarizationQs:understandthepoint(multi-sentence)

(Laietal.2018)

15

Increasingreasoning

ComparisonwithotherQAdatasets•  Reasoningquestions:59.2%ofRACE;20.5%ofSQuAD•  Processingtypes:

–  Wordmatching:exactmatch–  Paraphrasing:paraphraseorentailment–  Single-sentreasoning:incompleteinfoorconceptualoverlap

–  Multi-sentreasoning:synthesizinginformationfrommultiplesentences

–  Insufficient/Ambiguous:noA,orAisnotunique

(Laietal.2018)

16

ComparingQAalgorithms

•  Baselines:–  SlidingWindow:TF-IDFbasedmatchingalgorithm–  StanfordAttentionReader(AR)andGatedAttentionReader(early-2018state-of-the-artneuralmodels)

•  RACEhasmore‘semantics’(=requiresmore‘reasoning’)thanothercorpora:–  higherhumanceiling–  harderforneuralmodels

(Laietal.2018)

17

Matchingtypeperformance

•  TurkersandSlidingWindowaregoodatsimplematchingquestions

•  Surprisingly,StanfordARdoesnothavebetterperformanceonmatchingquestions

(Laietal.2018)

18


WhentheAisdistant,orrequiresmore-sophisticatedmatching/’reasoning’(notjustsimpleword-string/languagemodel),

thenattention-basedneuralmodelscandosomeofit,butstillfailwiththeharderparts

19

Option3:A‘hidden’intrainingdata

SometimestheQcontextdoesnotcontaintheAatall…butyoucanSTILLgettherightA!(AndevengetitwithouttheQitself!)CorruptedngramsandotherSQuADperturbations(JiaandLiang,EMNLP2017)

NecessityofQcontextorevenofQitself(KaushikandLipton,EMNLP2018,BestShortPaperaward)

20

Example:QonlyQuestion:shinkanemaru,thegravel-voicedback-roombosswhodiedonthursdayaged81,goesdowninhistoryasjapan’smostcorruptpost-warpoliticianafter___________Passage:...glynisbc-nj-zimmer-profile-2takes-nytrahanefumioyasuhirodragnealhadonbjorkman/max...seventh-largestembarrasedjeopardyhilariouslymasahisahaibarabajram8-to-24duke/meredithacceding...koiduiraqs2:32:21//www.ironmanlive.com/sagawakyubindeaninternatinoal90-meterkakueitanakaseven-paragraph577,610wendovergolf-lpga-jpnpartner,un-appointeduemazzeicanada-u.s.Answer:kakueitanaka

(KaushikandLipton,EMNLP2018)

21

Doyouactuallyneedthecontext?

•  Researchgoal:–  HowstrongaremodelsthatseetheQonly?– WhataboutmodelsthatseetheQcontextpassageonly?–  Howdoweknowmodelsarereally“reading”thewholepassage?

•  Question-onlysetting:–  IftheQAsystemneedsthepassage,randomizeitswordsfirst–  IfjustcandidateAsneeded,placetheminrandomspots,fillinterveningtextwithgibberish

•  Passage-onlysetting:–  ‘Ignore’theQs:assigneachQtosomerandompassage

22

(KaushikandLiptonEMNLP2018)

Experiments•  Datasets/tests:

–  Spanselection:SQuAD,TriviaQA–  Clozequeries:ChildrensBookTest(CBT),CNN,CLOTH,Who-did-What,DailyMail

–  Multi-classclassification(implicit):bAbI(20tasks)–  Multiple-choicequestionanswering:RACE,MCTest–  Answergeneration:MSMARCO

•  Algorithms:–  Key-ValueMemoryNetworks:

Milleretal.2016:Key-ValueMemoryNetworksforDirectlyReadingDocuments.ProceedingsofEMNLP

–  GatedAttentionReaders:Dhingraetal.2017:Gated-AttentionReadersforTextComprehension.ProceedingsofACL

–  QANet:Yuetal.2018:QANet:CombiningLocalConvolutionwithGlobalSelf-AttentionforReadingComprehension.ProceedingsofICLR


23

Someresults

SQuAD,usingQANet

bAbI,usingKey-ValueMemNets


24

Who-did-What,usingGated-AttentionReaders

CBT,usingGated-AttentionReaders


25

Why?What’sgoingon??Question:shinkanemaru,thegravel-voicedback-roombosswhodiedonthursdayaged81,goesdowninhistoryasjapan’smostcorruptpost-warpoliticianafter___________Passage:...glynisbc-nj-zimmer-profile-2takes-nytrahanefumioyasuhirodragnealhadonbjorkman/max...seventh-largestembarrasedjeopardyhilariouslymasahisahaibarabajram8-to-24duke/meredithacceding...koiduiraqs2:32:21//www.ironmanlive.com/sagawakyubindeaninternatinoal90-meterkakueitanakaseven-paragraph577,610wendovergolf-lpga-jpnpartner,un-appointeduemazzeicanada-u.s.Answer:kakueitanaka

Transportationcompany

Kanemaru’ssecretary

Long-termpolitician

NamenotinGoogle

26


•  Don’ttrustQAdatasets!•  Don’ttrustQAsystemclaims!•  First,checkif

– anypre-existing(=trainingdata)dependenciesamongtheQandcandidateAs?

–  fullcontextpredictstheAwithouteventheQ?

27

Option4:Aonlythroughreasoning

FortrulycomplexQA:1.  Identifytheindividualsteps/piecesneeded

toderivetheA2.  Figureouthowtocompute/findthem

– FromtheQcontextand/orfromelsewhere

3.  Compose(andcheck?)them– BuildanAfinding‘script’

28

Possiblesourcesofthisknowledge•  Externalsearch:

–  Querysomethinglikethewebandhopetobelucky

•  Entailments:“sentence”–>“sentence”–  Operateatsurfaceform(inRTEformulation)–  Allowonesurfaceformtobestatedwhenanotherisgiven–  NewsurfaceformmayprovideAnswer–  Need:entailmentrules+entailmentapplier

•  Axioms:A∨B–>C–  Operateatdeeperlevel–  Connectrepresentationsubgraphs,evenprovidingnewnodes–  ExpandedgraphmayprovideAnswer–  Need:axioms/compositionrules+theoremprover

29

Type1:Apopulartasktoday:QAoverstructureddata

•  Data:database,table,etc.•  Task:askQsthatrequire(1)findingvariousbitsofdataand(2)composingthemtomaketheA

•  Themissinginformationisthescriptgoverningthesequenceofaccessandcomposition

•  Research:howto[learnto]buildthisscript?•  Evaluation:didthesystemproducetherightA?•  Examples:

– U.S.geographydatabaseof800facts(Zelle&Mooney,1996)– Wikitablequestions(PasupatandLiang,2015;Dasigi2018)– Otherdomains’tables(severalAI2projects)

30

Wikitabledataset

31

Athlete Nation Olympics Medals

Gillis Grafström

Sweden (SWE) 1920–1932 4

Kim Soo-Nyung

South Korea (KOR) 1988-200 6

Evgeni Plushenko Russia (RUS) 2002–2014 4

Kim Yu-na South Korea (KOR) 2010–2014 2

Patrick Chan Canada (CAN) 2014 2

Question:WhichathletewasfromSouthKoreaaftertheyear2010?

Answer:KimYu-Na

Reasoning:1)  GetrowswhereNationcolumn

containsSouthKorea2)  FilterrowswhereOlympicshas

avaluegreaterthan2010.3)  GetvaluefromAthletecolumn

fromfilteredrows.

Program:((reverseathlete)(and (nationsouth_korea) (year((reversedate)

(>=2010-mm-dd)))WikiTableQuestions,PasupatandLiang,2015

(DasigiLTIPhDthesis,2018)

Example:Dasigi•  Approachforlearningtobuildaccessroutines:

1.  ParseQ,builddependencytree2.  ConvertintoLogicalForm3.  Translateintocandidatetableaccessroutine4.  (tryallkindsofmappingsfromwordstoqueryoperators/structure)5.  Testcompositionbyrepeatedtrialanderror

•  Essentially,learningisasearchin‘operatorcombinationspace’tobuildthelogicalform

•  Weaksupervisionisnotenough.Speedupthelearning/searchby:–  Learningtoassociatetableaccessparameterswithpartsofthetree(Q

variables)–  Learningtoassociatenestingandaccessoperatorswithpartsofthetree

(‘operator’words:“themost”,“last”,etc.)–  Predefiningsomelexicon-to-operationmappings–  Payingattentiontogrammaticalconstructionofthetree–  Implementingheuristicstoguideexploration(‘shortQsfirst’)

32

(DasigiLTIPhDthesis,2018)

Dasigiapproach•  Strategies:

–  Incorporateknowledgeofgrammaticalconstraints–  ‘Lucky’examples:removerightAwithwrongquerylogic–  Questioncoverage:howmanyQwordsmapped?–  Complexqueries(denotation):howlargeisthequery?–  Doiterativesearch,fromsimplertomorecomplexQs

•  CombineintosingleObjective:Minimizeexpectedvalueofcost(Goodman,1996;GoelandByrne,2000;SmithandEisner,2005)

withalinearcombinationofcoverageanddenotationcosts

33

x: NL term y: script term d: denotation

EmpiricalcomparisononWikiTableQuestions

●  Requiresapproximatesetoflogicalformsduringtraining

●  UsedoutputfromDynamicProgrammingonDenotations(PasupatandLiang,2016)

●  Variousmodels:strings,trees,etc.

●  Efficientsearchfollowedbypruningusinghumanannotations

34

(Krishnamurthy, Dasigi and Gardner, 2017)

Dasigiresultsusingiterativesearch

35

●  Similar trend in 2 domains ●  Used functional query language (Liang et al., 2018)

(Dasigi,Gardner,Murty,Zettlemoyer,Hovy2018)

NLVR WikiTableQuestions

Conclusionforoption4.1

Interestingideato‘operationalize’theQandtestits‘truth’byrunningthescriptfortheA

ButworksonlywithstructuredAsourceswheresuchoperationalizationispossible

36

Canwe‘operationalize’other,typicalkindsofQs?

Type2:AnewQAtask:Multi-domainknowledge

Q:WhatisthelargestcapitalcitysouthofSantiagodeChile?

– Geographicknowledge(lat-long,population)– Numericalability(sorting,etc.)

Q:WhichoftheleadersoftheXYZenterprisearewell-liked,andwhy?

– Discoveryofsocialrolebyactions– Sentimentjudgmentsattachedtoactions

37

Multi-domainknowledge

•  DefineNself-containedstandardized‘domainspecialists’(KBs+reasoners)thatanyQAenginecanrun

•  Atrun-time,analyzetheQ,buildtheAscript,activatethespecialistsasneeded,computetheA

Arithmetic

GeographyPsych:goals

SocialcustomsPhysics

38

Researchneeded

•  Foreachdomainspecialist:– Defineits‘knowledgeservice’–  Createtheunderlyingknowledge– DefinetheI/OAPIsfortheQAenginetouse–  Buildthespecialist

•  ForeachQAengine:– AnalyzetheQ—>determineparametersandneed– Decomposeneed,buildascriptofspecialistqueriesplustheirresultcomposition

–  Execute39

Somespecialistareaswearecurrentlyworkingoninmygroup

1.  Arithmetic/numericalreasoningforentailment(Ravichander,Naik,Rosé,Hovy,CoNLL2019,ACL2019)

2.  Psychgoalsforsentimentjustification(OtaniandHovy,ACL2019)

3.  Socialrolesforgroupactivitysupport(Yang,Kraut,Hov,yEMNLP,HCI,andothers2017–18)

40

Topic1.Numericalcalculation

•  Task:Entailmentproblem•  Input:clausescontainingnumbers•  Output:entailed/not-entailed

•  Results:–  EQUATEdatasetextractedfrom~8existingQAandEntailmentresources,withAsadded

–  Baselinenumericalreasonerscoresonthedataset

(Ravichander,Naik,Rose,Hovy,2019)

P:AbombinaHebrewUniversitycafeteriakilledfiveAmericansandfourIsraelisH:AbombingatHebrewUniversityinJerusalemkilledninepeople,includingfiveAmericans

41

EQUATEcorpusDataset Size Clas

sesSynthetic

DataSource

AnnotationSource

QuantitativePhenomena

StressTest 7500 3 ✓ AQuA-RAT Automatic Quantifiers

RTE-Quant 166 2 ✗ RTE2-RTE4 Expert Arithmetic,Worldknowledge,Ranges,Quantifiers

AwpNLI 722 2 ✓ ArithmeticWordProblems

Automatic Arithmetic

NewsNLI 1000 2 ✗ CNN Crowd-sourced

Ordinals,Quantifiers,Arithmetic,WorldKnowledge,Magnitude,Ratios

RedditNLI 250 3 ✗ Reddit Expert Range,Arithmetic,Approximation,Verbal

(Ravichanderetal.,2019)

42

Baselines(SOTAmethods)•  MajorityClass(MAJ):Simplebaselinealwayspredictsthemajorityclassintestset.•  Hypothesis-Only(HYP):FastTextclassifiertrainedononlyhypothesestopredictthe

entailmentrelation(Gururanganetal.2018)•  ALIGN:Abag-of-wordsalignmentmodelinspiredbyMacCartney(2009)•  NB(NieandBansal2017):SentenceencoderconsistingofstackedBiLSTM-RNNs

withshortcutconnectionsandfine-tuningofembeddings.Achievestopnon-ensembleresultintheRepEval-2017sharedtask

•  CH(Chenetal.2017):SentenceencoderconsistingofstackedBiLSTM-RNNswithshortcutconnections,character-compositionwordembeddingslearnedviaCNNs,intra-sentencegatedattentionandensembling.AchievesbestoverallresultintheRepEval-2017sharedtask

•  RC(Balazsetal.2017):Single-layerBiLSTMwithmeanpoolingandintra-sentenceattention

•  IS(Conneauetal.2017):Single-layerBiLSTM-RNNwithmax-pooling,showntolearnrobustuniversalsentencerepresentationsthattransferwellacrossinferencetasks

•  BiLSTM:WereimplementthesimpleBiLSTMbaselinemodelofNangiaetal.(2017).OurreimplementationachievesslightlybetterresultsontheMultiNLIdevset

•  CBOW:Bag-of-wordssentencerepresentationfromwordembeddingspassedthroughatanhnon-linearityandasoftmaxlayerforclassification.


43

Constructingentailmentinferences•  Generateareport

foreachpremise-hypothesispair,consistingof:–  ExtractedNUMSETSforpremiseandhypothesis

–  WhichNUMSETSwerecombinedandbywhatoperation

–  WhichNUMSETSwerejustifiedandwhichweren’t

•  Combinesneuralandsymbolicprograms–  Somesubmodulesareneural;overallframeworkissymbolic–  Lightweightsupervision


44

Topic2.Humangoals

•  ComplexQAdomain:humangoalforsentiment–  Ilovedthehotel’spricebuttheroomwasnoisy—>[price+][room-]

•  Task:sentimentjustification:WHYdoestheHolderhavethesentimentvalueforthefacet?

•  Approach:Classifyeachclauseintoalistofhuman(psychologicalandsocial)goals–  Initialset:Maslowhierarchy–  Currently:~110humangoalsfromUSC(Talevichetal.)

•  Data:Crowdsourced;κ≈0.55

(OtaniandHovy,2019)

45

(Talevichetal.2017)

46

Topic3.Socialroles

•  ComplexQAdomain:Humaninteractionsingroups

•  Task:Automatedsocialrolediscovery–  Input:Discussionsinasocialmediaplatform– Output:Rolelist,andassignmentforeachuser

•  Data:– Wikipediaeditors:ourroletaxonomyconformstoWikipedia’sinternalset

– CancerSurvivorNetworkdiscussiongroups

(Yang,Kraut,Hovy,2018)

47

User edit history Role assignments

Information_insertion 0.4 Reference_insertion 0.2 ….

Roles

Grammar 0.2 Markup_deletion 0.1 Rephrase 0.1 ….

Wikilink_insertion 0.2 Wikilink_deletion 0.1 ….

48


LatentrolemodelinWikipedia

Role: distribution of edit actions

Role proportions

Role assignment for user u and word n

Edit actions


49

Discoverededitorroles(namingbyexpert)Expert’s role name Discovered representative behavior

Substantive Expert Information insertion, wikilink insertion, reference insertion

Social Networker Main talk namespace, user namespace

Vandal Fighter Reverting, user talk namespace

Quality Assurance Wikilink insertion, wikipedia namespace, template namespace

Fact Checker Information deletion, wikilink deletion, reference deletion

Cleanup Worker Wikilink modification, template insertion, markup modification

Fact Updater Template modification, reference modification

Copy Editor Grammar, paraphrase, relocation 50


Topics4–.Otherinferencespecialists

•  GeographyandTime… (see(Allen,CACM1983)and(Davis,JAIR2017))– E.g.:north-of,area-included-in-region…

•  Physics,Biology… (seetheHALOproject)– RecentworkonaspectsofPhysicsatAI2(Clarketal.)

•  Emotions

51

Physics:noun-nouncompounds

Whereis…

•  …thekitchentable•  …thecoffeetable•  …thewoodtable•  …theteacher’stable•  …thedatatable

•  Needtoknowtherelationandthenountypestoinferadditionalinfo:

•  LOC•  FUNCTIONèLOC•  MATERIAL•  ?FUNCTIONèLOC?•  TYPESèCONTENTèLOC?

52

Conclusionforoption4.2WherenextwithComplexQA?

•  Identifyandbuildthemostusefuldomainspecialists–  Findbasicknowledgeprimitives– Developreasoninglogics,models,andimplementations

– Develop/findQAdatasetsthatexercisethissortofspecialistknowledgeandreasoning

•  Greatoverviewin(Davis,JAIR2018)•  Createacommonlibraryforalltoshare•  EvaluatecorrectnessANDAnswerproductionscripts(traces,as‘explanation’) 53

Open-sourceandgeneral-purpose(notjustscientific/political)versionofWolframAlpha

THANKYOU

54