cs 6120/cs4120: natural language processing · 2017-11-20 · cs 6120/cs4120: natural language...
TRANSCRIPT
CS6120/CS4120:NaturalLanguageProcessing
Instructor:Prof.LuWangCollegeofComputerandInformationScience
NortheasternUniversityWebpage:www.ccs.neu.edu/home/luwang
QuestionAnswering
QuestionAnswering
What do worms eat?
worms
eat
what
worms
eat
grass
Worms eat grass
worms
eat
grass
Grass is eaten by wormsbirds
eat
worms
Birds eat worms
horses
eat
grass
Horses with worms eat grass
with
worms
Ques%on: Poten%al-Answers:
OneoftheoldestNLPtasks(punchedcardsystemsin1961)Simmons,Klein,McConlogue.1964.IndexingandDependencyLogicforAnsweringEnglishQuestions.AmericanDocumentation15:30,196-204
QuestionAnswering:IBM’sWatson
• WonJeopardy onFebruary16,2011!
WILLIAMWILKINSON’S“ANACCOUNTOFTHEPRINCIPALITIESOF
WALLACHIAANDMOLDOVIA”INSPIREDTHISAUTHOR’SMOSTFAMOUSNOVEL
BramStoker
Apple’sSiri
WolframAlpha
TypesofQuestionsinModernSystems
• Factoidquestions• Whowrote“TheUniversalDeclarationofHumanRights”?• Howmanycaloriesarethereintwoslicesofapplepie?• Whatistheaverageageoftheonsetofautism?• WhereisAppleComputerbased?
• Complex(narrative)questions:• Inchildrenwithanacutefebrileillness,whatistheefficacyofacetaminopheninreducingfever?
• WhatdoscholarsthinkaboutJefferson’spositionondealingwithpirates?
Commercialsystems:mainlyfactoidquestions
WhereistheLouvreMuseumlocated? InParis,France
What’stheabbreviation forlimitedpartnership?
L.P.
What arethenamesofOdin’sravens? Huginn andMuninn
What currencyisusedinChina? Theyuan
Whatkindofnutsareusedinmarzipan? almonds
WhatinstrumentdoesMaxRoachplay? drums
WhatisthetelephonenumberforStanfordUniversity?
650-723-2300
ParadigmsforQA
•InformationRetrieval(IR)-basedapproaches•TREC;IBMWatson;Google
•Knowledge-basedandHybridapproaches• IBMWatson;AppleSiri;WolframAlpha;TrueKnowledgeEvi
Manyquestionscanalreadybeansweredbywebsearch
IR-basedQuestionAnswering
• a
IR-basedFactoidQA
DocumentDocumentDocument
DocumentDocume
ntDocumentDocume
ntDocument
Question Processing
PassageRetrieval
Query Formulation
Answer Type Detection
Question
Passage Retrieval
Document Retrieval
Answer Processing
Answer
passages
Indexing
RelevantDocs
DocumentDocumentDocument
IR-basedFactoidQA
• QUESTIONPROCESSING• Detectquestiontype,answertype,focus,relations• Formulatequeriestosendtoasearchengine
• PASSAGERETRIEVAL• Retrieverankeddocuments• Breakintosuitablepassagesandrerank
• ANSWERPROCESSING• Extractcandidateanswers• Rankcandidates
• usingevidencefromthetextandexternalsources
Knowledge-basedapproaches(Siri)
• Buildasemanticrepresentationofthequery• Times,dates,locations,entities,numericquantities
• Mapfromthissemanticstoquerystructureddataorresources• Geospatialdatabases• Ontologies(Wikipediainfoboxes,dbPedia,WordNet,Yago)• Restaurantreviewsourcesandreservationservices• Scientificdatabases
Hybridapproaches(IBMWatson)
• Buildashallowsemanticrepresentationofthequery• GenerateanswercandidatesusingIRmethods
• Augmentedwithontologiesandsemi-structureddata
• Scoreeachcandidateusingricherknowledgesources• Geospatialdatabases• Temporalreasoning• Taxonomicalclassification
AnswerTypesandQueryFormulation
FactoidQ/A
DocumentDocumentDocument
DocumentDocume
ntDocumentDocume
ntDocument
Question Processing
PassageRetrieval
Query Formulation
Answer Type Detection
Question
Passage Retrieval
Document Retrieval
Answer Processing
Answer
passages
Indexing
RelevantDocs
DocumentDocumentDocument
QuestionProcessingThingstoextractfromthequestion
• AnswerTypeDetection• Decidethenamedentitytype(person,place)oftheanswer
• QueryFormulation• ChoosequerykeywordsfortheIRsystem
• QuestionTypeclassification• Isthisadefinitionquestion,amathquestion,alistquestion?
• FocusDetection• Findthequestionwordsthatarereplacedbytheanswer
• RelationExtraction• Findrelationsbetweenentitiesinthequestion
Question ProcessingJeopardy!: They’re the two states you could be reentering if you’re crossing Florida’s northern border
•AnswerType:USstate•Query:twostates,border,Florida,north•Focus:thetwostates•Relations:borders(Florida,?x,north)
AnswerTypeDetection:NamedEntities
•WhofoundedVirginAirlines?• PERSON
•WhatCanadiancityhasthelargestpopulation?• CITY.
AnswerTypeTaxonomy
•6coarseclasses• ABBREVIATION,ENTITY,DESCRIPTION,HUMAN,LOCATION,NUMERIC
•50finerclasses• LOCATION:city,country,mountain…• HUMAN:group,individual,title,description• ENTITY:animal,body,color,currency…
Xin Li,DanRoth.2002.LearningQuestion Classifiers.COLING'02
PartofLi&Roth’sAnswerTypeTaxonomy
LOCATION
NUMERIC
ENTITY HUMAN
ABBREVIATIONDESCRIPTION
country city state
datepercent
money
sizedistance
individual
title
group
food
currency
animal
definition
reason expression
abbreviation
AnswerTypes
MoreAnswerTypes
AnswertypesinJeopardy
• 2500answertypesin20,000Jeopardyquestionsample• Themostfrequent200answertypescover<50%ofdata• The40mostfrequentJeopardyanswertypeshe,country,city,man,film,state,she,author,group,here,company,president,capital,star,novel,character,woman,river,island,king,song,part,series,sport,singer,actor,play,team,show,actress,animal,presidential,composer,musical,nation,book,title,leader,game
Ferrucci etal.2010.BuildingWatson:AnOverviewoftheDeepQA Project.AIMagazine.Fall2010.59-79.
AnswerTypeDetection
•Hand-writtenrules•MachineLearning•Hybrids
AnswerTypeDetection
• Regularexpression-basedrulescangetsomecases:• Who{is|was|are|were}PERSON• PERSON(YEAR– YEAR)
• Otherrulesusethequestionheadword:(theheadwordofthefirstnounphraseafterthewh-word)
• Whichcity inChinahasthelargestnumberofforeignfinancialcompanies?
• Whatisthestateflower ofCalifornia?
AnswerTypeDetection
•Mostoften,wetreattheproblemasmachinelearningclassification•Defineataxonomyofquestiontypes•Annotatetrainingdataforeachquestiontype•Trainclassifiersforeachquestionclassusingarichsetoffeatures.
• featuresincludethosehand-writtenrules!
FeaturesforAnswerTypeDetection
•Questionwordsandphrases•Part-of-speechtags•Parsefeatures(headwords)•NamedEntities•Semanticallyrelatedwords
FactoidQ/A
DocumentDocumentDocument
DocumentDocume
ntDocumentDocume
ntDocument
Question Processing
PassageRetrieval
Query Formulation
Answer Type Detection
Question
Passage Retrieval
Document Retrieval
Answer Processing
Answer
passages
Indexing
RelevantDocs
DocumentDocumentDocument
KeywordSelectionAlgorithm
1.Selectallnon-stopwordsinquotations2.SelectallNNPwordsinrecognizednamedentities3.Selectallcomplexnominals withtheiradjectivalmodifiers4.Selectallothercomplexnominals5.Selectallnounswiththeiradjectivalmodifiers6.Selectallothernouns7.Selectallverbs8.Selectalladverbs9.Selectthequestionfocusword(skippedinallprevioussteps)10.Selectallotherwords
DanMoldovan,Sanda Harabagiu,MariusPaca,Rada Mihalcea,RichardGoodrum,RoxanaGirju andVasile Rus.1999.ProceedingsofTREC-8.
Choosingkeywordsfromthequery
Whocoinedtheterm“cyberspace”inhisnovel“Neuromancer”?
1 1
4 4
7
cyberspace/1Neuromancer/1term/4novel/4coined/7
SlidefromMihaiSurdeanu
PassageRetrievalandAnswerExtraction
FactoidQ/A
DocumentDocumentDocument
DocumentDocume
ntDocumentDocume
ntDocument
Question Processing
PassageRetrieval
Query Formulation
Answer Type Detection
Question
Passage Retrieval
Document Retrieval
Answer Processing
Answer
passages
Indexing
RelevantDocs
DocumentDocumentDocument
PassageRetrieval
•Step1:IRengineretrievesdocumentsusingqueryterms•Step2:Segmentthedocumentsintoshorterunits
• somethinglikeparagraphs•Step3:Passageranking
• Useanswertypetohelprerank passages
FeaturesforPassageRanking
• NumberofNamedEntitiesoftherighttypeinpassage• Numberofquerywordsinpassage• NumberofquestionN-gramsalsoinpassage• Proximityofquerykeywordstoeachotherinpassage• Longestsequenceofquestionwords• Rankofthedocumentcontainingpassage
Eitherinrule-basedclassifiersorwithsupervisedmachinelearning
FactoidQ/A
DocumentDocumentDocument
DocumentDocume
ntDocumentDocume
ntDocument
Question Processing
PassageRetrieval
Query Formulation
Answer Type Detection
Question
Passage Retrieval
Document Retrieval
Answer Processing
Answer
passages
Indexing
RelevantDocs
DocumentDocumentDocument
AnswerExtraction
• Runananswer-typenamed-entitytaggeronthepassages• Eachanswertyperequiresanamed-entitytaggerthatdetectsit• IfanswertypeisCITY,taggerhastotagCITY
• CanbefullNER,simpleregularexpressions,orhybrid
• Returnthestringwiththerighttype:• Who is the prime minister of India (PERSON)Manmohan Singh, Prime Minister of India, had told left leaders that the deal would not be renegotiated.
• How tall is Mt. Everest? (LENGTH)The official height of Mount Everest is 29035 feet
RankingCandidateAnswers
•Butwhatiftherearemultiplecandidateanswers!
Q: Who was Queen Victoria’s second son?•AnswerType:Person
• Passage:TheMariebiscuitisnamedafterMarieAlexandrovna,thedaughterofCzarAlexanderIIofRussiaandwifeofAlfred,thesecondsonofQueenVictoriaandPrinceAlbert
RankingCandidateAnswers
•Butwhatiftherearemultiplecandidateanswers!
Q: Who was Queen Victoria’s second son?•AnswerType:Person
• Passage:TheMariebiscuitisnamedafterMarieAlexandrovna,thedaughterofCzarAlexanderIIofRussiaandwifeofAlfred,thesecondsonofQueenVictoriaandPrinceAlbert
Usemachinelearning:Featuresforrankingcandidateanswers
Answertypematch:Candidatecontainsaphrasewiththecorrectanswertype.Patternmatch:Regularexpressionpatternmatchesthecandidate.Questionkeywords:#ofquestionkeywordsinthecandidate.Keyworddistance:DistanceinwordsbetweenthecandidateandquerykeywordsNoveltyfactor:Awordinthecandidateisnotinthequery.Appositionfeatures:ThecandidateisanappositivetoquestiontermsPunctuationlocation:Thecandidateisimmediatelyfollowedbyacomma,period,quotationmarks,semicolon,orexclamationmark.Sequencesofquestionterms:Thelengthofthelongestsequenceofquestiontermsthatoccursinthecandidateanswer.
CandidateAnswerscoringinIBMWatson
• Eachcandidateanswergetsscoresfrom>50components• (fromunstructuredtext,semi-structuredtext,triplestores)
• logicalform(parse)matchbetweenquestionandcandidate
• passagesourcereliability• geospatiallocation
• California is”southwest ofMontana”• temporalrelationships• taxonomicclassification
CommonEvaluationMetrics
1. Accuracy (doesanswermatchgold-labeledanswer?)2. MeanReciprocalRank
• ForeachqueryreturnarankedlistofMcandidateanswers.• Queryscoreis1/Rankofthefirstcorrectanswer
• Iffirstansweriscorrect:1• elseifsecondansweriscorrect:½• elseifthirdansweriscorrect:⅓,etc.• Scoreis0ifnoneoftheManswersarecorrect
• TakethemeanoverallNqueriesMRR =
1rankii=1
N
∑
N
KnowledgeinQA
RelationExtraction
•Answers:DatabasesofRelations• born-in(“EmmaGoldman”,“June271869”)• author-of(“CaoXue Qin”,“DreamoftheRedChamber”)• DrawfromWikipediainfoboxes,DBpedia,FreeBase,etc.
•Questions:ExtractingRelationsinQuestionsWhosegranddaughterstarredinE.T.?(acted-in ?x “E.T.”)
(granddaughter-of ?x ?y)
Temporal Reasoning
•Relation databases• (andobituaries,biographical dictionaries,etc.)
• IBMWatson”In1594hetook ajob asatax collector inAndalusia”Candidates:
• Thoreau isabad answer (born in1817)• Cervantes ispossible (was alive in1594)
Geospatial knowledge(containment,directionality,borders)
• Beijing isagood answer for”Asiancity”• California is”southwest ofMontana”• geonames.org:
ContextandConversationinVirtualAssistantslikeSiri•Coreferencehelpsresolveambiguities
U:“BookatableatIlFornaio at7:00withmymom”U:“Alsosendher anemailreminder”
•Clarificationquestions:U:“Chicagopizza”S:“DidyoumeanpizzarestaurantsinChicagoorChicago-stylepizza?”