cs 6120/cs4120: natural language processing · 2017-11-20 · cs 6120/cs4120: natural language...

48
CS 6120/CS4120: Natural Language Processing Instructor: Prof. Lu Wang College of Computer and Information Science Northeastern University Webpage: www.ccs.neu.edu/home/luwang

Upload: others

Post on 09-May-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS 6120/CS4120: Natural Language Processing · 2017-11-20 · CS 6120/CS4120: Natural Language Processing Instructor: Prof. Lu Wang ... •How many calories are there in two slices

CS6120/CS4120:NaturalLanguageProcessing

Instructor:Prof.LuWangCollegeofComputerandInformationScience

NortheasternUniversityWebpage:www.ccs.neu.edu/home/luwang

Page 2: CS 6120/CS4120: Natural Language Processing · 2017-11-20 · CS 6120/CS4120: Natural Language Processing Instructor: Prof. Lu Wang ... •How many calories are there in two slices

QuestionAnswering

Page 3: CS 6120/CS4120: Natural Language Processing · 2017-11-20 · CS 6120/CS4120: Natural Language Processing Instructor: Prof. Lu Wang ... •How many calories are there in two slices

QuestionAnswering

What do worms eat?

worms

eat

what

worms

eat

grass

Worms eat grass

worms

eat

grass

Grass is eaten by wormsbirds

eat

worms

Birds eat worms

horses

eat

grass

Horses with worms eat grass

with

worms

Ques%on: Poten%al-Answers:

OneoftheoldestNLPtasks(punchedcardsystemsin1961)Simmons,Klein,McConlogue.1964.IndexingandDependencyLogicforAnsweringEnglishQuestions.AmericanDocumentation15:30,196-204

Page 4: CS 6120/CS4120: Natural Language Processing · 2017-11-20 · CS 6120/CS4120: Natural Language Processing Instructor: Prof. Lu Wang ... •How many calories are there in two slices

QuestionAnswering:IBM’sWatson

• WonJeopardy onFebruary16,2011!

WILLIAMWILKINSON’S“ANACCOUNTOFTHEPRINCIPALITIESOF

WALLACHIAANDMOLDOVIA”INSPIREDTHISAUTHOR’SMOSTFAMOUSNOVEL

BramStoker

Page 5: CS 6120/CS4120: Natural Language Processing · 2017-11-20 · CS 6120/CS4120: Natural Language Processing Instructor: Prof. Lu Wang ... •How many calories are there in two slices

Apple’sSiri

Page 6: CS 6120/CS4120: Natural Language Processing · 2017-11-20 · CS 6120/CS4120: Natural Language Processing Instructor: Prof. Lu Wang ... •How many calories are there in two slices

WolframAlpha

Page 7: CS 6120/CS4120: Natural Language Processing · 2017-11-20 · CS 6120/CS4120: Natural Language Processing Instructor: Prof. Lu Wang ... •How many calories are there in two slices

TypesofQuestionsinModernSystems

• Factoidquestions• Whowrote“TheUniversalDeclarationofHumanRights”?• Howmanycaloriesarethereintwoslicesofapplepie?• Whatistheaverageageoftheonsetofautism?• WhereisAppleComputerbased?

• Complex(narrative)questions:• Inchildrenwithanacutefebrileillness,whatistheefficacyofacetaminopheninreducingfever?

• WhatdoscholarsthinkaboutJefferson’spositionondealingwithpirates?

Page 8: CS 6120/CS4120: Natural Language Processing · 2017-11-20 · CS 6120/CS4120: Natural Language Processing Instructor: Prof. Lu Wang ... •How many calories are there in two slices

Commercialsystems:mainlyfactoidquestions

WhereistheLouvreMuseumlocated? InParis,France

What’stheabbreviation forlimitedpartnership?

L.P.

What arethenamesofOdin’sravens? Huginn andMuninn

What currencyisusedinChina? Theyuan

Whatkindofnutsareusedinmarzipan? almonds

WhatinstrumentdoesMaxRoachplay? drums

WhatisthetelephonenumberforStanfordUniversity?

650-723-2300

Page 9: CS 6120/CS4120: Natural Language Processing · 2017-11-20 · CS 6120/CS4120: Natural Language Processing Instructor: Prof. Lu Wang ... •How many calories are there in two slices

ParadigmsforQA

•InformationRetrieval(IR)-basedapproaches•TREC;IBMWatson;Google

•Knowledge-basedandHybridapproaches• IBMWatson;AppleSiri;WolframAlpha;TrueKnowledgeEvi

Page 10: CS 6120/CS4120: Natural Language Processing · 2017-11-20 · CS 6120/CS4120: Natural Language Processing Instructor: Prof. Lu Wang ... •How many calories are there in two slices

Manyquestionscanalreadybeansweredbywebsearch

Page 11: CS 6120/CS4120: Natural Language Processing · 2017-11-20 · CS 6120/CS4120: Natural Language Processing Instructor: Prof. Lu Wang ... •How many calories are there in two slices

IR-basedQuestionAnswering

• a

Page 12: CS 6120/CS4120: Natural Language Processing · 2017-11-20 · CS 6120/CS4120: Natural Language Processing Instructor: Prof. Lu Wang ... •How many calories are there in two slices

IR-basedFactoidQA

DocumentDocumentDocument

DocumentDocume

ntDocumentDocume

ntDocument

Question Processing

PassageRetrieval

Query Formulation

Answer Type Detection

Question

Passage Retrieval

Document Retrieval

Answer Processing

Answer

passages

Indexing

RelevantDocs

DocumentDocumentDocument

Page 13: CS 6120/CS4120: Natural Language Processing · 2017-11-20 · CS 6120/CS4120: Natural Language Processing Instructor: Prof. Lu Wang ... •How many calories are there in two slices

IR-basedFactoidQA

• QUESTIONPROCESSING• Detectquestiontype,answertype,focus,relations• Formulatequeriestosendtoasearchengine

• PASSAGERETRIEVAL• Retrieverankeddocuments• Breakintosuitablepassagesandrerank

• ANSWERPROCESSING• Extractcandidateanswers• Rankcandidates

• usingevidencefromthetextandexternalsources

Page 14: CS 6120/CS4120: Natural Language Processing · 2017-11-20 · CS 6120/CS4120: Natural Language Processing Instructor: Prof. Lu Wang ... •How many calories are there in two slices

Knowledge-basedapproaches(Siri)

• Buildasemanticrepresentationofthequery• Times,dates,locations,entities,numericquantities

• Mapfromthissemanticstoquerystructureddataorresources• Geospatialdatabases• Ontologies(Wikipediainfoboxes,dbPedia,WordNet,Yago)• Restaurantreviewsourcesandreservationservices• Scientificdatabases

Page 15: CS 6120/CS4120: Natural Language Processing · 2017-11-20 · CS 6120/CS4120: Natural Language Processing Instructor: Prof. Lu Wang ... •How many calories are there in two slices

Hybridapproaches(IBMWatson)

• Buildashallowsemanticrepresentationofthequery• GenerateanswercandidatesusingIRmethods

• Augmentedwithontologiesandsemi-structureddata

• Scoreeachcandidateusingricherknowledgesources• Geospatialdatabases• Temporalreasoning• Taxonomicalclassification

Page 16: CS 6120/CS4120: Natural Language Processing · 2017-11-20 · CS 6120/CS4120: Natural Language Processing Instructor: Prof. Lu Wang ... •How many calories are there in two slices

AnswerTypesandQueryFormulation

Page 17: CS 6120/CS4120: Natural Language Processing · 2017-11-20 · CS 6120/CS4120: Natural Language Processing Instructor: Prof. Lu Wang ... •How many calories are there in two slices

FactoidQ/A

DocumentDocumentDocument

DocumentDocume

ntDocumentDocume

ntDocument

Question Processing

PassageRetrieval

Query Formulation

Answer Type Detection

Question

Passage Retrieval

Document Retrieval

Answer Processing

Answer

passages

Indexing

RelevantDocs

DocumentDocumentDocument

Page 18: CS 6120/CS4120: Natural Language Processing · 2017-11-20 · CS 6120/CS4120: Natural Language Processing Instructor: Prof. Lu Wang ... •How many calories are there in two slices

QuestionProcessingThingstoextractfromthequestion

• AnswerTypeDetection• Decidethenamedentitytype(person,place)oftheanswer

• QueryFormulation• ChoosequerykeywordsfortheIRsystem

• QuestionTypeclassification• Isthisadefinitionquestion,amathquestion,alistquestion?

• FocusDetection• Findthequestionwordsthatarereplacedbytheanswer

• RelationExtraction• Findrelationsbetweenentitiesinthequestion

Page 19: CS 6120/CS4120: Natural Language Processing · 2017-11-20 · CS 6120/CS4120: Natural Language Processing Instructor: Prof. Lu Wang ... •How many calories are there in two slices

Question ProcessingJeopardy!: They’re the two states you could be reentering if you’re crossing Florida’s northern border

•AnswerType:USstate•Query:twostates,border,Florida,north•Focus:thetwostates•Relations:borders(Florida,?x,north)

Page 20: CS 6120/CS4120: Natural Language Processing · 2017-11-20 · CS 6120/CS4120: Natural Language Processing Instructor: Prof. Lu Wang ... •How many calories are there in two slices

AnswerTypeDetection:NamedEntities

•WhofoundedVirginAirlines?• PERSON

•WhatCanadiancityhasthelargestpopulation?• CITY.

Page 21: CS 6120/CS4120: Natural Language Processing · 2017-11-20 · CS 6120/CS4120: Natural Language Processing Instructor: Prof. Lu Wang ... •How many calories are there in two slices

AnswerTypeTaxonomy

•6coarseclasses• ABBREVIATION,ENTITY,DESCRIPTION,HUMAN,LOCATION,NUMERIC

•50finerclasses• LOCATION:city,country,mountain…• HUMAN:group,individual,title,description• ENTITY:animal,body,color,currency…

Xin Li,DanRoth.2002.LearningQuestion Classifiers.COLING'02

Page 22: CS 6120/CS4120: Natural Language Processing · 2017-11-20 · CS 6120/CS4120: Natural Language Processing Instructor: Prof. Lu Wang ... •How many calories are there in two slices

PartofLi&Roth’sAnswerTypeTaxonomy

LOCATION

NUMERIC

ENTITY HUMAN

ABBREVIATIONDESCRIPTION

country city state

datepercent

money

sizedistance

individual

title

group

food

currency

animal

definition

reason expression

abbreviation

Page 23: CS 6120/CS4120: Natural Language Processing · 2017-11-20 · CS 6120/CS4120: Natural Language Processing Instructor: Prof. Lu Wang ... •How many calories are there in two slices

AnswerTypes

Page 24: CS 6120/CS4120: Natural Language Processing · 2017-11-20 · CS 6120/CS4120: Natural Language Processing Instructor: Prof. Lu Wang ... •How many calories are there in two slices

MoreAnswerTypes

Page 25: CS 6120/CS4120: Natural Language Processing · 2017-11-20 · CS 6120/CS4120: Natural Language Processing Instructor: Prof. Lu Wang ... •How many calories are there in two slices

AnswertypesinJeopardy

• 2500answertypesin20,000Jeopardyquestionsample• Themostfrequent200answertypescover<50%ofdata• The40mostfrequentJeopardyanswertypeshe,country,city,man,film,state,she,author,group,here,company,president,capital,star,novel,character,woman,river,island,king,song,part,series,sport,singer,actor,play,team,show,actress,animal,presidential,composer,musical,nation,book,title,leader,game

Ferrucci etal.2010.BuildingWatson:AnOverviewoftheDeepQA Project.AIMagazine.Fall2010.59-79.

Page 26: CS 6120/CS4120: Natural Language Processing · 2017-11-20 · CS 6120/CS4120: Natural Language Processing Instructor: Prof. Lu Wang ... •How many calories are there in two slices

AnswerTypeDetection

•Hand-writtenrules•MachineLearning•Hybrids

Page 27: CS 6120/CS4120: Natural Language Processing · 2017-11-20 · CS 6120/CS4120: Natural Language Processing Instructor: Prof. Lu Wang ... •How many calories are there in two slices

AnswerTypeDetection

• Regularexpression-basedrulescangetsomecases:• Who{is|was|are|were}PERSON• PERSON(YEAR– YEAR)

• Otherrulesusethequestionheadword:(theheadwordofthefirstnounphraseafterthewh-word)

• Whichcity inChinahasthelargestnumberofforeignfinancialcompanies?

• Whatisthestateflower ofCalifornia?

Page 28: CS 6120/CS4120: Natural Language Processing · 2017-11-20 · CS 6120/CS4120: Natural Language Processing Instructor: Prof. Lu Wang ... •How many calories are there in two slices

AnswerTypeDetection

•Mostoften,wetreattheproblemasmachinelearningclassification•Defineataxonomyofquestiontypes•Annotatetrainingdataforeachquestiontype•Trainclassifiersforeachquestionclassusingarichsetoffeatures.

• featuresincludethosehand-writtenrules!

Page 29: CS 6120/CS4120: Natural Language Processing · 2017-11-20 · CS 6120/CS4120: Natural Language Processing Instructor: Prof. Lu Wang ... •How many calories are there in two slices

FeaturesforAnswerTypeDetection

•Questionwordsandphrases•Part-of-speechtags•Parsefeatures(headwords)•NamedEntities•Semanticallyrelatedwords

Page 30: CS 6120/CS4120: Natural Language Processing · 2017-11-20 · CS 6120/CS4120: Natural Language Processing Instructor: Prof. Lu Wang ... •How many calories are there in two slices

FactoidQ/A

DocumentDocumentDocument

DocumentDocume

ntDocumentDocume

ntDocument

Question Processing

PassageRetrieval

Query Formulation

Answer Type Detection

Question

Passage Retrieval

Document Retrieval

Answer Processing

Answer

passages

Indexing

RelevantDocs

DocumentDocumentDocument

Page 31: CS 6120/CS4120: Natural Language Processing · 2017-11-20 · CS 6120/CS4120: Natural Language Processing Instructor: Prof. Lu Wang ... •How many calories are there in two slices

KeywordSelectionAlgorithm

1.Selectallnon-stopwordsinquotations2.SelectallNNPwordsinrecognizednamedentities3.Selectallcomplexnominals withtheiradjectivalmodifiers4.Selectallothercomplexnominals5.Selectallnounswiththeiradjectivalmodifiers6.Selectallothernouns7.Selectallverbs8.Selectalladverbs9.Selectthequestionfocusword(skippedinallprevioussteps)10.Selectallotherwords

DanMoldovan,Sanda Harabagiu,MariusPaca,Rada Mihalcea,RichardGoodrum,RoxanaGirju andVasile Rus.1999.ProceedingsofTREC-8.

Page 32: CS 6120/CS4120: Natural Language Processing · 2017-11-20 · CS 6120/CS4120: Natural Language Processing Instructor: Prof. Lu Wang ... •How many calories are there in two slices

Choosingkeywordsfromthequery

Whocoinedtheterm“cyberspace”inhisnovel“Neuromancer”?

1 1

4 4

7

cyberspace/1Neuromancer/1term/4novel/4coined/7

SlidefromMihaiSurdeanu

Page 33: CS 6120/CS4120: Natural Language Processing · 2017-11-20 · CS 6120/CS4120: Natural Language Processing Instructor: Prof. Lu Wang ... •How many calories are there in two slices

PassageRetrievalandAnswerExtraction

Page 34: CS 6120/CS4120: Natural Language Processing · 2017-11-20 · CS 6120/CS4120: Natural Language Processing Instructor: Prof. Lu Wang ... •How many calories are there in two slices

FactoidQ/A

DocumentDocumentDocument

DocumentDocume

ntDocumentDocume

ntDocument

Question Processing

PassageRetrieval

Query Formulation

Answer Type Detection

Question

Passage Retrieval

Document Retrieval

Answer Processing

Answer

passages

Indexing

RelevantDocs

DocumentDocumentDocument

Page 35: CS 6120/CS4120: Natural Language Processing · 2017-11-20 · CS 6120/CS4120: Natural Language Processing Instructor: Prof. Lu Wang ... •How many calories are there in two slices

PassageRetrieval

•Step1:IRengineretrievesdocumentsusingqueryterms•Step2:Segmentthedocumentsintoshorterunits

• somethinglikeparagraphs•Step3:Passageranking

• Useanswertypetohelprerank passages

Page 36: CS 6120/CS4120: Natural Language Processing · 2017-11-20 · CS 6120/CS4120: Natural Language Processing Instructor: Prof. Lu Wang ... •How many calories are there in two slices

FeaturesforPassageRanking

• NumberofNamedEntitiesoftherighttypeinpassage• Numberofquerywordsinpassage• NumberofquestionN-gramsalsoinpassage• Proximityofquerykeywordstoeachotherinpassage• Longestsequenceofquestionwords• Rankofthedocumentcontainingpassage

Eitherinrule-basedclassifiersorwithsupervisedmachinelearning

Page 37: CS 6120/CS4120: Natural Language Processing · 2017-11-20 · CS 6120/CS4120: Natural Language Processing Instructor: Prof. Lu Wang ... •How many calories are there in two slices

FactoidQ/A

DocumentDocumentDocument

DocumentDocume

ntDocumentDocume

ntDocument

Question Processing

PassageRetrieval

Query Formulation

Answer Type Detection

Question

Passage Retrieval

Document Retrieval

Answer Processing

Answer

passages

Indexing

RelevantDocs

DocumentDocumentDocument

Page 38: CS 6120/CS4120: Natural Language Processing · 2017-11-20 · CS 6120/CS4120: Natural Language Processing Instructor: Prof. Lu Wang ... •How many calories are there in two slices

AnswerExtraction

• Runananswer-typenamed-entitytaggeronthepassages• Eachanswertyperequiresanamed-entitytaggerthatdetectsit• IfanswertypeisCITY,taggerhastotagCITY

• CanbefullNER,simpleregularexpressions,orhybrid

• Returnthestringwiththerighttype:• Who is the prime minister of India (PERSON)Manmohan Singh, Prime Minister of India, had told left leaders that the deal would not be renegotiated.

• How tall is Mt. Everest? (LENGTH)The official height of Mount Everest is 29035 feet

Page 39: CS 6120/CS4120: Natural Language Processing · 2017-11-20 · CS 6120/CS4120: Natural Language Processing Instructor: Prof. Lu Wang ... •How many calories are there in two slices

RankingCandidateAnswers

•Butwhatiftherearemultiplecandidateanswers!

Q: Who was Queen Victoria’s second son?•AnswerType:Person

• Passage:TheMariebiscuitisnamedafterMarieAlexandrovna,thedaughterofCzarAlexanderIIofRussiaandwifeofAlfred,thesecondsonofQueenVictoriaandPrinceAlbert

Page 40: CS 6120/CS4120: Natural Language Processing · 2017-11-20 · CS 6120/CS4120: Natural Language Processing Instructor: Prof. Lu Wang ... •How many calories are there in two slices

RankingCandidateAnswers

•Butwhatiftherearemultiplecandidateanswers!

Q: Who was Queen Victoria’s second son?•AnswerType:Person

• Passage:TheMariebiscuitisnamedafterMarieAlexandrovna,thedaughterofCzarAlexanderIIofRussiaandwifeofAlfred,thesecondsonofQueenVictoriaandPrinceAlbert

Page 41: CS 6120/CS4120: Natural Language Processing · 2017-11-20 · CS 6120/CS4120: Natural Language Processing Instructor: Prof. Lu Wang ... •How many calories are there in two slices

Usemachinelearning:Featuresforrankingcandidateanswers

Answertypematch:Candidatecontainsaphrasewiththecorrectanswertype.Patternmatch:Regularexpressionpatternmatchesthecandidate.Questionkeywords:#ofquestionkeywordsinthecandidate.Keyworddistance:DistanceinwordsbetweenthecandidateandquerykeywordsNoveltyfactor:Awordinthecandidateisnotinthequery.Appositionfeatures:ThecandidateisanappositivetoquestiontermsPunctuationlocation:Thecandidateisimmediatelyfollowedbyacomma,period,quotationmarks,semicolon,orexclamationmark.Sequencesofquestionterms:Thelengthofthelongestsequenceofquestiontermsthatoccursinthecandidateanswer.

Page 42: CS 6120/CS4120: Natural Language Processing · 2017-11-20 · CS 6120/CS4120: Natural Language Processing Instructor: Prof. Lu Wang ... •How many calories are there in two slices

CandidateAnswerscoringinIBMWatson

• Eachcandidateanswergetsscoresfrom>50components• (fromunstructuredtext,semi-structuredtext,triplestores)

• logicalform(parse)matchbetweenquestionandcandidate

• passagesourcereliability• geospatiallocation

• California is”southwest ofMontana”• temporalrelationships• taxonomicclassification

Page 43: CS 6120/CS4120: Natural Language Processing · 2017-11-20 · CS 6120/CS4120: Natural Language Processing Instructor: Prof. Lu Wang ... •How many calories are there in two slices

CommonEvaluationMetrics

1. Accuracy (doesanswermatchgold-labeledanswer?)2. MeanReciprocalRank

• ForeachqueryreturnarankedlistofMcandidateanswers.• Queryscoreis1/Rankofthefirstcorrectanswer

• Iffirstansweriscorrect:1• elseifsecondansweriscorrect:½• elseifthirdansweriscorrect:⅓,etc.• Scoreis0ifnoneoftheManswersarecorrect

• TakethemeanoverallNqueriesMRR =

1rankii=1

N

N

Page 44: CS 6120/CS4120: Natural Language Processing · 2017-11-20 · CS 6120/CS4120: Natural Language Processing Instructor: Prof. Lu Wang ... •How many calories are there in two slices

KnowledgeinQA

Page 45: CS 6120/CS4120: Natural Language Processing · 2017-11-20 · CS 6120/CS4120: Natural Language Processing Instructor: Prof. Lu Wang ... •How many calories are there in two slices

RelationExtraction

•Answers:DatabasesofRelations• born-in(“EmmaGoldman”,“June271869”)• author-of(“CaoXue Qin”,“DreamoftheRedChamber”)• DrawfromWikipediainfoboxes,DBpedia,FreeBase,etc.

•Questions:ExtractingRelationsinQuestionsWhosegranddaughterstarredinE.T.?(acted-in ?x “E.T.”)

(granddaughter-of ?x ?y)

Page 46: CS 6120/CS4120: Natural Language Processing · 2017-11-20 · CS 6120/CS4120: Natural Language Processing Instructor: Prof. Lu Wang ... •How many calories are there in two slices

Temporal Reasoning

•Relation databases• (andobituaries,biographical dictionaries,etc.)

• IBMWatson”In1594hetook ajob asatax collector inAndalusia”Candidates:

• Thoreau isabad answer (born in1817)• Cervantes ispossible (was alive in1594)

Page 47: CS 6120/CS4120: Natural Language Processing · 2017-11-20 · CS 6120/CS4120: Natural Language Processing Instructor: Prof. Lu Wang ... •How many calories are there in two slices

Geospatial knowledge(containment,directionality,borders)

• Beijing isagood answer for”Asiancity”• California is”southwest ofMontana”• geonames.org:

Page 48: CS 6120/CS4120: Natural Language Processing · 2017-11-20 · CS 6120/CS4120: Natural Language Processing Instructor: Prof. Lu Wang ... •How many calories are there in two slices

ContextandConversationinVirtualAssistantslikeSiri•Coreferencehelpsresolveambiguities

U:“BookatableatIlFornaio at7:00withmymom”U:“Alsosendher anemailreminder”

•Clarificationquestions:U:“Chicagopizza”S:“DidyoumeanpizzarestaurantsinChicagoorChicago-stylepizza?”