information extraction - columbia universitykathy/nlp/2017/classslides/... · • alexander...
TRANSCRIPT
InformationExtraction
Announcements• Courseevalua+on:pleasefillout• HW4extendedtonoon,12/4• Thursday:bringyourlaptop!
• Poetrygenera+on• Finalreview
• Finalexam:12/21,finaliscumula+ve• Whattopicswouldyouliketohearaboutagain?
• DanJurafsky’stalk:5pmtoday,CEPSRauditorium.Hopetoseeyouallthere!“DoesthisVehicleBelongtoYou?”
EnablingConnectionsinaHyperconnectedWorldthroughEmotionAITaniyaMishra,Affectiva• Date:Wednesday,December6th• Time:7-8pm• Loca+on:Room504oftheDianaCenter
• Abstract:• Weliveinahyperconnectedworldpoweredbysmartdevices.Abigpartofbuildingconnec+onsisrecognizingandrespondingtoemo+ons.Butoursmartdevicess+lllackthisfundamentalaspectofsocialcommunica+on,renderingourinterac+onswithorthroughthemsuperficialandlimited.Nowimagineifwecouldempowerourdeviceswithintelligencetorecognizehumanemo+ons.Theresultswouldbetransforma+ve,rangingfromempathe+cchatbotstopersonalizeddigitalsignagetosmartcarsthatensurethecomfortandsafetyofpassengersbyrecognizingtheiremo+ons.Emo+onAI—emo+ones+ma+onviaar+ficialintelligence—canmakethispossible.
• Inthistalk,Iwillpresentwhatemo+onAIis;itsprac+calandconsumerapplica+ons;theprocessofdevelopingalgorithmsandmodelstoes+mateemo+onsfromdifferentmodali+eswithapar+cularfocusonvoice,whichismyareaofspecializa+on.IwillalsobrieflypresentmycareertrajectoryandhowIcametopar+cipateinthisexci+ngfield.
Clari@icationsfromlasttime.
Clarifica+on:ThisisPhrasebasedMT(top)versusNeuralMT(foriden+fica+onofphrases)(bohom)asIpresentedlast+me.
Choetal2014
Performance• Withoutahen+on,LSTMworksquitewellun+lasentencegetslongerthan30words
• Ahen+ondoesbeher,however,evenwithshortersentences
• OthertricksinWMT2017:• Improvementsof1.5–3bluepoints(Edin)• Layernormaliza+on,deepernetworks(encoderdepthof5,decoderdepthof8)
• BasePhraseEncodings(BPE)• THESEAREACTUALLY“BYTEPAIRENCODING”.THEYARETOIDENTIFY
SUB-WORDS.PRODUCEONLYSUBWORDSSEENINTHETRAININGCORPUS.Thisrestric+onreducedvocabulary.
• Reducedvocabularyimprovesmemoryefficiency• Data:parallel,back-translated,duplicatedmonolingual
InformationExtraction• Extrac+onofconcretefactsfromtext
• Nameden++es,rela+ons,events
• Openusedtocreateastructuredknowledgebaseoffacts
• KathyMcKeown,aprofessorfromColumbiaUniversityinNewYorkCity,tookatrainyesterdaytoWashingtonDC.
NamedEntities• KathyMcKeownper,aprofessorfromColumbiaUniversityorginNewYorkCityloc,tookatrainyesterdaytoWashingtonDCloc.
NamedEntities,Relations• KathyMcKeownper,aprofessorfromColumbiaUniversityorginNewYorkCityloc,tookatrainyesterdaytoWashingtonDCloc.
• KathyMcKeownfromColumbia• ColumbiainNewYorkCity
NamedEntities,Relations,Events• KathyMcKeownper,aprofessorfromColumbiaUniversityorginNewYorkCityloc,tookatrainyesterdaytoWashingtonDCloc.
• KathyMcKeowntookatrain(yesterday)
EntityDiscoveryandLinking• KathyMcKeown,aprofessorfromColumbiaUniversityinNewYorkCity,tookatrainyesterdaytoWashingtonDC.
IEforTemplateFillingRelationDetectionGivenasetofdocumentsandadomainofinterest,fillatableofrequiredfields.•Forexample:
Numberofcaraccidentspervehicletypeandnumberofcasual+esintheaccidents.
Never-EndingLanguageLearner TomMitchell CMU
• Cancomputerslearntoread?
• Browsesthewebandahemptstoextractfactsfromhundredsofmillionsofwebpages
• Ahemptstoimproveitsmethodsandaccuracy
• Todate,50millioncandidatefactsatdifferentlevelsofconfidence
• hhp://rtw.ml.cmu.edu/rtw/
Q:WhenwasGandhiborn?A:October2,1869
Q:WherewasBillClintoneducated?A:GeorgetownUniversityinWashington,D.C.
Q:Whatwastheeduca+onofYassirArafat?A:CivilEngineering
Q:WhatisthereligionofNoamChomsky?A:Jewish
IEforQuestionAnswering
StateoftheArt(English)
• NamedEn++es(news)• Rela+ons(slotfilling)• Events(nuggets)
F-measure
• 89%• 59%• 63%
Methods:Sequencelabeling(MEMM,CRF),neuralnets,distantlearningFeatures:linguis+cfeatures,similarity,popularity,gazeteers,ontologies,verbtriggers
WhereHaveYouBeenEntityDiscoveryandLinking?GrowwithDEFT 2006-2011 2012-2017HENGJI,RPIMen+onExtrac+on Human(most) Automa+c
NILClustering None 64methods
ForeignLanguages Chinese(5%-10%lowerthanEnglish)
Systemfor282languages(Chinese/Spanishcomparableto/OutperformEnglish);researchtoward3,000languages
DocumentSize - 500à90,000documents
Genre News,webblog News,DiscussionForum,Webblog,Tweets
En+tyTypes PER,GPE,ORG PER,GPE,ORG,LOC,FAC,hundredsoffine-grainedtypesfortyping
Men+onTypes Nameorallconcepts(most)
Name,Nominal,Pronoun(forBeST)
KB Wikipedia FreebaseàListonly
TrainingData 20,000queries(en+tymen+ons)
500à0documents;unsupervisedlinkingcomparabletosupervisedlinking
#(Good)Papers 62 110(newKBPtrackatACL);6tutorialsattopconferences
SlidefromHengJi
• <PERSON>AlexanderMackenzie</PERSON>,(<TIMEX>January28,1822<TIMEX>-<TIMEX>April17,1892</TIMEX>),abuildingcontractorandwriter,wasthesecondPrimeMinisterof<GPE>Canada</GPE>from….
• Sta`s`calsequencelabelingtechniquescanbeused–similartoPOStagging• Word-by-wordsequencelabeling• Exampleoffeatures
• POStags• Syntac`ccons`tuents• Shapefeatures• Presenceinanameden`tylist
ApproachforNER
• Givenacorpusofannotatedrela+onsbetweenen++es,trainaclassifier:• Abinaryclassifier
• Givenaspanoftextandtwoen++es->decideifthereisarela+onshipbetweenthesetwoen++es
• Features• Typesoftwonameden++es• Bagofwords• POSofwordsinbetween
• Example:• ArentedSUVwentoutofcontrolonSunday,causingthedeathofsevenpeopleinBrooklyn
• Rela+on:Type=Accident,VehicleType=SUV,casualty=7,weather=?
SupervisedApproachforrelationdetection
• Paherns:• “[CAR_TYPE]wentoutofcontrolon[TIMEX],causingthedeathof[NUM]people”
• “[PERSON]wasbornin[GPE]”• “[PERSON]wasgraduatedfrom[FAC]”• “[PERSON]waskilledby<X>”
• MatchingTechniques• Exactmatching
• ProsandCons?• Flexiblematching(e.g.,[X]was.*killed.*by[Y])
• ProsandCons?
PatternMatchingforRelationDetection
• Howcanwecomeupwiththesepaherns?• Manually?
• Taskanddomain-specific• Tedious,+meconsuming,notscalable
• Machinelearning,semi-supervisedapproaches
PatternMatching
1.Name(s),aliases:2.*DateofBirthorCurrentAge:3.*DateofDeath:4.*PlaceofBirth:5.*PlaceofDeath:6.CauseofDeath:7.Religion(Affilia+ons):8.Knownloca+onsanddates:9.Lastknownaddress:10.Previousdomiciles:11.Ethnicortribalaffilia+ons:12.Immediatefamilymembers13.Na+veLanguagespoken:14.SecondaryLanguagesspoken:15.PhysicalCharacteris+cs16.Passportnumberandcountryofissue:17.Professionalposi+ons:18.Educa+on19.Partyorotherorganiza+onaffilia+ons:20.Publica+ons(+tlesanddates):
Task:Produceabiographyof[person]
• Toobtainhighprecision,handleeachslotindependentlyusingbootstrappingtolearnIEpaherns.
• Toimprovetherecall,u+lizeabiographicalsentenceclassifier
Biography–twoapproaches
Bouncingbackandforth• Workedwellforafieldssuchaseduca+on,publica+ons,immediatefamilymembers,party,otherorganiza+onac+vi+es
• Didnotworkwellforotherfieldsincludingreligion,ethnicortribalaffilia+ons,previousdomiciles->toomuchnoise
• Whyisthebouncingideabeherthanusingonlyonecorpus?
HowareneuralnetsusedforIE?
OrganizingknowledgeIt’saversionofChicago–thestandardclassicMacintoshmenufont,withthatdis+nc+vethickdiagonalinthe”N”.
ChicagowasusedbydefaultforMacmenusthroughMacOS7.6,andOS8wasreleasedmid-1997..
ChicagoVIIIwasoneoftheearly70s-eraChicagoalbumstocatchmyear,alongwithChicagoII.
44 SlidefromHengJi
Cross-documentco-referenceresolutionIt’saversionofChicago–thestandardclassicMacintoshmenufont,withthatdis+nc+vethickdiagonalinthe”N”.
ChicagowasusedbydefaultforMacmenusthroughMacOS7.6,andOS8wasreleasedmid-1997..
ChicagoVIIIwasoneoftheearly70s-eraChicagoalbumstocatchmyear,alongwithChicagoII.
45 SlidefromHengJi
46
Referenceresolution:(disambiguationtoWikipedia)It’saversionofChicago–thestandardclassicMacintoshmenufont,withthatdis+nc+vethickdiagonalinthe”N”.
ChicagowasusedbydefaultforMacmenusthroughMacOS7.6,andOS8wasreleasedmid-1997..
ChicagoVIIIwasoneoftheearly70s-eraChicagoalbumstocatchmyear,alongwithChicagoII.
SlidefromHengJi
47
The“Reference”CollectionhasStructureIt’saversionofChicago–thestandardclassicMacintoshmenufont,withthatdis+nc+vethickdiagonalinthe”N”.
ChicagowasusedbydefaultforMacmenusthroughMacOS7.6,andOS8wasreleasedmid-1997..
ChicagoVIIIwasoneoftheearly70s-eraChicagoalbumstocatchmyear,alongwithChicagoII.
Used_In
Is_a Is_a
Succeeded
Released
SlidefromHengJi
48
AnalysisofInformationNetworksIt’saversionofChicago–thestandardclassicMacintoshmenufont,withthatdis+nc+vethickdiagonalinthe”N”.
ChicagowasusedbydefaultforMacmenusthroughMacOS7.6,andOS8wasreleasedmid-1997..
ChicagoVIIIwasoneoftheearly70s-eraChicagoalbumstocatchmyear,alongwithChicagoII.
SlidefromHengJi
49
Here–Wikipediaasaknowledgeresource….butwecanuseotherresources
Used_In
Is_a Is_a
Succeeded
Released
SlidefromHengJi
Wiki@ication:TheReferenceProblem
Blumenthal(D)isacandidatefortheU.S.SenateseatnowheldbyChristopherDodd(D),andhehasheldacommandingleadintheracesinceheenteredit.ButtheTimesreporthasthepotentialtofundamentallyreshapethecontestintheNutmegState.
Blumenthal(D)isacandidatefortheU.S.SenateseatnowheldbyChristopherDodd(D),andhehasheldacommandingleadintheracesinceheenteredit.ButtheTimesreporthasthepotentialtofundamentallyreshapethecontestintheNutmegState.
CyclesofKnowledge:Groundingfor/usingKnowledge
50SlidefromHengJi
TaskDe@inition• Aformaldefini+onofthetaskconsistsof:
1. Adefini+onofthemen`ons(concepts,en++es)tohighlight
2. Determiningthetargetencyclopedic
resource(KB)
3. DefiningwhattopointtointheKB(`tle)
51SlidefromHengJi
Blumenthal(D)isacandidatefortheU.S.SenateseatnowheldbyChristopherDodd(D),andhehasheldacommandingleadintheracesinceheenteredit.ButtheTimesreporthasthepoten+altofundamentallyreshapethecontestintheNutmegState.
Blumenthal(D)isacandidatefortheU.S.SenateseatnowheldbyChristopherDodd(D),andhehasheldacommandingleadintheracesinceheenteredit.ButtheTimesreporthasthepoten+altofundamentallyreshapethecontestintheNutmegState.
ExamplesofMentions(1)
52SlidefromHengJi
NeuralApproachtoEntityLinking(Wiki@ication)Gupta,SinghandRoth,EMNLP2017• Learnsadense,unifiedrepresenta+onofen++es• Encodesseman+candbackgroundknowledgefrommul+plesources
• Anencoderforeachsourceofinforma+on• En+tyembeddingslearnedtobesimilartoencodings
• OnlyusesindirectsupervisionfromWikipedia/Freebase
• Canincorporatenewen++eswithoutretrainingexis+ngrepresenta+ons
• hhp://cogcomp.org/papers/GuptaSiRo17.pdf
JointlyEmbeddingEntityInformation
JointlyEmbeddingEntityInformation
LookatWikipedia• En+tydescrip+on:hhps://en.wikipedia.org/wiki/India_na+onal_cricket_team
EncodingthementioncontextIn1932,IndiaplayedtheirfirstgameinEngland.• Examplemen+oncontainstwomen+ons:“India”and“England”
• Aimtodisambiguate“India”totheteam• Localcontext:“played”and“match”• Documentcontext:toiden+fythesport
• Preservetheseman+cs:“England”shouldnotmatchtoateam
LocalContext• Givenmen+onminsentence:{w1,…,m,….wN}• LepLSTMappliedtow1…m->
• RightLSTMappliedtom….wN->• concatenatedandpassedthroughasinglelayerfeedforwardnetwork
DocumentContextEncoder• Bagofmen+onsvector:
• USA,PearlJam,NasserHassain• Compressedtoalowdimensionalrepresenta+onusingasinglelayerfeedforwardneuralnetwork
• Combinelocalanddocumentrepresenta+onstogetamen+onlevelencodingusingconcatena+onandfeedthroughasinglelayerfeedforwardnetwork
EncodingEntityDescriptionD• EmbedeachwordoftheWikipediadescrip+onasad-dimensionalvector
• EncodeasafixedvectorusingaCNN:
LearningtheTypeRepresentation• EmbedtypeTinFreebase
• Eachen+tycanhavemul+pletypes
• Jointlylearnen+tyandtyperepresenta+ons
LearningUni@iedEntityRepresentations• Separatemodelsforen+tymen+ons,en+tydescrip+ons,typedescrip+ons
• Tolearnthedifferenten+tyrepresenta+onsandtheirparameters,jointlymaximizethetotalobjec+vewherevearethesetofen+tyrepresenta+onsandθaretheparameters
Lookingforward• Morelanguages:3000!
• Mul+-media
• Streamingmode
• Nomoretrainingdata
• Context-aware,living