tom finin: “from strings to things: populating knowledge bases from text”
TRANSCRIPT
FromStringstoThings:Popula4ngKnowledge
BasesfromText
TimFininUniversityofMaryland,Bal6moreCounty
IBMCogni6veSystemsIns6tuteSpeakerSeriesJuly14,2016
h"p://ebiq.org/r/369
Butitwasdesignedforpeople,notmachines• Itscontentismostlytext,spokenlanguage,imagesandvideos• Theseareeasyforpeopletounderstand• Buthardformachines
Machinesneedaccesstothisknowledgetoo
Weneedtoaddknowledgegraphs• Highqualitysemi-structuredinforma6onabouten66esandrela6ons• RepresentedandaccessedviaWebstandards• Easilyintegrated,fusedandreasonedwith
Almostallcommercialrecipesitesembedseman4cdataabouttherecipesinanRDF-compa6bleformusingtermsfromtheschema.orgontology.
SearchenginesreadandusethisdatatobeRerunder-standtheseman6csofthepagecontent
WheredoesKnowledgecomefrom• Exis6ngknowledgegraphsquicklydevelopedwithsemi-structureddatafromWikipedia,augmentedwith– Structureddatabases(Geonames)– Semi-structureddatacollec6ons(CIAFactbook)
• Maintainingandextendingthemandcrea6ngfocusedknowledgegraphswillincreasinglyrequireextrac4nginforma4onfromtext
NISTTextAnalysisConferenceYearlyevalua6onworkshopsonnaturallanguageprocessingandrelatedapplica6onswithlargetestcollec6onsandcommonevalua6onprocedures TACTracks 08 09 10 11 12 13 14 15
Ques6onAnswering X
TextualEntailment X X X X
Summariza6on X X X X X
KnowledgeBasePopula6on X X X X X X X
En6tyDiscovery&Linking X X
EventDetec6on X X
Valida6on X
hRp://www.nist.gov/tac
KnowledgeBasePopula4on
Tracksfocusedonbuildingaknowledgebasefromen66esandrela6onsextractedfromtext• ColdStartKBP:constructaKBfromtext• En4tydiscovery&linking:clusterandlinken6tymen6ons
• Slotfilling• Slotfillervalida6on• Sen6ment• Events:discoverandclustereventsintext
KnowledgeBasePopula4on*Buildasystemthatcan• Read90Kdocuments:newswirear6cles&socialmediapostsinEnglish,ChineseandSpanish
• Finden6tymen6ons,typesandrela6ons• Clusteren66eswithinandacrossdocumentsandlinktoareferenceKBwhenappropriate
• Removeerrors(ObamaborninIllinois),drawsoundinferences(MaliaandSashasisters)
• Generatetripleswithprovenancedata(Doc+stringoffsets)formen6onsandrela6ons *2016
KnowledgeBasePopula4on*Buildasystemthatcan• Read90Kdocuments:newswirear6cles&socialmediapostsinEnglish,ChineseandSpanish
• Finden6tymen6ons,typesandrela6ons• Clusteren66eswithinandacrossdocumentsandlinktoareferenceKBwhenappropriate
• Removeerrors(ObamaborninIllinois),drawsoundinferences(MaliaandSashasisters)
• Generatetripleswithprovenancedata(Doc+stringoffsets)formen6onsandrela6ons *2016
<DOCid="APW_ENG_20100325.0021"type="story"><HEADLINE>DivorceaRorneysaysDennisHopperisdying</HEADLINE><DATELINE>LOSANGELES2010-03-2500:15:51UTC</DATELINE><TEXT<P>DennisHopper'sdivorceaRorneysaysinacourtfilingthattheactorisdyingandcan'tundergochemotherapyashebaRlesprostatecancer.</P><P>ARorneyJosephMannisdescribedthe"EasyRider"star'sgravecondi6oninadeclara6onfiledWednesdayinLosAngelesSuperiorCourt.</P><P>MannisandaRorneysforHopper'swifeVictoriaarefigh6ngoverwhenandwhethertotaketheactor'sdeposi6on.</P>…
KnowledgeBasePopula4on*Buildasystemthatcan• Read90Kdocuments:newswirear6cles&socialmediapostsinEnglish,ChineseandSpanish
• Finden6tymen6ons,typesandrela6ons• Clusteren66eswithinandacrossdocumentsandlinktoareferenceKBwhenappropriate
• Removeerrors(ObamaborninIllinois),drawsoundinferences(MaliaandSashasisters)
• Generatetripleswithprovenancedata(Doc+stringoffsets)formen6onsandrela6ons *2016
<DOCid="APW_ENG_20100325.0021"type="story"><HEADLINE>DivorceaRorneysaysDennisHopperisdying</HEADLINE><DATELINE>LOSANGELES2010-03-2500:15:51UTC</DATELINE><TEXT<P>DennisHopper'sdivorceaRorneysaysinacourtfilingthattheactorisdyingandcan'tundergochemotherapyashebaRlesprostatecancer.</P><P>ARorneyJosephMannisdescribedthe"EasyRider"star'sgravecondi6oninadeclara6onfiledWednesdayinLosAngelesSuperiorCourt.</P><P>MannisandaRorneysforHopper'swifeVictoriaarefigh6ngoverwhenandwhethertotaketheactor'sdeposi6on.</P>…
…:e00211 type PER:e00211 link FB:m.02fn5:e00211 link WIKI:Dennis_Hopper:e00211 mention "Dennis Hopper" APW_021:185-197:e00211 mention "Hopper" APW_021:507-512:e00211 mention "Hopper" APW_021:618-623:e00211 mention "丹尼斯·霍珀” CMN_011:930-936:e00211 per:spouse :e00217 APW_021:521-528:e00217 per:spouse :e00211 APW_021:521-528:e00211 per:age "72" APW_021:521-528…
KnowledgeBasePopula4on*Buildasystemthatcan• Read90Kdocuments:newswirear6cles&socialmediapostsinEnglish,ChineseandSpanish
• Finden6tymen6ons,typesandrela6ons• Clusteren66eswithinandacrossdocumentsandlinktoareferenceKBwhenappropriate
• Removeerrors(ObamaborninIllinois),drawsoundinferences(MaliaandSashasisters)
• Generatetripleswithprovenancedata(Doc+stringoffsets)formen6onsandrela6ons *2016
<DOCid="APW_ENG_20100325.0021"type="story"><HEADLINE>DivorceaRorneysaysDennisHopperisdying</HEADLINE><DATELINE>LOSANGELES2010-03-2500:15:51UTC</DATELINE><TEXT<P>DennisHopper'sdivorceaRorneysaysinacourtfilingthattheactorisdyingandcan'tundergochemotherapyashebaRlesprostatecancer.</P><P>ARorneyJosephMannisdescribedthe"EasyRider"star'sgravecondi6oninadeclara6onfiledWednesdayinLosAngelesSuperiorCourt.</P><P>MannisandaRorneysforHopper'swifeVictoriaarefigh6ngoverwhenandwhethertotaketheactor'sdeposi6on.</P>…
…:e00211 type PER:e00211 link FB:m.02fn5:e00211 link WIKI:Dennis_Hopper:e00211 mention "Dennis Hopper" APW_021:185-197:e00211 mention "Hopper" APW_021:507-512:e00211 mention "Hopper" APW_021:618-623:e00211 mention "丹尼斯·霍珀” CMN_011:930-936:e00211 per:spouse :e00217 APW_021:521-528:e00217 per:spouse :e00211 APW_021:521-528:e00211 per:age "72" APW_021:521-528…
KBEvalua4onMethodology
• Evalua6ngKBsextractedfrom90Kdocumentsisnon-trivial
• TAC’sapproachissimplifiedby:– Fixingtheontologyofen6tytypesandrela6ons– Specifyingaserializa6onastriples+provenance– SamplingaKBusingasetofqueriesgroundedinanen#tymen#onfoundinadocument
• GivenaKB,wecanevaluateitsprecisionandrecallforasetofqueries
KBEvalua4onMethodology• EX:WhatarethenamesofschoolsaRendedbythechildrenoftheen6tymen6onedindocument#45611atcharacters401-412– Thatmen6onisGeorgeBushandthedocumentcontextsuggestsitreferstothe41stU.S.president– QuerygiveninstructuredformusingTAContology
• Assessorsdeterminegoodanswersgiventextcorpusandchecksubmissionresultsusingprovenanceasneeded– Yale,Harvard,UTAus6n,Tulane,Univ.ofVirginia,BostonCollege,...
TACOntology• Fivebasicen6tytypes– PER:people(JohnLennon)orgroups(Americans)– ORG:organiza4onslikeIBM,MITorUSSenate– GPE:geopoli4calen6tylikeBoston,BelgiumorEurope– LOC:loca4onslikeCentralParkortheRockies– FAC:facili4eslikeBWIortheEmpireStateBuilding
• En6tyMen6ons– Stringsreferencingen66esbyname(BarackObama)ordescrip6on(thePresident)
• ~65rela6ons– Rela6onsholdbetweentwoen66es:parent_of,spouse,employer,founded_by,city_of_birth,…
– Orbetweenanen6ty&string:age,website,6tle,cause_of_death,...
Kelvin• KELVIN:KnowledgeExtrac6on,Linking,Valida6onandInference
• Informa6onextrac6onsystemdevelopedattheHumanLanguageTechnologyCenterofExcellenceatJHU
• UsedintheTACKBPtracksin2012-2016andvariousprojects
• Documentsin,triplesout(withprovenance)
h"p://ebiq.org/p/730
Kelvin’sbasicpipeline
1. Documentlevelanalysisforen6ty&rela6onextrac6on
2. Cross-documentco-referencetocreateini6alKB
3. KBlinking,inferenceandadjudica6on
4. MaterializetheKB
Document-levelprocessing• Doneinparallelonagrid-basedsystem• Extractnameden66es,men6ons,events,rela6ons,valuesusingseveraltools(BBNSerif,StanfordCoreNLP,customsystems)• MapresultstoTAContology• Resolveinconsistenciesinen6tymen6onchains• Produces“singledocument”KBswithTACcomplianttriplesandprovenancedata• CS15:50Kdocs,1.9Mmen4ons,800Ken44es,2Mfacts
1
X-docCoref:crea4ngini4alKB• Cross-documentco-referencecreatesanini6alKBfromthecollec6onofsingle-documentKBs- Iden6fythatBarackObamaen6tyinDOC32issameindividualasObamainDOC342,etc.
• Usesagglomera#veclusteringonsimilarityofen4tymen4ons&co-men4oneden44es- Bothdocsalsomen6onWhiteHouse,Biden,D.C.
- Variousstringsimilaritymeasures,namefrequency• WorkswellonEnglish,Spanish,Chinese
2
Kripke• Kelvinclustered~1Mdocumenten##esinto~300KKBen##es• Conserva6ve,tendingtounder-ratherthanover-merge• TypicalpaRernisonelargeclusterandalongtailofsmallones• BarackObamahadoneclusterof8Kdoc.en66esand~100smallones
0
2000
4000
6000
8000
10000
72singletonObamaclusters+15withjusttwo
Huge8kcluster
2
Inferenceandadjudica4on
Inferenceandadjudica6onrulesareusedto• Deleterela6onsviola6ngKBconstraints– Apersoncan’tbeborninanorganiza#on• Addrela6onsfromsoundinferencerules– Twopeoplesharingaparentaresiblings– BorninplaceP1,P1part_ofP2=>BorninP2• Mergeen66esusingrules– Mergeci#esinsameGeo.regionwithsamenames
• Addrela6onsbydefaultorheuris6crules– Apersonisprobablyaci#zenoftheircountryofbirth
3
En4tyLinking• Simplestrategytolinktexten66estoreferenceKB• TACuseslastFreebaseKB,oursubsethas- ~4.5Men66esand~150Mtriples- NamesandtextinEnglish,SpanishandChinese
• En6tylinkingcomparesen6tymen#onswithFreebaseen6tynames&aliases,weightedby– Howoyeneachmen6onusedforen6ty– En6tysignificance(logofinlinks(1..20))
• Results:setofcandidatesandascoreforeach– Rejectcandidatesiftoomanyorscoreslow
3
KB-levelmergingrules• Mergeci6eswithsamenameinsamestate
• Highlydiscrimina6verela6onsgiveevidenceofsameness– per:spouseisfewtofew– org:top_level_empisfewtofew
• MergePERswithsimilarnameswhowere– bothCEOsofthesamecompany,or– Bothmarriedtothesameperson
• Mergeen66eslinkedtosameFreebaseen6ty
3
SlotValueConsolida4on• Problem:toomanyvaluesforsomeslots,especiallyfor‘popular’en66es,e.g.– Anen6tywithfourdifferentper:agevalues– Obamahas>100per:employee_ofvalues
• Strategy:rankvaluesandselectbest– RankaRestedvaluesby#ofaRes6ngdocs– RankinferredvaluesbelowaRestedvalues– Selectbestvalueforsingle-valuesslot– SelectbestNvaluesformul6-valuedslots
3
InferenceinColdStart• InCSwecanonlyusesoundinferencerules,e.g.– Peoplearesiblingsiftheyshareaparent– PeoplebornincityXwereborninregionYifXisinY• Defaultandheuris6crulesarenotaRested– Youprobablyresidedinacitywereyouwenttoschool• 2014baserunfound848aRestedsiblingrela6ons,inferenceadded132,16%more• Drawingsoundinferencesfrom{good|bad}factsyields{good|bad}results
3 ⊨
MaterializeKBversionsSomeworkisneededtoencodeournascentKBinanactualKBsystemorserializa6on• TACrequirecustomserializa6onwithspecificrequirements,e.g.,atmostfourprovenancestringsoflength≤150characters– Pickbestforslotswithmanyprovenancestrings
• RDFrequiresanOWLschemaandusingreifica6onformetadata,e.g.,men6onandrela6onprovenance
WeuseRDFforsubsequentuse
4
2015TACResults
• TACColdStartKnowledgeBasePopula6on– Eightteamspar6cipatedin2015ColdStartKBP– Weplaced2ndand3rddependingonthemetric
• TACEn6tyDiscoveryandlinking• SixteamssubmiRedrunsforallthreelanguages,8forENG,7forCMNand7forSPA• Weplacedthirdforalllanguages,secondforChineseandthirdforEnglish
LessonsLearned• Wealwayshavetomindprecision&recall• Extrac6nginforma6onfromtextisinherentlynoisy;readingmoretexthelpsboth
• Usingmachinelearningateverylevelisimportant
• Makingmoreuseofprobabili6eswillhelp• Extrac6nginforma6onaboutaeventsishard• Recognizingthetemporalextentofrela6onsisimportant,buts6llachallenge
Conclusion
• KBshelpinextrac6nginforma6onfromtext• Theinforma6onextractedcanupdatetheKBs• TheKBsprovidesupportfornewtasks,suchasques6onansweringandspeechinterfaces
• We’llseethisapproachgrowandevolveinthefuture
• NewmachinelearningframeworkswillresultinbeReraccuracy
Formoreinforma4on,[email protected]