informaon extracon (ie) - columbia universityjulia/courses/cs4705/kathy/slides09/...(biadsy, et al.,...
TRANSCRIPT
Informa(onExtrac(on(IE)
FadiBiadsyCS4705
Oct30,2008
Informa(onExtrac(on(IE)‐‐Task
• Idea:‘extract’ortagpar(culartypesofinforma(onfromarbitrarytextortranscribedspeech
NamedEn(tyTagger
• Iden(fytypesandboundariesofnameden(ty
• Forexample:
– AlexanderMackenzie,(January28,1822‐April17,1892),abuildingcontractorandwriter,wasthesecondPrimeMinisterofCanadafrom….
<PERSON>AlexanderMackenzie</PERSON>,(<TIMEX>January28,1822<TIMEX>‐<TIMEX>April17,1892</TIMEX>),abuildingcontractorandwriter,wasthesecondPrimeMinisterof<GPE>Canada</GPE>from….
IEforTemplateFillingRela(onDetec(on
• Givenasetofdocumentsandadomainofinterest,fillatableofrequiredfields.
• Forexample:– Numberofcaraccidentspervehicletypeandnumberofcasualtyin
theaccidents.
VehicleType #accidents #casual@es Weather
SUV 1200 190 Rainy
Trucks 200 20 Sunny
IEforQues(onAnswering
• Q:WhenwasGandhiborn?• A:October2,1869
• Q:WherewasBillClintoneducated?
• A:GeorgetownUniversityinWashington,D.C.
• Q:Whatwastheeduca(onofYassirArafat?• A:CivilEngineering
• Q:WhatisthereligionofNoamChomsky?• A:Jewish
Approaches
1. Sta(s(calSequenceLabeling2. Supervised3. Semi‐SupervisedandBootstrapping
ApproachforNER• <PERSON>AlexanderMackenzie</PERSON>,(<TIMEX>January28,1822<TIMEX>‐
<TIMEX>April17,1892</TIMEX>),abuildingcontractorandwriter,wasthesecondPrimeMinisterof<GPE>Canada</GPE>from….
• Sta@s@calsequence‐labelingtechniquesapproachcanbeused–similartoPOStagging.– Word‐by‐wordsequencelabeling
– ExampleofFeatures:
• POStags• Syntac(ccons(tuents• Shapefeatures• Presenceinanameden(tylist
SupervisedApproachforrela(ondetec(on
• Givenacorpusofannotatedrela(onsbetweenen((es,traintwoclassifiers:1. Abinaryclassifier:
• Givenaspanoftextandtwoen((es• Decideifthereisarela(onshipbetweenthesetwoen((es.
2. Aclassifieristrainedtodeterminethetypesofrela(onsexistbetweentheen((es
• Features:– Typesoftwonameden((es– Bag‐of‐words– …
• Example:– ArentedSUVwentoutofcontrolonSunday,causingthedeathofsevenpeopleinBrooklyn– Rela(on:Type=Accident,VehicleType=SUV,causality=7,weather=?
• ProsandCons?
PamernMatchingforRela(onDetec(on
• PaNerns:• “[CAR_TYPE]wentoutofcontrolon[TIMEX],causingthedeathof[NUM]people”
• “[PERSON]wasbornin[GPE]”• “[PERSON]wasgraduatedfrom[FAC]”
• “[PERSON]waskilledby<X>”
• MatchingTechniques– Exactmatching
• ProsandCons?– Flexiblematching(e.g.,[X]was.*killed.*by[Y])
• ProsandCons?
PamernMatching
• Howcanwecomeupwiththesepamerns?• Manually?
– Taskanddomainspecific‐‐tedious,(meconsuming,andnotscalable.
Semi‐SupervisedApproachAutoSlog‐TS(Riloff,1996)
• MUC‐4task:extractinforma(onaboutterroristeventsinLa(nAmerica.
• Twocorpora:1) Domain‐dependentcorpusthatcontainsrelevantinforma(on
2) Asetofirrelevantdocuments
• Algorithm:1. Usingsomeheuris(crules,allpamernsareextractedfromboth
corpora.Forexample: Rule:<Subj>passive‐verb
<Subj>wasmurdered
<Subj>wascalled
2. PamernRanking:Theoutputpamernsarethenrankedbyfrequencyoftheiroccurrencesincorpus1/corpus2.
3. Filteroutthepamernsbyhand
Bootstrapping
PamernExtrac(on
TupleSearch
SeedPamerns
PamernSearch
PamernSet
TupleExtrac(on
TupleSet
SeedTuples
XwasborninY
GeorgeW.BushwasborninConnec(cut
<GeorgeW.Bush,Connec(cut>
BorninConnec(cutonJuly8,1946,Georgewas
BorninYonZ,Xwas
13
TASK 12: (DARPA – GALE year 2) PRODUCE A BIOGRAPHY OF [PERON].
1. Name(s),aliases:2. *DateofBirthorCurrentAge:3. *DateofDeath:4. *PlaceofBirth:5. *PlaceofDeath:6. CauseofDeath:7. Religion(Affilia(ons):8. Knownloca(onsanddates:9. Lastknownaddress:10. Previousdomiciles:11. Ethnicortribalaffilia(ons:12. Immediatefamilymembers13. Na(veLanguagespoken:14. SecondaryLanguagesspoken:15. PhysicalCharacteris(cs16. Passportnumberandcountryofissue:17. Professionalposi(ons:18. Educa(on19. Partyorotherorganiza(onaffilia(ons:20. Publica(ons((tlesanddates):
14
Biography – two approaches
• Toobtainhighprecision,wehandleeachslotindependentlyusingbootstrappingtolearnIEpamerns.
• Toimprovetherecall,weu(lizeabiographical‐sentenceclassifier.
15
Biography patterns from Wikipedia
16
• Martin Luther King, Jr., (January 15, 1929 – April 4, 1968) was the most …
• Martin Luther King, Jr., was born on January 15, 1929, in Atlanta, Georgia.
Biography patterns from Wikipedia
17
Run NER on these sentences
• <Person>Mar(nLutherKing,Jr.</Person>,(<Date>January15,1929</Date>–<Date>April4,1968</Date>)wasthemost…
• <Person>Mar(nLutherKing,Jr.</Person>,wasbornon<Date>January15,1929</Date>,in<GPE>Atlanta,Georgia</GPE>.
• Takethetokensequencethatincludesthetagsofinterest+somecontext(2tokensbeforeand2tokensa{er)
18
Convert to Patterns:
• <Target_Person>(<Target_Date>–<Date>)wasthe
• <Target_Person>,wasbornon<Target_Date>,in
• Removemorespecificpamerns–ifthereisapamernthatcontainsother,takethesmallest>ktokens.
• <Target_Person>,wasbornon<Target_Date>
• <Target_Person>(<Target_Date>–<Date>)
• Finally,verifythepamernsmanuallytoremoveirrelevantpamerns.
19
Examples of Patterns:
• 502dis(nctplace‐of‐birthpamerns:– 600 <Target_Person>wasbornin<Target_GPE>– 169 <Target_Person>(born<Date>in<Target_GPE>)– 44 Bornin<Target_GPE>,<Target_Person>– 10 <Target_Person>wasana(ve<Target_GPE>– 10 <Target_Person>'shometownof<Target_GPE>– 1 <Target_Person>wasbap(zedin<Target_GPE>– …
• 291dis(nctdate‐of‐deathpamerns:– 770 <Target_Person>(<Date>‐<Target_Date>)– 92 <Target_Person>diedon<Target_Date>– 19 <Target_Person><Date>‐<Target_Date>– 16 <Target_Person>diedin<GPE>on<Target_Date>– 3 <Target_Person>passedawayon<Target_Date>– 1 <Target_Person>commimedsuicideon<Target_Date>– …
20
Biography as an IE task
• ThisapproachisgoodfortheconsistentlyannotatedfieldsinWikipedia:placeofbirth,dateofbirth,placeofdeath,dateofdeath
• Notallfieldsofinterestsareannotated,adifferentapproachisneededtocovertherestoftheslots
21
Bouncing between Wikipedia and Google
• Useoneseedtupleonly:– <TargetPerson>and<Targetfield>
• Google:“Arafat”“civilengineering”,weget:
22
23
• Useoneseedtupleonly:• Google:“Arafat”“civilengineering”,weget:
⇒ Arafatgraduatedwithabachelor’sdegreeincivilengineering⇒ Arafatstudiedcivilengineering⇒ Arafat,acivilengineeringstudent⇒ …
• Usingthesesnippets,correspondingpamernsarecreated,thenfilteredout.
Bouncing between Wikipedia and Google
24
• Useoneseedtupleonly:• Google:“Arafat”“civilengineering”,weget:
⇒ Arafatgraduatedwithabachelor’sdegreeincivilengineering⇒ Arafatstudiedcivilengineering⇒ Arafat,acivilengineeringstudent⇒ …
• Usingthesesnippets,correspondingpamernsarecreated,thenfilteredoutmanually
• Dueto(melimita(ontheautoma(cfilterwasnotcompleted.
– Togetmoreseedtuples,gotoWikipediabiographypagesonlyandsearchfor:
– “graduatedwithabachelor’sdegreein”– Weget:
Bouncing between Wikipedia and Google
25
26
• Newseedtuples:– “BurnieThompson”“poli(calscience“
– “HenreyLuke”“EnvironmentStudies”
– “ErinCrocker”“industrialandmanagementengineering”– “DeniseBode”“poli(calscience”– …
• GobacktoGoogleandrepeattheprocesstogetmoreseedpamerns!
Bouncing between Wikipedia and Google
27
Bouncing between Wikipedia and Google
• Thisapproachworkedwellforafewfieldssuchas:educa7on,publica7on,Immediatefamilymembers,andPartyorotherorganiza7onaffilia7ons
• Didnotprovidegoodpamernsforsomeofthefields,suchas:Religion,Ethnicortribalaffilia7ons,and
Previousdomiciles),wegotalotofnoise
• Whythebouncingideaisbemerthanusingonlyonecorpus?
• Nonofthepamernsmatch?Back‐offstrategy…
28
Biographical‐SentenceClassifier(Biadsy,etal.,2008)
• Trainabinaryclassifiertoiden(fybiographicalsentences
• Manuallyannota(ngalargecorpusofbiographicalandnon‐biographicalinforma(on(e.g.,Zhouetal.,2004)islaborintensive
• Ourapproach:collectbiographicalandnon‐biographicalcorporaautoma(cally
29
TrainingData–BiographicalCorpusfromWikipedia
• U(lizeWikipediabiographies
• Extract17KbiographiesfromthexmlversionofWikipedia
• Applysimpletextprocessingtechniquestocleanupthetext
30
Construc(ngtheBiographicalCorpus
1. Iden(fythesubjectofeachbiography
2. RunNYU’sACEsystemtotagNEsanddocoreferenceresolu(on(Grishmanetal.,2005)
31
Construc(ngtheBiographicalCorpus
3. ReplaceeachNEbyitstagtypeandsubtype
InSeptember1951,KingbeganhisdoctoralstudiesIntheologyatBostonUniversity.
In[TIMEX],[PER_Individual]began[TARGET_HIS]doctoralstudiesIntheologyat[ORG_Educa@onal].
32
Construc(ngtheBiographicalCorpus
3. ReplaceeachNEbyitstagtypeandsubtype
4. Non‐pronominalreferringexpressionthatiscoreferen(alwiththetargetpersonisreplacedby[TARGET_PER]
InSeptember1951,KingbeganhisdoctoralstudiesIntheologyatBostonUniversity.
In[TIMEX],[TARGET_PER]began[TARGET_HIS]doctoralstudiesIntheologyat[ORG_Educa@onal].
33
Construc(ngtheBiographicalCorpus
3. ReplaceeachNEbyitstagtypeandsubtype
4. Non‐pronominalreferringexpressionthatiscoreferen(alwiththetargetpersonisreplacedby[TARGET_PER]
5. EverypronounPthatreferstothetargetpersonisreplacedby[TARGET_P],wherePisthepronounreplaced
InSeptember1951,KingbeganhisdoctoralstudiesIntheologyatBostonUniversity.
In[TIMEX],[TARGET_PER]began[TARGET_HIS]doctoralstudiesIntheologyat[ORG_Educa@onal].
34
Construc(ngtheBiographicalCorpus
3. ReplaceeachNEbyitstagtypeandsubtype
4. Non‐pronominalreferringexpressionsthatarecoreferen(alwiththetargetpersonarereplacedby[TARGET_PER]
5. EverypronounPthatreferstothetargetpersonisreplacedby[TARGET_P],wherePisthepronounreplaced
6. Sentencescontainingnoreferencetothetargetpersonareremoved
InSeptember1951,KingbeganhisdoctoralstudiesIntheologyatBostonUniversity.
In[TIMEX],[TARGET_PER]began[TARGET_HIS]doctoralstudiesIntheologyat[ORG_Educa@onal].
35
Construc(ngtheNon‐BiographicalCorpus
• Englishnewswirear(clesinTDT4usedtorepresentnon‐biographicalsentences
1. RunNYU’sACEsystemoneachar(cle
2. SelectaPERSONNEmen(onatrandomfromallNEsinar(cletorepresentthetargetperson
3. Excludesentenceswithnoreferencetothistarget
4. ReplacereferringexpressionsandNEsasinbiographycorpus
36
Biographical‐SentenceClassifier
• Trainaclassifieronthebiographicalandnon‐biographicalcorpora
– Biographicalcorpus:• 30,002sentencesfromWikipedia• 2,108sentencesheldoutfortes(ng
– Non‐Biographicalcorpus:• 23,424sentencesfromTDT4• 2,108sentencesheldoutfortes(ng
37
Biographical‐SentenceClassifier
• Features:– Frequencyof1‐2‐3gramsofclass‐based/lexical,e.g.:
• [TARGET_PER]wasborn• [TARGET_HER]husbandwas• [TARGET_PER]said
– Frequencyof1‐2gramsofPOS
• Chi‐squareforfeatureselec(on
38
Classifica(onResults
• Experimentedwiththreetypesofclassifiers:
• Note:Classifiersprovideaconfidencescoreforeachclassifiedsample
Classifier Accuracy F‐Meassure SVM 87.6% 0.87 M.NaïveBayes(MNB) 84.1% 0.84 C4.5 81.8% 0.82
39
Thankyou