![Page 1: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/1.jpg)
Information Extraction
Mausam(Based on slides of Heng Ji, Dan Jurafsky, Chris Manning, Ray Mooney, Alan Ritter, Alex Yates)
![Page 2: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/2.jpg)
Extracting information from text
• Company report: “International Business Machines Corporation (IBM or the company) was incorporated in the State of New York on June 16, 1911, as the Computing-Tabulating-Recording Co. (C-T-R)…”
• Extracted Complex Relation:Company-Founding
Company IBMLocation New YorkDate June 16, 1911Original-Name Computing-Tabulating-Recording Co.
• But we will focus on the simpler task of extracting relation triplesFounding-year(IBM,1911)
Founding-location(IBM,New York)
![Page 3: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/3.jpg)
Extracting Relation Triples from TextThe Leland Stanford Junior University, commonly referred to as Stanford University or Stanford, is an American private research university located in Stanford, California … near Palo Alto, California… Leland Stanford…founded the university in 1891
Stanford EQ Leland Stanford Junior University
Stanford LOC-IN California
Stanford IS-A research university
Stanford LOC-NEAR Palo Alto
Stanford FOUNDED-IN 1891
Stanford FOUNDER Leland Stanford
![Page 4: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/4.jpg)
Why Information Extraction?
• Create new structured knowledge bases, useful for any app
• Augment current knowledge bases• Adding words to WordNet thesaurus, facts to FreeBase or DBPedia
• Support question answering• The granddaughter of which actor starred in the movie “E.T.”?
(acted-in ?x “E.T.”)(is-a ?y actor)(granddaughter-of ?x ?y)
• But which relations should we extract?
4
![Page 5: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/5.jpg)
5
Early Information Extraction
• FRUMP (Dejong, 1979) was an early information extraction system that processed news stories and identified various types of events (e.g. earthquakes, terrorist attacks, floods).
• Used “sketchy scripts” of various events to identify specific pieces of information about such events.
• Able to summarize articles in multiple languages.
• Relied on “brittle” hand-built symbolic knowledge structures that were hard to build and not very robust.
![Page 6: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/6.jpg)
6
MUC• DARPA funded significant efforts in IE in the early to mid 1990’s.
• Message Understanding Conference (MUC) was an annual event/competition where results were presented.
• Focused on extracting information from news articles:• Terrorist events
• Industrial joint ventures
• Company management changes
• Information extraction of particular interest to the intelligence community (CIA, NSA).
• Established standard evaluation methodology using development (training) and test data and metrics: precision, recall, F-measure.
![Page 7: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/7.jpg)
Automated Content Extraction (ACE)
ARTIFACT
GENERALAFFILIATION
ORGAFFILIATION
PART-WHOLE
PERSON-SOCIAL
PHYSICAL
Located
Near
Business
Family Lasting Personal
Citizen-Resident-Ethnicity-Religion
Org-Location-Origin
Founder
EmploymentMembership
OwnershipStudent-Alum
Investor
User-Owner-Inventor-Manufacturer
Geographical
Subsidiary
Sports-Affiliation
17 relations from 2008 “Relation Extraction Task”
![Page 8: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/8.jpg)
Automated Content Extraction (ACE)
• Physical-Located PER-GPE
He was in Tennessee
• Part-Whole-Subsidiary ORG-ORGXYZ, the parent company of ABC
• Person-Social-Family PER-PER
John’s wife Yoko
• Org-AFF-Founder PER-ORG
Steve Jobs, co-founder of Apple…
•8
![Page 9: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/9.jpg)
UMLS: Unified Medical Language System
• 134 entity types, 54 relations
Injury disrupts Physiological FunctionBodily Location location-of Biologic FunctionAnatomical Structure part-of OrganismPharmacologic Substance causes Pathological FunctionPharmacologic Substance treats Pathologic Function
![Page 10: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/10.jpg)
Extracting UMLS relations from a sentence
Doppler echocardiography can be used to
diagnose left anterior descending artery
stenosis in patients with type 2 diabetes
Echocardiography, Doppler DIAGNOSES Acquired stenosis
10
![Page 11: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/11.jpg)
Databases of Wikipedia Relations
11
Relations extracted from InfoboxStanford state CaliforniaStanford motto “Die Luft der Freiheit weht”…
Wikipedia Infobox
![Page 12: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/12.jpg)
Relation databases that draw from Wikipedia
• Resource Description Framework (RDF) triplessubject predicate object
Golden Gate Park location San Francisco
dbpedia:Golden_Gate_Park dbpedia-owl:location dbpedia:San_Francisco
• DBPedia: 1 billion RDF triples, 385 from English Wikipedia
• Frequent Freebase relations:people/person/nationality, location/location/containspeople/person/profession, people/person/place-of-birthbiology/organism_higher_classification film/film/genre
12
![Page 13: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/13.jpg)
Ontological relations
• IS-A (hypernym): subsumption between classes
• Giraffe IS-A ruminant IS-A ungulate IS-Amammal IS-A vertebrate IS-A animal…
• Instance-of: relation between individual and class
• San Francisco instance-of city
Examples from the WordNet Thesaurus
![Page 14: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/14.jpg)
How to build relation extractors
1. Hand-written patterns
2. Supervised machine learning
3. Semi-supervised and unsupervised • Bootstrapping (using seeds)
• Distant supervision
• Unsupervised learning from the web
![Page 15: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/15.jpg)
Hand Written Patterns
15
![Page 16: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/16.jpg)
Rules for extracting IS-A relation
Early intuition from Hearst (1992)
• “Agar is a substance prepared from a mixture of red algae, such as Gelidium, for laboratory or industrial use”
• What does Gelidium mean?
• How do you know?`
![Page 17: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/17.jpg)
Rules for extracting IS-A relation
Early intuition from Hearst (1992)
• “Agar is a substance prepared from a mixture of red algae, such as Gelidium, for laboratory or industrial use”
• What does Gelidium mean?
• How do you know?`
![Page 18: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/18.jpg)
Hearst’s Patterns for extracting IS-A relations
(Hearst, 1992): Automatic Acquisition of Hyponyms
“Y such as X ((, X)* (, and|or) X)”
“such Y as X”
“X or other Y”
“X and other Y”
“Y including X”
“Y, especially X”
![Page 19: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/19.jpg)
Hearst’s Patterns for extracting IS-A relations
Hearst pattern Example occurrences
X and other Y ...temples, treasuries, and other important civic buildings.
X or other Y Bruises, wounds, broken bones or other injuries...
Y such as X The bow lute, such as the Bambara ndang...
Such Y as X ...such authors as Herrick, Goldsmith, and Shakespeare.
Y including X ...common-law countries, including Canada and England...
Y , especially X European countries, especially France, England, and Spain...
![Page 20: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/20.jpg)
Extracting Richer Relations Using Rules
• Intuition: relations often hold between specific entities
• located-in (ORGANIZATION, LOCATION)
• founded (PERSON, ORGANIZATION)
• cures (DRUG, DISEASE)
• Start with Named Entity tags to help extract relation!
![Page 21: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/21.jpg)
Named Entities aren’t quite enough.Which relations hold between 2 entities?
Drug Disease
Cure?
Prevent?
Cause?
![Page 22: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/22.jpg)
What relations hold between 2 entities?
PERSON ORGANIZATION
Founder?
Investor?
Member?
Employee?
President?
![Page 23: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/23.jpg)
Extracting Richer Relations Using Rules andNamed Entities
Who holds what office in what organization?
PERSON, POSITION of ORG
• George Marshall, Secretary of State of the United States
PERSON(named|appointed|chose|etc.) PERSON Prep? POSITION
• Truman appointed Marshall Secretary of State
PERSON [be]? (named|appointed|etc.) Prep? ORG POSITION
• George Marshall was named US Secretary of State
![Page 24: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/24.jpg)
Hand-built patterns for relations
• Plus:
• Human patterns tend to be high-precision
• Can be tailored to specific domains
• Minus
• Human patterns are often low-recall
• A lot of work to think of all possible patterns!
• Don’t want to have to do this for every relation!
• We’d like better accuracy
![Page 25: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/25.jpg)
Supervised Algorithms
25
![Page 26: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/26.jpg)
Supervised machine learning for relations
• Choose a set of relations we’d like to extract
• Choose a set of relevant named entities
• Find and label data• Choose a representative corpus
• Label the named entities in the corpus
• Hand-label the relations between these entities
• Break into training, development, and test
• Train a classifier on the training set
26
![Page 27: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/27.jpg)
How to do classification in supervised relation extraction
1. Find all pairs of named entities (usually in same sentence)
2. Decide if 2 entities are related
3. If yes, classify the relation
• Why the extra step?
• Faster classification training by eliminating most pairs
• Can use distinct feature-sets appropriate for each task.
27
![Page 28: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/28.jpg)
Automated Content Extraction (ACE)
ARTIFACT
GENERALAFFILIATION
ORGAFFILIATION
PART-WHOLE
PERSON-SOCIAL
PHYSICAL
Located
Near
Business
Family Lasting Personal
Citizen-Resident-Ethnicity-Religion
Org-Location-Origin
Founder
EmploymentMembership
OwnershipStudent-Alum
Investor
User-Owner-Inventor-Manufacturer
Geographical
Subsidiary
Sports-Affiliation
17 sub-relations of 6 relations from 2008 “Relation Extraction Task”
![Page 29: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/29.jpg)
Relation Extraction
Classify the relation between two entities in a sentence
American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner said.
SUBSIDIARY
FAMILY
EMPLOYMENT
NIL
FOUNDER
CITIZEN
INVENTOR…
![Page 30: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/30.jpg)
Word Features for Relation Extraction
• Headwords of M1 and M2, and combinationAirlines Wagner Airlines-Wagner
• Bag of words and bigrams in M1 and M2{American, Airlines, Tim, Wagner, American Airlines, Tim Wagner}
• Words or bigrams in particular positions left and right of M1/M2M2: -1 spokesman
M2: +1 said
• Bag of words or bigrams between the two entities{a, AMR, of, immediately, matched, move, spokesman, the, unit}
American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner saidMention 1 Mention 2
![Page 31: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/31.jpg)
Named Entity Type and Mention LevelFeatures for Relation Extraction
• Named-entity types• M1: ORG
• M2: PERSON
• Concatenation of the two named-entity types• ORG-PERSON
• Entity Level of M1 and M2 (NAME, NOMINAL, PRONOUN)• M1: NAME [it or he would be PRONOUN]
• M2: NAME [the company would be NOMINAL]
American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner saidMention 1 Mention 2
![Page 32: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/32.jpg)
Parse Features for Relation Extraction
• Base syntactic chunk sequence from one to the otherNP NP PP VP NP NP
• Constituent path through the tree from one to the otherNP NP S S NP
• Dependency path
Airlines matched Wagner said
American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner saidMention 1 Mention 2
![Page 33: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/33.jpg)
Gazeteer and trigger word features for relation extraction
• Trigger list for family: kinship terms• parent, wife, husband, grandparent, etc. [from WordNet]
• Gazeteer:• Lists of useful geo or geopolitical words
• Country name list
• Other sub-entities
![Page 34: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/34.jpg)
American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner said.
![Page 35: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/35.jpg)
Classifiers for supervised methods
• Now you can use any classifier you like
• MaxEnt
• Naïve Bayes
• SVM
• ...
• Train it on the training set, tune on the dev set, test on the test set
• Can also cast as sequence labeling – use CRFs
![Page 36: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/36.jpg)
37
Using Background Knowledge (Chan and Roth, 2010)
• Features employed are usually restricted to being defined on the various representations of the target sentences
• Humans rely on background knowledge to recognize relations
• Overall aim of this work• Propose methods of using knowledge or resources that exists
beyond the sentence
• Wikipedia, word clusters, hierarchy of relations, entity type constraints, coreference
• As additional features, or under the Constraint Conditional Model (CCM) framework with Integer Linear Programming (ILP)37
![Page 37: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/37.jpg)
3838
David
Cone
,
a
Kansas
City
native
,
was
originally
signed
by
the
Royals
and
broke
into
the
majors
with
the
team
Using Background Knowledge
![Page 38: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/38.jpg)
3939
David
Cone
,
a
Kansas
City
native
,
was
originally
signed
by
the
Royals
and
broke
into
the
majors
with
the
team
Using Background Knowledge
![Page 39: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/39.jpg)
4040
David
Cone
,
a
Kansas
City
native
,
was
originally
signed
by
the
Royals
and
broke
into
the
majors
with
the
team
Using Background Knowledge
![Page 40: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/40.jpg)
4141
David
Cone
,
a
Kansas
City
native
,
was
originally
signed
by
the
Royals
and
broke
into
the
majors
with
the
team
Using Background Knowledge
![Page 41: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/41.jpg)
4242
David
Cone
,
a
Kansas
City
native
,
was
originally
signed
by
the
Royals
and
broke
into
the
majors
with
the
team
Using Background KnowledgeDavid Brian Cone (born January 2, 1963) is a
former Major League Baseball pitcher. He compiled
an 8–3 postseason record over 21 postseason
starts and was a part of five World Series
championship teams (1992 with the Toronto Blue
Jays and 1996, 1998, 1999 & 2000 with the New
York Yankees). He had a career postseason ERA of
3.80. He is the subject of the book A Pitcher's
Story: Innings With David Cone by Roger Angell.
Fans of David are known as "Cone-Heads."
Cone lives in Stamford, Connecticut, and is formerly
a color commentator for the Yankees on the YES
Network.[1]
Contents
[hide]
1 Early years
2 Kansas City Royals
3 New York Mets
Partly because of the resulting lack of leadership,
after the 1994 season the Royals decided to
reduce payroll by trading pitcher David Cone and
outfielder Brian McRae, then continued their
salary dump in the 1995 season. In fact, the
team payroll, which was always among the
league's highest, was sliced in half from $40.5
million in 1994 (fourth-highest in the major
leagues) to $18.5 million in 1996 (second-lowest
in the major leagues)
![Page 42: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/42.jpg)
4343
David
Cone
,
a
Kansas
City
native
,
was
originally
signed
by
the
Royals
and
broke
into
the
majors
with
the
team
Using Background Knowledge
fine-grained
Employment:Staff 0.20
Employment:Executive 0.15
Personal:Family 0.10
Personal:Business 0.10
Affiliation:Citizen 0.20
Affiliation:Based-in 0.25
![Page 43: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/43.jpg)
4444
David
Cone
,
a
Kansas
City
native
,
was
originally
signed
by
the
Royals
and
broke
into
the
majors
with
the
team
Using Background Knowledge
fine-grained coarse-grained
Employment:Staff 0.200.35 Employment
Employment:Executive 0.15
Personal:Family 0.100.40 Personal
Personal:Business 0.10
Affiliation:Citizen 0.200.25 Affiliation
Affiliation:Based-in 0.25
![Page 44: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/44.jpg)
4545
David
Cone
,
a
Kansas
City
native
,
was
originally
signed
by
the
Royals
and
broke
into
the
majors
with
the
team
Using Background Knowledge
fine-grained coarse-grained
Employment:Staff 0.200.35 Employment
Employment:Executive 0.15
Personal:Family 0.100.40 Personal
Personal:Business 0.10
Affiliation:Citizen 0.200.25 Affiliation
Affiliation:Based-in 0.25
![Page 45: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/45.jpg)
4646
David
Cone
,
a
Kansas
City
native
,
was
originally
signed
by
the
Royals
and
broke
into
the
majors
with
the
team
Using Background Knowledge
fine-grained coarse-grained
Employment:Staff 0.200.35 Employment
Employment:Executive 0.15
Personal:Family 0.100.40 Personal
Personal:Business 0.10
Affiliation:Citizen 0.200.25 Affiliation
Affiliation:Based-in 0.25
0.55
![Page 46: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/46.jpg)
47
Knowledge1: Word Class Information(as additional feature)
• Supervised systems face an issue of data sparseness (of lexical features)
• Use class information of words to support generalization better: instantiated as word clusters in our work• Automatically generated from unlabeled texts
using algorithm of (Brown et al., 1992)
apple pear Apple IBM
0 1 0 1
0 1
bought run of in
0 1 0 1
0 1
0 1
47
![Page 47: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/47.jpg)
48
Knowledge1: Word Class Information
• Supervised systems face an issue of data sparseness (of lexical features)
• Use class information of words to support generalization better: instantiated as word clusters in our work• Automatically generated from unlabeled texts
using algorithm of (Brown et al., 1992)
apple pear Apple
0 1 0 1
0 1
bought run of in
0 1 0 1
0 1
0 1
48
IBM
![Page 48: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/48.jpg)
49
Knowledge1: Word Class Information
• Supervised systems face an issue of data sparseness (of lexical features)
• Use class information of words to support generalization better: instantiated as word clusters in our work• Automatically generated from unlabeled texts
using algorithm of (Brown et al., 1992)
apple pear Apple
0 1 0 1
0 1
bought run of in
0 1 0 1
0 1
0 1
49
IBM011
![Page 49: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/49.jpg)
50
Knowledge1: Word Class Information
• All lexical features consisting of single words will be duplicated with its corresponding bit-string representation
apple pear Apple IBM
0 1 0 1
0 1
bought run of in
0 1 0 1
0 1
0 1
50
00 01 10 11
![Page 50: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/50.jpg)
51
Knowledge2: Wikipedia(as additional feature)
• We use a Wikifier system (Ratinov et al., 2010) which performs context-sensitive mapping of mentions to Wikipedia pages
• Introduce a new feature based on:
•
• introduce a new feature by combining the above with the coarse-grained entity types of mi,mj
otherwise ,0
)(or )( if ,1),(1
imjm
ji
mAmAmmw ji
51
mi mjr ?
![Page 51: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/51.jpg)
5353
weight vector for
“local” modelscollection of
classifiers
Constraint Conditional Models (CCMs)(Roth and Yih, 2007; Chang et al., 2008)
![Page 52: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/52.jpg)
54
Constraint Conditional Models (CCMs)(Roth and Yih, 2007; Chang et al., 2008)
54
weight vector for
“local” modelscollection of
classifiers
penalty for violating
the constraint
how far y is from a
“legal” assignment
![Page 53: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/53.jpg)
55
Constraint Conditional Models (CCMs)(Roth and Yih, 2007; Chang et al., 2008)
55
•Wikipedia
•word clusters
•hierarchy of relations
•entity type constraints
•coreference
![Page 54: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/54.jpg)
5656
David
Cone
,
a
Kansas
City
native
,
was
originally
signed
by
the
Royals
and
broke
into
the
majors
with
the
team
Constraint Conditional Models (CCMs)
fine-grained coarse-grained
Employment:Staff 0.200.35 Employment
Employment:Executive 0.15
Personal:Family 0.100.40 Personal
Personal:Business 0.10
Affiliation:Citizen 0.200.25 Affiliation
Affiliation:Based-in 0.25
![Page 55: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/55.jpg)
57
• Key steps• Write down a linear objective function
• Write down constraints as linear inequalities
• Solve using integer linear programming (ILP) packages
57
Constraint Conditional Models (CCMs)(Roth and Yih, 2007; Chang et al., 2008)
![Page 56: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/56.jpg)
58
Knowledge3: Relations between our target relations
......
personal
......
employment
family biz executivestaff
58
![Page 57: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/57.jpg)
59
Knowledge3: Hierarchy of Relations
......
personal
......
employment
family biz executivestaff
59
coarse-grained
classifier
fine-grained
classifier
![Page 58: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/58.jpg)
60
Knowledge3: Hierarchy of Relations
......
personal
......
employment
family biz executivestaff
60
mi mj
coarse-grained?
fine-grained?
![Page 59: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/59.jpg)
61
Knowledge3: Hierarchy of Relations
......
personal
......
employment
family biz executivestaff
61
![Page 60: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/60.jpg)
62
Knowledge3: Hierarchy of Relations
......
personal
......
employment
family biz executivestaff
62
![Page 61: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/61.jpg)
63
Knowledge3: Hierarchy of Relations
......
personal
......
employment
family biz executivestaff
63
![Page 62: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/62.jpg)
64
Knowledge3: Hierarchy of Relations
......
personal
......
employment
family biz executivestaff
64
![Page 63: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/63.jpg)
65
Knowledge3: Hierarchy of Relations
......
personal
......
employment
family biz executivestaff
65
![Page 64: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/64.jpg)
66
Knowledge3: Hierarchy of Relations Write down a linear objective function
R LR L R rf
rfRR
R rcrcRR
RfRc
yrfpxrcp,,
)()(max
66
coarse-grained
prediction
probabilities
fine-grained
prediction
probabilities
![Page 65: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/65.jpg)
67
Knowledge3: Hierarchy of Relations Write down a linear objective function
R LR L R rf
rfRR
R rcrcRR
RfRc
yrfpxrcp,,
)()(max
67
coarse-grained
prediction
probabilities
fine-grained
prediction
probabilities
coarse-grained
indicator
variable
fine-grained
indicator
variable
indicator variable == relation assignment
![Page 66: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/66.jpg)
68
Knowledge3: Hierarchy of Relations Write down constraints
• If a relation R is assigned a coarse-grained label rc, then we must also assign to R a fine-grained relation rf which is a child of rc.
• (Capturing the inverse relationship) If we assign rfto R, then we must also assign to R the parent of rf, which is a corresponding coarse-grained label
68
nrfRrfRrfRrcRyyyx
,,,, 21
)(,, rfparentRrfRxy
![Page 67: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/67.jpg)
69
Knowledge4: Entity Type Constraints(Roth and Yih, 2004, 2007)
• Entity types are useful for constraining the possible labels that a relation R can assume
69
mi mj
Employment:Staff
Employment:Executive
Personal:Family
Personal:Business
Affiliation:Citizen
Affiliation:Based-in
![Page 68: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/68.jpg)
70
• Entity types are useful for constraining the possible labels that a relation R can assume
70
Employment:Staff
Employment:Executive
Personal:Family
Personal:Business
Affiliation:Citizen
Affiliation:Based-in
per org
per org
per
per per
per
per
org
gpe
gpe
per per
mi mj
Knowledge4: Entity Type Constraints(Roth and Yih, 2004, 2007)
![Page 69: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/69.jpg)
71
• We gather information on entity type constraints from ACE-2004 documentation and impose them on the coarse-grained relations• By improving the coarse-grained predictions and combining with the
hierarchical constraints defined earlier, the improvements would propagate to the fine-grained predications
71
Employment:Staff
Employment:Executive
Personal:Family
Personal:Business
Affiliation:Citizen
Affiliation:Based-in
per org
per org
per
per per
per
per
org
gpe
gpe
per per
mi mj
Knowledge4: Entity Type Constraints(Roth and Yih, 2004, 2007)
![Page 70: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/70.jpg)
72
Knowledge5: Coreference
72
mi mj
Employment:Staff
Employment:Executive
Personal:Family
Personal:Business
Affiliation:Citizen
Affiliation:Based-in
![Page 71: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/71.jpg)
73
Knowledge5: Coreference
• In this work, we assume that we are given the coreference information, which is available from the ACE annotation.
73
mi mj
Employment:Staff
Employment:Executive
Personal:Family
Personal:Business
Affiliation:Citizen
Affiliation:Based-in
null
![Page 72: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/72.jpg)
74
Experiment Results
74
F1% improvement from using each knowledge source
All nwire 10% of nwire
BasicRE 50.5% 31.0%
![Page 73: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/73.jpg)
Summary: Supervised Relation Extraction
+ Can get high accuracies with enough hand-labeled
training data, if test similar enough to training
- Labeling a large training set is expensive
- Supervised models are brittle, don’t generalize well
to different genres
![Page 74: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/74.jpg)
Bootstrapping Approaches
76
![Page 75: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/75.jpg)
Rule Learning
• Thinking up some patterns for hyponyms might not be too hard, but what about some new relationship? • E.g., enzymes and the molecular pathway(s) they’re involved in?
• Cities and their mayors? Films and their directors?
• Can we automate the process of identifying patterns?
• Rule learning automates this process, if it is given some examples of the relationship of interest.• For instance, some example enzyme names and the names of the
pathways they’re involved in.
![Page 76: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/76.jpg)
Seed-based or bootstrapping approaches to relation extraction
• No training set? Maybe you have:
• A few seed tuples or
• A few high-precision patterns
• Can you use those seeds to do something useful?
• Bootstrapping: use the seeds to directly learn to populate a relation
![Page 77: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/77.jpg)
Relation Bootstrapping (Hearst 1992)
• Gather a set of seed pairs that have relation R
• Iterate:
1. Find sentences with these pairs
2. Look at the context between or around the pair and generalize the context to create patterns
3. Use the patterns for grep for more pairs
![Page 78: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/78.jpg)
Initial Seed Tuples Occurrences of Seed Tuples
Generate Extraction Patterns
Generate New Seed Tuples
Augment Table
Bootstrapping for Relation Extraction
![Page 79: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/79.jpg)
Bootstrapping
• <Mark Twain, Elmira> Seed tuple• Grep (google) for the environments of the seed tuple
“Mark Twain is buried in Elmira, NY.”
X is buried in Y
“The grave of Mark Twain is in Elmira”
The grave of X is in Y
“Elmira is Mark Twain’s final resting place”
Y is X’s final resting place.
• Use those patterns to grep for new tuples
• Iterate
![Page 80: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/80.jpg)
BootstrappingSeed Examples
Philadelphia – Michael Nutter
New York – Michael Bloomberg
Rule Learning Extraction Rules
X is mayor of Y
X, mayor of Y
X runs City Hall in Y
High-confidenceExtractions
![Page 81: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/81.jpg)
BootstrappingSeed Examples
Philadelphia – Michael Nutter
New York – Michael Bloomberg
San Diego – Jerry Sanders
Belgrade -- Dragan Đilas
Rule Learning Extraction Rules
X is mayor of Y
X, mayor of Y
X runs City Hall in Y
Social Democrat X is new mayorof Y
High-confidenceExtractions
![Page 82: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/82.jpg)
Occurrences of
seed tuples:
Computer servers at Microsoft’s
headquarters in Redmond…
In mid-afternoon trading, share of
Redmond-based Microsoft fell…
The Armonk-based IBM introduced
a new line…
The combined company will operate
from Boeing’s headquarters in Seattle.
Intel, Santa Clara, cut prices of its
Pentium processor.
ORGANIZATION LOCATION
MICROSOFT REDMOND
IBM ARMONK
BOEING SEATTLE
INTEL SANTA CLARA
Initial Seed Tuples Occurrences of Seed Tuples
Generate Extraction Patterns
Generate New Seed Tuples
Augment Table
Bootstrapping for Relation Extraction
![Page 83: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/83.jpg)
• <STRING1>’s headquarters in <STRING2>
•<STRING2> -based <STRING1>
•<STRING1> , <STRING2>
Initial Seed Tuples Occurrences of Seed Tuples
Generate Extraction Patterns
Generate New Seed Tuples
Augment Table
LearnedPatterns:
Bootstrapping for Relation Extraction
![Page 84: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/84.jpg)
Initial Seed Tuples Occurrences of Seed Tuples
Generate Extraction Patterns
Generate New Seed Tuples
Augment Table
Generatenew seedtuples; start newiteration
ORGANIZATION LOCATION
AG EDWARDS ST LUIS
157TH STREET MANHATTAN
7TH LEVEL RICHARDSON
3COM CORP SANTA CLARA
3DO REDWOOD CITY
JELLIES APPLE
MACWEEK SAN FRANCISCO
Bootstrapping for Relation Extraction
![Page 85: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/85.jpg)
Main Issue: Semantic Drift
• US Presidents “presidents such as…” Company presidents
• States “states such as…” liquid, solid, gas, Canada, …• “Microfinance institutions (MFIs) in the United States such as Accion”
• …
87
![Page 86: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/86.jpg)
Doubly Anchored Patterns[Kozareva, ACL 08]
• “X such as Y and Z” • Fix two look for third
• “presidents such as Clinton and *”• Will likely not drift to company presidents
88
![Page 87: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/87.jpg)
89
Pattern-Match Rule Learning• Writing accurate patterns for each slot for each application
requires laborious software engineering.
• Alternative is to use rule induction methods.
• RAPIER system (Califf & Mooney, 1999) learns three regex-style patterns for each slot:• Pre-filler pattern
• Filler pattern
• Post-filler pattern
• RAPIER allows use of POS and WordNet categories in patterns to generalize over lexical items.
![Page 88: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/88.jpg)
90
RAPIER Pattern Induction Example• If goal is to extract the name of the city in which a
posted job is located, the least-general-generalization constructed by RAPIER is:
“…located in Atlanta, Georgia…” “…offices in Kansas City, Missouri…”
Rapier
Pattern Induction
Prefiller: “in” as Prep
Filler: 1 to 2 PropNouns
Postfiller: PropNoun which is a State
![Page 89: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/89.jpg)
Distant Supervision
94
![Page 90: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/90.jpg)
Distant Supervision
• Combine bootstrapping with supervised learning
• Instead of 5 seeds,
• Use a large database to get huge # of seed examples
• Create lots of features from all these examples
• Combine in a supervised classifier
Snow, Jurafsky, Ng. 2005. Learning syntactic patterns for automatic hypernym discovery. NIPS 17Fei Wu and Daniel S. Weld. 2007. Autonomously Semantifying Wikipeida. CIKM 2007Mintz, Bills, Snow, Jurafsky. 2009. Distant supervision for relation extraction without labeled data. ACL09
![Page 91: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/91.jpg)
Distant supervision paradigm
• Like supervised classification:
• Uses a classifier with lots of features
• Supervised by detailed hand-created knowledge
• Doesn’t require iteratively expanding patterns
• Like unsupervised classification:
• Uses very large amounts of unlabeled data
• Not sensitive to genre issues in training corpus
![Page 92: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/92.jpg)
Distantly supervised learning of relation extraction patterns
For each relation
For each tuple in big database
Find sentences in large corpus with both entities
Extract frequent features (parse, words, etc)
Train supervised classifier using thousands of patterns
4
1
2
3
5
PER was born in LOC
PER, born (XXXX), LOC
PER’s birthplace in LOC
<Edwin Hubble, Marshfield><Albert Einstein, Ulm>
Born-In
Hubble was born in Marshfield
Einstein, born (1879), Ulm
Hubble’s birthplace in Marshfield
P(born-in | f1,f2,f3,…,f70000)
![Page 93: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/93.jpg)
Heuristics for Labeling Training Data
Person Birth Location
Barack Obama
Honolulu
Mitt Romney Detroit
Albert Einstein
Ulm
Nikola Tesla Smiljan
… …
“Barack Obama was born
on August 4, 1961 at … in
the city of Honolulu ...”
“… site between Downtown
Honolulu and Waikiki is
proposed site for Barack
Obama's presidential library.
“Born in Honolulu, Barack
Obama went on to become…”
…
(Barack Obama, Honolulu)
(Mitt Romney, Detroit)
(Albert Einstein, Ulm)
98
e.g. [Mintz et. al. 2009]
Problem 1: not all sentences are positive training data!
![Page 94: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/94.jpg)
𝑠1 𝑠2 𝑠3… 𝑠𝑛
𝑧1 𝑧2 𝑧3… 𝑧𝑛
𝑑1 𝑑2 𝑑𝑘…
Local Extractors
Deterministic OR
(Barack Obama, Honolulu)
99
Sentences
Aggregate Relations
(Born-In, Lived-In, children, etc…)
𝑃 𝑧𝑖 = 𝑟 𝑠𝑖 ∝ exp(𝜃 ⋅ 𝑓 𝑠𝑖 , 𝑟 ) Relation mentions
𝑧
𝑃(𝑧, 𝑑|𝑠; 𝜃)MaximizeConditionalLikelihood
Multi-instance Learning (MultiR)
![Page 95: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/95.jpg)
Heuristics for Labeling Training Data
Person Birth Location
Barack Obama
Honolulu
Mitt Romney Detroit
Albert Einstein
Ulm
Nikola Tesla Smiljan
… …
“Barack Obama was born
on August 4, 1961 at … in
the city of Honolulu ...”
“… site between Downtown
Honolulu and Waikiki is
proposed site for Barack
Obama's presidential library.
“Born in Honolulu, Barack
Obama went on to become…”
…
(Barack Obama, Honolulu)
(Mitt Romney, Detroit)
(Albert Einstein, Ulm)
100
e.g. [Mintz et. al. 2009]
Problem 2: not all databases are complete (neither is text)
![Page 96: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/96.jpg)
Missing Data Problems…
• 2 Assumptions Drive learning:• Not in DB -> not mentioned in text
• In DB -> must be mentioned at least once
• Leads to errors in training data:• False positives
• False negatives
101
![Page 97: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/97.jpg)
Changes
𝑠1 𝑠2 𝑠3 … 𝑠𝑛
𝑧1 𝑧2 𝑧3 … 𝑧𝑛
𝑑1 𝑑2 𝑑𝑘…
102
![Page 98: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/98.jpg)
Modeling Missing Data
𝑠1 𝑠2 𝑠3 … 𝑠𝑛
𝑧1 𝑧2 𝑧3 … 𝑧𝑛
𝑡1 𝑡2 𝑡𝑘…
Mentioned in DB 𝑑1 𝑑2 𝑑𝑘…
Encourage Agreement
Mentioned in Text
103
[Ritter et. al. TACL 2013]
![Page 99: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/99.jpg)
Side Information• Entity coverage
in database• Popular
entities
• Good coverage in Freebase Wikipedia
• Unlikely to extract new facts
104
𝑠1 𝑠2 𝑠3 … 𝑠𝑛
𝑧1 𝑧2 𝑧3 … 𝑧𝑛
𝑡1 𝑡2 𝑡𝑘…
𝑑1 𝑑2 𝑑𝑘…
![Page 100: Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence](https://reader030.vdocument.in/reader030/viewer/2022041204/5d53392a88c993bb198b6478/html5/thumbnails/100.jpg)
Experiments
• Red: MultiR
• Black: Soft Constraints
• Green: Missing Data Model
105
[Hoffmann et. al. 2011]