information extraction from wikipedia: moving down the long tail fei wu, raphael hoffmann, daniel s....
TRANSCRIPT
Information Extraction from Wikipedia:
Moving Down the Long Tail
Fei Wu, Raphael Hoffmann, Daniel S. Weld
Department of Computer Science & Engineering
University of Washington
Seattle, WA, USA
Intelligence in Wikipedia:
Fei Wu, Eytan Adar, Saleema Amershi, Oren Etzioni, James Fogarty, Raphael Hoffmann, Kayur Patel,
Stef Schoenmackers & Michael Skinner
Motivating Vision Next-Generation Search =
Information Extraction + Ontology + Inference
Which performing artists were born
in Chicago?
…Bob was born in Northwestern Memorial Hospital. …
…Bob Black is an active actor who was selected as this year’s …
…Northwestern Memorial Hospital is one of the country’s leading hospitals in Chicago…
Next-Generation Search
Information Extraction <Bob, Born-In, NMH> <Bob Black, ISA, actor> <NMH, in Chicago> …
Ontology Actor ISA Performing Artist …
Inference Born-In(A) ^ PartOf(A,B)
=> Born-In(B) …
…Bob was born in Northwestern Memorial Hospital. …
…Bob Black is an active actor who…
…Northwestern Memorial Hospital is one of the country’s leading hospitals in Chicago…
Wikipedia – Bootstrap for the Web
Goal: search over the Web Now: search over Wikipedia
Comprehensive High-quality (Semi-)Structured data
Infoboxes are designed to present summary information about an article's subject, such that similar subjects have a uniform look and in a common format
An infobox is a generalization of a taxobox (from taxonomy) which summarizes information for an organism or group of organisms.
Infoboxes
Infobox examples
Basic infobox
Taxobox –Plant species
More example
Infobox People - ActorInfobox- Convention Center
Outline Background: Kylin Extraction Long-Tailed Challenges
Sparse infobox classes Incomplete articles Moving Down the Long Tails Shrinkage Retraining Extracting from the Web Problem with information Extraction IWP (Intelligence in Wikipedia) CCC and IE Virtuous Cycle IWP (Shrinkage, Retraining and Extracting from Web)
Multilingual Extraction Summary
Kylin: Autonomously Semantifying Wikipedia Totally autonomous with no additional human efforts
Form training dataset based on infoboxes
Extract semantic relations from Wikipedia articles
Kylin: a mythical hooved Chinese chimerical creature that is said to appear in conjunction with the arrival of a sage.
------Wikipedia
It is a prototype of self-supervised, machine learning system
It looks for classes of pages with similar infoboxes
It determines common attributes
It creates training examples
Kylin
Infobox Generation
Preprocessor
Schema Refinement Free edit -> schema drift
Duplicate templates:U.S.County(1428), US County(574), Counties(50), County(19)
Low usage of attribute Duplicate attributes:“Census Yr”, “Census Estimate Yr”, “Census
Est.”, “Census Year”
Kylin: Strict name match 15% occurrences
Its county seat is Clearfield.
As of 2005, the population density was 28.2/km².
Clearfield County was created on 1804 from parts of Huntingdon and Lycoming Counties but was administered as part of Centre County until 1812.
2,972 km² (1,147 mi²) of it is land and 17 km² (7 mi²) of it (0.56%) is water.
Preprocessor
• Training Dataset Construction
Document Classifier
List and Category Fast Precision(98.5%) Recall(68.8%)
Sentence Classifier Predicts which attribute value are contained in given sentence. It uses maximum entropy model. To decrease noisy and incomplete training dataset, Kylin apply
bagging.
Classifier
Conditional Random Fields Model
Attribute value extraction: sequential data labeling CRF model for each attribute independently
Relabel–filter false negative training examples 2,972km²(1,147mi²) of it is land and 17km²(7mi²) of it (0.56%) is water.
Preprocessor: Water_area
Classifier: Water_area; Land_area
Though Kylin is successful on popular classes, its performance decreases on sparse classes where there is insufficient training data.
CRF Extractor
Outline Background: Kylin Extraction Long-Tailed Challenges
Sparse infobox classes Incomplete articles
Moving Down the Long Tails Shrinkage Retraining Extracting from the Web
Problem with information Extraction IWP (Intelligence in Wikipedia) CCC and IE Virtuous Cycle IWP (Shrinkage, Retraining and Extracting from Web)
Multilingual Extraction
Summary
Long-Tail 1: Sparse Infobox Class
Kylin Performs Well on Popular Classes: Precision: mid 70% ~ high 90%
Recall: low 50% ~ mid 90%
Kylin Flounders on Sparse Classes – Little Training Data
e.g: for “US County class ” Kylin has 97.3% precision and 95.9% recall while many other classes like “Irish Newspaper” contains very small number of infobox containing articles
Long-Tail 2: Incomplete Articles
Desired Information Missing from WikipediaAmong 1.8 millions pages [July 2007 of Wikipedia ] many are short articles and almost 800,000 (44.2%) are marked as stub pages indicating much needed information is missing.
Outline Background: Kylin Extraction Long-Tailed Challenges
Sparse infobox classes Incomplete articles
Moving Down the Long Tails Shrinkage Retraining Extracting from the Web
Problem with information Extraction IWP (Intelligence in Wikipedia) CCC and IE Virtuous Cycle IWP (Shrinkage, Retraining and Extracting from Web)
Multilingual Extraction
Summary
Attempt to improve Kylin’s performance using shrinkage.
We use Shrinkage when training an extractor of an instance-space infobox class by aggregating data from its parent and children classes
Shrinkage
Shrinkage
performer (44)
.location
actor(8738)
comedian (106)
.birthplace
.birth_place
.cityofbirth
.origin
person(1201)
.birth_place
[McCallum et al., ICML98]
KOG (Kylin Ontology Generator) [Wu & Weld, WWW08]
performer (44)
.location
actor(8738)
comedian (106)
.birthplace
.birth_place
.cityofbirth
.origin
person(1201)
.birth_place
Shrinkage
Outline Background: Kylin Extraction Long-Tailed Challenges
Sparse infobox classes Incomplete articles
Moving Down the Long Tails Shrinkage Retraining Extracting from the Web
Problem with information Extraction IWP (Intelligence in Wikipedia) CCC and IE Virtuous Cycle IWP (Shrinkage, Retraining and Extracting from Web)
Multilingual Extraction
Summary
Retraining
Key:• Identify relevant sentences given the sea of Web data?
Complementary to Shrinkage: Harvest extra training data from broader Web
Andrew Murray was born in Scotland in 1828 ……
<Andrew Murray, was born in, Scotland><Andrew Murray, was born in, 1828>
Retraining
Kylin Extraction:
TextRunner Extraction:
Query TextRunner for relevant sentences:
• r1=<Ada Cambridge, was born in, England>Ada Cambridge was born in England in 1844 and moved to Australia with her curate husband in 1870.
• r2=<Ada Cambridge, was born in, “Norfolk , England”>Ada Cambridge was born in Norfolk , England , in 1844 .
t=< Ada Cambridge, location, “St Germans , Norfolk , England”>
Effect of Shrinkage & Retraining
Effect of Shrinkage & Retraining
1755% improvement for a sparse class
13.7% improvement for a popular class
Outline Background: Kylin Extraction Long-Tailed Challenges
Sparse infobox classes Incomplete articles
Moving Down the Long Tails Shrinkage Retraining Extracting from the Web
Problem with information Extraction IWP (Intelligence in Wikipedia) CCC and IE Virtuous Cycle IWP (Shrinkage, Retraining and Extracting from Web)
Multilingual Extraction
Summary
Extraction from the Web
Idea: apply Kylin extractors trained on Wikipedia to general Web pages
Challenge: maintain high precision General Web pages are noisy Many Web pages describe multiple objects
Key: retrieve relevant sentences
Procedure Generate a set of search engine queries Retrieve top-k pages from Google Weight extractions from these pages
Choosing Queries
Example: get birth date attribute for article titled “Andrew Murray (minister)”
“andrew murray”“andrew murray” birth date“andrew murray” was born in“andrew murray” …
predicatesfromTextRunner
attribute name
Weighting Extractions
Which extractions are more relevant?
Features : # sentences between sentence and
closest occurrence of title (‘andrew murray’) : rank of page on Google’s result lists : Kylin’s extractor confidence
Web Extraction Experiment
Extractor confidence alone performs poor Weighted combination is the best
Combining Wikipedia & Web
Recall Benefit from Shrinkage / Retraining…
Combining Wikipedia & Web
Benefit from Shrinkage + Retraining + Web
Outline Background: Kylin Extraction Long-Tailed Challenges
Sparse infobox classes Incomplete articles
Moving Down the Long Tails Shrinkage Retraining Extracting from the Web
Problem with information Extraction IWP (Intelligence in Wikipedia) CCC and IE Virtuous Cycle IWP (Shrinkage, Retraining and Extracting from Web)
Multilingual Extraction
Summary
Problem
Information Extraction is Imprecise› Wikipedians Don’t Want 90% Precision
How Improve Precision?
› People!
Outline Background: Kylin Extraction Long-Tailed Challenges
Sparse infobox classes Incomplete articles
Moving Down the Long Tails Shrinkage Retraining Extracting from the Web
Problem with information Extraction IWP (Intelligence in Wikipedia) CCC and IE Virtuous Cycle IWP (Shrinkage, Retraining and Extracting from Web)
Multilingual Extraction
Summary
Intelligence in Wikipedia
What is IWP?› A project/system that aims to combine
IE (Information Extraction) CCC (communal content creation)
Information Extraction
Examples:› Zoominfo.com› Fligdog.com› Citeseer› Google
Advantage: Autonomy Disadvantage: Expensive
IE system contributors
Contributors in this room?› Wikipedia
IE systems› Citeseer› Rexa› DBlife
Communal Content Creation
Examples› Wikipedia› Ebay› Netflix
› Advantage: more accuracy then IE› Disadvantage: bootstrapping, incentives,
and management
Outline Background: Kylin Extraction Long-Tailed Challenges
Sparse infobox classes Incomplete articles
Moving Down the Long Tails Shrinkage Retraining Extracting from the Web
Problem with information Extraction IWP (Intelligence in Wikipedia) CCC and IE Virtuous Cycle IWP (Shrinkage, Retraining and Extracting from Web)
Multilingual Extraction
Summary
Virtuous Cycle
Contributing as a Non-Primary Task
Encourage contributions Without annoying or abusing readers
› Compared 5 different interfaces
Results
• Contribution Rate• 1.6% 13%
• 90% of positive labels were correct
Outline Background: Kylin Extraction Long-Tailed Challenges
Sparse infobox classes Incomplete articles
Moving Down the Long Tails Shrinkage Retraining Extracting from the Web
Problem with information Extraction IWP (Intelligence in Wikipedia) CCC and IE Virtuous Cycle IWP (Shrinkage, Retraining and Extracting from Web)
Multilingual Extraction
Summary
IWP and Shrinkage, Retraining, and Extracting from the Web
Shrinkage – improves IWP’s precision, and recall
Retraining – improves the robustness of IWP’s extractors
Extraction – further helps IWP’s performance
Multi-Lingual Extraction
Idea: Further leverage the virtuous feedback cycle Utilize IE methods to add or update missing
information by copying from one language to another
Utilize CCC to validate and improve updates.
Example Nombre = “Jerry Seinfeld” and Name = “Jerry
Seinfeld Cónyuge = “Jessica Sklar” and Spouse = “Jessica
Sienfeld”
Summary
Kylin’s initial performance is unacceptable Methods for increasing recall
ShrinkageRetrainingExtraction from the web
Summary IWP – developing AI methods to facilitate the
growth, operation and use of Wikipedia Initial goal – extraction of a giant
knowledge bas of semantic triplesFaceted browsingInput to reasoning based question-
answering system How
IECCC
Questions