information extraction from wikipedia: moving down the long tail fei wu, raphael hoffmann, daniel s....

Information Extraction from Wikipedia:

Moving Down the Long Tail

Fei Wu, Raphael Hoffmann, Daniel S. Weld

Department of Computer Science & Engineering

University of Washington

Seattle, WA, USA

Intelligence in Wikipedia:

Fei Wu, Eytan Adar, Saleema Amershi, Oren Etzioni, James Fogarty, Raphael Hoffmann, Kayur Patel,

Stef Schoenmackers & Michael Skinner

Motivating Vision Next-Generation Search =

Information Extraction + Ontology + Inference

Which performing artists were born

in Chicago?

…Bob was born in Northwestern Memorial Hospital. …

…Bob Black is an active actor who was selected as this year’s …

…Northwestern Memorial Hospital is one of the country’s leading hospitals in Chicago…

Next-Generation Search

Information Extraction <Bob, Born-In, NMH> <Bob Black, ISA, actor> <NMH, in Chicago> …

Ontology Actor ISA Performing Artist …

Inference Born-In(A) ^ PartOf(A,B)

=> Born-In(B) …

…Bob was born in Northwestern Memorial Hospital. …

…Bob Black is an active actor who…

…Northwestern Memorial Hospital is one of the country’s leading hospitals in Chicago…

Wikipedia – Bootstrap for the Web

Goal: search over the Web Now: search over Wikipedia

Comprehensive High-quality (Semi-)Structured data

Infoboxes are designed to present summary information about an article's subject, such that similar subjects have a uniform look and in a common format

An infobox is a generalization of a taxobox (from taxonomy) which summarizes information for an organism or group of organisms.

Infoboxes

Infobox examples

Basic infobox

Taxobox –Plant species

More example

Infobox People - ActorInfobox- Convention Center

Outline Background: Kylin Extraction Long-Tailed Challenges

Sparse infobox classes Incomplete articles Moving Down the Long Tails Shrinkage Retraining Extracting from the Web Problem with information Extraction IWP (Intelligence in Wikipedia) CCC and IE Virtuous Cycle IWP (Shrinkage, Retraining and Extracting from Web)

Multilingual Extraction Summary

Kylin: Autonomously Semantifying Wikipedia Totally autonomous with no additional human efforts

Form training dataset based on infoboxes

Extract semantic relations from Wikipedia articles

Kylin: a mythical hooved Chinese chimerical creature that is said to appear in conjunction with the arrival of a sage.

------Wikipedia

It is a prototype of self-supervised, machine learning system

It looks for classes of pages with similar infoboxes

It determines common attributes

It creates training examples

Kylin

Infobox Generation

Preprocessor

Schema Refinement Free edit -> schema drift

Duplicate templates:U.S.County(1428), US County(574), Counties(50), County(19)

Low usage of attribute Duplicate attributes:“Census Yr”, “Census Estimate Yr”, “Census

Est.”, “Census Year”

Kylin: Strict name match 15% occurrences

Its county seat is Clearfield.

As of 2005, the population density was 28.2/km².

Clearfield County was created on 1804 from parts of Huntingdon and Lycoming Counties but was administered as part of Centre County until 1812.

2,972 km² (1,147 mi²) of it is land and 17 km² (7 mi²) of it (0.56%) is water.

Preprocessor

• Training Dataset Construction

Document Classifier

List and Category Fast Precision(98.5%) Recall(68.8%)

Sentence Classifier Predicts which attribute value are contained in given sentence. It uses maximum entropy model. To decrease noisy and incomplete training dataset, Kylin apply

bagging.

Classifier

Conditional Random Fields Model

Attribute value extraction: sequential data labeling CRF model for each attribute independently

Relabel–filter false negative training examples 2,972km²(1,147mi²) of it is land and 17km²(7mi²) of it (0.56%) is water.

Preprocessor: Water_area

Classifier: Water_area; Land_area

Though Kylin is successful on popular classes, its performance decreases on sparse classes where there is insufficient training data.

CRF Extractor


Sparse infobox classes Incomplete articles

Moving Down the Long Tails Shrinkage Retraining Extracting from the Web

Problem with information Extraction IWP (Intelligence in Wikipedia) CCC and IE Virtuous Cycle IWP (Shrinkage, Retraining and Extracting from Web)

Multilingual Extraction

Summary

Long-Tail 1: Sparse Infobox Class

Kylin Performs Well on Popular Classes: Precision: mid 70% ~ high 90%

Recall: low 50% ~ mid 90%

Kylin Flounders on Sparse Classes – Little Training Data

e.g: for “US County class ” Kylin has 97.3% precision and 95.9% recall while many other classes like “Irish Newspaper” contains very small number of infobox containing articles

Long-Tail 2: Incomplete Articles

Desired Information Missing from WikipediaAmong 1.8 millions pages [July 2007 of Wikipedia ] many are short articles and almost 800,000 (44.2%) are marked as stub pages indicating much needed information is missing.






Summary

Attempt to improve Kylin’s performance using shrinkage.

We use Shrinkage when training an extractor of an instance-space infobox class by aggregating data from its parent and children classes

Shrinkage

Shrinkage

performer (44)

.location

actor(8738)

comedian (106)

.birthplace

.birth_place

.cityofbirth

.origin

person(1201)

.birth_place

[McCallum et al., ICML98]

KOG (Kylin Ontology Generator) [Wu & Weld, WWW08]

performer (44)

.location

actor(8738)

comedian (106)

.birthplace

.birth_place

.cityofbirth

.origin

person(1201)

.birth_place

Shrinkage






Summary

Retraining

Key:• Identify relevant sentences given the sea of Web data?

Complementary to Shrinkage: Harvest extra training data from broader Web

Andrew Murray was born in Scotland in 1828 ……

<Andrew Murray, was born in, Scotland><Andrew Murray, was born in, 1828>

Retraining

Kylin Extraction:

TextRunner Extraction:

Query TextRunner for relevant sentences:

• r1=<Ada Cambridge, was born in, England>Ada Cambridge was born in England in 1844 and moved to Australia with her curate husband in 1870.

• r2=<Ada Cambridge, was born in, “Norfolk , England”>Ada Cambridge was born in Norfolk , England , in 1844 .

t=< Ada Cambridge, location, “St Germans , Norfolk , England”>

Effect of Shrinkage & Retraining

Effect of Shrinkage & Retraining

1755% improvement for a sparse class

13.7% improvement for a popular class






Summary

Extraction from the Web

Idea: apply Kylin extractors trained on Wikipedia to general Web pages

Challenge: maintain high precision General Web pages are noisy Many Web pages describe multiple objects

Key: retrieve relevant sentences

Procedure Generate a set of search engine queries Retrieve top-k pages from Google Weight extractions from these pages

Choosing Queries

Example: get birth date attribute for article titled “Andrew Murray (minister)”

“andrew murray”“andrew murray” birth date“andrew murray” was born in“andrew murray” …

predicatesfromTextRunner

attribute name

Weighting Extractions

Which extractions are more relevant?

Features : # sentences between sentence and

closest occurrence of title (‘andrew murray’) : rank of page on Google’s result lists : Kylin’s extractor confidence

Web Extraction Experiment

Extractor confidence alone performs poor Weighted combination is the best

Combining Wikipedia & Web

Recall Benefit from Shrinkage / Retraining…

Combining Wikipedia & Web

Benefit from Shrinkage + Retraining + Web






Summary

Problem

Information Extraction is Imprecise› Wikipedians Don’t Want 90% Precision

How Improve Precision?

› People!






Summary

Intelligence in Wikipedia

What is IWP?› A project/system that aims to combine

IE (Information Extraction) CCC (communal content creation)

Information Extraction

Examples:› Zoominfo.com› Fligdog.com› Citeseer› Google

Advantage: Autonomy Disadvantage: Expensive

IE system contributors

Contributors in this room?› Wikipedia

IE systems› Citeseer› Rexa› DBlife

Communal Content Creation

Examples› Wikipedia› Ebay› Netflix

› Advantage: more accuracy then IE› Disadvantage: bootstrapping, incentives,

and management






Summary

Virtuous Cycle

Contributing as a Non-Primary Task

Encourage contributions Without annoying or abusing readers

› Compared 5 different interfaces

Results

• Contribution Rate• 1.6% 13%

• 90% of positive labels were correct






Summary

IWP and Shrinkage, Retraining, and Extracting from the Web

Shrinkage – improves IWP’s precision, and recall

Retraining – improves the robustness of IWP’s extractors

Extraction – further helps IWP’s performance

Multi-Lingual Extraction

Idea: Further leverage the virtuous feedback cycle Utilize IE methods to add or update missing

information by copying from one language to another

Utilize CCC to validate and improve updates.

Example Nombre = “Jerry Seinfeld” and Name = “Jerry

Seinfeld Cónyuge = “Jessica Sklar” and Spouse = “Jessica

Sienfeld”

Summary

Kylin’s initial performance is unacceptable Methods for increasing recall

ShrinkageRetrainingExtraction from the web

Summary IWP – developing AI methods to facilitate the

growth, operation and use of Wikipedia Initial goal – extraction of a giant

knowledge bas of semantic triplesFaceted browsingInput to reasoning based question-

answering system How

IECCC

Questions

information extraction from wikipedia: moving down the long tail fei wu, raphael hoffmann, daniel s....

Documents

census yr

incomplete training

wikipedia articleskylin

semantifying wikipedia

present summary information

census yearkylin

clearfield county

census estimate yr