information extraction from wikipedia: moving down the long tail fei wu, raphael hoffmann, daniel s....

52
Information Extraction from Wikipedia: Moving Down the Long Tail Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering University of Washington Seattle, WA, USA Intelligence in Wikipedia: Fei Wu, Eytan Adar, Saleema Amershi, Oren Etzioni, James Fogarty, Raphael Hoffmann, Kayur Patel, Stef Schoenmackers & Michael Skinner

Upload: darrell-stanley

Post on 04-Jan-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Information Extraction from Wikipedia: Moving Down the Long Tail Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering

Information Extraction from Wikipedia:

Moving Down the Long Tail

Fei Wu, Raphael Hoffmann, Daniel S. Weld

Department of Computer Science & Engineering

University of Washington

Seattle, WA, USA

Intelligence in Wikipedia:

Fei Wu, Eytan Adar, Saleema Amershi, Oren Etzioni, James Fogarty, Raphael Hoffmann, Kayur Patel,

Stef Schoenmackers & Michael Skinner

Page 2: Information Extraction from Wikipedia: Moving Down the Long Tail Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering

Motivating Vision Next-Generation Search =

Information Extraction + Ontology + Inference

Which performing artists were born

in Chicago?

…Bob was born in Northwestern Memorial Hospital. …

…Bob Black is an active actor who was selected as this year’s …

…Northwestern Memorial Hospital is one of the country’s leading hospitals in Chicago…

Page 3: Information Extraction from Wikipedia: Moving Down the Long Tail Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering

Next-Generation Search

Information Extraction <Bob, Born-In, NMH> <Bob Black, ISA, actor> <NMH, in Chicago> …

Ontology Actor ISA Performing Artist …

Inference Born-In(A) ^ PartOf(A,B)

=> Born-In(B) …

…Bob was born in Northwestern Memorial Hospital. …

…Bob Black is an active actor who…

…Northwestern Memorial Hospital is one of the country’s leading hospitals in Chicago…

Page 4: Information Extraction from Wikipedia: Moving Down the Long Tail Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering

Wikipedia – Bootstrap for the Web

Goal: search over the Web Now: search over Wikipedia

Comprehensive High-quality (Semi-)Structured data

Page 5: Information Extraction from Wikipedia: Moving Down the Long Tail Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering

Infoboxes are designed to present summary information about an article's subject, such that similar subjects have a uniform look and in a common format

An infobox is a generalization of a taxobox (from taxonomy) which summarizes information for an organism or group of organisms.

Infoboxes

Page 6: Information Extraction from Wikipedia: Moving Down the Long Tail Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering

Infobox examples

Basic infobox

Taxobox –Plant species

Page 7: Information Extraction from Wikipedia: Moving Down the Long Tail Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering

More example

Infobox People - ActorInfobox- Convention Center

Page 8: Information Extraction from Wikipedia: Moving Down the Long Tail Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering

Outline Background: Kylin Extraction Long-Tailed Challenges

Sparse infobox classes Incomplete articles Moving Down the Long Tails Shrinkage Retraining Extracting from the Web Problem with information Extraction IWP (Intelligence in Wikipedia) CCC and IE Virtuous Cycle IWP (Shrinkage, Retraining and Extracting from Web)

Multilingual Extraction Summary

Page 9: Information Extraction from Wikipedia: Moving Down the Long Tail Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering

Kylin: Autonomously Semantifying Wikipedia Totally autonomous with no additional human efforts

Form training dataset based on infoboxes

Extract semantic relations from Wikipedia articles

Kylin: a mythical hooved Chinese chimerical creature that is said to appear in conjunction with the arrival of a sage.

------Wikipedia

Page 10: Information Extraction from Wikipedia: Moving Down the Long Tail Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering

It is a prototype of self-supervised, machine learning system

It looks for classes of pages with similar infoboxes

It determines common attributes

It creates training examples

Kylin

Page 11: Information Extraction from Wikipedia: Moving Down the Long Tail Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering

Infobox Generation

Page 12: Information Extraction from Wikipedia: Moving Down the Long Tail Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering

Preprocessor

Schema Refinement Free edit -> schema drift

Duplicate templates:U.S.County(1428), US County(574), Counties(50), County(19)

Low usage of attribute Duplicate attributes:“Census Yr”, “Census Estimate Yr”, “Census

Est.”, “Census Year”

Kylin: Strict name match 15% occurrences

Page 13: Information Extraction from Wikipedia: Moving Down the Long Tail Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering

Its county seat is Clearfield.

As of 2005, the population density was 28.2/km².

Clearfield County was created on 1804 from parts of Huntingdon and Lycoming Counties but was administered as part of Centre County until 1812.

2,972 km² (1,147 mi²) of it is land and 17 km²  (7 mi²) of it (0.56%) is water.

Preprocessor

• Training Dataset Construction

Page 14: Information Extraction from Wikipedia: Moving Down the Long Tail Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering

Document Classifier

List and Category Fast Precision(98.5%) Recall(68.8%)

Sentence Classifier Predicts which attribute value are contained in given sentence. It uses maximum entropy model. To decrease noisy and incomplete training dataset, Kylin apply

bagging.

Classifier

Page 15: Information Extraction from Wikipedia: Moving Down the Long Tail Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering

Conditional Random Fields Model

Attribute value extraction: sequential data labeling CRF model for each attribute independently

Relabel–filter false negative training examples 2,972km²(1,147mi²) of it is land and 17km²(7mi²) of it (0.56%) is water.

Preprocessor: Water_area

Classifier: Water_area; Land_area

Though Kylin is successful on popular classes, its performance decreases on sparse classes where there is insufficient training data.

CRF Extractor

Page 16: Information Extraction from Wikipedia: Moving Down the Long Tail Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering

Outline Background: Kylin Extraction Long-Tailed Challenges

Sparse infobox classes Incomplete articles

Moving Down the Long Tails Shrinkage Retraining Extracting from the Web

Problem with information Extraction IWP (Intelligence in Wikipedia) CCC and IE Virtuous Cycle IWP (Shrinkage, Retraining and Extracting from Web)

Multilingual Extraction

Summary

Page 17: Information Extraction from Wikipedia: Moving Down the Long Tail Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering

Long-Tail 1: Sparse Infobox Class

Kylin Performs Well on Popular Classes: Precision: mid 70% ~ high 90%

Recall: low 50% ~ mid 90%

Kylin Flounders on Sparse Classes – Little Training Data

e.g: for “US County class ” Kylin has 97.3% precision and 95.9% recall while many other classes like “Irish Newspaper” contains very small number of infobox containing articles

Page 18: Information Extraction from Wikipedia: Moving Down the Long Tail Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering

Long-Tail 2: Incomplete Articles

Desired Information Missing from WikipediaAmong 1.8 millions pages [July 2007 of Wikipedia ] many are short articles and almost 800,000 (44.2%) are marked as stub pages indicating much needed information is missing.

Page 19: Information Extraction from Wikipedia: Moving Down the Long Tail Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering

Outline Background: Kylin Extraction Long-Tailed Challenges

Sparse infobox classes Incomplete articles

Moving Down the Long Tails Shrinkage Retraining Extracting from the Web

Problem with information Extraction IWP (Intelligence in Wikipedia) CCC and IE Virtuous Cycle IWP (Shrinkage, Retraining and Extracting from Web)

Multilingual Extraction

Summary

Page 20: Information Extraction from Wikipedia: Moving Down the Long Tail Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering

Attempt to improve Kylin’s performance using shrinkage.

We use Shrinkage when training an extractor of an instance-space infobox class by aggregating data from its parent and children classes

Shrinkage

Page 21: Information Extraction from Wikipedia: Moving Down the Long Tail Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering

Shrinkage

performer (44)

.location

actor(8738)

comedian (106)

.birthplace

.birth_place

.cityofbirth

.origin

person(1201)

.birth_place

[McCallum et al., ICML98]

Page 22: Information Extraction from Wikipedia: Moving Down the Long Tail Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering

KOG (Kylin Ontology Generator) [Wu & Weld, WWW08]

performer (44)

.location

actor(8738)

comedian (106)

.birthplace

.birth_place

.cityofbirth

.origin

person(1201)

.birth_place

Shrinkage

Page 23: Information Extraction from Wikipedia: Moving Down the Long Tail Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering

Outline Background: Kylin Extraction Long-Tailed Challenges

Sparse infobox classes Incomplete articles

Moving Down the Long Tails Shrinkage Retraining Extracting from the Web

Problem with information Extraction IWP (Intelligence in Wikipedia) CCC and IE Virtuous Cycle IWP (Shrinkage, Retraining and Extracting from Web)

Multilingual Extraction

Summary

Page 24: Information Extraction from Wikipedia: Moving Down the Long Tail Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering

Retraining

Key:• Identify relevant sentences given the sea of Web data?

Complementary to Shrinkage: Harvest extra training data from broader Web

Andrew Murray was born in Scotland in 1828 ……

<Andrew Murray, was born in, Scotland><Andrew Murray, was born in, 1828>

Page 25: Information Extraction from Wikipedia: Moving Down the Long Tail Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering

Retraining

Kylin Extraction:

TextRunner Extraction:

Query TextRunner for relevant sentences:

• r1=<Ada Cambridge, was born in, England>Ada Cambridge was born in England in 1844 and moved to Australia with her curate husband in 1870.

• r2=<Ada Cambridge, was born in, “Norfolk , England”>Ada Cambridge was born in Norfolk , England , in 1844 .

t=< Ada Cambridge, location, “St Germans , Norfolk , England”>

Page 26: Information Extraction from Wikipedia: Moving Down the Long Tail Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering

Effect of Shrinkage & Retraining

Page 27: Information Extraction from Wikipedia: Moving Down the Long Tail Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering

Effect of Shrinkage & Retraining

1755% improvement for a sparse class

13.7% improvement for a popular class

Page 28: Information Extraction from Wikipedia: Moving Down the Long Tail Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering

Outline Background: Kylin Extraction Long-Tailed Challenges

Sparse infobox classes Incomplete articles

Moving Down the Long Tails Shrinkage Retraining Extracting from the Web

Problem with information Extraction IWP (Intelligence in Wikipedia) CCC and IE Virtuous Cycle IWP (Shrinkage, Retraining and Extracting from Web)

Multilingual Extraction

Summary

Page 29: Information Extraction from Wikipedia: Moving Down the Long Tail Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering

Extraction from the Web

Idea: apply Kylin extractors trained on Wikipedia to general Web pages

Challenge: maintain high precision General Web pages are noisy Many Web pages describe multiple objects

Key: retrieve relevant sentences

Procedure Generate a set of search engine queries Retrieve top-k pages from Google Weight extractions from these pages

Page 30: Information Extraction from Wikipedia: Moving Down the Long Tail Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering

Choosing Queries

Example: get birth date attribute for article titled “Andrew Murray (minister)”

“andrew murray”“andrew murray” birth date“andrew murray” was born in“andrew murray” …

predicatesfromTextRunner

attribute name

Page 31: Information Extraction from Wikipedia: Moving Down the Long Tail Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering

Weighting Extractions

Which extractions are more relevant?

Features : # sentences between sentence and

closest occurrence of title (‘andrew murray’) : rank of page on Google’s result lists : Kylin’s extractor confidence

Page 32: Information Extraction from Wikipedia: Moving Down the Long Tail Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering

Web Extraction Experiment

Extractor confidence alone performs poor Weighted combination is the best

Page 33: Information Extraction from Wikipedia: Moving Down the Long Tail Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering

Combining Wikipedia & Web

Recall Benefit from Shrinkage / Retraining…

Page 34: Information Extraction from Wikipedia: Moving Down the Long Tail Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering

Combining Wikipedia & Web

Benefit from Shrinkage + Retraining + Web

Page 35: Information Extraction from Wikipedia: Moving Down the Long Tail Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering

Outline Background: Kylin Extraction Long-Tailed Challenges

Sparse infobox classes Incomplete articles

Moving Down the Long Tails Shrinkage Retraining Extracting from the Web

Problem with information Extraction IWP (Intelligence in Wikipedia) CCC and IE Virtuous Cycle IWP (Shrinkage, Retraining and Extracting from Web)

Multilingual Extraction

Summary

Page 36: Information Extraction from Wikipedia: Moving Down the Long Tail Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering

Problem

Information Extraction is Imprecise› Wikipedians Don’t Want 90% Precision

How Improve Precision?

› People!

Page 37: Information Extraction from Wikipedia: Moving Down the Long Tail Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering

Outline Background: Kylin Extraction Long-Tailed Challenges

Sparse infobox classes Incomplete articles

Moving Down the Long Tails Shrinkage Retraining Extracting from the Web

Problem with information Extraction IWP (Intelligence in Wikipedia) CCC and IE Virtuous Cycle IWP (Shrinkage, Retraining and Extracting from Web)

Multilingual Extraction

Summary

Page 38: Information Extraction from Wikipedia: Moving Down the Long Tail Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering

Intelligence in Wikipedia

What is IWP?› A project/system that aims to combine

IE (Information Extraction) CCC (communal content creation)

Page 39: Information Extraction from Wikipedia: Moving Down the Long Tail Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering

Information Extraction

Examples:› Zoominfo.com› Fligdog.com› Citeseer› Google

Advantage: Autonomy Disadvantage: Expensive

Page 40: Information Extraction from Wikipedia: Moving Down the Long Tail Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering

IE system contributors

Contributors in this room?› Wikipedia

IE systems› Citeseer› Rexa› DBlife

Page 41: Information Extraction from Wikipedia: Moving Down the Long Tail Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering

Communal Content Creation

Examples› Wikipedia› Ebay› Netflix

› Advantage: more accuracy then IE› Disadvantage: bootstrapping, incentives,

and management

Page 42: Information Extraction from Wikipedia: Moving Down the Long Tail Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering

Outline Background: Kylin Extraction Long-Tailed Challenges

Sparse infobox classes Incomplete articles

Moving Down the Long Tails Shrinkage Retraining Extracting from the Web

Problem with information Extraction IWP (Intelligence in Wikipedia) CCC and IE Virtuous Cycle IWP (Shrinkage, Retraining and Extracting from Web)

Multilingual Extraction

Summary

Page 43: Information Extraction from Wikipedia: Moving Down the Long Tail Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering

Virtuous Cycle

Page 44: Information Extraction from Wikipedia: Moving Down the Long Tail Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering

Contributing as a Non-Primary Task

Encourage contributions Without annoying or abusing readers

› Compared 5 different interfaces

Page 45: Information Extraction from Wikipedia: Moving Down the Long Tail Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering
Page 46: Information Extraction from Wikipedia: Moving Down the Long Tail Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering

Results

• Contribution Rate• 1.6% 13%

• 90% of positive labels were correct

Page 47: Information Extraction from Wikipedia: Moving Down the Long Tail Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering

Outline Background: Kylin Extraction Long-Tailed Challenges

Sparse infobox classes Incomplete articles

Moving Down the Long Tails Shrinkage Retraining Extracting from the Web

Problem with information Extraction IWP (Intelligence in Wikipedia) CCC and IE Virtuous Cycle IWP (Shrinkage, Retraining and Extracting from Web)

Multilingual Extraction

Summary

Page 48: Information Extraction from Wikipedia: Moving Down the Long Tail Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering

IWP and Shrinkage, Retraining, and Extracting from the Web

Shrinkage – improves IWP’s precision, and recall

Retraining – improves the robustness of IWP’s extractors

Extraction – further helps IWP’s performance

Page 49: Information Extraction from Wikipedia: Moving Down the Long Tail Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering

Multi-Lingual Extraction

Idea: Further leverage the virtuous feedback cycle Utilize IE methods to add or update missing

information by copying from one language to another

Utilize CCC to validate and improve updates.

Example Nombre = “Jerry Seinfeld” and Name = “Jerry

Seinfeld Cónyuge = “Jessica Sklar” and Spouse = “Jessica

Sienfeld”

Page 50: Information Extraction from Wikipedia: Moving Down the Long Tail Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering

Summary

Kylin’s initial performance is unacceptable Methods for increasing recall

ShrinkageRetrainingExtraction from the web

Page 51: Information Extraction from Wikipedia: Moving Down the Long Tail Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering

Summary IWP – developing AI methods to facilitate the

growth, operation and use of Wikipedia Initial goal – extraction of a giant

knowledge bas of semantic triplesFaceted browsingInput to reasoning based question-

answering system How

IECCC

Page 52: Information Extraction from Wikipedia: Moving Down the Long Tail Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering

Questions