websets: extracting sets of entities from the web using unsupervised information extraction bhavana...

1
WebSets: Extracting Sets of Entities from the Web Using Unsupervised Information Extraction Bhavana Dalvi , William W. Cohen and Jamie Callan Language Technologies Institute, Carnegie Mellon University Motivation Experiments WebSets Framework Application Acknowledgements This work is supported by Google and the Intelligence Advanced Research Projects Activity (IARPA) via Air Force Research Laboratory (AFRL) contract number FA8650-10-C-7058. HTML Table Corpus Entity-feature file <Entities, table-columns, domains> Hyponym Concept Dataset Relational Table Identificatio n Hypernym Recommendation Bottom-up Entity Clustering Labeled entity sets <Entities , hypernym> Entity Cluste rs Conclusions Intelligence Domain Religions: Buddhism, Christianity, Islam, Sikhism, Taoism, Zoroastrianism, Jainism, Bahai, Judaism, Hinduism, Confucianism , .… Government: Monarchy, Limited Democracy, Islamic Republic, Parliamentary Self Governing Territory, Parliamentary Republic, Constitutional Republic, Republic Presidential Multiparty System, …. International Organizations: United Nations Children Fund UNICEF, Southeast European Cooperative Initiative SECI, World Trade Organization WTO, Indian Ocean Commission INOC, Economic and Social Council ECOSOC, Caribbean Community and Languages: Hebrew, Portuguese, Danish, Brazilian, Surinamese, Burkinabe, Barbadian, Cuban , …. Music Domain Instruments: Flute, Tuba , String Orchestra, Chimes, Harmonium, Bassoon, Woodwinds, Glockenspiel, French horn, Timpani, Piano, …. Intervals: Whole tone, Major sixth, Fifth, Perfect fifth, Seventh, Third, Diminished fifth, Whole step , …. Genres: Smooth jazz, Gothic, Metal rock, Rock, Pop, Hip hop, Rock n roll, Country, Folk, Punk rock , …. Audio Equipments: Audio editor , General midi synthesizer , Audio recorder , Multichannel digital audio workstation , Drum sequencer , Mixers , Music engraving system , Audio server , Mastering software , Soundfont sample player …. Many NLP tasks get benefit from concept- instance pairs Summarization, Co-reference resolution, Named entity extraction Existing knowledge bases (NELL, Freebase, …) are incomplete. Problem can be divided into : Detecting co-ordinate terms to find term clusters (i ~ j) Using hyponym patterns (“X such as Y”) to name the terms We worked on problem of automatically harvesting concept-instance pairs from a corpus of HTML tables. Hypothesis 1 : Entities appearing in a table column probably belong to the same concept. Hypothesis 2 : Frequent co- occurrence of a set of entities in multiple table columns and distinct web domains indicates that they represent some meaningful concept. We propose a unsupervised IE technique to extract concept-instance pairs from an HTML corpus. It is novel in that it relies solely on HTML tables to detect coordinate terms. Our triplet-based data representation helps in disambiguating multiple senses of the same noun-phrase. WebSets approach is corpus driven, efficient and scalable. We presented a method which takes O(N * logN) time to process the HTML tables of size O(N) and extract named entity sets from them. Labeled entity sets produced by WebSets can act as summary of a HTML corpus. Class-instance pairs thus produced are also being used to populate an existing Country Capital City India Delhi China Beijing Canada Ottawa France Paris Country Capital City China Beijing Canada Ottawa France Paris England London TableId=21 , domain=“wikipedia.org” TableId=34 , domain=“aneki.com” Entities Table:Colum n China, Canada, India 21:1 Wikipedia. org Canada, China, France 21:1, 34:1 Wikipedia. org, aneki.com Beijing, Delhi, Ottawa 21:2 Wikipedia. org Beijing, Ottawa, Paris 21:2, 34:2 Wikipedia. org, aneki.com Canada, England, France 34:1 aneki.com London, Ottawa, Paris 34:2 aneki.com Hypernym Entities Table:Colu mn Domains Country India, China, Canada, France, England 21:1, 34:1 Wikipedia. org, aneki.com City, Destinati ons Delhi, Beijing, Ottawa, London, Paris 21:2, 34:2 Wikipedia. org, aneki.com Datasets Table Identification Features : #rows, #non-link columns, HTML tags, length(cells), recursive or not % relational tables : 15-30% to 70-85% Entity vs. Triplet record representation O(N) triplet records created for tables of size O(N) Can disambiguate different senses of entities : Toy_Apple dataset Bottom-up clustering Number of clusters is unknown Gold standard #clusters : Toy_Apple (27) and Delicious_Sports (29) Hypernym Recommendation Score(hypernym | cluster) co-occurrence counts of hypernym with entities in the cluster Dataset Method K Puri ty NMI RI FM Toy_Apple K- Means 40 0.96 0.71 0.98 0.41 WebSet s 25 0.99 0.99 1.00 0.99 Delicious_S ports K- Means 50 0.72 0.68 0.98 0.47 WebSet s 32 0.83 0.64 1.00 0.85 Method K FM w/ Entity records FM w/ Triplet records WebSets 0.11 (K=25) 0.85 (K=34) K-Means 30 0.09 0.35 25 0.08 0.38 Metho d K J %Accurac y Yield (#pairs produced) #Correct pairs (predicted) DPM Inf 0. 0 34.6 88.6K 30.7K 5 0. 2 50.0 0.8K 0.4K DPMEx t Inf 0. 0 21.9 100,828.0K 22,081.3K 5 0. 2 44.0 2.8K 1.2K Dataset #Triple ts #Cluste rs #Clusters with hypernyms %Meaningful clusters MRR of hypernym %Precisio n of labeled sets CSEAL_Useful 165.2K 1090 312 69.0 0.56 98.6% ASIA_NELL 11.4K 448 266 73.0 0.59 98.5% ASIA_INT 15.1K 395 218 63.0 0.58 97.4% Clueweb_HPR 516.0 47 34 70.5 0.56 99.0% Evaluation of quality of entity sets produced Hyponym Concept Dataset Corpus Summary : Hearst patterns e.g. “X such as Y” arg1 such as (w+ (and/or))? arg2 arg1 (w+ )? (and/or) other arg2 arg1 include (w+ (and/or))? arg2 arg1 including (w+ (and/or))? Arg2 ClueWeb09 dataset : 500M page sample of the Web Noun-pair context dataset e.g. “Obama is president of USA” (president of , Obama, USA) Dataset Description #HTML pages #tables Toy_Apple Fruits + companies 574 2.6K Delicious_S ports Links from Delicious w/ tag=sports 21K 146.3K Delicious_M usic Links from Delicious w/ tag=music 183K 643.3K CSEAL_Usefu l Pages SEAL found NELL entities on 30K 322.8K ASIA_NELL ASIA run on NELL categories 112K 676.9K ASIA_INT ASIA run on intelligence domain 121K 621.3K Clueweb_HPR High pagerank sample of Clueweb 100K 586.9K Hyponym Concept:count USA Country:1000 Paris City:450, destination:100 Monkey Animal:100, mammal:30 Sparrow Bird:40 Bottom-Up Clustering Algorithm X, Y are hyponym, hypernym when context = Hearst pattern Record/cluster : <entity+ , tableColumn+, domain+> Clusters = { } Go through each triplet record t so that | t.domains| > threshold For each existing cluster C check if t.entity overlaps with C.entity OR t.tableColumn overlaps with C.tableColumn If sufficient overlap add t to C If no existing cluster C matches t Create new cluster C’ = t Add C’ to Clusters Time complexity : O(N * log N) Table corpus : O(N) Triplet Store : O(N)

Upload: sharlene-mccarthy

Post on 25-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: WebSets: Extracting Sets of Entities from the Web Using Unsupervised Information Extraction Bhavana Dalvi, William W. Cohen and Jamie Callan Language Technologies

WebSets: Extracting Sets of Entities from the Web Using

Unsupervised Information Extraction Bhavana Dalvi , William W. Cohen and Jamie Callan

Language Technologies Institute, Carnegie Mellon University Motivation

Experiments

WebSets Framework Application

AcknowledgementsThis work is supported by Google and the Intelligence Advanced Research Projects Activity (IARPA) via Air Force Research Laboratory (AFRL) contract number FA8650-10-C-7058.

HTML Table

Corpus

Entity-feature file<Entities, table-

columns, domains>

Hyponym Concept Dataset

Relational Table

Identification

Hypernym Recommendatio

n

Bottom-up Entity

Clustering

Labeled entity sets<Entities, hypernym

>

Entity Cluster

s

Conclusions

Intelligence DomainReligions: Buddhism, Christianity, Islam, Sikhism, Taoism, Zoroastrianism, Jainism, Bahai, Judaism, Hinduism, Confucianism , .…

Government: Monarchy, Limited Democracy, Islamic Republic, Parliamentary Self Governing Territory, Parliamentary Republic, Constitutional Republic, Republic Presidential Multiparty System, ….

International Organizations: United Nations Children Fund UNICEF, Southeast European Cooperative Initiative SECI, World Trade Organization WTO, Indian Ocean Commission INOC, Economic and Social Council ECOSOC, Caribbean Community and Common Market CARICOM, ….

Languages: Hebrew, Portuguese, Danish, Brazilian, Surinamese, Burkinabe, Barbadian, Cuban , ….

Music Domain

Instruments: Flute, Tuba , String Orchestra, Chimes, Harmonium, Bassoon, Woodwinds, Glockenspiel, French horn, Timpani, Piano, ….

Intervals: Whole tone, Major sixth, Fifth, Perfect fifth, Seventh, Third, Diminished fifth, Whole step , ….

Genres: Smooth jazz, Gothic, Metal rock, Rock, Pop, Hip hop, Rock n roll, Country, Folk, Punk rock , ….

Audio Equipments: Audio editor , General midi synthesizer , Audio recorder , Multichannel digital audio workstation , Drum sequencer , Mixers , Music engraving system , Audio server , Mastering software , Soundfont sample player ….

Many NLP tasks get benefit from concept-instance pairs

Summarization, Co-reference resolution,

Named entity extraction Existing knowledge bases (NELL, Freebase,

…) are incomplete. Problem can be divided into :

Detecting co-ordinate terms to find term clusters (i ~ j)

Using hyponym patterns (“X such as Y”) to name the terms

We worked on problem of automatically harvesting concept-instance pairs from a corpus of HTML tables.

Hypothesis 1 : Entities appearing in a table column probably belong to the same concept.

Hypothesis 2 : Frequent co- occurrence of a set of entities in multiple table columns and distinct web domains indicates that they represent some meaningful concept.

We propose a unsupervised IE technique to extract concept-instance pairs from an HTML corpus. It is novel in that it relies solely on HTML tables to detect coordinate terms.

Our triplet-based data representation helps in disambiguating multiple senses of the same noun-phrase.

WebSets approach is corpus driven, efficient and scalable. We presented a method which takes O(N * logN) time to process the HTML tables of size O(N) and extract named entity sets from them.

Labeled entity sets produced by WebSets can act as summary of a HTML corpus.

Class-instance pairs thus produced are also being used to populate an existing Knowledge Base (NELL).

Future research direction is to extend this method for doing Unsupervised Relation Extraction.

Country Capital City

India Delhi

China Beijing

Canada Ottawa

France Paris

Country Capital City

China Beijing

Canada Ottawa

France Paris

England London

TableId=21 , domain=“wikipedia.org”

TableId=34 , domain=“aneki.com”

Entities Table:Column

Domains

China, Canada, India 21:1 Wikipedia.org

Canada, China, France 21:1, 34:1 Wikipedia.org, aneki.com

Beijing, Delhi, Ottawa 21:2 Wikipedia.org

Beijing, Ottawa, Paris 21:2, 34:2 Wikipedia.org, aneki.com

Canada, England, France

34:1 aneki.com

London, Ottawa, Paris 34:2 aneki.com

Hypernym

Entities Table:Column

Domains

Country India, China, Canada, France, England

21:1, 34:1 Wikipedia.org, aneki.com

City,Destinations

Delhi, Beijing, Ottawa, London, Paris

21:2, 34:2 Wikipedia.org, aneki.com

Datasets

Table Identification Features : #rows, #non-link columns, HTML tags,

length(cells), recursive or not

% relational tables : 15-30% to 70-85%

Entity vs. Triplet record representation O(N) triplet records created for tables of size O(N)

Can disambiguate different senses of entities : Toy_Apple dataset

Bottom-up clustering Number of clusters is unknown

Gold standard #clusters : Toy_Apple (27) and Delicious_Sports (29)

Hypernym RecommendationScore(hypernym | cluster) co-occurrence counts of hypernym

with entities in the cluster

Dataset Method

K Purity

NMI RI FM

Toy_Apple K-Means

40 0.96 0.71 0.98 0.41

WebSets

25 0.99 0.99 1.00 0.99

Delicious_Sports

K-Means

50 0.72 0.68 0.98 0.47

WebSets

32 0.83 0.64 1.00 0.85

Method K FM w/ Entity records

FM w/ Triplet records

WebSets 0.11 (K=25) 0.85 (K=34)

K-Means 30 0.09 0.35

25 0.08 0.38

Method

K J %Accuracy

Yield (#pairs produced)

#Correct pairs (predicted)

DPM Inf 0.0 34.6 88.6K 30.7K

5 0.2 50.0 0.8K 0.4K

DPMExt

Inf 0.0 21.9 100,828.0K 22,081.3K

5 0.2 44.0 2.8K 1.2K

WS - - 67.7 73.7K 45.8K

WSExt - - 78.8 64.8K 51.1K

Dataset #Triplets

#Clusters

#Clusters with hypernyms

%Meaningfulclusters

MRR of hypernym

%Precision of labeled sets

CSEAL_Useful 165.2K 1090 312 69.0 0.56 98.6%

ASIA_NELL 11.4K 448 266 73.0 0.59 98.5%

ASIA_INT 15.1K 395 218 63.0 0.58 97.4%

Clueweb_HPR 516.0 47 34 70.5 0.56 99.0%

Evaluation of quality of entity sets produced

Hyponym Concept Dataset

Corpus Summary :

Hearst patterns e.g. “X such as Y”

arg1 such as (w+ (and/or))? arg2 arg1 (w+ )? (and/or) other arg2 arg1 include (w+ (and/or))? arg2 arg1 including (w+ (and/or))? Arg2

ClueWeb09 dataset : 500M page sample of the Web

Noun-pair context dataset e.g. “Obama is president of USA” (president of , Obama, USA)

Dataset Description #HTML pages

#tables

Toy_Apple Fruits + companies 574 2.6K

Delicious_Sports

Links from Delicious w/ tag=sports

21K 146.3K

Delicious_Music

Links from Delicious w/ tag=music

183K 643.3K

CSEAL_Useful

Pages SEAL found NELL entities on

30K 322.8K

ASIA_NELL ASIA run on NELL categories

112K 676.9K

ASIA_INT ASIA run on intelligence domain

121K 621.3K

Clueweb_HPR

High pagerank sample of Clueweb

100K 586.9K

Hyponym Concept:count

USA Country:1000

Paris City:450, destination:100

Monkey Animal:100, mammal:30

Sparrow Bird:40

Bottom-Up Clustering Algorithm

X, Y are hyponym, hypernym when

context = Hearst pattern

Record/cluster : <entity+ , tableColumn+, domain+> Clusters = { } Go through each triplet record t so that |t.domains| > threshold

For each existing cluster C check if t.entity overlaps with C.entity OR t.tableColumn overlaps with C.tableColumn If sufficient overlap add t to C

If no existing cluster C matches t Create new cluster C’ = t Add C’ to Clusters

Time complexity : O(N * log N) Table corpus : O(N) Triplet Store : O(N)