websets: extracting sets of entities from the web using unsupervised information extraction bhavana...
TRANSCRIPT
![Page 1: WebSets: Extracting Sets of Entities from the Web Using Unsupervised Information Extraction Bhavana Dalvi, William W. Cohen and Jamie Callan Language Technologies](https://reader038.vdocument.in/reader038/viewer/2022110401/56649de95503460f94ae49b2/html5/thumbnails/1.jpg)
WebSets: Extracting Sets of Entities from the Web Using
Unsupervised Information Extraction Bhavana Dalvi , William W. Cohen and Jamie Callan
Language Technologies Institute, Carnegie Mellon University Motivation
Experiments
WebSets Framework Application
AcknowledgementsThis work is supported by Google and the Intelligence Advanced Research Projects Activity (IARPA) via Air Force Research Laboratory (AFRL) contract number FA8650-10-C-7058.
HTML Table
Corpus
Entity-feature file<Entities, table-
columns, domains>
Hyponym Concept Dataset
Relational Table
Identification
Hypernym Recommendatio
n
Bottom-up Entity
Clustering
Labeled entity sets<Entities, hypernym
>
Entity Cluster
s
Conclusions
Intelligence DomainReligions: Buddhism, Christianity, Islam, Sikhism, Taoism, Zoroastrianism, Jainism, Bahai, Judaism, Hinduism, Confucianism , .…
Government: Monarchy, Limited Democracy, Islamic Republic, Parliamentary Self Governing Territory, Parliamentary Republic, Constitutional Republic, Republic Presidential Multiparty System, ….
International Organizations: United Nations Children Fund UNICEF, Southeast European Cooperative Initiative SECI, World Trade Organization WTO, Indian Ocean Commission INOC, Economic and Social Council ECOSOC, Caribbean Community and Common Market CARICOM, ….
Languages: Hebrew, Portuguese, Danish, Brazilian, Surinamese, Burkinabe, Barbadian, Cuban , ….
Music Domain
Instruments: Flute, Tuba , String Orchestra, Chimes, Harmonium, Bassoon, Woodwinds, Glockenspiel, French horn, Timpani, Piano, ….
Intervals: Whole tone, Major sixth, Fifth, Perfect fifth, Seventh, Third, Diminished fifth, Whole step , ….
Genres: Smooth jazz, Gothic, Metal rock, Rock, Pop, Hip hop, Rock n roll, Country, Folk, Punk rock , ….
Audio Equipments: Audio editor , General midi synthesizer , Audio recorder , Multichannel digital audio workstation , Drum sequencer , Mixers , Music engraving system , Audio server , Mastering software , Soundfont sample player ….
Many NLP tasks get benefit from concept-instance pairs
Summarization, Co-reference resolution,
Named entity extraction Existing knowledge bases (NELL, Freebase,
…) are incomplete. Problem can be divided into :
Detecting co-ordinate terms to find term clusters (i ~ j)
Using hyponym patterns (“X such as Y”) to name the terms
We worked on problem of automatically harvesting concept-instance pairs from a corpus of HTML tables.
Hypothesis 1 : Entities appearing in a table column probably belong to the same concept.
Hypothesis 2 : Frequent co- occurrence of a set of entities in multiple table columns and distinct web domains indicates that they represent some meaningful concept.
We propose a unsupervised IE technique to extract concept-instance pairs from an HTML corpus. It is novel in that it relies solely on HTML tables to detect coordinate terms.
Our triplet-based data representation helps in disambiguating multiple senses of the same noun-phrase.
WebSets approach is corpus driven, efficient and scalable. We presented a method which takes O(N * logN) time to process the HTML tables of size O(N) and extract named entity sets from them.
Labeled entity sets produced by WebSets can act as summary of a HTML corpus.
Class-instance pairs thus produced are also being used to populate an existing Knowledge Base (NELL).
Future research direction is to extend this method for doing Unsupervised Relation Extraction.
Country Capital City
India Delhi
China Beijing
Canada Ottawa
France Paris
Country Capital City
China Beijing
Canada Ottawa
France Paris
England London
TableId=21 , domain=“wikipedia.org”
TableId=34 , domain=“aneki.com”
Entities Table:Column
Domains
China, Canada, India 21:1 Wikipedia.org
Canada, China, France 21:1, 34:1 Wikipedia.org, aneki.com
Beijing, Delhi, Ottawa 21:2 Wikipedia.org
Beijing, Ottawa, Paris 21:2, 34:2 Wikipedia.org, aneki.com
Canada, England, France
34:1 aneki.com
London, Ottawa, Paris 34:2 aneki.com
Hypernym
Entities Table:Column
Domains
Country India, China, Canada, France, England
21:1, 34:1 Wikipedia.org, aneki.com
City,Destinations
Delhi, Beijing, Ottawa, London, Paris
21:2, 34:2 Wikipedia.org, aneki.com
Datasets
Table Identification Features : #rows, #non-link columns, HTML tags,
length(cells), recursive or not
% relational tables : 15-30% to 70-85%
Entity vs. Triplet record representation O(N) triplet records created for tables of size O(N)
Can disambiguate different senses of entities : Toy_Apple dataset
Bottom-up clustering Number of clusters is unknown
Gold standard #clusters : Toy_Apple (27) and Delicious_Sports (29)
Hypernym RecommendationScore(hypernym | cluster) co-occurrence counts of hypernym
with entities in the cluster
Dataset Method
K Purity
NMI RI FM
Toy_Apple K-Means
40 0.96 0.71 0.98 0.41
WebSets
25 0.99 0.99 1.00 0.99
Delicious_Sports
K-Means
50 0.72 0.68 0.98 0.47
WebSets
32 0.83 0.64 1.00 0.85
Method K FM w/ Entity records
FM w/ Triplet records
WebSets 0.11 (K=25) 0.85 (K=34)
K-Means 30 0.09 0.35
25 0.08 0.38
Method
K J %Accuracy
Yield (#pairs produced)
#Correct pairs (predicted)
DPM Inf 0.0 34.6 88.6K 30.7K
5 0.2 50.0 0.8K 0.4K
DPMExt
Inf 0.0 21.9 100,828.0K 22,081.3K
5 0.2 44.0 2.8K 1.2K
WS - - 67.7 73.7K 45.8K
WSExt - - 78.8 64.8K 51.1K
Dataset #Triplets
#Clusters
#Clusters with hypernyms
%Meaningfulclusters
MRR of hypernym
%Precision of labeled sets
CSEAL_Useful 165.2K 1090 312 69.0 0.56 98.6%
ASIA_NELL 11.4K 448 266 73.0 0.59 98.5%
ASIA_INT 15.1K 395 218 63.0 0.58 97.4%
Clueweb_HPR 516.0 47 34 70.5 0.56 99.0%
Evaluation of quality of entity sets produced
Hyponym Concept Dataset
Corpus Summary :
Hearst patterns e.g. “X such as Y”
arg1 such as (w+ (and/or))? arg2 arg1 (w+ )? (and/or) other arg2 arg1 include (w+ (and/or))? arg2 arg1 including (w+ (and/or))? Arg2
ClueWeb09 dataset : 500M page sample of the Web
Noun-pair context dataset e.g. “Obama is president of USA” (president of , Obama, USA)
Dataset Description #HTML pages
#tables
Toy_Apple Fruits + companies 574 2.6K
Delicious_Sports
Links from Delicious w/ tag=sports
21K 146.3K
Delicious_Music
Links from Delicious w/ tag=music
183K 643.3K
CSEAL_Useful
Pages SEAL found NELL entities on
30K 322.8K
ASIA_NELL ASIA run on NELL categories
112K 676.9K
ASIA_INT ASIA run on intelligence domain
121K 621.3K
Clueweb_HPR
High pagerank sample of Clueweb
100K 586.9K
Hyponym Concept:count
USA Country:1000
Paris City:450, destination:100
Monkey Animal:100, mammal:30
Sparrow Bird:40
Bottom-Up Clustering Algorithm
X, Y are hyponym, hypernym when
context = Hearst pattern
Record/cluster : <entity+ , tableColumn+, domain+> Clusters = { } Go through each triplet record t so that |t.domains| > threshold
For each existing cluster C check if t.entity overlaps with C.entity OR t.tableColumn overlaps with C.tableColumn If sufficient overlap add t to C
If no existing cluster C matches t Create new cluster C’ = t Add C’ to Clusters
Time complexity : O(N * log N) Table corpus : O(N) Triplet Store : O(N)