extending tables with data from over a million websites
DESCRIPTION
The slideset describes the Mannheim Search Join Engine and was used to present our submission to the Semantic Web Challenge 2014. More information about the Semantic Web Challenge: http://challenge.semanticweb.org/ Paper about the Mannheim Search Join Engine: http://dws.informatik.uni-mannheim.de/fileadmin/lehrstuehle/ki/pub/Lehmberg-Ritze-Ristoski-Eckert-Paulheim-Bizer-TableExtension-SemanticWebChallenge-ISWC2014-Paper.pdf Abstract: This Big Data Track submission demonstrates how the BTC 2014 dataset, Microdata annotations from thousands of websites, as well as millions of HTML tables are used to extend local tables with additional columns. Table extension is a useful operation within a wide range of application scenarios: Image you are an analyst having a local table describing companies and you want to extend this table with the headquarter of each company. Or imagine you are a film lover and want to extend a table describing films with attributes like director, genre, and release date of each film. The Mannheim SearchJoin Engine automatically performs such table extension operations based on a large data corpus gathered from over a million websites that publish structured data in various formats. Given a local table, the SearchJoin Engine searches the corpus for additional data describing the entities of the input table. The discovered data are then joined with the local table and their content is consolidated using schema matching and data fusion methods. As result, the user is presented with an extended table and given the opportunity to examine the provenance of the added data. Our experiments show that the Mannheim SearchJoin Engine achieves a coverage close to 100% and a precision of around 90% within different application scenarios.TRANSCRIPT
Slide 1
International Semantic Web ConferenceRiva del Garda, Italy, 22.10.2014
Semantic Web Challenge – Big Data Track
Extending Tables with Data from over a Million Websites
Oliver Lehmberg, Dominique Ritze, Petar Ristoski,
Kai Eckert, Heiko Paulheim, Christian Bizer
Slide 2
Extend a local table with additional columns using different types of Web data.
RegionUn-
employmentAlsace 11 %Lorraine 12 %Guadeloupe 28 %Centre 10 %Martinique 25 %
GDPper Capita
PopulationGrowth
45.914 € 0,16 %51.233 € -0,05 %19.810 € 1,34 %59.502 € 1,76 %
NULL 2,64 %
+
Goal
Slide 3
Operation 1: Extend Local Table with Single Column
Given a local table and keywords describing the extension column, add the extension column to the table and fill it with data from the Web.
Region UnemploymentAlsace 11 %Lorraine 12 %Guadeloupe 28 %Centre 10 %Martinique 25 %… …
GDP per Capita45.914 €51.233 €19.810 €59.502 €21,527 €
…
+
„GDP per Capita“
Slide 4
Operation 2: Extend Local Table with Many Columns
Given a local table, add all columns to the table that can be filled beyond a density threshold.
Region Unemp.Rate
Alsace 11 %Lorraine 12 %Guadeloupe 28 %Centre 10 %Martinique 25 %… …
GDPper Capita
PopulationGrowth
Overseasdepartments
…
45.914 € 0,16 % No …51.233 € -0,05 % No …19.810 € 1,34 % Yes …59.502 € NULL NULL …
NULL 2,64 % Yes …… … …
+
density >= 0.8
Slide 5
Types of Web Data Used
Microdata(schema.org)
Wiki Tables
HTML Tables
Linked Data
Slide 6
Billion Triple Challenge Dataset 2014
4 billion triples crawled from 47,000 websites.
Slide 7
Extracted from Common Crawl 2013 web corpus 2.2 billion HTML pages from 12.8 million websites
Mostly using the schema.org vocabulary
Main topics Products
Reviews
Organisations / LocalBusiness
Events
Web Data Commons - Microdata Corpus
250 million triples from 463,000 websites.
Download: http://webdatacommons.org/structureddata/
Slide 8
Web Data Commons – Web Tables Corpus
we used 35 million English HTML tables. extracted from the Common Crawl 2012 web corpus
selected out of 11.2 billion raw tables
Around 1% of all HTML tables contain structured data.
Slide 9
Column Statistics
Web Data Commons – Web Tables Corpus
Column #Tablesname 4,600,000price 3,700,000date 2,700,000artist 2,100,000location 1,200,000year 1,000,000manufacturer 375,000counrty 340,000isbn 99,000area 95,000population 86,000
Subject Column Values
Value #Rowsusa 135,000germany 91,000greece 42,000new york 59,000london 37,000athens 11,000david beckham 3,000ronaldinho 1,200oliver kahn 710twist shout 2,000yellow submarine 1,400
Download: http://webdatacommons.org/webtables/
Slide 10
WikiTables
extracted by Northwestern University
from the 2013 Wikipedia XML dump
only tables, no infoboxes
1.4 million tables from English Wikipedia.
Download: http://downey-n1.cs.northwestern.edu/public/
Slide 11
Internal Data Model: Entity-Attributes-Tables
One entity per row
Subject Column = Name of the entity HTML tables: Most unique string column, break ties by taking leftmost.
Table generation from Linked Data and Microdata generate one table per class and website
subject column: rdfs:label, foaf:name, x:name
we exploit common vocabularies
Rank Film Studio Director Length1. Star Wars –Episode 1 Lucasfilm George Lucas 121 min2. Alien Brandwine Ridley Scott 117 min3. Black Moon NEF Louis Malle 100 min
Slide 12
Indexed Tables
Selection Conditions: 1. Minimum size of 3 columns and 5 rows
2. Subject column detection successful
Total # of tables: 36.3 million
Total # of PLDs: ~ 1.5 million
Total # of triples: 3.0 billion
Slide 13
The Mannheim Search Joins Engine (MSJE)
Collection of tables Table NormalizationTable Storage Table Index
1. Table Indexing
Input query table Table Preprocessing
Search
2. Table Search
3. Data Consolidation
Data collection
User Preferences
ConsolidationMultiJoin Top k Candidates
Slide 14
The Search Operator
Table Ranking subject column value overlap
extended Jaccard Similarity (FastJoin)
Select TopK Tables 1000 tables in the single column experiments
The Search operator determines the set of relevant Web tables.
Relevant
Slide 15
Multi-Join Operator
The MultiJoin operator performs a series of left-outer joins between the query table and all tables in the input set.
No. Region
1 Alsace
2 Lorraine
3 Guadeloupe
4 Centre
Unemploy
11 %
12 %
28 %
10 %
Unemploy
NULL
NULL
NULL
9.4 %
GDP
45.914 €
51.233 €
NULL
NULL
GDP per C
45.000 €
NULL
19.000 €
59.500 €
Slide 16
Consolidation Operator
Column Matching Combination of label- and instance-based techniques
Conflict Resolution Strings: majority vote
Numeric values: average,median, clustering and vote
The consolidation operator merges corresponding columns and fuses values in order to return a concise result table.
No Region Unemploy GDP
1 Alsace 11 % 45.914 €
2 Lorraine 12 % 51.233 €
3 Guadeloupe
28 % 19.000 €
4 Centre 10 % 59.500 €
Slide 17
http://searchjoins.webdatacommons.org
Slide 18
Result: Extend with Single Column
Slide 19
Provenance Summary
Slide 20
Provenance Details
Slide 21
Evaluation Results
Author Head‐quarter Industry Area Capital Code Currency Popu‐
lationIngre‐dient Cast Director Genre Year Artist Team
Book Company Country Drug Film Song SoccerPlayer
coverage 93% 94% 94% 100% 100% 100% 94% 100% 87% 94% 97% 97% 96% 99% 88%precision 96% 96% 94% 95% 100% 94% 96% 64% 89% 85% 97% 86% 97% 95% 67%
0%
20%
40%
60%
80%
100%
Coverage: Percentage of entities for which a value was found.Precision: Manually evaluated using Wikipedia, IMDB, Amazon.
Slide 22
Result: Extend with Many Columns
505 columns are addedand filled with data from 2071 tables.
Slide 23
Provenance Summary
Slide 24
Provenance Details for “area (sq. km)”
Slide 25
Conclusion
The prototype shows that simple queries are feasible.
The Web is one application domain for search joins, corporate intranets are the other.
The overlooked Big Data Vs: Variety and Veracity
Search Joins bring together Web Search and DB Joins.