center for e-business technology seoul national university seoul, korea webtables: exploring the...
TRANSCRIPT
Center for E-Business TechnologySeoul National University
Seoul, Korea
WebTables: Exploring the Power of Tables on the Web
Michael J. Cafarella, Alon Halevy, Zhe Daisy Wang, Eugene Wu, Yang Zhang
VLDB 2008
2009. 01. 08.
Summarized and Presented by {Name}, IDS Lab., Seoul National University
Copyright 2009 by CEBT
Introduction
Web is a corpus of unstructured data
Some structure is imposed by
Hierarchical URLs
Hyperlink Graph
Web pages generally contain
Text as paragraphs
Tabular data (Relations)
Text and tables have different characteristics
Tables have more structured data than raw text
2
Copyright 2009 by CEBT
Introduction (2)
Tables can give some hints about semantics
Headers
Tuples
Regular keyword query techniques are not very effective for tables
3
Copyright 2009 by CEBT
Motivation
Enable analysis and integration of data on the web
User demand for structured data
For 30 million queries users clicked on results containing tables
This paper focuses on two fundamental questions
What are effective methods for searching within large collections of tables?
Is there additional power that can be derived by analyzing large corpus of tables?
4
Copyright 2009 by CEBT
WebTables - Data
WebTables system considers HTML tables that are already surfaced and crawlable
Deep Web refers to the content that is made available through filling HTML forms
Corpus
14.1 Billion raw HTML tables
154 Million distinct relational databases
Relational database form 1.1% of raw HTML tables
60% of data from non-deep-web sources
40% of data from parameterized URLs
5
Copyright 2009 by CEBT
Extracting Relations
Most HTML tables are used for page layouts
To filter relational and non relational tables
Handwritten detectors
Statistically trained classifiers
Training & Test data generated by two independent judges
Scale of relational quality 1-5
Tables that received average score of 4 or above were considered as relational
6
Copyright 2009 by CEBT
Data Model
7
R Corpus of databases where each database is a relation
R Is a relation, R Є R Ru , Ri uniquely define R
Ru URL of the page from which relation was extracted
Ri Offset of the relation within the page
Rs Schema of a Relation
Rt A list of tupless
A Attribute Correlation Statistics Database (ACSDb)
Copyright 2009 by CEBT
Attribute Correlation Statistics Database (ACSDb)
For each Unique Schema Rs, ACSDb contains frequency
count
A = {(Rs1,C1), (Rs2,C2), (Rs3,C3) … }
If schema appears multiple times under same domain name it is counted only once
ACSDb contains
5.4M unique attribute names
2.6M unique schemas
ACSDb is simple but can be used to compute probabilities
For example, conditional probability of finding attribute ‘Address’ in a schema given attribute ‘Name’
P(address|name) = count of schemas containing address, name / count of schemas containing name
8
Copyright 2009 by CEBT
ACSDb
9
Copyright 2009 by CEBT
Relation Search
WebTables search engine allows users to rank relations by relevance
Query appropriate visualizations can be created
Columns containing place names can be displayed on a map
Graphs can be generated from table data
Traditional structured operations can be applied over search results
Selection
Projection
10
Copyright 2009 by CEBT 11
Copyright 2009 by CEBT
Ranking
Keyword ranking for databases is a novel problem
Challenges
Relations does not exist in a domain specific schema graph
Word frequencies apply ambiguously to tables (Ex: which table in the page is described by which frequent word)
Attribute labels are extremely important
Attributes provide good summaries of the subject matter
Tuples may have a key like element that summaries the row
Ranking Functions
naïveRank
filterRank
featureRank
schemaRank
12
Copyright 2009 by CEBT
Ranking Function (1)
Naïve Rank
It simply uses the top k search engine result pages to generate relations.
If there are no relations in the top k search results, naïve Rank will emit no relations.
Roughly simulates modern search engine user
13
Copyright 2009 by CEBT
Ranking Function (2)
Filter Rank
Similar to naïve rank
It will go as far down the search result pages as necessary to find ‘k’ relations
14
Copyright 2009 by CEBT
Ranking Function (3)
Feature Rank
Does not rely on an existing search engine
Uses relation specific features to score each extracted relation in the Corpus
Sorts results by score
Different feature scores were combined using linear regression estimator
– trained by a thousand (q, relation) pairs each scored by two human judges
15
Copyright 2009 by CEBT
Ranking Function (4)
Schema Rank
Same as feature Rank
Additionally uses ACSDb based Schema coherence score
Coherent Schema is one where attributes are strongly related
Make, Model
Make, Zipcode
PMI - Point Mutual Information
Gives a sense of how strongly two items are related
Coherence score for a schema is the average of all possible attribute-pairwise PMI scores for the schema
16
Copyright 2009 by CEBT
Indexing
Traditional Search Engines use Inverted Index
Inverted Index can not retrieve relational features
Inverted Index
Term -> (docid, offset)
WebTables data exists in two dimensions
Term -> (docid, offset-X, offset-Y)
17
Copyright 2009 by CEBT
ACSDb Application (1)
Schema Auto Complete
Designed to assist novice database designers when creating a relational schema
Schemas consisting of Single Relations
User enter one or more domain-specific attributes and the auto-completer guesses the rest if the attributes
18
Copyright 2009 by CEBT
ACSDb Application (2)
Attribute Synonym-Finding
Automatically find synonyms between arbitrary attribute strings
Based on a set of context attributes generates attribute pairs
Assumptions
– Synonymous attributes will never appear together in same chema
– Odds of synonymity are higher if p(a,b) = 0 despite a large value for p(a)p(b)
– Two synonyms will appear in similar contexts
19
Copyright 2009 by CEBT
ACSDb Application (3)
Join Graph Traversal
Provide a useful way of navigating huge graph of 2.6M Schemas
Basic join graph
– Contains a node ‘N’ for each unique schema
– Undirected join link between any two schemas that share a attribute
Every schema that contains ‘name’ field is linked to every other schema that contains ‘name’
Cluster together similar schemas to minimize graph clutter
Schema: X,Y
Shared Attribute: D
20
Copyright 2009 by CEBT
Exp. Results – Relation Ranking
Rank-ACSD beats Naïve (simulates search engine users) by 78-100%
All of the non-Naïve solutions improve as k (number of results) increases
21
Copyright 2009 by CEBT
Exp. Results – Schema Auto Complete
Test Scenario
6 Humans designed schemas using given attributes
Auto-Complete tool got three tries
By 3rd output Auto complete was able to reproduce a large number of schemas
No test designer recognized ‘ab’ as an abbrevation for ‘at-bats’, baseball terminology
22
Copyright 2009 by CEBT
Exp. Results – Synonym Finding
Ranked by quality
An ideal ranking would present a stream of only correct synonyms, followed by only incorrect ones
Poor ranking will mix them together
23
Copyright 2009 by CEBT
Exp. Results – Join Graph Traversal
24
Copyright 2009 by CEBT
Conclusion
WebTables is first large scale attempt to extract relational information embedded in HTML tables
Relation Ranking
ACSDb uses
Schema auto complete
Attribute Synonym Finding
Join Graph Traversing
Adding signal for source page quality like PageRank will improve overall quality
25
Copyright 2009 by CEBT
Discussion
Pros
Handling tables separately for search is a good idea
Cons
Most of the paper is focused on uses of ACSDb
26