Download - Data Integration for Relational Web
Michael Cafarella Alon HalevyNodira KhoussainovaUniversity of Washington Google, incUniversity of Washington
Data Integration for Relational Web
OVERVIEWINTRODUCTIONOCTOPUS AND ITS OPERATORSALGORITHMSIMPLEMENTATION AT SCALEEXPERIMENTSRELATED WORKCONCLUSIONSREFERENCES
INTRODUCTION OCTOPUS, a system that combines search, extraction, data
cleaning and integration, and enables the end user to create new data set from those found on the Web.
Drawbacks of traditional Data Integration tools:Locating Relevant Data: Difficult due to the large amount of data on
Web.System should Integrate well and present user with relevant data.
Data Sources Embedded in Web Pages: Data needs to be prepared before processing.
Offer Web Specific Data Extraction Tool
The semantic of Web data are implicitly tied to the Web Page.Eg: Multiple Tables of VLDB PC members available but the Year of the
Conference is in the Web Page text.Able to recover any Relevant and implicit column for each extracted data
source
Most of the Integration tool are tailored for the query posted against the stable database.
We do not concentrate on a certain data source, but we consider transient data source.
OCTOPUS AND ITS OPERATORS
DATA MODELINTEGRATING WEB SOURCESINTEGRATION OPERATORS
SEARCHCONTEXTEXTEND
DATA MODELManipulates the data Extracted from Web pagesCurrently it handles HTML tables and HTML lists.[3] &
[9]Can manage any manipulate data obtained from any
information extraction technique that emits relational data.
Extracted relation is a table T with ‘k’ columns.The table extracted has domain sensitive columns.
Eg: Column will contain strings which depict strings or integers which are drawn from same domain. (movie titles).
Relation T would also preserves it extraction lineage. Data Available: 3.9B HTML lists, 154M of 14B HTML
table contain high quality relational data (slightly over 1% of data available on Web).
INTEGRATING WEB SOURCESIn Traditional Data integrating TechniqueCreate a mediated schema which can be used
for query processing.For eg: To collect data about the programming
committee the schema would be PCMEMBER(name, institution, conference, year)
The query would be reformulated to
In OCTOPUS system, the description for data sources are not prepared in advance.
Because Data Integration task in OCTOPUS Transient Large Number of Data Sources.
Integral Part of data integration is finding the relevant data sources over the Web.
Search operator finds relevant data over the Web and then clusters the result. Each member table of the cluster is a concrete table
that contributes to the Clusters Schema Relation. The Context operator helps to discover the
selection predicates that apply to schematic mapping of source table and mediated table but which are not described explicitly. Context only requires single concrete relation to
operate on(linkage).
Search and Context operators are sufficient to express semantic mappings for sources to the mediated schema.
The Extend operator will help us to express joins between data sources.For eg: In the previous example if the mediated schema is
extended with an another attribute Adviser. We will have to join the tables VLDB08Page and
VLDB09Page with other relation on the Web that describe the adviser relationship.
However the above information may come from many different sources and hence we would have several set of inputs.
The OCTOPUS uses the ranking technique like the conventional Web Search Engine to decide on the output.
Data Cleaning operators such as Data Transformation, Entity Resolution can also be implemented in OCTOPUS.
INTEGRATION OPERATORS1. SEARCH OPERATORTakes Extracted Set of Relation S and a Users Query q
as input. Returns a Sorted List of Cluster of tables in S, ranked by
the relevance to q. Relevance ranking helps to find the useful source
relation and is evaluated as the traditional Web Search. Clustering: Finding Relations in S that are similar. Tables in a Single Cluster should be able unionable with
few or no modifications.( ie. They should be identical or very similar)
The Output of Search is List L of table sets. A single table may appear in multiple clusters C. It sorts the List L for relevancy and diversity of results.
INTEGRATION OPERATORS2. CONTEXT OPERATORTakes a single extracted Relation T as input and
modifies to contain additional columns using data derived from T’s Source Web Page.
The values generated by Context can be viewed as the selection conditions in semantic mapping created by SEARCH.
Data values that hold true for every tuple are generally projected out and added to surrounding text.
Hence it makes the implicit data that are embedded in the Web page available explicitly.
In the previous example, year is the implicit data which can me made available by CONTEXT operator.
INTEGARTION OPERATORS
3. EXTEND OPERATOR:Enables the user to add more columns to the
table by performing a join. Takes a column “c” of table T as input and a
topic keyword “k”.It returns 1or more columns whose values are
described by k.The new column added to T does not necessarily
come from aa single data source. It gathers data from large number of sources. It can also gather data from table with different
label from k or no label at all.
ALGORITHMSSEARCH
RankingClustering
CONTEXTEXTEND
SEARCH : Rank the Table by relevance to Users Query
Cluster other related tables around top ranking Search result.
SEARCH ALGORITHM1. RANKING: Simple Rank Algorithm: Transmits the users search query to Web Search engine
obtains the URL ordering and presents the data according to that order.
Drawbacks:Ranks Individual whole page and not the data on that
page.Eg: persons home page contains a HTML list that serve as navigation
list to other pages.When multiple data sets are present on the web page, SR
algorithm relies on in-page ordering. (ie. In the order of its appearance)
Any metadata about the HTML lists exists only in the surrounding text and not the table itself. Cannot count hits between the query and a specific tables
metadata.
SCPRanking Algorithm:Uses symmetric conditional probability to measure
correlation between cell in extracted database and query term. It is defined as:
How likely the term q and c appear together in a document.SCPRank scores the table and not the cell.It sends the query to the Search Engine, extracting a
candidate set of tables. Then it computes per-column scores, each of which is
sum of per-cell SCP score in the column. The tables overall score is the max of all of its per-column
scores.Finally it sorts the table in the order of their scores and
returns a ranked list. Time consuming. Compute score for first ‘r’ rows of every candidate table.Approximating SCP score on a small subset of Web
corpus.
SEARCH ALGORITHM
2.CLUSTERING:It computes the dist(t,t’) for every t’ € T-t. Then it applies a similarity score threshold that
limits the size of the cluster centered around t. Dist() for Text Cluster: computes tf-idf cosine dist
between texts of table a and text of table b.Dist() for Size Cluster: computes column to column
similarity score that measures the difference in mean string length between them.The overall table-to-able similarity score for a pair of
table is sum of per column score for best column-to-column matching.
Dist() for Column Cluster: Its similar to Size Cluster however it computes a tf-idf cosine distance using only the text found in the 2 columns.
2. CONTEXTSignificant Term Algorithm:
Examines the source page of the extracted table and returns the k terms with the highest tf-idf values and do not appear in the extracted data.
Related View Partners:Looks beyond the source page.Operating on the table T, it obtains a large number
of candidate related view tables, by using each value in T as parameter for a new Web Search
Then filters out tables that are unrelated to t’s source page, by removing all tables that do not contain atleast one value from ST(T)
It obtains all the data value in the remaining table and ranks them according to the frequency of occurrence, returns the k highest ranked values.
Hybrid Algorithm:It uses the fact that the above 2 algorithm are
complimentary in nature.ST finds the context terms that RVP misses and
RVP discovers the context terms that ST misses. Hybrid returns the context term that appear in
result of either algorithm.
3. EXTEND:The Algorithm used is shown below:
EXPERIMENTS:We now evaluate the quality of result generated
by each of the operators.The Queries used:
SEARCH OPERATORRanking:
Clustering:
CONTEXT:
EXTEND:
RELATED WORKData Integration on Web called as “MashUp” is
increasingly popular area of work.The Yahoo Pipes allows the user to graphically
describe the flow of data (structured data only)CIMPLE is data integration system for web use
designed to construct community websites.
CONCLUSIONOCTOPUS allows the user to integrate data from
many unstructured data source.It offers access to orders of magnitude of data
sources, frees the user from having to design or even know about the mediated schema.
REFERENCES: