Download - Data Integration for Relational Web

Michael Cafarella Alon HalevyNodira KhoussainovaUniversity of Washington Google, incUniversity of Washington

Data Integration for Relational Web

OVERVIEWINTRODUCTIONOCTOPUS AND ITS OPERATORSALGORITHMSIMPLEMENTATION AT SCALEEXPERIMENTSRELATED WORKCONCLUSIONSREFERENCES

INTRODUCTION OCTOPUS, a system that combines search, extraction, data

cleaning and integration, and enables the end user to create new data set from those found on the Web.

Drawbacks of traditional Data Integration tools:Locating Relevant Data: Difficult due to the large amount of data on

Web.System should Integrate well and present user with relevant data.

Data Sources Embedded in Web Pages: Data needs to be prepared before processing.

Offer Web Specific Data Extraction Tool

The semantic of Web data are implicitly tied to the Web Page.Eg: Multiple Tables of VLDB PC members available but the Year of the

Conference is in the Web Page text.Able to recover any Relevant and implicit column for each extracted data

source

Most of the Integration tool are tailored for the query posted against the stable database.

We do not concentrate on a certain data source, but we consider transient data source.

OCTOPUS AND ITS OPERATORS

DATA MODELINTEGRATING WEB SOURCESINTEGRATION OPERATORS

SEARCHCONTEXTEXTEND

DATA MODELManipulates the data Extracted from Web pagesCurrently it handles HTML tables and HTML lists.[3] &

[9]Can manage any manipulate data obtained from any

information extraction technique that emits relational data.

Extracted relation is a table T with ‘k’ columns.The table extracted has domain sensitive columns.

Eg: Column will contain strings which depict strings or integers which are drawn from same domain. (movie titles).

Relation T would also preserves it extraction lineage. Data Available: 3.9B HTML lists, 154M of 14B HTML

table contain high quality relational data (slightly over 1% of data available on Web).

INTEGRATING WEB SOURCESIn Traditional Data integrating TechniqueCreate a mediated schema which can be used

for query processing.For eg: To collect data about the programming

committee the schema would be PCMEMBER(name, institution, conference, year)

The query would be reformulated to

In OCTOPUS system, the description for data sources are not prepared in advance.

Because Data Integration task in OCTOPUS Transient Large Number of Data Sources.

Integral Part of data integration is finding the relevant data sources over the Web.

Search operator finds relevant data over the Web and then clusters the result. Each member table of the cluster is a concrete table

that contributes to the Clusters Schema Relation. The Context operator helps to discover the

selection predicates that apply to schematic mapping of source table and mediated table but which are not described explicitly. Context only requires single concrete relation to

operate on(linkage).

Search and Context operators are sufficient to express semantic mappings for sources to the mediated schema.

The Extend operator will help us to express joins between data sources.For eg: In the previous example if the mediated schema is

extended with an another attribute Adviser. We will have to join the tables VLDB08Page and

VLDB09Page with other relation on the Web that describe the adviser relationship.

However the above information may come from many different sources and hence we would have several set of inputs.

The OCTOPUS uses the ranking technique like the conventional Web Search Engine to decide on the output.

Data Cleaning operators such as Data Transformation, Entity Resolution can also be implemented in OCTOPUS.

INTEGRATION OPERATORS1. SEARCH OPERATORTakes Extracted Set of Relation S and a Users Query q

as input. Returns a Sorted List of Cluster of tables in S, ranked by

the relevance to q. Relevance ranking helps to find the useful source

relation and is evaluated as the traditional Web Search. Clustering: Finding Relations in S that are similar. Tables in a Single Cluster should be able unionable with

few or no modifications.( ie. They should be identical or very similar)

The Output of Search is List L of table sets. A single table may appear in multiple clusters C. It sorts the List L for relevancy and diversity of results.

INTEGRATION OPERATORS2. CONTEXT OPERATORTakes a single extracted Relation T as input and

modifies to contain additional columns using data derived from T’s Source Web Page.

The values generated by Context can be viewed as the selection conditions in semantic mapping created by SEARCH.

Data values that hold true for every tuple are generally projected out and added to surrounding text.

Hence it makes the implicit data that are embedded in the Web page available explicitly.

In the previous example, year is the implicit data which can me made available by CONTEXT operator.

INTEGARTION OPERATORS

3. EXTEND OPERATOR:Enables the user to add more columns to the

table by performing a join. Takes a column “c” of table T as input and a

topic keyword “k”.It returns 1or more columns whose values are

described by k.The new column added to T does not necessarily

come from aa single data source. It gathers data from large number of sources. It can also gather data from table with different

label from k or no label at all.

ALGORITHMSSEARCH

RankingClustering

CONTEXTEXTEND

SEARCH : Rank the Table by relevance to Users Query

Cluster other related tables around top ranking Search result.

SEARCH ALGORITHM1. RANKING: Simple Rank Algorithm: Transmits the users search query to Web Search engine

obtains the URL ordering and presents the data according to that order.

Drawbacks:Ranks Individual whole page and not the data on that

page.Eg: persons home page contains a HTML list that serve as navigation

list to other pages.When multiple data sets are present on the web page, SR

algorithm relies on in-page ordering. (ie. In the order of its appearance)

Any metadata about the HTML lists exists only in the surrounding text and not the table itself. Cannot count hits between the query and a specific tables

metadata.

SCPRanking Algorithm:Uses symmetric conditional probability to measure

correlation between cell in extracted database and query term. It is defined as:

How likely the term q and c appear together in a document.SCPRank scores the table and not the cell.It sends the query to the Search Engine, extracting a

candidate set of tables. Then it computes per-column scores, each of which is

sum of per-cell SCP score in the column. The tables overall score is the max of all of its per-column

scores.Finally it sorts the table in the order of their scores and

returns a ranked list. Time consuming. Compute score for first ‘r’ rows of every candidate table.Approximating SCP score on a small subset of Web

corpus.

SEARCH ALGORITHM

2.CLUSTERING:It computes the dist(t,t’) for every t’ € T-t. Then it applies a similarity score threshold that

limits the size of the cluster centered around t. Dist() for Text Cluster: computes tf-idf cosine dist

between texts of table a and text of table b.Dist() for Size Cluster: computes column to column

similarity score that measures the difference in mean string length between them.The overall table-to-able similarity score for a pair of

table is sum of per column score for best column-to-column matching.

Dist() for Column Cluster: Its similar to Size Cluster however it computes a tf-idf cosine distance using only the text found in the 2 columns.

2. CONTEXTSignificant Term Algorithm:

Examines the source page of the extracted table and returns the k terms with the highest tf-idf values and do not appear in the extracted data.

Related View Partners:Looks beyond the source page.Operating on the table T, it obtains a large number

of candidate related view tables, by using each value in T as parameter for a new Web Search

Then filters out tables that are unrelated to t’s source page, by removing all tables that do not contain atleast one value from ST(T)

It obtains all the data value in the remaining table and ranks them according to the frequency of occurrence, returns the k highest ranked values.

Hybrid Algorithm:It uses the fact that the above 2 algorithm are

complimentary in nature.ST finds the context terms that RVP misses and

RVP discovers the context terms that ST misses. Hybrid returns the context term that appear in

result of either algorithm.

3. EXTEND:The Algorithm used is shown below:

EXPERIMENTS:We now evaluate the quality of result generated

by each of the operators.The Queries used:

SEARCH OPERATORRanking:

Clustering:

CONTEXT:

EXTEND:

RELATED WORKData Integration on Web called as “MashUp” is

increasingly popular area of work.The Yahoo Pipes allows the user to graphically

describe the flow of data (structured data only)CIMPLE is data integration system for web use

designed to construct community websites.

CONCLUSIONOCTOPUS allows the user to integrate data from

many unstructured data source.It offers access to orders of magnitude of data

sources, frees the user from having to design or even know about the mediated schema.

REFERENCES:

Download - Data Integration for Relational Web

Top Related