fledex: flexible data exchange eli cortez, altigran silva federal university of amazonas, brazil...

29
FleDEx: Flexible Data Exchange Eli Cortez, Altigran Silva Federal University of Amazonas, Brazil WIDM’07 Filipe Mesquita, Denilson Barbosa University of Alberta, Canada

Upload: brian-hodges

Post on 12-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: FleDEx: Flexible Data Exchange Eli Cortez, Altigran Silva Federal University of Amazonas, Brazil WIDM’07 Filipe Mesquita, Denilson Barbosa University of

FleDEx: Flexible Data Exchange

Eli Cortez, Altigran Silva

Federal University of Amazonas, Brazil

WIDM’07

Filipe Mesquita, Denilson Barbosa

University of Alberta, Canada

Page 2: FleDEx: Flexible Data Exchange Eli Cortez, Altigran Silva Federal University of Amazonas, Brazil WIDM’07 Filipe Mesquita, Denilson Barbosa University of

Originaldata

The Data Exchange Problem is…

…translating data from a source schema to a target schema

Source schem

a

Translateddata

Target schema

translate

Page 3: FleDEx: Flexible Data Exchange Eli Cortez, Altigran Silva Federal University of Amazonas, Brazil WIDM’07 Filipe Mesquita, Denilson Barbosa University of

Existing Solutions are Complex for Non-Experts Data are kept in databases Tools are used to help the translation from a

database schema to another, such as Clio Non-experts do not have the skills nor the

resources to set up databases and use mapping tools

SourceDB

TargetDB

Page 4: FleDEx: Flexible Data Exchange Eli Cortez, Altigran Silva Federal University of Amazonas, Brazil WIDM’07 Filipe Mesquita, Denilson Barbosa University of

“Data Exchange for the Masses” We propose a lightweight framework for

non-experts to share data, where: Data are kept in collections outside a DBMS Schema is not always available Users move only small portions of a data

collection at a time It is possible that two users may exchange

data once and never again

Page 5: FleDEx: Flexible Data Exchange Eli Cortez, Altigran Silva Federal University of Amazonas, Brazil WIDM’07 Filipe Mesquita, Denilson Barbosa University of

A Motivating Application

My Collection My Collection

Peer-to-peer data sharing systems

Several formats (XML, CSV, ...) Casual connections

Share

Internet

Page 6: FleDEx: Flexible Data Exchange Eli Cortez, Altigran Silva Federal University of Amazonas, Brazil WIDM’07 Filipe Mesquita, Denilson Barbosa University of

Example

Source collection – CSV format

Target collection – XML format…<artist name=“Miles Davis>

<CD title=“Kind of Blue” style=“Instrumental”><song title=“So What”><song title=“All Blues”>

</CD></artist><artist name=“Norah Jones”>

<CD title=“Not Too Late”></artist>…

Artist, Instrument, Album, PriceM. Davis, Trumpet, Kind of Blue, $7.97L. Armstrong, Trumpet, On the Road, $5.98J. Coltrane, Saxophone, Giant Steps, $10.99

translate according to

Schema NOT provided!

Target collection’s data

is available!

Page 7: FleDEx: Flexible Data Exchange Eli Cortez, Altigran Silva Federal University of Amazonas, Brazil WIDM’07 Filipe Mesquita, Denilson Barbosa University of

Data Exchange is NOT Data Integration Fagin et al. [Theor. Comp. Sci.’05], who laid

down the foundations of the data exchange problem, wrote:

“A more significant difference between data exchange and data integration is that […] we have to actually materialize a finite target instance that best reflects the given source instance. In data integration no such exchange of data is required”

Page 8: FleDEx: Flexible Data Exchange Eli Cortez, Altigran Silva Federal University of Amazonas, Brazil WIDM’07 Filipe Mesquita, Denilson Barbosa University of

Data Exchange and Schema Matching Clio [VLDB’02] translates data between

databases once schemas are matched Unlike Clio, our approach does not require

setup investment and user intervention Several solutions for matching schemas are

discussed by Rahm and Bernstein’s survey [SIGMOD’01]

Most of them exploit schema information (e.g. labels), which does not work well in our setting, as our experiments show

Page 9: FleDEx: Flexible Data Exchange Eli Cortez, Altigran Silva Federal University of Amazonas, Brazil WIDM’07 Filipe Mesquita, Denilson Barbosa University of

FleDEx Framework

FleDEx Data Model (FDM): A minimalist generic hierarchical data model that captures essential features of XML and tabular data

Data Fitting: A algorithm for restructuring instances of our data model according to a target schema

Page 10: FleDEx: Flexible Data Exchange Eli Cortez, Altigran Silva Federal University of Amazonas, Brazil WIDM’07 Filipe Mesquita, Denilson Barbosa University of

FDM Instance is a Tree

Entities are represented by round rectangles

Attributes are textual nodes stemming out of entities

Attribute values are shown in italics

Page 11: FleDEx: Flexible Data Exchange Eli Cortez, Altigran Silva Federal University of Amazonas, Brazil WIDM’07 Filipe Mesquita, Denilson Barbosa University of

FDM Schema

Boxes represent entity types

Ovals represent attributes

The arrows indicate the attributes of the entities and the way they can be nested

Hollow arrows represent indicate optional attributes

Page 12: FleDEx: Flexible Data Exchange Eli Cortez, Altigran Silva Federal University of Amazonas, Brazil WIDM’07 Filipe Mesquita, Denilson Barbosa University of

Converting XML to FDM

<artist name=“Norah Jones”><CD title=“Not Too Late”>

<song track=“1”><title> Wish I could </title>

</song><song track=“2”>

<title> Sinkin’ soon </title></song>

</artist>

Page 13: FleDEx: Flexible Data Exchange Eli Cortez, Altigran Silva Federal University of Amazonas, Brazil WIDM’07 Filipe Mesquita, Denilson Barbosa University of

The Data Fitting Algorithm

1. Find a mapping of corresponding attributes in source and target schemas

2. Translate instances using such a mapping

Source schema Target schema

Page 14: FleDEx: Flexible Data Exchange Eli Cortez, Altigran Silva Federal University of Amazonas, Brazil WIDM’07 Filipe Mesquita, Denilson Barbosa University of

Similarity Components

Keyword-based similarity Attribute vocabularies {Davis, Norah, …} vs. {Miles, Davis, …}

Value-based similarity Shared Values “Kind of Blue” vs. “Kind of Blue”

Label similarity Names of entities and attributes artist/name vs. album/artist

Page 15: FleDEx: Flexible Data Exchange Eli Cortez, Altigran Silva Federal University of Amazonas, Brazil WIDM’07 Filipe Mesquita, Denilson Barbosa University of

A Bayesian Network for Combining Components

The OR operator:

K – KeywordV – ValueC – ContentL – LabelF – Final similarity

Page 16: FleDEx: Flexible Data Exchange Eli Cortez, Altigran Silva Federal University of Amazonas, Brazil WIDM’07 Filipe Mesquita, Denilson Barbosa University of

Avoiding Redundancy

Consider translating:“genre→album” and “song→track”…

We have to repeat all tracks for each album’s style! (Cartesian product)

Page 17: FleDEx: Flexible Data Exchange Eli Cortez, Altigran Silva Federal University of Amazonas, Brazil WIDM’07 Filipe Mesquita, Denilson Barbosa University of

Conflicts

To avoid that, we say that genre→album has a conflict with song→track.

Consequently, (a) has a conflict with both (b) and (c)

Solution: remove (a) or (b),(c) Thus, we are looking for the best

mapping without conflicts, which is an NP-complete optimization problem

Page 18: FleDEx: Flexible Data Exchange Eli Cortez, Altigran Silva Federal University of Amazonas, Brazil WIDM’07 Filipe Mesquita, Denilson Barbosa University of

Solving Conflicts

Let G(V,E) be a graph where V contains entity pairs and E contains edges as conflicts between them

We want to remove entity pairs with low score that produce conflicts

This is equivalent of finding a minimum-vertex cover in G

We use a heuristic for approximate results

genre→album0.5

conflict

song→track0.9

Page 19: FleDEx: Flexible Data Exchange Eli Cortez, Altigran Silva Federal University of Amazonas, Brazil WIDM’07 Filipe Mesquita, Denilson Barbosa University of

Final Attribute Mapping

Is a injective function with no conflicts

Page 20: FleDEx: Flexible Data Exchange Eli Cortez, Altigran Silva Federal University of Amazonas, Brazil WIDM’07 Filipe Mesquita, Denilson Barbosa University of

Translating Instances

This does not entail only relabeling but may also involve structural changes (e.g. different nesting)

First step: flatten data into a relation, where there is no particular nesting

Second step: create entities and attributes according to the target structure for each tuple

Page 21: FleDEx: Flexible Data Exchange Eli Cortez, Altigran Silva Federal University of Amazonas, Brazil WIDM’07 Filipe Mesquita, Denilson Barbosa University of

Example

artist album.title track.title num

Norah Jones Not Too Late Wish I could 1

Norah Jones Not Too Late Sinkin’ soon 2

Original instance Flatten to a relation

Translated instance

Page 22: FleDEx: Flexible Data Exchange Eli Cortez, Altigran Silva Federal University of Amazonas, Brazil WIDM’07 Filipe Mesquita, Denilson Barbosa University of

Our Translation Process…

Preserves the semantics of the source instance: Preserve ancestor-descendant relationships

between source entities Preserve sibling relationships between

source attributes Is unambiguous – there is a unique way of

restructuring instances Since our simplistic data model relates

entities through nesting only

Page 23: FleDEx: Flexible Data Exchange Eli Cortez, Altigran Silva Federal University of Amazonas, Brazil WIDM’07 Filipe Mesquita, Denilson Barbosa University of

Experiments

Goal: produce good mappings Metric: F-measure (harmonic mean of

precision and recall) Datasets:

Page 24: FleDEx: Flexible Data Exchange Eli Cortez, Altigran Silva Federal University of Amazonas, Brazil WIDM’07 Filipe Mesquita, Denilson Barbosa University of

Effectiveness of the Data Fitting Score

The combined score outperformed all individual scores

50 runs with 10 source entities each

Page 25: FleDEx: Flexible Data Exchange Eli Cortez, Altigran Silva Federal University of Amazonas, Brazil WIDM’07 Filipe Mesquita, Denilson Barbosa University of

Impact of the Size of Source Instance

Movies data collection with 20 runs for each plot

The combined method again outperformed the others, especially for smaller source instances

Page 26: FleDEx: Flexible Data Exchange Eli Cortez, Altigran Silva Federal University of Amazonas, Brazil WIDM’07 Filipe Mesquita, Denilson Barbosa University of

Impact of the Size of Target Instance

5 runs with 10 source entities each

Plots for simple collections are high regardless of their size

For more complex collections, the curve improves as the size increases

Page 27: FleDEx: Flexible Data Exchange Eli Cortez, Altigran Silva Federal University of Amazonas, Brazil WIDM’07 Filipe Mesquita, Denilson Barbosa University of

Resilience to noise

20 runs with 10 movies in the source instance each

The combined similarity suffers the least relative drop remaining almost perfect even when only

1/3 of the attributes have a match

Page 28: FleDEx: Flexible Data Exchange Eli Cortez, Altigran Silva Federal University of Amazonas, Brazil WIDM’07 Filipe Mesquita, Denilson Barbosa University of

Conclusion

Our method is particularly attractive for non-expert and casual users

It does not require the data to be stored in database systems, nor the use of special-purpose schema mapping tools

Our data model is simple yet powerful enough for the setting considered

Finally, extensive experimental results with real Web data showed that our approach is effective and very promising

Page 29: FleDEx: Flexible Data Exchange Eli Cortez, Altigran Silva Federal University of Amazonas, Brazil WIDM’07 Filipe Mesquita, Denilson Barbosa University of

Thank You!