fledex: flexible data exchange eli cortez, altigran silva federal university of amazonas, brazil...
TRANSCRIPT
FleDEx: Flexible Data Exchange
Eli Cortez, Altigran Silva
Federal University of Amazonas, Brazil
WIDM’07
Filipe Mesquita, Denilson Barbosa
University of Alberta, Canada
Originaldata
The Data Exchange Problem is…
…translating data from a source schema to a target schema
Source schem
a
Translateddata
Target schema
translate
Existing Solutions are Complex for Non-Experts Data are kept in databases Tools are used to help the translation from a
database schema to another, such as Clio Non-experts do not have the skills nor the
resources to set up databases and use mapping tools
SourceDB
TargetDB
“Data Exchange for the Masses” We propose a lightweight framework for
non-experts to share data, where: Data are kept in collections outside a DBMS Schema is not always available Users move only small portions of a data
collection at a time It is possible that two users may exchange
data once and never again
A Motivating Application
My Collection My Collection
Peer-to-peer data sharing systems
Several formats (XML, CSV, ...) Casual connections
Share
Internet
Example
Source collection – CSV format
Target collection – XML format…<artist name=“Miles Davis>
<CD title=“Kind of Blue” style=“Instrumental”><song title=“So What”><song title=“All Blues”>
</CD></artist><artist name=“Norah Jones”>
<CD title=“Not Too Late”></artist>…
Artist, Instrument, Album, PriceM. Davis, Trumpet, Kind of Blue, $7.97L. Armstrong, Trumpet, On the Road, $5.98J. Coltrane, Saxophone, Giant Steps, $10.99
translate according to
Schema NOT provided!
Target collection’s data
is available!
Data Exchange is NOT Data Integration Fagin et al. [Theor. Comp. Sci.’05], who laid
down the foundations of the data exchange problem, wrote:
“A more significant difference between data exchange and data integration is that […] we have to actually materialize a finite target instance that best reflects the given source instance. In data integration no such exchange of data is required”
Data Exchange and Schema Matching Clio [VLDB’02] translates data between
databases once schemas are matched Unlike Clio, our approach does not require
setup investment and user intervention Several solutions for matching schemas are
discussed by Rahm and Bernstein’s survey [SIGMOD’01]
Most of them exploit schema information (e.g. labels), which does not work well in our setting, as our experiments show
FleDEx Framework
FleDEx Data Model (FDM): A minimalist generic hierarchical data model that captures essential features of XML and tabular data
Data Fitting: A algorithm for restructuring instances of our data model according to a target schema
FDM Instance is a Tree
Entities are represented by round rectangles
Attributes are textual nodes stemming out of entities
Attribute values are shown in italics
FDM Schema
Boxes represent entity types
Ovals represent attributes
The arrows indicate the attributes of the entities and the way they can be nested
Hollow arrows represent indicate optional attributes
Converting XML to FDM
<artist name=“Norah Jones”><CD title=“Not Too Late”>
<song track=“1”><title> Wish I could </title>
</song><song track=“2”>
<title> Sinkin’ soon </title></song>
</artist>
The Data Fitting Algorithm
1. Find a mapping of corresponding attributes in source and target schemas
2. Translate instances using such a mapping
Source schema Target schema
Similarity Components
Keyword-based similarity Attribute vocabularies {Davis, Norah, …} vs. {Miles, Davis, …}
Value-based similarity Shared Values “Kind of Blue” vs. “Kind of Blue”
Label similarity Names of entities and attributes artist/name vs. album/artist
A Bayesian Network for Combining Components
The OR operator:
K – KeywordV – ValueC – ContentL – LabelF – Final similarity
Avoiding Redundancy
Consider translating:“genre→album” and “song→track”…
We have to repeat all tracks for each album’s style! (Cartesian product)
Conflicts
To avoid that, we say that genre→album has a conflict with song→track.
Consequently, (a) has a conflict with both (b) and (c)
Solution: remove (a) or (b),(c) Thus, we are looking for the best
mapping without conflicts, which is an NP-complete optimization problem
Solving Conflicts
Let G(V,E) be a graph where V contains entity pairs and E contains edges as conflicts between them
We want to remove entity pairs with low score that produce conflicts
This is equivalent of finding a minimum-vertex cover in G
We use a heuristic for approximate results
genre→album0.5
conflict
song→track0.9
Final Attribute Mapping
Is a injective function with no conflicts
Translating Instances
This does not entail only relabeling but may also involve structural changes (e.g. different nesting)
First step: flatten data into a relation, where there is no particular nesting
Second step: create entities and attributes according to the target structure for each tuple
Example
artist album.title track.title num
Norah Jones Not Too Late Wish I could 1
Norah Jones Not Too Late Sinkin’ soon 2
Original instance Flatten to a relation
Translated instance
Our Translation Process…
Preserves the semantics of the source instance: Preserve ancestor-descendant relationships
between source entities Preserve sibling relationships between
source attributes Is unambiguous – there is a unique way of
restructuring instances Since our simplistic data model relates
entities through nesting only
Experiments
Goal: produce good mappings Metric: F-measure (harmonic mean of
precision and recall) Datasets:
Effectiveness of the Data Fitting Score
The combined score outperformed all individual scores
50 runs with 10 source entities each
Impact of the Size of Source Instance
Movies data collection with 20 runs for each plot
The combined method again outperformed the others, especially for smaller source instances
Impact of the Size of Target Instance
5 runs with 10 source entities each
Plots for simple collections are high regardless of their size
For more complex collections, the curve improves as the size increases
Resilience to noise
20 runs with 10 movies in the source instance each
The combined similarity suffers the least relative drop remaining almost perfect even when only
1/3 of the attributes have a match
Conclusion
Our method is particularly attractive for non-expert and casual users
It does not require the data to be stored in database systems, nor the use of special-purpose schema mapping tools
Our data model is simple yet powerful enough for the setting considered
Finally, extensive experimental results with real Web data showed that our approach is effective and very promising
Thank You!