bootstrapping pay-as-you-go data integration systems€¦ · outline 1 background 2 goal 3...
TRANSCRIPT
Bootstrapping Pay-As-You-Go Data Integration SystemsAnish Das Sarma, Xin Dong, Alon Halevy
Christan [email protected]
University of Florida
October 10, 2011
@cegme (University of Florida) pay-as-you-go data integration October 10, 2011 1 / 27
very technical, don’t hate me
@cegme (University of Florida) pay-as-you-go data integration October 10, 2011 2 / 27
Outline
1 Background
2 Goal
3 Motivation
4 Example
5 Overview
6 Definitions7 Generation8 Consolidation9 Experiments10 Conclusion11 Thank you
@cegme (University of Florida) pay-as-you-go data integration October 10, 2011 3 / 27
Background — What is Data Integration
Data integration systems offer a single-point interface over datasources and possibly heterogeneous data.
I Represented by the triple {G ,S ,M}F G is the Global schemaF S is the set of source schemasF M is the mapping between G and S
Two main approaches:1 Global As View – G is a collection of mediated views2 Local As View – All local views S are changed to look like G
@cegme (University of Florida) pay-as-you-go data integration October 10, 2011 4 / 27
Global As View
Simple interface that must be designed in advance
Mapping functions must be created for each source to global
Queries are reformulated over query to reach data sources
Results are retrieved and combined from data sources
Does not scale
@cegme (University of Florida) pay-as-you-go data integration October 10, 2011 5 / 27
Goal
Give best effort answers over data sources while allowing theadministrator improve system in a pay-as-you-go fashion.
@cegme (University of Florida) pay-as-you-go data integration October 10, 2011 6 / 27
Motivation
Many approaches automatically performed the mappings
Several plausible mappings may exist /
Idea: Lets make the mappings probabilistic and choose the mappingwith maximum entropy 1
1principle of maximum entropy is a postulate which states that, the probabilitydistribution which best represents the current state of knowledge is the one with largestentropy
@cegme (University of Florida) pay-as-you-go data integration October 10, 2011 7 / 27
Example from overview
Consider source schema S1 and S2 both describing peopleI S1(name, hPhone, hAddr , oPhone, oAddr)I S2(name, phone, addr)I Possible mappings:
M1({name}, {phone, hPhone, oPhone}, {addr , hAddress, oAddress})M2({name}, {phone, hPhone}, {oPhone}, {addr , oAddress}, {hAddress})M3({name}, {phone, hPhone}, {oPhone}, {addr , hAddress}, {oA})M4({name}, {phone, oPhone}, {hPhone}, {addr , oA}, {hAddress})M5({name}, {phone}, {hPhone}, {oPhone}, {addr}, {hAddress}, {oA})
@cegme (University of Florida) pay-as-you-go data integration October 10, 2011 8 / 27
Example from overview
Consider source schema S1 and S2 both describing peopleI S1(name, hPhone, hAddr , oPhone, oAddr)I S2(name, phone, addr)I Possible mappings:
M1({name}, {phone, hPhone, oPhone}, {addr , hAddress, oAddress})M2({name}, {phone, hPhone}, {oPhone}, {addr , oAddress}, {hAddress})M3({name}, {phone, hPhone}, {oPhone}, {addr , hAddress}, {oA})M4({name}, {phone, oPhone}, {hPhone}, {addr , oA}, {hAddress})M5({name}, {phone}, {hPhone}, {oPhone}, {addr}, {hAddress}, {oA})
M1
@cegme (University of Florida) pay-as-you-go data integration October 10, 2011 8 / 27
Example from overview
Consider source schema S1 and S2 both describing peopleI S1(name, hPhone, hAddr , oPhone, oAddr)I S2(name, phone, addr)I Possible mappings:
M1({name}, {phone, hPhone, oPhone}, {addr , hAddress, oAddress})M2({name}, {phone, hPhone}, {oPhone}, {addr , oAddress}, {hAddress})M3({name}, {phone, hPhone}, {oPhone}, {addr , hAddress}, {oA})M4({name}, {phone, oPhone}, {hPhone}, {addr , oA}, {hAddress})M5({name}, {phone}, {hPhone}, {oPhone}, {addr}, {hAddress}, {oA})
M2
@cegme (University of Florida) pay-as-you-go data integration October 10, 2011 8 / 27
Example from overview
Consider source schema S1 and S2 both describing peopleI S1(name, hPhone, hAddr , oPhone, oAddr)I S2(name, phone, addr)I Possible mappings:
M1({name}, {phone, hPhone, oPhone}, {addr , hAddress, oAddress})M2({name}, {phone, hPhone}, {oPhone}, {addr , oAddress}, {hAddress})M3({name}, {phone, hPhone}, {oPhone}, {addr , hAddress}, {oA})M4({name}, {phone, oPhone}, {hPhone}, {addr , oA}, {hAddress})M5({name}, {phone}, {hPhone}, {oPhone}, {addr}, {hAddress}, {oA})
M3
@cegme (University of Florida) pay-as-you-go data integration October 10, 2011 8 / 27
Example from overview
Consider source schema S1 and S2 both describing peopleI S1(name, hPhone, hAddr , oPhone, oAddr)I S2(name, phone, addr)I Possible mappings:
M1({name}, {phone, hPhone, oPhone}, {addr , hAddress, oAddress})M2({name}, {phone, hPhone}, {oPhone}, {addr , oAddress}, {hAddress})M3({name}, {phone, hPhone}, {oPhone}, {addr , hAddress}, {oA})M4({name}, {phone, oPhone}, {hPhone}, {addr , oA}, {hAddress})M5({name}, {phone}, {hPhone}, {oPhone}, {addr}, {hAddress}, {oA})
M4
@cegme (University of Florida) pay-as-you-go data integration October 10, 2011 8 / 27
Example from overview
Consider source schema S1 and S2 both describing peopleI S1(name, hPhone, hAddr , oPhone, oAddr)I S2(name, phone, addr)I Possible mappings:
M1({name}, {phone, hPhone, oPhone}, {addr , hAddress, oAddress})M2({name}, {phone, hPhone}, {oPhone}, {addr , oAddress}, {hAddress})M3({name}, {phone, hPhone}, {oPhone}, {addr , hAddress}, {oA})M4({name}, {phone, oPhone}, {hPhone}, {addr , oA}, {hAddress})M5({name}, {phone}, {hPhone}, {oPhone}, {addr}, {hAddress}, {oA})
M5
@cegme (University of Florida) pay-as-you-go data integration October 10, 2011 8 / 27
Example from overview
Consider source schema S1 and S2 both describing peopleI S1(name, hPhone, hAddr , oPhone, oAddr)I S2(name, phone, addr)I Possible mappings:
M1({name}, {phone, hPhone, oPhone}, {addr , hAddress, oAddress})M2({name}, {phone, hPhone}, {oPhone}, {addr , oAddress}, {hAddress})M3({name}, {phone, hPhone}, {oPhone}, {addr , hAddress}, {oA})M4({name}, {phone, oPhone}, {hPhone}, {addr , oA}, {hAddress})M5({name}, {phone}, {hPhone}, {oPhone}, {addr}, {hAddress}, {oA})
M6
@cegme (University of Florida) pay-as-you-go data integration October 10, 2011 8 / 27
Example from overview
Consider source schema S1 and S2 both describing peopleI S1(name, hPhone, hAddr , oPhone, oAddr)I S2(name, phone, addr)I Possible mappings:
M1({name}, {phone, hPhone, oPhone}, {addr , hAddress, oAddress})M2({name}, {phone, hPhone}, {oPhone}, {addr , oAddress}, {hAddress})M3({name}, {phone, hPhone}, {oPhone}, {addr , hAddress}, {oA})M4({name}, {phone, oPhone}, {hPhone}, {addr , oA}, {hAddress})M5({name}, {phone}, {hPhone}, {oPhone}, {addr}, {hAddress}, {oA})
No good mapping exist, even with probabilistic mappings.
We need both probabilistic mediated schema and semantic mappingsto that schema!
@cegme (University of Florida) pay-as-you-go data integration October 10, 2011 8 / 27
Steps
1 Construct a probabilistic mediated schema
2 Find best probabilistic schema mappings
3 Create a single mediated schema to expose to the user
@cegme (University of Florida) pay-as-you-go data integration October 10, 2011 9 / 27
The Architecture
@cegme (University of Florida) pay-as-you-go data integration October 10, 2011 10 / 27
Deterministic Schema Mappings
We have our source schemas S = {S1, . . . ,Sn}Attributes of Si is attr(Si )
The set of all source attributes A = attr(S1) ∪ . . . ∪ attr(Sn) 2
Mediated attributes M = {A1, . . . ,An} where i , j ∈ [1,m], i 6= j ,Ai ∩ Aj = ∅
2Mediated attributes are not given ‘names’@cegme (University of Florida) pay-as-you-go data integration October 10, 2011 11 / 27
Probablistic Mediated Schema
Set of schemas S = {S1, . . . ,Sn}The p-med-schema for S is M = {(M1,Pr(M1)), . . . , (Ml ,Pr(Ml))}where:
I Mi is a mediated schema for SI each Mi is uniqueI Pr(Mi ) ∈ (0, 1], and
∑li=1 Pr(Mi ) = 1
@cegme (University of Florida) pay-as-you-go data integration October 10, 2011 12 / 27
Probabilistic Schema Mapping
For a set S and a mediated schema M the probabilistic mapping ispM = {(m1,Pr(m1)), ..., (ml ,Pr(ml))}
I each mi is a unique schema mapping between S and MI Pr(mi ) ∈ (0, 1], and
∑li=1 Pr(mi ) = 1
@cegme (University of Florida) pay-as-you-go data integration October 10, 2011 13 / 27
Querying Mediated Schema
Givena source schema Sa p-med-schema M = {(M1,Pr(M1)), ..., (Ml ,Pr(Ml))}a set of probabilistic mappings pM = {pM(M1), ..., pM(Ml)} whereeach pM(Mi ) is a mapping between S and Mi .
Let D be an instance of S , Q be a query, and t be a tuple
The probability t is an answer for Q is Pr(t|Mi ), r ∈ [1, l ] w.r.t Mi
and pM(Mi ).
Let p =∑l
i=1 Pr(t|Mi ) ∗ Pr(Mi )
We can say (t, p) is a by-table answer w.r.t M and pM if p > 0
the set of table by table answers is QM,pM(D).
@cegme (University of Florida) pay-as-you-go data integration October 10, 2011 14 / 27
Consistency
Consistency Let M be a mediated schema for sources S1, . . . ,Sn . We sayM is consistent with a source schema Si , i ∈ [1, n], if there isno pair of attributes in Si that appear in the same cluster inM .
Consistent P-Mapping A p-mapping pM is consistent with a weightedcorrespondence Ci ,j between a pair of source and targetattributes if the sum of the probabilities of all mappingsm ∈ pM containing correspondence (i , j) equals pi ,j ; that is,
pi ,j =∑
m∈pM,(i ,j)∈m
Pr(m).
A p-mapping is consistent with a set of weightedcorrespondences C if it is consistent with each weightedcorrespondence C ∈ C.
@cegme (University of Florida) pay-as-you-go data integration October 10, 2011 15 / 27
Generate all mediated schema
@cegme (University of Florida) pay-as-you-go data integration October 10, 2011 16 / 27
Example p-med-schema
@cegme (University of Florida) pay-as-you-go data integration October 10, 2011 17 / 27
We want to choose the mapping that does not favor any of the eventsover the others
We chose the one that does not introduce new information1 Enumerate all possible one to one mappings between S and M that
have a subset of correspondences m1, ...,ml .2 Assign probabilities
maximizel∑
k=1
−pk ∗ logpk
subject to
1 ∀k ∈ [1, l ], 0 ≤ pk ≤ 1,2
∑lk=1 = 1 , and
3 ∀i , j :∑
k∈[i,j],(i,j)∈mkpk = pi,j .
@cegme (University of Florida) pay-as-you-go data integration October 10, 2011 18 / 27
P-Med-Schema Consolidation
Example M = {M1,M2}, where M1 contains three attributes{a1, a2, a3}, {a4}, and {a5, a6}, and M2 contains twoattributes {a2, a3, a4} and {a1, a5, a6}. The target schema Twould then contain four at- tributes: {a1}, {a2, a3}, {a4},and {a5, a6}.
@cegme (University of Florida) pay-as-you-go data integration October 10, 2011 19 / 27
Consolidating P-Mappings
See paper ...
@cegme (University of Florida) pay-as-you-go data integration October 10, 2011 20 / 27
Experiments
UDI Systems tales a set of data sources and creates mediated schemaand probabilistic schema
UDI accepts select-project queries 3
Each data source was in a MySQL table, query processor was writtenin Java.
Second String for JaroWinkler and pairwise similarity
Knitro to solve entropy maximization problem
Compared UDI with answers obtained through manual integration
3no joins because it is one table. They did not specify self-joins@cegme (University of Florida) pay-as-you-go data integration October 10, 2011 21 / 27
UDI versus Manual Integration
Goal Compare UDI vs manual integration
@cegme (University of Florida) pay-as-you-go data integration October 10, 2011 22 / 27
Competing Automatic Approaches
MySQL Keyword Consider all data sources as text documents and applykeyword search techniques. Used MySQL’s Keyword SearchEngine.
SOURCE Answers the query on every data source with the appropriateattributes
TOPMAPPING Consolidate schema but only take the highest probabilitymediated schema.
@cegme (University of Florida) pay-as-you-go data integration October 10, 2011 23 / 27
Quality of Mediated Schema
Domain Precision Recall F-measureMovie .97 .62 .76
Car .68 .88 .77People .76 .86 .81Course .83 .58 .68
Bib .77 .81 .79
Avg .802 .75 .762
Counted how many pairs of attributed were correctlyclustered
@cegme (University of Florida) pay-as-you-go data integration October 10, 2011 24 / 27
Setup Efficiency
Setup time includes:
1 importing sourceschemas
2 creating p-med-schema
3 creating p-mappingbetween each sourceschema and eachpossible mediatedschema
4 consolidating thep-med-schema andp-mappings
@cegme (University of Florida) pay-as-you-go data integration October 10, 2011 25 / 27
Conclusion
Demonstrated that it is possible to get accurate data integrationworking automatically
In the future authors consider more than one table, normalization ofmediated schemas, modeling data integration over time. Also, addinguserfeed back into the loop (Pay-as-you-go User feedback)
@cegme (University of Florida) pay-as-you-go data integration October 10, 2011 26 / 27
Thank You!
@cegme (University of Florida) pay-as-you-go data integration October 10, 2011 27 / 27