bootstrapping pay-as-you-go data integration systems€¦ · outline 1 background 2 goal 3...

34
Bootstrapping Pay-As-You-Go Data Integration Systems Anish Das Sarma, Xin Dong, Alon Halevy Christan Grant [email protected]fl.edu University of Florida October 10, 2011 @cegme (University of Florida) pay-as-you-go data integration October 10, 2011 1 / 27

Upload: others

Post on 20-Sep-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bootstrapping Pay-As-You-Go Data Integration Systems€¦ · Outline 1 Background 2 Goal 3 Motivation 4 Example 5 Overview 6 De nitions 7 Generation 8 Consolidation 9 Experiments

Bootstrapping Pay-As-You-Go Data Integration SystemsAnish Das Sarma, Xin Dong, Alon Halevy

Christan [email protected]

University of Florida

October 10, 2011

@cegme (University of Florida) pay-as-you-go data integration October 10, 2011 1 / 27

Page 2: Bootstrapping Pay-As-You-Go Data Integration Systems€¦ · Outline 1 Background 2 Goal 3 Motivation 4 Example 5 Overview 6 De nitions 7 Generation 8 Consolidation 9 Experiments

very technical, don’t hate me

@cegme (University of Florida) pay-as-you-go data integration October 10, 2011 2 / 27

Page 3: Bootstrapping Pay-As-You-Go Data Integration Systems€¦ · Outline 1 Background 2 Goal 3 Motivation 4 Example 5 Overview 6 De nitions 7 Generation 8 Consolidation 9 Experiments

Outline

1 Background

2 Goal

3 Motivation

4 Example

5 Overview

6 Definitions7 Generation8 Consolidation9 Experiments10 Conclusion11 Thank you

@cegme (University of Florida) pay-as-you-go data integration October 10, 2011 3 / 27

Page 4: Bootstrapping Pay-As-You-Go Data Integration Systems€¦ · Outline 1 Background 2 Goal 3 Motivation 4 Example 5 Overview 6 De nitions 7 Generation 8 Consolidation 9 Experiments

Background — What is Data Integration

Data integration systems offer a single-point interface over datasources and possibly heterogeneous data.

I Represented by the triple {G ,S ,M}F G is the Global schemaF S is the set of source schemasF M is the mapping between G and S

Two main approaches:1 Global As View – G is a collection of mediated views2 Local As View – All local views S are changed to look like G

@cegme (University of Florida) pay-as-you-go data integration October 10, 2011 4 / 27

Page 5: Bootstrapping Pay-As-You-Go Data Integration Systems€¦ · Outline 1 Background 2 Goal 3 Motivation 4 Example 5 Overview 6 De nitions 7 Generation 8 Consolidation 9 Experiments

Global As View

Simple interface that must be designed in advance

Mapping functions must be created for each source to global

Queries are reformulated over query to reach data sources

Results are retrieved and combined from data sources

Does not scale

@cegme (University of Florida) pay-as-you-go data integration October 10, 2011 5 / 27

Page 6: Bootstrapping Pay-As-You-Go Data Integration Systems€¦ · Outline 1 Background 2 Goal 3 Motivation 4 Example 5 Overview 6 De nitions 7 Generation 8 Consolidation 9 Experiments

Goal

Give best effort answers over data sources while allowing theadministrator improve system in a pay-as-you-go fashion.

@cegme (University of Florida) pay-as-you-go data integration October 10, 2011 6 / 27

Page 7: Bootstrapping Pay-As-You-Go Data Integration Systems€¦ · Outline 1 Background 2 Goal 3 Motivation 4 Example 5 Overview 6 De nitions 7 Generation 8 Consolidation 9 Experiments

Motivation

Many approaches automatically performed the mappings

Several plausible mappings may exist /

Idea: Lets make the mappings probabilistic and choose the mappingwith maximum entropy 1

1principle of maximum entropy is a postulate which states that, the probabilitydistribution which best represents the current state of knowledge is the one with largestentropy

@cegme (University of Florida) pay-as-you-go data integration October 10, 2011 7 / 27

Page 8: Bootstrapping Pay-As-You-Go Data Integration Systems€¦ · Outline 1 Background 2 Goal 3 Motivation 4 Example 5 Overview 6 De nitions 7 Generation 8 Consolidation 9 Experiments

Example from overview

Consider source schema S1 and S2 both describing peopleI S1(name, hPhone, hAddr , oPhone, oAddr)I S2(name, phone, addr)I Possible mappings:

M1({name}, {phone, hPhone, oPhone}, {addr , hAddress, oAddress})M2({name}, {phone, hPhone}, {oPhone}, {addr , oAddress}, {hAddress})M3({name}, {phone, hPhone}, {oPhone}, {addr , hAddress}, {oA})M4({name}, {phone, oPhone}, {hPhone}, {addr , oA}, {hAddress})M5({name}, {phone}, {hPhone}, {oPhone}, {addr}, {hAddress}, {oA})

@cegme (University of Florida) pay-as-you-go data integration October 10, 2011 8 / 27

Page 9: Bootstrapping Pay-As-You-Go Data Integration Systems€¦ · Outline 1 Background 2 Goal 3 Motivation 4 Example 5 Overview 6 De nitions 7 Generation 8 Consolidation 9 Experiments

Example from overview

Consider source schema S1 and S2 both describing peopleI S1(name, hPhone, hAddr , oPhone, oAddr)I S2(name, phone, addr)I Possible mappings:

M1({name}, {phone, hPhone, oPhone}, {addr , hAddress, oAddress})M2({name}, {phone, hPhone}, {oPhone}, {addr , oAddress}, {hAddress})M3({name}, {phone, hPhone}, {oPhone}, {addr , hAddress}, {oA})M4({name}, {phone, oPhone}, {hPhone}, {addr , oA}, {hAddress})M5({name}, {phone}, {hPhone}, {oPhone}, {addr}, {hAddress}, {oA})

M1

@cegme (University of Florida) pay-as-you-go data integration October 10, 2011 8 / 27

Page 10: Bootstrapping Pay-As-You-Go Data Integration Systems€¦ · Outline 1 Background 2 Goal 3 Motivation 4 Example 5 Overview 6 De nitions 7 Generation 8 Consolidation 9 Experiments

Example from overview

Consider source schema S1 and S2 both describing peopleI S1(name, hPhone, hAddr , oPhone, oAddr)I S2(name, phone, addr)I Possible mappings:

M1({name}, {phone, hPhone, oPhone}, {addr , hAddress, oAddress})M2({name}, {phone, hPhone}, {oPhone}, {addr , oAddress}, {hAddress})M3({name}, {phone, hPhone}, {oPhone}, {addr , hAddress}, {oA})M4({name}, {phone, oPhone}, {hPhone}, {addr , oA}, {hAddress})M5({name}, {phone}, {hPhone}, {oPhone}, {addr}, {hAddress}, {oA})

M2

@cegme (University of Florida) pay-as-you-go data integration October 10, 2011 8 / 27

Page 11: Bootstrapping Pay-As-You-Go Data Integration Systems€¦ · Outline 1 Background 2 Goal 3 Motivation 4 Example 5 Overview 6 De nitions 7 Generation 8 Consolidation 9 Experiments

Example from overview

Consider source schema S1 and S2 both describing peopleI S1(name, hPhone, hAddr , oPhone, oAddr)I S2(name, phone, addr)I Possible mappings:

M1({name}, {phone, hPhone, oPhone}, {addr , hAddress, oAddress})M2({name}, {phone, hPhone}, {oPhone}, {addr , oAddress}, {hAddress})M3({name}, {phone, hPhone}, {oPhone}, {addr , hAddress}, {oA})M4({name}, {phone, oPhone}, {hPhone}, {addr , oA}, {hAddress})M5({name}, {phone}, {hPhone}, {oPhone}, {addr}, {hAddress}, {oA})

M3

@cegme (University of Florida) pay-as-you-go data integration October 10, 2011 8 / 27

Page 12: Bootstrapping Pay-As-You-Go Data Integration Systems€¦ · Outline 1 Background 2 Goal 3 Motivation 4 Example 5 Overview 6 De nitions 7 Generation 8 Consolidation 9 Experiments

Example from overview

Consider source schema S1 and S2 both describing peopleI S1(name, hPhone, hAddr , oPhone, oAddr)I S2(name, phone, addr)I Possible mappings:

M1({name}, {phone, hPhone, oPhone}, {addr , hAddress, oAddress})M2({name}, {phone, hPhone}, {oPhone}, {addr , oAddress}, {hAddress})M3({name}, {phone, hPhone}, {oPhone}, {addr , hAddress}, {oA})M4({name}, {phone, oPhone}, {hPhone}, {addr , oA}, {hAddress})M5({name}, {phone}, {hPhone}, {oPhone}, {addr}, {hAddress}, {oA})

M4

@cegme (University of Florida) pay-as-you-go data integration October 10, 2011 8 / 27

Page 13: Bootstrapping Pay-As-You-Go Data Integration Systems€¦ · Outline 1 Background 2 Goal 3 Motivation 4 Example 5 Overview 6 De nitions 7 Generation 8 Consolidation 9 Experiments

Example from overview

Consider source schema S1 and S2 both describing peopleI S1(name, hPhone, hAddr , oPhone, oAddr)I S2(name, phone, addr)I Possible mappings:

M1({name}, {phone, hPhone, oPhone}, {addr , hAddress, oAddress})M2({name}, {phone, hPhone}, {oPhone}, {addr , oAddress}, {hAddress})M3({name}, {phone, hPhone}, {oPhone}, {addr , hAddress}, {oA})M4({name}, {phone, oPhone}, {hPhone}, {addr , oA}, {hAddress})M5({name}, {phone}, {hPhone}, {oPhone}, {addr}, {hAddress}, {oA})

M5

@cegme (University of Florida) pay-as-you-go data integration October 10, 2011 8 / 27

Page 14: Bootstrapping Pay-As-You-Go Data Integration Systems€¦ · Outline 1 Background 2 Goal 3 Motivation 4 Example 5 Overview 6 De nitions 7 Generation 8 Consolidation 9 Experiments

Example from overview

Consider source schema S1 and S2 both describing peopleI S1(name, hPhone, hAddr , oPhone, oAddr)I S2(name, phone, addr)I Possible mappings:

M1({name}, {phone, hPhone, oPhone}, {addr , hAddress, oAddress})M2({name}, {phone, hPhone}, {oPhone}, {addr , oAddress}, {hAddress})M3({name}, {phone, hPhone}, {oPhone}, {addr , hAddress}, {oA})M4({name}, {phone, oPhone}, {hPhone}, {addr , oA}, {hAddress})M5({name}, {phone}, {hPhone}, {oPhone}, {addr}, {hAddress}, {oA})

M6

@cegme (University of Florida) pay-as-you-go data integration October 10, 2011 8 / 27

Page 15: Bootstrapping Pay-As-You-Go Data Integration Systems€¦ · Outline 1 Background 2 Goal 3 Motivation 4 Example 5 Overview 6 De nitions 7 Generation 8 Consolidation 9 Experiments

Example from overview

Consider source schema S1 and S2 both describing peopleI S1(name, hPhone, hAddr , oPhone, oAddr)I S2(name, phone, addr)I Possible mappings:

M1({name}, {phone, hPhone, oPhone}, {addr , hAddress, oAddress})M2({name}, {phone, hPhone}, {oPhone}, {addr , oAddress}, {hAddress})M3({name}, {phone, hPhone}, {oPhone}, {addr , hAddress}, {oA})M4({name}, {phone, oPhone}, {hPhone}, {addr , oA}, {hAddress})M5({name}, {phone}, {hPhone}, {oPhone}, {addr}, {hAddress}, {oA})

No good mapping exist, even with probabilistic mappings.

We need both probabilistic mediated schema and semantic mappingsto that schema!

@cegme (University of Florida) pay-as-you-go data integration October 10, 2011 8 / 27

Page 16: Bootstrapping Pay-As-You-Go Data Integration Systems€¦ · Outline 1 Background 2 Goal 3 Motivation 4 Example 5 Overview 6 De nitions 7 Generation 8 Consolidation 9 Experiments

Steps

1 Construct a probabilistic mediated schema

2 Find best probabilistic schema mappings

3 Create a single mediated schema to expose to the user

@cegme (University of Florida) pay-as-you-go data integration October 10, 2011 9 / 27

Page 17: Bootstrapping Pay-As-You-Go Data Integration Systems€¦ · Outline 1 Background 2 Goal 3 Motivation 4 Example 5 Overview 6 De nitions 7 Generation 8 Consolidation 9 Experiments

The Architecture

@cegme (University of Florida) pay-as-you-go data integration October 10, 2011 10 / 27

Page 18: Bootstrapping Pay-As-You-Go Data Integration Systems€¦ · Outline 1 Background 2 Goal 3 Motivation 4 Example 5 Overview 6 De nitions 7 Generation 8 Consolidation 9 Experiments

Deterministic Schema Mappings

We have our source schemas S = {S1, . . . ,Sn}Attributes of Si is attr(Si )

The set of all source attributes A = attr(S1) ∪ . . . ∪ attr(Sn) 2

Mediated attributes M = {A1, . . . ,An} where i , j ∈ [1,m], i 6= j ,Ai ∩ Aj = ∅

2Mediated attributes are not given ‘names’@cegme (University of Florida) pay-as-you-go data integration October 10, 2011 11 / 27

Page 19: Bootstrapping Pay-As-You-Go Data Integration Systems€¦ · Outline 1 Background 2 Goal 3 Motivation 4 Example 5 Overview 6 De nitions 7 Generation 8 Consolidation 9 Experiments

Probablistic Mediated Schema

Set of schemas S = {S1, . . . ,Sn}The p-med-schema for S is M = {(M1,Pr(M1)), . . . , (Ml ,Pr(Ml))}where:

I Mi is a mediated schema for SI each Mi is uniqueI Pr(Mi ) ∈ (0, 1], and

∑li=1 Pr(Mi ) = 1

@cegme (University of Florida) pay-as-you-go data integration October 10, 2011 12 / 27

Page 20: Bootstrapping Pay-As-You-Go Data Integration Systems€¦ · Outline 1 Background 2 Goal 3 Motivation 4 Example 5 Overview 6 De nitions 7 Generation 8 Consolidation 9 Experiments

Probabilistic Schema Mapping

For a set S and a mediated schema M the probabilistic mapping ispM = {(m1,Pr(m1)), ..., (ml ,Pr(ml))}

I each mi is a unique schema mapping between S and MI Pr(mi ) ∈ (0, 1], and

∑li=1 Pr(mi ) = 1

@cegme (University of Florida) pay-as-you-go data integration October 10, 2011 13 / 27

Page 21: Bootstrapping Pay-As-You-Go Data Integration Systems€¦ · Outline 1 Background 2 Goal 3 Motivation 4 Example 5 Overview 6 De nitions 7 Generation 8 Consolidation 9 Experiments

Querying Mediated Schema

Givena source schema Sa p-med-schema M = {(M1,Pr(M1)), ..., (Ml ,Pr(Ml))}a set of probabilistic mappings pM = {pM(M1), ..., pM(Ml)} whereeach pM(Mi ) is a mapping between S and Mi .

Let D be an instance of S , Q be a query, and t be a tuple

The probability t is an answer for Q is Pr(t|Mi ), r ∈ [1, l ] w.r.t Mi

and pM(Mi ).

Let p =∑l

i=1 Pr(t|Mi ) ∗ Pr(Mi )

We can say (t, p) is a by-table answer w.r.t M and pM if p > 0

the set of table by table answers is QM,pM(D).

@cegme (University of Florida) pay-as-you-go data integration October 10, 2011 14 / 27

Page 22: Bootstrapping Pay-As-You-Go Data Integration Systems€¦ · Outline 1 Background 2 Goal 3 Motivation 4 Example 5 Overview 6 De nitions 7 Generation 8 Consolidation 9 Experiments

Consistency

Consistency Let M be a mediated schema for sources S1, . . . ,Sn . We sayM is consistent with a source schema Si , i ∈ [1, n], if there isno pair of attributes in Si that appear in the same cluster inM .

Consistent P-Mapping A p-mapping pM is consistent with a weightedcorrespondence Ci ,j between a pair of source and targetattributes if the sum of the probabilities of all mappingsm ∈ pM containing correspondence (i , j) equals pi ,j ; that is,

pi ,j =∑

m∈pM,(i ,j)∈m

Pr(m).

A p-mapping is consistent with a set of weightedcorrespondences C if it is consistent with each weightedcorrespondence C ∈ C.

@cegme (University of Florida) pay-as-you-go data integration October 10, 2011 15 / 27

Page 23: Bootstrapping Pay-As-You-Go Data Integration Systems€¦ · Outline 1 Background 2 Goal 3 Motivation 4 Example 5 Overview 6 De nitions 7 Generation 8 Consolidation 9 Experiments

Generate all mediated schema

@cegme (University of Florida) pay-as-you-go data integration October 10, 2011 16 / 27

Page 24: Bootstrapping Pay-As-You-Go Data Integration Systems€¦ · Outline 1 Background 2 Goal 3 Motivation 4 Example 5 Overview 6 De nitions 7 Generation 8 Consolidation 9 Experiments

Example p-med-schema

@cegme (University of Florida) pay-as-you-go data integration October 10, 2011 17 / 27

Page 25: Bootstrapping Pay-As-You-Go Data Integration Systems€¦ · Outline 1 Background 2 Goal 3 Motivation 4 Example 5 Overview 6 De nitions 7 Generation 8 Consolidation 9 Experiments

We want to choose the mapping that does not favor any of the eventsover the others

We chose the one that does not introduce new information1 Enumerate all possible one to one mappings between S and M that

have a subset of correspondences m1, ...,ml .2 Assign probabilities

maximizel∑

k=1

−pk ∗ logpk

subject to

1 ∀k ∈ [1, l ], 0 ≤ pk ≤ 1,2

∑lk=1 = 1 , and

3 ∀i , j :∑

k∈[i,j],(i,j)∈mkpk = pi,j .

@cegme (University of Florida) pay-as-you-go data integration October 10, 2011 18 / 27

Page 26: Bootstrapping Pay-As-You-Go Data Integration Systems€¦ · Outline 1 Background 2 Goal 3 Motivation 4 Example 5 Overview 6 De nitions 7 Generation 8 Consolidation 9 Experiments

P-Med-Schema Consolidation

Example M = {M1,M2}, where M1 contains three attributes{a1, a2, a3}, {a4}, and {a5, a6}, and M2 contains twoattributes {a2, a3, a4} and {a1, a5, a6}. The target schema Twould then contain four at- tributes: {a1}, {a2, a3}, {a4},and {a5, a6}.

@cegme (University of Florida) pay-as-you-go data integration October 10, 2011 19 / 27

Page 27: Bootstrapping Pay-As-You-Go Data Integration Systems€¦ · Outline 1 Background 2 Goal 3 Motivation 4 Example 5 Overview 6 De nitions 7 Generation 8 Consolidation 9 Experiments

Consolidating P-Mappings

See paper ...

@cegme (University of Florida) pay-as-you-go data integration October 10, 2011 20 / 27

Page 28: Bootstrapping Pay-As-You-Go Data Integration Systems€¦ · Outline 1 Background 2 Goal 3 Motivation 4 Example 5 Overview 6 De nitions 7 Generation 8 Consolidation 9 Experiments

Experiments

UDI Systems tales a set of data sources and creates mediated schemaand probabilistic schema

UDI accepts select-project queries 3

Each data source was in a MySQL table, query processor was writtenin Java.

Second String for JaroWinkler and pairwise similarity

Knitro to solve entropy maximization problem

Compared UDI with answers obtained through manual integration

3no joins because it is one table. They did not specify self-joins@cegme (University of Florida) pay-as-you-go data integration October 10, 2011 21 / 27

Page 29: Bootstrapping Pay-As-You-Go Data Integration Systems€¦ · Outline 1 Background 2 Goal 3 Motivation 4 Example 5 Overview 6 De nitions 7 Generation 8 Consolidation 9 Experiments

UDI versus Manual Integration

Goal Compare UDI vs manual integration

@cegme (University of Florida) pay-as-you-go data integration October 10, 2011 22 / 27

Page 30: Bootstrapping Pay-As-You-Go Data Integration Systems€¦ · Outline 1 Background 2 Goal 3 Motivation 4 Example 5 Overview 6 De nitions 7 Generation 8 Consolidation 9 Experiments

Competing Automatic Approaches

MySQL Keyword Consider all data sources as text documents and applykeyword search techniques. Used MySQL’s Keyword SearchEngine.

SOURCE Answers the query on every data source with the appropriateattributes

TOPMAPPING Consolidate schema but only take the highest probabilitymediated schema.

@cegme (University of Florida) pay-as-you-go data integration October 10, 2011 23 / 27

Page 31: Bootstrapping Pay-As-You-Go Data Integration Systems€¦ · Outline 1 Background 2 Goal 3 Motivation 4 Example 5 Overview 6 De nitions 7 Generation 8 Consolidation 9 Experiments

Quality of Mediated Schema

Domain Precision Recall F-measureMovie .97 .62 .76

Car .68 .88 .77People .76 .86 .81Course .83 .58 .68

Bib .77 .81 .79

Avg .802 .75 .762

Counted how many pairs of attributed were correctlyclustered

@cegme (University of Florida) pay-as-you-go data integration October 10, 2011 24 / 27

Page 32: Bootstrapping Pay-As-You-Go Data Integration Systems€¦ · Outline 1 Background 2 Goal 3 Motivation 4 Example 5 Overview 6 De nitions 7 Generation 8 Consolidation 9 Experiments

Setup Efficiency

Setup time includes:

1 importing sourceschemas

2 creating p-med-schema

3 creating p-mappingbetween each sourceschema and eachpossible mediatedschema

4 consolidating thep-med-schema andp-mappings

@cegme (University of Florida) pay-as-you-go data integration October 10, 2011 25 / 27

Page 33: Bootstrapping Pay-As-You-Go Data Integration Systems€¦ · Outline 1 Background 2 Goal 3 Motivation 4 Example 5 Overview 6 De nitions 7 Generation 8 Consolidation 9 Experiments

Conclusion

Demonstrated that it is possible to get accurate data integrationworking automatically

In the future authors consider more than one table, normalization ofmediated schemas, modeling data integration over time. Also, addinguserfeed back into the loop (Pay-as-you-go User feedback)

@cegme (University of Florida) pay-as-you-go data integration October 10, 2011 26 / 27

Page 34: Bootstrapping Pay-As-You-Go Data Integration Systems€¦ · Outline 1 Background 2 Goal 3 Motivation 4 Example 5 Overview 6 De nitions 7 Generation 8 Consolidation 9 Experiments

Thank You!

@cegme (University of Florida) pay-as-you-go data integration October 10, 2011 27 / 27