paolo papotti, università roma tre. data exchange: a five minute introduction what are the...
Post on 21-Dec-2015
215 views
TRANSCRIPT
2
Data Exchange: A Five Minute IntroductionWhat are the problems?
Problem 1: To Generate Good SolutionsProblem 2: To Generate Mappings
Core QuestionsThe Revolution: A Rewriting Algorithm for
Core Mappings+Spicy Mapping Tool
Outline
3
Problemmoving data across repositoriesInput: one or more source databases and one
target schemaOutput: one instance of the target database
Why this is an important problem ?it is an enabling technology for a very large
number of applicationsETL, EII, EAI, data fusion, schema evolution, …
A Five Minute Introduction
Data Exchange
Data Exchange
4
Example: Books
Data Exchange: A Five Minute Introduction
Data Exchange: A Five Minute Introduction
Publisher [0..*]
idname
Target DBBook [0..*]
titlepubId
Source DB1: Internet Book DatabaseIBDBook [0..*]
titletitleThe HobbitThe Da Vinci CodeThe Lord of the Rings
Source DB2: Library of CongressLOC [0..*]
titlepublisher
title publisherThe Lord of the Rings Houghton The Catcher in the Rye Lb Books
Source DB3: Internet Book ListIBLBook [0..*]
titlepublisherId
IBLPublisher [0..*]
idname
id name245 Ballantine776 Lb Books
title pubIdThe Hobbit 245The Catcher in the Rye 776
5
Many possible solutionsprocedural code (es: Java program)ad hoc scriptETL script …
We need a more principled way to do that[Haas, ICDT 2007]
A clean way to do it: by using logic
Moving Data to the Target
Data Exchange: A Five Minute Introduction
Data Exchange: A Five Minute Introduction
6
Source-to-Target TGD (GLAV mappings)logical formula used to constrain how data should
appear in the target based on data in the sourceExamples
m1 : ∀ t, i : IBLBook (t, i) → Book(t, i)
m2 : ∀ i, p : IBLPublisher(i, p) → Publisher (i, p)
m3 : ∀ t: IBDBook(t) → ∃N: Book(t, N)
m4 : ∀ t, p: LOC(t, p) → ∃N: Book(t, N)Publisher (N,p)
Tuple Generating Dependencies
Data Exchange: A Five Minute Introduction
Data Exchange: A Five Minute Introduction
7
How to handle existentially quantified variables
Labeled nulls N1 N2 ... Nk ...
identified variables in the target instance to satisfy existential quantifiers
some are pure nulls, others correlate tuplese.g. LOC(t, p) → ∃N0: Book(t, N0) Publisher (N0, p)
Can be generated using Skolem functions [>>] N0: sk(Book(A:t),Publisher(B:p))
Labeled Nulls
Data Exchange: A Five Minute Introduction
Data Exchange: A Five Minute Introduction
8
Target TGDlogical formula used to enforce inclusion
constraints on the targetExample: ∀ t, i : Book (t, i) → ∃N:
Publisher(i, N)Target EGD
logical formula used to enforce functional-dependencies on the target
Example: ∀ t, i, i’ : Book (t, i), Book (t, i’)→ (i = i’)
How to model schema constraints?
Data Exchange: A Five Minute Tutorial
Data Exchange: A Five Minute Tutorial
9
M = <S, T, Sst, St >S : source schema, T: target schema Sst : source-to-target tgds St : target tgds and target egds
TGD : ∀ X: Φ(X) → Ǝ Y : Ψ(X, Y)EGD : ∀ X: Φ(X) → (xi = xj)Φ(X), Ψ(X, Y)
conjunctions of atoms of the form R(X), R(X, Y)∀ t, p: LOC(t, p) → ∃N: Book(t, N) Publisher (N, p)
Data Exchange Scenario
Data Exchange: A Five Minute Introduction
Data Exchange: A Five Minute Introduction
10
Given:a data exchange scenario, M = <S, T, Sst, St >an instance of the source, I
Generate:a solution, i.e.,an instance of the target, J, such that I and J
satisfy the constraints in Sst and J satisfies the constraints in St
Intuitionconstraints are not used to check properties,
but to generate or modify tuples
Data Exchange Problem
Data Exchange: A Five Minute Introduction
Data Exchange: A Five Minute Introduction
11
Solid theoretical backgroundClean semantics [>>]
Generalitytgds and egds can express all embedded
implicational dependencies [Fagin, JACM82] (essentially all constraints really useful in practice)
Bottom linedata exchange is a promising framework for
applications that exchange and integrate data
Advantages
Data Exchange: A Five Minute Introduction
Data Exchange: A Five Minute Introduction
12
Problem 1: To Generate Solutionsgiven a data exchange scenario and a source
instance, generate a solution (target instance)Problem 2: To Generate the TGDs
users are not willing to write down logical formulas
it is much more natural to provide a minimal, high-level specification of the mapping
What are the problems?
13
A key observationdependencies do not fully specify the solutioni.e., a scenario may have many solutions
Example ∀ x : R(x) → Ǝ y: S(x, y)I = { R(a) }J0 = { S(a, N1) }J1 = { S(a, b) }J2 = { S(a, N1), T(c, d) }
Semantics of a Scenario
Problem 1: To Generate Solutions
Problem 1: To Generate Solutions
Why is J0 better
than J1 and J2 ?
14
Given J, J’, an homomorphism h : J → J’ is a mapping of values such thatfor each constant c, h(c) = cfor each fact R(v1, ... vn) in J, R(h(v1), ... , h(vn))
is in J’Examples:
1. from J0 = { S(a, N1) } to J1 = { S(a, b) }
2. from J0 = { S(a, N1) } to J2 = { S(a, c) }
3. from J0 = { S(a, N1) } to J3 = { S(a, N2) }
4. from J4 = { S(a, N1, N1) } to J5 = { S(a, b, c) } [?]
Homomorphisms
Problem 1: To Generate Solutions
Problem 1: To Generate Solutions
15
A good solutioncontains sufficient information to satisfy the tgdsdoes not contain any extra information
Universal solution [Fagin et al, ICDT 2003, TCS 2005]a target instance J that is a solution for Iand such that, for any other solution J’ for I, there exist an
homomorphism h : J → J’Examples: J0 is universal, J1 and J2 are not
J0 = { S(a, N1) }J1 = { S(a, b) }J2 = { S(a, N1), T(c, d) }
How to generate universal solutions ?
Universal Solutions
Problem 1: To Generate Solutions
Problem 1: To Generate Solutions
16
Operational Approach: The chase procedureClassical chase procedure for tgds
J = emptysetrepeat until fixpoint
for each tgd m : X: ∀ Φ(X) → Ǝ Y : Ψ(X, Y)
for each tuple of constants A such that I ⊧ Φ(A)
if ! Ǝ B such that J ⊧ Ψ(A, B) take C as a vector of fresh nulls J = J U Ψ(A, C)
The Chase
Problem 1: To Generate Solutions
Problem 1: To Generate Solutions
17
Examplem1 : ∀ t,i : IBLBook (t, i) → Book(t, i)
m2 : ∀ i,p : IBLPublisher(i, p) → Publisher (i, p)
m3 : ∀ t: IBDBook(t) → ∃N: Book(t, N)
m4 : ∀ t, p: LOC(t, p) → ∃N: Book(t, N) Publisher (N, p) IBLBook(‘The Hobbit’, 245) → Book(‘The Hobbit’,
245) IBLPublisher(245, ‘Ballantine’) → Publisher(245,
‘Ballantine’) IBDBook(‘The Hobbit’) → Book(‘The Hobbit’, N1)
LOC(‘The Hobbit’, ‘Ballantine’) → Book(‘The Hobbit’, N2), Publisher(N2, ‘Ballantine’)
The Chase
Problem 1: To Generate Solutions
Problem 1: To Generate Solutions
18
A chase procedureexists for target tgds and for target egds as well
(the latter equates values)However, these may fail
target tgds may generate infinite solutionses: R(x, y) → ∃N : S(y, N)
S(x, y) → ∃N : R(y, N)R(a, b) → S(b, N1) → R(N1, N2) → S(N2, N3) …
target egds may lead to failureses: R(x, y), R(x, y’) → (y = y’)R(a, b), R(a, c)
The Chase
Core Questions
Core Questions
a notion of weak acyclicity hasbeen introduced to guarantee termination
19
Canonical solutionsolution generated by the chase
Some Nice propertieseach canonical solution is a universal solutionthe chase can be implemented in SQL + Skolem
functions to generate nulls
The Chase
Problem 1: To Generate Solutions
Problem 1: To Generate Solutions
20
Examplem3 : ∀ t: IBDBook(t) → ∃N: Book(t, N)
m4 : ∀ t, p: LOC(t, p) → ∃N1: Book(t, N1) Publisher (N1, p)
First step: skolemize to remove existential variablesm3 : ∀ t: IBDBook(t) → Book(t, f(t))
m4 : ∀ t, p: LOC(t, p) → Book(t, f(t,p)) Publisher (f(t,p), p)
Second step: non recursive Datalog with functionsBook(t, f(t)) ← IBDBook(t)Book(t, f(t,p)) ← LOC(t, p) Publisher (f(t,p), p) ← LOC(t, p)
From tgds to scripts
insert into Book select t, f(t) from IBDBook union select t, f(t,p) from LOC
Problem 1: To Generate Solutions
Problem 1: To Generate Solutions
very efficient [>>]
Target: Booktitle pubId
The Hobbit NULL
The Da Vinci Code NULL
The Lord of the Rings NULL
The Lord of the Rings I1
The Catcher in the Rye I2
The Hobbit 245
The Catcher in the Rye 901
21
IBDBooktitle
The Hobbit
The Da Vinci Code
The Lord of the Rings
IBLBook
title publisher
The Lord of the Rings Houghton
The Catcher in the Rye Lb Books
LOC
Title pubId
The Hobbit 245
The Catcher in the Rye 901
id name
245 Ballantine
901 Lb Books
IBLPublisher
Target: Publisher
id name
I1 Houghton
I2 Lb Books
245 Ballantine
901 Lb Books
Scripts generate universal solutions!m1 : ∀ t,i : IBLBook (t, i) → Book(t, i)
m2 : ∀ i,p : IBLPublisher(i, p) → Publisher (i, p)
m3 : ∀ t: IBDBook(t) → ∃N: Book(t, N)
m4 : ∀ t, p: LOC(t, p) → ∃N1: Book(t, N1) Publisher (N1, p)
Problem 1: To Generate Solutions
Problem 1: To Generate Solutions
Target: Booktitle pubId
The Hobbit NULL
The Da Vinci Code NULL
The Lord of the Rings NULL
The Lord of the Rings I1
The Catcher in the Rye I2
The Hobbit 245
The Catcher in the Rye 901
22
IBDBooktitle
The Hobbit
The Da Vinci Code
The Lord of the Rings
IBLBook
title publisher
The Lord of the Rings Houghton
The Catcher in the Rye Lb Books
LOC
Title pubId
The Hobbit 245
The Catcher in the Rye 901
id name
245 Ballantine
901 Lb Books
IBLPublisher
Target: Publisher
id name
I1 Houghton
I2 Lb Books
245 Ballantine
901 Lb Books
But who wants to write tgds?
title
Source
pubId
IBLBook [0..*]
Publisher [0..*]idname
Target
Book [0..*]titlepubId
IBDBook [0..*]title
LOC [0..*]titlepublisher
IBLPublisher [0..*]idname
23
A schema mapping systemtakes as input an abstract specification of the
mapping under the form of value correspondences among schema elements
generates the tgds and then executable transformations (SQL, XQuery, XSLT…) to run them
Problem 3: Gathering Correspondencesusers may visually specify them as linesor they may be suggested by a schema
matching tool
Problem 2: To Generate the TGDs
24
Schema Matchingtools and techniques for the generation of
"lines"Schema Mapping
practical tools and algorithms for the generation of mappings (tgds and executable scripts)
Data Exchangetheoretical studies about the “data exchange
problem”
Research Landscape
Problem 2: To Generate the TGDs
Problem 2: To Generate the TGDs
tons of papers
some papers
many papers
tons > many > some
25
Clio [Popa et al, Vldb 2000&2002]Data Exchange semantics
[Fagin et al, ICDT 2003&TCS 2005] “which came first, the chicken or the egg?”
(Quick) History
Problem 2: To Generate the TGDs
Problem 2: To Generate the TGDs
26
Clio [Popa et al, Vldb 2000&2002]
(Quick) History
Problem 2: To Generate the TGDs
Problem 2: To Generate the TGDs
Chase constraints: elegant support for foreign keys
27
Clio [Popa et al, Vldb 2000&2002]
HepTox [Bonifati et al, Vldb 2005 - VldbJ 2009]
(Quick) History
Problem 2: To Generate the TGDs
Problem 2: To Generate the TGDs
28
Clio [Popa et al, Vldb 2000&2002] HepTox [Bonifati et al, Vldb 2005 - VldbJ 2009]
Nested Mappings: schema mapping reloaded [Fuxman et al, Vldb 2006]
(Quick) History
Problem 2: To Generate the TGDs
Problem 2: To Generate the TGDs
Schema Mapping Generation
Step 1. Extraction of “concepts” (in each schema). Concept = one category of data that can exist in the schema
Step 2. Mapping generation Enumerate all non-redundant maps between pairs of
concepts
Source schema S
Target schema T
Schema Correspondences
Source Concepts(relational views)
Target Concepts(relational views)
Mappings
dept: Set [ dname budget emps: Set [ ename salary worksOn: Set [ pid ] ] projects: Set [ pid pname ] ]
m1: (p0 in proj) (d in dept) (p in d.projects) p0.dname = d.dname p0.pname = p.pname
m2: (p0 in proj) (e0 in p0.emps) (d in dept) (p in d.projects) (e in d.emps) (w in
e.worksOn) w.pid = p.pid p0.dname = d.dname p0.pname = p.pname e0.ename = e.ename e0.salary = e.salary
Two ‘basic’ mappings (source-to-target tgds)
proj: Set [ dname pname emps: Set [ ename salary ] ]
m1
m2
m2 maps proj-emps to
dept-emps-worksOn-projects
Example
expression for dept-emps-worksOn-projects
The concept of “project of a department”
The concept of “project of an employee of a department”
m1 maps proj to dept-projects
proj: Set of [ dname pname emps: Set of [ ename salary ] ]
dept: Set of [ dname budget emps: Set of [ ename salary worksOn: Set of [ pid ] ] projects: Set of [ pid pname ] ]
P
PE
D
DPDE
DEPW
X
PDP
PEDEPW
A DAG of basic mappingsfor p in proj exists d’ in dept, p’ in d’.projects where d’.dname=p.dname and p’.pname=p.pname and
( for e in p.emps exists e’ in d’.emps, w in e’.worksOn where w.pid=p’.pid and e’.ename=e.ename and e’.salary=e.salary )
Nesting Algorithm: Example
32
Clio [Popa et al, Vldb 2000&2002] HepTox [Bonifati et al, Vldb 2005 - VldbJ 2009]
Nested Mappings: schema mapping reloaded [Fuxman et al, Vldb 2006]
(Quick) History
Problem 2: To Generate the TGDs
Problem 2: To Generate the TGDs
33
Execution time for basic: 22 minutesExecution time for nested: 1.1 seconds1
10
100
1000
10000
1 2 3 4
Nesting Level
Ba
sic q
uery
ex
ecu
tio
n t
ime
/
Nest
ed
qu
ery
ex
ecu
tio
n t
ime
1020 KB
514 KB
312 KB
• Qbasic execution time /
Qnested execution time
Problem 2: To Generate the TGDs
Problem 2: To Generate the TGDs
Clio [Popa et al, Vldb 2000&2002] HepTox [Bonifati et al, Vldb 2005 - VldbJ 2009]
Nested Mappings: schema mapping reloaded [Fuxman et al, Vldb 2006]
(Quick) History
34
Clio [Popa et al, Vldb 2000&2002] HepTox [Bonifati et al, Vldb 2005 - VldbJ 2009] Nested Mappings: schema mapping reloaded
[Fuxman et al, Vldb 2006]
Clip [Raffio et al, Icde 2008]
(Quick) History
Problem 2: To Generate the TGDs
Problem 2: To Generate the TGDs
35
Clio [Popa et al, Vldb 2000&2002] HepTox [Bonifati et al, Vldb 2005 - VldbJ 2009] Nested Mappings: schema mapping reloaded
[Fuxman et al, Vldb 2006] Clip [Raffio et al, Icde 2008]
MAD Mappings [Hernandez et al, Vldb 2008]
(Quick) History
Problem 2: To Generate the TGDs
Problem 2: To Generate the TGDs
36
Clio [Popa et al, Vldb 2000&2002] HepTox [Bonifati et al, Vldb 2005 - VldbJ 2009] Nested Mappings: schema mapping reloaded
[Fuxman et al, Vldb 2006] Clip [Raffio et al, Icde 2008] MAD Mappings [Hernandez et al, Vldb 2008]
Altova MapForce, StylusStudio, Microsoft Ado.NET, Oracle Service Integrator…
(Quick) History
Problem 2: To Generate the TGDs
Problem 2: To Generate the TGDs
Target: Booktitle pubId
The Hobbit NULL
The Da Vinci Code NULL
The Lord of the Rings NULL
The Lord of the Rings I1
The Catcher in the Rye I2
The Hobbit 245
The Catcher in the Rye 901
37
IBDBooktitle
The Hobbit
The Da Vinci Code
The Lord of the Rings
IBLBook
title publisher
The Lord of the Rings Houghton
The Catcher in the Rye Lb Books
LOC
Title pubId
The Hobbit 245
The Catcher in the Rye 901
id name
245 Ballantine
901 Lb Books
IBLPublisher
Target: Publisher
id name
I1 Houghton
I2 Lb Books
245 Ballantine
901 Lb Books
Problems 1 and 2 are “solved”m1 : ∀ t,i : IBLBook (t, i) → Book(t, i)
m2 : ∀ i,p : IBLPublisher(i, p) → Publisher (i, p)
m3 : ∀ t: IBDBook(t) → ∃N: Book(t, N)
m4 : ∀ t, p: LOC(t, p) → ∃N1: Book(t, N1) Publisher (N1,
p)
title
Source
pubId
IBLBook [0..*]
Publisher [0..*]idname
Target
Book [0..*]titlepubId
IBDBook [0..*]title
LOC [0..*]titlepublisher
IBLPublisher [0..*]idname
but looking at the target instance…
1. Source semantics preserved in the target instance
2. Given a minimal abstract specification
Target: BookTitle pubId
The Hobbit NULL
The Da Vinci Code NULL
The Lord of the Rings NULL
The Lord of the Rings I1
The Catcher in the Rye I2
The Hobbit 245
The Catcher in the Rye 901
38
IBDBooktitle
The Hobbit
The Da Vinci Code
The Lord of the Rings
IBLBook
title publisher
The Lord of the Rings Houghton
The Catcher in the Rye Lb Books
LOC
Title pubId
The Hobbit 245
The Catcher in the Rye 901
id name
245 Ballantine
901 Lb Books
IBLPublisher
Target: Publisher
id name
I1 Houghton
I2 Lb Books
245 Ballantine
901 Lb Books
Redundancy m1 : ∀ t,i : IBLBook (t, i) → Book(t, i)
m2 : ∀ i,p : IBLPublisher(i, p) → Publisher (i, p)
m3 : ∀ t: IBDBook(t) → ∃N: Book(t, N)
m4 : ∀ t, p: LOC(t, p) → ∃N1: Book(t, N1) Publisher (N1,
p)
title
Source
pubId
IBLBook [0..*]
Publisher [0..*]idname
Target
Book [0..*]titlepubId
IBDBook [0..*]title
LOC [0..*]titlepublisher
IBLPublisher [0..*]idname
39
What does redundancy mean ?information that also appears elsewhere in the
same instancees: Book(‘The Hobbit’, 245), Book(‘The
Hobbit’, N1)
A nice way to characterize thisany tuple t such that there exist a tuple t’ and
an homomorphism h : t → t’ is redundantintuition: t’ contains at least the same
informationMinimizing solutions: removing redundancy
Redundancy
Core Questions
Core Questions
tt’
40
The core [Fagin et al, Pods 2003 - Tods 2005]does not contain any proper subset that is also
a universal solutionis unique (up to the renaming of nulls)originates in graph theory, exists for all finite
structuresclear notion of quality (quality = minimality)
Example
The Core: smallest universal solution
Core Questions
Core Questions
Target: Booktitle pubId
The Da Vinci Code NULL
The Lord of the Rings N1
The Hobbit 245
The Catcher in the Rye 901
The Hobbit NULL
The Lord of the Rings NULL
The Catcher in the Rye N1
Target: Publisher
id name
N1 Houghton
245 Ballantine
901 Lb Books
N2 Lb Books
title pubId
The Da Vinci Code NULL
The Lord of the Rings I1
The Hobbit 245
The Catcher in the Rye 776
id name
I1 Houghton
245 Ballantine
901 Lb Books
41
First generation mapping systemsgenerate core solutions only in special cases
How far do they go from the core in general?
Comparing Solutions
Core Questions
Core Questions
s1_n s1_2n s1_4n s2_n s2_2n s2_4n s3_n s3_2n s3_4n0%
10%
20%
30%
40%
50%
60%
70%Size Ratio vs Basic Mappings
Size Ratio vs Nested Mappings
42
How much does it cost to find the core ?for arbitrary instances, a lot: CORE-
Identification is NP-hardBut…
for universal solutions, it is a polynomial problem [Fagin et al, TODS 2005], [Gottlob, Nash JACM 2008]
algorithm:post-process the initial solution, look exhaustively
for endomorphisms, and progressively remove nulls
Getting to the Core
Core Questions
Core Questions
43
The problem is solved in theory for a very general settings (weakly acyclic tgds + egds)
In practicesimple scenario: 4 tables, 4 tgdssmall instance: 5000 source tuplesgenerating the canonical solution takes 1 seccomputing the core takes 8 hours *
Getting to the Core
Core Questions
Core Questions
* using an implementation [Pichler, Savenkov, LPAR 2008] of the algorithm in [Gottlob, Nash, JACM 2008] running on PostgreSQL
O(vm|dom(J)|b)v : number of nullsm : size of Jb : block size
44
Is it possible to compute a core solution using an executable script (e.g. SQL)?
Advantagesefficiency, modularity, reuse
The Core Question
Core Questions
Core Questions
Core computatio
n (on top of a
relational db)
Canonical solution computation
Core solution computation
46
Is it possible to compute a core solution using an executable script ?
The previous belief wasthis is not possible
Our answer isyes, it is possibleunder proper restrictions
Two independent algorithms[Mecca, Papotti, Raunich, Sigmod 2009 – Vldb
2009][ten Cate, Chiticariu, Kolaitis, Tan, Vldb 2009]
Core Answers
47
Given a mapping scenarioM = <S, T, Sst>
Generate a new scenarioM’ = <S, T, S‘st>such that, for any source instance I, chasing
S‘st yields core solutions for I under MIn essence
an algorithm that rewrites the original tgds
Idea
Core Answers
Core Answers
48
Assumption 1no target constraintsthere is a workaround for target tgds (foreign
key constraints can be rewritten into the s-t tgds [Clio])
there is yet no solution for target egds (in this case the post-processing algorithm must be used)A(x,y) → S(x, y, Z), B(x,z) → S(x, Y, z)
Assumption 2tgds are “source based”, i.e., some universal
variable appears both in Φ(X) and Ψ(X, Y)→ S(Y, Z) no, R(x, y) → S(Y)
Assumptions
Core Answers
Core Answers
49
The goalto prevent the generation of redundancy, i.e., of
homomorphisms at tuple leveles: Book(‘The Hobbit’, N1) vs Book(‘The
Hobbit’, 245)Idea
look at tgd conclusions (i.e., structures of fact blocks in the target), and identify homorphisms at the formula level
rewrite the tgds accordingly
The Key Intuition
Core Answers
Core Answers
50
Formula homomorphismmapping among variable occurrences that
maps universal occurrences into universal occurrences and preserves tgd conclusions
Example
m1 : ∀ t,i : IBLBook (t, i) → Book(t, i)
m3 : ∀ t’: IBDBook(t’) → ∃N’: Book(t’, N’)h: Book(t’, N’) → Book(t, i)
h(t’) → th(N) → i
Formula Homomorphism
Core Answers
Core Answers
51
Rewriting strategyfor each tgd m such that there is a formula
homomorphism into m’fire m’, the “more informative” mapping; then fire
m only when m’ does not fire for the same valuesIn practice
we introduce negation in tgd premisesi.e., differences in the SQL script
ExampleIBDBook(t’) ∧ ¬(IBLBook(t, i) ∧ t = t’ ) → ∃N’:
Book(t’, N’)
Tgd Rewriting
Core Answers
Core Answers
52
…→ Book(t1, i1)
…→ Publisher (i2, p2)
…→ Book(t4, N4) Publisher (N4, p4)
…→ Book(t1, i1)
…→ Book(t3, N3)
…→ Book(t4, N4) Publisher (N4, p4)
1. Subsumption“simple” formula homomorphism, from one tgd
conclusion into another tgd conclusion2. Coverage
formula homomorphism from one tgd conclusion into the union of several tgd conclusions
3. Self-Joins (coming soon)Examples for 1 and 2
Classification of the Scenarios
Core Answers
Core Answers
53
The original mappingsm1 : IBLBook (t1, i1) → Book(t1, i1)
m2 : IBLPublisher(i2, p2) → Publisher (i2, p2)
m3 : IBDBook(t3) → ∃N3: Book(t3, N3)
m4 : LOC(t4, p4) → ∃N4: Book(t4, N4) Publisher (N4, p4)
The rewritingm’1 : IBLBook (t1, i1) → Book(t1, i1)
m’2 : IBLPublisher(i2, p2) → Publisher (i2, p2)
m’3 : IBDBook(t3) ∧ ¬(IBLBook(t1, i1) ∧ t1 = t3) ∧ ¬ (LOC(t4, p4) ∧ t4 = t3)
→ ∃N3: Book(t3, N3)
m’4 : LOC(t4, p4) ∧ ¬(IBLBook (t1, i1) ∧ IBLPublisher(i2, p2) ∧ i1 = i2 ∧
t1 = t4 ∧ p4 = p2 ) → ∃N4: Book(t4, N4) Publisher (N4, p4)
Tgd Rewriting
Core Answers
Core Answers
intuition: it is not always necessary to fire a tgd to
guaranteethat the target
satisfies the corresponding constraint
NOTE: coverages require additional joins, usually on key/foreign keys
54
Supposetgd conclusions do not contain self-joins (i.e.,
each relation symbol appears at most once) Result n. 1
the only kind of formula homomorphisms that may exist are subsumptions and coverages
Result n. 2by rewriting tgds according to subsumptions
and coverages, it is possible to generate a new set of tgds with negation whose canonical solutions are core solutions for the original scenario
Some Interesting Results
Core Answers
Core Answers
Intuitions behind the complexitySubsumptions generate a DAG
worst-case: linear pathaverage-case: forest
Coverages are not very frequentm1 : ... → R(t1, i1)
m2 : ... → S (i2, p2)
m : ... → ∃N4: R(t4, N4) S(N4, p4)
Some Interesting Results
Core Answers
Core Answers
55
Supposen: number of tgdsd: max. number of tgds that write into a tablek: max. number of atoms in a tgd
Some Interesting Results
Core Answers
Core Answers
Homomorphism
Worst Case
Average Frequency
Subsumptions O(n2) O(n) Very high
Coverages O(dk) low frequency (d low)
56
In case of self-joinsthese results do not hold any longer
Example from [Fagin et al, Tods 2005]chasing a single tgd may generate redundancy,
i.e., non core tuplesm1 : R(a, b, c, d) → S (x5, b, x1, x2, a) ∧
S (x5, c, x3, x4, a) ∧ S (d, c, x3, x4, b)
m2 : R(a, b, c, d) → S (d, a, a, x1, b) ∧ S (x5, a, a, x1, a) ∧ S (x5, c, x2, x3, x4)
57
Self-Joins
Core Answers
Core Answers
In this casea more elaborate rewriting is needed
The same intuitions still holdseach tgd states a constraint that must be
satisfied by target instancesthere are many possible ways to satisfy the
constraintlet’s compare these ways and find the most
“concise” ones
However, there are important differences
58
Self-Joins
Core Answers
Core Answers
59
Self-Joins: Two-Phase
m’1 R(a, b, c, d) → S1 (x5, b, x1, x2, a) ∧ S2 (x5, c, x3, x4, a) ∧ S3 (d, c, x3, x4, b) m′2 R(e, f, g, h) → S4 (h, e, e, y1, f ) ∧ S5 (y5, e, e, y1, e) ∧ S6 (y5, g, y2, y3, y4)
RS2
S1 S
first exchange second exchange
S1 (x5 , b, x1, x2, a) … → S (…)S2 (x5, c, x3, x4, a) … → S (…) S (…)S3 (d, c, x3, x4, b) … → S (…)S4 (h, e, e, y1, f ) … → S (…)S5 (y5, e, e, y1, e) … → S (…)S6 (y5, g, y2, y3, y4) … → S (…)…
Expansions
1. we rewrite the tgds using distinct symbols and variables and do a first exchange
2. second exchange considers all possible ways to satisfy a tgd and use negation to avoid redundancy
Core Answers
Core Answers
R S
60
Self-joins: Expansions
Core Answers
Core Answers
Source instanceR(n, n, n, k)
Fire m3
S(N5, n, N1, N2, n), S(N5, n, N3, N4, n), S(k, n, N3, N4, n)
m’1 R(a, b, c, d) → S1 (x5, b, x1, x2, a) ∧ S2 (x5, c, x3, x4, a) ∧ S3 (d, c, x3, x4, b) m′2 R(e, f, g, h) → S4 (h, e, e, y1, f ) ∧ S5 (y5, e, e, y1, e) ∧ S6 (y5, g, y2, y3, y4)
m3 :S1 (x5, b, x1, x2, a) ∧ S2 (x5, c, x3, x4, a) ∧ S3 (d, c, x3, x4, b) → S (x5, b, x1, x2, a) ∧ S (x5, c, x3, x4, a) ∧ S (d, c, x3, x4, b)
61
Self-joins: Expansions
Core Answers
Core Answers
Source instanceR(n, n, n, k)
Fire m3
S(N5, n, N1, N2, n), S(N5, n, N3, N4, n), S(k, n, N3, N4, n)
But, if b=c, we have a more compact expansion and get m4
Fire m4
S(N5, n, N3, N4, n), S(k, n, N3, N4, n)
m’1 R(a, b, c, d) → S1 (x5, b, x1, x2, a) ∧ S2 (x5, c, x3, x4, a) ∧ S3 (d, c, x3, x4, b) m′2 R(e, f, g, h) → S4 (h, e, e, y1, f ) ∧ S5 (y5, e, e, y1, e) ∧ S6 (y5, g, y2, y3, y4)
m4 : S2(x5, c, x3, x4, a) ∧ S3(d, c, x3, x4, b) ∧ (S1(x5, b, x1, x2, a) b = c)∧ → S (x5, c, x3, x4, a) ∧ S (d, c, x3, x4, b)
62
Self-joins: Expansions
Core Answers
Core Answers
Source instanceR(n, n, n, k)
Fire m3
S(N5, n, N1, N2, n), S(N5, n, N3, N4, n), S(k, n, N3, N4, n)
There are many possible expansions: with many equalities we get m5
Fire m5
S(k, n, N3, N4, n)
m’1 R(a, b, c, d) → S1 (x5, b, x1, x2, a) ∧ S2 (x5, c, x3, x4, a) ∧ S3 (d, c, x3, x4, b) m′2 R(e, f, g, h) → S4 (h, e, e, y1, f ) ∧ S5 (y5, e, e, y1, e) ∧ S6 (y5, g, y2, y3, y4)
m5 : S4(h, e, e, y1, f) ∧ S4(h’, e’, e’, y1, f’) h = h’ ∧ ∧ (S1(x5, b, x1, x2, a) ∧ S2(x5, c, x3, x4, a) ∧ S3(d, c, x3, x4, b) ∧ e=b f=a e′=c f′=a h′=d f∧ ∧ ∧ ∧ ∧′=b) → S (d, c, x3, x4, b)
Expansionany subset of target symbols that generates
target instances that satisfy a tgdRewriting algorithm
1. find expansions of each tgd conclusion in ex 1
2. for each expansion, generate a tgd in ex2 using the expansion as a premise
3. use homomorphisms to identify more compact or more informative expansions
63
Self-joins: Rewriting Algorithm
e12 S2(x5, c, x3, x4, a) ∧ S3(d, c, x3, x4, b) ∧ (S1(x5, b, x1, x2, a) b = c) ∧ → S (…) S (…) e13 … → S (…) …
Core Answers
Core Answers
64
The double exchangeis not strictly needed [tenCate et al Vldb 2009,
Mecca et al submitted]in fact, the implementation performs a single
exchange with some additional viewsSome observations
for self-joins, worst-case expression complexity is O(2n)
large number of tgds in the second exchangelarge number of differences and joins in the SQL script
more complex skolemizationThe exponential explosion is observed only in
synthetic scenariosgreat part of the complexity of the algorithm is needed to
handle a small minority of cases (80-20)
Complexity
Core Answers
Core Answers
66
Algorithms implemented in +Spicyhttp://www.db.unibas.it/projects/spicy/
Scripts in SQL (and XQuery)PostgreSQL 8.3 on a Intel CoreDuo 2.4Ghz/4GB
Ram/LinuxScenarios from the literature
mostly from STBenchmark [Vldb 2008 – www.stbenchmark.org]
Each SQL test run with 10k, 100k, 250k, 500k, and 1M tuples in the
source instancetime limit = 1 hour
Custom engine exceeded the time limit in all scenarios
Experimental Results
67
Experiments results
10K 100K 250K 500K 1M
0
500
1000
1500
2000
2500
3000
3500
SJ1SJ2SJ3
10K 100K 250K 500K 1M
0
20
40
60
80
100
120
S1S2C1C2
#tuples in the source
Tim
es (
sec)
Tim
es (
sec)
#tuples in the source
Subsumption and coverages Self joins
Scalability experiments with up to 100 tables (82 tgds, 51 subsuptions, 12 coverages): rewriting algorithm ran in 6 secs
25 tables 50 tables 75 tables 100 tables
0
10
20
30
40
50
60
70
80
90
100
0
2
4
6
8
10
12
14
21
39
55
82
26
36
4551
28
12
# of TGDS # of Subsumptions # of CoveragesScript Gen Time (s) Execution Time (s)
69
A reference architecture
Schema Mappings in Practice
Schema Mappings: A One Minute Tutorial
Schema Mappings: A One Minute Tutorial
Schema Matching
source
target
correspondences
MappingGeneration
s.A t.E, 0.87s.B t.E, 0.90s.C t.F, 0.76s.D t.F, 0.98…
MappingExecution
(Data Exchange)
solution
mappings (TGDs)
QueryAnswering
(Data Integration) query results
70
Core Schema Mappingsalgorithms (and tool) to generate core solutions
efficientlydeclarative semanticsclear notion of quality
Two main resultsbridge the gap between the practice of
mapping generation and the theory of data exchange
enable schema mappings for more practical applications
Conclusions
72
Modular solution
canonical solution
subsumption-free solution
coverage-free solution
core (self-join-coverage-free solution)
Algorithms allow to produce approximations of the core for expensive computations: “reduced” rewriting wrt to a subset of homomorphisms
Theorem 5.2 (Core Computation) Given a data exchange setting (S, T, Σ ), suppose Σ is a set of source-
based tgds in normal form that do not contain self-joins in tgd conclusions. Call Σ′ the set of coverage rewritings of Σ. Given an instance I of S, call J, J’ the canonical solutions of Σ and Σ′ for I. Then J’ is the core of J.
Proof idea: whenever two tgds m1, m2 in Σ are fired to generate an
endomorphism, several homomorphisms must be in place. Call a1, a2 the variable assignments used to fire m1, m2 with an
homomorphism h from ψ1(a1, b1) to ψ2 (a2, b2). Then we know that there must be a formula homomorphism h′ from ψ1(x1, y1) to ψ2(x2, y2), and therefore a rewriting of m1 in which the premise of m2 is negated.
By composing the various homomorphism it is possible to show that the rewriting of m1 will not be fired on assignment a1. Therefore, the endomorphism will not be present in J’. 73
Core computation
Given a data exchange setting (S, T, Σ), with Σ:Φ(x) → Ψ(x,y)
for each vector of constants a s.t. I ⊧ Φ(a), we say witness block WB(a) a block of facts in J s.t. there exists an homomorphism
h: Ψ(a, b) -> WB(a)
Expansions generate all the witness blocks for a tgd. Each expansion is a contained rewriting of the query:
Ψ(x,y) ∩ Φ(x)because we are interested only in target tuples from Φ(x)1. in the first phase, expansions compute all witness blocks the RHS of
the original mapping M2. in the second phase subsumptions select only minimal witness
blocks
74
Core computation
To implement the computation of cores via an executable script (e.g. SQL) great care is needed in order to properly invent labeled nulls A Skolem function is usually an uninterpreted term of
the form f(v1, v2, . . . , vn), where each vi is either a constant or a term
Default chase semantics Skolem functions for a tgd m guarantee that, if
same/another tgd fires a block of facts identical to that generated by m up to nulls, the Skolem arguments are identical
75
Chase and Skolemization
R(a, b, c) → ∃N0, N1: S(a, N0), T(b, N0, N1), W(N1)
Skolem for N0 and N1 have three arguments: (i) the sequence of facts generated by firing the tgd (ii) the sequence of joins imposed by existential variables (iii) a reference to the specific variable for which the
function is used
N0: fsk({S(A:a),T(A:b),W()}, {S.B=T.B, T.C=W.A}, S.B=T.B)
N1: fsk({S(A:a),T(A:b),W()}, {S.B=T.B, T.C=W.A}, T.C=W.A)
76
Chase and Skolemization