Automatic Schema Matching
Seminar on Databases and the InternetYaron Naveh
January 2006
Automatic Schema Matching, SDBI, 2006
2
Articles
A survey of approaches to automatic schema matching Rahm & Bernstein (2001)
Discovering Complex Matchings across Web Query Interfaces: A Correlation Mining Approach He, Chen-Chuan Chang & Han (2004)
Automatic Schema Matching, SDBI, 2006
3
Contents
Problem Definition Applications Classic Approaches Correlation Mining
Approach
Automatic Schema Matching, SDBI, 2006
4
Match Definition
ID
Name
NumOfBooks
AID
AName
ANumOfBooks
Authors Authors
A match is a mapping between elements of two schemas that correspond semantically to each other
Automatic Schema Matching, SDBI, 2006
5
Match Properties
ID
Name
NumOfBooks
ID
FName
LName
YearOfBirth
Authors Authors
?
• (n:m) matching also possible
(1:1)
(1:n)
?
Automatic Schema Matching, SDBI, 2006
6
Match Properties (cont’d)
ID
Name
Salary ($)
Authors Authors
• Salary(NIS) = Salary($) * 4.55
• We will not find the function, just the attributes
ID
Name
Salary (NIS)
Automatic Schema Matching, SDBI, 2006
7
Match Properties (cont’d)
EmpName
DeptID
Employees
Employees
One relation is mapped to two others
EmpName
DeptName
DeptID
DeptName
Departments
Join
Automatic Schema Matching, SDBI, 2006
8
Match Properties (cont’d)
Teacher
StartTime
EndTime
Lessons Lessons
• Too hard for PC!
• PC should only suggest mappings to the user
Teacher
Time??
Automatic Schema Matching, SDBI, 2006
9
Match Properties (cont’d)
An automated tool can be helpful here…
Field1
Field2
Field3
Field4
Field5
Field6
Field7
Field8
Field9
field10
Field1
Field2
Field3
Field4
Field5
Field6
Field7
Field8
Field9
field10
So maybe it can all be done manually?
Automatic Schema Matching, SDBI, 2006
10
Match Generalization
We have defined a match for the relational model.
There are other interesting models:
…
<author>
<id>1</id>
<name>Calvino</name>
</author>
…
AuthorsBooks
ID
AuthorsName
Automatic Schema Matching, SDBI, 2006
11
Match Generalization (cont’d)
• nodes and edges in graphs
• elements, subelements, and IDREFs in XML
…
Define a Schema to be a set of elements connected by some structure
Use the natural correspondence:
Automatic Schema Matching, SDBI, 2006
12
Contents
Problem Definition Applications Classic Approaches Correlation Mining
Approach
Automatic Schema Matching, SDBI, 2006
13
Data Migration
Date
From
Message
Time
Writer
Message
IsVisible
ResponseTo
Old Forum New Forum
Migrate data from old DB to new DB
Special case: Data warehouse
Automatic Schema Matching, SDBI, 2006
14
E-Commerce
Map between different message formats
<book>
<name>The Invisible Cities</name>
<price>50</price>
</book>
<product>
<name>book</name>
<price>50</price>
</product>
Book Store
General Store
Automatic Schema Matching, SDBI, 2006
15
Global Query Interface
<input name=search>
<select name=type>
MSN
<input name=q>
Yahoo
<input name=qry>
<input name=type>
You want to build a Meta-Querier. However…
Automatic Schema Matching, SDBI, 2006
16
Global Query Interface (cont’d)
Search
Type
q
MSNYahoo
Solution: Reduce the html form to its “schema”
Qry
Type
Automatic Schema Matching, SDBI, 2006
17
Semantic Query Processing
Id
Name
Authors Find: Author + Ram + Oren
Keywords search scenario
SELECT * WHERE Id=‘Ram Oren’SELECT * WHERE Name=‘Ram Oren’
?
?
Author
Ram
Oren
How does this differ from previous
examples?
Automatic Schema Matching, SDBI, 2006
18
Contents
Problem Definition Applications Classic Approaches Correlation Mining
Approach
Automatic Schema Matching, SDBI, 2006
19
Matchers
There are a few algorithms to map attributes of 2 schemas
Define such an algorithm as a matcher Define a hybrid matcher as a matcher that
combines results from other matchers
Automatic Schema Matching, SDBI, 2006
20
Schema-based Vs. Instance-based
Two ways to perform a match:
• Use schema data (field name, type, constraints…)
• Use data from the table
Automatic Schema Matching, SDBI, 2006
21
Instance-based
BookID TotPages
TotPrice
1 500 50
2 400 40
3 450 90
BookID
TotalP
1 6060
• Build a schema from instance data, then use schema matchers
• Use the data directly. Example:
Two options for using data from the table:
Books Books
What is TotalP?
Automatic Schema Matching, SDBI, 2006
22
Instance-based (cont’d)
• Useful when no schema data is available
• Not useful when no instance data is available…
When will we use/not use instance based matchers?
Automatic Schema Matching, SDBI, 2006
23
Schema-Based
• Element’s name
• Description
• Data Type
• Relationships
• Constraints
What useful data is there in the schema?
Automatic Schema Matching, SDBI, 2006
24
Schema-Based: Name Matching
Map elements with similar names:
• String equality
• Common substrings (Birthday --> DayOfBirth)
• Canonical names (CName --> Customer Name)
• Synonyms (Car --> Automobile)
• Hypernyms (Book is-a Publication)
• Soundex (ShipTo --> Ship2)
• User provided (Issue --> Bug)
Automatic Schema Matching, SDBI, 2006
25
Schema-Based: Description
Map elements based on description
empn //employee name
name //name of employee
Schema A Schema B
Automatic Schema Matching, SDBI, 2006
26
Schema-Based: Constraint Based
Map elements based on Constraints:
• Data Types
• Unique, Primary, Foreign
Name
PID
ID
PLevel
Name
PID
Employees
Permissions
Employees ID
Sum
Payments
?
Automatic Schema Matching, SDBI, 2006
27
Reuse Previous Matching
Schema AName
Salary
AName
Income
Author
Money
Schema B
Schema C
• Get mapping AC From mappings AB and BC
• A partial reuse is also possible (e.g. on some of the attributes)
• Be aware of the domain: salary and income are not always the same!
Automatic Schema Matching, SDBI, 2006
28
Complexity
• We must compare every subgroup of attributes in schema A to every subgroup in schema B
• Exponential in the number of attributes
• However, we can assume the number of attributes is blocked…
• Also check (n:m) matching only for n,m<C for some C
Automatic Schema Matching, SDBI, 2006
29
Contents
Problem Definition Applications Classic Approaches Correlation Mining
Approach
Automatic Schema Matching, SDBI, 2006
30
Data Mining
TransID
Item
1 Book
1 Pencil
2 Book
2 Soap
3 Book
3 Soap
Sells
Which items are likely to co-appear?
Data Mining is the process of discovering patterns in data, usually stored in a Database.
Automatic Schema Matching, SDBI, 2006
31
Data Mining (cont’d)
TransID
Item
1 Book
1 Pencil
2 Book
2 Soap
3 Book
3 Soap
Sells Support of an itemset: the fraction of transactions that contain all items in the itemset.
What is the support for {Book}?
1
And for {Book, Soap}? 0.666
The A-Priori property: the support for any subset of an itemset is bigger than the support for the itemset
Automatic Schema Matching, SDBI, 2006
32
Data Mining (cont’d)
TransID
Item
1 Book
1 Pencil
2 Book
2 Soap
3 Book
3 Soap
SellsAlgorithm to find frequent itemsets:
Why can we
stop?
1. Define a threshold minSupport for “frequent” itemsets
2. Calculate support for all itemsets of size (1)
3. Calculate support for itemsets of size 2,3,4…
4. For each size k save the frequent itemsets
5. Stop when there are no frequent itemsets in size K.
Automatic Schema Matching, SDBI, 2006
33
Data Mining (cont’d)
TransID
Item
1 Book
1 Pencil
2 Book
2 Soap
3 Book
3 Soap
Sells Example:1. Set minSupport = 0.5
2. S({Book})=1, S({Pencil})=0.33, S({Soap})=0.666
3. S({Book, Soap})=0.666
4. S({Book, Soap, Pencil})=0
Where is {Soap,
Pencil}?
Automatic Schema Matching, SDBI, 2006
34
Back to Schema Matching…
Id
First
Last
Id
Salary
Name
Year
Authors
Id
AuthorFirst
AuthorLast
YearBirth
Id
Author
Goal: Map {Name} to {Author}, {Salary} to {Income}…
Id
FirstName
LastName
Income
Idea:{Name} and {Author} are unlikely to appear togetherSolution: go to the supermarket, but instead of food buy attributes!
What is the difference from the
supermarket example?
Automatic Schema Matching, SDBI, 2006
35
The Algorithm
Input: set of m schemas
{Name}:{Author}:{AuthorFirst, AuthorLast}:{First,Last}…
{Salary}:{Income}
{Year}:{YearBirth}
Output: set of n-ary mappings
Id
First
Last
Id
Salary
Name
Year
Id
AuthorFirst
AuthorLast
YearBirth
Id
Author
Id
FirstName
LastName
Income
Automatic Schema Matching, SDBI, 2006
36
Algorithm
1. Make a list L of all attributes from all schemasL = {Name, Salary, FirstName,
LastName, Author, First, Last…}
2. For each pair of attributes, calculate their support (how often they appear together)
S(Name, Salary) = 0.4
S(First, Last) = 0.95
S(Last, Name) = 0.1
Naive Algorithm
Automatic Schema Matching, SDBI, 2006
37
Algorithm (Cont’d)
4. Using the A-Priory property calculate support for groups of sizes 3,4,5…
3. Choose groups with low support
S(Name, LastName, Salary) = 0
S(First, Last, Salary) = 0.1
5. Return all groups with low support
S(Name, Salary) = 0.4
S(First, Last) = 0.95
S(Last, Name) = 0.1
Automatic Schema Matching, SDBI, 2006
38
Algorithm (Cont’d)
The algorithm is naive.
{name, author, X}
Actually for any attribute X we have:
{name, author}
Then we also have negative correlation for this:
{name, author, salary}
{name, author, yearOfBirth}
suppose we have negative correlation for this:
Automatic Schema Matching, SDBI, 2006
39
Improvement
Improvement: Define the support (s) of an itemset {a,b,c…} to be
MAX { s(a,b), s(b,c), s(a,c) … }
s(name, author)=0.1
s(name, salary)=0.5
s(salary, author)=0.6
Example:
s(name,author,salary)=MAX (0.1,0.5,0.6)=0.6
Now the support can go up so checking it is not trivial
What is the logic
behind this?
Automatic Schema Matching, SDBI, 2006
40
Generalizing the algorithm
({first,last}, {name})
Now the algorithm finds all groups of attributes (a,b,c…) s.t. none of the pairs appears together.
Hopefully these are attributes with the same semantic:{name, author}
{salary, payments}
…
But what about this?
Currently we find only (1:1) matching
For (n:m) we need to preprocess…
Automatic Schema Matching, SDBI, 2006
41
Preprocess
1. Make a list L of all attributes from all schemasL = {Name, Salary, FirstName,
LastName, Author, First, Last…}
2. Run the normal A-Priori algorithm (find all attributes that DO appear together)
S(first, last)=0.9
S(firstName,lastName)=0.85
Pre-Process for the algorithm:
Automatic Schema Matching, SDBI, 2006
42
Preprocess
3. For each schema S in the input:
For each frequent attributes group A:
If A intersects with S than add new attribute “A” to S
Id
First
Last
Id
First
Last
First, Last4. Run the previous algorithm on
S1’, S2’… to find negative correlation
{First,Last}
({first,last}, {name})
Now we can find groups like:
SA
S’
Automatic Schema Matching, SDBI, 2006
43
Still Not Perfect…
Suppose we found these mappings:
{first,last}:{name}:{author}
{first, yearOfBirth}:{birthDate}
{yearOfBirth, monthOfBirth}:{birthDate}There is a contradiction!
Automatic Schema Matching, SDBI, 2006
44
Solution
Add the top rank to the results
1. {first,last}:{name}:{author}
Delete contradictions to this rank:
2. {first, yearOfBirth}:{birthDate} XProcess next mapping
3. {yearOfBirth, monthOfBirth}:{birthDate}
1. {first,last}:{name}:{author}
2. {first, yearOfBirth}:{birthDate}
3. {yearOfBirth, monthOfBirth}:{birthDate}
Solution: rank the mappings according to the support of the lowest pair in each mapping
Automatic Schema Matching, SDBI, 2006
45
Attributes with the same name
Payment (longint)
Step 1 of the algorithm (reminder):
Make a list S of all attributes from all schemasS = {Name, Salary, FirstName,
LastName, Author, First, Last…}
This means that two attributes with the same name are always considered the same.
Payment (datetime)?Solution: add the type to the name
Id
First
Last
Id_Int
First_String
Last_String
Automatic Schema Matching, SDBI, 2006
46
Correlation Measure
So Income=Id?
s(Income, Id)=0.2
Id
First
Last
Id
Salary
Name
Year
Id
AuthorFirst
AuthorLast
YearBirth
Id
Author
Id
FirstName
LastName
IncomeThe rare attribute problem:
Automatic Schema Matching, SDBI, 2006
47
Correlation Measure (cont’d)
s(Salary, Income)=0
Id
First
Last
Id
Salary
Name
Year
Id
AuthorFirst
AuthorLast
YearBirth
Id
Author
Id
FirstName
LastName
IncomeThe sparseness problem:
If Salary=Income than what is their equivalence in the other tables?
Automatic Schema Matching, SDBI, 2006
48
Correlation Measure (cont’d)
Let A,B be two attributes. Define
f11: the number of schemas where both A,B appears
f10: number of schemas where only A appears
…
f1+: f11+f10
A ^A
B f11 f10 f1+
^B f01 f00 f0+
f+1 f+0 f++
Support of an itemset: the fraction of transactions that contain all items in the itemset.
There are other ways to calculate support:
Automatic Schema Matching, SDBI, 2006
49
Correlation Measure (cont’d)
support=f11/f++
We used: Lift:
f00f11/f10f11
H-measure
f01f10/f+1f1+
A ^A
B f11 f10 f1+
^B f01 f00 f0+
f+1 f+0 f++
Every measure fits a different situation
For example, in the matching problem we want to “punish” attributes that co-appear
Id
Salary
Name
Year
Automatic Schema Matching, SDBI, 2006
50
Applications
This approach can only be used when we have many schemas
El-Al.Com•Adult
•Child
•Infant
Arkia.Com American Airlines.Com
•Adult
•Child
•Destination
•Passengers
•To
• Data Migration?
• Web query interfaces. Example:
Is it possible to use the algorithm for migration by running it on many random schemas?
Automatic Schema Matching, SDBI, 2006
51
Complexity
The A-Priory algorithm is O(2^n)
Usually there are only few correlations, so in step (k+1) we consider just a few from the groups of size k
Automatic Schema Matching, SDBI, 2006
52