a survey of approaches to automatic schema matching sushant vemparala gaurang telang

34
A SURVEY OF APPROACHES TO AUTOMATIC SCHEMA MATCHING Sushant Vemparala Gaurang Telang

Upload: robert-matthews

Post on 30-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A SURVEY OF APPROACHES TO AUTOMATIC SCHEMA MATCHING Sushant Vemparala Gaurang Telang

A SURVEY OF APPROACHES TO AUTOMATIC SCHEMA MATCHING

Sushant Vemparala Gaurang Telang

Page 2: A SURVEY OF APPROACHES TO AUTOMATIC SCHEMA MATCHING Sushant Vemparala Gaurang Telang

Motivating Example

Assume UTA needs to integrate 40 databases from its different schools with a total of 27,000 elements.

It would take approximately 12 person years to integrate them if done manually.

How would you reduce the manual burden ?

Page 3: A SURVEY OF APPROACHES TO AUTOMATIC SCHEMA MATCHING Sushant Vemparala Gaurang Telang

Schema Matching

<Schema name="Schema S"> <ElementType name="AccountOwner"> <element type="Name"/> <element type="Address"/> <element type="BirthDate"/> </ElementType> <ElementType name="Address"> <element type="street"/> <element type="city"/> <element type="state"/> <element type="ZIP"/> </ElementType></Schema>

<Schema name="Schema T“> <ElementType name="Customer"> <element type="FName"/> <element type="LName"/> <element type="CAddress"/> </ElementType> <ElementType name="CAddress"> <element type="street"/> <element type="city"/> <element type="provine"/> <element type="code"/> </ElementType></Schema>

Schema 1

Schema 2

Page 4: A SURVEY OF APPROACHES TO AUTOMATIC SCHEMA MATCHING Sushant Vemparala Gaurang Telang

Schema Matching Definition

Schema matching is defined as the task of finding the semantic correspondences

between elements of two schemas.

Match

S1

S2

Match Result

Auxiliary information

( User feedback, Dictionaries, Previous mappings)

Page 5: A SURVEY OF APPROACHES TO AUTOMATIC SCHEMA MATCHING Sushant Vemparala Gaurang Telang

Application Domains

Schema integration Developing global view over set of

independently developed schemas

Comparing data schemes:• Items from different shopping sites• Merger between two corporations• Preparation of data for data warehousing and

analyzing processes

Any other examples?

Page 6: A SURVEY OF APPROACHES TO AUTOMATIC SCHEMA MATCHING Sushant Vemparala Gaurang Telang

High Level Architecture of Generic Match

http://db18.informatik.uni-leipzig.de:8080/WebEdition/

Page 7: A SURVEY OF APPROACHES TO AUTOMATIC SCHEMA MATCHING Sushant Vemparala Gaurang Telang

Classification of Schema Matching Approaches1) Schema Level Matching

Granularity of Schema Level Element Level Structural Level

2) Instance level Matching

3) Hybrid and composite Matching

Page 8: A SURVEY OF APPROACHES TO AUTOMATIC SCHEMA MATCHING Sushant Vemparala Gaurang Telang

Schema Level Matching

Only Schema level information(No data content)

Properties? (Name, description, data type ,is-a /part-of relationship, constraints and structure)

Match will find match candidates (each having similarity value)

Page 9: A SURVEY OF APPROACHES TO AUTOMATIC SCHEMA MATCHING Sushant Vemparala Gaurang Telang

Granularity: Element Level

Page 10: A SURVEY OF APPROACHES TO AUTOMATIC SCHEMA MATCHING Sushant Vemparala Gaurang Telang

Granularity: Structure-Level

Structure-Level: Matches combinations of elements that appear together in S1 with “combinations” of elements that appear together in S2.

Full Structure Match vs Partial Structure Match

S1 Elements S2 Elements

Address CustAddress

Street Street

City City

State USState

Zip PostalCode

S1 Elements S2 Elements

AccountOwner(Finance) Customer(Sales)

Name Cname

Address CAddress

Birthdate CPhone

TaxExempt

Page 11: A SURVEY OF APPROACHES TO AUTOMATIC SCHEMA MATCHING Sushant Vemparala Gaurang Telang

Granularity: Structure-Level (Contd) Equivalence Patterns: Can enhance

structure matching by considering known equivalence patterns stored in a library.

Page 12: A SURVEY OF APPROACHES TO AUTOMATIC SCHEMA MATCHING Sushant Vemparala Gaurang Telang

April 19, 2023 12

Matching Cardinality

One or more S1 elements can match one or more S2 elements.

1:1, 1:n, n:1, (m:n)

1:1

n:1

1:n

m:n

Page 13: A SURVEY OF APPROACHES TO AUTOMATIC SCHEMA MATCHING Sushant Vemparala Gaurang Telang

Instance Level Matching

Insight into the contents and meaning of schema elements

Useful when schema information is limited and when semi-structured data is used

Incorrect interpretation of schema level information can be corrected

Eg : X is match candidate for CompanyName and Manufacturer

Page 14: A SURVEY OF APPROACHES TO AUTOMATIC SCHEMA MATCHING Sushant Vemparala Gaurang Telang

Techniques for Schema Level Matching

Linguistic approaches

Name based (equality of names)

equality of canonical name (Cust# = CustNo)

equality of synonyms (make = brand)

equality of hypernyms (book is-a publication & article is-a publication implies book =article)

Page 15: A SURVEY OF APPROACHES TO AUTOMATIC SCHEMA MATCHING Sushant Vemparala Gaurang Telang

Techniques for Schema level Matching

Name Matching (Contd) Similarity based on pronunciation or soundex (ship2=ShipTo)

user-provided name matches (issue=bug)

Not limited to 1:1 matches (phone = {homePhone, officePhone} )

Context based :Payroll application(salary=income) vs Tax reporting application(salary!=income)

Page 16: A SURVEY OF APPROACHES TO AUTOMATIC SCHEMA MATCHING Sushant Vemparala Gaurang Telang

Techniques for Schema Level Matching

Description based

Eg: Comments in schema elements

Page 17: A SURVEY OF APPROACHES TO AUTOMATIC SCHEMA MATCHING Sushant Vemparala Gaurang Telang

Techniques for Schema Level Matching

Constraint based Mapping - Eg:data types and value ranges,

optionality, relationship types, cardinalities, etc.

- Combined with other matchers to limit match candidates

Page 18: A SURVEY OF APPROACHES TO AUTOMATIC SCHEMA MATCHING Sushant Vemparala Gaurang Telang

Techniques for Schema Level Matching

Reusing Schema and Mapping Information-Idea: schemas from same domains are often very

similar

eg address fields and name fields repeated

-Create schema library and schema editors should access library

( Analogy: XML namespaces)

S->S2(known)

Goal:S1->S?

S1->S2?(easy to find)

Page 19: A SURVEY OF APPROACHES TO AUTOMATIC SCHEMA MATCHING Sushant Vemparala Gaurang Telang

Techniques for Instance Level IR techniques (Measures such as Jacard

coefficient)

Constraint-based Characterization (EmpNo range vs Dept No range)

Auxiliary Information

Learning (Eg :Evaluate S1 contents Characterization 1, Evaluate S2 contents against Characterization 1 )

Drawback of Instance based?

Page 20: A SURVEY OF APPROACHES TO AUTOMATIC SCHEMA MATCHING Sushant Vemparala Gaurang Telang

Combining Matcher: Hybrid Matcher Integrates multiple matching criteria Eg:-A Matcher with Name matching and

constraint based matching

Single Pass

Matching criteria is hard-wired

Page 21: A SURVEY OF APPROACHES TO AUTOMATIC SCHEMA MATCHING Sushant Vemparala Gaurang Telang

Combining Matcher: Composite Matcher

Combine the result of several independently executed Matchers

Iterative (Match result of 1st Matcher is consumed by the 2nd Matcher)

Flexible ordering

Which is efficient –Hybrid and Composite?

Page 22: A SURVEY OF APPROACHES TO AUTOMATIC SCHEMA MATCHING Sushant Vemparala Gaurang Telang

Summarization

Page 23: A SURVEY OF APPROACHES TO AUTOMATIC SCHEMA MATCHING Sushant Vemparala Gaurang Telang

How good is a Match?

Assessing match quality is difficult Human verification and tuning of matching is

often required A useful metric would be to measure the amount

of human work required to reach the perfect match

Recall: how many good matches did we show?

Precision: how many of the matches we show are good?

Page 24: A SURVEY OF APPROACHES TO AUTOMATIC SCHEMA MATCHING Sushant Vemparala Gaurang Telang

Current Work

LSD SKAT Similarity Flooding

Page 25: A SURVEY OF APPROACHES TO AUTOMATIC SCHEMA MATCHING Sushant Vemparala Gaurang Telang

LSD(Learning Source Description)

Produces 1:1 Instance level Mapping

Suppose user wants to integrate 100 data sources

1. User: manually creates mappings for a few

sources, say 3 shows LSD these mappings

2. LSD learns from the mappings “Multi-strategy” learning incorporates many

types of info in a general way Knowledge of constraints further helps

3. LSD proposes mappings for remaining 97 sources

Page 26: A SURVEY OF APPROACHES TO AUTOMATIC SCHEMA MATCHING Sushant Vemparala Gaurang Telang

LSD: Example

listed-price $250,000 $110,000 ...

address price agent-phone description

location Miami, FL Boston, MA ...

phone(305) 729 0831(617) 253 1429 ...

commentsFantastic houseGreat location ...

realestate.com

location listed-price phone comments

Schema of realestate.com

If “fantastic” & “great”

occur frequently in data values =>

description

Learned hypotheses

price $550,000 $320,000 ...

contact-phone(278) 345 7215(617) 335 2315 ...

extra-infoBeautiful yardGreat beach ...

homes.com

If “phone” occurs in the name =>

agent-phone

Mediated schema

Page 27: A SURVEY OF APPROACHES TO AUTOMATIC SCHEMA MATCHING Sushant Vemparala Gaurang Telang

LSD: Training the Learners

<location> Boston, MA </> <listed-price> $110,000</> <phone> (617) 253 1429</> <comments> Great location </>

<location> Miami, FL </> <listed-price> $250,000</> <phone> (305) 729 0831</> <comments> Fantastic house </>

Naive Bayes Learner

(location, address)(listed-price, price)(phone, agent-phone)(comments, description) ...

(“Miami, FL”, address)(“$ 250,000”, price)(“(305) 729 0831”, agent-phone)(“Fantastic house”, description) ...

realestate.com

Name Learner

address price agent-phone description

Schema of realestate.com

Mediated schema

location listed-price phone comments

Page 28: A SURVEY OF APPROACHES TO AUTOMATIC SCHEMA MATCHING Sushant Vemparala Gaurang Telang

LSD: Applying the Learners

<extra-info>Beautiful yard</><extra-info>Great beach</><extra-info>Close to Seattle</>

<day-phone>(278) 345 7215</><day-phone>(617) 335 2315</><day-phone>(512) 427 1115</>

<area>Seattle, WA</><area>Kent, WA</><area>Austin, TX</>Name Learner

Naive Bayes Meta-Learner

(address,0.8), (description,0.2)(address,0.6), (description,0.4)(address,0.7), (description,0.3)

(address,0.6), (description,0.4)

Meta-LearnerName LearnerNaive Bayes

(address,0.7), (description,0.3)

(agent-phone,0.9), (description,0.1)

address price agent-phone description

Schema of homes.com Mediated schema

area day-phone extra-info

Page 29: A SURVEY OF APPROACHES TO AUTOMATIC SCHEMA MATCHING Sushant Vemparala Gaurang Telang

SKAT(Semantic Knowledge Articulation) Expert supplies SKAT with few initial rules Ex : 1) Match US.president US.chancellor 2) MisMatch human.nail factory.nail

SKAT articulates on supplied matching rules

Expert approves/rejects.

Creates correct rules and computes an updated articulation

(Knowledge gained from irrelevant and rejected rules stored)

Page 30: A SURVEY OF APPROACHES TO AUTOMATIC SCHEMA MATCHING Sushant Vemparala Gaurang Telang

Similarity Flooding

Intuition : Whenever any two elements in the graphs G1 and G2 are similar, their neighbors tend to be similar.

Transform schemas into directed labeled graphs

Page 31: A SURVEY OF APPROACHES TO AUTOMATIC SCHEMA MATCHING Sushant Vemparala Gaurang Telang

Similarity Flooding Example

Page 32: A SURVEY OF APPROACHES TO AUTOMATIC SCHEMA MATCHING Sushant Vemparala Gaurang Telang

Conclusion

User feedback: User Interaction: minimize user input but maximize impact of

the feedback If we require user acceptance for our matches, then what

happens if our matcher returns thousands or hundreds of matches?

The more configurable the matcher,the better

Problem with Schema representation and Data Dealing with inconsistent data values for a schema element. independence of schema representation

Mapping Maintenance: what happens when you map between two schemas and then one changes?

Sophisticated techniques required for n:m matches [Current work based on 1:1]

Page 33: A SURVEY OF APPROACHES TO AUTOMATIC SCHEMA MATCHING Sushant Vemparala Gaurang Telang

Conclusion

More attention 1) Re-use opportunities2) Learning from User feedback

Any other issues to address?

Page 34: A SURVEY OF APPROACHES TO AUTOMATIC SCHEMA MATCHING Sushant Vemparala Gaurang Telang

THANK YOU!