hku csis db seminar: hku csis db seminar: coma-a system for flexible combination of schema matching...
DESCRIPTION
DB Seminar3 Why Schema Matching? Done by domain experts Time consuming Reduce user effort Semi-automatic –Need user to verify –Need user to modifyTRANSCRIPT
HKU CSIS DB Seminar:HKU CSIS DB Seminar:COMA-A system for flexible
combination of schema matching approaches
- VLDB 2002 -Hong-Hai Do and Erhard Rahm
Speaker: Eric Lohttp://www.csis.hku.hk/~dbgroup/seminar/seminar020927.htm
DB Seminar 2
What is Schema Matching?
• Finding semantic correspondences between elements of two schemas
• Input: 2 schemas• Output: A set of mappings
DB Seminar 3
Why Schema Matching?
• Done by domain experts• Time consuming• Reduce user effort• Semi-automatic
– Need user to verify– Need user to modify
DB Seminar 4
Application domains
• Ecommerce:– E.g. a comparison shopping website– Aggregates product offer from multiple
independent online stores– Match each product catalog against their
combined catalog• [Amazon].product_code [Combined].product_id• [Wrox].bookid [Combined].product_id
DB Seminar 5
Application domains
• Data warehouses and data integration system– Preprocessing
• Data translation– XML Relation data mapping
DB Seminar 6
Schema matching categories
• Goal: High match accuracy for large variety of schemas• A single technique is not enough for different schemas
combine different approach effectively• Hybrid approach:
– Most common– Different match criteria (e.g. name, data type, dictionary,
thesaurus…) are used in a single algorithm• Composite approach:
– High flexibility– 1 match algorithm for single match criteria– Combine the independent result from algorithms
DB Seminar 7
Outline
• Introduction• COMA system• Overview of different matchers• Reuse matcher from COMA• Evaluation• Conclusions• Discussions• References
DB Seminar 8
COMA-COmbining MAtch algorithm
• Composite approach• No previous work on composite generic
matching• A generic match system• Support multiple schema (e.g. XML and
relational)
DB Seminar 9
COMA
• Different match algorithm exists as extensible library in COMA Matcher
• Support different combination of extensible library (match algorithm) result
• An evaluation platform to systematically examine and compare the effectiveness of different matchers (matching algorithm/extensible library) and combination strategies
DB Seminar 10
COMA
• Interactive and iterative match process which allow user feed-back
• Also propose a new matcher, reusing previously obtained match results (they observed that many schemas to be matched are very similar to previously matched schema)
DB Seminar 11
Schema1
Schema2
UserFeedbackMatcher1
Matcher3
Matcher2 SimilarityCube
S1 S2
S2 S1
Combine match result
Matcher Library:Simple matchers: ngram, synoymnHybrid: NamePath
Matching Process
<schema1.cname><schema2.companyname> Sim = 0.95<schema1.cname><schema2.businessname> Sim = 0.8<schema1.address><schema2.address> Sim =1
DB Seminar 12
5 Steps
• Step1: Schema Representation• Step2: Schema Tree Distinct Elements• Step3: Matching Algorithm (Matcher)• Step4: Aggregation of k matcher values• Step5: Selection
DB Seminar 13
Step1: Schema Representation
DB Seminar 14
XML Schema Representation
DB Seminar 15
Step 2
• Traverse the schema tree• Represented each schema
element by its path– Sequences of nodes from root– E.g. Address in PO2– Multiple paths
• PO2DeliverToAddress• PO2BillToAddress
DB Seminar 16
Step 3: Match algorithmS
• Take in each schema element path• Returning similarity value• If involve human feedback:
– User approved, similarity is 1 (0 in contrast)• Different matchers return similarity value
between 0 to 1• COMA support simple, hybrid, reuse-
oriented matchers now (discuss later)
DB Seminar 17
Storing k matchers resultby Similarity cube
• k matchers• m schema 1 elements• n schema 2 elements• A cube of k x m x n is stored in repository
for later combination and selection steps
k
m
n
DB Seminar 18
Some samples from similarity cube
Matcher PO1 Elements PO2 Elements Sim
Matcher1:Type-name
ShipTo.shipToCity DeliverToAddress.City 0.65
ShipTo.shipToStreet 0.3
ShipTo.Customer.custCity 0.8
Matcher2:Name-path
ShipTo.shipToCity DeliverTo.Address.City 0.78
ShipTo.shipToStreet 0.73
ShipTo.Customer.custCity 0.53
DB Seminar 19
Step 4 and 5: Combine match result
• Combine k result from the similarity cube• Step 4: Aggregation
– Aggregation of matcher-specific results• E.g. taking average of k values / max /min
• ShipTo.shipToCity DeliverToAddress.City 0.72• ShipTo.shipToStreet 0.52• ShipTo.Customer.custCity 0.67
• Step 5: Selection– Selection of match candidates
• Select ShipTo.shipToCity DeliverToAddress.City (0.72)
DB Seminar 20
How the matchers work?
• Step 1: Schema Representation• Step 2: Schema Tree Distinct Elements• Step 3: Matching Algorithm (Matcher)• Step 4: Aggregation of k matcher values• Step 5: Selection
DB Seminar 21
COMA Matcher LibraryType Name Schema Info Aux. Info
Simple Affix Element names -
N-gram Element names -
Soundex Element names -
EditDistance Element names -
Synonym Element names Extern, dictionaries
Data Type Data types Data type compatibility table
UserFeedback - User-specified (mis-) matches
Hybrid NameMatcher Element names -
NamePath Names+Paths -
TypeName DataTypes+Path -
Children Child elements -
Leaves Leaf elements -
Reuse-oriented Schema - Existing schema-level match results
DB Seminar 22
Simple Matcher
• Use element name to compare– Name string– Name semantic
• Can use approximate string matching technique (apply on data cleansing)
• Affix: Looks for common (prefix and suffix) on NameString
• DataType: Similarity = degree of compatibility of 2 datatypes (values are predefined)– E.g. int and bit = 0.6, text and hex =0.1
DB Seminar 23
Hybrid Matcher
• Fixed combination of simple matcher• E.g. EditDistance + Data Type• Hybrid Matcher 1 (Name Matcher):
– Tokenization(POShipTo PO, Ship, To)– Expansion (PO Purchase, Order)– Then use e.g. Affix + Trigram
DB Seminar 24
Another Hybrid Matcher
• NamePath Matcher:– Name + Path (element + structure)– Build a long string from path– Apply Name Matcher– E.g. PurchaseOrder.ShipTo.Street and
PurchaseOrder.shipToStreet– Same in Name Matcher, but not in NamePath
DB Seminar 25
Outline
• Introduction• COMA system• Overview of different matchers [Step 3]• Reuse matcher for COMA [Step 3]• Evaluation• Conclusions• Discussions• References
DB Seminar 26
Reuse of previous match result
• Based on authors observation:– Many schemas to be matched are similar (or identical)
to previous matched schema– Build a reuse-oriented matcher to save resources– A match with B before (A B) [Match 101]– B match with C before (B C) [Match 234]– Now new match task, A C
• MatchCompose operation combine previous match result to obtain new match result
DB Seminar 27
MatchCompose operation
• Given 2 match results: – match1: S1<-> S2 – match2: S2 <-> S3
• MatchCompose derives a new match result S1 <-> S3
• PO1.Contact <-> PO2.Contact <-> PO3.Contact• Name name lastName• Email email firstName• Company email• company MatchCompose
mapping
Match:S1<->S3
DB Seminar 28
MatchCompose in relation
PO2 PO3 SIM23
Name lastName 0.6
Name firstName 0.6
e-mail email 1.0
PO1 PO2 SIM12Name name 1.0Email e-mail 1.0
Match1
Match2
PO1 PO3 SIM13
Name lastName 0.8
Name firstName 0.8
Email email 1.0
MatchCompose
DB Seminar 29
Re-use: Schema matcher
• All previous match store in repository• New matching problem comes, e.g. S1 match with S2• Find all match result with schema (Si, Sj and Sk) related
to BOTH S1 and S2 in any order• Each pair undergoes MatchCompose
DB Seminar 30
How to aggregate the results from k matchers?
• Step 1: Schema Representation• Step 2: Schema Tree Distinct Elements• Step 3: Matching Algorithm (Matcher)• Step 4: Aggregation of k matcher values• Step 5: Selection
DB Seminar 31
How to combine similarity values from different matcher?
• Aggregate to a single similarity value from different matchers•Max: return the max values from M matchers•Weighted sum: weight assign according to the expected importance of the matchers•Average•Min
DB Seminar 32
Along so many combinations, how to select the set of result which return to user?
• Step 1: Schema Representation• Step 2: Schema Tree Distinct Elements• Step 3: Matching Algorithm (Matcher)• Step 4: Aggregation of k matcher values• Step 5: Selection
DB Seminar 33
Select candidates from combined cube
• Direction of match candidates selection•Given 2 schemas S1 and S2 with |S2| <= |S1|•3 Directions: LargeSmall, SmallLarge, Both•LargeSmall: Match Large Schema S1 with Small target S2,i.e. elements from S1 are ranked and selected with respect to each S2 element
DB Seminar 34
3 directions
DeliverToAddress BillToAddressshipToCity 0.72 0.71
custCity 0.67 0.68
shipToStreet 0.52 0.6
LargeSmall SmallLarge BothFor each small schema element For each large schema element LargeSmall + Small Large
- DeliverToAddress choose shipToCity
- shipToCity choose DeliverToAddress
YES
- BillToAddress choose shipToCity
- custCity choose BillToAddress
NO
- shipToStreet choose BillToAddress
NO
Small SchemaLarge Schem
a
DB Seminar 35
Selecting candidates (cont)
• Along one direction, 3 ways to select:– MaxN: Select n candidates with top sim. values
• If n=1, 1 to 1 correspondence– MaxDelta: select the MaxN one, given a
tolerance value d, also select those candidates with sim value > MaxN – d
• Select those almost maximum– Threshold: All elements > threshold t
DB Seminar 36
Evaluation
• Test by 5 real world schemas on purchase order– CIDX, Excel, Noris, Paragon and Aperturm (from
www.biztalk.org)– |Inner or Leaf nodes| != |paths| Schema share some
fragments
DB Seminar 37
Data Sets• 5 schemas, 10 match tasks• Done manually, domain experts• #Matches = no of correspondences to identified• Shows the problem sizes• Schema Similarity=#MatchedPaths/#AllPaths
DB Seminar 38
Evaluation – match quality
• Automatic match returns P matches• I is true positive (by domain experts)
• Precision= |c|/|P| reliability of match predictions• Recall= |c|/|I| % of real matches found• Accuracy = Recall*(2-1/Precision)• Accuracy = no. of labour saving to modify incorrect
matches to correct matches + no of labour saving to identify missed matches
P Ic
DB Seminar 39
Experimental result
• Only in automatic mode• Conducted 12,312 experiments set
– Different choices of matchers– Different choices of direction etc
• Each combination runs on 10 schemas matching task (1<->2, …)
DB Seminar 40
Distribution of no-reuse matchers
Accuracy
• 1 series = 1 combination• Most (7077) no-reuse matchers with Accuracy < 0
DB Seminar 41
Distribution w.r.t. aggregation
Accuracy
DB Seminar 42
Distribution w.r.t. direction
Accuracy
DB Seminar 43
Distribution w.r.t. selection
Accuracy
DB Seminar 44
Outline
• Introduction• COMA system• Overview of different matchers• Reuse matcher from COMA• Evaluation• Conclusions• Discussions• References
DB Seminar 45
Conclusions
• COMA provides a framework for combining different matcher for different purposes
• A new matcher – Reuse-oriented matcher
DB Seminar 46
Discussions
• Most are 1:1 matching, n:1 , n:m?
• Accuracy metric• Time is a problem?• To match 2 schemas, A B is a must?
– How about if A map to B in some extend, B map to A in another extend?
a c (1:1) local(2:1) global
b c (1:1) local
a cb
(2:1) local
DB Seminar 47
References
• [VLDB02] COMA-A system for flexible combination of schema matching approaches– By Hong-hai Do, Erhard Rahm– University of Leipzig
• [ICDE02] Similarity Flooding: A Versatile Graph Matching Algorithm and its Application to Schema Matching– By Sergey Melik, Hector Garcia-Molina, Erhard Rahm– Stanford and University of Leipzig
• [VLDB02] Translating Web Data– By Lucian Popa, Yannis Velegrakis, Renee J. Miller, et. al.– IBM Almaden Research Center and University of Toronto
DB Seminar 48
eNd
DB Seminar 49
Interactive mode
• In contrast with auto mode• User interactive with COMA for each
iteration (optional)• E.g.
– Specify which matcher (simple / hybrid)– Accept / reject match candidates
• Improve match quality
DB Seminar 50
Simple Matcher
• EditDistance: Similarity = No of edit need to transform one string to another
• Synonym: Looking up the terminological relationship in a specific dictionary
• N-gram: i.e. sequences of n characters• Soundex: Based on the phonetic similarity
DB Seminar 51
Hybrid Matcher
• TypeName Matcher:– DataType + Name Matcher
• Children Matcher:– Leaf compared with TypeName Matcher– If compare two non-leave elements A and B,
compare A’s children with B’s children
DB Seminar 52
Hybrid Matcher
• Leave Matcher:– Similar to Children Matcher, but only consider the
leaves with TypeName Matcher– PO1.ShipTo.shipToStreet– PO1.ShipTo.shipToCity– PO2.DeliverTo.Address.Street– PO2.DeliverTo.Address.City– If cmp ShipTo with DeliverTo by Children Matcher,
i.e. shipToStreet cmp with Address!!