anhai doan database & data mining group university of washington, seattle spring 2002 learning...
TRANSCRIPT
AnHai Doan
Database & Data Mining Group
University of Washington, Seattle
Spring 2002
Learning to Map between Learning to Map between Structured Representations of DataStructured Representations of Data
2
New faculty member
Find houses with 2 bedrooms priced under
200K
homes.comrealestate.com homeseekers.com
Data Integration ChallengeData Integration Challenge
3
Architecture of Data Integration SystemArchitecture of Data Integration System
mediated schema
homes.comrealestate.com
source schema 2
homeseekers.com
source schema 3source schema 1
Find houses with 2 bedrooms priced under 200K
4
price agent-name address
Semantic Mappings between SchemasSemantic Mappings between Schemas
1-1 mapping complex mapping
homes.com listed-price contact-name city state
Mediated-schema
320K Jane Brown Seattle WA240K Mike Smith Miami FL
5
Schema Matching is Ubiquitous!Schema Matching is Ubiquitous! Fundamental problem in numerous applications Databases
– data integration
– data translation
– schema/view integration
– data warehousing
– semantic query processing
– model management
– peer data management
AI– knowledge bases, ontology merging, information gathering agents, ...
Web– e-commerce
– marking up data using ontologies (Semantic Web)
6
Why Schema Matching is DifficultWhy Schema Matching is Difficult
Schema & data never fully capture semantics!– not adequately documented
Must rely on clues in schema & data – using names, structures, types, data values, etc.
Such clues can be unreliable– same names => different entities: area => location or square-feet– different names => same entity: area & address => location
Intended semantics can be subjective– house-style = house-description?
Cannot be fully automated, needs user feedback!
7
Current State of AffairsCurrent State of Affairs Finding semantic mappings is now a key bottleneck!
– largely done by hand– labor intensive & error prone– data integration at GTE [Li&Clifton, 2000]
– 40 databases, 27000 elements, estimated time: 12 years
Will only be exacerbated– data sharing becomes pervasive– translation of legacy data
Need semi-automatic approaches to scale up! Many current research projects
– Databases: IBM Almaden, Microsoft Research, BYU, George Mason,
U of Leipzig, ...– AI: Stanford, Karlsruhe University, NEC Japan, ...
8
Goals and ContributionsGoals and Contributions Vision for schema-matching tools
– learn from previous matching activities– exploit multiple types of information – incorporate domain integrity constraints– handle user feedback
My contributions: solution for semi-automatic schema matching– can match relational schemas, DTDs, ontologies, ...– discovers both 1-1 & complex mappings– highly modular & extensible– achieves high matching accuracy (66 -- 97%) on real-world data
9
Road MapRoad Map
Introduction Schema matching [SIGMOD-01]
– 1-1 mappings for data integration– LSD (Learning Source Description) system
– learns from previous matching activities
– employs multi-strategy learning
– exploits domain constraints & user feedback
Creating complex mappings [Tech. Report-02]
Ontology matching [WWW-02]
Conclusions
10
Suppose user wants to integrate 100 data sources
1. User – manually creates mappings for a few sources, say 3– shows LSD these mappings
2. LSD learns from the mappings
3. LSD predicts mappings for remaining 97 sources
Schema Matching for Data Integration:Schema Matching for Data Integration:the LSD Approachthe LSD Approach
11
price agent-name agent-phone office-phone description
Learning from the Manual Mappings Learning from the Manual Mappings
listed-price contact-name contact-phone office comments
Schema of realestate.com
Mediated schema
$250K James Smith (305) 729 0831 (305) 616 1822 Fantastic house $320K Mike Doan (617) 253 1429 (617) 112 2315 Great location
listed-price contact-name contact-phone office comments
realestate.com
If “fantastic” & “great” occur frequently in data instances => description
sold-at contact-agent extra-info
$350K (206) 634 9435 Beautiful yard $230K (617) 335 4243 Close to Seattle
homes.com
If “office” occurs in name => office-phone
12
price agent-name agent-phone office-phone description
Must Exploit Multiple Types of Information! Must Exploit Multiple Types of Information!
listed-price contact-name contact-phone office comments
Schema of realestate.com
Mediated schema
$250K James Smith (305) 729 0831 (305) 616 1822 Fantastic house $320K Mike Doan (617) 253 1429 (617) 112 2315 Great location
listed-price contact-name contact-phone office comments
realestate.com
If “fantastic” & “great” occur frequently in data instances => description
sold-at contact-agent extra-info
$350K (206) 634 9435 Beautiful yard $230K (617) 335 4243 Close to Seattle
homes.com
If “office” occurs in name => office-phone
13
Multi-Strategy LearningMulti-Strategy Learning
Use a set of base learners– each exploits well certain types of information
To match a schema element of a new source– apply base learners– combine their predictions using a meta-learner
Meta-learner– uses training sources to measure base learner accuracy– weighs each learner based on its accuracy
14
Base LearnersBase Learners Training
Matching Name Learner
– training: (“location”, address) (“contact name”, name)
– matching: agent-name => (name,0.7),(phone,0.3)
Naive Bayes Learner– training: (“Seattle, WA”,address)
(“250K”,price)
– matching: “Kent, WA” => (address,0.8),(name,0.2)
labels weighted by confidence scoreX
(X1,C1)(X2,C2)...(Xm,Cm)
Observed label
Training examples
Object
Classification model (hypothesis)
15
The LSD ArchitectureThe LSD ArchitectureMatching PhaseTraining Phase
Mediated schemaSource schemas
Base-Learner1 Base-Learnerk
Meta-Learner
Training datafor base learners
Hypothesis1 Hypothesisk
Weights for Base Learners
Base-Learner1 .... Base-Learnerk
Meta-Learner
Prediction Combiner
Predictions for elements
Predictions for instances
Constraint Handler
Mappings
Domainconstraints
16
Naive Bayes Learner
(“Miami, FL”, address)(“$250K”, price)(“James Smith”, agent-name)(“(305) 729 0831”, agent-phone)(“(305) 616 1822”, office-phone)(“Fantastic house”, description)(“Boston,MA”, address)
Training the Base LearnersTraining the Base Learners
Miami, FL $250K James Smith (305) 729 0831 (305) 616 1822 Fantastic houseBoston, MA $320K Mike Doan (617) 253 1429 (617) 112 2315 Great location
location price contact-name contact-phone office comments
realestate.com
(“location”, address)(“price”, price)(“contact name”, agent-name)(“contact phone”, agent-phone)(“office”, office-phone)(“comments”, description)
Name Learner
address price agent-name agent-phone office-phone descriptionMediated schema
17
Meta-Learner: StackingMeta-Learner: Stacking[Wolpert 92,Ting&Witten99][Wolpert 92,Ting&Witten99]
Training– uses training data to learn weights
– one for each (base-learner,mediated-schema element) pair
– weight (Name-Learner,address) = 0.2
– weight (Naive-Bayes,address) = 0.8
Matching: combine predictions of base learners– computes weighted average of base-learner confidence scores
Seattle, WAKent, WABend, OR
(address,0.4)(address,0.9)
Name LearnerNaive Bayes
Meta-Learner (address, 0.4*0.2 + 0.9*0.8 = 0.8)
area
18
The LSD ArchitectureThe LSD ArchitectureMatching PhaseTraining Phase
Mediated schemaSource schemas
Base-Learner1 Base-Learnerk
Meta-Learner
Training datafor base learners
Hypothesis1 Hypothesisk
Weights for Base Learners
Base-Learner1 .... Base-Learnerk
Meta-Learner
Prediction Combiner
Predictions for elements
Predictions for instances
Constraint Handler
Mappings
Domainconstraints
19
contact-agent
Applying the LearnersApplying the Learners
Name LearnerNaive Bayes
Prediction-Combiner
(address,0.8), (description,0.2)(address,0.6), (description,0.4)(address,0.7), (description,0.3)
(address,0.6), (description,0.4)
Meta-LearnerName LearnerNaive Bayes
(address,0.7), (description,0.3)
(price,0.9), (agent-phone,0.1)
extra-info
homes.com
Seattle, WAKent, WABend, OR
area
sold-at
(agent-phone,0.9), (description,0.1)
Meta-Learner
area sold-at contact-agent extra-infohomes.com schema
20
Domain ConstraintsDomain Constraints
Encode user knowledge about domain Specified by examining mediated schema Examples
– at most one source-schema element can match address– if a source-schema element matches house-id then it is a key– avg-value(price) > avg-value(num-baths)
Given a mapping combination – can verify if it satisfies a given constraint
area: addresssold-at: price contact-agent: agent-phoneextra-info: address
21
area: (address,0.7), (description,0.3)sold-at: (price,0.9), (agent-phone,0.1)contact-agent: (agent-phone,0.9), (description,0.1)extra-info: (address,0.6), (description,0.4)
The Constraint HandlerThe Constraint Handler
Searches space of mapping combinations efficiently Can handle arbitrary constraints Also used to incorporate user feedback
– sold-at does not match price
0.30.10.10.40.0012
0.70.90.90.40.2268
Domain Constraints
At most one element matches address
Predictions from Prediction Combiner
area: addresssold-at: price contact-agent: agent-phoneextra-info: description
0.70.90.90.60.3402
area: addresssold-at: price contact-agent: agent-phoneextra-info: address
22
The Current LSD SystemThe Current LSD System Can also handle data in XML format
– matches XML DTDs
Base learners– Naive Bayes [Duda&Hart-93, Domingos&Pazzani-97]
– exploits frequencies of words & symbols– WHIRL Nearest-Neighbor Classifier [Cohen&Hirsh KDD-98]
– employs information-retrieval similarity metric– Name Learner [SIGMOD-01]
– matches elements based on their names– County-Name Recognizer [SIGMOD-01]
– stores all U.S. county names
– XML Learner [SIGMOD-01]– exploits hierarchical structure of XML data
23
Empirical EvaluationEmpirical Evaluation Four domains
– Real Estate I & II, Course Offerings, Faculty Listings
For each domain– created mediated schema & domain constraints– chose five sources– extracted & converted data into XML– mediated schemas: 14 - 66 elements, source schemas: 13 - 48
Ten runs for each domain, in each run:– manually provided 1-1 mappings for 3 sources– asked LSD to propose mappings for remaining 2 sources
– accuracy = % of 1-1 mappings correctly identified
24
High Matching AccuracyHigh Matching Accuracy
0
10
20
30
40
50
60
70
80
90
100
Real Estate I Real Estate II CourseOfferings
FacultyListings
LSD’s accuracy: 71 - 92%
Best single base learner: 42 - 72%
+ Meta-learner: + 5 - 22%
+ Constraint handler: + 7 - 13%
+ XML learner: + 0.8 - 6%
Ave
rage
Mat
chin
g A
cccu
racy
(%
)
25
0
10
20
30
40
50
60
70
80
90
100
Real Estate I Real Estate II Course Offerings Faculty Listings
Contribution of Schema vs. DataContribution of Schema vs. Data
LSD with only schema info.
LSD with only data info.
Complete LSD
Ave
rage
mat
chin
g ac
cura
cy (
%)
More experiments in [Doan et al. SIGMOD-01]
26
LSD SummaryLSD Summary LSD
– learns from previous matching activities– exploits multiple types of information
– by employing multi-strategy learning
– incorporates domain constraints & user feedback– achieves high matching accuracy
LSD focuses on 1-1 mappings Next challenge: discover more complex mappings!
– COMAP (Complex Mapping) system
27
listed-price agent-id full-baths half-baths city zipcode
The COMAP Approach The COMAP Approach
For each mediated-schema element – searches space of all mappings– finds a small set of likely mapping candidates– uses LSD to evaluate them
To search efficiently – employs a specialized searcher for each element type– Text Searcher, Numeric Searcher, Category Searcher, ...
price num-baths address
Mediated-schema
homes.com
320K 53211 2 1 Seattle 98105240K 11578 1 1 Miami 23591
28
The COMAP Architecture [Doan The COMAP Architecture [Doan et al.et al., 02], 02]
Source schema + dataMediated schema
SearcherkSearcher2
Prediction Combiner
Constraint Handler
Mappings
Domainconstraints
Meta-Learner
Base-Learner1 .... Base-Learnerk
Mapping candidates
LSD
Searcher1
29
An Example: Text SearcherAn Example: Text Searcher
Best mapping candidates for address – (agent-id,0.7), (concat(agent-id,city),0.75), (concat(city,zipcode),0.9)
listed-price agent-id full-baths half-baths city zipcode
price num-baths address
Mediated-schema
320K 532a 2 1 Seattle 98105240K 115c 1 1 Miami 23591
homes.com
concat(agent-id,zipcode)
532a 98105115c 23591
concat(city,zipcode)
Seattle 98105Miami 23591
concat(agent-id,city)
532a Seattle115c Miami
Beam search in space of all concatenation mappings Example: find mapping candidates for address
30
Empirical EvaluationEmpirical Evaluation Current COMAP system
– eight searchers
Three real-world domains – in real estate & product inventory– mediated schema: 6 -- 26 elements, source schema: 16 -- 31
Accuracy: 62 -- 97% Sample discovered mappings
– agent-name = concat(first-name,last-name)– area = building-area / 43560– discount-cost = (unit-price * quantity) * (1 - discount)
31
Road MapRoad Map
Introduction Schema matching
– LSD system
Creating complex mappings– COMAP system
Ontology matching – GLUE system
Conclusions
32
Ontology MatchingOntology Matching Increasingly critical for
– knowledge bases, Semantic Web
An ontology – concepts organized into a taxonomy tree– each concept has
– a set of attributes– a set of instances
– relations among concepts
Matching– concepts – attributes – relations
name: Mike Burnsdegree: Ph.D.
Entity
UndergradCourses
GradCourses
People
StaffFaculty
AssistantProfessor
AssociateProfessor
Professor
CS Dept. US
33
Matching Taxonomies of ConceptsMatching Taxonomies of Concepts
Entity
Courses Staff
Technical StaffAcademic Staff
Lecturer Senior Lecturer
Professor
CS Dept. Australia
Entity
UndergradCourses
GradCourses
People
StaffFaculty
AssistantProfessor
AssociateProfessor
Professor
CS Dept. US
34
Constraints in Taxonomy MatchingConstraints in Taxonomy Matching
Domain-dependent – at most one node matches department-chair– a node that matches professor can not be a child of a node
that matches assistant-professor
Domain-independent– two nodes match if parents & children match– if all children of X matches Y, then X also matches Y
– Variations have been exploited in many restricted settings[Melnik&Garcia-Molina,ICDE-02], [Milo&Zohar,VLDB-98],[Noy et al., IJCAI-01], [Madhavan et al., VLDB-01]
Challenge: find a general & efficient approach
35
Solution: Relaxation LabelingSolution: Relaxation Labeling Relaxation labeling [Hummel&Zucker, 83]
– applied to graph labeling in vision, NLP, hypertext classification– finds best label assignment, given a set of constraints– starts with initial label assignment– iteratively improves labels, using constraints
Standard relax. labeling not applicable– extended it in many ways [Doan et al., W W W-02]
Experiments– three real-world domains in course catalog & company listings– 30 -- 300 nodes / taxonomy– accuracy 66 -- 97% vs. 52 -- 83% of best base learner – relaxation labeling very fast (under few seconds)
36
Related WorkRelated Work
TRANSCM [Milo&Zohar98]ARTEMIS [Castano&Antonellis99] [Palopoli et al. 98] CUPID [Madhavan et al. 01]PROMPT [Noy et al. 00]
SEMINT [Li&Clifton94]ILA [Perkowitz&Etzioni95]DELTA [Clifton et al. 97]
LSD [Doan et al., SIGMOD-01]COMAP [Doan et al. 2002, submitted]GLUE [Doan et al., WWW-02]
CLIO [Miller et. al., 00] [Yan et al. 01]
Single learnerExploit data 1-1 mapping
RulesExploit data1-1 + complex mapping
Hand-crafted rules Exploit schema 1-1 mapping
Learners + rules, use multi-strategy learningExploit schema + data1-1 + complex mappingExploit domain constraints
37
Future WorkFuture Work
Learning source descriptions– formal semantics for mapping– query capabilities, source schema, scope, reliability of data, ...
Dealing with changes in source description Matching objects across sources More sophisticated user feedback
Focus on distributed information management systems– data integration, web-service integration, peer data management– goal: significantly reduce complexity of construction & maintenance
38
ConclusionsConclusions
Efficiently creating semantic mappings is critical
Developed solution for semi-automatic schema matching– learns from previous matching activities– can match relational schemas, DTDs, ontologies, ...– discovers both 1-1 & complex mappings– highly modular & extensible – achieves high matching accuracy
Made contributions to machine learning– developed novel method to classify XML data– extended relaxation labeling
39
Backup SlidesBackup Slides
40
Least-SquaresLinear Regression
Training the Meta-LearnerTraining the Meta-Learner
<location> Miami, FL</><listed-price> $250,000</><area> Seattle, WA </><house-addr>Kent, WA</><num-baths>3</>...
Extracted XML Instances Name Learner
0.5 0.8 1 0.4 0.3 0 0.3 0.9 1 0.6 0.8 1 0.3 0.3 0 ... ... ...
Naive Bayes True Predictions
Weight(Name-Learner,address) = 0.1Weight(Naive-Bayes,address) = 0.9
For address
41
Sensitivity to Amount of Available DataSensitivity to Amount of Available Data
40
50
60
70
80
90
100
0 100 200 300 400 500
Ave
rage
mat
chin
g ac
cura
cy (
%)
Number of data listings per source (Real Estate I)
42
Contribution of Each ComponentContribution of Each Component
0
20
40
60
80
100
Real Estate I Course Offerings Faculty Listings Real Estate II
Ave
rage
Mat
chin
g A
cccu
racy
(%
)
Without Name Learner
Without Naive Bayes
Without Whirl Learner
Without Constraint Handler
The complete LSD system
43
Existing learners flatten out all structures
Developed XML learner– similar to the Naive Bayes learner
– input instance = bag of tokens– differs in one crucial aspect
– consider not only text tokens, but also structure tokens
Exploiting Hierarchical Structure Exploiting Hierarchical Structure
<description> Victorian house with a view. Name your price! To see it, contact Gail Murphy at MAX Realtors.</description>
<contact> <name> Gail Murphy </name> <firm> MAX Realtors </firm></contact>
44
Reasons for Incorrect MatchingsReasons for Incorrect Matchings Unfamiliarity
– suburb– solution: add a suburb-name recognizer
Insufficient information– correctly identified general type, failed to pinpoint exact type– agent-name phone
Richard Smith (206) 234 5412
– solution: add a proximity learner
Subjectivity– house-style = description?
Victorian Beautiful neo-gothic houseMexican Great location
45
Evaluate Mapping CandidatesEvaluate Mapping Candidates For address, Text Searcher returns
– (agent-id,0.7)– (concat(agent-id,city),0.8)– (concat(city,zipcode),0.75)
Employ multi-strategy learning to evaluate mappings Example: (concat(agent-id,city),0.8)
– Naive Bayes Learner: 0.8– Name Learner: “address” vs. “agent id city” 0.3– Meta-Learner: 0.8 * 0.7 + 0.3 * 0.3 = 0.65
Meta-Learner returns– (agent-id,0.59)– (concat(agent-id,city),0.65)– (concat(city,zipcode),0.70)
46
Relaxation LabelingRelaxation Labeling
Dept U.S. Dept Australia
CoursesCourses Staff People
StaffFacultyTech. StaffAcad. StaffStaff
People
CoursesCourses
Faculty
Applied to similar problems in– vision, NLP, hypertext classification
47
Relaxation Labeling for Taxonomy MatchingRelaxation Labeling for Taxonomy Matching Must define
– neighborhood of a node – k features of neighborhood– how to combine influence of features
–
Algorithm– init: for each pair <N,L>, compute – loop: for each pair <N,L>, re-compute
M
MPMLNPLNP )|().,|()|(
)|( LNP
),...,,|( 21 kfffLNP
Acad. Staff: FacultyTech. Staff: StaffStaff = People
Neighborhood configuration
48
Relaxation Labeling for Taxonomy MatchingRelaxation Labeling for Taxonomy Matching
Huge number of neighborhood configurations!– typically neighborhood = immediate nodes– here neighborhood can be entire graph
100 nodes, 10 labels => configurations
Solution– label abstraction + dynamic programming– guarantee quadratic time for a broad range of domain constraints
Empirical evaluation– GLUE system [Doan et. al., WWW-02]– three real-world domains – 30 -- 300 nodes / taxonomy– high accuracy 66 -- 97% vs. 52 -- 83% of best base learner– relaxation labeling very fast, finished in several seconds
10010