anhai doan database & data mining group university of washington, seattle spring 2002 learning...

48
AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Learning to Map between Structured Representations Structured Representations of Data of Data

Upload: ross-garrett

Post on 23-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data

AnHai Doan

Database & Data Mining Group

University of Washington, Seattle

Spring 2002

Learning to Map between Learning to Map between Structured Representations of DataStructured Representations of Data

Page 2: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data

2

New faculty member

Find houses with 2 bedrooms priced under

200K

homes.comrealestate.com homeseekers.com

Data Integration ChallengeData Integration Challenge

Page 3: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data

3

Architecture of Data Integration SystemArchitecture of Data Integration System

mediated schema

homes.comrealestate.com

source schema 2

homeseekers.com

source schema 3source schema 1

Find houses with 2 bedrooms priced under 200K

Page 4: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data

4

price agent-name address

Semantic Mappings between SchemasSemantic Mappings between Schemas

1-1 mapping complex mapping

homes.com listed-price contact-name city state

Mediated-schema

320K Jane Brown Seattle WA240K Mike Smith Miami FL

Page 5: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data

5

Schema Matching is Ubiquitous!Schema Matching is Ubiquitous! Fundamental problem in numerous applications Databases

– data integration

– data translation

– schema/view integration

– data warehousing

– semantic query processing

– model management

– peer data management

AI– knowledge bases, ontology merging, information gathering agents, ...

Web– e-commerce

– marking up data using ontologies (Semantic Web)

Page 6: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data

6

Why Schema Matching is DifficultWhy Schema Matching is Difficult

Schema & data never fully capture semantics!– not adequately documented

Must rely on clues in schema & data – using names, structures, types, data values, etc.

Such clues can be unreliable– same names => different entities: area => location or square-feet– different names => same entity: area & address => location

Intended semantics can be subjective– house-style = house-description?

Cannot be fully automated, needs user feedback!

Page 7: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data

7

Current State of AffairsCurrent State of Affairs Finding semantic mappings is now a key bottleneck!

– largely done by hand– labor intensive & error prone– data integration at GTE [Li&Clifton, 2000]

– 40 databases, 27000 elements, estimated time: 12 years

Will only be exacerbated– data sharing becomes pervasive– translation of legacy data

Need semi-automatic approaches to scale up! Many current research projects

– Databases: IBM Almaden, Microsoft Research, BYU, George Mason,

U of Leipzig, ...– AI: Stanford, Karlsruhe University, NEC Japan, ...

Page 8: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data

8

Goals and ContributionsGoals and Contributions Vision for schema-matching tools

– learn from previous matching activities– exploit multiple types of information – incorporate domain integrity constraints– handle user feedback

My contributions: solution for semi-automatic schema matching– can match relational schemas, DTDs, ontologies, ...– discovers both 1-1 & complex mappings– highly modular & extensible– achieves high matching accuracy (66 -- 97%) on real-world data

Page 9: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data

9

Road MapRoad Map

Introduction Schema matching [SIGMOD-01]

– 1-1 mappings for data integration– LSD (Learning Source Description) system

– learns from previous matching activities

– employs multi-strategy learning

– exploits domain constraints & user feedback

Creating complex mappings [Tech. Report-02]

Ontology matching [WWW-02]

Conclusions

Page 10: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data

10

Suppose user wants to integrate 100 data sources

1. User – manually creates mappings for a few sources, say 3– shows LSD these mappings

2. LSD learns from the mappings

3. LSD predicts mappings for remaining 97 sources

Schema Matching for Data Integration:Schema Matching for Data Integration:the LSD Approachthe LSD Approach

Page 11: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data

11

price agent-name agent-phone office-phone description

Learning from the Manual Mappings Learning from the Manual Mappings

listed-price contact-name contact-phone office comments

Schema of realestate.com

Mediated schema

$250K James Smith (305) 729 0831 (305) 616 1822 Fantastic house $320K Mike Doan (617) 253 1429 (617) 112 2315 Great location

listed-price contact-name contact-phone office comments

realestate.com

If “fantastic” & “great” occur frequently in data instances => description

sold-at contact-agent extra-info

$350K (206) 634 9435 Beautiful yard $230K (617) 335 4243 Close to Seattle

homes.com

If “office” occurs in name => office-phone

Page 12: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data

12

price agent-name agent-phone office-phone description

Must Exploit Multiple Types of Information! Must Exploit Multiple Types of Information!

listed-price contact-name contact-phone office comments

Schema of realestate.com

Mediated schema

$250K James Smith (305) 729 0831 (305) 616 1822 Fantastic house $320K Mike Doan (617) 253 1429 (617) 112 2315 Great location

listed-price contact-name contact-phone office comments

realestate.com

If “fantastic” & “great” occur frequently in data instances => description

sold-at contact-agent extra-info

$350K (206) 634 9435 Beautiful yard $230K (617) 335 4243 Close to Seattle

homes.com

If “office” occurs in name => office-phone

Page 13: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data

13

Multi-Strategy LearningMulti-Strategy Learning

Use a set of base learners– each exploits well certain types of information

To match a schema element of a new source– apply base learners– combine their predictions using a meta-learner

Meta-learner– uses training sources to measure base learner accuracy– weighs each learner based on its accuracy

Page 14: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data

14

Base LearnersBase Learners Training

Matching Name Learner

– training: (“location”, address) (“contact name”, name)

– matching: agent-name => (name,0.7),(phone,0.3)

Naive Bayes Learner– training: (“Seattle, WA”,address)

(“250K”,price)

– matching: “Kent, WA” => (address,0.8),(name,0.2)

labels weighted by confidence scoreX

(X1,C1)(X2,C2)...(Xm,Cm)

Observed label

Training examples

Object

Classification model (hypothesis)

Page 15: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data

15

The LSD ArchitectureThe LSD ArchitectureMatching PhaseTraining Phase

Mediated schemaSource schemas

Base-Learner1 Base-Learnerk

Meta-Learner

Training datafor base learners

Hypothesis1 Hypothesisk

Weights for Base Learners

Base-Learner1 .... Base-Learnerk

Meta-Learner

Prediction Combiner

Predictions for elements

Predictions for instances

Constraint Handler

Mappings

Domainconstraints

Page 16: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data

16

Naive Bayes Learner

(“Miami, FL”, address)(“$250K”, price)(“James Smith”, agent-name)(“(305) 729 0831”, agent-phone)(“(305) 616 1822”, office-phone)(“Fantastic house”, description)(“Boston,MA”, address)

Training the Base LearnersTraining the Base Learners

Miami, FL $250K James Smith (305) 729 0831 (305) 616 1822 Fantastic houseBoston, MA $320K Mike Doan (617) 253 1429 (617) 112 2315 Great location

location price contact-name contact-phone office comments

realestate.com

(“location”, address)(“price”, price)(“contact name”, agent-name)(“contact phone”, agent-phone)(“office”, office-phone)(“comments”, description)

Name Learner

address price agent-name agent-phone office-phone descriptionMediated schema

Page 17: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data

17

Meta-Learner: StackingMeta-Learner: Stacking[Wolpert 92,Ting&Witten99][Wolpert 92,Ting&Witten99]

Training– uses training data to learn weights

– one for each (base-learner,mediated-schema element) pair

– weight (Name-Learner,address) = 0.2

– weight (Naive-Bayes,address) = 0.8

Matching: combine predictions of base learners– computes weighted average of base-learner confidence scores

Seattle, WAKent, WABend, OR

(address,0.4)(address,0.9)

Name LearnerNaive Bayes

Meta-Learner (address, 0.4*0.2 + 0.9*0.8 = 0.8)

area

Page 18: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data

18

The LSD ArchitectureThe LSD ArchitectureMatching PhaseTraining Phase

Mediated schemaSource schemas

Base-Learner1 Base-Learnerk

Meta-Learner

Training datafor base learners

Hypothesis1 Hypothesisk

Weights for Base Learners

Base-Learner1 .... Base-Learnerk

Meta-Learner

Prediction Combiner

Predictions for elements

Predictions for instances

Constraint Handler

Mappings

Domainconstraints

Page 19: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data

19

contact-agent

Applying the LearnersApplying the Learners

Name LearnerNaive Bayes

Prediction-Combiner

(address,0.8), (description,0.2)(address,0.6), (description,0.4)(address,0.7), (description,0.3)

(address,0.6), (description,0.4)

Meta-LearnerName LearnerNaive Bayes

(address,0.7), (description,0.3)

(price,0.9), (agent-phone,0.1)

extra-info

homes.com

Seattle, WAKent, WABend, OR

area

sold-at

(agent-phone,0.9), (description,0.1)

Meta-Learner

area sold-at contact-agent extra-infohomes.com schema

Page 20: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data

20

Domain ConstraintsDomain Constraints

Encode user knowledge about domain Specified by examining mediated schema Examples

– at most one source-schema element can match address– if a source-schema element matches house-id then it is a key– avg-value(price) > avg-value(num-baths)

Given a mapping combination – can verify if it satisfies a given constraint

area: addresssold-at: price contact-agent: agent-phoneextra-info: address

Page 21: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data

21

area: (address,0.7), (description,0.3)sold-at: (price,0.9), (agent-phone,0.1)contact-agent: (agent-phone,0.9), (description,0.1)extra-info: (address,0.6), (description,0.4)

The Constraint HandlerThe Constraint Handler

Searches space of mapping combinations efficiently Can handle arbitrary constraints Also used to incorporate user feedback

– sold-at does not match price

0.30.10.10.40.0012

0.70.90.90.40.2268

Domain Constraints

At most one element matches address

Predictions from Prediction Combiner

area: addresssold-at: price contact-agent: agent-phoneextra-info: description

0.70.90.90.60.3402

area: addresssold-at: price contact-agent: agent-phoneextra-info: address

Page 22: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data

22

The Current LSD SystemThe Current LSD System Can also handle data in XML format

– matches XML DTDs

Base learners– Naive Bayes [Duda&Hart-93, Domingos&Pazzani-97]

– exploits frequencies of words & symbols– WHIRL Nearest-Neighbor Classifier [Cohen&Hirsh KDD-98]

– employs information-retrieval similarity metric– Name Learner [SIGMOD-01]

– matches elements based on their names– County-Name Recognizer [SIGMOD-01]

– stores all U.S. county names

– XML Learner [SIGMOD-01]– exploits hierarchical structure of XML data

Page 23: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data

23

Empirical EvaluationEmpirical Evaluation Four domains

– Real Estate I & II, Course Offerings, Faculty Listings

For each domain– created mediated schema & domain constraints– chose five sources– extracted & converted data into XML– mediated schemas: 14 - 66 elements, source schemas: 13 - 48

Ten runs for each domain, in each run:– manually provided 1-1 mappings for 3 sources– asked LSD to propose mappings for remaining 2 sources

– accuracy = % of 1-1 mappings correctly identified

Page 24: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data

24

High Matching AccuracyHigh Matching Accuracy

0

10

20

30

40

50

60

70

80

90

100

Real Estate I Real Estate II CourseOfferings

FacultyListings

LSD’s accuracy: 71 - 92%

Best single base learner: 42 - 72%

+ Meta-learner: + 5 - 22%

+ Constraint handler: + 7 - 13%

+ XML learner: + 0.8 - 6%

Ave

rage

Mat

chin

g A

cccu

racy

(%

)

Page 25: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data

25

0

10

20

30

40

50

60

70

80

90

100

Real Estate I Real Estate II Course Offerings Faculty Listings

Contribution of Schema vs. DataContribution of Schema vs. Data

LSD with only schema info.

LSD with only data info.

Complete LSD

Ave

rage

mat

chin

g ac

cura

cy (

%)

More experiments in [Doan et al. SIGMOD-01]

Page 26: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data

26

LSD SummaryLSD Summary LSD

– learns from previous matching activities– exploits multiple types of information

– by employing multi-strategy learning

– incorporates domain constraints & user feedback– achieves high matching accuracy

LSD focuses on 1-1 mappings Next challenge: discover more complex mappings!

– COMAP (Complex Mapping) system

Page 27: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data

27

listed-price agent-id full-baths half-baths city zipcode

The COMAP Approach The COMAP Approach

For each mediated-schema element – searches space of all mappings– finds a small set of likely mapping candidates– uses LSD to evaluate them

To search efficiently – employs a specialized searcher for each element type– Text Searcher, Numeric Searcher, Category Searcher, ...

price num-baths address

Mediated-schema

homes.com

320K 53211 2 1 Seattle 98105240K 11578 1 1 Miami 23591

Page 28: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data

28

The COMAP Architecture [Doan The COMAP Architecture [Doan et al.et al., 02], 02]

Source schema + dataMediated schema

SearcherkSearcher2

Prediction Combiner

Constraint Handler

Mappings

Domainconstraints

Meta-Learner

Base-Learner1 .... Base-Learnerk

Mapping candidates

LSD

Searcher1

Page 29: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data

29

An Example: Text SearcherAn Example: Text Searcher

Best mapping candidates for address – (agent-id,0.7), (concat(agent-id,city),0.75), (concat(city,zipcode),0.9)

listed-price agent-id full-baths half-baths city zipcode

price num-baths address

Mediated-schema

320K 532a 2 1 Seattle 98105240K 115c 1 1 Miami 23591

homes.com

concat(agent-id,zipcode)

532a 98105115c 23591

concat(city,zipcode)

Seattle 98105Miami 23591

concat(agent-id,city)

532a Seattle115c Miami

Beam search in space of all concatenation mappings Example: find mapping candidates for address

Page 30: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data

30

Empirical EvaluationEmpirical Evaluation Current COMAP system

– eight searchers

Three real-world domains – in real estate & product inventory– mediated schema: 6 -- 26 elements, source schema: 16 -- 31

Accuracy: 62 -- 97% Sample discovered mappings

– agent-name = concat(first-name,last-name)– area = building-area / 43560– discount-cost = (unit-price * quantity) * (1 - discount)

Page 31: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data

31

Road MapRoad Map

Introduction Schema matching

– LSD system

Creating complex mappings– COMAP system

Ontology matching – GLUE system

Conclusions

Page 32: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data

32

Ontology MatchingOntology Matching Increasingly critical for

– knowledge bases, Semantic Web

An ontology – concepts organized into a taxonomy tree– each concept has

– a set of attributes– a set of instances

– relations among concepts

Matching– concepts – attributes – relations

name: Mike Burnsdegree: Ph.D.

Entity

UndergradCourses

GradCourses

People

StaffFaculty

AssistantProfessor

AssociateProfessor

Professor

CS Dept. US

Page 33: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data

33

Matching Taxonomies of ConceptsMatching Taxonomies of Concepts

Entity

Courses Staff

Technical StaffAcademic Staff

Lecturer Senior Lecturer

Professor

CS Dept. Australia

Entity

UndergradCourses

GradCourses

People

StaffFaculty

AssistantProfessor

AssociateProfessor

Professor

CS Dept. US

Page 34: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data

34

Constraints in Taxonomy MatchingConstraints in Taxonomy Matching

Domain-dependent – at most one node matches department-chair– a node that matches professor can not be a child of a node

that matches assistant-professor

Domain-independent– two nodes match if parents & children match– if all children of X matches Y, then X also matches Y

– Variations have been exploited in many restricted settings[Melnik&Garcia-Molina,ICDE-02], [Milo&Zohar,VLDB-98],[Noy et al., IJCAI-01], [Madhavan et al., VLDB-01]

Challenge: find a general & efficient approach

Page 35: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data

35

Solution: Relaxation LabelingSolution: Relaxation Labeling Relaxation labeling [Hummel&Zucker, 83]

– applied to graph labeling in vision, NLP, hypertext classification– finds best label assignment, given a set of constraints– starts with initial label assignment– iteratively improves labels, using constraints

Standard relax. labeling not applicable– extended it in many ways [Doan et al., W W W-02]

Experiments– three real-world domains in course catalog & company listings– 30 -- 300 nodes / taxonomy– accuracy 66 -- 97% vs. 52 -- 83% of best base learner – relaxation labeling very fast (under few seconds)

Page 36: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data

36

Related WorkRelated Work

TRANSCM [Milo&Zohar98]ARTEMIS [Castano&Antonellis99] [Palopoli et al. 98] CUPID [Madhavan et al. 01]PROMPT [Noy et al. 00]

SEMINT [Li&Clifton94]ILA [Perkowitz&Etzioni95]DELTA [Clifton et al. 97]

LSD [Doan et al., SIGMOD-01]COMAP [Doan et al. 2002, submitted]GLUE [Doan et al., WWW-02]

CLIO [Miller et. al., 00] [Yan et al. 01]

Single learnerExploit data 1-1 mapping

RulesExploit data1-1 + complex mapping

Hand-crafted rules Exploit schema 1-1 mapping

Learners + rules, use multi-strategy learningExploit schema + data1-1 + complex mappingExploit domain constraints

Page 37: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data

37

Future WorkFuture Work

Learning source descriptions– formal semantics for mapping– query capabilities, source schema, scope, reliability of data, ...

Dealing with changes in source description Matching objects across sources More sophisticated user feedback

Focus on distributed information management systems– data integration, web-service integration, peer data management– goal: significantly reduce complexity of construction & maintenance

Page 38: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data

38

ConclusionsConclusions

Efficiently creating semantic mappings is critical

Developed solution for semi-automatic schema matching– learns from previous matching activities– can match relational schemas, DTDs, ontologies, ...– discovers both 1-1 & complex mappings– highly modular & extensible – achieves high matching accuracy

Made contributions to machine learning– developed novel method to classify XML data– extended relaxation labeling

Page 39: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data

39

Backup SlidesBackup Slides

Page 40: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data

40

Least-SquaresLinear Regression

Training the Meta-LearnerTraining the Meta-Learner

<location> Miami, FL</><listed-price> $250,000</><area> Seattle, WA </><house-addr>Kent, WA</><num-baths>3</>...

Extracted XML Instances Name Learner

0.5 0.8 1 0.4 0.3 0 0.3 0.9 1 0.6 0.8 1 0.3 0.3 0 ... ... ...

Naive Bayes True Predictions

Weight(Name-Learner,address) = 0.1Weight(Naive-Bayes,address) = 0.9

For address

Page 41: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data

41

Sensitivity to Amount of Available DataSensitivity to Amount of Available Data

40

50

60

70

80

90

100

0 100 200 300 400 500

Ave

rage

mat

chin

g ac

cura

cy (

%)

Number of data listings per source (Real Estate I)

Page 42: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data

42

Contribution of Each ComponentContribution of Each Component

0

20

40

60

80

100

Real Estate I Course Offerings Faculty Listings Real Estate II

Ave

rage

Mat

chin

g A

cccu

racy

(%

)

Without Name Learner

Without Naive Bayes

Without Whirl Learner

Without Constraint Handler

The complete LSD system

Page 43: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data

43

Existing learners flatten out all structures

Developed XML learner– similar to the Naive Bayes learner

– input instance = bag of tokens– differs in one crucial aspect

– consider not only text tokens, but also structure tokens

Exploiting Hierarchical Structure Exploiting Hierarchical Structure

<description> Victorian house with a view. Name your price! To see it, contact Gail Murphy at MAX Realtors.</description>

<contact> <name> Gail Murphy </name> <firm> MAX Realtors </firm></contact>

Page 44: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data

44

Reasons for Incorrect MatchingsReasons for Incorrect Matchings Unfamiliarity

– suburb– solution: add a suburb-name recognizer

Insufficient information– correctly identified general type, failed to pinpoint exact type– agent-name phone

Richard Smith (206) 234 5412

– solution: add a proximity learner

Subjectivity– house-style = description?

Victorian Beautiful neo-gothic houseMexican Great location

Page 45: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data

45

Evaluate Mapping CandidatesEvaluate Mapping Candidates For address, Text Searcher returns

– (agent-id,0.7)– (concat(agent-id,city),0.8)– (concat(city,zipcode),0.75)

Employ multi-strategy learning to evaluate mappings Example: (concat(agent-id,city),0.8)

– Naive Bayes Learner: 0.8– Name Learner: “address” vs. “agent id city” 0.3– Meta-Learner: 0.8 * 0.7 + 0.3 * 0.3 = 0.65

Meta-Learner returns– (agent-id,0.59)– (concat(agent-id,city),0.65)– (concat(city,zipcode),0.70)

Page 46: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data

46

Relaxation LabelingRelaxation Labeling

Dept U.S. Dept Australia

CoursesCourses Staff People

StaffFacultyTech. StaffAcad. StaffStaff

People

CoursesCourses

Faculty

Applied to similar problems in– vision, NLP, hypertext classification

Page 47: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data

47

Relaxation Labeling for Taxonomy MatchingRelaxation Labeling for Taxonomy Matching Must define

– neighborhood of a node – k features of neighborhood– how to combine influence of features

Algorithm– init: for each pair <N,L>, compute – loop: for each pair <N,L>, re-compute

M

MPMLNPLNP )|().,|()|(

)|( LNP

),...,,|( 21 kfffLNP

Acad. Staff: FacultyTech. Staff: StaffStaff = People

Neighborhood configuration

Page 48: AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Learning to Map between Structured Representations of Data

48

Relaxation Labeling for Taxonomy MatchingRelaxation Labeling for Taxonomy Matching

Huge number of neighborhood configurations!– typically neighborhood = immediate nodes– here neighborhood can be entire graph

100 nodes, 10 labels => configurations

Solution– label abstraction + dynamic programming– guarantee quadratic time for a broad range of domain constraints

Empirical evaluation– GLUE system [Doan et. al., WWW-02]– three real-world domains – 30 -- 300 nodes / taxonomy– high accuracy 66 -- 97% vs. 52 -- 83% of best base learner– relaxation labeling very fast, finished in several seconds

10010