1 berendt: knowledge and the web, 2014, berendt/teaching 1 knowledge and the web – schema,...
TRANSCRIPT
1Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
1
Knowledge and the Web –
Schema, instance and ontology matching
Bettina Berendt
KU Leuven, Department of Computer Science
http://www.cs.kuleuven.be/~berendt/teaching/2014-15-1stsemester/kaw/
Last update: 22 October 2014
2Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
2
Until now ...
... we have looked into modelling ... we have seen how the languages RDF(S) and OWL allow
us to combine different schemas and data ... we have seen how Linked Data on the Web uses HTTP as a
connecting protocol/architecture ... we have assumed that such combinations can be done
effortlessly (unique names etc.) ... we have looked at some interpretation problems
associated with these procedures Now we need to ask:
What are (further) challenges of such combinations? What are approaches proposed to solve it?
– from the databases & the Semantic Web / ontologies fields
– from architectural and logical points of view
3Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
3Motivation 1: Price comparison engines search & combine heterogeneous travel-agency DBs, which seach & combine heterogeneous airline DBs
4Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
4
Motivation 2a: Schemas coming from different languages
A river is a natural stream of water, usually freshwater, flowing toward an ocean, a lake, or another stream. In some cases a river flows into the ground or dries up completely before reaching another body of water. Usually larger streams are called rivers while smaller streams are called creeks, brooks, rivulets, rills, and many other terms, but there is no general rule that defines what can be called a river. Sometimes a river is said to be larger than a creek,[1] but this is not always the case.[2]
Une rivière est un cours d'eau qui s'écoule sous l'effet de la gravité et qui se jette dans une autre rivière ou dans un fleuve, contrairement au fleuve qui se jette, lui, dans la mer ou dans l'océan.
Een rivier is een min of meer natuurlijke waterstroom. We onderscheiden oceanische rivieren (in België ook wel stroom genoemd) die in een zee of oceaan uitmonden, en continentale rivieren die in een meer, een moeras of woestijn uitmonden. Een beek is de aanduiding voor een kleine rivier. Tussen beek en rivier ligt meestal een bijrivier.
5Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
5Motivation 2b: Information about “the same“ thing from different sources
6Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
6
Motivation 3a: Are these the same entity?
7Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
7
Motivation 3b: „Who is that?“ – Merging identities
Mickey Mouse
8Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
8
Motivation 3c: „Who was that?“ – Re-identification
9Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
9
High-level overview: Goals and approaches in data integration
Basic goal: Combine data/knowledge from different sources
Goal / emphasis can lie on finding correspondences between the models schema matching, ontology matching
the instances record linkage
Techniques can leverage similarities between schema/ontology-level information
instance information
most of today
An established problem in DB; a focus &challenge for LOD (“owl:sameAs“)
10Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
10
Agenda
The match problem & what info to use for matching
(Semi-)automated matching: Example CUPID
(Semi-)automated matching: Example iMAP
(Automated) matching of LOD and LOD ontologies
Evaluating matching
Involving the user: Explanations; mass collaboration
If time permits,these 2 topics
too (briefly)
11Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
11
The match problem(Running example 1)
Given two schemas S1 and S2, find a mapping between elements of S1 and S2 that correspond semantically to each other
12Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
12
Running example 2
13Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
13
Based on what information can the matchings/mappings be found?
(work on the two running examples)
14Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
14
The match operator
Match operator: f(S1,S2) = mapping between S1 and S2 for schemas S1, S2
Mapping a set of mapping elements
Mapping elements elements of S1, elements of S2, mapping expression
Mapping expression different functions and relationships
15Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
15
Matching expressions: examples
Scalar relations (=, ≥, ...) S.HOUSES.location = T.LISTINGS.area
Functions T.LISTINGS.list-price = S.HOUSES.price * (1+S.AGENTS.fee-rate) T.LISTINGS.agent-address = concat(S.AGENTS.city,S.AGENTS.state)
ER-style relationships (is-a, part-of, ...) Set-oriented relationships (overlaps, contains, ...) Any other terms that are defined in the expression language used
16Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
16
Matching and mapping
1. Find the schema match („declarative“)
2. Create a procedure (e.g., a query expression) to enable automated data translation or exchange (mapping, „procedural“)
Example of result of step 2: To create T.LISTINGS from S (simplified notation):
area = SELECT location FROM HOUSES
agent-name = SELECT name FROM AGENTS
agent-address = SELECT concat(city,state) FROM AGENTS
list-price = SELECT price * (1+fee-rate)
FROM HOUSES, AGENTS
WHERE agent-id = id
17Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
17Based on what information can the matchings/mappings be found?
Rahm & Bernstein‘s classification of schema matching approaches
18Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
18
Challenges
Semantics of the involved elements often need to be inferred
Often need to base (heuristic) solutions on cues in schema and data, which are unreliable
e.g., homonyms (area), synonyms (area, location)
Schema and data clues are often incomplete e.g., date: date of what?
Global nature of matching: to choose one matching possibility, must typically exclude all others as worse
Matching is often subjective and/or context-dependent e.g., does house-style match house-description or not?
Extremely laborious and error-prone process e.g., Li & Clifton 2000: project at GTE telecommunications:
40 databases, 27K elements, no access to the original developers of the DB estimated time for just finding and documenting the matches: 12 person years
Ontologies often even bigger For example Cyc: now (as of 2012) has > 500,000 concepts, ~ 5,000,000
assertions, >26,000 relations
19Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
19
Semi-automated schema matching (1)
Rule-based solutions Hand-crafted rules Exploit schema information
+ relatively inexpensive
+ do not require training
+ fast (operate only on schema, not data)
+ can work very well in certain types of applications & domains
+ rules can provide a quick & concise method of capturing user knowledge about the domain
– cannot exploit data instances effectively
– cannot exploit previous matching efforts
(other than by re-use)
20Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
20
Semi-automated schema matching (2)
Learning-based solutions Rules/mappings learned from attribute specifications and statistics of
data content (Rahm&Bernstein: „instance-level matching“)
Exploit schema information and data Some approaches: external evidence
Past matches Corpus of schemas and matches („matchings in real-estate applications
will tend to be alike“) Corpus of users (more details later in this slide set)
+ can exploit data instances effectively
+ can exploit previous matching efforts
– relatively expensive
– require training
– slower (operate data)
– results may be opaque (e.g., neural network output) explanation components! (more details later)
21Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
21
Agenda
The match problem & what info to use for matching
(Semi-)automated matching: Example CUPID
(Semi-)automated matching: Example iMAP
(Automated) matching of LOD and LOD ontologies
Evaluating matching
Involving the user: Explanations; mass collaboration
22Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
22
Overview (1)
Rule-based approach
Schema types: Relational, XML
Metadata representation: Extended ER
Match granularity: Element, structure
Match cardinality: 1:1, n:1
23Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
23
Overview (2)
Schema-level match: Name-based: name equality, synonyms, hypernyms,
homonyms, abbreviations
Constraint-based: data type and domain compatibility, referential constraints
Structure matching: matching subtrees, weighted by leaves
Re-use, auxiliary information used: Thesauri, glossaries
Combination of matchers: Hybrid
Manual work / user input: User can adjust threshold weights
24Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
24
Basic representation: Schema trees
Computation overview:
1. Compute similarity coefficients between elements of these graphs
2. Deduce a mapping from these coefficients
25Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
25
Computing similarity coefficients (1): Linguistic matching
Operates on schema element names (= nodes in schema tree)
1. Normalization Tokenization (parse names into tokens based on punctuation, case,
etc.)
e.g., Product_ID {Product, ID} Expansion (of abbreviations and acronyms) Elimination (of prepositions, articles, etc.)
2. Categorization / clustering Based on data types, schema hierarchy, linguistic content of names
e.g., „real-valued elements“, „money-related elements“
3. Comparison (within the categories) Compute linguistic similarity coefficients (lsim) based on thesarus
(synonmy, hypernymy)
Output: Table of lsim coefficients (in [0,1]) between schema elements
26Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
26
How to identify synonyms and homonyms: Example WordNet
27Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
27How to identify hypernyms: Example WordNet
What if you hadto match
“statement“ and “bill“?
28Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
28
(Lately also done with Wikipedia rather than with WordNet: e.g. WikiMatch)
29Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
29
Computing similarity coefficients (2): Structure matching
Intuitions: Leaves are similar if they are linguistic & data-type similar, and
if they have similar neighbourhoods
Non-leaf elements are similar if linguistically similar & have similar subtrees (where leaf sets are most important)
Procedure:
1. Initialize structural similarity of leaves based on data types
Identical data types: compat. = 0.5; otherwise in [0,0.5]
2. Process the tree in post-order
3. Stronglink(leaf1, leaf2) iff their weighted sim. ≥ threshold
4. .
30Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
30
The structure matching algorithm
Output: an 1:n mapping for leaves
To generate non-leaf mappings: 2nd post-order traversal
31Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
31
Matching shared types
Solution: expand the schema into a schema tree, then proceed as before
Can help to generate context-dependent mappings
Fails if a cycle of containment and IsDerivedFrom relationships is present (e.g., recursive type definitions)
32Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
32
Agenda
The match problem & what info to use for matching
(Semi-)automated matching: Example CUPID
(Semi-)automated matching: Example iMAP
(Automated) matching of LOD and LOD ontologies
Evaluating matching
Involving the user: Explanations; mass collaboration
33Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
33
Main ideas
A learning-based approach
Main goal: discover complex matches In particular: functions such as
T.LISTINGS.list-price = S.HOUSES.price * (1+S.AGENTS.fee-rate)
T.LISTINGS.agent-address = concat(S.AGENTS.city,S.AGENTS.state)
Works on relational schemas
Basic idea: reformulate schema matching as search
34Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
34
Architecture
Specialized searchers are specialized on discovering certain types of complex matches make search more efficient
35Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
35
Overview of implemented searchers
36Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
36
Example: The textual searcher
For target attribute T.LISTINGS.agent-address: Examine attributes and concatenations of attributes from S Restrict examined set by analyzing textual properties
Data type information in schema, heuristics (proportion of non-numeric characters etc.)
Evaluate match candidates based on data correspondences, prune inferior candidates
37Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
37
Example: The numerical searcher
For target attribute T.LISTINGS.list-price:
Examine attributes and arithmetic expressions over them from S
Restrict examined set by analyzing numeric properties Data type information in schema, heuristics
Evaluate match candidates based on data correspondences, prune inferior candidates
38Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
38
Search strategy (1): Example textual searcher
1. Learn a (Naive Bayes) classifier
text class („agent-address“ or „other“)
from the data instances in T.LISTINGS.agent-address
2. Apply this classifier to each match candidate (e.g., location, concat(city,state)
3. Score of the candidate = average over instance probabilities
4. For expansion: beam search – only k-top scoring candiates
39Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
39
Search strategy (2): Example numeric searcher
1. Get value distributions of target attribute and each candidate
2. Compare the value distributions (Kullback-Leibler divergence measure)
3. Score of the candidate = Kullback-Leibler measure
40Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
40
Evaluation strategies of implemented searchers
41Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
41
Pruning by domain constraints
Multiple attributes of S: „attributes name and beds are unrelated“ do not generate match candidates with these 2 attributes
Properties of a single attribute of T: „the average value of num-rooms does not exceed 10“ use in evaluation of candidates
Properties of multiple attributes of T: „lot-area and num-baths are unrelated“ at match selector level, „clean up“:
Example
– T.num_baths S.baths
– ? T.lot-area (S.lot-sq-feet/43560)+1.3e-15 * S.baths
Based on the domain constraint, drop the term involving S.baths
42Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
42
Pruning by using knowledge from overlap data
When S and T share the same data
Consider fraction of data for which mapping is correct e.g., house locations:
S.HOUSES.location overlaps more with T.LISTINGS.area than with T.LISTINGS.agent-address
Discard the candidate T.LISTINGS.agent-address = S.HOUSES.location,
keep only T.LISTINGS.agent-address = concat(S.AGENTS.city,S.AGENTS,state)
43Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
43
Agenda
The match problem & what info to use for matching
(Semi-)automated matching: Example CUPID
(Semi-)automated matching: Example iMAP
(Automated) matching of LOD and LOD ontologies
Evaluating matching
Involving the user: Explanations; mass collaboration
44Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
44
What is ontology matching (relative to schema matching)?
same basic idea but works on ontologies that are conceptual models (not on logical
schemas such as relational tables or XML trees) emphasizes that concepts and relations need to be matched and
mapped, and may treat these differently (Note: in the schema matching literature, it is not always clearly
laid out whether the matched items come from a conceptual or a logical model; the toy examples above in particular are also conceptual)
In practice, some ontology matching tasks in fact work on such simple models (or simple subparts of models) that they do not differ at all from what we have seen so far
example: Anatomy task, see below in evaluation
Terminology: Also known as ontology alignment See (Shvaiko & Euzenat, 2005) for more details
45Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
45Recap: Rahm & Bernstein‘s classification of schema matching approaches
46Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
46The methods that are important when the schema is in the foreground (which it is in ontologies!)
47Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
47
The extension by Shvaiko & Euzenat (2005) [Partial view]
48Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
48
(slide from last week)
Special challenges on LOD ?!
49Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
49
Using the example of Geonames and DBPedia:
1. Matching instances to generate owl:sameAs links
2. Discovering concepts that cover these instances to map between ontologies
What about matching/mapping instances and classes?
50Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
50
What can we infer from this ? (1)
<owl:Class rdf:ID="Boek"/> <owl:Class rdf:ID="Book"/> <owl:DatatypeProperty rdf:ID="ISBN"> <rdf:type rdf:resource="&owl;FunctionalProperty"/> <rdfs:domain rdf:resource="#Book"/> <rdfs:range rdf:resource="&xsd;string"/> </owl:DatatypeProperty> <owl:DatatypeProperty rdf:ID="isbn"> <rdf:type rdf:resource="&owl;FunctionalProperty"/> <rdfs:domain rdf:resource="#Boek"/> <rdfs:range rdf:resource="&xsd;string"/> </owl:DatatypeProperty>
51Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
51
What can we infer from this ? (2)
<Book rdf:ID="mybook1"> <ISBN rdf:datatype="&xsd;string">12345</ISBN> </Book> <Book rdf:ID="mybook2"> <ISBN rdf:datatype="&xsd;string">12345</ISBN> </Book> <Book rdf:ID="mybook3"> <ISBN rdf:datatype="&xsd;string">6789</ISBN> </Book> <Boek rdf:ID="mijnboek_3"> <isbn rdf:datatype="&xsd;string">6789</isbn> </Boek>
52Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
52What about this? (dbpedia: 526K geog. places/features, GeoNames: 7.8Mio geog. features)
53Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
53How this matching was done(http://lists.w3.org/Archives/Public/semantic-web/2006Dec/0027.html)
>> Around 100,000 geonames place names now have wikipedia links.
> Very cool. I wonder how you link the articles? Can't be simple word matching, no?
Simple word matching would lead to an incredible mess. There are for example 53 places with the name London and 58 places with the name Paris in the geonames database. Place name disambiguation is a rather hard problem and for matching geonames places with wikipedia articles we use semantic information in the wikipedia dump together with the article title. The semantic information primarily is latitude and longitude, but also country, administrative division, feature type, population and categories …. We only consider articles where we are able to parse semantic information .... Unfortunately there is a proliferation of templates and a lot of wikipedia users have fun inventing new ones instead of reusing existing ones.
54Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
54But what about the classes?
55Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
55
Concept covering: Motivation (Parundekar et al., 2012)
“The Web of Linked Data has grown significantly in the past few years – 31.6 billion triples as of September 2011. This includes a wide range of data sources from the government (42%), geographic (19.4%), life sciences (9.6%) and other domains.
A common way that the instances in these sources are linked to others is through the owl:sameAs property.
Though the size of Linked Data Cloud is increasing steadily (10% over the 28.5 billion triples in 2010), inspection of the sources at the ontology level reveals that only a few of them (15 out of the 190 sources) include mappings between their ontologies.
Since interoperability is crucial to the success of the Semantic Web, it is essential that these heterogeneous schemas, the result of a de-centralized approach to the generation of data and ontologies, also be linked.”
56Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
56
Challenges
The problem of finding alignments in ontologies of Linked Data sources is non-trivial, since there might not be one-to-one concept equivalences.
In some sources the ontology is extremely rudimentary, for example GeoNames has only one class : geonames:Feature
alignment with a well-defined ont. such as DBpedia is not particularly useful.
need to generate more expressive concepts. The necessary information to do this is often present in the properties and values of the instances in the sources.
For example, in GeoNames the values of the featureCode and featureClass properties provide useful concept constructors, which can be aligned with existing concepts in Dbpedia
the concept geonames:featureCode=P.PPL (populated place) aligns to dbpedia:City
Approach: explore the space of concepts defined by value restrictions, (“restriction classes”)
57Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
57
Restriction classes
Basic expression to define a restriction class:
p = v
• either p is an object property and v is a resource
• Ex.: rdf:type=City
• or p is a data property and v is a literal.
• Ex.: featureCode=P.PPL
• two restriction classes equal if their respective instance sets can be identified as equal after following the owl:sameAs links
Conjunctive and disjunctive restriction classes
Alignment algorithm for disjunctive restriction classes:
1. Find initial equivalence and subset relations
2. Discover concept coverings using disjunctions of restriction classes
58Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
58
Aligning atomic restriction classes (examples on the board)
Note: There are some typos in the paper. I switched the conclusions of the first 2 if-branches.Also, the cardinality of Img(r1) in the example on p.4 should be 3918
59Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
59
Identifying concept coverings
60Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
60
Results
61Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
61
Claim – can you comment?
„An interesting outcome of our algorithm is that it identifies inconsistencies and possible errors in the linked data, and provides a method for automatically curating the Linked Data Cloud.”
62Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
62
Part of the evaluation
71Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
71
Q: “Is this a publicly available tool?“
Not all schema/ontology matchers are available, for many reasons (proprietary, collaboration with a company, own start-up, ..., the Phd student left the institute and nobody understands the code ...)
Increasingly, though, it is seen as good practice by researchers to make their tools available. You can see how (some of) these tools perform by checking the Ontology Alignment Evaluation Initative pages (see part “Evaluating matching“)
Examples:
COMA (database schemas and ontologies) http://dbs.uni-leipzig.de/Research/coma.html
Falcon-OA (RDF(S) and OWL) http://ws.nju.edu.cn/falcon-ao/
LogMap (reasoning-based) http://www.cs.ox.ac.uk/isg/tools/LogMap/
“50 Ontology Mapping and Alignment Tools - More Than 20 Are Currently Active and Often in Open Source”: overview at http://www.mkbergman.com/1769/50-ontology-mapping-and-alignment-tools/
72Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
72
Outlook
The match problem & what info to use for matching
(Semi-)automated matching: Example CUPID
(Semi-)automated matching: Example iMAP
(Automated) matching of LOD and LOD ontologies
Evaluating matching
Involving the user: Explanations; mass collaboration
73Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
73
How to compare?
Input: What kind of input data? (What languages? Only toy examples? What external information?)
Output: mapping between attributes or tables, nodes or paths? How much information does the system report?
Quality measures: metrics for accuracy and completeness?
Effort: how much savings of manual effort, how quantified? Pre-match effort (training of learners, dictionary preparation, ...)
Post-match effort (correction and improvement of the match output)
How are these measured?
74Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
74
Match quality measures
Need a „gold standard“ (the „true“ match)
Measures from information retrieval:
(standard choice: F1, a = 0.5)
Quantifies post-match effort
75Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
75
Benchmarking
Do, Melnik, and Rahm (2003) found that evaluation studies were not comparable
Need more standardized conditions (benchmarks)
Since 2004: competitions in ontology matching (more in the next session):
Test cases and contests at http://www.ontologymatching.org/evaluation.html
76Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
76Example: Tasks 2009 (various are re-used; 2013 is just out)
(excerpt; from http://oaei.ontologymatching.org/2009/); latest completed run at http://oaei.ontologymatching.org/2013/
Expressive ontologies anatomy
The anatomy real world case is about matching the Adult Mouse Anatomy (2744 classes) and the NCI Thesaurus (3304 classes) describing the human anatomy.
conference Participants will be asked to find all correct correspondences (equivalence and/or
subsumption correspondences) and/or 'interesting correspondences' within a collection of ontologies describing the domain of organising conferences (the domain being well understandable for every researcher). Results will be evaluated a posteriori in part manually and in part by data-mining techniques and logical reasoning techniques. There will also be evaluation against reference mapping based on subset of the whole collection.
Directories and thesauri fishery gears
features four different classification schemes, expressed in OWL, adopted by different fishery information systems in FIM division of FAO. An alignment performed on this 4 schemes should be able to spot out equivalence, or a degree of similarity between the fishing gear types and the groups of gears, such to enable a future exercise of data aggregation cross systems.
Oriented matching This track focuses on the evaluation of alignments that contain other mapping
relations than equivalences.
Instance matching very large crosslingual resources
The purpose of this task (vlcr) is to match the Thesaurus of the Netherlands Institute for Sound and Vision (called GTAA, see below for more information) to two other resources: the English WordNet from Princeton University and DBpedia.
77Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
77
Mice and humans
The anatomy real world case is about matching the Adult Mouse Anatomy (2744 classes) and the NCI Thesaurus (3304 classes) describing the human anatomy.
(http://oaei.ontologymatching.org/2008/anatomy/)
78Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
78Matching task and evaluation approach(http://oaei.ontologymatching.org/2007/anatomy/)
We would like to gratefully thank Martin Ringwald and Terry Hayamizu (Mouse Genome Informatics - http://www.informatics.jax.org/), who provided us with a reference mapping for these ontologies.
The reference mapping contains only equivalence correspondences between concepts of the ontologies. No correspondences between properties (roles) are specified.
If your system also creates correspondences between properties or correspondences that describe subsumption relations, these results will not influence the evaluation (but can nevertheless be part of your submitted results).
The results of your matching system will be compared to this reference alignment. Therefore, all of the the results have to be delivered in the format specified here.
79Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
79Matching task and evaluation approach (http://oaei.ontologymatching.org/2011/oriented/index.html)
“An increasing number of matchers are now capable of deriving mapping relations other than equivalence relations, such as subsumption, disjointness or named relations.
This is a necessity given that we need to compute alignments between ontologies at different granularity levels or between ontologies that elaborate on non-equivalent elements. The evaluation of such mappings was addressed already in OAEI (2009) Oriented Matching track. […]
The track aims also to report on evaluation methods and measures for subsumption mappings, in conjunction to the computation of equivalence mappings.
Targeting these goals, we have built new benchmark datasets that are described below.”
80Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
80(Some) results(http://oaei.ontologymatching.org/2009/results/anatomy/)
81Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
81(Some) results(http://oaei.ontologymatching.org/2013/results/anatomy/)
83Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
83
Outlook
The match problem & what info to use for matching
(Semi-)automated matching: Example CUPID
(Semi-)automated matching: Example iMAP
(Automated) matching of LOD and LOD ontologies
Evaluating matching
Involving the user: Explanations; mass collaboration
84Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
84
Example in iMAP
User sees ranked candidates:
1. List-price = price
2. List-price = price * (1 + fee-rate)
Explanation:
a) Both generated from numeric searcher, 2 ranked higher than 1
b) But:
c) Match month-posted = fee-rate
d) domain constraint: matches for month-posted and price do not share attributes
e) cannot match list-price to anything to do with fee-rate
f) Why c)?
g) Data instances of fee-rate were classified as of type date
User corrects this wrong step f), the rest is repaired accordingly
85Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
85
Background knowledge structure for explanation: dependency graph
86Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
86
MOBS: Using mass collaboration to automate data integration
1. Initialization: a correct but partial match (e.g. title = a1, title = b2, etc.)
2. Soliciting user feedback: User query user must answer a simple question user gets answer to initial query
3. Computing user weights (e.g., trustworthiness = fraction of correct answers to known mappings)
4. Combining user feedback (e.g, majority count) Important: „instant gratification“ (e.g., include the new field in the
results page after a user has given helpful input)
87Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
87
Task for next week (from http://opendefinition.org/)
Do you see a statement in this definition that does not appear substantiated?
Can you give 3 reasons why it may be true?
Can you give 3 reasons why it may be false?
88Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
88
.... which stands in some relation with these claims ...
„An interesting outcome of our algorithm is that it identifies inconsistencies and possible errors in the linked data, and provides a method for automatically curating the Linked Data Cloud.”
89Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
89
Outlook
The match problem & what info to use for matching
(Semi-)automated matching: Example CUPID
(Semi-)automated matching: Example iMAP
(Automated) matching of LOD and LOD ontologies
Evaluating matching
Involving the user: Explanations; mass collaboration
Invited lecture Aad Versteden (Tenforce)
90Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching
90
References / background reading; acknowledgements
Rahm, E. & Bernstein, P.A. (2001). A survey of approaches to automatic schema matching. The VLBD Journal, 10, 334-350.
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.700
Doan, A. & Halevy, A.Y. (2004). Semantic Integration Research in the Database Community: A brief survey. AI Magazine.
http://dit.unitn.it/~p2p/RelatedWork/Matching/si-survey-db-community.pdf
Sven Hertling, Heiko Paulheim . WikiMatch - using Wikipedia for ontology matching. In Proc. of The Seventh International Workshop on Ontology Matching, 2012. http://www.dit.unitn.it/~p2p/OM-2012/om2012_Tpaper4.pdf
Madhavan, J., Bernstein, P.A., Rahm, E. (2001). Generic Schema Matching with Cupid. In Proc. Of the 27th VLDB Conference.
http://dbs.uni-leipzig.de/de/publication/title/generic_schema_matching_with_cupid
Dhamankar, R., Lee, Y., Doan, A., Halevy, A., & Domingos, P. (2004). iMAP: Discovering complex semantic matches between database schemas. In Proc. Of SIGMOD 2004.
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.5.4117
P. Shvaiko, J. Euzenat: A Survey of Schema-based Matching Approaches. Journal on Data Semantics, 2005.
http://www.dit.unitn.it/~p2p/RelatedWork/Matching/JoDS-IV-2005_SurveyMatching-SE.pdf
pp. 50ff.: Bizer, C., Cyganiak, R., & Heath, T. (2007). How to Publish Linked Data on the Web. Chapter 6. How to set RDF Links to other Data Sources. http://wifo5-03.informatik.uni-mannheim.de/bizer/pub/LinkedDataTutorial/#links
pp. 55ff,; Rahul Parundekar, Craig A. Knoblock, and José Luis Ambite. Discovering concept coverings in ontologies of linked data sources. In Proceedings of the 11th International Semantic Web Conference (ISWC 2012), pp. 427–443, Boston, Mass., 2012. http://iswc2012.semanticweb.org/sites/default/files/76490417.pdf
Do, H.-H., Melnik, S., & Rahm, E. (2003). Comparison of schema matching evaluations. In Web, Web-Services, and Database Systems: NODe 2002, Web- and Database-Related Workshops, Erfurt, Germany, October 7-10, 2002. Revised Papers (pp. 221-237). Springer.
http://dit.unitn.it/~p2p/RelatedWork/Comparison%20of%20Schema%20Matching%20Evaluations.pdf
McCann, R., Doan, A., Varadarajan, V., & Kramnik, A. (2003). Building data integration systems via mass collaboration. In Proc. International Workshop on the Web and Databases (WebDB).
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.4.9964
Please see the Powerpoint slide-specific „notes“ for URLs of used pictures and formulae