instance matching

38
WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data WWW2012 Tutorial Practical Cross-Dataset Queries on the Web of Data Instance Matching Robert Isele Freie Universität Berlin

Upload: robert-isele

Post on 04-Jul-2015

1.182 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Instance Matching

WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data

WWW2012 Tutorial Practical Cross-Dataset Queries on the Web of Data

Instance Matching

Robert IseleFreie Universität Berlin

Page 2: Instance Matching

WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data

Outline

Motivation

Link Discovery Tools

Linking Workflow

Silk Workbench

Page 3: Instance Matching

WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data

Motivation

The Web of Data is a single global data space because data sources are connected by links

Over 31 billion triples published as Linked Open Data and growing

But: ● Less than 500 million links

● Most publishers only link to one other dataset

Page 4: Instance Matching

WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data

Use Case 1: Publishing a New Dataset

A data provider wants to publish a new dataset

Wants to interlink with existing data sets from the same domain

Example● A data publisher wants to publish a new dataset about movies

● Interlink movies with LinkedMDB (Linked Movie Data Base)

● Interlink directors with DBpedia (Wikipedia)

Page 5: Instance Matching

WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data

Use Case 2: Linked Data Application

Linked Data application integrates multiple data sources from the same domain

In the decentralized Web of Data, many data sources use different URIs for the same real world object.

Identifying these URI aliases, is a central problem in Linked Data.

Page 6: Instance Matching

WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data

Challenges for Link Discovery

The Web of Data is heterogeneous● Many different vocabularies are in use

● Different data formats

● Many different ways to represent the same information

Distribution of the most widely used vocabularies

Page 7: Instance Matching

WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data

Challenges for Link Discovery

Large range of domains● 256 data sources in the LOD cloud from a variety of domains

● Linkage Rules are different in each domain

● Writing a Linkage Rule is for each of these domains is usually not trivial

Distribution of triples by domain

Page 8: Instance Matching

WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data

Challenges for Link Discovery

Scalability● The current LOD cloud contains 277 datasets (August 2011)

● 30 billion triples in total

● Infeasible to compare every possible entity pair

Page 9: Instance Matching

WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data

Link Discovery Tools

Tools enable data publishers to set links

Most tools generate links based on user-defined linkage rules

A linkage rule specifies the conditions data items must fulfill in order to be interlinked

Popular Link Discover Tools:● Silk Link Discovery Framework

● LIMES

● Others: http://esw.w3.org/TaskForces/CommunityProjects/LinkingOpenData/EquivalenceMining

Page 10: Instance Matching

WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data

Silk Link Discovery Framework

Tool for discovering links between data items within different Linked Data sources.

The Silk Link Specification Language (Silk-LSL) allows to express complex linkage rules

Can be used to generate owl:sameAs links as well as other relationships

Scalability and high performance through efficient data handling

Page 11: Instance Matching

WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data

Silk Versions

Silk Single Machine ● Generate links on a single machine

● Local or remote data sets

Silk MapReduce ● Generate RDF links using a cluster of multiple machines

● Based on Hadoop (Can be run on Amazon Elastic MapReduce)

Silk Server ● Provides an HTTP API for matching instances from an incoming

stream of RDF data while keeping track of known entities

● Can be used as an identity resolution component within applications that consume Linked Data from the Web

Page 12: Instance Matching

WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data

Silk Workbench

Silk Workbench is a web application which guides the user through the process of interlinking different data sources.

Enables the user to manage different sets of data sources and linking tasks.

Offers a graphical editor which enables the user to easily create and edit linkage rules

Offers tools to evaluate the current linkage rule

Includes experimental support for learning linkage rules

Page 13: Instance Matching

WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data

Linking Workflow

Page 14: Instance Matching

WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data

Typical linkage rule

Select the values to be compared● Example: Select labels and dates of a music record

Normalize the values● Example: Transform dates to a common format

Compare different values using similarity measures● Example: Compare labels and dates of a music record

Aggregate the results of multiple comparisons● Example: Compute the average of the label and date similarity

Page 15: Instance Matching

WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data

Value selectors

Values in the graph around the entities can be used for comparison

Property path languages have been developed for that purpose

Examples (SPARQL 1.1 Property Paths Language):● Entity label: rdfs:label

● Movie director name: dbpedia-owl:director/foaf:name

● All movies of a director: ^dbpedia-owl:director/rdfs:label

Page 16: Instance Matching

WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data

Data Transformations

Different data sets may use different data formats

Data sets may be noisy

⇒ Values must be normalized prior to comparison

Page 17: Instance Matching

WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data

Common Transformations

Case normalization

Structural transformation

Extract values from URIs

Page 18: Instance Matching

WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data

Similarity Measures

A similarity measure compares two values

It returns a value between 0 (no similarity) and 1 (equality)

Formally, a similarity measure is a function:

Various similarity measures have been proposed● Character-based measures

● Token-based measures

● Domain-specific measures

sim :Σ*×Σ*→[0,1]

Page 19: Instance Matching

WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data

Character-Based Similarity Measures

Usually rely on character edit operations

Often used for catching typographical errors

Most popular● Levenstein

● Jaro/Jaro-Winkler

Page 20: Instance Matching

WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data

Levenshtein Distance

The minimum number of edits needed to transform one string into the other

Allowed edit operations:● insert a character into the string

● delete a character from the string

● replace one character with a different character

Examples:● levensthein('Table', 'Cable') = 1 (1 Substitution)

● levensthein('Table', 'able') = 1 (1 Deletion)

Page 21: Instance Matching

WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data

Token-Based Similarity Measures

Character-based measures work well for typographical errors, but fail when word arrangements differ

Example: 'John Doe', 'Doe, John', 'Mr. John Doe'

Token-based measures split the values into tokens before computing the similarity

Example: tokenize('Mr. John Doe') = {'Mr.', 'John', 'Doe'}

Most popular: Jaccard, Dice

Page 22: Instance Matching

WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data

Jaccard coefficient

Intuition: Measure the fraction of the tokens which are shared by both strings

Defined as the number of matching words divided by the total number of distinct words:

Example:

Jaccard (A ,B)=∣A∩B∣∣A∪B∣

Jaccard ({Thomas ,Sean ,Connery} ,{Sir ,Sean ,Connery})=24=0.5

Page 23: Instance Matching

WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data

Domain-Specific Similarity Measures

Geographic distance

Date/Time

Numbers

Page 24: Instance Matching

WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data

Aggregating Similarity Values

In order to determine if two entities are duplicates it is usually not sufficient to compare a single property

Aggregation Functions aggregate the similarity of multiple comparisons

Example: Interlinking geographical datasets● Compare by label and geographic coordinates

● Aggregate similarity values

Page 25: Instance Matching

WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data

Popular Aggregation Functions

Minimum● Choose the lowest value

● ⇒ All values must exceed the threshold

Maximum● Choose the highest value

● ⇒ At least one value must exceed the threshold

Weighted Average● Assign a weight to each comparison

● Compute the weighted mean

Page 26: Instance Matching

WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data

Putting it all together

Page 27: Instance Matching

WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data

Example

Interlink cities in different data sources:

Page 28: Instance Matching

WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data

Evaluating Linkage Rules

Gold standard in the form of reference links● Positive links (definitive matches)

● Negative links (definitive non-matches)

Based on the reference links, we can determine the number of correct and incorrect matches

We distinguish between 4 cases:

Positive Link Negative Link

match(a,b) = link True positive False positive

match(a,b) = nonlink False negative True negative

Page 29: Instance Matching

WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data

Evaluating Linkage Rules

Recall: Ratio of correct links compared to all known links

Precision: Ratio of correct links compared to all found links

F-measure: Harmonic mean of precision and recall

recall=∣true positives∣

∣true positives∣+ ∣ false positives∣

precision=∣true positives∣

∣true positives∣+ ∣ false negatives∣

F=2⋅precision⋅recallprecision+ recall

Page 30: Instance Matching

WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data

Recall-Precision diagram

A recall-precision diagram visualizes the trade-off between maximizing the recall and maximizing the precision

From: Creating probabilistic databases from duplicated data, Oktie Hassanzadeh · Renée J. Miller (VLDBJ)

Page 31: Instance Matching

WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data

Outline

Motivation

Link Discovery Tools

Linking Workflow

Silk Workbench

Page 32: Instance Matching

WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data

Silk Worbench

Silk Workbench offers a GUI for:● Manage different data sourcs and linkage rules● Creating linkage rules● Executing linkage rules ● Evaluating linkage rules● Learning Linkage Rules

Page 33: Instance Matching

WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data

Workspace

The Workspace holds a set of projects consisting of:

Data Sources

● Holds all information that is needed by Silk to retrieve entities from it. 

● Usually a file dump or a SPARQL endpoint

Linking Tasks

● Interlinks a type of entity between two data sources

● e.g. Interlinkiing movies in DBpedia and LinkedMDB

Page 34: Instance Matching

WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data

Linkage Rule Editor

Allows to view and edit linkage rules

Linkage Rules are shown as a tree

Editing using drag & drop.

Page 35: Instance Matching

WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data

Generating Links

Page 36: Instance Matching

WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data

Managing Reference Links

Page 37: Instance Matching

WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data

Conclusion

In order to publish a new data set or to consume an existing dataset we need to generate links

A linkage rule specifies the conditions which must hold true for two entities in order to be considered the same real-world object.

The Silk Workbench provides a graphical user interface to create and edit linking tasks

The hands on session will cover a simple example interlinking musical artists in freebase and DBpedia

Page 38: Instance Matching

WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data

Q & A