a comparison of supervised learning classifiers for link discovery

26
A Comparison of Supervised Learning Classifiers for Link Discovery Tommaso Soru and Axel-Cyrille Ngonga Ngomo Agile Knowledge Engineering and Semantic Web Department of Computer Science University of Leipzig Augustusplatz 10, 04109 Leipzig {tsoru,ngonga}@informatik.uni-leipzig.de http://aksw.org September 4, 2014

Upload: tommaso-soru

Post on 01-Jul-2015

179 views

Category:

Science


1 download

DESCRIPTION

Slides for the paper "A Comparison of Supervised Learning Classifiers for Link Discovery" by Tommaso Soru and Axel-Cyrille Ngonga Ngomo (AKSW, University of Leipzig), presented on September 4, 2014 at the 10th International Conference on Semantic Systems (SEMANTiCS) in Leipzig, Germany.

TRANSCRIPT

Page 1: A Comparison of Supervised Learning Classifiers for Link Discovery

A Comparison of Supervised Learning Classifiersfor Link Discovery

Tommaso Soru and Axel-Cyrille Ngonga Ngomo

Agile Knowledge Engineering and Semantic WebDepartment of Computer Science

University of LeipzigAugustusplatz 10, 04109 Leipzig

{tsoru,ngonga}@informatik.uni-leipzig.dehttp://aksw.org

September 4, 2014

Page 2: A Comparison of Supervised Learning Classifiers for Link Discovery

tugraz

SEMANTiCS 2014 — The 10th International Conference on Semantic Systems

Introduction/1

The 4th Linked Data Web Principle.

“Include links to other URIs, so that they can discover morethings.” – Tim Berners-Lee

31B triples in 2011

of which only ∼ 3% linkdifferent datasets

> 71B triples expected in2014

T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classifiers for Link Discovery2 / 18

Page 3: A Comparison of Supervised Learning Classifiers for Link Discovery

tugraz

SEMANTiCS 2014 — The 10th International Conference on Semantic Systems

Introduction/2

Link Discovery

What? Discover new links among resources.

How? Using supervised and unsupervised methods.

Why? Links are important for data integration, questionanswering, knowledge extraction.

We will focus on supervised machine-learning algorithms.

T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classifiers for Link Discovery3 / 18

Page 4: A Comparison of Supervised Learning Classifiers for Link Discovery

tugraz

SEMANTiCS 2014 — The 10th International Conference on Semantic Systems

Introduction/2

Link Discovery

What? Discover new links among resources.

How? Using supervised and unsupervised methods.

Why? Links are important for data integration, questionanswering, knowledge extraction.

We will focus on supervised machine-learning algorithms.

T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classifiers for Link Discovery3 / 18

Page 5: A Comparison of Supervised Learning Classifiers for Link Discovery

tugraz

SEMANTiCS 2014 — The 10th International Conference on Semantic Systems

Preliminaries

Link Discovery.

Given two datasets S and T , the general aim of link discovery is to find the setof resource pairs (s, t) ∈ S × T such that R(s, t) holds, where R is a givenrelation such as owl:sameAs or dbp:near.

Link Specification.

A link specification is a rule composed by a complex similarity function sim anda threshold θ that defines which pairs (s, t) should be linked together:

sim(s, t) ≥ θ

Main problems

1 Naıve approaches demand quadratic time complexity.

2 Efficient algorithms ; accurate link specifications.

T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classifiers for Link Discovery4 / 18

Page 6: A Comparison of Supervised Learning Classifiers for Link Discovery

tugraz

SEMANTiCS 2014 — The 10th International Conference on Semantic Systems

Preliminaries

Link Discovery.

Given two datasets S and T , the general aim of link discovery is to find the setof resource pairs (s, t) ∈ S × T such that R(s, t) holds, where R is a givenrelation such as owl:sameAs or dbp:near.

Link Specification.

A link specification is a rule composed by a complex similarity function sim anda threshold θ that defines which pairs (s, t) should be linked together:

sim(s, t) ≥ θ

Main problems

1 Naıve approaches demand quadratic time complexity.

2 Efficient algorithms ; accurate link specifications.

T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classifiers for Link Discovery4 / 18

Page 7: A Comparison of Supervised Learning Classifiers for Link Discovery

tugraz

SEMANTiCS 2014 — The 10th International Conference on Semantic Systems

Preliminaries

Link Discovery.

Given two datasets S and T , the general aim of link discovery is to find the setof resource pairs (s, t) ∈ S × T such that R(s, t) holds, where R is a givenrelation such as owl:sameAs or dbp:near.

Link Specification.

A link specification is a rule composed by a complex similarity function sim anda threshold θ that defines which pairs (s, t) should be linked together:

sim(s, t) ≥ θ

Main problems

1 Naıve approaches demand quadratic time complexity.

2 Efficient algorithms ; accurate link specifications.

T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classifiers for Link Discovery4 / 18

Page 8: A Comparison of Supervised Learning Classifiers for Link Discovery

tugraz

SEMANTiCS 2014 — The 10th International Conference on Semantic Systems

Motivation

We want to answer these questions.

Q1: Which of the paradigms achieves the best F-measures?

Q2: Which of the paradigms is most robust against noise?

Q3: Which of the methods is the most time-efficient?

T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classifiers for Link Discovery5 / 18

Page 9: A Comparison of Supervised Learning Classifiers for Link Discovery

tugraz

SEMANTiCS 2014 — The 10th International Conference on Semantic Systems

Motivation

We want to answer these questions.

Q1: Which of the paradigms achieves the best F-measures?

Q2: Which of the paradigms is most robust against noise?

Q3: Which of the methods is the most time-efficient?

T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classifiers for Link Discovery5 / 18

Page 10: A Comparison of Supervised Learning Classifiers for Link Discovery

tugraz

SEMANTiCS 2014 — The 10th International Conference on Semantic Systems

Motivation

We want to answer these questions.

Q1: Which of the paradigms achieves the best F-measures?

Q2: Which of the paradigms is most robust against noise?

Q3: Which of the methods is the most time-efficient?

T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classifiers for Link Discovery5 / 18

Page 11: A Comparison of Supervised Learning Classifiers for Link Discovery

tugraz

SEMANTiCS 2014 — The 10th International Conference on Semantic Systems

Overview/1

Evaluation pipeline

Alignment between properties is carried out manually.

Perfect mapping (i.e., labels)

(s, t) is a positive example iff R(s, t) holds.

T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classifiers for Link Discovery6 / 18

Page 12: A Comparison of Supervised Learning Classifiers for Link Discovery

tugraz

SEMANTiCS 2014 — The 10th International Conference on Semantic Systems

Overview/2

Assumptions

The complex similarity function sim compares property values.

In case of

datatype properties: it uses text/numerical/date similarities.object properties: it applies the similarities iteratively.

Graph structure has not been considered as a feature per se.

Cross-validation has been preferred over semi-supervisedlearning because it yields more accurate results.

T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classifiers for Link Discovery7 / 18

Page 13: A Comparison of Supervised Learning Classifiers for Link Discovery

tugraz

SEMANTiCS 2014 — The 10th International Conference on Semantic Systems

Evaluation Setup/1

Similarities

for string values:

Weighted trigram similarity, setting tf-idf scores as weightsWeighted edit distance, setting confusion matrices as weightsCosine similarity

for numerical values:

Logarithmic similarity

for date values:

a day-based Date similarity

T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classifiers for Link Discovery8 / 18

Page 14: A Comparison of Supervised Learning Classifiers for Link Discovery

tugraz

SEMANTiCS 2014 — The 10th International Conference on Semantic Systems

Evaluation Setup/2

Linear non-probabilisticclassifiers

Linear SVM*

Polynomial SVM*

Linear SVM withSequential MinimalOptimization

Linear Regression

Probabilistic classifiers

Logistic Regression

Naıve Bayes

Random Tree

J48

Neural networks

Multilayer Perceptron

Rule-based classifiers

Decision Table

We used classifiers from the Weka library, except (*) from LibSVM.

T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classifiers for Link Discovery9 / 18

Page 15: A Comparison of Supervised Learning Classifiers for Link Discovery

tugraz

SEMANTiCS 2014 — The 10th International Conference on Semantic Systems

Evaluation Setup/3

Datasets

D1-D3: synthetic datasets from the Ontology Alignment EvaluationInitiative (OAEI) 2010 Benchmark

D4-D6: real datasets from the Benchmark for Entity Resolution, DBSLeipzig

D5-D6: datasets having a high level of noise

# dataset domain size

D1 OAEI-Persons1 personal data 250kD2 OAEI-Persons2 personal data 240kD3 OAEI-Restaurants places 72k

D4 DBLP–ACM bibliographic 6MD5 Amazon–GoogleProducts e-commerce 10MD6 ABT–Buy e-commerce 1M

T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classifiers for Link Discovery10 / 18

Page 16: A Comparison of Supervised Learning Classifiers for Link Discovery

tugraz

SEMANTiCS 2014 — The 10th International Conference on Semantic Systems

Results/1

F-measure

Classifier D1 D2 D3 D4 D5 D6

Linear SVM 99.40% 98.99% 97.75% 97.81% 27.06% 39.18%Linear SMO 100.00% 98.73% 100.00% 92.58% 46.63% 31.39%Polynomial-3 SVM 99.40% 93.76% 98.29% 97.67% 37.28% 31.69%Multilayer Perceptron 99.50% 99.50% 100.00% 97.43% 35.58% 43.49%Logistic Regression 99.90% 98.12% 96.67% 97.71% 40.64% 41.92%Linear Regression 99.30% 96.92% 100.00% 96.36% 37.06% 36.84%Naıve Bayes 97.75% 35.05% 95.19% 29.47% 2.92% 11.90%Decision Table 97.98% 100.00% 100.00% 97.66% 42.44% 29.66%Random Tree 97.45% 99.24% 89.89% 96.82% 39.38% 41.03%J48 99.50% 95.56% 98.29% 97.66% 44.28% 31.53%

State of the Art 100.00% 100.00% 100.00% 98.20% 62.10% 71.30%

F-measure calculated on the class of positive examples.

T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classifiers for Link Discovery11 / 18

Page 17: A Comparison of Supervised Learning Classifiers for Link Discovery

tugraz

SEMANTiCS 2014 — The 10th International Conference on Semantic Systems

Results/2

Computation runtimes

Classifier D1 D2 D3 D4 D5 D6

Linear SVM 7.16 6.93 2.67 63.94 484.29 75.44Linear SMO 17.07 12.93 3.77 113.40 369.20 37.16Polynomial-3 SVM 5.67 6.18 2.63 162.82 1,091.10 103.89Multilayer Perceptron 15.13 16.10 3.40 96.96 376.26 41.68Logistic Regression 16.11 14.91 4.61 110.12 275.94 38.48Linear Regression 16.04 16.21 5.02 120.54 497.43 44.50Naıve Bayes 17.34 17.09 4.39 105.31 375.91 43.79Decision Table 16.68 16.44 3.78 90.99 389.35 48.87Random Tree 12.02 11.16 2.24 53.67 347.36 34.11J48 21.31 15.96 6.99 131.57 98.27 38.46

All values in seconds.

T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classifiers for Link Discovery12 / 18

Page 18: A Comparison of Supervised Learning Classifiers for Link Discovery

tugraz

SEMANTiCS 2014 — The 10th International Conference on Semantic Systems

Results/3

Considerations

Some average trends can besuggested, yet no algorithmoutperforms all other significantly.

Multilayer Perceptrons performedbest including and excluding noisydatasets.

Random Trees seem the fastestapproach overall.

The different approaches seemcomplementary on theirbehaviour.

Naıve Bayes might fail as itconsiders all features asindependent from each other.

T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classifiers for Link Discovery13 / 18

Page 19: A Comparison of Supervised Learning Classifiers for Link Discovery

tugraz

SEMANTiCS 2014 — The 10th International Conference on Semantic Systems

Results/4

Answers

Q1: Which of the paradigms achieves the best F-measures?

A1: Multilayer Perceptrons, Linear SVMs, Decision Tables.

Q2: Which of the paradigms is most robust against noise?

A2: Logistic Regression, Random Trees, Multilayer Perceptrons.

Q3: Which of the methods is the most time-efficient?

A3: Random Trees, however all approaches scale well.

T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classifiers for Link Discovery14 / 18

Page 20: A Comparison of Supervised Learning Classifiers for Link Discovery

tugraz

SEMANTiCS 2014 — The 10th International Conference on Semantic Systems

Results/4

Answers

Q1: Which of the paradigms achieves the best F-measures?

A1: Multilayer Perceptrons, Linear SVMs, Decision Tables.

Q2: Which of the paradigms is most robust against noise?

A2: Logistic Regression, Random Trees, Multilayer Perceptrons.

Q3: Which of the methods is the most time-efficient?

A3: Random Trees, however all approaches scale well.

T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classifiers for Link Discovery14 / 18

Page 21: A Comparison of Supervised Learning Classifiers for Link Discovery

tugraz

SEMANTiCS 2014 — The 10th International Conference on Semantic Systems

Results/4

Answers

Q1: Which of the paradigms achieves the best F-measures?

A1: Multilayer Perceptrons, Linear SVMs, Decision Tables.

Q2: Which of the paradigms is most robust against noise?

A2: Logistic Regression, Random Trees, Multilayer Perceptrons.

Q3: Which of the methods is the most time-efficient?

A3: Random Trees, however all approaches scale well.

T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classifiers for Link Discovery14 / 18

Page 22: A Comparison of Supervised Learning Classifiers for Link Discovery

tugraz

SEMANTiCS 2014 — The 10th International Conference on Semantic Systems

Results/4

Answers

Q1: Which of the paradigms achieves the best F-measures?

A1: Multilayer Perceptrons, Linear SVMs, Decision Tables.

Q2: Which of the paradigms is most robust against noise?

A2: Logistic Regression, Random Trees, Multilayer Perceptrons.

Q3: Which of the methods is the most time-efficient?

A3: Random Trees, however all approaches scale well.

T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classifiers for Link Discovery14 / 18

Page 23: A Comparison of Supervised Learning Classifiers for Link Discovery

tugraz

SEMANTiCS 2014 — The 10th International Conference on Semantic Systems

Related Work

Time-efficient deduplication algorithms (PPJoin+, EDJoin,PassJoin, TrieJoin)

LIMES – Link Discovery Framework for Metric Spaces

Approaches for learning link specifications (HYPPO, HR3,EAGLE, ACIDS)Dedicated efficient methods (RDF-AI, REEDED)LinkLion – A Link Repository for the Web of DataThe SAIM interface

Other link discovery frameworks (SILK, LDIF)

Other machine learning frameworks (MARLIN, FEBRL,RAVEN)

Other blocking techniques (MultiBlock, KnoFuss)

T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classifiers for Link Discovery15 / 18

Page 24: A Comparison of Supervised Learning Classifiers for Link Discovery

tugraz

SEMANTiCS 2014 — The 10th International Conference on Semantic Systems

Future Work

1 Integration of Multilayer Perceptrons into the LIMESframework.

2 Use of ensemble learning techniques.

3 Evaluation on a semi-supervised learning setting with fewtraining data.

4 Evaluation using a larger amount of similarity measures.

5 Incorporation of a component based on Statistical RelationalLearning.

T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classifiers for Link Discovery16 / 18

Page 25: A Comparison of Supervised Learning Classifiers for Link Discovery

tugraz

SEMANTiCS 2014 — The 10th International Conference on Semantic Systems

Web resources

Source code – Batch Learners Evaluation for Link Discoveryhttp://github.com/mommi84/BALLAD

Technical report – Batch Learners Evaluation for Link Discoveryhttp://mommi84.github.io/BALLAD

The OAEI 2010 Benchmarkhttp://oaei.ontologymatching.org/2010/benchmarks

The Benchmark for Entity Resolution, DBS Leipzighttp://goo.gl/bvWBjA

Weka – Data Mining Software in Javahttp://www.cs.waikato.ac.nz/ml/weka

LibSVM – A Library for Support Vector Machineshttp://www.csie.ntu.edu.tw/~cjlin/libsvm

LIMES – Link Discovery Framework for Metric Spaceshttp://aksw.org/Projects/LIMES

LinkLion – A Link Repository for the Web of Datahttp://www.linklion.org

T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classifiers for Link Discovery17 / 18

Page 26: A Comparison of Supervised Learning Classifiers for Link Discovery

tugraz

SEMANTiCS 2014 — The 10th International Conference on Semantic Systems

Thank you for your attention.

T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classifiers for Link Discovery18 / 18