a comparison of supervised learning classifiers for link discovery
DESCRIPTION
Slides for the paper "A Comparison of Supervised Learning Classifiers for Link Discovery" by Tommaso Soru and Axel-Cyrille Ngonga Ngomo (AKSW, University of Leipzig), presented on September 4, 2014 at the 10th International Conference on Semantic Systems (SEMANTiCS) in Leipzig, Germany.TRANSCRIPT
A Comparison of Supervised Learning Classifiersfor Link Discovery
Tommaso Soru and Axel-Cyrille Ngonga Ngomo
Agile Knowledge Engineering and Semantic WebDepartment of Computer Science
University of LeipzigAugustusplatz 10, 04109 Leipzig
{tsoru,ngonga}@informatik.uni-leipzig.dehttp://aksw.org
September 4, 2014
tugraz
SEMANTiCS 2014 — The 10th International Conference on Semantic Systems
Introduction/1
The 4th Linked Data Web Principle.
“Include links to other URIs, so that they can discover morethings.” – Tim Berners-Lee
31B triples in 2011
of which only ∼ 3% linkdifferent datasets
> 71B triples expected in2014
T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classifiers for Link Discovery2 / 18
tugraz
SEMANTiCS 2014 — The 10th International Conference on Semantic Systems
Introduction/2
Link Discovery
What? Discover new links among resources.
How? Using supervised and unsupervised methods.
Why? Links are important for data integration, questionanswering, knowledge extraction.
We will focus on supervised machine-learning algorithms.
T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classifiers for Link Discovery3 / 18
tugraz
SEMANTiCS 2014 — The 10th International Conference on Semantic Systems
Introduction/2
Link Discovery
What? Discover new links among resources.
How? Using supervised and unsupervised methods.
Why? Links are important for data integration, questionanswering, knowledge extraction.
We will focus on supervised machine-learning algorithms.
T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classifiers for Link Discovery3 / 18
tugraz
SEMANTiCS 2014 — The 10th International Conference on Semantic Systems
Preliminaries
Link Discovery.
Given two datasets S and T , the general aim of link discovery is to find the setof resource pairs (s, t) ∈ S × T such that R(s, t) holds, where R is a givenrelation such as owl:sameAs or dbp:near.
Link Specification.
A link specification is a rule composed by a complex similarity function sim anda threshold θ that defines which pairs (s, t) should be linked together:
sim(s, t) ≥ θ
Main problems
1 Naıve approaches demand quadratic time complexity.
2 Efficient algorithms ; accurate link specifications.
T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classifiers for Link Discovery4 / 18
tugraz
SEMANTiCS 2014 — The 10th International Conference on Semantic Systems
Preliminaries
Link Discovery.
Given two datasets S and T , the general aim of link discovery is to find the setof resource pairs (s, t) ∈ S × T such that R(s, t) holds, where R is a givenrelation such as owl:sameAs or dbp:near.
Link Specification.
A link specification is a rule composed by a complex similarity function sim anda threshold θ that defines which pairs (s, t) should be linked together:
sim(s, t) ≥ θ
Main problems
1 Naıve approaches demand quadratic time complexity.
2 Efficient algorithms ; accurate link specifications.
T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classifiers for Link Discovery4 / 18
tugraz
SEMANTiCS 2014 — The 10th International Conference on Semantic Systems
Preliminaries
Link Discovery.
Given two datasets S and T , the general aim of link discovery is to find the setof resource pairs (s, t) ∈ S × T such that R(s, t) holds, where R is a givenrelation such as owl:sameAs or dbp:near.
Link Specification.
A link specification is a rule composed by a complex similarity function sim anda threshold θ that defines which pairs (s, t) should be linked together:
sim(s, t) ≥ θ
Main problems
1 Naıve approaches demand quadratic time complexity.
2 Efficient algorithms ; accurate link specifications.
T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classifiers for Link Discovery4 / 18
tugraz
SEMANTiCS 2014 — The 10th International Conference on Semantic Systems
Motivation
We want to answer these questions.
Q1: Which of the paradigms achieves the best F-measures?
Q2: Which of the paradigms is most robust against noise?
Q3: Which of the methods is the most time-efficient?
T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classifiers for Link Discovery5 / 18
tugraz
SEMANTiCS 2014 — The 10th International Conference on Semantic Systems
Motivation
We want to answer these questions.
Q1: Which of the paradigms achieves the best F-measures?
Q2: Which of the paradigms is most robust against noise?
Q3: Which of the methods is the most time-efficient?
T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classifiers for Link Discovery5 / 18
tugraz
SEMANTiCS 2014 — The 10th International Conference on Semantic Systems
Motivation
We want to answer these questions.
Q1: Which of the paradigms achieves the best F-measures?
Q2: Which of the paradigms is most robust against noise?
Q3: Which of the methods is the most time-efficient?
T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classifiers for Link Discovery5 / 18
tugraz
SEMANTiCS 2014 — The 10th International Conference on Semantic Systems
Overview/1
Evaluation pipeline
Alignment between properties is carried out manually.
Perfect mapping (i.e., labels)
(s, t) is a positive example iff R(s, t) holds.
T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classifiers for Link Discovery6 / 18
tugraz
SEMANTiCS 2014 — The 10th International Conference on Semantic Systems
Overview/2
Assumptions
The complex similarity function sim compares property values.
In case of
datatype properties: it uses text/numerical/date similarities.object properties: it applies the similarities iteratively.
Graph structure has not been considered as a feature per se.
Cross-validation has been preferred over semi-supervisedlearning because it yields more accurate results.
T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classifiers for Link Discovery7 / 18
tugraz
SEMANTiCS 2014 — The 10th International Conference on Semantic Systems
Evaluation Setup/1
Similarities
for string values:
Weighted trigram similarity, setting tf-idf scores as weightsWeighted edit distance, setting confusion matrices as weightsCosine similarity
for numerical values:
Logarithmic similarity
for date values:
a day-based Date similarity
T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classifiers for Link Discovery8 / 18
tugraz
SEMANTiCS 2014 — The 10th International Conference on Semantic Systems
Evaluation Setup/2
Linear non-probabilisticclassifiers
Linear SVM*
Polynomial SVM*
Linear SVM withSequential MinimalOptimization
Linear Regression
Probabilistic classifiers
Logistic Regression
Naıve Bayes
Random Tree
J48
Neural networks
Multilayer Perceptron
Rule-based classifiers
Decision Table
We used classifiers from the Weka library, except (*) from LibSVM.
T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classifiers for Link Discovery9 / 18
tugraz
SEMANTiCS 2014 — The 10th International Conference on Semantic Systems
Evaluation Setup/3
Datasets
D1-D3: synthetic datasets from the Ontology Alignment EvaluationInitiative (OAEI) 2010 Benchmark
D4-D6: real datasets from the Benchmark for Entity Resolution, DBSLeipzig
D5-D6: datasets having a high level of noise
# dataset domain size
D1 OAEI-Persons1 personal data 250kD2 OAEI-Persons2 personal data 240kD3 OAEI-Restaurants places 72k
D4 DBLP–ACM bibliographic 6MD5 Amazon–GoogleProducts e-commerce 10MD6 ABT–Buy e-commerce 1M
T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classifiers for Link Discovery10 / 18
tugraz
SEMANTiCS 2014 — The 10th International Conference on Semantic Systems
Results/1
F-measure
Classifier D1 D2 D3 D4 D5 D6
Linear SVM 99.40% 98.99% 97.75% 97.81% 27.06% 39.18%Linear SMO 100.00% 98.73% 100.00% 92.58% 46.63% 31.39%Polynomial-3 SVM 99.40% 93.76% 98.29% 97.67% 37.28% 31.69%Multilayer Perceptron 99.50% 99.50% 100.00% 97.43% 35.58% 43.49%Logistic Regression 99.90% 98.12% 96.67% 97.71% 40.64% 41.92%Linear Regression 99.30% 96.92% 100.00% 96.36% 37.06% 36.84%Naıve Bayes 97.75% 35.05% 95.19% 29.47% 2.92% 11.90%Decision Table 97.98% 100.00% 100.00% 97.66% 42.44% 29.66%Random Tree 97.45% 99.24% 89.89% 96.82% 39.38% 41.03%J48 99.50% 95.56% 98.29% 97.66% 44.28% 31.53%
State of the Art 100.00% 100.00% 100.00% 98.20% 62.10% 71.30%
F-measure calculated on the class of positive examples.
T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classifiers for Link Discovery11 / 18
tugraz
SEMANTiCS 2014 — The 10th International Conference on Semantic Systems
Results/2
Computation runtimes
Classifier D1 D2 D3 D4 D5 D6
Linear SVM 7.16 6.93 2.67 63.94 484.29 75.44Linear SMO 17.07 12.93 3.77 113.40 369.20 37.16Polynomial-3 SVM 5.67 6.18 2.63 162.82 1,091.10 103.89Multilayer Perceptron 15.13 16.10 3.40 96.96 376.26 41.68Logistic Regression 16.11 14.91 4.61 110.12 275.94 38.48Linear Regression 16.04 16.21 5.02 120.54 497.43 44.50Naıve Bayes 17.34 17.09 4.39 105.31 375.91 43.79Decision Table 16.68 16.44 3.78 90.99 389.35 48.87Random Tree 12.02 11.16 2.24 53.67 347.36 34.11J48 21.31 15.96 6.99 131.57 98.27 38.46
All values in seconds.
T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classifiers for Link Discovery12 / 18
tugraz
SEMANTiCS 2014 — The 10th International Conference on Semantic Systems
Results/3
Considerations
Some average trends can besuggested, yet no algorithmoutperforms all other significantly.
Multilayer Perceptrons performedbest including and excluding noisydatasets.
Random Trees seem the fastestapproach overall.
The different approaches seemcomplementary on theirbehaviour.
Naıve Bayes might fail as itconsiders all features asindependent from each other.
T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classifiers for Link Discovery13 / 18
tugraz
SEMANTiCS 2014 — The 10th International Conference on Semantic Systems
Results/4
Answers
Q1: Which of the paradigms achieves the best F-measures?
A1: Multilayer Perceptrons, Linear SVMs, Decision Tables.
Q2: Which of the paradigms is most robust against noise?
A2: Logistic Regression, Random Trees, Multilayer Perceptrons.
Q3: Which of the methods is the most time-efficient?
A3: Random Trees, however all approaches scale well.
T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classifiers for Link Discovery14 / 18
tugraz
SEMANTiCS 2014 — The 10th International Conference on Semantic Systems
Results/4
Answers
Q1: Which of the paradigms achieves the best F-measures?
A1: Multilayer Perceptrons, Linear SVMs, Decision Tables.
Q2: Which of the paradigms is most robust against noise?
A2: Logistic Regression, Random Trees, Multilayer Perceptrons.
Q3: Which of the methods is the most time-efficient?
A3: Random Trees, however all approaches scale well.
T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classifiers for Link Discovery14 / 18
tugraz
SEMANTiCS 2014 — The 10th International Conference on Semantic Systems
Results/4
Answers
Q1: Which of the paradigms achieves the best F-measures?
A1: Multilayer Perceptrons, Linear SVMs, Decision Tables.
Q2: Which of the paradigms is most robust against noise?
A2: Logistic Regression, Random Trees, Multilayer Perceptrons.
Q3: Which of the methods is the most time-efficient?
A3: Random Trees, however all approaches scale well.
T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classifiers for Link Discovery14 / 18
tugraz
SEMANTiCS 2014 — The 10th International Conference on Semantic Systems
Results/4
Answers
Q1: Which of the paradigms achieves the best F-measures?
A1: Multilayer Perceptrons, Linear SVMs, Decision Tables.
Q2: Which of the paradigms is most robust against noise?
A2: Logistic Regression, Random Trees, Multilayer Perceptrons.
Q3: Which of the methods is the most time-efficient?
A3: Random Trees, however all approaches scale well.
T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classifiers for Link Discovery14 / 18
tugraz
SEMANTiCS 2014 — The 10th International Conference on Semantic Systems
Related Work
Time-efficient deduplication algorithms (PPJoin+, EDJoin,PassJoin, TrieJoin)
LIMES – Link Discovery Framework for Metric Spaces
Approaches for learning link specifications (HYPPO, HR3,EAGLE, ACIDS)Dedicated efficient methods (RDF-AI, REEDED)LinkLion – A Link Repository for the Web of DataThe SAIM interface
Other link discovery frameworks (SILK, LDIF)
Other machine learning frameworks (MARLIN, FEBRL,RAVEN)
Other blocking techniques (MultiBlock, KnoFuss)
T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classifiers for Link Discovery15 / 18
tugraz
SEMANTiCS 2014 — The 10th International Conference on Semantic Systems
Future Work
1 Integration of Multilayer Perceptrons into the LIMESframework.
2 Use of ensemble learning techniques.
3 Evaluation on a semi-supervised learning setting with fewtraining data.
4 Evaluation using a larger amount of similarity measures.
5 Incorporation of a component based on Statistical RelationalLearning.
T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classifiers for Link Discovery16 / 18
tugraz
SEMANTiCS 2014 — The 10th International Conference on Semantic Systems
Web resources
Source code – Batch Learners Evaluation for Link Discoveryhttp://github.com/mommi84/BALLAD
Technical report – Batch Learners Evaluation for Link Discoveryhttp://mommi84.github.io/BALLAD
The OAEI 2010 Benchmarkhttp://oaei.ontologymatching.org/2010/benchmarks
The Benchmark for Entity Resolution, DBS Leipzighttp://goo.gl/bvWBjA
Weka – Data Mining Software in Javahttp://www.cs.waikato.ac.nz/ml/weka
LibSVM – A Library for Support Vector Machineshttp://www.csie.ntu.edu.tw/~cjlin/libsvm
LIMES – Link Discovery Framework for Metric Spaceshttp://aksw.org/Projects/LIMES
LinkLion – A Link Repository for the Web of Datahttp://www.linklion.org
T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classifiers for Link Discovery17 / 18
tugraz
SEMANTiCS 2014 — The 10th International Conference on Semantic Systems
Thank you for your attention.
T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classifiers for Link Discovery18 / 18