a comparison of supervised learning classifiers for link discovery

A Comparison of Supervised Learning Classifiersfor Link Discovery

Tommaso Soru and Axel-Cyrille Ngonga Ngomo

Agile Knowledge Engineering and Semantic WebDepartment of Computer Science

University of LeipzigAugustusplatz 10, 04109 Leipzig

{tsoru,ngonga}@informatik.uni-leipzig.dehttp://aksw.org

September 4, 2014

http://aksw.org

tugraz

SEMANTiCS 2014 — The 10th International Conference on Semantic Systems

Introduction/1

The 4th Linked Data Web Principle.

“Include links to other URIs, so that they can discover morethings.” – Tim Berners-Lee

31B triples in 2011

of which only ∼ 3% linkdifferent datasets

> 71B triples expected in2014

T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classifiers for Link Discovery2 / 18

tugraz


Introduction/2

Link Discovery

What? Discover new links among resources.

How? Using supervised and unsupervised methods.

Why? Links are important for data integration, questionanswering, knowledge extraction.

We will focus on supervised machine-learning algorithms.


tugraz


Preliminaries

Link Discovery.

Given two datasets S and T , the general aim of link discovery is to find the setof resource pairs (s, t) ∈ S × T such that R(s, t) holds, where R is a givenrelation such as owl:sameAs or dbp:near.

Link Specification.

A link specification is a rule composed by a complex similarity function sim anda threshold θ that defines which pairs (s, t) should be linked together:

sim(s, t) ≥ θ

Main problems

1 Naıve approaches demand quadratic time complexity.

2 Efficient algorithms ; accurate link specifications.


tugraz


Motivation

We want to answer these questions.

Q1: Which of the paradigms achieves the best F-measures?

Q2: Which of the paradigms is most robust against noise?

Q3: Which of the methods is the most time-efficient?


tugraz


Overview/1

Evaluation pipeline

Alignment between properties is carried out manually.

Perfect mapping (i.e., labels)

(s, t) is a positive example iff R(s, t) holds.


tugraz


Overview/2

Assumptions

The complex similarity function sim compares property values.

In case of

datatype properties: it uses text/numerical/date similarities.object properties: it applies the similarities iteratively.

Graph structure has not been considered as a feature per se.

Cross-validation has been preferred over semi-supervisedlearning because it yields more accurate results.


tugraz


Evaluation Setup/1

Similarities

for string values:

Weighted trigram similarity, setting tf-idf scores as weightsWeighted edit distance, setting confusion matrices as weightsCosine similarity

for numerical values:

Logarithmic similarity

for date values:

a day-based Date similarity


tugraz


Evaluation Setup/2

Linear non-probabilisticclassifiers

Linear SVM*

Polynomial SVM*

Linear SVM withSequential MinimalOptimization

Linear Regression

Probabilistic classifiers

Logistic Regression

Naıve Bayes

Random Tree

J48

Neural networks

Multilayer Perceptron

Rule-based classifiers

Decision Table

We used classifiers from the Weka library, except (*) from LibSVM.


tugraz


Evaluation Setup/3

Datasets

D1-D3: synthetic datasets from the Ontology Alignment EvaluationInitiative (OAEI) 2010 Benchmark

D4-D6: real datasets from the Benchmark for Entity Resolution, DBSLeipzig

D5-D6: datasets having a high level of noise

# dataset domain size

D1 OAEI-Persons1 personal data 250kD2 OAEI-Persons2 personal data 240kD3 OAEI-Restaurants places 72k

D4 DBLP–ACM bibliographic 6MD5 Amazon–GoogleProducts e-commerce 10MD6 ABT–Buy e-commerce 1M


tugraz


Results/1

F-measure

Classifier D1 D2 D3 D4 D5 D6

Linear SVM 99.40% 98.99% 97.75% 97.81% 27.06% 39.18%Linear SMO 100.00% 98.73% 100.00% 92.58% 46.63% 31.39%Polynomial-3 SVM 99.40% 93.76% 98.29% 97.67% 37.28% 31.69%Multilayer Perceptron 99.50% 99.50% 100.00% 97.43% 35.58% 43.49%Logistic Regression 99.90% 98.12% 96.67% 97.71% 40.64% 41.92%Linear Regression 99.30% 96.92% 100.00% 96.36% 37.06% 36.84%Naıve Bayes 97.75% 35.05% 95.19% 29.47% 2.92% 11.90%Decision Table 97.98% 100.00% 100.00% 97.66% 42.44% 29.66%Random Tree 97.45% 99.24% 89.89% 96.82% 39.38% 41.03%J48 99.50% 95.56% 98.29% 97.66% 44.28% 31.53%

State of the Art 100.00% 100.00% 100.00% 98.20% 62.10% 71.30%

F-measure calculated on the class of positive examples.


tugraz


Results/2

Computation runtimes

Classifier D1 D2 D3 D4 D5 D6

Linear SVM 7.16 6.93 2.67 63.94 484.29 75.44Linear SMO 17.07 12.93 3.77 113.40 369.20 37.16Polynomial-3 SVM 5.67 6.18 2.63 162.82 1,091.10 103.89Multilayer Perceptron 15.13 16.10 3.40 96.96 376.26 41.68Logistic Regression 16.11 14.91 4.61 110.12 275.94 38.48Linear Regression 16.04 16.21 5.02 120.54 497.43 44.50Naıve Bayes 17.34 17.09 4.39 105.31 375.91 43.79Decision Table 16.68 16.44 3.78 90.99 389.35 48.87Random Tree 12.02 11.16 2.24 53.67 347.36 34.11J48 21.31 15.96 6.99 131.57 98.27 38.46

All values in seconds.


tugraz


Results/3

Considerations

Some average trends can besuggested, yet no algorithmoutperforms all other significantly.

Multilayer Perceptrons performedbest including and excluding noisydatasets.

Random Trees seem the fastestapproach overall.

The different approaches seemcomplementary on theirbehaviour.

Naıve Bayes might fail as itconsiders all features asindependent from each other.


tugraz


Results/4

Answers

Q1: Which of the paradigms achieves the best F-measures?

A1: Multilayer Perceptrons, Linear SVMs, Decision Tables.

Q2: Which of the paradigms is most robust against noise?

A2: Logistic Regression, Random Trees, Multilayer Perceptrons.

Q3: Which of the methods is the most time-efficient?

A3: Random Trees, however all approaches scale well.


tugraz


Related Work

Time-efficient deduplication algorithms (PPJoin+, EDJoin,PassJoin, TrieJoin)

LIMES – Link Discovery Framework for Metric Spaces

Approaches for learning link specifications (HYPPO, HR3,EAGLE, ACIDS)Dedicated efficient methods (RDF-AI, REEDED)LinkLion – A Link Repository for the Web of DataThe SAIM interface

Other link discovery frameworks (SILK, LDIF)

Other machine learning frameworks (MARLIN, FEBRL,RAVEN)

Other blocking techniques (MultiBlock, KnoFuss)


tugraz


Future Work

1 Integration of Multilayer Perceptrons into the LIMESframework.

2 Use of ensemble learning techniques.

3 Evaluation on a semi-supervised learning setting with fewtraining data.

4 Evaluation using a larger amount of similarity measures.

5 Incorporation of a component based on Statistical RelationalLearning.


tugraz


Web resources

Source code – Batch Learners Evaluation for Link Discoveryhttp://github.com/mommi84/BALLAD

Technical report – Batch Learners Evaluation for Link Discoveryhttp://mommi84.github.io/BALLAD

The OAEI 2010 Benchmarkhttp://oaei.ontologymatching.org/2010/benchmarks

The Benchmark for Entity Resolution, DBS Leipzighttp://goo.gl/bvWBjA

Weka – Data Mining Software in Javahttp://www.cs.waikato.ac.nz/ml/weka

LibSVM – A Library for Support Vector Machineshttp://www.csie.ntu.edu.tw/~cjlin/libsvm

LIMES – Link Discovery Framework for Metric Spaceshttp://aksw.org/Projects/LIMES

LinkLion – A Link Repository for the Web of Datahttp://www.linklion.org


http://github.com/mommi84/BALLAD

http://mommi84.github.io/BALLAD

http://oaei.ontologymatching.org/2010/benchmarks

http://goo.gl/bvWBjA

http://www.cs.waikato.ac.nz/ml/weka

http://www.csie.ntu.edu.tw/~cjlin/libsvm

http://aksw.org/Projects/LIMES

http://www.linklion.org

tugraz


Thank you for your attention.