ainl 2016: bastrakova, ledesma, millan, zighed

18
Relational Machine Learning Author Disambiguation E. Bastrakova†, R. Ledesma†, J. Millan†, F. Rico‡*, D. Zighed* † Data Mining and Knowledge Management - Université Lumière Lyon II, Lyon, France ‡ Université Claude Bernard Lyon I, Lyon, France * ERIC Lab - Université Lumière Lyon II, Lyon, France

Upload: lidia-pivovarova

Post on 15-Apr-2017

232 views

Category:

Science


3 download

TRANSCRIPT

Page 1: AINL 2016: Bastrakova, Ledesma, Millan, Zighed

Relational Machine Learning Author Disambiguation

E. Bastrakova†, R. Ledesma†, J. Millan†, F. Rico‡*, D. Zighed*

† Data Mining and Knowledge Management - Université Lumière Lyon II, Lyon, France‡ Université Claude Bernard Lyon I, Lyon, France* ERIC Lab - Université Lumière Lyon II, Lyon, France

Page 2: AINL 2016: Bastrakova, Ledesma, Millan, Zighed

Signature 1

Signature 2

Signature n

Article n

Signature 1

Signature 2

Signature n

Article 2

Introduction

Signature 1

Signature 2

Signature n

Article Information

Article 1

Author 1

A1 - Signature 2

A2 - Signature 1

An - Signature 1

Author 2

A1 - Signature 3

A2 - Signature 4

An - Signature 2Author n

A1 - Signature m

An - Signature m

Cluster =

Disambiguated

Author 2/18

Page 3: AINL 2016: Bastrakova, Ledesma, Millan, Zighed

Outline

1. Related Work

2. Source Data

3. Article / Complementing Features

4. Similarity Features

5. Author Disambiguation Workflow

6. Implementation

7. Results

8. Conclusions

3/18

Page 4: AINL 2016: Bastrakova, Ledesma, Millan, Zighed

Manual

Disambiguation

Related Work

Automatic

methods

Author Assignment

Techniques

Author Grouping

Techniques

ORCID (Open Researcher

and Contributor ID) initiative

http://orcid.org/

Learn a model for a real author based on

available information

● Assignment methods (large

training set)

● Clustering methods (known

number of authors)

Use similarity measures to find close

papers and estimate real authors

● Various similarity measures

● Graph-based similarity

functions

● Ethnicity

4/18

Page 5: AINL 2016: Bastrakova, Ledesma, Millan, Zighed

Source Data

5/18

Articles’ information can be retrieved from any

digital library followed by an ETL process to fit

our source data model.

Page 6: AINL 2016: Bastrakova, Ledesma, Millan, Zighed

Complementing Features

Complement the information of the

source data by calculating extra

features derived from it.

● Focus Name

● LDA Topic

● Ethnicity Information6/18

Article Features

Features present in the source data

used for the disambiguation workflow

● Title

● Publication Year

● Keywords

● Referenced Journals

● Subjects

● Authors’ Names

Page 7: AINL 2016: Bastrakova, Ledesma, Millan, Zighed

Complementing Features - Focus Name

Simplified version of the last name of

the author

We avoid:

● Language complexity

● Spelling errors

● Transliteration errors

Further used to simplify clustering -

each focus name is processed

separately

7/18

Metaphone

Page 8: AINL 2016: Bastrakova, Ledesma, Millan, Zighed

Complementing Features - LDA Topic

Latent Dirichlet Allocation

Classifies each article into 8 topics

Common terms in most frequent LDA Topics

Medicine

Networking

Biology

8/18

Page 9: AINL 2016: Bastrakova, Ledesma, Millan, Zighed

Estimated ethnicity based on the author’s surname.

Binary SVM classifier

for each ethnicity.

(Based in the US 2000 Census)

Ethnicities:

● White

● Black

● Asian

● Asian Pacific Islander

● 2 or more races

● Hispanic

Complementing Features - Ethnicity

9/18G. Louppe, H. Al-Natsheh, M. Susik, and E. Maguire, “Ethnicity sensitive author disambiguation using semi-

supervised learning.” CoRR, vol. abs/1508.07744, 2015.

Page 10: AINL 2016: Bastrakova, Ledesma, Millan, Zighed

Similarity Features

Feature Name Calculation Description

First Name Initial Equality

Second Name Initial Equality

LDA Topic Equality

Publication Year Absolute difference

Keywords Jaccard Coefficient

References Jaccard Coefficient

Subject Jaccard Coefficient

Title Jaccard Coefficient

Coauthors Jaccard Coefficient

Ethnicity Jaccard Coefficient

Compares the information

between a pair of signatures in

order to facilitate the

disambiguation process.

10/18

Page 11: AINL 2016: Bastrakova, Ledesma, Millan, Zighed

Hierarchical Clustering

Author Disambiguation Workflow

Complementing Features:• Focus Name• LDA Topic• Ethnicities

Signatures (+Article Features)

s1

s2,

.

.

.

sn

Signatures belong to the same author or not.

Machine Learning Models:

SVM, LR, GB, RF

s1 s2 =(%)

Signatures are clustered in disambiguated authors.

Focus Name

Cross Product Signatures

s1, s2 ,

,sns1

s2

.

.

.

sn

Pairs of Signatures

Cluster = DisambiguatedAuthor 11/18

Similarity Features:• Eq. First Name• Eq. Middle Name• Eq. LDA Topic• Diff. Pub. Year• Dist. Keywords• Dist. References• Dist. Subject• Dist. Title• Dist. Coauthors• Dist. Ethnicities

• Title• Pub. Year• Keywords

• Referenced Journals• Subjects• Authors’ Names

Page 12: AINL 2016: Bastrakova, Ledesma, Millan, Zighed

Implementation

Best of two worlds:

RDBMS(PostgreSQL 9.5)

Statistical Software(R 3.2)

Source Data:

Articles retrieved from Web of Science portal.

Manually Disambiguated Dataset:

1330 unique signatures (+ article information) of 236 real authors.

The implementation is publicly available in:

https://github.com/DMKM1517/author_disambiguation

12/18

Page 13: AINL 2016: Bastrakova, Ledesma, Millan, Zighed

Two-Step Validation Process

13/18

First Validation

Hierarchical Clustering

Complementing Features:• Focus Name• LDA Topic• Ethnicities

Signatures (+Article Features)s1

s2,

.

.

.

sn

Signatures belong to the same author or not.

Machine Learning Models:

SVM, LR, GB, RF

s1 s2 =(%)

Signatures are clustered in disambiguated authors.

Focus Name

Cross Product Signatures

s1, s2 ,

,sns1

s2

.

.

.

sn

Pairs of Signatures

Cluster = DisambiguatedAuthor

Similarity Features:• Eq. First Name• Eq. Middle Name• Eq. LDA Topic• Diff. Pub. Year• Dist. Keywords• Dist. References• Dist. Subject• Dist. Title• Dist. Coauthors• Dist. Ethnicities

• Title• Pub. Year• Keywords

• Referenced Journals• Subjects• Authors’ Names

Second Validation

Page 14: AINL 2016: Bastrakova, Ledesma, Millan, Zighed

Second Validation: Clustering

Model Method Precision Recall F1

RF Pairwise

B3

0.9289

0.9274

0.4645

0.7652

0.6193

0.8385

GB Pairwise

B3

0.7218

0.7523

0.9587

0.9496

0.8236

0.8395

LR Pairwise

B3

0.7850

0.7913

0.9231

0.9074

0.8485

0.8454

SVM Pairwise

B3

0.5957

0.6013

0.9128

0.8795

0.7210

0.7143 14/18

Model Accuracy Precision Recall F1

RF 0.9671 0.9850 0.9772 0.9811

GB 0.9751 0.9938 0.9777 0.9857

LR 0.9756 0.9877 0.9843 0.9860

SVM 0.9447 0.9594 0.9782 0.9687

Results

First Validation: Pair of Signatures

Page 15: AINL 2016: Bastrakova, Ledesma, Millan, Zighed

Results

15/18

Page 16: AINL 2016: Bastrakova, Ledesma, Millan, Zighed

Conclusions

The best model: Logistic Regression (1stV: F1=98.60% & 2ndV: F1=84.85%).

Feature with highest gain: Referenced Journals Distance.

Similarly, the Initials of the author are key features.

LDA topic has a high relevance in classification (more than the given subject).

16/18

Page 17: AINL 2016: Bastrakova, Ledesma, Millan, Zighed

Conclusions

Solution implemented on RDBMS + Statistical Software

Both of them are well-accepted in the community and provide open source libraries.

Modular and scalable Workflow.

Some ideas for future work

Calculate the distances with other methods (e.g. graphs for co-authorship/community detection)

Experiment with other algorithms and techniques (e.g. deep learning)

Include or discover new features (DOI, coauthors’ information, etc)

Integrate a user feedback to improve the solution

17/18

Page 18: AINL 2016: Bastrakova, Ledesma, Millan, Zighed

Thank you