ainl 2016: bastrakova, ledesma, millan, zighed

Relational Machine Learning Author Disambiguation

E. Bastrakova†, R. Ledesma†, J. Millan†, F. Rico‡*, D. Zighed*

† Data Mining and Knowledge Management - Université Lumière Lyon II, Lyon, France‡ Université Claude Bernard Lyon I, Lyon, France* ERIC Lab - Université Lumière Lyon II, Lyon, France

Signature 1

Signature 2

Signature n

Article n

Signature 1

Signature 2

Signature n

Article 2

Introduction

Signature 1

Signature 2

Signature n

Article Information

Article 1

Author 1

A1 - Signature 2

A2 - Signature 1

An - Signature 1

Author 2

A1 - Signature 3

A2 - Signature 4

An - Signature 2Author n

A1 - Signature m

An - Signature m

Cluster =

Disambiguated

Author 2/18

Outline

1. Related Work

2. Source Data

3. Article / Complementing Features

4. Similarity Features

5. Author Disambiguation Workflow

6. Implementation

7. Results

8. Conclusions

3/18

Manual

Disambiguation

Related Work

Automatic

methods

Author Assignment

Techniques

Author Grouping

Techniques

ORCID (Open Researcher

and Contributor ID) initiative

http://orcid.org/

Learn a model for a real author based on

available information

● Assignment methods (large

training set)

● Clustering methods (known

number of authors)

Use similarity measures to find close

papers and estimate real authors

● Various similarity measures

● Graph-based similarity

functions

● Ethnicity

4/18

http://orcid.org/

Source Data

5/18

Articles’ information can be retrieved from any

digital library followed by an ETL process to fit

our source data model.

Complementing Features

Complement the information of the

source data by calculating extra

features derived from it.

● Focus Name

● LDA Topic

● Ethnicity Information6/18

Article Features

Features present in the source data

used for the disambiguation workflow

● Title

● Publication Year

● Keywords

● Referenced Journals

● Subjects

● Authors’ Names

Complementing Features - Focus Name

Simplified version of the last name of

the author

We avoid:

● Language complexity

● Spelling errors

● Transliteration errors

Further used to simplify clustering -

each focus name is processed

separately

7/18

Metaphone

Complementing Features - LDA Topic

Latent Dirichlet Allocation

Classifies each article into 8 topics

Common terms in most frequent LDA Topics

Medicine

Networking

Biology

8/18

Estimated ethnicity based on the author’s surname.

Binary SVM classifier

for each ethnicity.

(Based in the US 2000 Census)

Ethnicities:

● White

● Black

● Asian

● Asian Pacific Islander

● 2 or more races

● Hispanic

Complementing Features - Ethnicity

9/18G. Louppe, H. Al-Natsheh, M. Susik, and E. Maguire, “Ethnicity sensitive author disambiguation using semi-

supervised learning.” CoRR, vol. abs/1508.07744, 2015.

Similarity Features

Feature Name Calculation Description

First Name Initial Equality

Second Name Initial Equality

LDA Topic Equality

Publication Year Absolute difference

Keywords Jaccard Coefficient

References Jaccard Coefficient

Subject Jaccard Coefficient

Title Jaccard Coefficient

Coauthors Jaccard Coefficient

Ethnicity Jaccard Coefficient

Compares the information

between a pair of signatures in

order to facilitate the

disambiguation process.

10/18

Hierarchical Clustering

Author Disambiguation Workflow

Complementing Features:• Focus Name• LDA Topic• Ethnicities

Signatures (+Article Features)

s1

s2,

.

.

.

sn

Signatures belong to the same author or not.

Machine Learning Models:

SVM, LR, GB, RF

s1 s2 =(%)

Signatures are clustered in disambiguated authors.

Focus Name

Cross Product Signatures

s1, s2 ,

,sns1

s2

.

.

.

sn

Pairs of Signatures

Cluster = DisambiguatedAuthor 11/18

Similarity Features:• Eq. First Name• Eq. Middle Name• Eq. LDA Topic• Diff. Pub. Year• Dist. Keywords• Dist. References• Dist. Subject• Dist. Title• Dist. Coauthors• Dist. Ethnicities

• Title• Pub. Year• Keywords

• Referenced Journals• Subjects• Authors’ Names

Implementation

Best of two worlds:

RDBMS(PostgreSQL 9.5)

Statistical Software(R 3.2)

Source Data:

Articles retrieved from Web of Science portal.

Manually Disambiguated Dataset:

1330 unique signatures (+ article information) of 236 real authors.

The implementation is publicly available in:

https://github.com/DMKM1517/author_disambiguation

12/18

https://github.com/DMKM1517/author_disambiguation

Two-Step Validation Process

13/18

First Validation

Hierarchical Clustering

Complementing Features:• Focus Name• LDA Topic• Ethnicities

Signatures (+Article Features)s1

s2,

.

.

.

sn

Signatures belong to the same author or not.

Machine Learning Models:

SVM, LR, GB, RF

s1 s2 =(%)

Signatures are clustered in disambiguated authors.

Focus Name

Cross Product Signatures

s1, s2 ,

,sns1

s2

.

.

.

sn

Pairs of Signatures

Cluster = DisambiguatedAuthor

Similarity Features:• Eq. First Name• Eq. Middle Name• Eq. LDA Topic• Diff. Pub. Year• Dist. Keywords• Dist. References• Dist. Subject• Dist. Title• Dist. Coauthors• Dist. Ethnicities

• Title• Pub. Year• Keywords

• Referenced Journals• Subjects• Authors’ Names

Second Validation

Second Validation: Clustering

Model Method Precision Recall F1

RF Pairwise

B3

0.9289

0.9274

0.4645

0.7652

0.6193

0.8385

GB Pairwise

B3

0.7218

0.7523

0.9587

0.9496

0.8236

0.8395

LR Pairwise

B3

0.7850

0.7913

0.9231

0.9074

0.8485

0.8454

SVM Pairwise

B3

0.5957

0.6013

0.9128

0.8795

0.7210

0.7143 14/18

Model Accuracy Precision Recall F1

RF 0.9671 0.9850 0.9772 0.9811

GB 0.9751 0.9938 0.9777 0.9857

LR 0.9756 0.9877 0.9843 0.9860

SVM 0.9447 0.9594 0.9782 0.9687

Results

First Validation: Pair of Signatures

Results

15/18

Conclusions

The best model: Logistic Regression (1stV: F1=98.60% & 2ndV: F1=84.85%).

Feature with highest gain: Referenced Journals Distance.

Similarly, the Initials of the author are key features.

LDA topic has a high relevance in classification (more than the given subject).

16/18

Conclusions

Solution implemented on RDBMS + Statistical Software

Both of them are well-accepted in the community and provide open source libraries.

Modular and scalable Workflow.

Some ideas for future work

Calculate the distances with other methods (e.g. graphs for co-authorship/community detection)

Experiment with other algorithms and techniques (e.g. deep learning)

Include or discover new features (DOI, coauthors’ information, etc)

Integrate a user feedback to improve the solution

17/18

Thank you