ainl 2016: bastrakova, ledesma, millan, zighed
TRANSCRIPT
Relational Machine Learning Author Disambiguation
E. Bastrakova†, R. Ledesma†, J. Millan†, F. Rico‡*, D. Zighed*
† Data Mining and Knowledge Management - Université Lumière Lyon II, Lyon, France‡ Université Claude Bernard Lyon I, Lyon, France* ERIC Lab - Université Lumière Lyon II, Lyon, France
Signature 1
Signature 2
Signature n
Article n
Signature 1
Signature 2
Signature n
Article 2
Introduction
Signature 1
Signature 2
Signature n
Article Information
Article 1
Author 1
A1 - Signature 2
A2 - Signature 1
An - Signature 1
Author 2
A1 - Signature 3
A2 - Signature 4
An - Signature 2Author n
A1 - Signature m
An - Signature m
Cluster =
Disambiguated
Author 2/18
Outline
1. Related Work
2. Source Data
3. Article / Complementing Features
4. Similarity Features
5. Author Disambiguation Workflow
6. Implementation
7. Results
8. Conclusions
3/18
Manual
Disambiguation
Related Work
Automatic
methods
Author Assignment
Techniques
Author Grouping
Techniques
ORCID (Open Researcher
and Contributor ID) initiative
http://orcid.org/
Learn a model for a real author based on
available information
● Assignment methods (large
training set)
● Clustering methods (known
number of authors)
Use similarity measures to find close
papers and estimate real authors
● Various similarity measures
● Graph-based similarity
functions
● Ethnicity
4/18
Source Data
5/18
Articles’ information can be retrieved from any
digital library followed by an ETL process to fit
our source data model.
Complementing Features
Complement the information of the
source data by calculating extra
features derived from it.
● Focus Name
● LDA Topic
● Ethnicity Information6/18
Article Features
Features present in the source data
used for the disambiguation workflow
● Title
● Publication Year
● Keywords
● Referenced Journals
● Subjects
● Authors’ Names
Complementing Features - Focus Name
Simplified version of the last name of
the author
We avoid:
● Language complexity
● Spelling errors
● Transliteration errors
Further used to simplify clustering -
each focus name is processed
separately
7/18
Metaphone
Complementing Features - LDA Topic
Latent Dirichlet Allocation
Classifies each article into 8 topics
Common terms in most frequent LDA Topics
Medicine
Networking
Biology
8/18
Estimated ethnicity based on the author’s surname.
Binary SVM classifier
for each ethnicity.
(Based in the US 2000 Census)
Ethnicities:
● White
● Black
● Asian
● Asian Pacific Islander
● 2 or more races
● Hispanic
Complementing Features - Ethnicity
9/18G. Louppe, H. Al-Natsheh, M. Susik, and E. Maguire, “Ethnicity sensitive author disambiguation using semi-
supervised learning.” CoRR, vol. abs/1508.07744, 2015.
Similarity Features
Feature Name Calculation Description
First Name Initial Equality
Second Name Initial Equality
LDA Topic Equality
Publication Year Absolute difference
Keywords Jaccard Coefficient
References Jaccard Coefficient
Subject Jaccard Coefficient
Title Jaccard Coefficient
Coauthors Jaccard Coefficient
Ethnicity Jaccard Coefficient
Compares the information
between a pair of signatures in
order to facilitate the
disambiguation process.
10/18
Hierarchical Clustering
Author Disambiguation Workflow
Complementing Features:• Focus Name• LDA Topic• Ethnicities
Signatures (+Article Features)
s1
s2,
.
.
.
sn
Signatures belong to the same author or not.
Machine Learning Models:
SVM, LR, GB, RF
s1 s2 =(%)
Signatures are clustered in disambiguated authors.
Focus Name
Cross Product Signatures
s1, s2 ,
,sns1
s2
.
.
.
sn
Pairs of Signatures
Cluster = DisambiguatedAuthor 11/18
Similarity Features:• Eq. First Name• Eq. Middle Name• Eq. LDA Topic• Diff. Pub. Year• Dist. Keywords• Dist. References• Dist. Subject• Dist. Title• Dist. Coauthors• Dist. Ethnicities
• Title• Pub. Year• Keywords
• Referenced Journals• Subjects• Authors’ Names
Implementation
Best of two worlds:
RDBMS(PostgreSQL 9.5)
Statistical Software(R 3.2)
Source Data:
Articles retrieved from Web of Science portal.
Manually Disambiguated Dataset:
1330 unique signatures (+ article information) of 236 real authors.
The implementation is publicly available in:
https://github.com/DMKM1517/author_disambiguation
12/18
Two-Step Validation Process
13/18
First Validation
Hierarchical Clustering
Complementing Features:• Focus Name• LDA Topic• Ethnicities
Signatures (+Article Features)s1
s2,
.
.
.
sn
Signatures belong to the same author or not.
Machine Learning Models:
SVM, LR, GB, RF
s1 s2 =(%)
Signatures are clustered in disambiguated authors.
Focus Name
Cross Product Signatures
s1, s2 ,
,sns1
s2
.
.
.
sn
Pairs of Signatures
Cluster = DisambiguatedAuthor
Similarity Features:• Eq. First Name• Eq. Middle Name• Eq. LDA Topic• Diff. Pub. Year• Dist. Keywords• Dist. References• Dist. Subject• Dist. Title• Dist. Coauthors• Dist. Ethnicities
• Title• Pub. Year• Keywords
• Referenced Journals• Subjects• Authors’ Names
Second Validation
Second Validation: Clustering
Model Method Precision Recall F1
RF Pairwise
B3
0.9289
0.9274
0.4645
0.7652
0.6193
0.8385
GB Pairwise
B3
0.7218
0.7523
0.9587
0.9496
0.8236
0.8395
LR Pairwise
B3
0.7850
0.7913
0.9231
0.9074
0.8485
0.8454
SVM Pairwise
B3
0.5957
0.6013
0.9128
0.8795
0.7210
0.7143 14/18
Model Accuracy Precision Recall F1
RF 0.9671 0.9850 0.9772 0.9811
GB 0.9751 0.9938 0.9777 0.9857
LR 0.9756 0.9877 0.9843 0.9860
SVM 0.9447 0.9594 0.9782 0.9687
Results
First Validation: Pair of Signatures
Results
15/18
Conclusions
The best model: Logistic Regression (1stV: F1=98.60% & 2ndV: F1=84.85%).
Feature with highest gain: Referenced Journals Distance.
Similarly, the Initials of the author are key features.
LDA topic has a high relevance in classification (more than the given subject).
16/18
Conclusions
Solution implemented on RDBMS + Statistical Software
Both of them are well-accepted in the community and provide open source libraries.
Modular and scalable Workflow.
Some ideas for future work
Calculate the distances with other methods (e.g. graphs for co-authorship/community detection)
Experiment with other algorithms and techniques (e.g. deep learning)
Include or discover new features (DOI, coauthors’ information, etc)
Integrate a user feedback to improve the solution
17/18
Thank you