kolawole john adebayo, luigi di caro and guido boella | a supervised keyphrase extraction system

A SUPERVISED KEYPHRASE EXTRACTION SYSTEM

Semantics 2016, Leipzig

Kolawole. J, AdebayoLuigi, Di CaroGuido, Boella

Outlines

• Introduction • Related Works• Methodology• Experiments• Conclusions

Introduction

• Keyphrases– What?– Why?

• Document Indexing , Document Summarization , Clustering and visualization

• Keyphrase Assignment Vs Keyphrase Extraction• Unsupervised Vs Supervised• Classification Vs Ranking

Introduction

Semantic Features

Supervised KeyPhrase Extraction, Keyphrase, Keyword

Related WorksAlgorithm Classification Features AlgorithmWitten et al (1999)

Statistical TF, TFIDF, length (bi or tri), first occurrence, node degree etc.

Naïve Bayes

P. Turney (2000)

Statistical phrase frequency, position, TF, TFIDF, n-gram Overlap, etc.

C4.5, Genex

Hulth (2003) Linguistic Lexical and syntactic features

Mihalcea and Tarau (2004)

Graph Based Unsupervised TextRank

Medelyan et al (2010)

Graph Based Statistical, lexical , syntactic features

MAUI

Methodology

Training Document

Select Candidate

Extract Feature for Candidates

Combine features with

Classifier

Training Document

Select Candidate

Extract Feature for Candidates

Predictor

Extracted Keyphrases

Methodology

• Candidate Selection– extracts ngrams (range = 1-4) that do not start or

end with a stopword– Candidate should not be proper nouns– Candidate should not end with adjective– Candidates could start with Abbreviation – Verbs are down-weighted

MethodologyCategory Description

Statistical TF, TFIDF, Keyphrase Length

Positional First and last point of appearance, geographical spread e.g., upper section, mid section and lower section. Also key candidates’ span

Lexical NP, NE, Ngrams

Semantic Wikipedia Lookup (Freq in Wikipedia), does it have wikipedia page, in-out link freq on wikipedia page

Semantic LDA Topic count (T=50)

Semantic Candidate similarity to POS-filtered words (Proper Nouns, Verbs and Adjective)

Methodology

POS filtered n-grams (2,3,4)

Candidate keyphrase

Embedding Similarity

ResultsFeatures Dataset Precision Recall F-Measure

Meldeyan et al (2010)

Marujo 49.4 - -

Marujo et al (2013)

Marujo 55.4 - -

All-features Marujo 58.3 42.0 48.8

Selected-features

Marujo 48.7 36.5 41.7

Table 1: Evaluation result on Marujo dataset


Selected Features

Combined 29.9 20.3 16.9

Selected Features

Reader 26.4 17.1 20.7

All-features Combined 32.7 21.0 25.5

All-features Reader 30.2 18.1 22.6

Table 2: Evaluation result on Semeval dataset


(2,5,6,7,8,9) Combined 32.1 20.6 25.0

(1,2,5,7,8,9) Combined 31.8 20.1 24.7

(2,4,5,7,8,9) Combined 30.2 17.7 22.3

(3,4,6,7,8,9) Combined 27.4 16.3 20.4

Table 3: Ablation test on Semeval dataset

Good or Bad?

Supervised Keyphrase Extraction, Keyphrase Extraction system, supervised machine learning, Random Forest algorithm, Feature Engineering, Candidate Word,Keyphrase Extraction, Behavioural sciences, supervised classification, Keyphrase overlap

References• A. Hulth. Improved automatic keyword extraction given more linguistic knowledge. In Proceedings of the

2003 conference on Empirical methods in natural language processing, pages 216{223. Association for Computational Linguistics, 2003.

• S. N. Kim, O. Medelyan, M.-Y. Kan, and T. Baldwin. Semeval-2010 task 5: Automatic keyphrase extraction from scientic articles. In Proceedings of the 5th International Workshop on Semantic Evaluation, pages 21-26.

Association for Computational Linguistics, 2010.• L. Marujo, A. Gershman, J. Carbonell, R. Frederking, and J. P. Neto. Supervised topical key phrase extraction

of news stories using crowdsourcing, light fltering and co-reference normalization. arXiv preprint arXiv:1306.4886, 2013.

• P. Turney. Learning to extract keyphrases from text. 1999.• Xin Jianga, Yunhua Hub, Hang Lib : A Ranking Approach to Keyphrase Extraction, 2010• T. D. Nguyen and M.-Y. Kan. Keyphrase extraction in scientic publications. In Asian Digital Libraries.

Looking Back 10 Years and Forging New Frontiers, pages 317{326. Springer, 2007.• I. H. Witten, G. W. Paynter, E. Frank, C. Gutwin, and C. G. Nevill-Manning. Kea: Practical automatic

keyphrase extraction. In Proceedings of the fourth ACM conference on Digital libraries, pages 254{255. ACM, 1999.

• R. Mihalcea and P. Tarau. Textrank: Bringing order into texts. Association for Computational Linguistics, 2004.

Conclusions

• Many Thanks For The Attention!!!

kolawole john adebayo, luigi di caro and guido boella | a supervised keyphrase extraction system

Technology