kolawole john adebayo, luigi di caro and guido boella | a supervised keyphrase extraction system
TRANSCRIPT
A SUPERVISED KEYPHRASE EXTRACTION SYSTEM
Semantics 2016, Leipzig
Kolawole. J, AdebayoLuigi, Di CaroGuido, Boella
Outlines
• Introduction • Related Works• Methodology• Experiments• Conclusions
Introduction
• Keyphrases– What?– Why?
• Document Indexing , Document Summarization , Clustering and visualization
• Keyphrase Assignment Vs Keyphrase Extraction• Unsupervised Vs Supervised• Classification Vs Ranking
Introduction
Semantic Features
Supervised KeyPhrase Extraction, Keyphrase, Keyword
Related WorksAlgorithm Classification Features AlgorithmWitten et al (1999)
Statistical TF, TFIDF, length (bi or tri), first occurrence, node degree etc.
Naïve Bayes
P. Turney (2000)
Statistical phrase frequency, position, TF, TFIDF, n-gram Overlap, etc.
C4.5, Genex
Hulth (2003) Linguistic Lexical and syntactic features
Mihalcea and Tarau (2004)
Graph Based Unsupervised TextRank
Medelyan et al (2010)
Graph Based Statistical, lexical , syntactic features
MAUI
Methodology
Training Document
Select Candidate
Extract Feature for Candidates
Combine features with
Classifier
Training Document
Select Candidate
Extract Feature for Candidates
Predictor
Extracted Keyphrases
Methodology
• Candidate Selection– extracts ngrams (range = 1-4) that do not start or
end with a stopword– Candidate should not be proper nouns– Candidate should not end with adjective– Candidates could start with Abbreviation – Verbs are down-weighted
MethodologyCategory Description
Statistical TF, TFIDF, Keyphrase Length
Positional First and last point of appearance, geographical spread e.g., upper section, mid section and lower section. Also key candidates’ span
Lexical NP, NE, Ngrams
Semantic Wikipedia Lookup (Freq in Wikipedia), does it have wikipedia page, in-out link freq on wikipedia page
Semantic LDA Topic count (T=50)
Semantic Candidate similarity to POS-filtered words (Proper Nouns, Verbs and Adjective)
Methodology
POS filtered n-grams (2,3,4)
Candidate keyphrase
Embedding Similarity
ResultsFeatures Dataset Precision Recall F-Measure
Meldeyan et al (2010)
Marujo 49.4 - -
Marujo et al (2013)
Marujo 55.4 - -
All-features Marujo 58.3 42.0 48.8
Selected-features
Marujo 48.7 36.5 41.7
Table 1: Evaluation result on Marujo dataset
ResultsFeatures Dataset Precision Recall F-Measure
Selected Features
Combined 29.9 20.3 16.9
Selected Features
Reader 26.4 17.1 20.7
All-features Combined 32.7 21.0 25.5
All-features Reader 30.2 18.1 22.6
Table 2: Evaluation result on Semeval dataset
ResultsFeatures Dataset Precision Recall F-Measure
(2,5,6,7,8,9) Combined 32.1 20.6 25.0
(1,2,5,7,8,9) Combined 31.8 20.1 24.7
(2,4,5,7,8,9) Combined 30.2 17.7 22.3
(3,4,6,7,8,9) Combined 27.4 16.3 20.4
Table 3: Ablation test on Semeval dataset
Good or Bad?
Supervised Keyphrase Extraction, Keyphrase Extraction system, supervised machine learning, Random Forest algorithm, Feature Engineering, Candidate Word,Keyphrase Extraction, Behavioural sciences, supervised classification, Keyphrase overlap
References• A. Hulth. Improved automatic keyword extraction given more linguistic knowledge. In Proceedings of the
2003 conference on Empirical methods in natural language processing, pages 216{223. Association for Computational Linguistics, 2003.
• S. N. Kim, O. Medelyan, M.-Y. Kan, and T. Baldwin. Semeval-2010 task 5: Automatic keyphrase extraction from scientic articles. In Proceedings of the 5th International Workshop on Semantic Evaluation, pages 21-26.
Association for Computational Linguistics, 2010.• L. Marujo, A. Gershman, J. Carbonell, R. Frederking, and J. P. Neto. Supervised topical key phrase extraction
of news stories using crowdsourcing, light fltering and co-reference normalization. arXiv preprint arXiv:1306.4886, 2013.
• P. Turney. Learning to extract keyphrases from text. 1999.• Xin Jianga, Yunhua Hub, Hang Lib : A Ranking Approach to Keyphrase Extraction, 2010• T. D. Nguyen and M.-Y. Kan. Keyphrase extraction in scientic publications. In Asian Digital Libraries.
Looking Back 10 Years and Forging New Frontiers, pages 317{326. Springer, 2007.• I. H. Witten, G. W. Paynter, E. Frank, C. Gutwin, and C. G. Nevill-Manning. Kea: Practical automatic
keyphrase extraction. In Proceedings of the fourth ACM conference on Digital libraries, pages 254{255. ACM, 1999.
• R. Mihalcea and P. Tarau. Textrank: Bringing order into texts. Association for Computational Linguistics, 2004.
Conclusions
• Many Thanks For The Attention!!!