keyphrase extraction in scientific documents thuy dung nguyen and min-yen kan school of computing...

Keyphrase Extraction in Scientific DocumentsThuy Dung Nguyen and Min-Yen Kan

School of Computing

National University of Singapore

Slides available at http://wing.comp.nus.edu.sg/downloads/keyphraseCorpus

http://wing.comp.nus.edu.sg/downloads/keyphraseCorpus

2ICADL 2007 (Hanoi, Vietnam)

Thuy Dung Nguyen and Min-Yen Kan

Keyphrases!

To think about: Are tags keyphrases?

Credits: Amazon.com, ACM.org, IMDB.com



Using Keyphrases in DLs

• Navigation– Searching: Better weighting for terms

– Browsing and Linking: Finding similar documents

• Reading– Highlighting

– Key Concepts

Helping to make the transition seamless between the two

Why are keyphrases important to digital libraries?

Genex



Related WorkGeneration

– Kim and Wilbur – statistical properties of distribution– Tomokiyo and Hurst – Phraseness model

Selection– GenEx (Frank)– Kea (Frank et al.): just 3 features

TF×IDF, position, corpus frequency

– Turney: selection not independent, use PMI

Assignment– From Ontology (Medelyan & Witten), use graph features



Architecture

Key difference from previous works:• Centered on scientific publications• As such, adds two modules to capitalize on this limited domain

Preprocessing:- Sentence delimiting- POS tagging- Stemming

Candidate Identification-Simplex noun phrase detection

Basic Features- TF×IDF- Position

MorphologicalFeatures- Suffix sequence- POS sequence- Acronym

Structural Features- Section distributionvector

Plain text

HTML formatted output

Generic header

mapping model

Keyphraseselection

model

Scientific publication

Key-phrases



1) Morphological Features

• POS tags (used in previous work; e.g., Genex)– Used to identify candidates for simplex noun phrases

(i.e., matching regex “(JJ|NN)* IN? NN”)

– Noun modifiers seem to be more productive than adjectival ones (e.g. “Additive”/NN vs. “Additional”/JJ)

• Suffixes – sequences on modifiers and headwords (e.g., -ic, -al, -ive on modifiers; -ion, -ics, -ment on headword)

– more fine grained than POS tagging



Morphological Features

• Acronym candidate– Binary feature - is the word an acronym? – Using simple adjacent pattern matching of parenthesized text to candidates to their left / right

ICADL (Int’l Conf. on Asian Digital Libraries)

Int’l Conf. on Asian Digital Libraries (ICADL)

– Weakness:- Not comparable to state-of-the-art algorithm, not meant

to be - Not yet evaluated as a separate component- A finer-grained feature may be more useful



Stemming

• After other processing, case folding and stemming conflates candidates to obtain accurate phrase counts

– Use Lovins iterated stemmer

– Represent all stems using the most frequent form

voxel (1)

Voxels (2)

voxelization (5)

Voxelization (8)



2) Structural Feature

Abstract

Introduction

Related Work

Methods

Evaluation

Conclusion

AbstractIntroductionRelated WorkMethodsEvaluation

Conclusion

Learning which sections are more productive for keyphrases



Structural FeaturesExecution: create a feature vector of where a term logically appears

<Abstract, Introduction, Methods, …, Evaluation, Conclusion>

Stem A: <1, 2, 4, … 0, 0>

Stem B: <0, 0, 0, … 3, 0>

Caveat: Lots of unique headers in documents.

Not helpful to say candidate occurs in “Metadata Extraction Approaches”

Change it to “Related Work”



Mapping to Generic Section Headers

• Method: also supervised machine learning• Map to 14 generic headers

1. Absolute section number (Section 3)

2. Relative position (Section 3 of 11 = 3 / (11-1) = .30)

3. Previous section header text

4. Current section header text

• Performance (on a corpus of 1020 headers) – Maximum Entropy: 92% accuracy– Hidden Markov Model: 36% accuracy



Evaluation - Corpus CollectionNo publicly available corpus of keyphrase assignments for scientific documents*. What to do?

So we collected our own. Freely available at:http://wing.comp.nus.edu.sg/downloads/keyphraseCorpus

• 211 documents where text was extractable– Superset of previous set–

• Searched for “keywords general terms filetype:pdf”

* Consider citeulike.org?




Evaluation• 120 documents with at least two sets of keyphrases

– One by original author– One or more by student annotators

• Accuracy by matching top ten extracted keyphrases versus the gold standard

– Standard P/R/F1

– Weighted average: use frequency of phrase in standard1 + ln(f)

• Tested Naïve Bayes and Maximum Entropy• Using Kea features as the baseline



Evaluation Results

• Maximum Entropy did not work as well as NB• NB results show statistical significance at .05 level for both evaluation schemes

2

2.5

3

3.5

4

4.5

5

Exact Matches Weighted Matches

3.03 3.25 3.61 3.84

Number of keywords matched



Discussion Assigned Keyphrases Kea Baseline Our System

Neural network Handover Clusters

3G network Soft handover Soft handoverSoft handover (2) 3G Data Cluster analysis Clusters 3G network

Self organizing map 3G network Interesting clusters

Hierarchical clustering Cell Neural network

Errors:

• Still encourage longer phrase generation

• General words still appear (e.g., “data”, “cell”)



ConclusionsCurrent and Future Work

– Enlarge the keyphrase corpus– Integrate tagging with keyphrases– Deploy system into a scholarly digital library

Contributions: better keyphrase extraction:– Developed features specifically for scientific documents– Developed mapping model for headers– Created a corpus for keyphrase testing

http://wing.comp.nus.edu.sg/downloads/keyphraseCorpus Advertisement: We’re hiring postdocs in terminology extraction and semistructured document processing


End of Presentation

Backup slides follow



ICADL format

•23-25 minutes for talk•5 minutes question•30 minutes in total

keyphrase extraction in scientific documents thuy dung nguyen and min-yen kan school of computing...

Documents

vietnam keyphrases

genex slide

vietnam stemming

useful slide

features tfidf

graph features

left right icadl intl

tags keyphrases