keyphrase extraction in scientific documents thuy dung nguyen and min-yen kan school of computing...

18
Keyphrase Extraction in Scientific Documents Thuy Dung Nguyen and Min-Yen Kan School of Computing National University of Singapore Slides available at http://wing.comp.nus.edu.sg/downloads/keyphraseCorpus

Upload: simon-bryant

Post on 25-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Keyphrase Extraction in Scientific DocumentsThuy Dung Nguyen and Min-Yen Kan

School of Computing

National University of Singapore

Slides available at http://wing.comp.nus.edu.sg/downloads/keyphraseCorpus

2ICADL 2007 (Hanoi, Vietnam)

Thuy Dung Nguyen and Min-Yen Kan

Keyphrases!

To think about: Are tags keyphrases?

Credits: Amazon.com, ACM.org, IMDB.com

3ICADL 2007 (Hanoi, Vietnam)

Thuy Dung Nguyen and Min-Yen Kan

Using Keyphrases in DLs

• Navigation– Searching: Better weighting for terms

– Browsing and Linking: Finding similar documents

• Reading– Highlighting

– Key Concepts

Helping to make the transition seamless between the two

Why are keyphrases important to digital libraries?

Genex

4ICADL 2007 (Hanoi, Vietnam)

Thuy Dung Nguyen and Min-Yen Kan

Related WorkGeneration

– Kim and Wilbur – statistical properties of distribution– Tomokiyo and Hurst – Phraseness model

Selection– GenEx (Frank)– Kea (Frank et al.): just 3 features

TF×IDF, position, corpus frequency

– Turney: selection not independent, use PMI

Assignment– From Ontology (Medelyan & Witten), use graph features

5ICADL 2007 (Hanoi, Vietnam)

Thuy Dung Nguyen and Min-Yen Kan

Architecture

Key difference from previous works:• Centered on scientific publications• As such, adds two modules to capitalize on this limited domain

Preprocessing:- Sentence delimiting- POS tagging- Stemming

Candidate Identification-Simplex noun phrase detection

Basic Features- TF×IDF- Position

MorphologicalFeatures- Suffix sequence- POS sequence- Acronym

Structural Features- Section distributionvector

Plain text

HTML formatted output

Generic header

mapping model

Keyphraseselection

model

Scientific publication

Key-phrases

6ICADL 2007 (Hanoi, Vietnam)

Thuy Dung Nguyen and Min-Yen Kan

1) Morphological Features

• POS tags (used in previous work; e.g., Genex)– Used to identify candidates for simplex noun phrases

(i.e., matching regex “(JJ|NN)* IN? NN”)

– Noun modifiers seem to be more productive than adjectival ones (e.g. “Additive”/NN vs. “Additional”/JJ)

• Suffixes – sequences on modifiers and headwords (e.g., -ic, -al, -ive on modifiers; -ion, -ics, -ment on headword)

– more fine grained than POS tagging

7ICADL 2007 (Hanoi, Vietnam)

Thuy Dung Nguyen and Min-Yen Kan

Morphological Features

• Acronym candidate– Binary feature - is the word an acronym? – Using simple adjacent pattern matching of parenthesized text to candidates to their left / right

ICADL (Int’l Conf. on Asian Digital Libraries)

Int’l Conf. on Asian Digital Libraries (ICADL)

– Weakness:- Not comparable to state-of-the-art algorithm, not meant

to be - Not yet evaluated as a separate component- A finer-grained feature may be more useful

8ICADL 2007 (Hanoi, Vietnam)

Thuy Dung Nguyen and Min-Yen Kan

Stemming

• After other processing, case folding and stemming conflates candidates to obtain accurate phrase counts

– Use Lovins iterated stemmer

– Represent all stems using the most frequent form

voxel (1)

Voxels (2)

voxelization (5)

Voxelization (8)

9ICADL 2007 (Hanoi, Vietnam)

Thuy Dung Nguyen and Min-Yen Kan

2) Structural Feature

Abstract

Introduction

Related Work

Methods

Evaluation

Conclusion

AbstractIntroductionRelated WorkMethodsEvaluation

Conclusion

Learning which sections are more productive for keyphrases

10ICADL 2007 (Hanoi, Vietnam)

Thuy Dung Nguyen and Min-Yen Kan

Structural FeaturesExecution: create a feature vector of where a term logically appears

<Abstract, Introduction, Methods, …, Evaluation, Conclusion>

Stem A: <1, 2, 4, … 0, 0>

Stem B: <0, 0, 0, … 3, 0>

Caveat: Lots of unique headers in documents.

Not helpful to say candidate occurs in “Metadata Extraction Approaches”

Change it to “Related Work”

11ICADL 2007 (Hanoi, Vietnam)

Thuy Dung Nguyen and Min-Yen Kan

Mapping to Generic Section Headers

• Method: also supervised machine learning• Map to 14 generic headers

1. Absolute section number (Section 3)

2. Relative position (Section 3 of 11 = 3 / (11-1) = .30)

3. Previous section header text

4. Current section header text

• Performance (on a corpus of 1020 headers) – Maximum Entropy: 92% accuracy– Hidden Markov Model: 36% accuracy

12ICADL 2007 (Hanoi, Vietnam)

Thuy Dung Nguyen and Min-Yen Kan

Evaluation - Corpus CollectionNo publicly available corpus of keyphrase assignments for scientific documents*. What to do?

So we collected our own. Freely available at:http://wing.comp.nus.edu.sg/downloads/keyphraseCorpus

• 211 documents where text was extractable– Superset of previous set–

• Searched for “keywords general terms filetype:pdf”

* Consider citeulike.org?

13ICADL 2007 (Hanoi, Vietnam)

Thuy Dung Nguyen and Min-Yen Kan

Evaluation• 120 documents with at least two sets of keyphrases

– One by original author– One or more by student annotators

• Accuracy by matching top ten extracted keyphrases versus the gold standard

– Standard P/R/F1

– Weighted average: use frequency of phrase in standard1 + ln(f)

• Tested Naïve Bayes and Maximum Entropy• Using Kea features as the baseline

14ICADL 2007 (Hanoi, Vietnam)

Thuy Dung Nguyen and Min-Yen Kan

Evaluation Results

• Maximum Entropy did not work as well as NB• NB results show statistical significance at .05 level for both evaluation schemes

2

2.5

3

3.5

4

4.5

5

Exact Matches Weighted Matches

3.03 3.25 3.61 3.84

Number of keywords matched

15ICADL 2007 (Hanoi, Vietnam)

Thuy Dung Nguyen and Min-Yen Kan

Discussion Assigned Keyphrases Kea Baseline Our System

Neural network Handover Clusters

3G network Soft handover Soft handoverSoft handover (2) 3G Data Cluster analysis Clusters 3G network

Self organizing map 3G network Interesting clusters

Hierarchical clustering Cell Neural network

Errors:

• Still encourage longer phrase generation

• General words still appear (e.g., “data”, “cell”)

16ICADL 2007 (Hanoi, Vietnam)

Thuy Dung Nguyen and Min-Yen Kan

ConclusionsCurrent and Future Work

– Enlarge the keyphrase corpus– Integrate tagging with keyphrases– Deploy system into a scholarly digital library

Contributions: better keyphrase extraction:– Developed features specifically for scientific documents– Developed mapping model for headers– Created a corpus for keyphrase testing

http://wing.comp.nus.edu.sg/downloads/keyphraseCorpus Advertisement: We’re hiring postdocs in terminology extraction and semistructured document processing

End of Presentation

Backup slides follow

18ICADL 2007 (Hanoi, Vietnam)

Thuy Dung Nguyen and Min-Yen Kan

ICADL format

•23-25 minutes for talk•5 minutes question•30 minutes in total