keyphrase extraction in scientific documents thuy dung nguyen and min-yen kan school of computing...
TRANSCRIPT
Keyphrase Extraction in Scientific DocumentsThuy Dung Nguyen and Min-Yen Kan
School of Computing
National University of Singapore
Slides available at http://wing.comp.nus.edu.sg/downloads/keyphraseCorpus
2ICADL 2007 (Hanoi, Vietnam)
Thuy Dung Nguyen and Min-Yen Kan
Keyphrases!
To think about: Are tags keyphrases?
Credits: Amazon.com, ACM.org, IMDB.com
3ICADL 2007 (Hanoi, Vietnam)
Thuy Dung Nguyen and Min-Yen Kan
Using Keyphrases in DLs
• Navigation– Searching: Better weighting for terms
– Browsing and Linking: Finding similar documents
• Reading– Highlighting
– Key Concepts
Helping to make the transition seamless between the two
Why are keyphrases important to digital libraries?
Genex
4ICADL 2007 (Hanoi, Vietnam)
Thuy Dung Nguyen and Min-Yen Kan
Related WorkGeneration
– Kim and Wilbur – statistical properties of distribution– Tomokiyo and Hurst – Phraseness model
Selection– GenEx (Frank)– Kea (Frank et al.): just 3 features
TF×IDF, position, corpus frequency
– Turney: selection not independent, use PMI
Assignment– From Ontology (Medelyan & Witten), use graph features
5ICADL 2007 (Hanoi, Vietnam)
Thuy Dung Nguyen and Min-Yen Kan
Architecture
Key difference from previous works:• Centered on scientific publications• As such, adds two modules to capitalize on this limited domain
Preprocessing:- Sentence delimiting- POS tagging- Stemming
Candidate Identification-Simplex noun phrase detection
Basic Features- TF×IDF- Position
MorphologicalFeatures- Suffix sequence- POS sequence- Acronym
Structural Features- Section distributionvector
Plain text
HTML formatted output
Generic header
mapping model
Keyphraseselection
model
Scientific publication
Key-phrases
6ICADL 2007 (Hanoi, Vietnam)
Thuy Dung Nguyen and Min-Yen Kan
1) Morphological Features
• POS tags (used in previous work; e.g., Genex)– Used to identify candidates for simplex noun phrases
(i.e., matching regex “(JJ|NN)* IN? NN”)
– Noun modifiers seem to be more productive than adjectival ones (e.g. “Additive”/NN vs. “Additional”/JJ)
• Suffixes – sequences on modifiers and headwords (e.g., -ic, -al, -ive on modifiers; -ion, -ics, -ment on headword)
– more fine grained than POS tagging
7ICADL 2007 (Hanoi, Vietnam)
Thuy Dung Nguyen and Min-Yen Kan
Morphological Features
• Acronym candidate– Binary feature - is the word an acronym? – Using simple adjacent pattern matching of parenthesized text to candidates to their left / right
ICADL (Int’l Conf. on Asian Digital Libraries)
Int’l Conf. on Asian Digital Libraries (ICADL)
– Weakness:- Not comparable to state-of-the-art algorithm, not meant
to be - Not yet evaluated as a separate component- A finer-grained feature may be more useful
8ICADL 2007 (Hanoi, Vietnam)
Thuy Dung Nguyen and Min-Yen Kan
Stemming
• After other processing, case folding and stemming conflates candidates to obtain accurate phrase counts
– Use Lovins iterated stemmer
– Represent all stems using the most frequent form
voxel (1)
Voxels (2)
voxelization (5)
Voxelization (8)
9ICADL 2007 (Hanoi, Vietnam)
Thuy Dung Nguyen and Min-Yen Kan
2) Structural Feature
Abstract
Introduction
Related Work
Methods
Evaluation
Conclusion
AbstractIntroductionRelated WorkMethodsEvaluation
Conclusion
Learning which sections are more productive for keyphrases
10ICADL 2007 (Hanoi, Vietnam)
Thuy Dung Nguyen and Min-Yen Kan
Structural FeaturesExecution: create a feature vector of where a term logically appears
<Abstract, Introduction, Methods, …, Evaluation, Conclusion>
Stem A: <1, 2, 4, … 0, 0>
Stem B: <0, 0, 0, … 3, 0>
Caveat: Lots of unique headers in documents.
Not helpful to say candidate occurs in “Metadata Extraction Approaches”
Change it to “Related Work”
11ICADL 2007 (Hanoi, Vietnam)
Thuy Dung Nguyen and Min-Yen Kan
Mapping to Generic Section Headers
• Method: also supervised machine learning• Map to 14 generic headers
1. Absolute section number (Section 3)
2. Relative position (Section 3 of 11 = 3 / (11-1) = .30)
3. Previous section header text
4. Current section header text
• Performance (on a corpus of 1020 headers) – Maximum Entropy: 92% accuracy– Hidden Markov Model: 36% accuracy
12ICADL 2007 (Hanoi, Vietnam)
Thuy Dung Nguyen and Min-Yen Kan
Evaluation - Corpus CollectionNo publicly available corpus of keyphrase assignments for scientific documents*. What to do?
So we collected our own. Freely available at:http://wing.comp.nus.edu.sg/downloads/keyphraseCorpus
• 211 documents where text was extractable– Superset of previous set–
• Searched for “keywords general terms filetype:pdf”
* Consider citeulike.org?
13ICADL 2007 (Hanoi, Vietnam)
Thuy Dung Nguyen and Min-Yen Kan
Evaluation• 120 documents with at least two sets of keyphrases
– One by original author– One or more by student annotators
• Accuracy by matching top ten extracted keyphrases versus the gold standard
– Standard P/R/F1
– Weighted average: use frequency of phrase in standard1 + ln(f)
• Tested Naïve Bayes and Maximum Entropy• Using Kea features as the baseline
14ICADL 2007 (Hanoi, Vietnam)
Thuy Dung Nguyen and Min-Yen Kan
Evaluation Results
• Maximum Entropy did not work as well as NB• NB results show statistical significance at .05 level for both evaluation schemes
2
2.5
3
3.5
4
4.5
5
Exact Matches Weighted Matches
3.03 3.25 3.61 3.84
Number of keywords matched
15ICADL 2007 (Hanoi, Vietnam)
Thuy Dung Nguyen and Min-Yen Kan
Discussion Assigned Keyphrases Kea Baseline Our System
Neural network Handover Clusters
3G network Soft handover Soft handoverSoft handover (2) 3G Data Cluster analysis Clusters 3G network
Self organizing map 3G network Interesting clusters
Hierarchical clustering Cell Neural network
Errors:
• Still encourage longer phrase generation
• General words still appear (e.g., “data”, “cell”)
16ICADL 2007 (Hanoi, Vietnam)
Thuy Dung Nguyen and Min-Yen Kan
ConclusionsCurrent and Future Work
– Enlarge the keyphrase corpus– Integrate tagging with keyphrases– Deploy system into a scholarly digital library
Contributions: better keyphrase extraction:– Developed features specifically for scientific documents– Developed mapping model for headers– Created a corpus for keyphrase testing
http://wing.comp.nus.edu.sg/downloads/keyphraseCorpus Advertisement: We’re hiring postdocs in terminology extraction and semistructured document processing