automatic keyphrase extraction by bridging vocabulary gap
DESCRIPTION
Automatic Keyphrase Extraction by Bridging Vocabulary Gap. Xinxiong Chen Tsinghua University 2013-04-26. Main Idea. Vocabulary gap: Appropriate keyphrases are not always statistically significant or even do not appear in the given document. - PowerPoint PPT PresentationTRANSCRIPT
Automatic Keyphrase Extraction by Bridging Vocabulary Gap
Xinxiong ChenTsinghua University
2013-04-26
04/19/2023 THUNLP, Tsinghua University http://nlp.csai.tsinghua.edu.cn
1
Main Idea Vocabulary gap: Appropriate keyphrases
are not always statistically significant or even do not appear in the given document.
Use word alignment models in statistical machine translation to learn translation probabilities between the words in documents and the words in keyphrases.
THUNLP, Tsinghua Universityhttp://nlp.csai.tsinghua.edu.cn
04/19/2023 2
Introduction – Keyphrase What is keyphrase
a set of terms selected from a document as a short summary of the document.
THUNLP, Tsinghua Universityhttp://nlp.csai.tsinghua.edu.cn
04/19/2023 3
Introduction – Keyphrase Extraction
Why keyphrase extraction Digital libraries Information Retrieval
Goal : automatically extract keyphrases from documents Unsupervised
THUNLP, Tsinghua Universityhttp://nlp.csai.tsinghua.edu.cn
04/19/2023 4
Example A News article: (translated from
Chinese)
THUNLP, Tsinghua Universityhttp://nlp.csai.tsinghua.edu.cn
04/19/2023 5
Title Israeli Military Claims Iran Can Produce Nuclear Bombs and Considering Military Action against Iran
Summary …
Content …
Keywords Israeli , Iran , Nuclear bombs , Nuclear weapon
Example
Existing unsupervised method: TFIDF : Nuclear bombs , Iran , Israeli ,
enriched uranium , speech TextRank : Iran , Israeli , chief , Nuclear
bombs , Military Use a window whose size is a constant to build a word graph Use PageRank to decidewhich word is more important
THUNLP, Tsinghua Universityhttp://nlp.csai.tsinghua.edu.cn
04/19/2023 6
TF Israeli(6) Iran(6) intelligence(5) nuclear bombs(4) enriched uranium(3)…nuclear weapon(1)
Example LDA : Iran , England , America , Nation
, Speech Learn topics from documents
ExpandRank : Iran , enriched uranium , Israeli , atomic energy, Lebanon Find k nearest neighbor documents to
build word graphs
04/19/2023 THUNLP, Tsinghua Universityhttp://nlp.csai.tsinghua.edu.cn
7
Idea - Association If a word is mentioned, it remind people
of other words. iPhone – Apple Nuclear bombs – Nuclear Weapon
What is the probability between “Nuclear bombs” and “Nuclear Weapon”?
04/19/2023 THUNLP, Tsinghua Universityhttp://nlp.csai.tsinghua.edu.cn
8
Nuclear bombs Nuclear Weapon
Idea – SMT for Keyphrase Extraction
Both the content and the keyphrase are parallel summaries of a news
Unsupervised : Use title or summarization instead
Estimate the translation probabilities between the words in content and title word alignment models
THUNLP, Tsinghua Universityhttp://nlp.csai.tsinghua.edu.cn
News
Content Title(Summarization)Translation
04/19/2023 9
Translation Probability Example:
Nuclear bombs: Nuclear bombs : 0.515757 Liquid : 0.0871815 Nuclear Weapon : 0.0808868 Military Action : 0.0239178 Israeli Military : 0.0215988 Miniaturization : 0.0118 Possible : 0.0113688 enriched uranium : 0.0100252
04/19/2023 THUNLP, Tsinghua Universityhttp://nlp.csai.tsinghua.edu.cn
10
Keyphrase Extraction Using WAM
Given news, rank keyphrases by computing the scores
Iran , Israeli , chief, Nuclear bombs , Military …
Iran , Israeli , chief, Nuclear bombs , Nuclear weapon , Military , speech
THUNLP, Tsinghua Universityhttp://nlp.csai.tsinghua.edu.cn
04/19/2023 11
Word Trigger Method (WTM) Three Steps :
Preparing translation pairs Learning a translation model
IBM Model-1 Extracting keyphrase given a resource
THUNLP, Tsinghua Universityhttp://nlp.csai.tsinghua.edu.cn
04/19/2023 12
Translation Pairs Length unbalance problem
Unable to list all tags on the annotation side
Tags may have different importance for the resource
THUNLP, Tsinghua Universityhttp://nlp.csai.tsinghua.edu.cn
04/19/2023 13
Content-Title Pairs Length unbalanced problem
Unable to list all tags on the annotation side
Tags may have different importance for the resource
Sampling Method Tag weighting type
TFt, TF-IRFt
Length ratio
THUNLP, Tsinghua Universityhttp://nlp.csai.tsinghua.edu.cn
04/19/2023 14
Learning Translation Probabilities IBM Model-1 as WAM algorithms
Asymmetric: Prd2a(t|w), Pra2d(t|w) Linear Combination
Prd2a(t|w)
Pra2d(t|w) When λ = 1 or λ = 0, it simply uses model
Prd2a(t|w) or Pra2d(t|w) correspondingly
THUNLP, Tsinghua Universityhttp://nlp.csai.tsinghua.edu.cn
04/19/2023 15
content title
title content
Tag Suggestion Using Triggered Words
Given description, rank tags by computing the scores
THUNLP, Tsinghua Universityhttp://nlp.csai.tsinghua.edu.cn
04/19/2023 16
Tag Suggestion Using Triggered Words
Given description, rank tags by computing the scores
Trigger power of the word w in the content TF-IRFw
TextRank Their product
THUNLP, Tsinghua Universityhttp://nlp.csai.tsinghua.edu.cn
04/19/2023 17
Keyphrase Extraction Using Triggered Words
Given description, rank tags by computing the scores
Translation probabilities from words in description to keyphraes
THUNLP, Tsinghua Universityhttp://nlp.csai.tsinghua.edu.cn
04/19/2023 18
Emphasize Tags Appearing In Content for WTM (EWTM)
Emphasize tags appearing in description
It(w): indicator function to emphasize the tags appearing in content Gets 1 when t = w Gets 0 when t != w
THUNLP, Tsinghua Universityhttp://nlp.csai.tsinghua.edu.cn
04/19/2023 19
Experiments Datasets
13702 news from www.163.com
Evaluation Metrics Precision, recall and F-measure
THUNLP, Tsinghua Universityhttp://nlp.csai.tsinghua.edu.cn
04/19/2023 20
Words in documents 72900
Words in keyphrases 12405
Lengths of document 971.7 words
Lengths of titles 11.6 words
Lengths of summarization 45.8 words
Num of Keyphrases 2.4
Experiment Results
THUNLP, Tsinghua Universityhttp://nlp.csai.tsinghua.edu.cn
04/19/2023 21
Parameters – Length Ratio The length ratio: content/title
04/19/2023 THUNLP, Tsinghua Universityhttp://nlp.csai.tsinghua.edu.cn
22
THUNLP, Tsinghua Universityhttp://nlp.csai.tsinghua.edu.cn
04/19/2023 23
SINA APP(http://app.thunlp.org/weibo)Now we have more than 2 million registered users
Application
Thank you ! Q & A