survey- topical phrases and exploration of text corpora by topics
DESCRIPTION
University of Illinois at Urbana-Champaign DEPARTMENT OF COMPUTER SCIENCE CS 512 – Spring 2014 Ahmed El-Kishky Mehrdad Biglari Ramesh KolavennuTRANSCRIPT
Topical Phrases and Exploration of Text Corpora by Topics
CS 512 – Spring 2014Professor: Jiawei Han
University of Illinois at Urbana-ChampaignDEPARTMENT OF COMPUTER SCIENCE
Ahmed El-KishkyMehrdad Biglari
Ramesh Kolavennu
Introduction
• This survey is in relation to a system for text exploration based on the framework ToPMine– Scalable Topical Phrase Mining from Large Text
Corpora, El-Kishky, Song, Wang, Voss, Han
Outline• Unigram Topic Modeling
– Latent Dirichlet Allocation
• Post Topic Model Topical Phrase Construction– Turbo Topics– Keyphrase Extraction and Ranking by Topic (KERT)– Constructing Topical Hierarchies
• Inferring Phrases and Topic– Bigram Language Model– Topical N-grams– Phrase-Discovering LDA
• Separation of Phrase and Topic– ToPMine
• Constraining LDA– Hidden Topic Markov Model– Sentence LDA
• Frequent Pattern Mining and Topic Models– Sentence LDA
• Phrase Extraction Literature– Common Phrase Extraction Techniques
• Visualization– Visualizing Topic Models
Latent Dirichlet Allocation
LDA is a generative topic model that posits that each document is a mixture of a small number of topics, with each topic a word multinomial.1. Sample a distribution over topics from a Dirichlet prior.2. For each word in the document
– Randomly choose a topic from the distribution over topics in step #1.
– Randomly choose a word from the corresponding distribution over the vocabulary.
LDA Sample Topics
Post Topic Modeling Phrase Construction
• Turbo Topics• Keyphrase Extraction and Ranking by Topics
(KERT)• Constructing Topical Hierarchies (CATHY)
Turbo Topics
Construction of Topical Phrases as a post-processing step to LDA.1. Perform LDA topic modeling on the input corpus.2. Find consecutive words that are of the same topic within the
corpus.3. Perform recursive permutation tests on a back-off model to test
significance of phrases
KERTConstruction of Topical Phrases as a post-processing
step to LDA.1. Perform LDA topic modeling on the input corpus.2. Partition each document into K documents (K =
#topics)3. For each topic, mine frequent patterns on the new
corpus.4. Rank the output for each topic based on four
heuristic measures• Purity – measure of how often a phrase appears in a
topic in relation to other topics• Completeness – threshold filtering of subphrases
(support vector vs support vector machines)• Coverage – measure of frequency of phrase in topic• Phrase-ness – ratio of expected occurrence of a phrase
to actual occurence
learning
support vector machines
reinforcement learning
feature selection
conditional random fields
classification
decision trees
:
:
KERT: Sample Results
CATHY1. Create a term co-occurrence network. 2. Cluster the link weights into “topics”. 3. To find subtopics, recursively cluster the
subgraphsDBLP
Data & Information system …
DB DM IR AI …
CATHY: Sample Results
Inferring Phrase and Topic
• Bigram Topic Model• Topical N-grams• Phrase-Discovering LDA
Bigram Topic Model
1. Draw a discrete word distribution from a Dirichlet priori for each Topic/word pair.
2. For each document, draw a discrete topic distribution from a Dirichlet prior. Then for each word:– Draw a topic from the document-
topic distribution from Step #2– Draw a word from the topic/word
distribution in #1 from the topic drawn and the distribution conditioned on the previous word.
Topical N-Grams1. Draw discrete word distribution from a Dirichlet
prior for each topic 2. Draw Bernoulli distributions from a Beta prior
for each topic and each word3. Draw Discrete distributions from a Dirichlet prior
for each topic and each word4. For each document, draw a discrete document
topic distribution from a dirichlet prior. Then for each word in the document– Draw a value, X, from the Bernoulli
distribution conditioned on the previous word and its topic
– Draw a topic, T, from the document topic distribution
– Draw a word from the previous word’s bigram word-topic distribution if X = 1; else draw from the unigram topic distribution from topic T
TNG: Sample Results
Phrase-Discovering LDA
• Each sentence is viewed as a as a time-series of words• PD-LDA posits that the generative parameter (the topic)
changes periodically in accordance with changepoint indicators. As such topical phrases can be arbitrarily long, but always be drawn from a single topic.
The Process is as follows
1. Each token is drawn from a distribution conditioned on the context.
2. The context consists of the phrase topic and the past m words. When m=1, this is analogous to TNG, but for larger contexts, the distributions are Pitman-Yor processes linked together in a hierarchical tree structure.
PD-LDA Results
Separating Phrases and Topics: ToPMine
• ToPMine addresses the computational and quality issues of other topical phrase extraction methods by separating phrase finding from topic modeling.
• First the corpus is transformed from a bag-of-words to a bag-of-phrases via an efficient agglomerative grouping algorithm.
• The resultant bag-of-phrases is then the input to a PhraseLDA, a phrase-based topic model
ToPMine
Good Phrases!
Step 1: Step 2:
Constraining LDA
• Constraining LDA has been performed with positive results.
• Sentence LDA constrains each sentence to take on the same topic.
• This model is adapted to work on mined phrases, where each phrase shares a single topic (a weaker assumption than sentences).
Visualizing Topic Models
Visualizing Topic Models
Q&A