survey- topical phrases and exploration of text corpora by topics

Topical Phrases and Exploration of Text Corpora by Topics

CS 512 – Spring 2014Professor: Jiawei Han

University of Illinois at Urbana-ChampaignDEPARTMENT OF COMPUTER SCIENCE

Ahmed El-KishkyMehrdad Biglari

Ramesh Kolavennu

Introduction

• This survey is in relation to a system for text exploration based on the framework ToPMine– Scalable Topical Phrase Mining from Large Text

Corpora, El-Kishky, Song, Wang, Voss, Han

Outline• Unigram Topic Modeling

– Latent Dirichlet Allocation

• Post Topic Model Topical Phrase Construction– Turbo Topics– Keyphrase Extraction and Ranking by Topic (KERT)– Constructing Topical Hierarchies

• Inferring Phrases and Topic– Bigram Language Model– Topical N-grams– Phrase-Discovering LDA

• Separation of Phrase and Topic– ToPMine

• Constraining LDA– Hidden Topic Markov Model– Sentence LDA

• Frequent Pattern Mining and Topic Models– Sentence LDA

• Phrase Extraction Literature– Common Phrase Extraction Techniques

• Visualization– Visualizing Topic Models

Latent Dirichlet Allocation

LDA is a generative topic model that posits that each document is a mixture of a small number of topics, with each topic a word multinomial.1. Sample a distribution over topics from a Dirichlet prior.2. For each word in the document

– Randomly choose a topic from the distribution over topics in step #1.

– Randomly choose a word from the corresponding distribution over the vocabulary.

LDA Sample Topics

Post Topic Modeling Phrase Construction

• Turbo Topics• Keyphrase Extraction and Ranking by Topics

(KERT)• Constructing Topical Hierarchies (CATHY)

Turbo Topics

Construction of Topical Phrases as a post-processing step to LDA.1. Perform LDA topic modeling on the input corpus.2. Find consecutive words that are of the same topic within the

corpus.3. Perform recursive permutation tests on a back-off model to test

significance of phrases

KERTConstruction of Topical Phrases as a post-processing

step to LDA.1. Perform LDA topic modeling on the input corpus.2. Partition each document into K documents (K =

#topics)3. For each topic, mine frequent patterns on the new

corpus.4. Rank the output for each topic based on four

heuristic measures• Purity – measure of how often a phrase appears in a

topic in relation to other topics• Completeness – threshold filtering of subphrases

(support vector vs support vector machines)• Coverage – measure of frequency of phrase in topic• Phrase-ness – ratio of expected occurrence of a phrase

to actual occurence

learning

support vector machines

reinforcement learning

feature selection

conditional random fields

classification

decision trees

:

:

KERT: Sample Results

CATHY1. Create a term co-occurrence network. 2. Cluster the link weights into “topics”. 3. To find subtopics, recursively cluster the

subgraphsDBLP

Data & Information system …

DB DM IR AI …

CATHY: Sample Results

Inferring Phrase and Topic

• Bigram Topic Model• Topical N-grams• Phrase-Discovering LDA

Bigram Topic Model

1. Draw a discrete word distribution from a Dirichlet priori for each Topic/word pair.

2. For each document, draw a discrete topic distribution from a Dirichlet prior. Then for each word:– Draw a topic from the document-

topic distribution from Step #2– Draw a word from the topic/word

distribution in #1 from the topic drawn and the distribution conditioned on the previous word.

Topical N-Grams1. Draw discrete word distribution from a Dirichlet

prior for each topic 2. Draw Bernoulli distributions from a Beta prior

for each topic and each word3. Draw Discrete distributions from a Dirichlet prior

for each topic and each word4. For each document, draw a discrete document

topic distribution from a dirichlet prior. Then for each word in the document– Draw a value, X, from the Bernoulli

distribution conditioned on the previous word and its topic

– Draw a topic, T, from the document topic distribution

– Draw a word from the previous word’s bigram word-topic distribution if X = 1; else draw from the unigram topic distribution from topic T

TNG: Sample Results

Phrase-Discovering LDA

• Each sentence is viewed as a as a time-series of words• PD-LDA posits that the generative parameter (the topic)

changes periodically in accordance with changepoint indicators. As such topical phrases can be arbitrarily long, but always be drawn from a single topic.

The Process is as follows

1. Each token is drawn from a distribution conditioned on the context.

2. The context consists of the phrase topic and the past m words. When m=1, this is analogous to TNG, but for larger contexts, the distributions are Pitman-Yor processes linked together in a hierarchical tree structure.

PD-LDA Results

Separating Phrases and Topics: ToPMine

• ToPMine addresses the computational and quality issues of other topical phrase extraction methods by separating phrase finding from topic modeling.

• First the corpus is transformed from a bag-of-words to a bag-of-phrases via an efficient agglomerative grouping algorithm.

• The resultant bag-of-phrases is then the input to a PhraseLDA, a phrase-based topic model

ToPMine

Good Phrases!

Step 1: Step 2:

Constraining LDA

• Constraining LDA has been performed with positive results.

• Sentence LDA constrains each sentence to take on the same topic.

• This model is adapted to work on mined phrases, where each phrase shares a single topic (a weaker assumption than sentences).

Visualizing Topic Models

survey- topical phrases and exploration of text corpora by topics

Technology