probabilistic topic models for text mining

78
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 1 Probabilistic Topic Models for Text Mining ChengXiang Zhai ( 翟翟翟 ) Department of Computer Science Graduate School of Library & Information Science Institute for Genomic Biology, Statistics University of Illinois, Urbana-Champaign http://www-faculty.cs.uiuc.edu/~czhai, [email protected]

Upload: nam

Post on 13-Jan-2016

41 views

Category:

Documents


6 download

DESCRIPTION

Probabilistic Topic Models for Text Mining. ChengXiang Zhai (翟成祥) Department of Computer Science Graduate School of Library & Information Science Institute for Genomic Biology, Statistics University of Illinois, Urbana-Champaign http://www-faculty.cs.uiuc.edu/~czhai, [email protected]. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 1

Probabilistic Topic Models for

Text MiningChengXiang Zhai (翟成祥 )

Department of Computer Science

Graduate School of Library & Information Science

Institute for Genomic Biology, Statistics

University of Illinois, Urbana-Champaign

http://www-faculty.cs.uiuc.edu/~czhai, [email protected]

Page 2: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 2

What Is Text Mining?

“The objective of Text Mining is to exploit information contained in textual documents in various ways, including …discovery of patterns and trends in data, associations among entities, predictive rules, etc.” (Grobelnik et al., 2001)

“Another way to view text data mining is as a process of exploratory data analysis that leads to heretofore unknown information, or to answers for questions for which the answer is not currently known.” (Hearst, 1999)

(Slide from Rebecca Hwa’s “Intro to Text Mining”)

Page 3: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 3

Two Different Views of Text Mining

• Data Mining View: Explore patterns in textual data

– Find latent topics

– Find topical trends

– Find outliers and other hidden patterns

• Natural Language Processing View: Make inferences based on partial understanding natural language text

– Information extraction

– Question answering

Shallow mining

Deep mining

Page 4: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 4

Applications of Text Mining

• Direct applications: Go beyond search to find knowledge

– Question-driven (Bioinformatics, Business Intelligence, etc): We have specific questions; how can we exploit data mining to answer the questions?

– Data-driven (WWW, literature, email, customer reviews, etc): We have a lot of data; what can we do with it?

• Indirect applications

– Assist information access (e.g., discover latent topics to better summarize search results)

– Assist information organization (e.g., discover hidden structures)

Page 5: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 5

Text Mining Methods

• Data Mining Style: View text as high dimensional data– Frequent pattern finding

– Association analysis

– Outlier detection

• Information Retrieval Style: Fine granularity topical analysis– Topic extraction

– Exploit term weighting and text similarity measures

• Natural Language Processing Style: Information Extraction– Entity extraction

– Relation extraction

– Sentiment analysis

– Question answering

• Machine Learning Style: Unsupervised or semi-supervised learning– Mixture models

– Dimension reduction Topic of this lecture

Page 6: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 6

Outline

• The Basic Topic Models: – Probabilistic Latent Semantic Analysis (PLSA) [Hofmann

99]

– Latent Dirichlet Allocation (LDA) [Blei et al. 02]

• Extensions– Contextual Probabilistic Latent Semantic Analysis

(CPLSA) [Mei & Zhai 06]

Page 7: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 7

Basic Topic Model: PLSA

Page 8: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 8

PLSA: Motivation

Government Response New Orleans Oil Price Praying and Blessing Aid and Donation Personal bush 0.071 city 0.063 price 0.077 god 0.141 donate 0.120 i 0.405president 0.061 orleans 0.054 oil 0.064 pray 0.047 relief 0.076 my 0.116federal 0.051 new 0.034 gas 0.045 prayer 0.041 red 0.070 me 0.060government 0.047 louisiana 0.023 increase 0.020 love 0.030 cross 0.065 am 0.029fema 0.047 flood 0.022 product 0.020 life 0.025 help 0.050 think 0.015administrate 0.023 evacuate 0.021 fuel 0.018 bless 0.025 victim 0.036 feel 0.012response 0.020 storm 0.017 company 0.018 lord 0.017 organize 0.022 know 0.011brown 0.019 resident 0.016 energy 0.017 jesus 0.016 effort 0.020 something 0.007blame 0.017 center 0.016 market 0.016 will 0.013 fund 0.019 guess 0.007governor 0.014 rescue 0.012 gasoline 0.012 faith 0.012 volunteer 0.019 myself 0.006

What did people say in their blog articles about “Hurricane Katrina”?

Query = “Hurricane Katrina”

Results:

Page 9: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 9

Probabilistic Latent Semantic Analysis/Indexing (PLSA/PLSI) [Hofmann 99]

• Mix k multinomial distributions to generate a document

• Each document has a potentially different set of mixing weights which captures the topic coverage

• When generating words in a document, each word may be generated using a DIFFERENT multinomial distribution (this is in contrast with the document clustering model where, once a multinomial distribution is chosen, all the words in a document would be generated using the same model)

• We may add a background distribution to “attract” background words

Page 10: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 10

PLSA as a Mixture Model

Topic 1

Topic k

Topic 2

Document d

Background B

warning 0.3 system 0.2..

aid 0.1donation 0.05support 0.02 ..

statistics 0.2loss 0.1dead 0.05 ..

is 0.05the 0.04a 0.03 ..

k

1

2

B

B

W

d,1

d, k

1 - Bd,2

“Generating” word w in doc d in the collection

Parameters: B=noise-level (manually set)’s and ’s are estimated with Maximum Likelihood

])|()1()|([log),()(log

)|()1()|()(

1,

1,

k

jjjdBBB

Vw

k

jjjdBBBd

wpwpdwcdp

wpwpwp

??

??

?

???

??

?

Page 11: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 11

How to Estimate j: EM Algorithm

the 0.2a 0.1we 0.01to 0.02…

KnownBackground

p(w | B)

…text =? mining =? association =?word =? …

Unknowntopic model

p(w|1)=?

“Text mining”

Observed Doc(s)

MLEstimator

…information =? retrieval =? query =?document =? …

Unknowntopic model

p(w|2)=?

“informationretrieval”

Suppose, we knowthe identity of each word ...

Page 12: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 12

How the Algorithm Works

12

aidprice

oil

πd1,1 ( P(θ1|d1) )

πd1,2 ( P(θ2|d1) )

πd2,1 ( P(θ1|d2) )

πd2,2 ( P(θ2|d2) )

aidprice

oil

Topic 1 Topic 2

aid

price

oil

P(w| θ)

Initial value

Initial value

Initial value

Initializing πd, j and P(w| θj) with random values

Iteration 1: E Step: split word counts with different topics (by computing z’ s)

Iteration 1: M Step: re-estimate πd, j and P(w| θj) by adding and normalizing the

splitted word counts

Iteration 2: E Step: split word counts with different topics (by computing z’ s)

Iteration 2: M Step: re-estimate πd, j and P(w| θj) by adding and normalizing the

splitted word counts

Iteration 3, 4, 5, …Until converging

756

875

d1

d2

c(w, d)c(w,d)p(zd,w = B)

c(w,d)(1 - p(zd,w = B))p(zd,w=j)

Page 13: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 13

Vw

m

i Cd wdwd

m

i Cd wdwd

jn

j Vw wdwd

Vw wdwdnjd

k

j jnn

jdBBB

BBwd

k

j jnn

jd

jnn

jdwd

i

i

jzpBzpdwc

jzpBzpdwcwp

jzpBzpdwc

jzpBzpdwc

wpwp

wpBzp

wp

wpjzp

' 1 ',',

1 ,,)1(

' ,,

,,)1(,

1

)()(,

,

1' ')()(

',

)()(,

,

)())(1)(,'(

)())(1)(,()|(

)'())(1)(,(

)())(1)(,(

)|()1()|(

)|()(

)|(

)|()(

Parameter EstimationE-Step: Word w in doc d is generated- from cluster j- from background

Application of Bayes rule

M-Step:Re-estimate - mixing weights- cluster LM

Fractional counts contributing to- using cluster j in generating d- generating w from cluster j

Sum over all docs(in multiple collections)m = 1 if one collection

Page 14: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 14

PLSA with Prior Knowledge

• There are different ways of choosing aspects (topics)

– Google = Google News + Google Map + Google scholar, …

– Google = Google US + Google France + Google China, …

• Users have some domain knowledge in mind, e.g.,

– We expect to see “retrieval models” as a topic in IR.

– We want to show the aspects of “history” and “statistics” for Youtube

• A flexible way to incorporate such knowledge as priors of PLSA model

• In Bayesian, it’s your “belief” on the topic distributions

14

Page 15: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 1515

Adding Prior

Topic 1

Topic k

Topic 2

Document d

Background B

warning 0.3 system 0.2..

aid 0.1donation 0.05support 0.02 ..

statistics 0.2loss 0.1dead 0.05 ..

is 0.05the 0.04a 0.03 ..

k

1

2

B

B

W

d,1

d, k

1 - Bd,2

“Generating” word w in doc d in the collection

Parameters: B=noise-level (manually set)’s and ’s are estimated with Maximum Likelihood

)|()(maxarg*

DatappMost likely

Page 16: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 16

Adding Prior as Pseudo Counts

16

the 0.2a 0.1we 0.01to 0.02…

KnownBackground

p(w | B)

…text =? mining =? association =?word =? …

Unknowntopic model

p(w|1)=?

“Text mining”

…information =? retrieval =? query =?document =? …

…Unknown

topic modelp(w|2)=?

“informationretrieval”

Suppose, we knowthe identity of each word ...

Observed Doc(s)

MAPEstimator

Pseudo Doc

Size = μtext

mining

Page 17: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 1717

Maximum A Posterior (MAP) Estimation

+p(w|’j)+

Pseudo counts of w from prior ’

Sum of all pseudo counts

What if =0? What if =+?

Vw

m

i Cd wdwd

m

i Cd wdwd

jn

j Vw wdwd

Vw wdwdnjd

k

j jnn

jdBBB

BBwd

k

j jnn

jd

jnn

jdwd

i

i

jzpBzpdwc

jzpBzpdwcwp

jzpBzpdwc

jzpBzpdwc

wpwp

wpBzp

wp

wpjzp

' 1 ',',

1 ,,)1(

' ,,

,,)1(,

1

)()(,

,

1' ')()(

',

)()(,

,

)())(1)(,'(

)())(1)(,()|(

)'())(1)(,(

)())(1)(,(

)|()1()|(

)|()(

)|(

)|()(

Page 18: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 18

Basic Topic Model: LDA

The following slides about LDA are taken from Michael C. Mozer’s course lecturehttp://www.cs.colorado.edu/~mozer/courses/ProbabilisticModels/

Page 19: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

LDA: Motivation– “Documents have no generative probabilistic semantics”

•i.e., document is just a symbol

– Model has many parameters•linear in number of documents

•need heuristic methods to prevent overfitting

– Cannot generalize to new documents

Page 20: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

Unigram Model

N

nnwpp

1

)()(w

Page 21: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

Mixture of Unigrams

z

N

nn zwpzpp

1

)|()()(w

Page 22: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

Topic Model / Probabilistic LSI

z

nn dzpzwpdpwdp )|()|()(),(

•d is a localist representation of (trained) documents

•LDA provides a distributed representation

Page 23: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

LDA•Vocabulary of |V| words

•Document is a collection of words from vocabulary.•N words in document

•ww = (w1, ..., wN)

•Latent topics•random variable z, with values 1, ..., k

•Like topic model, document is generated by sampling a topic from a mixture and then sampling a word from a mixture.

•But topic model assumes a fixed mixture of topics (multinomial distribution) for each document.

•LDA assumes a random mixture of topics (Dirichlet distribution) for each topic.

Page 24: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

Generative Model

•“Plates” indicate looping structure•Outer plate replicated for each document

•Inner plate replicated for each word

•Same conditional distributions apply for each replicate

•Document probability

Page 25: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

Fancier Version

111

1

1 1

)(

)()(

k

kki i

ki ip

Page 26: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

Inference

kN

n znnn

nn

N

nn

pp

dzwpzppp

zwpzppp

p

n

1

1

),|(),|,,(

),()()(),(

),()()(),,,(

),,|,(

w

wz

wz wwz

Page 27: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

Inference

•In general, this formula is intractable:

•Expanded version:

kN

n znnn dzwpzppp

n

1

),()()(),(w

dpN

n

k

i

V

j

wiji

k

ii

i i

i i jni

1 1 11

1 )()(

)(),|(w

1 if wn is the j'th vocab word

Page 28: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

Variational Approximation

•Computing log likelihood and introducing Jensen's inequality: log(E[x]) >= E[log(x)]

•Find variational distribution q such that the above equation is computable.– q parameterized by γ and φ

n

– Maximize bound with respect to γ and φn to obtain best approximation

to p(w | α, β)

– Lead to variational EM algorithm

•Sampling algorithms (e.g., Gibbs sampling) are also common

Page 29: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

Data Sets

C. Elegans Community abstracts5,225 abstracts28,414 unique terms

TREC AP corpus (subset)16,333 newswire articles23,075 unique terms

Held-out data – 10% Removed terms

50 stop words, words appearing once

Page 30: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

C. Elegans

Note: fold in hack for pLSI to allow it to handle novel documents.Involves refitting p(z|d

new) parameters -> sort of a cheat

Page 31: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

AP

Page 32: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 32

Summary: PLSA vs. LDA

• LDA adds a Dirichlet distribution on top of PLSA to regularize the model

• Estimation of LDA is more complicated than PLSA

• LDA is a generative model, while PLSA isn’t

• PLSA is more likely to over-fit the data than LDA

• Which one to use?

– If you need generalization capacity, LDA

– If you want to mine topics from a collection, PLSA may be better (we want overfitting!)

Page 33: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 33

Extension of PLSA: Contextual Probabilistic Latent

Semantic Analysis (CPLSA)

Page 34: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 34

A General Introduction to EM

Data: X (observed) + H(hidden) Parameter:

“Incomplete” likelihood: L( )= log p(X| )“Complete” likelihood: Lc( )= log p(X,H| )

EM tries to iteratively maximize the incomplete likelihood:

Starting with an initial guess (0),

1. E-step: compute the expectation of the complete likelihood

2. M-step: compute (n) by maximizing the Q-function

( 1)

( 1) ( 1)( ; ) [ ( ) | ] ( | , ) log ( , )n

i

n nc i i

h

Q E L X p H h X P X h

( ) ( 1) ( 1)arg max ( ; ) arg max ( | , ) log ( , )i

n n ni i

h

Q p H h X P X h

Page 35: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 35

Convergence Guarantee

Goal: maximizing “Incomplete” likelihood: L( )= log p(X| )

I.e., choosing (n), so that L((n))-L((n-1))0

Note that, since p(X,H| ) =p(H|X, ) P(X| ) , L() =Lc() -log p(H|X, ) L((n))-L((n-1)) = Lc((n))-Lc( (n-1))+log [p(H|X, (n-1) )/p(H|X, (n))]

Taking expectation w.r.t. p(H|X, (n-1)), L((n))-L((n-1)) = Q((n); (n-1))-Q( (n-1); (n-1)) + D(p(H|X, (n-1))||p(H|X, (n)))

KL-divergence, always non-negativeEM chooses (n) to maximize Q

Therefore, L((n)) L((n-1))!

Doesn’t contain H

Page 36: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 36

Another way of looking at EM

Likelihood p(X| )

current guess

Lower bound(Q function)

next guess

E-step = computing the lower boundM-step = maximizing the lower bound

L((n-1)) + Q(; (n-1)) -Q( (n-1); (n-1) ) + D(p(H|X, (n-1) )||p(H|X, ))

L((n-1)) + Q(; (n-1)) -Q( (n-1); (n-1) )

Page 37: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 37

Why Contextual PLSA?

Page 38: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 38

Motivating Example:Comparing Product Reviews

Common Themes “IBM” specific “APPLE” specific “DELL” specific

Battery Life Long, 4-3 hrs Medium, 3-2 hrs Short, 2-1 hrs

Hard disk Large, 80-100 GB Small, 5-10 GB Medium, 20-50 GB

Speed Slow, 100-200 Mhz Very Fast, 3-4 Ghz Moderate, 1-2 Ghz

IBM LaptopReviews

APPLE LaptopReviews

DELL LaptopReviews

Unsupervised discovery of common topics and their variations

Page 39: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 39

Motivating Example:Comparing News about Similar Topics

Common Themes “Vietnam” specific “Afghan” specific “Iraq” specific

United nations … … …Death of people … … …… … … …

Vietnam War Afghan War Iraq War

Unsupervised discovery of common topics and their variations

Page 40: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 40

Motivating Example:Discovering Topical Trends in Literature

Unsupervised discovery of topics and their temporal variations

Theme Strength

Time

1980 1990 1998 2003TF-IDF Retrieval

IR Applications

Language Model

Text Categorization

Page 41: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 41

Motivating Example:Analyzing Spatial Topic Patterns

• How do blog writers in different states respond to topics such as “oil price increase during Hurricane Karina”?

• Unsupervised discovery of topics and their variations in different locations

Page 42: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 42

Motivating Example: Sentiment Summary

Unsupervised/Semi-supervised discovery of topics and different sentiments of the topics

Page 43: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 43

Research Questions

• Can we model all these problems generally?

• Can we solve these problems with a unified approach?

• How can we bring human into the loop?

Page 44: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 44

Contextual Text Mining

• Given collections of text with contextual information (meta-data)

• Discover themes/subtopics/topics (interesting word clusters)

• Compute variations of themes over contexts

• Applications:– Summarizing search results

– Federation of text information

– Opinion analysis

– Social network analysis

– Business intelligence

– ..

Page 45: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 45

Context Features of Text (Meta-data)

Weblog Article

Author

Author’s OccupationLocationTime

communities

source

Page 46: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 46

Context = Partitioning of Text

1999

2005

2006

1998

…… ……

papers written in 1998

WWW SIGIR ACL KDD SIGMOD

papers written by authors in US

Papers about Web

Page 47: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 47

Themes/Topics

• Uses of themes:– Summarize topics/subtopics

– Navigate in a document space

– Retrieve documents

– Segment documents

– …

Theme 1

Theme k

Theme 2

Background B

government 0.3

response 0.2..donate 0.1relief 0.05help 0.02 ..

city 0.2new 0.1orleans 0.05 ..

Is 0.05the 0.04a 0.03 ..

[ Criticism of government response to the hurricane primarily consisted of criticism of its response to the approach of the storm and its aftermath, specifically in the delayed response ] to the [ flooding of New Orleans. … 80% of the 1.3 million residents of the greater New Orleans metropolitan area evacuated ] …[ Over seventy countries pledged monetary donations or other assistance]. …

Page 48: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 48

View of Themes: Context-Specific Version of Views

Context: After 1998 (Language models)

Context: Before 1998 (Traditional models)

vectorspace

TF-IDF

Okapi

LSIvector

Rocchioweighting

feedbackterm

retrieval

feedback

languagemodelsmoothing

querygeneration

mixture

estimateEM

pseudo

model

feedbackjudgeexpansionpseudoquery

Theme 2:

FeedbackTheme 1:

Retrieval Model

retrieve

modelrelevancedocumentquery

Page 49: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 49

Coverage of Themes: Distribution over Themes

Background

• Theme coverage can depend on context

Oil Price

Government Response

Aid and donation

Criticism of government response to the hurricane primarily consisted of criticism of its response to … The total shut-in oil production from the Gulf of Mexico … approximately 24% of the annual production and the shut-in gas production … Over seventy countries pledged monetary donations or other assistance. …

Background

Oil PriceGovernment Response

Aid and donation

Context: Texas

Context: Louisiana

Page 50: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 50

General Tasks of Contextual Text Mining

• Theme Extraction: Extract the global salient themes

– Common information shared over all contexts

• View Comparison: Compare a theme from different views

– Analyze the content variation of themes over contexts

• Coverage Comparison: Compare the theme coverage of different contexts

– Reveal how closely a theme is associated to a context

• Others:

– Causal analysis

– Correlation analysis

Page 51: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 51

A General Solution: CPLSA

• CPLAS = Contextual Probabilistic Latent Semantic Analysis

• An extension of PLSA model ([Hofmann 99]) by

– Introducing context variables

– Modeling views of topics

– Modeling coverage variations of topics

• Process of contextual text mining

– Instantiation of CPLSA (context, views, coverage)

– Fit the model to text data (EM algorithm)

– Compute probabilistic topic patterns

Page 52: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 52

Documentcontext:

Time = July 2005Location = Texas

Author = xxxOccup. = Sociologist

Age Group = 45+…

“Generation” Process of CPLSA

View1 View2 View3Themes

government

donation

New Orleans

government 0.3 response 0.2..

donate 0.1relief 0.05help 0.02 ..

city 0.2new 0.1orleans 0.05 ..

Texas July 2005

sociologist

Theme coverages:

Texas July 2005 document

……

Choose a view

Choose a Coverage

government

donate

new

Draw a word from i

response

aid help

Orleans

Criticism of government response to the hurricane primarily consisted of criticism of its response to … The total shut-in oil production from the Gulf of Mexico … approximately 24% of the annual production and the shut-in gas production … Over seventy countries pledged monetary donations or other assistance. …

Choose a theme

Page 53: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 53

• To generate a document D with context feature set C:

– Choose a view vi according to the view distribution

– Choose a coverage кj according to the coverage distribution

– Choose a theme according to the coverage кj

– Generate a word using

– The likelihood of the document collection is:

Probabilistic Model

),|( CDvp i

),|( CDp j

il

D

D),( 111

))|()|(),|(),|(log(),()(logCD Vw

k

lilj

m

jj

n

ii wplpCDpCDvpDwcp

il

Page 54: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 54

Parameter Estimation: EM Algorithm

• Interesting patterns:

– Theme content variation for each view:

– Theme strength variation for each context

• Prior from a user can be incorporated using MAP estimation

n

i

m

j

k

l lit

jt

jt

it

ilt

jt

jt

it

ljiwwplpCDpCDvp

wplpCDpCDvpzp

1' 1' 1' '')()(

')(

')(

)()()()(

,,,)|()'|'(),|(),|(

)|()|(),|(),|()1(

n

i Vw

m

j

k

l ljiw

Vw

m

j

k

l ljiw

it

zpDwc

zpDwcCDvp

1' 1' 1' ',',',

1 1 ,,,)1(

)1(),(

)1(),(),|(

m

j Vw

n

i

k

l ljiw

Vw

n

i

k

l ljiwj

t

zpDwc

zpDwcCDp

1' 1' 1' ',',',

1 1 ,,,)1(

)1(),(

)1(),(),|(

l

l CD Vw

n

i ljiw

CD Vw

n

i ljiw

jt

zpDwc

zpDwclp

1' ),( ' 1' ',,',

),( 1 ,,,)1(

)1(),(

)1(),()|(

D

D

Vw CD

m

j ljiw

CD

m

j ljiw

ilt

zpDwc

zpDwcwp

' ),( 1' ,',,'

),( 1 ,,,)1(

)1(),'(

)1(),()|(

D

D

)|( ilwp

)|( jlp

Page 55: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 55

Regularization of the Model• Why?

– Generality high complexity (inefficient, multiple local maxima)

– Real applications have domain constraints/knowledge

• Two useful simplifications: – Fixed-Coverage: Only analyze the content variation of themes (e.g.,

author-topic analysis, cross-collection comparative analysis )

– Fixed-View: Only analyze the coverage variation of themes (e.g., spatiotemporal theme analysis)

• In general

– Impose priors on model parameters

– Support the whole spectrum from unsupervised to supervised learning

Page 56: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 56

Interpretation of Topics

Statistical topic models

term 0.1599relevance 0.0752weight 0.0660 feedback 0.0372independence 0.0311model 0.0310frequent 0.0233probabilistic 0.0188document 0.0173…

term 0.1599relevance 0.0752weight 0.0660 feedback 0.0372independence 0.0311model 0.0310frequent 0.0233probabilistic 0.0188document 0.0173…

term 0.1599relevance 0.0752weight 0.0660 feedback 0.0372independence 0.0311model 0.0310frequent 0.0233probabilistic 0.0188document 0.0173…

Multinomial topic models

NLP ChunkerNgram stat.

database system, clustering algorithm, r tree, functional dependency, iceberg cube, concurrency control, index structure …

Candidate label pool

Collection (Context)

Ranked Listof Labels

clustering algorithm;distance measure;…

Relevance Score Re-ranking

Coverage; Discrimination

Page 57: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 57

Relevance: the Zero-Order Score

• Intuition: prefer phrases covering high probability words

Clustering

dimensional

algorithm

birch

shape

Latent Topic

Good Label (l1): “clustering algorithm”

body

Bad Label (l2): “body shape”

p(w|)

)(

)|(

lp

lp

Page 58: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 58

Relevance: the First-Order Score

• Intuition: prefer phrases with similar context (distribution)

Clustering

dimension

partition

algorithm

hash

Clustering

hash

dimension

algorithm

partition

C: SIGMOD Proceedings

Topic

… …

P(w|) P(w|l1)

D(||l1) < D(||l2)

Good Label (l1):“clustering algorithm”

Clustering

hash

dimension

join

algorithm

Bad Label (l2):“hash join”

P(w|l2)

w

ClwPMIwp )|,()|(

Score (l, )

Page 59: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 59

Sample Results

• Comparative text mining

• Spatiotemporal pattern mining

• Sentiment summary

• Event impact analysis

• Temporal author-topic analysis

Page 60: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 60

Comparing News Articles Iraq War (30 articles) vs. Afghan War (26 articles)

Cluster 1 Cluster 2 Cluster 3

Common

Theme

united 0.042nations 0.04…

killed 0.035month 0.032deaths 0.023…

Iraq

Theme

n 0.03Weapons 0.024Inspections 0.023…

troops 0.016hoon 0.015sanches 0.012…

Afghan

Theme

Northern 0.04alliance 0.04kabul 0.03taleban 0.025aid 0.02…

taleban 0.026rumsfeld 0.02hotel 0.012front 0.011…

The common theme indicates that “United Nations” is involved in both wars

Collection-specific themes indicate different roles of “United Nations” in the two wars

Page 61: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 61

Comparing Laptop Reviews

Top words serve as “labels” for common themes(e.g., [sound, speakers], [battery, hours], [cd,drive])

These word distributions can be used to segment text and add hyperlinks between documents

Page 62: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 62

Spatiotemporal Patterns in Blog Articles

• Query= “Hurricane Katrina”

• Topics in the results:

• Spatiotemporal patterns

Government Response New Orleans Oil Price Praying and Blessing Aid and Donation Personal bush 0.071 city 0.063 price 0.077 god 0.141 donate 0.120 i 0.405president 0.061 orleans 0.054 oil 0.064 pray 0.047 relief 0.076 my 0.116federal 0.051 new 0.034 gas 0.045 prayer 0.041 red 0.070 me 0.060government 0.047 louisiana 0.023 increase 0.020 love 0.030 cross 0.065 am 0.029fema 0.047 flood 0.022 product 0.020 life 0.025 help 0.050 think 0.015administrate 0.023 evacuate 0.021 fuel 0.018 bless 0.025 victim 0.036 feel 0.012response 0.020 storm 0.017 company 0.018 lord 0.017 organize 0.022 know 0.011brown 0.019 resident 0.016 energy 0.017 jesus 0.016 effort 0.020 something 0.007blame 0.017 center 0.016 market 0.016 will 0.013 fund 0.019 guess 0.007governor 0.014 rescue 0.012 gasoline 0.012 faith 0.012 volunteer 0.019 myself 0.006

Page 63: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 63

Theme Life Cycles for Hurricane Katrina

city 0.0634orleans 0.0541new 0.0342louisiana 0.0235flood 0.0227evacuate 0.0211storm 0.0177…

price 0.0772oil 0.0643gas 0.0454 increase 0.0210product 0.0203fuel 0.0188company 0.0182…

Oil Price

New Orleans

Page 64: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 64

Theme Snapshots for Hurricane Katrina

Week4: The theme is again strong along the east coast and the Gulf of Mexico

Week3: The theme distributes more uniformly over the states

Week2: The discussion moves towards the north and west

Week5: The theme fades out in most states

Week1: The theme is the strongest along the Gulf of Mexico

Page 65: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 65

Theme Life Cycles: KDD

0

0. 002

0. 004

0. 006

0. 008

0. 01

0. 012

0. 014

0. 016

0. 018

0. 02

1999 2000 2001 2002 2003 2004Time (year)

Nor

mal

ized

Str

engt

h of

The

me

Biology Data

Web Information

Time Series

Classification

Association Rule

Clustering

Bussiness

Global Themes life cycles of KDD AbstractsGlobal Themes life cycles of KDD Abstracts

gene 0.0173expressions 0.0096probability 0.0081microarray 0.0038…

marketing 0.0087customer 0.0086model 0.0079business 0.0048…

rules 0.0142association 0.0064support 0.0053…

Page 66: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 66

Theme Evolution Graph: KDDT

SVM 0.007criteria 0.007classifica – tion 0.006linear 0.005…

decision 0.006tree 0.006classifier 0.005class 0.005Bayes 0.005…

Classifica - tion 0.015text 0.013unlabeled 0.012document 0.008labeled 0.008learning 0.007…

Informa - tion 0.012web 0.010social 0.008retrieval 0.007distance 0.005networks 0.004…

……

1999

web 0.009classifica –tion 0.007features0.006topic 0.005…

mixture 0.005random 0.006cluster 0.006clustering 0.005variables 0.005… topic 0.010

mixture 0.008LDA 0.006 semantic 0.005…

2000 2001 2002 2003 2004

Page 67: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 67

Blog Sentiment Summary (query=“Da Vinci Code”)

Neutral Positive Negative

Facet 1:Movie

... Ron Howards selection of Tom Hanks to play Robert Langdon.

Tom Hanks stars in the movie,who can be mad at that?

But the movie might get delayed, and even killed off if he loses.

Directed by: Ron Howard Writing credits: Akiva Goldsman ...

Tom Hanks, who is my favorite movie star act the leading role.

protesting ... will lose your faith by ... watching the movie.

After watching the movie I went online and some research on ...

Anybody is interested in it?

... so sick of people making such a big deal about a FICTION book and movie.

Facet 2:Book

I remembered when i first read the book, I finished the book in two days.

Awesome book. ... so sick of people making such a big deal about a FICTION book and movie.

I’m reading “Da Vinci Code” now.

So still a good book to past time.

This controversy book cause lots conflict in west society.

Page 68: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 68

Results: Sentiment Dynamics

Facet: the book “ the da vinci code”. ( Bursts during the movie, Pos > Neg )

Facet: religious beliefs ( Bursts during the movie, Neg > Pos )

Page 69: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 69

Event Impact Analysis: IR Research

vector 0.0514concept 0.0298extend 0.0297 model 0.0291space 0.0236boolean 0.0151function 0.0123feedback 0.0077…

xml 0.0678email 0.0197 model 0.0191collect 0.0187judgment 0.0102rank 0.0097subtopic 0.0079…

probabilist 0.0778model 0.0432logic 0.0404 ir 0.0338boolean 0.0281algebra 0.0200estimate 0.0119weight 0.0111…

model 0.1687language 0.0753estimate 0.0520 parameter 0.0281distribution 0.0268probable 0.0205smooth 0.0198markov 0.0137likelihood 0.0059…

1998

Publication of the paper “A language modeling approach to information retrieval”

Starting of the TREC conferences

year1992

term 0.1599relevance 0.0752weight 0.0660 feedback 0.0372independence 0.0311model 0.0310frequent 0.0233probabilistic 0.0188document 0.0173…

Theme: retrieval models

SIGIR papersSIGIR papers

Page 70: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 70

Temporal-Author-Topic Analysis

pattern 0.1107frequent 0.0406frequent-pattern 0.039 sequential 0.0360pattern-growth 0.0203constraint 0.0184push 0.0138…

project 0.0444itemset 0.0433intertransaction 0.0397 support 0.0264associate 0.0258frequent 0.0181closet 0.0176prefixspan 0.0170…

research 0.0551next 0.0308transaction 0.0308 panel 0.0275technical 0.0275article 0.0258revolution 0.0154innovate 0.0154…

close 0.0805pattern 0.0720sequential 0.0462 min_support 0.0353threshold 0.0207top-k 0.0176fp-tree 0.0102…

index 0.0440graph 0.0343web 0.0307gspan 0.0273substructure 0.0201gindex 0.0164bide 0.0115xml 0.0109…

2000time

Author

Author B

Author AGlobal theme: frequent patterns

Jiawei Han

Rakesh Agrawal

Page 71: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 71

Modeling Topical Communities (Mei et al. 08)

71

Community 1: Information Retrieval

Community 2: Data Mining

Community 3: Machine Learning

Page 72: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 72

Other Extensions (LDA Extensions)

• Many extensions of LDA, mostly done by David Blei, Andrew McCallum and their co-authors

• Some examples:– Hierarchical topic models [Blei et al. 03]

– Modeling annotated data [Blei & Jordan 03]

– Dynamic topic models [Blei & Lafferty 06]

– Pachinko allocation [Li & McCallum 06])

• Also, some specific context extension of PLSA, e.g., author-topic model [Steyvers et al. 04]

Page 73: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 73

Future Research Directions

• Topic models for text mining– Evaluation of topic models

– Improve the efficiency of estimation and inferences

– Incorporate linguistic knowledge

– Applications in new domains and for new tasks

• Text mining in general– Combination of NLP-style and DM-style mining algorithms

– Integrated mining of text (unstructured) and unstructured data (e.g., Text OLAP)

– Interactive mining:

• Incorporate user constraints and support iterative mining

• Design and implement mining languages

Page 74: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 74

Lecture 5: Key Points

• Topic models coupled with topic labeling are quite useful for extracting and modeling subtopics in text

• Adding context variables significantly increases a topic model’s capacity of performing text mining

– Enable interpretation of topics in context

– Accommodate variation analysis and correlation analysis of topics over context

• User’s preferences and domain knowledge can be added as prior or soft constraint

Page 75: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 75

Readings

• PLSA:

– http://www.cs.brown.edu/~th/papers/Hofmann-UAI99.pdf

• LDA:

– http://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf

– Many recent extensions, mostly done by David Blei and Andrew McCallums

• CPLSA:

– http://sifaka.cs.uiuc.edu/czhai/pub/kdd06-mix.pdf

– http://sifaka.cs.uiuc.edu/czhai/pub/www08-net.pdf

Page 76: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 76

Discussion

• Topic models for mining multimedia data

– Simultaneous modeling of text and images

• Cross-media analysis

– Text provides context to analyze images and vice versa

Page 77: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 77

Course Summary

Statistics

Machine Learning

Natural Language Processing

Scope of the course

Looking forward to collaborations…

Computer Vision

Information Retrieval

Multimedia Data Text DataRetrieval models/framework

EvaluationFeedback

Contextual topic models

1. Evaluation2. User modeling3. Ranking4. Learning with little

supervision

Integrated Multimedia Data Analysis-Mutual reinforcement (e.g., text images)-Simultaneous mining of text + images +video…

Page 78: Probabilistic Topic Models for  Text Mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 78

Thank You!