cross-cultural analysis of blogs and forums with mixed-collection topic models michael paul and...

45
Cross-Cultural Analysis of Blogs and Forums with Mixed-Collection Topic Models Michael Paul and Roxana Girju

Upload: darrell-lawson

Post on 15-Jan-2016

223 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Cross-Cultural Analysis of Blogs and Forums with Mixed-Collection Topic Models Michael Paul and Roxana Girju

Cross-Cultural Analysis of Blogs and Forums with Mixed-Collection Topic Models

Michael Paul and Roxana Girju

Page 2: Cross-Cultural Analysis of Blogs and Forums with Mixed-Collection Topic Models Michael Paul and Roxana Girju

Outline

• Overview of topic models• Cross-Collection LDA• Cross-cultural analysis with ccLDA• Other applications of ccLDA• Model evaluation• An alternative cross-collection model

Page 3: Cross-Cultural Analysis of Blogs and Forums with Mixed-Collection Topic Models Michael Paul and Roxana Girju

Outline

• Overview of topic models• PLSI and LDA• Some slides borrowed from CS410 – ChengXiang Zhai

• Cross-Collection LDA• Cross-cultural analysis with ccLDA• Other applications of ccLDA• Model evaluation• An alternative cross-collection model

Page 4: Cross-Cultural Analysis of Blogs and Forums with Mixed-Collection Topic Models Michael Paul and Roxana Girju

Probabilistic Topic Models

• Idea: each document is some mix of topics

• Each word in the document belongs to a topic

Page 5: Cross-Cultural Analysis of Blogs and Forums with Mixed-Collection Topic Models Michael Paul and Roxana Girju

5

Document as a Sample of Mixed Topics

• Applications of topic models:– Summarize themes/aspects– Facilitate navigation/browsing– Retrieve documents– Segment documents– Many others

• How can we discover these topic word distributions?

Topic 1

Topic k

Topic 2

Background B

government 0.3 response 0.2...

donate 0.1relief 0.05help 0.02 ...

city 0.2new 0.1orleans 0.05 ...

is 0.05the 0.04a 0.03 ...

[ Criticism of government response to the hurricane primarily consisted of criticism of its response to the approach of the storm and its aftermath, specifically in the delayed response ] to the [ flooding of New Orleans. … 80% of the 1.3 million residents of the greater New Orleans metropolitan area evacuated ] …[ Over seventy countries pledged monetary donations or other assistance]. …

Page 6: Cross-Cultural Analysis of Blogs and Forums with Mixed-Collection Topic Models Michael Paul and Roxana Girju

Probabilistic Latent Semantic Indexing[Hofmann, 1999]

• Each token in a document is associated with 2 variables:• a word w (observable)• a topic z (hidden)

• P(w,z|d) = P(z|d) P(w|z)

Page 7: Cross-Cultural Analysis of Blogs and Forums with Mixed-Collection Topic Models Michael Paul and Roxana Girju

7

PLSA as a Mixture Model

Topic 1

Topic k

Topic 2

Document d

Background B

warning 0.3 system 0.2..

aid 0.1donation 0.05support 0.02 ..

statistics 0.2loss 0.1dead 0.05 ..

is 0.05the 0.04a 0.03 ..

k

1

2

B

B

W

d,1

d, k

1 - Bd,2

“Generating” word w in doc d in the collection

Parameters: B=noise-level (manually set)’s and ’s are estimated with Maximum Likelihood

])|()1()|([log),()(log

)|()1()|()(

1,

1,

k

jjjdBBB

Vw

k

jjjdBBBd

wpwpdwcdp

wpwpwp

??

??

?

???

??

?

Page 8: Cross-Cultural Analysis of Blogs and Forums with Mixed-Collection Topic Models Michael Paul and Roxana Girju

How to Estimate Multiple Topics?(Expectation Maximization)

8

the 0.2a 0.1we 0.01to 0.02…

KnownBackground p(w | B)

…text =? mining =? association =?word =? …

Unknowntopic modelp(w|1)=?

“Text mining”

Observed Doc(s)

M-Step: Max. LikelihoodEstimatorbased on “fractionalcounts”…

…information =? retrieval =? query =?document =? …

Unknowntopic modelp(w|2)=?

“informationretrieval”

E-Step:Predict topic labels using Bayes Rule

Page 9: Cross-Cultural Analysis of Blogs and Forums with Mixed-Collection Topic Models Michael Paul and Roxana Girju

PLSI - Problems

• Each document is represented as a dummy variable d• Number of parameters grows linearly with corpus

size• Overfitting

• Not fully generative• Not clear how to model previously unseen documents

Page 10: Cross-Cultural Analysis of Blogs and Forums with Mixed-Collection Topic Models Michael Paul and Roxana Girju

Latent Dirichlet Allocation[Blei et al, 2003]

• Per-document topic mixtures and word multinomials come from Dirichlet priors

• Exact solution is intractable– Inference is more complicated

• Variational methods• Monte Carlo

Page 11: Cross-Cultural Analysis of Blogs and Forums with Mixed-Collection Topic Models Michael Paul and Roxana Girju

Dirichlet Distribution• Conjugate prior of multinomial distribution

Page 12: Cross-Cultural Analysis of Blogs and Forums with Mixed-Collection Topic Models Michael Paul and Roxana Girju

Latent Dirichlet Allocation

Page 13: Cross-Cultural Analysis of Blogs and Forums with Mixed-Collection Topic Models Michael Paul and Roxana Girju

Outline

• Overview of topic models• Cross-Collection LDA• Cross-cultural analysis with ccLDA• Other applications of ccLDA• Model evaluation• An alternative cross-collection model

Page 14: Cross-Cultural Analysis of Blogs and Forums with Mixed-Collection Topic Models Michael Paul and Roxana Girju

Cross-Collection LDA (ccLDA)

• LDA extension for modeling multiple text collections

• Each topic has a probability distribution that is shared among all collections as well as word distributions that are unique to each collection

• Automatically discovers differences between collections and organizes them by topic

Page 15: Cross-Cultural Analysis of Blogs and Forums with Mixed-Collection Topic Models Michael Paul and Roxana Girju

Example• Topic of weather and the outdoors in travel forums

Topic weather time day going rain summer month high days thanks

UK India Singapore

windwaterproofendingrollingwalkersrochdalelayerssnowfootwearankle

lehmonsoonroadmanaliladakhtrekkingtrekseasonrainsmonsoons

hothumidhumidityheatdegreeequatorsweatbringrainumbrella

Page 16: Cross-Cultural Analysis of Blogs and Forums with Mixed-Collection Topic Models Michael Paul and Roxana Girju

ccLDA

• Inference can be done with Gibbs sampling

Graphical representation: The generative process:

α φ β

C T

θ z

wc x

Dγ0

ψ σ δ

γ1 TC

N

Page 17: Cross-Cultural Analysis of Blogs and Forums with Mixed-Collection Topic Models Michael Paul and Roxana Girju

Previous Work• Comparative mixture model (CCMix)

– ChengXiang Zhai, Atulya Velivelli, Bei Yu. A cross-collection mixture model for comparative text mining. Proceedings of ACM KDD 2004.

• Improvements in ccLDA:– Does not rely on user-defined parameters– Distributions have Dirichlet/Beta priors– Document-topic distributions have collection-dependent priors– P(x) depends on the topic and collection

ccMix (2004) ccLDA (2009)

Common Dell Apple IBM Common Dell Apple IBM

cddriverwcombodvd

apointblahhooktug2499

airportburn4xreadschools

t20ultrabaytellsdevicenumber

drivecddvdhardrw

batterylaptopbayinspironmedia

itunesburnimovieburningminutes

2000ultrabayhotdeviceswappable

Page 18: Cross-Cultural Analysis of Blogs and Forums with Mixed-Collection Topic Models Michael Paul and Roxana Girju

Outline

• Overview of topic models• Cross-Collection LDA• Cross-cultural analysis with ccLDA• Other applications of ccLDA• Model evaluation• An alternative cross-collection model

Page 19: Cross-Cultural Analysis of Blogs and Forums with Mixed-Collection Topic Models Michael Paul and Roxana Girju

Cross-Cultural Analysis Documents from or about 3 countries:

United Kingdom

India

Singapore

3,266 forum discussions

collected from lonelyplanet.com

represents the perspective of tourists

7,388 English-language blogs

collected through blogcatalog.com

represents the perspective of locals

Page 20: Cross-Cultural Analysis of Blogs and Forums with Mixed-Collection Topic Models Michael Paul and Roxana Girju

Cross-Cultural Analysis• Topic of religion from the blogs

Topic: god jesus lord life faith holy man christ church love

UK India Singapore

churchgodjohntoddbentleychristlukebiblechristiansermon

krishnareligionreligiousspiritualgurulordsrishribabahindu

godsinjohnspiritthingslambexodussufferingcrosslives

Page 21: Cross-Cultural Analysis of Blogs and Forums with Mixed-Collection Topic Models Michael Paul and Roxana Girju

Cross-Cultural Analysis• Topic of entertainment from the blogs• Compare against ccMix

ccLDA ccMixTopic: music song new songs like album dance comments rock guitar

Topic: comment posted like music just blog time labels post love

UK India Singapore UK India Singapore

musicbandalbumdancefestivalsoundbandsremixtracksamp

moviefilmmoviessongsfilmsdirectorbestbollywoodindianawards

bandmusicamericanjapanesemarkworldvideosoundidolweek

musicalbumbandsongsongsnewreviewtrackbandspop

keralaindiatigerrajasthanbirdswaterparkcitytemplesanctuary

kidsbabycooldesktopmissfunwallpaperlovedontlittle

Page 22: Cross-Cultural Analysis of Blogs and Forums with Mixed-Collection Topic Models Michael Paul and Roxana Girju

Cross-Cultural Analysis• Topic of travel from the blogs• Compare against LDA (on each collection individually)

ccLDA LDATopic: travel hotel hotels city best place visit holiday trip world

Topic: travel city hotel park holiday hotels place beach road visit

UK India Singapore UK India Singapore

holidayholidayshotelsspain londongreatsurfbreakstrainski

indiadelhiindianmumbaibangaloretourairdubaicitymahindra

singaporehongkongspahotelbeachchinesepicturesrestaurantbangkok

travelholidayhotelcitylondonparkhotelplaceholidayshall

travelcitybeachplacehoteltempleroadparkhotelstourism

travelhotelcityparkplacebeachtriphotelsspavisit

Page 23: Cross-Cultural Analysis of Blogs and Forums with Mixed-Collection Topic Models Michael Paul and Roxana Girju

Cross-Cultural Analysis• Topic of food from both datasets• Compare the view of tourists and locals

Perspective of Locals Perspective of Tourists

food add chicken recipe cookingtaste rice recipes sugar soup

food eat restaurant restaurants teacheap meal eating cafe drink

UK India Singapore UK India Singapore

foodwine

restaurantcoffeecheesesoupeatchef

englishdrink

reciperecipespowderindian

salttsprice

masalaoil

coriander

coffeecupoil

commentsfriedadd

restaurantricetea

seafood

fishhaggischips

respectabilitydecentveggie

puddingphotoblogsausages

sandwiches

cookingspices

sickflour

tomatobatter

atecookolive

recipe

hawkersataystalls

noodlesrotistall

seafoodmalay

rochesternoodle

Page 24: Cross-Cultural Analysis of Blogs and Forums with Mixed-Collection Topic Models Michael Paul and Roxana Girju

Outline

• Overview of topic models• Cross-Collection LDA• Cross-cultural analysis with ccLDA• Other applications of ccLDA

– Scientific research/literature analysis– Media analysis and bias detection

• Model evaluation• An alternative cross-collection model

Page 25: Cross-Cultural Analysis of Blogs and Forums with Mixed-Collection Topic Models Michael Paul and Roxana Girju

Research Analysis• 16,186 abstracts from computational

linguistics and linguistics journals

• Interdisciplinary research topic discovery

• Topic evolution over time

Page 26: Cross-Cultural Analysis of Blogs and Forums with Mixed-Collection Topic Models Michael Paul and Roxana Girju

Research Analysis• Topic of communication

Topic: speech spoken interaction human discourse paper understanding task context communication goal users

Comp Ling Linguistics

dialogueusersystemsinformationutterancesdialoguesutteranceagentplanrecognitionagentsresearchmulti

socialcommunicationverbalwomenspeakersspeakerrelationshipinteractionwaysmeansbehaviorfacemen

Page 27: Cross-Cultural Analysis of Blogs and Forums with Mixed-Collection Topic Models Michael Paul and Roxana Girju

Research Analysis• Topic of parsing/grammars across two time intervals

Topic: parser grammar tree parsers grammars free context syntactic parse structure

Old (<2000) New (>= 2000)

numberresultcorrespondingnetworksknownbindinglrintroduceconsiderrecognitiontransformationalambiguousnetworks

dependencyprobabilistic stochastictreebankpcfgconstraintlexicalizedccgprojectiverobustnesshpsgmodelingtreebanks

Page 28: Cross-Cultural Analysis of Blogs and Forums with Mixed-Collection Topic Models Michael Paul and Roxana Girju

Media Analysis• 623 news articles from msnbc.com and foxnews.com from

August 2008• Discover editorial differences within topics

Topic: percent economy prices market Topic: car vehicle cars fuel drive

MSNBC FOX News MSNBC FOX News

stocksaccounttradestoolsspendingconsumerssalesinvestorstradingcompany

oildrillingpovertyoffshorecoverageinsurancegrowinguninsuredcensuscongress

dieselsaysautoscamarotaxcreditsmallermileagehybridchevrolet

mazdagallardochryslerminivanhorsepowerlamborghinimphsportslptraffic

Page 29: Cross-Cultural Analysis of Blogs and Forums with Mixed-Collection Topic Models Michael Paul and Roxana Girju

Outline

• Overview of topic models• Cross-Collection LDA• Cross-cultural analysis with ccLDA• Other applications of ccLDA• Model evaluation• An alternative cross-collection model

Page 30: Cross-Cultural Analysis of Blogs and Forums with Mixed-Collection Topic Models Michael Paul and Roxana Girju

Model Evaluation Greater likelihood of held-out data than

alternative models

Page 31: Cross-Cultural Analysis of Blogs and Forums with Mixed-Collection Topic Models Michael Paul and Roxana Girju

Model Evaluation Document classification – new vs old

Compare to NB and SVM (linear kernel)

Page 32: Cross-Cultural Analysis of Blogs and Forums with Mixed-Collection Topic Models Michael Paul and Roxana Girju

Outline

• Overview of topic models• Cross-Collection LDA• Cross-cultural analysis with ccLDA• Other applications of ccLDA• Model evaluation• An alternative cross-collection model

Page 33: Cross-Cultural Analysis of Blogs and Forums with Mixed-Collection Topic Models Michael Paul and Roxana Girju

Alternative Model

• Similar to hierarchical Pachinko Allocation [Mimno et al, 2007]

• Model as 2-level hierarchy

Page 34: Cross-Cultural Analysis of Blogs and Forums with Mixed-Collection Topic Models Michael Paul and Roxana Girju

Alternative Model

• Single, global set of “super-topics”

• One set of “sub-topics” for each collection

• Choose super-topic T from P(T|d)

• Choose sub-topic t from P(t|T,c)

• Choose hierarchy level l from P(l|t,T)

• if l = 0, choose word from P(w|T)else if l = 1, choose word from P(w|t)

Page 35: Cross-Cultural Analysis of Blogs and Forums with Mixed-Collection Topic Models Michael Paul and Roxana Girju

Alternative Model

• This is just a generalization of ccLDA!

• ccLDA = special case,constrained such that for each super-topic T=j there is exactly one sub-topic such that P(t=j|T=j)=1 and P(t=i|T=j)=0 for all i ≠ j

Page 36: Cross-Cultural Analysis of Blogs and Forums with Mixed-Collection Topic Models Michael Paul and Roxana Girju

Alternative Model• Topic of religion in the blogs

Super-Topicgod 0.046994 lord 0.015877 jesus 0.012076 life 0.01143 faith 0.010692 church 0.010185 holy 0.009189 man 0.00882 world 0.00869 people 0.007574

UK 1church 0.030402 john 0.017007 todd 0.016154 jesus 0.015552 bentley 0.014348 luke 0.012693 religion 0.012592 christ 0.012091 cross 0.011388 neville 0.009482

0.970483

Page 37: Cross-Cultural Analysis of Blogs and Forums with Mixed-Collection Topic Models Michael Paul and Roxana Girju

Alternative Model• Topic of religion in the blogs

Super-Topicgod 0.046994 lord 0.015877 jesus 0.012076 life 0.01143 faith 0.010692 church 0.010185 holy 0.009189 man 0.00882 world 0.00869 people 0.007574

India 1religion 0.021439 krishna 0.019062 spiritual 0.014765 hindu 0.012343 lord 0.01216 religious 0.012114 guru 0.011108 mother 0.01088 shri 0.010194 sri 0.009646

0.984414

Page 38: Cross-Cultural Analysis of Blogs and Forums with Mixed-Collection Topic Models Michael Paul and Roxana Girju

Alternative Model

Super-Topicgod 0.046994 lord 0.015877 jesus 0.012076 life 0.01143 faith 0.010692 church 0.010185 holy 0.009189 man 0.00882 world 0.00869 people 0.007574

SG 1god 0.032249 christ 0.018867 cross 0.015467 sin 0.012505 grace 0.012395 jesus 0.011957 john 0.011628 lamb 0.009982 mahendra 0.009489 good 0.009434

SG 2daily 0.020028 free 0.016023 fast 0.014822 silent 0.014221 wait 0.012418 going 0.011818 sign 0.009414 friday 0.009214 health 0.008413 star 0.008413

0.851749

0.102534

• Topic of religion in the blogs

Page 39: Cross-Cultural Analysis of Blogs and Forums with Mixed-Collection Topic Models Michael Paul and Roxana Girju

ccLDA• Topic of religion from the blogs

Topic: god jesus lord life faith holy man christ church love

UK India Singapore

churchgodjohntoddbentleychristlukebiblechristiansermon

krishnareligionreligiousspiritualgurulordsrishribabahindu

godsinjohnspiritthingslambexodussufferingcrosslives

Page 40: Cross-Cultural Analysis of Blogs and Forums with Mixed-Collection Topic Models Michael Paul and Roxana Girju

Alternative Model

Super-Topicpeople 0.021148 government 0.016807 world 0.010694 obama 0.009229 political 0.00902 media 0.008975 politics 0.008669 country 0.008534 state 0.007906 rights 0.007413

UK 1labour 0.049547 british 0.041125 workers 0.029925 european 0.026252 bbc 0.024908 david 0.017203 crisis 0.016934 immigration 0.014694 left 0.014336 trade 0.011648

UK 2war 0.023458 world 0.01909 wales 0.019002 welsh 0.017823 brown 0.014503 britain 0.013498 gordon 0.012188 london 0.011445 politics 0.010004 anti 0.009916

0.29108

0.699227

• Topic of politicsin the blogs

Page 41: Cross-Cultural Analysis of Blogs and Forums with Mixed-Collection Topic Models Michael Paul and Roxana Girju

Alternative Model• Topic of politics in the blogs

Super-Topicpeople 0.021148 government 0.016807 world 0.010694 obama 0.009229 political 0.00902 media 0.008975 politics 0.008669 country 0.008534 state 0.007906 rights 0.007413

India 1pakistan 0.052105 india 0.038041 kashmir 0.037222 state 0.023186 muslims 0.017312 muslim 0.016634 political 0.010647 taliban 0.010647 jammu 0.009461 kashmiri 0.00932

0.987059

Page 42: Cross-Cultural Analysis of Blogs and Forums with Mixed-Collection Topic Models Michael Paul and Roxana Girju

Alternative Model• Topic of politics in the blogs

Super-Topicpeople 0.021148 government 0.016807 world 0.010694 obama 0.009229 political 0.00902 media 0.008975 politics 0.008669 country 0.008534 state 0.007906 rights 0.007413

SG 1singapore 0.04263 world 0.027554 singaporeans 0.014817 people 0.013387 earth 0.012478 malaysia 0.011698 global 0.010398 say 0.010398 myanmar 0.009488 workers 0.008838

0.970675

Page 43: Cross-Cultural Analysis of Blogs and Forums with Mixed-Collection Topic Models Michael Paul and Roxana Girju

ccLDA• Topic of politics from the blogs

Topic: people government war world state political human rights said country

UK India Singapore

newspoliticslondonmediapostobamawarlabourworldbbc

pakistanindiakashmirindianpakistanimuslimsstatemuslimbrigadetaliban

singaporecommentssingaporeanslabelschineseagonewsworldjooposted

Page 44: Cross-Cultural Analysis of Blogs and Forums with Mixed-Collection Topic Models Michael Paul and Roxana Girju

Outline

• Overview of topic models• Cross-Collection LDA• Cross-cultural analysis with ccLDA• Other applications of ccLDA• Model evaluation• An alternative cross-collection model

Page 45: Cross-Cultural Analysis of Blogs and Forums with Mixed-Collection Topic Models Michael Paul and Roxana Girju

Questions?