cross-cultural analysis of blogs and forums with mixed-collection topic models michael paul and...
TRANSCRIPT
Cross-Cultural Analysis of Blogs and Forums with Mixed-Collection Topic Models
Michael Paul and Roxana Girju
Outline
• Overview of topic models• Cross-Collection LDA• Cross-cultural analysis with ccLDA• Other applications of ccLDA• Model evaluation• An alternative cross-collection model
Outline
• Overview of topic models• PLSI and LDA• Some slides borrowed from CS410 – ChengXiang Zhai
• Cross-Collection LDA• Cross-cultural analysis with ccLDA• Other applications of ccLDA• Model evaluation• An alternative cross-collection model
Probabilistic Topic Models
• Idea: each document is some mix of topics
• Each word in the document belongs to a topic
5
Document as a Sample of Mixed Topics
• Applications of topic models:– Summarize themes/aspects– Facilitate navigation/browsing– Retrieve documents– Segment documents– Many others
• How can we discover these topic word distributions?
Topic 1
Topic k
Topic 2
…
Background B
government 0.3 response 0.2...
donate 0.1relief 0.05help 0.02 ...
city 0.2new 0.1orleans 0.05 ...
is 0.05the 0.04a 0.03 ...
[ Criticism of government response to the hurricane primarily consisted of criticism of its response to the approach of the storm and its aftermath, specifically in the delayed response ] to the [ flooding of New Orleans. … 80% of the 1.3 million residents of the greater New Orleans metropolitan area evacuated ] …[ Over seventy countries pledged monetary donations or other assistance]. …
Probabilistic Latent Semantic Indexing[Hofmann, 1999]
• Each token in a document is associated with 2 variables:• a word w (observable)• a topic z (hidden)
• P(w,z|d) = P(z|d) P(w|z)
7
PLSA as a Mixture Model
Topic 1
Topic k
Topic 2
…
Document d
Background B
warning 0.3 system 0.2..
aid 0.1donation 0.05support 0.02 ..
statistics 0.2loss 0.1dead 0.05 ..
is 0.05the 0.04a 0.03 ..
k
1
2
B
B
W
d,1
d, k
1 - Bd,2
“Generating” word w in doc d in the collection
Parameters: B=noise-level (manually set)’s and ’s are estimated with Maximum Likelihood
])|()1()|([log),()(log
)|()1()|()(
1,
1,
k
jjjdBBB
Vw
k
jjjdBBBd
wpwpdwcdp
wpwpwp
??
??
?
???
??
?
How to Estimate Multiple Topics?(Expectation Maximization)
8
the 0.2a 0.1we 0.01to 0.02…
KnownBackground p(w | B)
…text =? mining =? association =?word =? …
Unknowntopic modelp(w|1)=?
“Text mining”
Observed Doc(s)
M-Step: Max. LikelihoodEstimatorbased on “fractionalcounts”…
…information =? retrieval =? query =?document =? …
Unknowntopic modelp(w|2)=?
“informationretrieval”
E-Step:Predict topic labels using Bayes Rule
PLSI - Problems
• Each document is represented as a dummy variable d• Number of parameters grows linearly with corpus
size• Overfitting
• Not fully generative• Not clear how to model previously unseen documents
Latent Dirichlet Allocation[Blei et al, 2003]
• Per-document topic mixtures and word multinomials come from Dirichlet priors
• Exact solution is intractable– Inference is more complicated
• Variational methods• Monte Carlo
Dirichlet Distribution• Conjugate prior of multinomial distribution
Latent Dirichlet Allocation
Outline
• Overview of topic models• Cross-Collection LDA• Cross-cultural analysis with ccLDA• Other applications of ccLDA• Model evaluation• An alternative cross-collection model
Cross-Collection LDA (ccLDA)
• LDA extension for modeling multiple text collections
• Each topic has a probability distribution that is shared among all collections as well as word distributions that are unique to each collection
• Automatically discovers differences between collections and organizes them by topic
Example• Topic of weather and the outdoors in travel forums
Topic weather time day going rain summer month high days thanks
UK India Singapore
windwaterproofendingrollingwalkersrochdalelayerssnowfootwearankle
lehmonsoonroadmanaliladakhtrekkingtrekseasonrainsmonsoons
hothumidhumidityheatdegreeequatorsweatbringrainumbrella
ccLDA
• Inference can be done with Gibbs sampling
Graphical representation: The generative process:
α φ β
C T
θ z
wc x
Dγ0
ψ σ δ
γ1 TC
N
Previous Work• Comparative mixture model (CCMix)
– ChengXiang Zhai, Atulya Velivelli, Bei Yu. A cross-collection mixture model for comparative text mining. Proceedings of ACM KDD 2004.
• Improvements in ccLDA:– Does not rely on user-defined parameters– Distributions have Dirichlet/Beta priors– Document-topic distributions have collection-dependent priors– P(x) depends on the topic and collection
ccMix (2004) ccLDA (2009)
Common Dell Apple IBM Common Dell Apple IBM
cddriverwcombodvd
apointblahhooktug2499
airportburn4xreadschools
t20ultrabaytellsdevicenumber
drivecddvdhardrw
batterylaptopbayinspironmedia
itunesburnimovieburningminutes
2000ultrabayhotdeviceswappable
Outline
• Overview of topic models• Cross-Collection LDA• Cross-cultural analysis with ccLDA• Other applications of ccLDA• Model evaluation• An alternative cross-collection model
Cross-Cultural Analysis Documents from or about 3 countries:
United Kingdom
India
Singapore
3,266 forum discussions
collected from lonelyplanet.com
represents the perspective of tourists
7,388 English-language blogs
collected through blogcatalog.com
represents the perspective of locals
Cross-Cultural Analysis• Topic of religion from the blogs
Topic: god jesus lord life faith holy man christ church love
UK India Singapore
churchgodjohntoddbentleychristlukebiblechristiansermon
krishnareligionreligiousspiritualgurulordsrishribabahindu
godsinjohnspiritthingslambexodussufferingcrosslives
Cross-Cultural Analysis• Topic of entertainment from the blogs• Compare against ccMix
ccLDA ccMixTopic: music song new songs like album dance comments rock guitar
Topic: comment posted like music just blog time labels post love
UK India Singapore UK India Singapore
musicbandalbumdancefestivalsoundbandsremixtracksamp
moviefilmmoviessongsfilmsdirectorbestbollywoodindianawards
bandmusicamericanjapanesemarkworldvideosoundidolweek
musicalbumbandsongsongsnewreviewtrackbandspop
keralaindiatigerrajasthanbirdswaterparkcitytemplesanctuary
kidsbabycooldesktopmissfunwallpaperlovedontlittle
Cross-Cultural Analysis• Topic of travel from the blogs• Compare against LDA (on each collection individually)
ccLDA LDATopic: travel hotel hotels city best place visit holiday trip world
Topic: travel city hotel park holiday hotels place beach road visit
UK India Singapore UK India Singapore
holidayholidayshotelsspain londongreatsurfbreakstrainski
indiadelhiindianmumbaibangaloretourairdubaicitymahindra
singaporehongkongspahotelbeachchinesepicturesrestaurantbangkok
travelholidayhotelcitylondonparkhotelplaceholidayshall
travelcitybeachplacehoteltempleroadparkhotelstourism
travelhotelcityparkplacebeachtriphotelsspavisit
Cross-Cultural Analysis• Topic of food from both datasets• Compare the view of tourists and locals
Perspective of Locals Perspective of Tourists
food add chicken recipe cookingtaste rice recipes sugar soup
food eat restaurant restaurants teacheap meal eating cafe drink
UK India Singapore UK India Singapore
foodwine
restaurantcoffeecheesesoupeatchef
englishdrink
reciperecipespowderindian
salttsprice
masalaoil
coriander
coffeecupoil
commentsfriedadd
restaurantricetea
seafood
fishhaggischips
respectabilitydecentveggie
puddingphotoblogsausages
sandwiches
cookingspices
sickflour
tomatobatter
atecookolive
recipe
hawkersataystalls
noodlesrotistall
seafoodmalay
rochesternoodle
Outline
• Overview of topic models• Cross-Collection LDA• Cross-cultural analysis with ccLDA• Other applications of ccLDA
– Scientific research/literature analysis– Media analysis and bias detection
• Model evaluation• An alternative cross-collection model
Research Analysis• 16,186 abstracts from computational
linguistics and linguistics journals
• Interdisciplinary research topic discovery
• Topic evolution over time
Research Analysis• Topic of communication
Topic: speech spoken interaction human discourse paper understanding task context communication goal users
Comp Ling Linguistics
dialogueusersystemsinformationutterancesdialoguesutteranceagentplanrecognitionagentsresearchmulti
socialcommunicationverbalwomenspeakersspeakerrelationshipinteractionwaysmeansbehaviorfacemen
Research Analysis• Topic of parsing/grammars across two time intervals
Topic: parser grammar tree parsers grammars free context syntactic parse structure
Old (<2000) New (>= 2000)
numberresultcorrespondingnetworksknownbindinglrintroduceconsiderrecognitiontransformationalambiguousnetworks
dependencyprobabilistic stochastictreebankpcfgconstraintlexicalizedccgprojectiverobustnesshpsgmodelingtreebanks
Media Analysis• 623 news articles from msnbc.com and foxnews.com from
August 2008• Discover editorial differences within topics
Topic: percent economy prices market Topic: car vehicle cars fuel drive
MSNBC FOX News MSNBC FOX News
stocksaccounttradestoolsspendingconsumerssalesinvestorstradingcompany
oildrillingpovertyoffshorecoverageinsurancegrowinguninsuredcensuscongress
dieselsaysautoscamarotaxcreditsmallermileagehybridchevrolet
mazdagallardochryslerminivanhorsepowerlamborghinimphsportslptraffic
Outline
• Overview of topic models• Cross-Collection LDA• Cross-cultural analysis with ccLDA• Other applications of ccLDA• Model evaluation• An alternative cross-collection model
Model Evaluation Greater likelihood of held-out data than
alternative models
Model Evaluation Document classification – new vs old
Compare to NB and SVM (linear kernel)
Outline
• Overview of topic models• Cross-Collection LDA• Cross-cultural analysis with ccLDA• Other applications of ccLDA• Model evaluation• An alternative cross-collection model
Alternative Model
• Similar to hierarchical Pachinko Allocation [Mimno et al, 2007]
• Model as 2-level hierarchy
Alternative Model
• Single, global set of “super-topics”
• One set of “sub-topics” for each collection
• Choose super-topic T from P(T|d)
• Choose sub-topic t from P(t|T,c)
• Choose hierarchy level l from P(l|t,T)
• if l = 0, choose word from P(w|T)else if l = 1, choose word from P(w|t)
Alternative Model
• This is just a generalization of ccLDA!
• ccLDA = special case,constrained such that for each super-topic T=j there is exactly one sub-topic such that P(t=j|T=j)=1 and P(t=i|T=j)=0 for all i ≠ j
Alternative Model• Topic of religion in the blogs
Super-Topicgod 0.046994 lord 0.015877 jesus 0.012076 life 0.01143 faith 0.010692 church 0.010185 holy 0.009189 man 0.00882 world 0.00869 people 0.007574
UK 1church 0.030402 john 0.017007 todd 0.016154 jesus 0.015552 bentley 0.014348 luke 0.012693 religion 0.012592 christ 0.012091 cross 0.011388 neville 0.009482
0.970483
Alternative Model• Topic of religion in the blogs
Super-Topicgod 0.046994 lord 0.015877 jesus 0.012076 life 0.01143 faith 0.010692 church 0.010185 holy 0.009189 man 0.00882 world 0.00869 people 0.007574
India 1religion 0.021439 krishna 0.019062 spiritual 0.014765 hindu 0.012343 lord 0.01216 religious 0.012114 guru 0.011108 mother 0.01088 shri 0.010194 sri 0.009646
0.984414
Alternative Model
Super-Topicgod 0.046994 lord 0.015877 jesus 0.012076 life 0.01143 faith 0.010692 church 0.010185 holy 0.009189 man 0.00882 world 0.00869 people 0.007574
SG 1god 0.032249 christ 0.018867 cross 0.015467 sin 0.012505 grace 0.012395 jesus 0.011957 john 0.011628 lamb 0.009982 mahendra 0.009489 good 0.009434
SG 2daily 0.020028 free 0.016023 fast 0.014822 silent 0.014221 wait 0.012418 going 0.011818 sign 0.009414 friday 0.009214 health 0.008413 star 0.008413
0.851749
0.102534
• Topic of religion in the blogs
ccLDA• Topic of religion from the blogs
Topic: god jesus lord life faith holy man christ church love
UK India Singapore
churchgodjohntoddbentleychristlukebiblechristiansermon
krishnareligionreligiousspiritualgurulordsrishribabahindu
godsinjohnspiritthingslambexodussufferingcrosslives
Alternative Model
Super-Topicpeople 0.021148 government 0.016807 world 0.010694 obama 0.009229 political 0.00902 media 0.008975 politics 0.008669 country 0.008534 state 0.007906 rights 0.007413
UK 1labour 0.049547 british 0.041125 workers 0.029925 european 0.026252 bbc 0.024908 david 0.017203 crisis 0.016934 immigration 0.014694 left 0.014336 trade 0.011648
UK 2war 0.023458 world 0.01909 wales 0.019002 welsh 0.017823 brown 0.014503 britain 0.013498 gordon 0.012188 london 0.011445 politics 0.010004 anti 0.009916
0.29108
0.699227
• Topic of politicsin the blogs
Alternative Model• Topic of politics in the blogs
Super-Topicpeople 0.021148 government 0.016807 world 0.010694 obama 0.009229 political 0.00902 media 0.008975 politics 0.008669 country 0.008534 state 0.007906 rights 0.007413
India 1pakistan 0.052105 india 0.038041 kashmir 0.037222 state 0.023186 muslims 0.017312 muslim 0.016634 political 0.010647 taliban 0.010647 jammu 0.009461 kashmiri 0.00932
0.987059
Alternative Model• Topic of politics in the blogs
Super-Topicpeople 0.021148 government 0.016807 world 0.010694 obama 0.009229 political 0.00902 media 0.008975 politics 0.008669 country 0.008534 state 0.007906 rights 0.007413
SG 1singapore 0.04263 world 0.027554 singaporeans 0.014817 people 0.013387 earth 0.012478 malaysia 0.011698 global 0.010398 say 0.010398 myanmar 0.009488 workers 0.008838
0.970675
ccLDA• Topic of politics from the blogs
Topic: people government war world state political human rights said country
UK India Singapore
newspoliticslondonmediapostobamawarlabourworldbbc
pakistanindiakashmirindianpakistanimuslimsstatemuslimbrigadetaliban
singaporecommentssingaporeanslabelschineseagonewsworldjooposted
Outline
• Overview of topic models• Cross-Collection LDA• Cross-cultural analysis with ccLDA• Other applications of ccLDA• Model evaluation• An alternative cross-collection model
Questions?