event intensity tracking in weblog collections viet ha-thuc, yelena mejova, christopher harris, dr...
TRANSCRIPT
Event Intensity Tracking in Weblog Collections
Viet Ha-Thuc, Yelena Mejova, Christopher Harris, Dr Padmini Srinivasan
ICWSM 2009 Data Challenge Workshop Presented by: Yelena Mejova
1
Outline
Motivation: Topic TrackingExplore the weblog collectionEvent tracking approachRelated workResults Event tracking Sub-event trackingFuture directions
2
Outline
Motivation: Topic TrackingExplore the weblog collectionEvent tracking approachRelated workResults Event tracking Sub-event trackingFuture directions
3
Event Tracking
• People talk: - What - When -How much
4
Outline
Motivation: Topic TrackingExplore the weblog collectionEvent tracking approachRelated workResults Event tracking Sub-event trackingFuture directions
5
Data Set
• Published by Spinn3r.com• 44 million blog posts• August 1, 2008 – October 1, 2008
• No comments
6
Data Set
Languages7
Data Set
Document Length8
Data Set
Document Distribution by Date9
Data Set
Popular Categories10
Data Set
• Our subset:
– 1 million documents (4% of all English posts)– English only– Inlink threshold of 400
11
Outline
Motivation: Topic TrackingExplore the weblog collectionEvent tracking approachRelated workResults Event tracking Sub-event trackingFuture directions
12
Tracking Approach
• Phase I: Estimate relevance-based
topic models
• Phase II: Estimate topical intensity
training docs
training docs
docsdocs
topic model
s
topic model
s
topic model
s
topic model
s
docsdocs
13
Relevance-based Topical Model
14
Relevance-based Topical Model
bb
ekek
to(d)to(d)
ww
N
D
K
BACKGROUNDTOPIC
(EX: COMMONENGLISH WORDS)
EVENTTOPIC
OTHER DOCUMENT-SPECIFIC TOPIC
OBSERVEDWORD TOKEN
TRAININGDOCUMENT
TRAINING DOCUMENTSFOR AN EVENT
ALL TRAINING SETSFOR ALL K EVENTS
15
Relevance-based Topical Model
• Inference– Given a training set for each event considered
b - All documents
ek - Event training documents, not the rest
to(d) - One document, not the rest
16
Estimating intensities
• From a subset (slice)• Window: 5 days
Intensity(ei,t) = Σ log[p(d|ei)]
d [t,t+w]∈
Log-likelihood of documentgiven an event
At a particular window in time
17
Outline
Motivation: Topic TrackingExplore the weblog collectionEvent tracking approachRelated workResults Event tracking Sub-event trackingFuture directions
18
Related Work
• Topic Evolution ExtractionZhou et al 2006, Mei & Zhai 2005
• Topic Detection and TrackingAllan 2002, Allan et al 1998
• Blog MiningAttardi & Simi 2006, Aschenbrenner & Miksch 2005,Kumar et al 2003, Glance, Hurst, Tomokiyo 2004
• Relevance ModelingRobertson & Sparck-Jones 1988, Lavrenko & Croft 2001
19
Outline
Motivation: Topic TrackingExplore the weblog collectionEvent tracking approachRelated workResults Event tracking Sub-event trackingFuture directions
20
Event Tracking
• News Eventssources: wikipedia.org + news sitestraining subsets: retrieved using Lucene2
US Presidential ElectionEconomic Financial CrisisHurricane Tropical StormsUS Open TennisRussia Georgia Conflict
Beijing OlympicsChina Milk Powder ScandalThai Political CrisisDelhi India Bomb BlastPakistan Impeachment
21
Event Tracking
• Topic Estimation
Beijing Olympics
word P(w|BO)
olymp 0.075
beij 0.071
phelp 0.043
china 0.041
game 0.040
gold 0.023
august 0.021
michael 0.021
US Presidential Election
word P(w|USPE)
obama 0.064
mccain 0.050
palin 0.041
democrat 0.034
republican 0.030
clinton 0.019
biden 0.018
convent 0.017
22
Event TrackingRunning mate announcements,National ConventionsOlympics: Aug 8-24Phelps’ Eighth Medal: Aug 17
Impeachment launched: Aug 7Formal impeachment charges: Aug 17Musharraf’s formal resignation: Aug 18
Several Hurricanes
23
Event Tracking
• Are the spikes due to sampling process?• Topic Latency
– How long does it take for discussion to start?
• What is the effect of topic interference?– Ex: Beijing Olympics China / China Milk Scandal
• What kinds of subtopics contribute to the main topics?
24
Outline
Motivation: Topic TrackingExplore the weblog collectionEvent tracking approachRelated workResults Event tracking Sub-event trackingFuture directions
25
Sub-Event Tracking
Training set: event-specific
26
Sub-Event Tracking
27
Sub-Event Tracking
• Sub-topic Estimation
Democratic Convention
word P(w|DC)
obama 0.041
dnc 0.040
democrat 0.038
clinton 0.034
biden 0.034
denver 0.027
barack 0.021
hillari 0.012
Republican Convention
word P(w|RC)
palin 0.073
republican 0.063
mccain 0.050
sarah 0.029
rnc 0.025
song 0.009
paul 0.009
gop 0.009
28
Sub-Event TrackingDemocratic Convention: August 25 - 28
Republican Convention: September 1 - 4
29
Sub-Event TrackingNamed: August 15Landfall: August 18
Named: August 25Landfall: September 1
Named: September 1Landfall: September 13
30
Sub-Event Tracking
• Deeper hierarchies
• Re-define sub-topics– Opinion, locale, other demographics
31
Financial Crisis
Federal Reserve Bailout
AIG Goldman Sachs
TaxpayerReaction
CongressionalReaction
Conflicts ofInterest
TaxpayerReaction
Financial MarketReaction
Conclusions
• Topic modeling– Excluding non-relevant background and
document-specific terms
• Topic tracking– Closely corresponds with real world– Hierarchical
• Scalability
32
Outline
Motivation: Topic TrackingExplore the weblog collectionEvent tracking approachRelated workResults Event tracking Sub-event trackingFuture directions
33
Future Directions
• Baseline– standard ad hoc retrieval approaches?
• Evaluation– gold standard?
• Dynamic Topic Tracking– moving time window
• Community Dynamics• Topical Sentiment Analysis
34
Thank You
35
Works Cited
[1] Blei, M., Ng. A., Jordan, M. Latent Dirichlet Allocation. Journal of Machine Learning Research, 3, 2003.
[2] Apache Lucene. http://lucene.apache.org/java/docs/
36