event intensity tracking in weblog collections viet ha-thuc, yelena mejova, christopher harris, dr...

36
Event Intensity Tracking in Weblog Collections Viet Ha-Thuc, Yelena Mejova, Christopher Harris, Dr Padmini Srinivasan ICWSM 2009 Data Challenge Workshop Presented by: Yelena Mejova 1

Upload: elvin-dalton

Post on 12-Jan-2016

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Event Intensity Tracking in Weblog Collections Viet Ha-Thuc, Yelena Mejova, Christopher Harris, Dr Padmini Srinivasan ICWSM 2009 Data Challenge Workshop

Event Intensity Tracking in Weblog Collections

Viet Ha-Thuc, Yelena Mejova, Christopher Harris, Dr Padmini Srinivasan

ICWSM 2009 Data Challenge Workshop Presented by: Yelena Mejova

1

Page 2: Event Intensity Tracking in Weblog Collections Viet Ha-Thuc, Yelena Mejova, Christopher Harris, Dr Padmini Srinivasan ICWSM 2009 Data Challenge Workshop

Outline

Motivation: Topic TrackingExplore the weblog collectionEvent tracking approachRelated workResults Event tracking Sub-event trackingFuture directions

2

Page 3: Event Intensity Tracking in Weblog Collections Viet Ha-Thuc, Yelena Mejova, Christopher Harris, Dr Padmini Srinivasan ICWSM 2009 Data Challenge Workshop

Outline

Motivation: Topic TrackingExplore the weblog collectionEvent tracking approachRelated workResults Event tracking Sub-event trackingFuture directions

3

Page 4: Event Intensity Tracking in Weblog Collections Viet Ha-Thuc, Yelena Mejova, Christopher Harris, Dr Padmini Srinivasan ICWSM 2009 Data Challenge Workshop

Event Tracking

• People talk: - What - When -How much

4

Page 5: Event Intensity Tracking in Weblog Collections Viet Ha-Thuc, Yelena Mejova, Christopher Harris, Dr Padmini Srinivasan ICWSM 2009 Data Challenge Workshop

Outline

Motivation: Topic TrackingExplore the weblog collectionEvent tracking approachRelated workResults Event tracking Sub-event trackingFuture directions

5

Page 6: Event Intensity Tracking in Weblog Collections Viet Ha-Thuc, Yelena Mejova, Christopher Harris, Dr Padmini Srinivasan ICWSM 2009 Data Challenge Workshop

Data Set

• Published by Spinn3r.com• 44 million blog posts• August 1, 2008 – October 1, 2008

• No comments

6

Page 7: Event Intensity Tracking in Weblog Collections Viet Ha-Thuc, Yelena Mejova, Christopher Harris, Dr Padmini Srinivasan ICWSM 2009 Data Challenge Workshop

Data Set

Languages7

Page 8: Event Intensity Tracking in Weblog Collections Viet Ha-Thuc, Yelena Mejova, Christopher Harris, Dr Padmini Srinivasan ICWSM 2009 Data Challenge Workshop

Data Set

Document Length8

Page 9: Event Intensity Tracking in Weblog Collections Viet Ha-Thuc, Yelena Mejova, Christopher Harris, Dr Padmini Srinivasan ICWSM 2009 Data Challenge Workshop

Data Set

Document Distribution by Date9

Page 10: Event Intensity Tracking in Weblog Collections Viet Ha-Thuc, Yelena Mejova, Christopher Harris, Dr Padmini Srinivasan ICWSM 2009 Data Challenge Workshop

Data Set

Popular Categories10

Page 11: Event Intensity Tracking in Weblog Collections Viet Ha-Thuc, Yelena Mejova, Christopher Harris, Dr Padmini Srinivasan ICWSM 2009 Data Challenge Workshop

Data Set

• Our subset:

– 1 million documents (4% of all English posts)– English only– Inlink threshold of 400

11

Page 12: Event Intensity Tracking in Weblog Collections Viet Ha-Thuc, Yelena Mejova, Christopher Harris, Dr Padmini Srinivasan ICWSM 2009 Data Challenge Workshop

Outline

Motivation: Topic TrackingExplore the weblog collectionEvent tracking approachRelated workResults Event tracking Sub-event trackingFuture directions

12

Page 13: Event Intensity Tracking in Weblog Collections Viet Ha-Thuc, Yelena Mejova, Christopher Harris, Dr Padmini Srinivasan ICWSM 2009 Data Challenge Workshop

Tracking Approach

• Phase I: Estimate relevance-based

topic models

• Phase II: Estimate topical intensity

training docs

training docs

docsdocs

topic model

s

topic model

s

topic model

s

topic model

s

docsdocs

13

Page 14: Event Intensity Tracking in Weblog Collections Viet Ha-Thuc, Yelena Mejova, Christopher Harris, Dr Padmini Srinivasan ICWSM 2009 Data Challenge Workshop

Relevance-based Topical Model

14

Page 15: Event Intensity Tracking in Weblog Collections Viet Ha-Thuc, Yelena Mejova, Christopher Harris, Dr Padmini Srinivasan ICWSM 2009 Data Challenge Workshop

Relevance-based Topical Model

bb

ekek

to(d)to(d)

ww

N

D

K

BACKGROUNDTOPIC

(EX: COMMONENGLISH WORDS)

EVENTTOPIC

OTHER DOCUMENT-SPECIFIC TOPIC

OBSERVEDWORD TOKEN

TRAININGDOCUMENT

TRAINING DOCUMENTSFOR AN EVENT

ALL TRAINING SETSFOR ALL K EVENTS

15

Page 16: Event Intensity Tracking in Weblog Collections Viet Ha-Thuc, Yelena Mejova, Christopher Harris, Dr Padmini Srinivasan ICWSM 2009 Data Challenge Workshop

Relevance-based Topical Model

• Inference– Given a training set for each event considered

b - All documents

ek - Event training documents, not the rest

to(d) - One document, not the rest

16

Page 17: Event Intensity Tracking in Weblog Collections Viet Ha-Thuc, Yelena Mejova, Christopher Harris, Dr Padmini Srinivasan ICWSM 2009 Data Challenge Workshop

Estimating intensities

• From a subset (slice)• Window: 5 days

Intensity(ei,t) = Σ log[p(d|ei)]

d [t,t+w]∈

Log-likelihood of documentgiven an event

At a particular window in time

17

Page 18: Event Intensity Tracking in Weblog Collections Viet Ha-Thuc, Yelena Mejova, Christopher Harris, Dr Padmini Srinivasan ICWSM 2009 Data Challenge Workshop

Outline

Motivation: Topic TrackingExplore the weblog collectionEvent tracking approachRelated workResults Event tracking Sub-event trackingFuture directions

18

Page 19: Event Intensity Tracking in Weblog Collections Viet Ha-Thuc, Yelena Mejova, Christopher Harris, Dr Padmini Srinivasan ICWSM 2009 Data Challenge Workshop

Related Work

• Topic Evolution ExtractionZhou et al 2006, Mei & Zhai 2005

• Topic Detection and TrackingAllan 2002, Allan et al 1998

• Blog MiningAttardi & Simi 2006, Aschenbrenner & Miksch 2005,Kumar et al 2003, Glance, Hurst, Tomokiyo 2004

• Relevance ModelingRobertson & Sparck-Jones 1988, Lavrenko & Croft 2001

19

Page 20: Event Intensity Tracking in Weblog Collections Viet Ha-Thuc, Yelena Mejova, Christopher Harris, Dr Padmini Srinivasan ICWSM 2009 Data Challenge Workshop

Outline

Motivation: Topic TrackingExplore the weblog collectionEvent tracking approachRelated workResults Event tracking Sub-event trackingFuture directions

20

Page 21: Event Intensity Tracking in Weblog Collections Viet Ha-Thuc, Yelena Mejova, Christopher Harris, Dr Padmini Srinivasan ICWSM 2009 Data Challenge Workshop

Event Tracking

• News Eventssources: wikipedia.org + news sitestraining subsets: retrieved using Lucene2

US Presidential ElectionEconomic Financial CrisisHurricane Tropical StormsUS Open TennisRussia Georgia Conflict

Beijing OlympicsChina Milk Powder ScandalThai Political CrisisDelhi India Bomb BlastPakistan Impeachment

21

Page 22: Event Intensity Tracking in Weblog Collections Viet Ha-Thuc, Yelena Mejova, Christopher Harris, Dr Padmini Srinivasan ICWSM 2009 Data Challenge Workshop

Event Tracking

• Topic Estimation

Beijing Olympics

word P(w|BO)

olymp 0.075

beij 0.071

phelp 0.043

china 0.041

game 0.040

gold 0.023

august 0.021

michael 0.021

US Presidential Election

word P(w|USPE)

obama 0.064

mccain 0.050

palin 0.041

democrat 0.034

republican 0.030

clinton 0.019

biden 0.018

convent 0.017

22

Page 23: Event Intensity Tracking in Weblog Collections Viet Ha-Thuc, Yelena Mejova, Christopher Harris, Dr Padmini Srinivasan ICWSM 2009 Data Challenge Workshop

Event TrackingRunning mate announcements,National ConventionsOlympics: Aug 8-24Phelps’ Eighth Medal: Aug 17

Impeachment launched: Aug 7Formal impeachment charges: Aug 17Musharraf’s formal resignation: Aug 18

Several Hurricanes

23

Page 24: Event Intensity Tracking in Weblog Collections Viet Ha-Thuc, Yelena Mejova, Christopher Harris, Dr Padmini Srinivasan ICWSM 2009 Data Challenge Workshop

Event Tracking

• Are the spikes due to sampling process?• Topic Latency

– How long does it take for discussion to start?

• What is the effect of topic interference?– Ex: Beijing Olympics China / China Milk Scandal

• What kinds of subtopics contribute to the main topics?

24

Page 25: Event Intensity Tracking in Weblog Collections Viet Ha-Thuc, Yelena Mejova, Christopher Harris, Dr Padmini Srinivasan ICWSM 2009 Data Challenge Workshop

Outline

Motivation: Topic TrackingExplore the weblog collectionEvent tracking approachRelated workResults Event tracking Sub-event trackingFuture directions

25

Page 26: Event Intensity Tracking in Weblog Collections Viet Ha-Thuc, Yelena Mejova, Christopher Harris, Dr Padmini Srinivasan ICWSM 2009 Data Challenge Workshop

Sub-Event Tracking

Training set: event-specific

26

Page 27: Event Intensity Tracking in Weblog Collections Viet Ha-Thuc, Yelena Mejova, Christopher Harris, Dr Padmini Srinivasan ICWSM 2009 Data Challenge Workshop

Sub-Event Tracking

27

Page 28: Event Intensity Tracking in Weblog Collections Viet Ha-Thuc, Yelena Mejova, Christopher Harris, Dr Padmini Srinivasan ICWSM 2009 Data Challenge Workshop

Sub-Event Tracking

• Sub-topic Estimation

Democratic Convention

word P(w|DC)

obama 0.041

dnc 0.040

democrat 0.038

clinton 0.034

biden 0.034

denver 0.027

barack 0.021

hillari 0.012

Republican Convention

word P(w|RC)

palin 0.073

republican 0.063

mccain 0.050

sarah 0.029

rnc 0.025

song 0.009

paul 0.009

gop 0.009

28

Page 29: Event Intensity Tracking in Weblog Collections Viet Ha-Thuc, Yelena Mejova, Christopher Harris, Dr Padmini Srinivasan ICWSM 2009 Data Challenge Workshop

Sub-Event TrackingDemocratic Convention: August 25 - 28

Republican Convention: September 1 - 4

29

Page 30: Event Intensity Tracking in Weblog Collections Viet Ha-Thuc, Yelena Mejova, Christopher Harris, Dr Padmini Srinivasan ICWSM 2009 Data Challenge Workshop

Sub-Event TrackingNamed: August 15Landfall: August 18

Named: August 25Landfall: September 1

Named: September 1Landfall: September 13

30

Page 31: Event Intensity Tracking in Weblog Collections Viet Ha-Thuc, Yelena Mejova, Christopher Harris, Dr Padmini Srinivasan ICWSM 2009 Data Challenge Workshop

Sub-Event Tracking

• Deeper hierarchies

• Re-define sub-topics– Opinion, locale, other demographics

31

Financial Crisis

Federal Reserve Bailout

AIG Goldman Sachs

TaxpayerReaction

CongressionalReaction

Conflicts ofInterest

TaxpayerReaction

Financial MarketReaction

Page 32: Event Intensity Tracking in Weblog Collections Viet Ha-Thuc, Yelena Mejova, Christopher Harris, Dr Padmini Srinivasan ICWSM 2009 Data Challenge Workshop

Conclusions

• Topic modeling– Excluding non-relevant background and

document-specific terms

• Topic tracking– Closely corresponds with real world– Hierarchical

• Scalability

32

Page 33: Event Intensity Tracking in Weblog Collections Viet Ha-Thuc, Yelena Mejova, Christopher Harris, Dr Padmini Srinivasan ICWSM 2009 Data Challenge Workshop

Outline

Motivation: Topic TrackingExplore the weblog collectionEvent tracking approachRelated workResults Event tracking Sub-event trackingFuture directions

33

Page 34: Event Intensity Tracking in Weblog Collections Viet Ha-Thuc, Yelena Mejova, Christopher Harris, Dr Padmini Srinivasan ICWSM 2009 Data Challenge Workshop

Future Directions

• Baseline– standard ad hoc retrieval approaches?

• Evaluation– gold standard?

• Dynamic Topic Tracking– moving time window

• Community Dynamics• Topical Sentiment Analysis

34

Page 35: Event Intensity Tracking in Weblog Collections Viet Ha-Thuc, Yelena Mejova, Christopher Harris, Dr Padmini Srinivasan ICWSM 2009 Data Challenge Workshop

Thank You

35

Page 36: Event Intensity Tracking in Weblog Collections Viet Ha-Thuc, Yelena Mejova, Christopher Harris, Dr Padmini Srinivasan ICWSM 2009 Data Challenge Workshop

Works Cited

[1] Blei, M., Ng. A., Jordan, M. Latent Dirichlet Allocation. Journal of Machine Learning Research, 3, 2003.

[2] Apache Lucene. http://lucene.apache.org/java/docs/

36