a unified model for stable and temporal topic detection from social media data hongzhi yin †, bin...

41
A Unified Model for Stable and Temporal Topic Detection from Social Media Data Hongzhi Yin , Bin Cui , Hua Lu , Yuxin Huang and Junjie Yao Peking University Aalobrg University

Upload: domenic-barnett

Post on 26-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

A Unified Model for Stable and Temporal Topic Detection from Social

Media Data

Hongzhi Yin†, Bin Cui†, Hua Lu‡, Yuxin Huang† and Junjie Yao†

†Peking University‡Aalobrg University

2 / 38

Outline

Motivation Problem Formulation A Basic Solution

A User-Temporal Mixture Model Enhancement of the basic solution

Regularization TechniqueBurst-Weighted Boosting

Experiments Q/A

3 / 38

Outline

Motivation Problem Formulation A Basic Solution

A User-Temporal Mixture Model Enhancement of the basic solution

Regularization TechniqueBurst-Weighted Boosting

Experiments Q/A

4 / 38

Motivation

5 / 38

Motivation (Cont.)

Two different types of topics are mixed up in the social media platforms such as Twitter, Weibo and Delicious;

Temporal Topics are temporally coherent meaningful themes. They are time-sensitive and often on popular real-life events or hot spots, i.e., breaking events in the real world.

Stable Topics are often on users’ regular interests and their daily routine discussions, e.g., their moods and statuses.

6 / 38

One Example in Twitter

Temporal Topic : Dead pigs in Shanghai Temporal Topic : Dead pigs in Shanghai Stable Topic : Big Data Stable Topic : Big Data

7 / 38

Another Example in Twitter

Temporal Topic: Independence DayTemporal Topic: Independence Day

Stable Topic: Animal AdoptionStable Topic: Animal Adoption

8 / 38

We can tell the difference between temporal and Stable topics from their temporal distributions and their description words.

10 / 38

Outline

Motivation Problem Formulation A Basic Solution

A User-Temporal Mixture Model Enhancement of the basic solution

Regularization TechniqueBurst-Weighted Smoothing

Experiments Q/A

11 / 38

Problem Formulation

A user-time-associated document d is a text document associated with a time stamp and a user.

A temporal topic is a temporally coherent theme. In other words, the words that are emerging in the close time dimension are clustered in a topic. An example of temporal topics: Given a collection of user-

time-associated tweets, the desired temporal topics are the events happening in different times.

Formally, a temporal/stable topic is represented by a word distribution where

12 / 38

Problem Formulation (Cont.)

A topic distribution in time dimension is the distribution of topics given a specific time interval.Formally, is the probability of temporal

topic given time interval t. A topic distribution in user space is the

distribution of topics given a specific user. Formally, is the probability of stable topic

given user u.

13 / 38

Problem Formulation (Cont.)

A User-Time-Keyword Matrix M is a hyper-matrix whose three dimensions refer to user, time and keyword. A cell in M[u, t, w] stores the frequency of word w generated by user u within time interval t.

Given a collection of user-time-associated documents C, we first formulate matrix M Detecting Temporal TopicsExtracting Stable Topics

Task 1

Task 2

14 / 38

Problem Formulation (Cont.)

Detecting a set of temporal topics that are event-driven.Detecting bursty events, such as disaster (e.g.,

earthquakes), politics (e.g., election), and public events (e.g., Olympics)

Analyzing topic trends Extracting a set of stable topics that are interest-

driven.Finding user intrinsic interests and better

modeling user preference

15 / 38

Outline

Motivation Problem Formulation A Basic Solution

A User-Temporal Mixture Model Enhancement of the basic solution

Regularization TechniqueBurst-Weighted Boosting

Experiments Q/A

16 / 38

A User-Time Mixture Model Main Insights

To find both temporal and stable topics in a unified manner, we propose a topic model that simultaneously captures two observations:● Words generated around the same time are more likely

to have the same event-driven temporal topic

● Words generated by the same user are more likely to have the same interest-driven stable topic.

The former helps find event-driven temporal topics while the latter helps identify interest-driven stable topics.

17 / 38

Combine user and time information We assume that when a user u generates a word

w at time t, he/she is probably influenced by two factors: the breaking news/events occurring in time t and his/her intrinsic interests.

Breaking events are modeled by temporal topics and user intrinsic interests are modeled by stable topics.

18 / 38

The likelihood that user u generates word w at time t is as follows:

Parameters and are mixing weights controlling the motivation factor choice, also denoting the proportions of temporal topics and stable topics in the dataset. It is worth mentioning that they are learnt automatically, instead of being fixed.

19 / 38

Parameter Estimation

The log-likelihood of the whole user-time-associated document collection C is

E-M algorithm to estimate

E-Step );( nQ M-Step… …Compute expectation Maximize, closed form solution

Please refer to the details of E-M algorithm in Section 4.2

20 / 38

Parameter Estimation

E-step:

M-step:

21 / 38

Outline

Motivation Problem Formulation A Basic Solution

A User-Temporal Mixture Model Enhancement of the basic solution

Regularization TechniqueBurst-Weighted Boosting

Experiments Q/A

22 / 38

Spatial Regularization

Intuitions If two users are connected in the social network space, they

are more likely to enjoy same/similar interests/topics. A topic is interest-coherent if people who are interested in

this topic also close in the network space.

22

DB

DB

DB?

More likely to be an DB person or an IR person?

Intuition: users’ interests are similar to their neighbors

23 / 38

Spatial Regularization

Topic Model With Spatial Regularization A regularized data likelihood is defined as follows:

RegularizerRegularizer

The Spatial Regularizer plays the role of spatial smoothing for user interests.

24 / 38

Parameter Estimation

24

Maximize, using Newton-Raphson

E-Step );( nQ M-Step… …Compute expectation

Regularized complete log-likelihood

Smooth using a spatial regularizer; in each iteration, a user interest is smoothed by his/her spatial neighbors.

25 / 38

Outline

Motivation Problem Formulation A Basic Solution

A User-Temporal Mixture Model Enhancement of the basic solution

Regularization TechniqueBurst-Weighted Boosting

Experiments Q/A

26 / 38

Insights

In topic models, the words with high occurrence rate, i.e., popular words, enjoy high probabilities to appear at top positions in each discovered topic.

These popular words are mostly general words, denoting abstract concepts. In stable topics, they can illustrate the domain of topics at the first glimpse.

However, in temporal topics, words with notable bursty feature are superior in expressing temporal information since users are more interested in bursty words than in abstract concepts when browsing temporal topic

27 / 38

Example: Michael Jackson’s Death

In this temporal topic, weexpect that bursty words“mj”, “michael jackson” “moonwalk” become the dominant words rather than the general words “world”, “news” and “death”.

But they cannot be removed as stop words, since they can help illustrate the stable topics.

28 / 38

Burst-Weighted Boosting We implement a bursty boosting step to escalate the

probability of these bursty words during the procedure of detecting temporal topics.We first compute the bursty-degree of each word in each

time interval. (Yao et al. ICDE’2010) A boosting step is then taken after each few E-M

iterations, as follows.

In this step, a word w will have its generation probability boosted in a temporal topic only if w’s bursty period overlaps with that of the topic.

29 / 38

Outline

Motivation Problem Formulation A Basic Solution

A User-Temporal Mixture Model Enhancement of the basic solution

Regularization TechniqueBurst-Weighted Boosting

Experiments Q/A

30 / 38

Data Sets

Twitter Data set (Mar. 2009 to Oct.2009) Delicious Data set (Feb.2008 to Dec. 2009) Sina Weibo (2011)

31 / 38

Data Sets Twitter: People in this platform often discuss many social

events and their daily life. It contains 9,884,640 tweets posted by 456,024 users in the period of Mar. 2009 to Oct.2009. Each user in this data set at least published 200 posts. We first removed all the stop words.

Delicious: Delicious is a collaborative tagging system on which users can upload and tag web pages. We collected 200,000 users and their tagging behaviors from the period of Feb.2008 to Dec. 2009. The dataset contains 7,103,622 tags. Topics on technology and electronic cover more than half of tags. Breaking news also co-exists.

32 / 38

Compared Methods

Our modelsBUT is the basic modelEUTS is the model enhanced with spatial

regularizationEUTB is the model enhanced with both spatial

regularization and burst-weighted boosting. PLSA Model on Time Slices (Mei et al. KDD’05) Individual Detection Method (Wang et al. KDD’07) Topic Over Time Model (TOT) (Wang et al. KDD’06) TimeUserLDA (Diao et al. ACL’12)

34 / 38

Time Stamp Prediction Comparison

Compared Methods0

0.10.20.30.40.50.60.70.8

EUTB

EUTS

BUT

TOT

TimeUserLDA

Individual Detection

EUTB

EUTS

BUT

TOT

TimeUserLDA

Individual Detection

35 / 38

Topic Quality Comparison

Excellent: a nicely presented temporal topic;Good: a topic containing bursty features;Poor: a topic without obvious bursty features

36 / 38

Stable Topics Detected in DeliciousT 10 T 16 T 27 T 55 T 8 T 33

windows

0.049

resources

0.034

news

0.107

u.s.

0.096

programmin

g 0.028

food

0.034

tools

0.048

education

0.031

latest

0.102

news

0.081

python

0.019

recipe

0.033

Freeware

0.038

interactive

0.020

Current

0.099

politics

0.076

Ruby

0.016

Cooking

0.030

firefox

0.038

Teaching

0.020

World

0.094

Democrats

0.068

javascript

0.015

Dessert

0.026

Google

0.029

science

0.019

events

0.084

international

0.064

software

0.014

Shopping

0.021

security

0.028

tools

0.015

newspaper

0.084

obama

0.061

tutorial

0.011

Home

0.016

37 / 38

Temporal Topics Detected in Delicious

T77 T78 T 87 T89

1.12-1.31 6.15-6.27 4.24-5.6 5.27-6.6

obama 0.144 moon 0.090 flu 0.158 google 0.061

inauguration 0.106 Space 0.068 swineflu 0.078 googlewave 0.059

bush 0.059 apollo11 0.032 pandemic 0.062 wave 0.042

president 0.021 apollo 0.023 swine 0.050 bing 0.040

gaza 0.017 nasa 0.018 health 0.020 apps 0.040

whitehouse 0.012 competition 0.015 disease 0.010 realtime 0.038

38 / 38

Stable Topics Detected in TwitterT 5 T 6 T 11 T 53 T 39 T 22

free

0.020

free

0.007

day

0.104

assassin

0.039

god

0.015

teeth

0.035

market

0.011

iphone

0.006

travel

0.009

attempt

0.034

day

0.013

white

0.027

money

0.010

video

0.006

hotel

0.008

wound

0.024

follow

0.010

mom

0.027

People

0.007

photo

0.006

Check

0.006

level

0.020

free

0.009

yellow

0.023

check

0.007

camera

0.004

site

0.004

reach

0.016

look

0.008

trick

0.022

help

0.006

Apple

0.004

Golf

0.004

Account

0.01

check

0.006

free

0.021

39 / 38

Temporal Topics Detected in Twitter

T63 T86 T 66 T70

7.6-7.15 7.1-7.6 10.7-10.15 6.24-6.30

july 0.012 july 0.035 free 0.012 michael 0.038

free 0.010 happy 0.020 nobel 0.012 jackson 0.036

summer 0.008 day 0.016 prize 0.011 rip 0.007

live 0.007 firework 0.009 peace 0.008 farrah 0.007

potter 0.006 independ 0.006 win 0.008 dead 0.005

harry 0.006 celebrate 0.005 obama 0.008 sad 0.005

43 / 38

Temporal Topic Trends Analysis

44 / 38

Temporal Topic Trends Analysis

45 / 38

Outline

Motivation Problem Formulation A Basic Solution

A User-Temporal Mixture Model Enhancement of the basic solution

Regularization TechniqueBurst-Weighted Boosting

Experiments Q/A

Thank You!Any Question ?

Email: [email protected]