socialcom 2013

43
Trending Topics on Twitter Improve the Prediction of Google Hot Queries Gabriele Tolomei Università Ca’ Foscari Venezia, Italy Federica Giummolè Università Ca’ Foscari Venezia, Italy Salvatore Orlando Università Ca’ Foscari Venezia, Italy 2013 ASE/IEEE International Conference on Social Computing September 8 th -14 th , 2013 - Washington D.C., USA Monday, September 30, 13

Upload: gabriele-tolomei

Post on 18-Jul-2015

227 views

Category:

Technology


0 download

TRANSCRIPT

Trending Topics on Twitter Improve the Prediction of Google Hot Queries

Gabriele TolomeiUniversità Ca’ Foscari Venezia, Italy

Federica GiummolèUniversità Ca’ Foscari Venezia, Italy

Salvatore OrlandoUniversità Ca’ Foscari Venezia, Italy

2013 ASE/IEEE International Conference on Social ComputingSeptember 8th-14th, 2013 - Washington D.C., USA

Monday, September 30, 13

AgendaSocial vs. Web Trends

• Introduction

• Methodology

• Experiments & Results

• Conclusion

2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA 2Monday, September 30, 13

AgendaSocial vs. Web Trends

• Introduction

• Methodology

• Experiments & Results

• Conclusion

32013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA

Monday, September 30, 13

Twitter• The most popular real-time microblogging service

• ~ 500M users

• ~ 400M tweets per day on avg. (as of 2012)

• 140-chars limited size tweets

• Social trends pushed by the social network via user-generated content

• hashtags (#)

• trending topics42013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA

Monday, September 30, 13

Google

• The most popular Web search engine

• ~ 5B search queries per day on avg. (as of 2012)

• Web trends derived from search keywords issued by users

• Zeitgeist

• Google (Hot) Trends

52013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA

Monday, September 30, 13

Social vs. Web Trends

...49ers

...dow jones

...nba...

obama 2016...

world war z...

...50 cent

...democrats

...iphone 5

...romney

...windows 8

...

...anne hathaway

...barack obama

...election

...nyc marathon

...veterans day

...

62013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA

Monday, September 30, 13

Which Came First?

0

20

40

60

80

10011-0

1

11-0

3

11-0

5

11-0

7

11-0

9

11-1

1

11-1

3

11-1

5

Vo

lum

e I

nd

ex

Timestamp

election

GoogleTwitter

Our claim is that a trending topic on Twitter could later become a hot query on Google

72013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA

Monday, September 30, 13

AgendaSocial vs. Web Trends

• Introduction

• Methodology

• Experiments & Results

• Conclusion

82013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA

Monday, September 30, 13

Data Collection

2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA 9

Streaming API

Search API

Atom feed

• 15 consecutive days of crawling• from 2012-11-01 00:00:00UTC to 2012-11-15 23:59:59UTC

• Google

• Hot Trends

• Twitter

• Trending Topics

• Public Timelines

Monday, September 30, 13

Google Hot Trends

49ers...

election...

obama 2016...

world war z

Pre-processing&

Cleaning

Top-20hourly US queries

|VY|=190

Top-20hourly US queries

102013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA

y

Monday, September 30, 13

Search Volume Index

Normalized integer score in [0,100]

Daily relative searches for a keyword limited to a specific country within a range of dates

112013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA

Monday, September 30, 13

Twitter Trending Topics

|VX|=892

50 cent...

iphone 5...

election...

windows 8

122013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA

Pre-processing&

Cleaning

Top-10

trending topicsevery 5 minutes

Top-10

hourly aggregated

x

Monday, September 30, 13

Trend Volume Index

132013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA

• Use the public timelines crawled ~ 260M tweets = 10% random sampling

• To be consistent with Google

• daily relative number of tweets mentioning a particular keyword could be hourly!

• normalized integer score in [0,100]

• limited to US and within a range of dates

Monday, September 30, 13

Trend Time Series• 15 daily observations T = <t1, ..., t15>

• Google

• Hot Trends + Search Volume Index

• e.g., Yt = election = <5,...,7,40,100,...,15,...>

• Twitter

• Trending Topics + Trend Volume Index

• e.g., Xt = election = <6,...,10,100,55,...,5,...>

142013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA

Monday, September 30, 13

Trend Pairing

• Not every pair of Google/Twitter trend time series are worth analyzing!

• anne hathaway vs. veterans day

• We focus only on trends that are “similar enough” to each other

• election vs. election

• election vs. barack obama

2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA 15Monday, September 30, 13

Trend Bipartite Graph

VX VY

...49ers

...dow jones

...election

...nba...

obama 2016...

world war z...

...50 cent

...democrats

...iphone 5

...election

...romney

...windows 8

...

...

trend similarity

x y

162013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA

Monday, September 30, 13

Trend Similarity• Edge weighting scheme of the TBG

• string/lexical: e.g., Levenshtein, Jaccard, n-grams, etc.

• semantic: e.g., Wikipedia-based

• We use the normalized longest common subsequence (nlcs) between two keywords

172013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA

Monday, September 30, 13

Datasets

• 2 thresholds on nlcs η1 = 1.0 and η2 = 0.6 lead to 2 TBGs

• D1 = {(Xt, Yt) | nlcs (x, y) = η1}, |D1| = 50

• D2 = {(Xt, Yt) | nlcs (x, y) >= η2}, |D2| = 69

• Aggregate and normalize Twitter time series linked to the the same Google keyword in the TBG

• |VX| > |VY|

182013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA

Monday, September 30, 13

Research Questions1) Is there any relation between a particular pair of (Xt,Yt)?

• Cross-Correlation (lagged relationship)

2) Are variables from Twitter time series useful to forecast those from Google?

• Time series regression

192013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA

Because from our data about 70% of times the same trend appears first on Twitter

... Why not the opposite?

Monday, September 30, 13

AgendaSocial vs. Web Trends

• Introduction

• Methodology

• Experiments & Results

• Conclusion

2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA 20Monday, September 30, 13

Cross-Correlation• Measures the correlation between two time series Xt, Yt shifted by δ time units

• Xt refers to Twitter and Yt refers to Google

• min δ = 1 day

• Check for which δ the cross-correlation is maximum

• X leads Y if one or more Xt+δ are predictors of Yt and δ < 0

• X lags Y, otherwise

212013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA

Monday, September 30, 13

Lagged Relationship

Most pairs of time series exhibit their max cross-correlation at lag δ = 0

Nevertheless, some exceptions occur and cross-correlation at lag

δ = -1 is still significant

222013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA

Twitter as measured one day before could help explain Google

Monday, September 30, 13

Time Series Regression• Relate Y (dependent variable) to a parametric function of a set of explanatory variables X1,...,Xr

• The widest used function is linear in the parameters

• Linear Regression

εkx1 column vector kxr matrix

of observed valuesfor X1,...,Xr parametrized by β

Y = Xβ +kx1 column vector of errors

232013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA

Monday, September 30, 13

Ordinary Least Squares• Technique to estimate the real vector of coefficients β

• Choose β’ such that:

β’ = argminβ {(Y-Xβ)T (Y-Xβ)}

β’ = (X T X)-1 X T Y

242013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA

Monday, September 30, 13

Autoregressive: AR(p)• The simplest time series regression model

• Relate a variable Yt to a linear combination of up to p of its previous values

Yt = α + φ1Yt-1 + φ2Yt-2 + ... + φpYt-p + εt

25

parameters random noise

2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA

Monday, September 30, 13

Distributed Lag: DL(q)

• The dependent variable Yt is only related to q+1 explanatory variables Xt at previous time

Yt = α + ψ1Xt + ψ2Xt-1 + ... + ψq+1Xt-q + εt

262013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA

parameters random noise

Monday, September 30, 13

Autoregressive Distributed Lag: ADL(p,q)

• Relate the dependent variable Yt to lags of itself and of an explanatory variable Xt

+ ψ1Xt + ψ2Xt-1 + ... + ψq+1Xt-q + εt

Yt = α + φ1Yt-1 + φ2Yt-2 + ... + φpYt-p +

27

parameters random noise

2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA

Monday, September 30, 13

Model Comparison

• We measure how likely a model AR(p), DL(q), ADL(p,q) retains its lagged component as significant

• Null hypothesis H0: “the lagged coefficient is not significant”

• Rejecting H0 means that the lagged coefficient is useful to fit the data

• H0 is rejected whenever the p-value is below a significance level α (e.g., α = .05)

282013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA

Monday, September 30, 13

Model Evaluation

• Compute both R2 ∈ [0,1] and its adjusted variation which penalizes models with too much explanatory terms

• Describes how well a regression line fits the observed data

• Provides a measure of how future observation are likely to be predicted by the model

292013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA

Monday, September 30, 13

AR(p) vs. DL(q)

On both D1 and D2, DL(q) retain their q-lagged coefficient much more often than AR(p)

302013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA

Twitter is actually useful to fit Google data!

Monday, September 30, 13

ADL(p,q)

312013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA

Slightly less cases where the lagged component of Twitter is significant to predict Google data...

But adjusted R2 evaluates much better than DL(q)

Monday, September 30, 13

Wrap Up

322013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA

ADL(1,1) is the best model

Reasonable!It mixes the autoregressive component of Google with the

prediction of Twitter, captured one day before

Monday, September 30, 13

Overcome LimitationsWe might expect better results

if finer-grained analysis (hourly) was possible...

2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA 33

Twitter vs. Wikipedia: Upcoming CIKM’13 Workshop

Monday, September 30, 13

AgendaSocial vs. Web Trends

• Introduction

• Methodology

• Experiments & Results

• Conclusion

2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA 34Monday, September 30, 13

Conclusion• Relate Twitter trending topics (social trends) with Google hot queries (web trends)

• Trend Bipartite Graph (TBG) links social and web trends

• Time Series Analysis

• maximum cross-correlation occurs at lag-0 but Twitter leads Google significantly (~ 60% of times)

• the very best model to explain data uses both Twitter and Google lagged coefficients

352013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA

Monday, September 30, 13

Thank You!

Questions?

2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA 36Monday, September 30, 13

Monday, September 30, 13

Backup

2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA

Monday, September 30, 13

Trend Vocabularies

VX VY

...49ers

...dow jones

...nba...

obama 2016...

world war z...

...50 cent

...democrats

...iphone 5

...romney

...windows 8

...

...anne hathaway

...barack obama

...election

...nyc marathon

...veterans day

...

2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA

Monday, September 30, 13

Trend Scores

• Given a discrete time interval T = <t1, ..., tT>

• Assign 2 scores (social and web) to each trending keyword during each time unit

• The score measures the “strength” of how much trending is a keyword at a given time

2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA

Monday, September 30, 13

Trend Time Series

• Model each Twitter/Google trending keyword as a time series of tT random variables

• Each random variable evaluates to the trending score of the keyword

• The observed time series for a trend is the sequence of values of its trending score

2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA

Monday, September 30, 13

Trend Bipartite Graph• 2 disjoint sets of nodes are the vocabularies of Twitter and Google trends

• Weighted edges measure the pairwise trend similarity

• string/lexical: edit distance, LCS, n-grams

• semantic: Wikipedia-based

• TBG identifies a set of pairs of comparable time series associated with similar trends

2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA

Monday, September 30, 13

(Weak) Stationarity

Autocorrelation of stationary variable decays into “noise” and/or negative values in few lags

2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA

Google Twitter

Monday, September 30, 13