analysis of causal topics in text data and time series with applications to presidential prediction...

24
Analysis of Causal Topics in Text Data and Time Series with Applications to Presidential Prediction Markets Hyun Duk Kim, ChengXiang (“Cheng”) Zhai (UIUC) Thomas A. Rietz (Univ. of Iowa) Daniel Diermeier (Northwestern Univ.) Meichun Hsu, Malu Castellanos, and Carlos Ceja (HP Labs) 1

Upload: ambrose-barber

Post on 12-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Analysis of Causal Topics in Text Data and Time Series with Applications to Presidential Prediction Markets Hyun Duk Kim, ChengXiang (“Cheng”) Zhai (UIUC)

1

Analysis of Causal Topics in Text Data and Time Series with Applications to Presidential Prediction Markets

Hyun Duk Kim, ChengXiang (“Cheng”) Zhai (UIUC) Thomas A. Rietz (Univ. of Iowa)

Daniel Diermeier (Northwestern Univ.) Meichun Hsu, Malu Castellanos, and Carlos Ceja (HP Labs)

Page 2: Analysis of Causal Topics in Text Data and Time Series with Applications to Presidential Prediction Markets Hyun Duk Kim, ChengXiang (“Cheng”) Zhai (UIUC)

2

… Time

Any clues in the companion news stream?Dow Jones Industrial Average [Source: Yahoo Finance]

Text Mining for Understanding Time Series

What might have caused the stock market crash?

Sept 11 Attack!

Page 3: Analysis of Causal Topics in Text Data and Time Series with Applications to Presidential Prediction Markets Hyun Duk Kim, ChengXiang (“Cheng”) Zhai (UIUC)

3

Analysis of Presidential Prediction Markets

What might have caused the sudden drop of price for this candidate?

What “mattered” in this election?

… Time

Any clues in the companion news stream?

Tax cut?

Page 4: Analysis of Causal Topics in Text Data and Time Series with Applications to Presidential Prediction Markets Hyun Duk Kim, ChengXiang (“Cheng”) Zhai (UIUC)

4

Joint Analysis of Text and Time Series to Discover “Causal Topics”

• Input: – Time series – Text data produced in a similar time period (text stream)

• Output– Topics whose coverage in the text stream has strong

correlations with the time series (“causal” topics)

Tax cut

Gun control

Page 5: Analysis of Causal Topics in Text Data and Time Series with Applications to Presidential Prediction Markets Hyun Duk Kim, ChengXiang (“Cheng”) Zhai (UIUC)

5

Related Work• Topic modeling (e.g., [Hofmann 99], [Blei et al. 03], …)

– Extract topics from text data and reveal their patterns– No consideration of time series topics extracted may not

be correlated with time series• Stream data mining (e.g., [Agrawal 02])

– Clustering & categorization of time series data– No topics being generated for text data

• Temporal text retrieval and prediction (e.g., [Efron 10], [Smith10])

– Incorporating time factor in retrieval or text-based prediction – No topics being generated

New Problem: Discover causal topics from text streams with time series data for supervision

Page 6: Analysis of Causal Topics in Text Data and Time Series with Applications to Presidential Prediction Markets Hyun Duk Kim, ChengXiang (“Cheng”) Zhai (UIUC)

6

Background: Topic Models

• Topic = multinomial distribution over words (unigram language models)• Text is assumed to be a sample of words drawn from

a mixture of multiple (unknown) topics • Parameter estimation and Bayesian inference

“reveal” – All the unknown topics in a text collection– The coverage of each topic in each document– Prior can be imposed to bias the inference of both topics

and topic coverage

Page 7: Analysis of Causal Topics in Text Data and Time Series with Applications to Presidential Prediction Markets Hyun Duk Kim, ChengXiang (“Cheng”) Zhai (UIUC)

Document as a Sample of Mixed Topics

Topic 1

Topic k

Topic 2

Background k

government 0.3 response 0.2...

donate 0.1relief 0.05help 0.02 ...

city 0.2new 0.1orleans 0.05 ...

is 0.05the 0.04a 0.03 ...

[ Criticism of government response to the hurricane primarily consisted of criticism of its response to the approach of the storm and its aftermath, specifically in the delayed response ] to the [ flooding of New Orleans. … 80% of the 1.3 million residents of the greater New Orleans metropolitan area evacuated ] …[ Over seventy countries pledged monetary donations or other assistance]. …

7

GenerativeTopic Model

Inference/EstimationOf topics

Prior can be added on them

Page 8: Analysis of Causal Topics in Text Data and Time Series with Applications to Presidential Prediction Markets Hyun Duk Kim, ChengXiang (“Cheng”) Zhai (UIUC)

8

When a topic model applied to text stream

… Time

Topic 1

Topic k

Topic 2

Background k

government 0.3 response 0.2...

donate 0.1relief 0.05help 0.02 ...

city 0.2new 0.1orleans 0.05 ...

is 0.05the 0.04a 0.03 ...

Page 9: Analysis of Causal Topics in Text Data and Time Series with Applications to Presidential Prediction Markets Hyun Duk Kim, ChengXiang (“Cheng”) Zhai (UIUC)

New Text Mining Framework:Iterative Causal Topic Modeling

9

Non-textTime Series

Sep2001

Oct …2001

Text Stream

Causal TopicsTopic

1Topic

2

Topic 3

Topic 4

Zoom into Word Level

Split Words

Feedbackas Prior

CausalWords

Topic 1

Topic Model-ingTopic

2

Topic 3

Topic 4

Topic 1-2W2 --W4 --

Topic 1-1W1 +W3 +

Topic 1

W1 +W2 --W3 +W4 --

W5 …

Page 10: Analysis of Causal Topics in Text Data and Time Series with Applications to Presidential Prediction Markets Hyun Duk Kim, ChengXiang (“Cheng”) Zhai (UIUC)

Iterative Causal Topic Modeling Framework

10

Non-textTime Series

Sep2001

Oct …2001

Text Stream

Causal TopicsTopic

1Topic

2

Topic 3

Topic 4

Zoom into Word Level

Split Words

Feedbackas Prior

CausalWords

Topic 1

Topic Model-ingTopic

2

Topic 3

Topic 4

Topic 1-2W2 --W4 --

Topic 1-1W1 +W3 +

Topic 1

W1 +W2 --W3 +W4 --

W5 …

• General Framework for any topic modeling and any causality measure• Naturally incorporate non-text time series in the process• Topic level + Word level Efficiency + Granularity

Page 11: Analysis of Causal Topics in Text Data and Time Series with Applications to Presidential Prediction Markets Hyun Duk Kim, ChengXiang (“Cheng”) Zhai (UIUC)

11

Heuristic Optimization of Causality + Coherence

Page 12: Analysis of Causal Topics in Text Data and Time Series with Applications to Presidential Prediction Markets Hyun Duk Kim, ChengXiang (“Cheng”) Zhai (UIUC)

12

• Pearson correlation– Basic correlation

• Granger Test – For two time series x (topic), y (stock), time lag p

Significance test if lagged x terms should be retained or not

Causality Measures

Auto-regression Lagged values

Page 13: Analysis of Causal Topics in Text Data and Time Series with Applications to Presidential Prediction Markets Hyun Duk Kim, ChengXiang (“Cheng”) Zhai (UIUC)

Feedback Prior Generation

Topic Word Impact Significance (%)

1

Social + 99Security + 96

Gun - 98Control - 96

5

September - 99Airline - 99

Terrorism - 97… (5 more

words)Attack - 96Good + 96

13

Topic Word Prob

1Social 0.8

security 0.2

2Gun 0.75

Control 0.25

3

September 0.1

Airline 0.1

Terrorism 0.075

… (5 more)

Attack 0.05

Good 0.0

Page 14: Analysis of Causal Topics in Text Data and Time Series with Applications to Presidential Prediction Markets Hyun Duk Kim, ChengXiang (“Cheng”) Zhai (UIUC)

14

• Time: June 2000 – Dec. 2011 • Text data– New York Times

• Time series– American Airlines stock (AAMRQ) – Apple stock (AAPL)

• Question: any “causal topics” to explain fluctuation of the stocks of the two companies?

Experiment Design 1: Stock Market Analysis

Page 15: Analysis of Causal Topics in Text Data and Time Series with Applications to Presidential Prediction Markets Hyun Duk Kim, ChengXiang (“Cheng”) Zhai (UIUC)

15

• Time: May 2000 – Oct. 2000 • Text data– New York Times (use text mentioning Bush or Gore)

• Time series– Normalized “Gore stock price” in Iowa Electronic

Markets (IEM), online future market

• Question: any “causal topics” to explain changes in opinions about Gore?

Experiment Design 2:2000 Presidential election campaign

Page 16: Analysis of Causal Topics in Text Data and Time Series with Applications to Presidential Prediction Markets Hyun Duk Kim, ChengXiang (“Cheng”) Zhai (UIUC)

Measuring Topic Quality

• Causality Confidence of a topic– Based on p-value of causality test (Granger, Pearson)

for the topic• Topic Purity – Consistency in the direction of “causal” relation with

the time series (“are all words in the topic positively correlated with the time series?”)

– Based on entropy of distribution of positive/negative words

16

Page 17: Analysis of Causal Topics in Text Data and Time Series with Applications to Presidential Prediction Markets Hyun Duk Kim, ChengXiang (“Cheng”) Zhai (UIUC)

17

Topic Purity

Topic Word Impact Significance (%)

1

Social + 99

Security + 96

Gun - 98

Control - 96

5

September - 99

Airline - 99

Terrorism - 97

Attack - 96

Good + 96

))(1(*100)(

)""(log)""(

)""(log)""()(

##

#)""(

THTPurity

negTpnegTp

posTpposTpTHEntropy

rdsnegativeWordspositiveWo

rdspositiveWoPosTp

P(T=“pos”)

H(T)

1.0

0 0.5 1.0

P(T=“pos”)=p(T=“neg”)=1/2Highest entropy Lowest purity(0)

P(T=“pos”)=1/5 p(T=“neg”)=4/5 Lower entropy Higher purity

Page 18: Analysis of Causal Topics in Text Data and Time Series with Applications to Presidential Prediction Markets Hyun Duk Kim, ChengXiang (“Cheng”) Zhai (UIUC)

AAMRQ AAPL

russia russian putin europe european germany

bush gore presidential police court judge airlines airport air

united trade terrorismfood foods cheese

nets scott basketball tennis williams open

awards gay boy moss minnesota chechnya

paid notice strussia russian europe

olympic games olympicsshe her ms

oil ford pricesblack fashion blacks

computer technology softwareinternet com webfootball giants jets

japan japanese plane…

18

- Significant topic list of two different external time series.AAMRQ: airline, terrorism topicAAPL: IT industry topic

Topics discovered depend on external time series

Sample Result 1:Topics discovered for AAMRQ vs. AAPL

Page 19: Analysis of Causal Topics in Text Data and Time Series with Applications to Presidential Prediction Markets Hyun Duk Kim, ChengXiang (“Cheng”) Zhai (UIUC)

19

Effect of Iterations on Causality Confidence & Purity

Page 20: Analysis of Causal Topics in Text Data and Time Series with Applications to Presidential Prediction Markets Hyun Duk Kim, ChengXiang (“Cheng”) Zhai (UIUC)

1 2 3 4 50

5

10

15

20

25

30

Number Of Significant Topics

MU10MU50MU100MU500MU1000

1 2 3 4 50

20

40

60

80

100

120

Average Purity

Iter

Different Feedback Strength (µ)

20

1 2 3 4 595.5

96

96.5

97

97.5

98

98.5

99

99.5

100

Average Confidence

• Significant improvement in confidence, number of significant topics by feedback– Clear benefit of feedback

• Large µ guarantees topic purity improvement

µ=10

µ=50

µ=100µ=500µ=1000

Iter Iter

Page 21: Analysis of Causal Topics in Text Data and Time Series with Applications to Presidential Prediction Markets Hyun Duk Kim, ChengXiang (“Cheng”) Zhai (UIUC)

Sample Result 2: Major Topics in 2000 Presidential Election

• Revealed several important issues– E.g. tax cut, abortion,

gun control, oil energy– Such topics are also

cited in political science literature [Pomper `01] and Wikipedia [Link]

21

Top Three Words in Significant Topics

tax cut 1screen pataki guilianienthusiasm door symbolicoil energy pricesnews w toppres al vicelove tucker presentedpartial abortion privatizationcourt supreme abortiongun control nra

Page 23: Analysis of Causal Topics in Text Data and Time Series with Applications to Presidential Prediction Markets Hyun Duk Kim, ChengXiang (“Cheng”) Zhai (UIUC)

23

Conclusions & Future Work

• Meaningful topics can be extracted from text stream by using time series for supervision

• Such “causal” topics provide potential explanations for changes in the time series data

• Preliminary experiment results on 2000 presidential prediction markets are promising

• Future work (discussion) – Issues related to topic models (e.g., local maxima, # of topics,

interpretation of topics)– Issues related to causality analysis (e.g., “local” causality)– Unified analysis model– System to support online interactive analysis of causal topics (time

series can be derived from text too)

Page 24: Analysis of Causal Topics in Text Data and Time Series with Applications to Presidential Prediction Markets Hyun Duk Kim, ChengXiang (“Cheng”) Zhai (UIUC)

24

Thank You!

Questions/Comments?