predicting document creation times in news citation networks · news citation network overview news...

32
Predicting Document Creation Times in News Citation Networks Andreas Spitz 1 , Jannik Strötgen 2 , and Michael Gertz 1 April 23, 2018 — TempWeb 2018, Lyon 1 Database Systems Research Group 2 Bosch Center for Artificial Intelligence Heidelberg University, Germany Germany

Upload: others

Post on 29-Sep-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Predicting Document Creation Times in News Citation Networks · News Citation Network Overview News articles from RSS feeds: I Politics and business feeds I 34 English news outlets

Predicting Document Creation Timesin News Citation Networks

Andreas Spitz1, Jannik Strötgen2, and Michael Gertz1

April 23, 2018 — TempWeb 2018, Lyon

1 Database Systems Research Group 2 Bosch Center for Artificial IntelligenceHeidelberg University, Germany Germany

Page 2: Predicting Document Creation Times in News Citation Networks · News Citation Network Overview News articles from RSS feeds: I Politics and business feeds I 34 English news outlets

Hm, when did this happen again?

1

Page 3: Predicting Document Creation Times in News Citation Networks · News Citation Network Overview News articles from RSS feeds: I Politics and business feeds I 34 English news outlets

News Citation Networks

Page 4: Predicting Document Creation Times in News Citation Networks · News Citation Network Overview News articles from RSS feeds: I Politics and business feeds I 34 English news outlets

News Citation Network Extraction

2

Page 5: Predicting Document Creation Times in News Citation Networks · News Citation Network Overview News articles from RSS feeds: I Politics and business feeds I 34 English news outlets

News Citation Network Overview

News articles from RSS feeds:

I Politics and business feeds

I 34 English news outlets(USA, UK, AUS, CAN, GER, CHN, QAT)

I 2 years (Nov 2015 - Oct 2017)

I 244.6 thousand articles

I 367.2 thousand edges

Used data:

I Hyperlinks in the article body

I Publication dates

I Temporal expressions3

Page 6: Predicting Document Creation Times in News Citation Networks · News Citation Network Overview News articles from RSS feeds: I Politics and business feeds I 34 English news outlets

News Outlet Statistics (sample)

short news outlet days 〈articles〉 〈temp exp〉 otherin otherout

AT The Atlantic 334 7.2 10.5 16.7 50.6BBC British Bc. Corp. 730 8.1 6.5 19.1 8.0DW Deutsche Welle 334 1.2 6.1 48.1 5.9FOX Fox News 548 2.7 9.8 0.0 0.0NPR National Public Radio 334 0.4 8.4 63.6 58.5NY The New Yorker 548 3.0 13.2 33.5 30.6NYT New York Times 669 23.8 10.7 26.8 4.7SMH Sydney Morn. Herald 548 2.3 7.0 3.0 51.9WP Washington Post 548 62.7 9.4 13.7 5.1

4

Page 7: Predicting Document Creation Times in News Citation Networks · News Citation Network Overview News articles from RSS feeds: I Politics and business feeds I 34 English news outlets

Evolution of Network Metrics

clustering coefficient average path length

average degree undirected diameter

2016−01 2016−07 2017−01 2017−07 2016−01 2016−07 2017−01 2017−07

0

20

40

60

5

10

15

1

2

3

0.0

0.2

0.4

0.6

days

mea

sure

val

ue

network aggregated politics business

5

Page 8: Predicting Document Creation Times in News Citation Networks · News Citation Network Overview News articles from RSS feeds: I Politics and business feeds I 34 English news outlets

Exploring Citation Chains

6

Page 9: Predicting Document Creation Times in News Citation Networks · News Citation Network Overview News articles from RSS feeds: I Politics and business feeds I 34 English news outlets

Article Publication Time Prediction

Page 10: Predicting Document Creation Times in News Citation Networks · News Citation Network Overview News articles from RSS feeds: I Politics and business feeds I 34 English news outlets

Task Definition: Publication Time Prediction

7

Page 11: Predicting Document Creation Times in News Citation Networks · News Citation Network Overview News articles from RSS feeds: I Politics and business feeds I 34 English news outlets

Available News Citation Network Data

Predict article publication times from:

I Citation network topology

I Publication dates of adjacent articles

I Temporal expressions in adjacent articles

I Not the metadata of the article itself

I Not the article content

8

Page 12: Predicting Document Creation Times in News Citation Networks · News Citation Network Overview News articles from RSS feeds: I Politics and business feeds I 34 English news outlets

Available News Citation Network Data

Predict article publication times from:

I Citation network topology

I Publication dates of adjacent articles

I Temporal expressions in adjacent articles

I Not the metadata of the article itself

I Not the article content

8

Page 13: Predicting Document Creation Times in News Citation Networks · News Citation Network Overview News articles from RSS feeds: I Politics and business feeds I 34 English news outlets

Feature Extraction

Page 14: Predicting Document Creation Times in News Citation Networks · News Citation Network Overview News articles from RSS feeds: I Politics and business feeds I 34 English news outlets

Network Topology Features

Node degree-based features:

I Incoming degree

I Outgoing degree

I Undirected degree

Density-based features:

I Undirected local clustering coe�icient

Centrality-based features:

I Betweenness centrality

I Incoming closeness centrality

I Outgoing closeness centrality

I Page Rank centrality

9

Page 15: Predicting Document Creation Times in News Citation Networks · News Citation Network Overview News articles from RSS feeds: I Politics and business feeds I 34 English news outlets

Network Topology Features

Node degree-based features:

I Incoming degree

I Outgoing degree

I Undirected degree

Density-based features:

I Undirected local clustering coe�icient

Centrality-based features:

I Betweenness centrality

I Incoming closeness centrality

I Outgoing closeness centrality

I Page Rank centrality

9

Page 16: Predicting Document Creation Times in News Citation Networks · News Citation Network Overview News articles from RSS feeds: I Politics and business feeds I 34 English news outlets

Network Topology Features

Node degree-based features:

I Incoming degree

I Outgoing degree

I Undirected degree

Density-based features:

I Undirected local clustering coe�icient

Centrality-based features:

I Betweenness centrality

I Incoming closeness centrality

I Outgoing closeness centrality

I Page Rank centrality

9

Page 17: Predicting Document Creation Times in News Citation Networks · News Citation Network Overview News articles from RSS feeds: I Politics and business feeds I 34 English news outlets

Temporal Network Features

10

Page 18: Predicting Document Creation Times in News Citation Networks · News Citation Network Overview News articles from RSS feeds: I Politics and business feeds I 34 English news outlets

Temporal Expression Features

Correlation of temporal expressions:

I good with publication dates ofreferencing articles (incoming edges)

I bad with publication dates ofreferenced articles (outgoing edges)

11

Page 19: Predicting Document Creation Times in News Citation Networks · News Citation Network Overview News articles from RSS feeds: I Politics and business feeds I 34 English news outlets

Temporal Expression Features

Correlation of temporal expressions:

I good with publication dates ofreferencing articles (incoming edges)

I bad with publication dates ofreferenced articles (outgoing edges)

11

Page 20: Predicting Document Creation Times in News Citation Networks · News Citation Network Overview News articles from RSS feeds: I Politics and business feeds I 34 English news outlets

Missing Features and Imputation

Missing features

I 30.8% of feature values are missing

I 89.6% of articles are missing at least one feature

Imputation of missing values

I Column mean of the feature

12

Page 21: Predicting Document Creation Times in News Citation Networks · News Citation Network Overview News articles from RSS feeds: I Politics and business feeds I 34 English news outlets

Missing Features and Imputation

Missing features

I 30.8% of feature values are missing

I 89.6% of articles are missing at least one feature

Imputation of missing values

I Column mean of the feature

12

Page 22: Predicting Document Creation Times in News Citation Networks · News Citation Network Overview News articles from RSS feeds: I Politics and business feeds I 34 English news outlets

Evaluation

Page 23: Predicting Document Creation Times in News Citation Networks · News Citation Network Overview News articles from RSS feeds: I Politics and business feeds I 34 English news outlets

Regression Methods

Used regression methods:

I BASE: Baseline (average publication date of adjacent articles)

I LR: Linear regression

I BAY: Bayesian ridge regression (Laplace model)

I RF: Random forest

I GB: Gradient boosting (Laplace distribution, decision trees)

I SVM: Support vector machine (radial kernel)

I NN: Neural network (feedforward, one hidden layer)

13

Page 24: Predicting Document Creation Times in News Citation Networks · News Citation Network Overview News articles from RSS feeds: I Politics and business feeds I 34 English news outlets

Evaluation Results: Mean Absolute Error (days)

BASE LR BAY NN RF GB SVM

all 66.72 60.46 59.61 26.88 24.98 22.66 26.19in 88.88 66.48 87.55 34.03 32.25 27.49 32.29

out 87.32 59.54 40.24 32.52 30.10 26.68 30.77in+out 18.68 55.45 54.95 12.62 11.23 12.76 14.31

14

Page 25: Predicting Document Creation Times in News Citation Networks · News Citation Network Overview News articles from RSS feeds: I Politics and business feeds I 34 English news outlets

Distribution of Absolute Errors

out in+out

all in

BASE LR BAY NN RF GB SVM BASE LR BAY NN RF GB SVM

0

50

100

150

200

250

0

50

100

150

200

250

regression method

abso

lute

err

or (

days

)

method BASE LR BAY NN RF GB SVM

15

Page 26: Predicting Document Creation Times in News Citation Networks · News Citation Network Overview News articles from RSS feeds: I Politics and business feeds I 34 English news outlets

Recall by Varying Absolute Error

out in+out

all in

0 20 40 60 0 20 40 60

0

25

50

75

100

0

25

50

75

100

absolute error (days)

reca

ll (p

erce

ntag

e of

pre

dict

ions

< a

bsol

ute

erro

r)

method BASE LR BAY NN RF GB SVM

16

Page 27: Predicting Document Creation Times in News Citation Networks · News Citation Network Overview News articles from RSS feeds: I Politics and business feeds I 34 English news outlets

Feature Importance: Random Forest

●●●

●●

Feature importance: random forestm

ax (T

out)

min

(Tin)

µ (T ou

t)µ

(T in)

min

(Tou

t)m

ax (T

in)

max

(Xin)

µ (X

in) c pr

σ (T ou

t)σ

(Xin)

c cl,o

utsp

an (T

out)

σ (T in

)sp

an (T

in)

min

(Xin)

span

(Xin)

c cl,in

min

(Dis

t)de

g out

µ (D

ist)

deg in

deg al

lm

ax (D

ist)

c btw cc

σ (D

ist)

10−3

10−2

10−1

100

rela

tive

impo

rtan

ce

Feature type: ● network topology temporal expression temporal network

17

Page 28: Predicting Document Creation Times in News Citation Networks · News Citation Network Overview News articles from RSS feeds: I Politics and business feeds I 34 English news outlets

Feature Importance: Gradient Boosting

●●

Feature importance: gradient boostingm

ax (T

out)

min

(Tin)

deg ou

(T out)

min

(Dis

t)de

g in c prσ

(T out)

σ (T in

(T in)

deg al

lsp

an (T

in)

max

(Tin)

µ (X

in)

min

(Tou

t)µ

(Dis

t)c bt

wm

ax (X

in)

max

(Dis

t)sp

an (X

in)

σ (X

in)

span

(Tou

t)m

in (X

in)

c cl,o

utσ

(Dis

t)c cl

,in cc

10−5

10−4

10−3

10−2

10−1

100

rela

tive

impo

rtan

ce

Feature type: ● network topology temporal expression temporal network

18

Page 29: Predicting Document Creation Times in News Citation Networks · News Citation Network Overview News articles from RSS feeds: I Politics and business feeds I 34 English news outlets

Summary & Resources

Page 30: Predicting Document Creation Times in News Citation Networks · News Citation Network Overview News articles from RSS feeds: I Politics and business feeds I 34 English news outlets

Summary

News citation networks:

I Focus on anchored links inside the article body

I Constructed like a citation network between articles

Publication date prediction:

I Can be framed as a regression problem

I Average prediction error of 3 weeks

I Temporal network features are most discriminative

19

Page 31: Predicting Document Creation Times in News Citation Networks · News Citation Network Overview News articles from RSS feeds: I Politics and business feeds I 34 English news outlets

Resources

Data and implementation are available online:

I [data] News citation network (including URLs)

I [data] Temporal annotations

I [code] Publication date prediction

https://dbs.ifi.uni-heidelberg.de/resources/data/

20

Page 32: Predicting Document Creation Times in News Citation Networks · News Citation Network Overview News articles from RSS feeds: I Politics and business feeds I 34 English news outlets

Resources

Data and implementation are available online:

I [data] News citation network (including URLs)

I [data] Temporal annotations

I [code] Publication date prediction

https://dbs.ifi.uni-heidelberg.de/resources/data/

20