using social media to predict traffic flow under special ...qinghe/papers/conference/2014 ni... ·...

Ni, He, and Gao

1

Using Social Media to Predict Traffic Flow under Special Event 1

Conditions 2

Ming Ni 3

Department of Industrial and Systems Engineering 4

University at Buffalo, The State University of New York 5

429 Bell Hall, Buffalo, NY 14260 6

Phone: (716)645-2357 7

Email: [email protected] 8

9

Qing He* 10

Department of Civil, Structural and Environmental Engineering and 11

Department of Industrial and Systems Engineering 12


313 Bell Hall, Buffalo, NY 14260 14

Phone: (716)645-3470 15


17

Jing Gao 18

Department of Computer Science and Engineering 19


350 Davis Hall, Buffalo, NY 14260 21

Phone: (716)645-1586 22


24

Submitted to TRB 93rd Annual Meeting for Presentation and Publication 25

January 2014, Washington D.C. 26

November 15, 2013 27

28

Word counts (<7500): 7486 29

Abstract and Manuscript: 4486 30

Number of Tables and Figures: 12 (=250*12=3000) 31

* Corresponding Author

mailto:[email protected]



Ni, He, and Gao

2

Abstract 1

Social media is great resource of user-generated contents. Public attention, opinion 2

and hot topics can be captured in the social media, which provides the ability to 3

predict human related events. Since social media can be retrieved in real time with 4

relatively small building and maintenance costs, traffic operation authorities probably 5

identify the social media data as another type sensor for traffic demand. One of those 6

challenges is how to extract reliable traffic related features from big and noisy social 7

media data. The other challenge is how to locate a feasible traffic study that fits well 8

with social media data. In this paper, we aim to use social media information to assist 9

traffic flow prediction under special event conditions. Specially, a short-term traffic 10

flow prediction model, incorporated with tweet features, is developed to forecast the 11

incoming traffic flow prior to sport game events. Both tweet rate features and 12

semantic features are included in the prediction model. We examine and compare the 13

performance of four regression methods, respectively autoregressive model, neural 14

networks model, support vector regression, and k-nearest neighbor, with and without 15

social media features. To the end, we show the benefit gained by including social 16

media information in the prediction model and its computational efficiency for 17

potential practical applications. 18

19

Ni, He, and Gao

3

1. Introduction 1

Social media has become an indicator of modern people and lifestyle in the Internet 2

virtual community. Vast of user-generated contents strengthen linkage and interaction 3

between each individuals within the community circle, also provide large amount of 4

information related to various area. Examples include Facebook, Twitter, YouTube, 5

Google Plus and Wikipedia. The trend of easy accessing social media will 6

continuously grow with the development and commercialization of wearable 7

computer devices, like Google Glass and other smart watches. 8

On the other hand, Traffic congestion is one of the interesting and long-lasting 9

problems in the world. In the last few decades, the sophisticated traffic network has 10

been established around the world, but coming with both day-to-day recurrent traffic 11

congestion and event-caused non-recurrent traffic congestion, which significantly 12

affect the quality of life and impact the U.S. economy. Among different traffic 13

mitigation counter-measures, traffic flow prediction alerts traffic congestion 14

beforehand and facilitates pro-active traffic management and control. With an 15

accurate prediction of traffic flow, road users are able to choose their routes 16

accordingly, and traffic management agencies are capable to operate and control the 17

traffic to alleviate possible congestion. Thus traffic flow prediction is one of the key 18

methods to alleviate traffic congestion, especially under non-recurrent traffic 19

conditions. 20

Motivated by the potentially great value of the social media, in this paper, we intend 21

to incorporate social media data into traffic flow prediction models. Specially, the 22

following question have been raised: Will social media help to predict prior-event 23

traffic? Different with traffic data from traditional detectors, social media data is 24

unorganized, noisy, big, and most of times unrelated to traffic. One of those 25

challenges is how to extract reliable traffic related features from big and noisy social 26

Ni, He, and Gao

4

media data. The other challenge is how to locate a feasible traffic study that fits well 1

with social media data. 2

Due to data availability, the scope of traffic analysis in this paper is constrained in the 3

freeway. The scope of the social media data is restrained to the tweets, which created, 4

shared and modified by the Twitter users. As a basic introduction of tweets, each user 5

of Twitter can be subscribed by other users known as followers. The message posts or 6

status updates, called as tweets, is able to contain 140 characters or less, which 7

typically is one kind of user-generated contents. The "@" sign followed by a 8

username is used for mentioning or replying to other users. Hashtags of tweets are the 9

words or phrases prefixed with a "#" sign, which is used to group posts together or 10

indicated to a topic. 11

This paper will utilize those above components of Twitter massages, like content of 12

tweet, number of hashtags, number of users, number of tweets with URLs, and 13

number of tweets mentioned by other users. We focus on fusing social media data 14

with event traffic data, in order to predict the dynamics of prior-event traffic. In this 15

paper, we build a short-term traffic flow prediction model, incorporated with tweet 16

features to forecast the incoming traffic flow prior to sport game events. We examine 17

and compare the performance of four regression methods, respectively autoregressive 18

model, neural networks model, support vector regression, and k-nearest neighbor, 19

with and without social media features. It is observed that the prediction errors are 20

notably decreased with the aid of social media data. 21

2. Literature review 22

There has not been considerable published research involved to use social media to do 23

traffic prediction. Therefore, \this section investigates literature from two perspectives 24

seperately. One is related to the existing forecasting studies utilizing social media 25

data. The other comes from the prior short-term traffic prediction models under 26

atypical conditions. 27

Ni, He, and Gao

5

Social Media 1

Although the social media includes a variety of web services, the previous studies 2

generally fall into two parts: social relations and user-generated contents (1). Social 3

relations reflect the virtually social network and connections between people on the 4

web. The user-generated contents are created and published by the people, which 5

intend to strengthen linkage and interaction between each individual within the 6

community circle. In this paper, we focus on the previous studies with the user-7

generated contents. 8

Positive or negative emotions were able be extracted from Twitter data about certain 9

company brands and products (2), which can be an auxiliary survey data applying to 10

the company marketing strategy. Public opinion of election in US was studied by 11

some researches (3)(4), effectively using the social data as a public survey to predict 12

the election results with low cost comparing to the traditional telephone survey. As a 13

different idea, a model was built by the rate of Twitter messages, related to movie, to 14

predict movie box-office revenue and confirmed that the Twitter feeds can be 15

effective indicators of real-world performance(5). An evaluation system about stock 16

market was proposed to collect public mood and sentiment form Twitter service as an 17

economic indicator (6). Other research also used social media to predict stock market 18

indicator by analyzing Twitter posts. It is concluded that the correlation between stock 19

market and social media indeed existed by examining emotional Twitter posts and 20

market index such as Dow Jones, NASDAQ and S&P 500 (7). 21

Yu and Kak (1) identified three characteristics of event when social media data can 22

be treated as a good predictor. The three characteristics are, respectively, human 23

related event, masses of people involved, and the event should be easy to be talked in 24

public. 25

Traffic Prediction 26

Ni, He, and Gao

6

Since people intended to publish on social media corresponding to the non-recurrent 1

traffic conditions, such as traffic special events, we focus on the literatures related to 2

the traffic prediction under atypical conditions. 3

Neural Network was used to build the traffic volume prediction model which was 4

based on the time-series data (8). In a similar manner, a supervised statistical learning 5

technique called Online Support Vector machine for Regression, or OL-SVR, was 6

applied for the prediction of short-term freeway traffic flow under both typical and 7

atypical conditions (9). Guo et al. (10) pre-processed traffic data using Singular 8

Spectrum Analysis , and utilized k- nearest neighbor method to predict traffic. Since 9

pre-processing step reduced the noisy sensor inputs, this model can be used under 10

normal and special conditions. An Online boosting nonparametric regression (OBNR) 11

model also was used to perform traffic prediction, which consists of two major parts, 12

respectively, the base part and the boosting part. The base part was constructed under 13

normal conditions, while the gradient boosting part undertook special conditions into 14

account (11). 15

3. Prediction Analysis 16

The framework of prediction analysis is presented as Figure 1. Starting with data 17

preprocessing, both Twitter and traffic features are introduced and extracted by the 18

following four components, traffic data de-trending, game traffic impacted detector 19

identification, game tweet extraction and aggregation, and tweet semantics extraction. 20

21

Figure 1 The flow chart of the prediction analysis 22

Traffic Data

De-trending

Game Tweets

Extraction and

Aggregation

Game Traffic

Impacted Detector

Identification

Tweets Semantics

Extraction

Model Features

Selection

Game Traffic

PredictionModel Building

Ni, He, and Gao

7

Data Description 1

In this paper, traffic data were obtained from the Caltrans Performance Measurement 2

System (PeMS). PeMS is a system designed to maintain California freeway traffic 3

data and compute annual congestion for facilities with surveillance systems in place, 4

typically loop detectors spaced approximately 0.5 mile apart on each freeway lane 5

(12). The analysis uses 1-hour aggregated volume data, collected in four months from 6

February 1, 2013 to May 30, 2013. Some of the detectors may miss or report invalid 7

data in the practice. In order to compensate the missing or incorrect data samples, the 8

diagnostics algorithm and imputation regression models developed in (13) were 9

applied to detect the bad detectors and fill the missing value. This method generated 10

total 12,242,699 entries of hourly traffic flow records in the database. 11

Twitter data was collected by the same spatial and temporal window accordingly, 12

through the Twitter Streaming API with geo-location filter. Right after tweets 13

collected, the filtration of spam and commercial tweets is implemented by the list of 14

Twitter users. This results in a total number of 5,444,527 valid tweets. 15

The sport events are considered as good venues to perform both traffic and social 16

media analysis, since it can be observed by the tweet posts, namely quantity and 17

semantics. In addition, the traffic volume is dramatically influenced by the local sport 18

events due to its popularity. In the study, we specially consider game traffic impact on 19

I880 near Oracle Arena and O.co Coliseum in Oakland, California. Oracle Arena and 20

O.co Coliseum was home to the Oakland Athletics (Athletics) and Golden State 21

Warriors (Warriors) in 2013 game seasons. Oakland Athletics are a Major League 22

Baseball team and Golden State Warriors are an American professional basketball 23

team. There were 51 home games of the Athletics and Warrior from February to May 24

in 2013. Oracle Arena and O.co Coliseum is located right besides Interstate Highway 25

880 (I-880). Six traffic detectors, located before the exits of I-880 to the entrance of 26

Ni, He, and Gao

8

the Oracle Arena, were chosen to analyze incoming prior game traffic, shown as 1

Figure 2. 2

3

Figure 2 Detectors layouts around Oracle Arena and O.co Coliseum 4

Data Pre-processing 5

In this part, we study the tweet and traffic data sets in order to obtain the potential 6

features in the prediction modeling. 7

Game and Non-game Day Comparison 8

Twitter 9

The sport games, like Athletics’ and Warriors’, were considerably interested topics in 10

the Twitter, especially among the Twitter users in the Bay area in order to comment 11

and support their home teams. In order to identify the tweets which are relevant to 12

Ni, He, and Gao

9

each game and each team, the appropriate keywords, shown as Table 1, were used to 1

select the game-related tweets with the help of disambiguation checks. 2

Table 1 Keywords to select game-related tweets 3

Gold State Warriors Oakland Athletics

letsgowarriors

warrior

dubnation

letsgodubs

warriorsground

warriorsgame

greencollar

letsgooakland

athletics

astalk

The keywords are chosen by the related phrases about the teams and the hashtags 4

which are used to group the messages about the teams or related games. Those 5

keywords are tested by a fraction of tweet data to verify whether the relevant tweet 6

messages would be included or not. After iterative trials, those keywords in the Table 7

1 have been finalized. 8

The game-relevant tweets were aggregated into hourly time series in order to be 9

compatible with Traffic data. For each hour in the 4 months, we extracted 5 features 10

of tweet rates, shown as Table 2. 11

Table 2 Features for Tweet Rates 12

Features for Tweet rates Description

no_of_tws Number of tweets related to game in one hour

no_of_users Number of independent users sent those tweets

no_of_hashtags Number of hashtags in all those tweets

no_of_tws_mention Number of tweets mentioned other twitter user

no_of_tws_urls Number of tweets which has URLs

13

Ni, He, and Gao

10

1

(a) Number of tweets per hour at game days 2

3

(b) Number of tweets per hour at non-game days 4

Figure 3 Comparisons of tweet rates on game days and non-game days 5

Note the non-game days stand for the days which don’t carry on either home game or 6

away game, while game days denotes only for the days carrying on home game at 7

Oracle Arena or O.co Coliseum. 8

Figure 3 shows the number of tweets per hour for Golden State Warriors at their game 9

days and non-game days. One can see that there are much more tweets on the game 10

day than that on the non-game day. Therefore, we can use the tweet rate series on the 11

same day to indicate the total public attention for this game. 12

Traffic 13

0

50

100

150

200

250

300

2/2/20130:00

2/16/20130:00

3/2/20130:00

3/16/20130:00

3/30/20130:00

4/13/20130:00

4/27/20130:00

5/11/20130:00

no_of_tws

0

50

100

150

200

250

300

2/1/2013 0:00 3/1/2013 0:00 4/1/2013 0:00 5/1/2013 0:00

no_of_tws

Ni, He, and Gao

11

First of all, In order to eliminate the effects of day-to-day traffic fluctuations, we de-1

trended the traffic flow by subtracting the average hour-of-day and day-of-week 2

traffic volumes. For every record of traffic flow, the below equation was used to 3

reduce the periodic variance so that we can only concentrate on the increment of the 4

traffic flow. 5

wtwtwt yyy 6

Where y stands for the traffic flow; w indicates it is weekday or weekend; t denotes 7

the hour of day; wty is the average time-of-day and day-of-week traffic flow. 8

The time-of-day patterns of traffic flow are demonstrated in Figure 4 for both 9

weekday and weekend. Figure 4 shows the different periodic fluctuations of traffic 10

between weekday and weekend. In traffic prediction, it is common practice to exclude 11

such fluctuations. Therefore, the estimated variations of traffic flow, respectively 12

weekday and weekday, are used here to de-trend the hourly traffic flow data. 13

14

(a) Hourly traffic flow for weekdays (b) Hourly traffic flow for weekends

Figure 4 Hourly traffic flow for detectors 15

0

1000

2000

3000

4000

5000

6000

7000

8000

1 3 5 7 9 11 13 15 17 1921 23

Traf

fic

Flo

w

time stamp (hourly)

tf_400498 tf_400955

tf_400190 tf_400956

tf_400360 tf_400333

0

1000

2000

3000

4000

5000

6000

7000

1 3 5 7 9 1113 15 17 19 21 23

Traf

fic

Flo

w

time stamp (hourly)

tf_400498 tf_400955

tf_400190 tf_400956

tf_400360 tf_400333

Ni, He, and Gao

12

1

Figure 5 The de-trended prior-game traffic flow 2

The de-trended prior game traffic is plotted in Figure 5, which takes the last 2 hours 3

before the game event starting into account. The majority of de-trended game traffic 4

series is greater than zero, which indicates more traffic demand on the game day. 5

In order to test the impact of game on traffic data, a single-factor analysis of variance 6

(ANOVA) was used to differentiate between two populations of data sets through 7

factor “game day”. One dataset contains the de-trended traffic flows within 4 hours 8

before the game starting time. The other includes the binary game indicator during the 9

same time on every non-game day. 10

11

Table 3 ANOVA test results for factor “game day” 12

Detector P-value Significant or not

400498 0.0064 YES

400955 1.24e-13 YES

400190 6.3e-07 YES

400956 0.029 NO

400360 4.52e-13 YES

400333 7.45e-13 YES

-1300

-800

-300

200

700

1200

1700

2200

1 7

13

19

25

31

37

43

49

55

61

67

73

79

85

91

97

10

3

10

9

11

5

12

1

12

7

13

3

13

9

14

5

15

1

De

-tre

nd

ed

Tra

ffic

Flo

w

time stamp (hourly)

tf_400498 tf_400955 tf_400190

tf_400956 tf_400360 tf_400333

Ni, He, and Gao

13

Table 3 presents the results from ANOVA analysis. As one can see, sport games 1

generate significantly impact on the traffic volume of most surrounding detectors on 2

game days. Note that the traffic flow at detector 400956 does not demonstrate 3

significant differences caused by game events, since the detector is located at 4

downstream of the exit towards Oracle Arena. In addition, the distance between 5

detector 400333 and Oracle Arena is much larger comparing with other detectors. For 6

the purpose of simplification, these 2 traffic detectors are treated as only the 7

neighboring detectors rather than the analytical targets to be predicted in the following 8

analysis. 9

Tweet Semantics Extraction 10

One of important differences between tweets and other types of data is that tweets 11

consist of rich sentiments. For our analysis, we believe the public sentiment about one 12

particular game has influent on the attendance, just like the relationship between the 13

public opinion about a company and the stock price. Psychologically, more biased 14

emotion about one topic, more attention would be gained in the public. Therefore, 15

sentiment extraction is necessarily introduced to the analysis, which served as the 16

potential semantic features of tweet data to be investigated in the following study. 17

Sentiment analysis is an important part in machine learning. In here, sentiment results 18

are treated as potential tweet features. In this paper, the sentiment analysis is 19

performed based on an R (14) package, called “tm.plugin.sentiment”, which provides 20

the functions for natural language sentiment processing. The words of text in each 21

Tweet are labeled as positive, negative and neutral references. Before conducting the 22

sentiment analysis, the text part of tweets was initially preprocessed by the following 23

steps. 24

1. Make each letter lowercase, 25

2. Remove punctuation and stopwords. 26

3. Replace the abbreviations from the formal English words. 27

Ni, He, and Gao

14

Our study implemented a unified letter transformation approach (15) to normalize the 1

Tweet post, to replace the non-standard tokens to the Standard English words. This 2

method tends to alleviate the biased effect of abnormal text on the sentiment analysis. 3

In order to quantify the sentiments of tweets, a sentiment scores system was 4

implemented and based on the Lydia/TextMap system (16)-(17). There are 5 5

measures listed as follows. 6

polarity (p - n / p + n) : difference of positive and negative sentiment 7

references / total number of sentiment references 8

subjectivity (p + n / N) : total number of sentiment references / total number 9

of references 10

pos_refs_per_ref (p / N) : total number of positive sentiment references / 11

total number of references 12

neg_refs_per_ref (n / N) : total number of negative sentiment references / 13

total number of references 14

senti_diffs_per_ref (p - n / N) : difference of positive and negative 15

sentiment references / total number of references 16

Above 5 measures were calculated for every tweet post. 17

Prediction Analysis and Modeling 18

After the traffic and tweet data have been processed, there are many usable features of 19

those two, respectively the traffic flowing of neighboring detectors, and 5 tweet rate 20

features plus 5 tweet sentiment score features. This section, two kinds of features are 21

combined together to build the short-term traffic prediction model under the special 22

event conditions. 23

Prediction Features Selection 24

Because there are many potential features of traffic and tweet at hand, it is essential to 25

select the right significant features to build a parsimonious prediction model. 26

Ni, He, and Gao

15

It varies that the correlations between tweets features and traffic flow for each 1

detectors, as Figure 6 shows. Therefore, the features selection is an essential step and 2

very significantly affects the performance of the prediction modals. 3

4

(a) Correlation plot of twitter polarity

and 400190 traffic flow

(b) Correlation plot of No. of Twitter user

and 400955 traffic flow

Figure 6 Correlation analysis between tweet features and game traffic incremental 5

Where the “W” stands for games of Golden State Warriors. “A” stands for games of 6

Oakland Athletics. “AW” denotes the game time happened to have two teams 7

together at the Oracle Arena. “ALL” is for correlation for the all of games together. 8

Those traffic and tweet features were combined into an optimal matrix for each traffic 9

detector. Here, we used the least squares optimization with L1-norm regularization 10

(18) to solve this problem. 11

∑∑ ∑ ∑

∑

∑| |

Where K is the number of game; T stands for the time steps before game start to be 12

predicted and gT indicates the game starting time; M is total number of neighboring 13

detectors; L denotes the number of time lags of the neighboring detectors; P is the 14

number of tweet features; r represents the number of time lags of tweets relative to 15

Ni, He, and Gao

16

traffic data; ml is the coefficients for traffic variables; pV is the coefficients for tweet 1

variables; indicates the intercept of the model. 0 is the regularization 2

parameter that balances the L1 regularization term and least square term. 3

By applying different values of , we solved the above optimization problem by 4

CPLEX. Appropriate traffic and tweet features were selected with non-zero 5

coefficients. For the traffic features, the traffic flow variables of neighboring detectors 6

were always chosen by the optimization model. For the tweet features, every target 7

detector had different combination of tweet rate features and tweet sentiment features, 8

shown as Table 4. 9

Table 4 Selected tweet features for each detector when 4000 10

Traffic

Flow Selected Tweet Features

Traffic

Flow Selected Tweet Features

tf_400498_1 no_of_tws_a_sum4 no_of_tws_mentions_a_sum4 no_of_tws_mentions_w_sum4 no_of_tws_urls_w_sum4

tf_400190_1 no_of_hashtags_a_sum4

no_of_hashtags_w_sum4

no_of_tws_a_sum4

no_of_tws_mentions_a_sum4

no_of_tws_mentions_w_sum4

no_of_tws_urls_a_sum4

polarity_sum_w_sum4

tf_400955_1 no_of_hashtags_a_sum4 no_of_hashtags_w_sum4 no_of_tws_mentions_a_sum4 no_of_tws_mentions_w_sum4 no_of_tws_urls_a_sum4 no_of_tws_urls_w_sum4 no_of_tws_w_sum4 no_of_users_a_sum4 no_of_users_w_sum4 polarity_sum_w_sum4

tf_400360_1 no_of_hashtags_a_sum4

no_of_hashtags_w_sum4

no_of_tws_a_sum4

no_of_tws_mentions_w_sum4

no_of_tws_urls_a_sum4

no_of_users_a_sum4

no_of_users_w_sum4

In the Table 4, the independent variables are the volumes of traffic flow (tf) in last 1 11

hour prior to the games. Because there are 2 sport teams, in here, “w” stands for 12

Golden State Warriors and “a” stands for Oakland Athletics. “sum4” means that the 13

tweet feature is aggregated in last 4 hours before the predicting time. 14

Prediction Models 15

Ni, He, and Gao

17

There were many existing prediction approaches for short-term traffic flow, although 1

most of them focus on traffic features only. To compare with the performance under 2

different methods, we examine the following four popular methods to predict short-3

term event traffic (19) (20) (21): 4

Autoregressive Model (AR) 5

Neural Networks (NN) 6

Support Vector Regression (SVR) 7

K-Nearest Neighbor (KNN) 8

In order to test the performance of social media features in short-term traffic 9

prediction, two models is constructed for every target detector under each method. 10

The model 1 is typically based on the traffic features only, while the model 2 depends 11

on both traffic features and tweet features. By comparing the results between Model 1 12

and Model 2, we are able to tell whether the social media features improve the overall 13

performance of the accuracy of traffic prediction. 14

Model 1: Target traffic flow~ (Neighboring Traffic flow) 15

Model 2: Target traffic flow~ (Neighboring Traffic flow) + (Tweet Features) 16

The implementation of those methods is as following. AR model basically 17

implements the linear regression model. The architecture of NN model consists of the 18

input layer, a single hidden layer with 6 neurons and 1 output neuron. In terms of 19

SVR, we used typical linear kernel and epsilon-regression. KNN method takes the 20

straight average of the 10 nearest points in the training set in Euclidean distance to 21

obtain the prediction results. 22

23

4. Performance Results 24

The prediction performance is evaluated by two measures, namely Mean Absolute 25

Percentage Error (MAPE) and Root Mean Square Error (RMSE), defined as follows 26

Ni, He, and Gao

18

∑∑

|

|

√

∑∑

Where K stands for the total number of games; y denotes the estimated value of 1

traffic flow y .; T represents total number of time steps to be predicted before the 2

game starts, and gT indicates the game starting time; 3

Every prediction model is implemented with above four methods. The training data 4

occupies the 70% of entire dataset, while the remaining 30% was treated as test data. 5

We generated 100 instances of training and test data with 100 different random seeds. 6

Every model runs for 100 times. And the performance is evaluated by the average 7

MAPE and RMSE of 100 experiments. 8

Figure 7 shows the evaluation results of two models in four regression methods for 9

four target detectors. As one can see, although the MAPEs doesn’t show consistent 10

results across four regression methods and four detectors, the RMSEs of model 2 with 11

tweet features always outperform those of model 1 with only traffic features. 12

Therefore, it is very beneficial to incorporate tweet features in the prediction models. 13

In other words, the tweet features indeed improve the results of short-term traffic 14

prediction. More importantly, such performance improvement exists regardless of 15

forecasting methods. 16

17

Ni, He, and Gao

19

1

(a) MAPE of 400498 (b) RMSE of 400498

2

(c) MAPE of 400955 (d) RMSE of 400955

3

(e) MAPE of 400190 (f) RMSE of 400190

4

(g) MAPE of 400360 (h) RMSE of 400360

Figure 7 The MAPE and RSME for each traffic detector 5

0

0.02

0.04

0.06

0.08

AR NN SVR KNN

Traffic only Traffic Tweet

0

200

400

600

AR NN SVR KNN


0

0.05

0.1

0.15

AR NN SVR KNN


0

500

1000

AR NN SVR KNN


0

0.02

0.04

0.06

0.08

AR NN SVR KNN


0

200

400

600

AR NN SVR KNN


0

0.05

0.1

AR NN SVR KNN


0

200

400

600

800

AR NN SVR KNN


Ni, He, and Gao

20

As indicated from Figure 7, SVR shows overall best results among four methods for 1

game traffic prediction. The average of MAPE and RMSE of SVR are presented in 2

Table 5. 3

Table 5 Average of MAPE and RMSE for SVR 4

Model 1

Traffic only Model 2

Traffic Tweet Error reduction

(%)

RMSE 385.5252 311.0429 23.946%

MAPE 0.050437 0.047146 6.9804%

The percentage of error reduction in Table 5 indicates the promising value of social 5

media in event traffic prediction. Because it is very difficult to predict the short-term 6

traffic flow under abnormal conditions, any improvement is significant for traffic 7

management and operations, especially when using no-cost and real-time social media 8

data. 9

The computational efficiency of short term traffic prediction is critical for practical 10

applications, such as Advanced Traveler Information Systems (ATIS). The traffic 11

flow data usually reports every 30 seconds from detectors, which can be used directly 12

in real-time. The social media data comes from the Twitter Stream application 13

interface in real time but it requires additional data processing to get the tweet rate 14

features and tweet sentiment features. Considering the hourly aggregation and 15

keywords filtration to obtain sport game related tweet, it is feasible to prepare the 16

right social media data in order to get the desirable tweet features in an hour. 17

Estimation of processing time for an hour tweet feature extraction is twenty minutes. 18

As a result, the proposed prediction models can be directly applied for real-time 19

practical applications. 20

5. Conclusion 21

In this paper, social media data is incorporated with traffic flow data for with prior-22

event traffic prediction. The linkages of tweet data and the sport game traffic is 23

analyzed and recognized to build a short-term traffic flow prediction model. We 24

Ni, He, and Gao

21

identify the traffic flow of neighboring detectors as traffic features; the tweet rates and 1

tweet sentiments as social media features. Four popular forecasting methods, 2

respectively, autoregressive model, neural networks, support vector regression, and k-3

nearest neighbor, are implemented to build the short-term traffic flow prediction 4

models. We then find that the prediction results with both traffic and tweet features 5

outperformed those with only traditional traffic features across different detectors. 6

With support vector regression, average MAPE and RMSE was reduced by 6.98% 7

and 23.95% after including tweet features in the model. We also argue the efficiency 8

of computation time for social media data processing in practical real-time 9

applications. 10

At the same time, another purpose of research is to demonstrate one way to uncover 11

the potential great value of big data generated by the social media. In this case, it can 12

be considered as an effective indicator of public attention and opinions. This 13

possibility may transfer to other branches of transportation, such as travel behavior 14

analysis, transportation related choices, transit scheduling and operations, and so on. 15

Reference 16

1. Yu S, Kak S. A Survey of Prediction Using Social Media [Internet]. 2012 Mar. 17

Report No.: 1203.1647. Available from: http://arxiv.org/abs/1203.1647 18

2. Jansen BJ, Zhang M, Sobel K, Chowdury A. Twitter power: Tweets as electronic 19

word of mouth. Journal of the American Society for Information Science and 20

Technology. 2009;60(11):2169–88. 21

3. Tumasjan A, Sprenger T, Sandner P, Welpe I. Predicting elections with twitter: 22

What 140 characters reveal about political sentiment. 2010. p. 178–85. 23

4. Balasubramanyan R, Routledge BR, Smith NA. From Tweets to Polls : Linking 24

Text Sentiment to Public Opinion Time Series. 2010. 25

5. Asur S, Huberman BA. Predicting the Future with Social Media. 2010 26

IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent 27

Agent Technology (WI-IAT). 2010. p. 492–9. 28

Ni, He, and Gao

22

6. Bollen J, Mao H, Zeng X. Twitter mood predicts the stock market. Journal of 1

Computational Science. 2011 Mar;2(1):1–8. 2

7. Zhang X, Fuehres H, Gloor PA. Predicting Stock Market Indicators Through 3

Twitter “I hope it is not as bad as I fear.”Procedia - Social and Behavioral 4

Sciences. 2011;26:55–62. 5

8. Yasdi R. Prediction of Road Traffic using a Neural Network Approach. Neural 6

Computing & Applications. 1999 May 1;8(2):135–42. 7

9. Castro-Neto M, Jeong Y-S, Jeong M-K, Han LD. Online-SVR for short-term 8

traffic flow prediction under typical and atypical traffic conditions. Expert 9

Systems with Applications. 2009 Apr;36(3, Part 2):6164–73. 10

10. Guo F, Krishnan R, Polak JW. Short-term traffic prediction under normal and 11

incident conditions using singular spectrum analysis and the k-nearest neighbour 12

method. IET and ITS Conference on Road Transport Information and Control 13

(RTIC 2012). 2012. p. 1–6. 14

11. Wu T, Xie K, Xinpin D, Song G. A online boosting approach for traffic flow 15

forecasting under abnormal conditions. 2012 9th International Conference on 16

Fuzzy Systems and Knowledge Discovery (FSKD). 2012. p. 2555–9. 17

12. Choe T, Skabardonis A, Varaiya P. Freeway performance measurement system: 18

Operational analysis tool. Transportation research record. (1811):67–75. 19

13. Chen C, Kwon J, Rice J, Skabardonis A, Varaiya P. Detecting Errors and 20

Imputing Missing Data for Single-Loop Surveillance Systems. Transportation 21

Research Record: Journal of the Transportation Research Board. 2003 Jan 22

1;1855(-1):160–7. 23

14. R Development Core Team. R: A Language and Environment for Statistical 24

Computing. Vienna, Austria: R Foundation for Statistical Computing. 2009. 25

15. Liu F, Weng F, Wang B, Liu Y. Insertion, Deletion, or Substitution 26

Normalizing Text Messages without Pre-categorization nor Supervision. 27

Proceedings of the 49th Annual Meeting of the Association for Computational 28

Linguistics: Human Language Technologies. Portland, Oregon, USA: 29

Association for Computational Linguistics; 2011. p. 71–6. 30

16. Godbole N, Srinivasaiah M, Skiena S. Large-Scale Sentiment Analysis for News 31

and Blogs. Proceedings of the International Conference on Weblogs and Social 32

Media (ICWSM); 2007. 33

Ni, He, and Gao

23

17. Agarwal A, Xie B, Vovsha I, Rambow O, Passonneau R. Sentiment analysis of 1

Twitter data. Proceedings of the Workshop on Languages in Social Media. 2

Stroudsburg, PA, USA: Association for Computational Linguistics; 2011. p. 30–3

8. 4

18. Schmidt M. Least Squares Optimization with L1-Norm Regularization. 5

Technical report. 2005. 6

19. Smith BL, Demetsky MJ. Short-term traffic flow prediction: neural network 7

approach. Transportation Research Record. 1994;(1453). 8

20. Smola AJ, Schölkopf B. A tutorial on support vector regression. Statistics and 9

Computing. 2004 Aug 1;14(3):199–222. 10

21. R. K. Oswald, William T. Scherer, Brian Lee Smith. Traffic flow forecasting 11

using approximate nearest neighbor nonparametric regression. Center for 12

transportation studies at the University of Virginia. 2001;(Research Report No. 13

UVACTS-15-13-7). 14

15

using social media to predict traffic flow under special ...qinghe/papers/conference/2014 ni... ·...

Documents