using social media to predict traffic flow under special ...qinghe/papers/conference/2014 ni... ·...
TRANSCRIPT
Ni, He, and Gao
1
Using Social Media to Predict Traffic Flow under Special Event 1
Conditions 2
Ming Ni 3
Department of Industrial and Systems Engineering 4
University at Buffalo, The State University of New York 5
429 Bell Hall, Buffalo, NY 14260 6
Phone: (716)645-2357 7
Email: [email protected] 8
9
Qing He* 10
Department of Civil, Structural and Environmental Engineering and 11
Department of Industrial and Systems Engineering 12
University at Buffalo, The State University of New York 13
313 Bell Hall, Buffalo, NY 14260 14
Phone: (716)645-3470 15
Email: [email protected] 16
17
Jing Gao 18
Department of Computer Science and Engineering 19
University at Buffalo, The State University of New York 20
350 Davis Hall, Buffalo, NY 14260 21
Phone: (716)645-1586 22
Email: [email protected] 23
24
Submitted to TRB 93rd Annual Meeting for Presentation and Publication 25
January 2014, Washington D.C. 26
November 15, 2013 27
28
Word counts (<7500): 7486 29
Abstract and Manuscript: 4486 30
Number of Tables and Figures: 12 (=250*12=3000) 31
* Corresponding Author
Ni, He, and Gao
2
Abstract 1
Social media is great resource of user-generated contents. Public attention, opinion 2
and hot topics can be captured in the social media, which provides the ability to 3
predict human related events. Since social media can be retrieved in real time with 4
relatively small building and maintenance costs, traffic operation authorities probably 5
identify the social media data as another type sensor for traffic demand. One of those 6
challenges is how to extract reliable traffic related features from big and noisy social 7
media data. The other challenge is how to locate a feasible traffic study that fits well 8
with social media data. In this paper, we aim to use social media information to assist 9
traffic flow prediction under special event conditions. Specially, a short-term traffic 10
flow prediction model, incorporated with tweet features, is developed to forecast the 11
incoming traffic flow prior to sport game events. Both tweet rate features and 12
semantic features are included in the prediction model. We examine and compare the 13
performance of four regression methods, respectively autoregressive model, neural 14
networks model, support vector regression, and k-nearest neighbor, with and without 15
social media features. To the end, we show the benefit gained by including social 16
media information in the prediction model and its computational efficiency for 17
potential practical applications. 18
19
Ni, He, and Gao
3
1. Introduction 1
Social media has become an indicator of modern people and lifestyle in the Internet 2
virtual community. Vast of user-generated contents strengthen linkage and interaction 3
between each individuals within the community circle, also provide large amount of 4
information related to various area. Examples include Facebook, Twitter, YouTube, 5
Google Plus and Wikipedia. The trend of easy accessing social media will 6
continuously grow with the development and commercialization of wearable 7
computer devices, like Google Glass and other smart watches. 8
On the other hand, Traffic congestion is one of the interesting and long-lasting 9
problems in the world. In the last few decades, the sophisticated traffic network has 10
been established around the world, but coming with both day-to-day recurrent traffic 11
congestion and event-caused non-recurrent traffic congestion, which significantly 12
affect the quality of life and impact the U.S. economy. Among different traffic 13
mitigation counter-measures, traffic flow prediction alerts traffic congestion 14
beforehand and facilitates pro-active traffic management and control. With an 15
accurate prediction of traffic flow, road users are able to choose their routes 16
accordingly, and traffic management agencies are capable to operate and control the 17
traffic to alleviate possible congestion. Thus traffic flow prediction is one of the key 18
methods to alleviate traffic congestion, especially under non-recurrent traffic 19
conditions. 20
Motivated by the potentially great value of the social media, in this paper, we intend 21
to incorporate social media data into traffic flow prediction models. Specially, the 22
following question have been raised: Will social media help to predict prior-event 23
traffic? Different with traffic data from traditional detectors, social media data is 24
unorganized, noisy, big, and most of times unrelated to traffic. One of those 25
challenges is how to extract reliable traffic related features from big and noisy social 26
Ni, He, and Gao
4
media data. The other challenge is how to locate a feasible traffic study that fits well 1
with social media data. 2
Due to data availability, the scope of traffic analysis in this paper is constrained in the 3
freeway. The scope of the social media data is restrained to the tweets, which created, 4
shared and modified by the Twitter users. As a basic introduction of tweets, each user 5
of Twitter can be subscribed by other users known as followers. The message posts or 6
status updates, called as tweets, is able to contain 140 characters or less, which 7
typically is one kind of user-generated contents. The "@" sign followed by a 8
username is used for mentioning or replying to other users. Hashtags of tweets are the 9
words or phrases prefixed with a "#" sign, which is used to group posts together or 10
indicated to a topic. 11
This paper will utilize those above components of Twitter massages, like content of 12
tweet, number of hashtags, number of users, number of tweets with URLs, and 13
number of tweets mentioned by other users. We focus on fusing social media data 14
with event traffic data, in order to predict the dynamics of prior-event traffic. In this 15
paper, we build a short-term traffic flow prediction model, incorporated with tweet 16
features to forecast the incoming traffic flow prior to sport game events. We examine 17
and compare the performance of four regression methods, respectively autoregressive 18
model, neural networks model, support vector regression, and k-nearest neighbor, 19
with and without social media features. It is observed that the prediction errors are 20
notably decreased with the aid of social media data. 21
2. Literature review 22
There has not been considerable published research involved to use social media to do 23
traffic prediction. Therefore, \this section investigates literature from two perspectives 24
seperately. One is related to the existing forecasting studies utilizing social media 25
data. The other comes from the prior short-term traffic prediction models under 26
atypical conditions. 27
Ni, He, and Gao
5
Social Media 1
Although the social media includes a variety of web services, the previous studies 2
generally fall into two parts: social relations and user-generated contents (1). Social 3
relations reflect the virtually social network and connections between people on the 4
web. The user-generated contents are created and published by the people, which 5
intend to strengthen linkage and interaction between each individual within the 6
community circle. In this paper, we focus on the previous studies with the user-7
generated contents. 8
Positive or negative emotions were able be extracted from Twitter data about certain 9
company brands and products (2), which can be an auxiliary survey data applying to 10
the company marketing strategy. Public opinion of election in US was studied by 11
some researches (3)(4), effectively using the social data as a public survey to predict 12
the election results with low cost comparing to the traditional telephone survey. As a 13
different idea, a model was built by the rate of Twitter messages, related to movie, to 14
predict movie box-office revenue and confirmed that the Twitter feeds can be 15
effective indicators of real-world performance(5). An evaluation system about stock 16
market was proposed to collect public mood and sentiment form Twitter service as an 17
economic indicator (6). Other research also used social media to predict stock market 18
indicator by analyzing Twitter posts. It is concluded that the correlation between stock 19
market and social media indeed existed by examining emotional Twitter posts and 20
market index such as Dow Jones, NASDAQ and S&P 500 (7). 21
Yu and Kak (1) identified three characteristics of event when social media data can 22
be treated as a good predictor. The three characteristics are, respectively, human 23
related event, masses of people involved, and the event should be easy to be talked in 24
public. 25
Traffic Prediction 26
Ni, He, and Gao
6
Since people intended to publish on social media corresponding to the non-recurrent 1
traffic conditions, such as traffic special events, we focus on the literatures related to 2
the traffic prediction under atypical conditions. 3
Neural Network was used to build the traffic volume prediction model which was 4
based on the time-series data (8). In a similar manner, a supervised statistical learning 5
technique called Online Support Vector machine for Regression, or OL-SVR, was 6
applied for the prediction of short-term freeway traffic flow under both typical and 7
atypical conditions (9). Guo et al. (10) pre-processed traffic data using Singular 8
Spectrum Analysis , and utilized k- nearest neighbor method to predict traffic. Since 9
pre-processing step reduced the noisy sensor inputs, this model can be used under 10
normal and special conditions. An Online boosting nonparametric regression (OBNR) 11
model also was used to perform traffic prediction, which consists of two major parts, 12
respectively, the base part and the boosting part. The base part was constructed under 13
normal conditions, while the gradient boosting part undertook special conditions into 14
account (11). 15
3. Prediction Analysis 16
The framework of prediction analysis is presented as Figure 1. Starting with data 17
preprocessing, both Twitter and traffic features are introduced and extracted by the 18
following four components, traffic data de-trending, game traffic impacted detector 19
identification, game tweet extraction and aggregation, and tweet semantics extraction. 20
21
Figure 1 The flow chart of the prediction analysis 22
Traffic Data
De-trending
Game Tweets
Extraction and
Aggregation
Game Traffic
Impacted Detector
Identification
Tweets Semantics
Extraction
Model Features
Selection
Game Traffic
PredictionModel Building
Ni, He, and Gao
7
Data Description 1
In this paper, traffic data were obtained from the Caltrans Performance Measurement 2
System (PeMS). PeMS is a system designed to maintain California freeway traffic 3
data and compute annual congestion for facilities with surveillance systems in place, 4
typically loop detectors spaced approximately 0.5 mile apart on each freeway lane 5
(12). The analysis uses 1-hour aggregated volume data, collected in four months from 6
February 1, 2013 to May 30, 2013. Some of the detectors may miss or report invalid 7
data in the practice. In order to compensate the missing or incorrect data samples, the 8
diagnostics algorithm and imputation regression models developed in (13) were 9
applied to detect the bad detectors and fill the missing value. This method generated 10
total 12,242,699 entries of hourly traffic flow records in the database. 11
Twitter data was collected by the same spatial and temporal window accordingly, 12
through the Twitter Streaming API with geo-location filter. Right after tweets 13
collected, the filtration of spam and commercial tweets is implemented by the list of 14
Twitter users. This results in a total number of 5,444,527 valid tweets. 15
The sport events are considered as good venues to perform both traffic and social 16
media analysis, since it can be observed by the tweet posts, namely quantity and 17
semantics. In addition, the traffic volume is dramatically influenced by the local sport 18
events due to its popularity. In the study, we specially consider game traffic impact on 19
I880 near Oracle Arena and O.co Coliseum in Oakland, California. Oracle Arena and 20
O.co Coliseum was home to the Oakland Athletics (Athletics) and Golden State 21
Warriors (Warriors) in 2013 game seasons. Oakland Athletics are a Major League 22
Baseball team and Golden State Warriors are an American professional basketball 23
team. There were 51 home games of the Athletics and Warrior from February to May 24
in 2013. Oracle Arena and O.co Coliseum is located right besides Interstate Highway 25
880 (I-880). Six traffic detectors, located before the exits of I-880 to the entrance of 26
Ni, He, and Gao
8
the Oracle Arena, were chosen to analyze incoming prior game traffic, shown as 1
Figure 2. 2
3
Figure 2 Detectors layouts around Oracle Arena and O.co Coliseum 4
Data Pre-processing 5
In this part, we study the tweet and traffic data sets in order to obtain the potential 6
features in the prediction modeling. 7
Game and Non-game Day Comparison 8
Twitter 9
The sport games, like Athletics’ and Warriors’, were considerably interested topics in 10
the Twitter, especially among the Twitter users in the Bay area in order to comment 11
and support their home teams. In order to identify the tweets which are relevant to 12
Ni, He, and Gao
9
each game and each team, the appropriate keywords, shown as Table 1, were used to 1
select the game-related tweets with the help of disambiguation checks. 2
Table 1 Keywords to select game-related tweets 3
Gold State Warriors Oakland Athletics
letsgowarriors
warrior
dubnation
letsgodubs
warriorsground
warriorsgame
greencollar
letsgooakland
athletics
astalk
The keywords are chosen by the related phrases about the teams and the hashtags 4
which are used to group the messages about the teams or related games. Those 5
keywords are tested by a fraction of tweet data to verify whether the relevant tweet 6
messages would be included or not. After iterative trials, those keywords in the Table 7
1 have been finalized. 8
The game-relevant tweets were aggregated into hourly time series in order to be 9
compatible with Traffic data. For each hour in the 4 months, we extracted 5 features 10
of tweet rates, shown as Table 2. 11
Table 2 Features for Tweet Rates 12
Features for Tweet rates Description
no_of_tws Number of tweets related to game in one hour
no_of_users Number of independent users sent those tweets
no_of_hashtags Number of hashtags in all those tweets
no_of_tws_mention Number of tweets mentioned other twitter user
no_of_tws_urls Number of tweets which has URLs
13
Ni, He, and Gao
10
1
(a) Number of tweets per hour at game days 2
3
(b) Number of tweets per hour at non-game days 4
Figure 3 Comparisons of tweet rates on game days and non-game days 5
Note the non-game days stand for the days which don’t carry on either home game or 6
away game, while game days denotes only for the days carrying on home game at 7
Oracle Arena or O.co Coliseum. 8
Figure 3 shows the number of tweets per hour for Golden State Warriors at their game 9
days and non-game days. One can see that there are much more tweets on the game 10
day than that on the non-game day. Therefore, we can use the tweet rate series on the 11
same day to indicate the total public attention for this game. 12
Traffic 13
0
50
100
150
200
250
300
2/2/20130:00
2/16/20130:00
3/2/20130:00
3/16/20130:00
3/30/20130:00
4/13/20130:00
4/27/20130:00
5/11/20130:00
no_of_tws
0
50
100
150
200
250
300
2/1/2013 0:00 3/1/2013 0:00 4/1/2013 0:00 5/1/2013 0:00
no_of_tws
Ni, He, and Gao
11
First of all, In order to eliminate the effects of day-to-day traffic fluctuations, we de-1
trended the traffic flow by subtracting the average hour-of-day and day-of-week 2
traffic volumes. For every record of traffic flow, the below equation was used to 3
reduce the periodic variance so that we can only concentrate on the increment of the 4
traffic flow. 5
wtwtwt yyy 6
Where y stands for the traffic flow; w indicates it is weekday or weekend; t denotes 7
the hour of day; wty is the average time-of-day and day-of-week traffic flow. 8
The time-of-day patterns of traffic flow are demonstrated in Figure 4 for both 9
weekday and weekend. Figure 4 shows the different periodic fluctuations of traffic 10
between weekday and weekend. In traffic prediction, it is common practice to exclude 11
such fluctuations. Therefore, the estimated variations of traffic flow, respectively 12
weekday and weekday, are used here to de-trend the hourly traffic flow data. 13
14
(a) Hourly traffic flow for weekdays (b) Hourly traffic flow for weekends
Figure 4 Hourly traffic flow for detectors 15
0
1000
2000
3000
4000
5000
6000
7000
8000
1 3 5 7 9 11 13 15 17 1921 23
Traf
fic
Flo
w
time stamp (hourly)
tf_400498 tf_400955
tf_400190 tf_400956
tf_400360 tf_400333
0
1000
2000
3000
4000
5000
6000
7000
1 3 5 7 9 1113 15 17 19 21 23
Traf
fic
Flo
w
time stamp (hourly)
tf_400498 tf_400955
tf_400190 tf_400956
tf_400360 tf_400333
Ni, He, and Gao
12
1
Figure 5 The de-trended prior-game traffic flow 2
The de-trended prior game traffic is plotted in Figure 5, which takes the last 2 hours 3
before the game event starting into account. The majority of de-trended game traffic 4
series is greater than zero, which indicates more traffic demand on the game day. 5
In order to test the impact of game on traffic data, a single-factor analysis of variance 6
(ANOVA) was used to differentiate between two populations of data sets through 7
factor “game day”. One dataset contains the de-trended traffic flows within 4 hours 8
before the game starting time. The other includes the binary game indicator during the 9
same time on every non-game day. 10
11
Table 3 ANOVA test results for factor “game day” 12
Detector P-value Significant or not
400498 0.0064 YES
400955 1.24e-13 YES
400190 6.3e-07 YES
400956 0.029 NO
400360 4.52e-13 YES
400333 7.45e-13 YES
-1300
-800
-300
200
700
1200
1700
2200
1 7
13
19
25
31
37
43
49
55
61
67
73
79
85
91
97
10
3
10
9
11
5
12
1
12
7
13
3
13
9
14
5
15
1
De
-tre
nd
ed
Tra
ffic
Flo
w
time stamp (hourly)
tf_400498 tf_400955 tf_400190
tf_400956 tf_400360 tf_400333
Ni, He, and Gao
13
Table 3 presents the results from ANOVA analysis. As one can see, sport games 1
generate significantly impact on the traffic volume of most surrounding detectors on 2
game days. Note that the traffic flow at detector 400956 does not demonstrate 3
significant differences caused by game events, since the detector is located at 4
downstream of the exit towards Oracle Arena. In addition, the distance between 5
detector 400333 and Oracle Arena is much larger comparing with other detectors. For 6
the purpose of simplification, these 2 traffic detectors are treated as only the 7
neighboring detectors rather than the analytical targets to be predicted in the following 8
analysis. 9
Tweet Semantics Extraction 10
One of important differences between tweets and other types of data is that tweets 11
consist of rich sentiments. For our analysis, we believe the public sentiment about one 12
particular game has influent on the attendance, just like the relationship between the 13
public opinion about a company and the stock price. Psychologically, more biased 14
emotion about one topic, more attention would be gained in the public. Therefore, 15
sentiment extraction is necessarily introduced to the analysis, which served as the 16
potential semantic features of tweet data to be investigated in the following study. 17
Sentiment analysis is an important part in machine learning. In here, sentiment results 18
are treated as potential tweet features. In this paper, the sentiment analysis is 19
performed based on an R (14) package, called “tm.plugin.sentiment”, which provides 20
the functions for natural language sentiment processing. The words of text in each 21
Tweet are labeled as positive, negative and neutral references. Before conducting the 22
sentiment analysis, the text part of tweets was initially preprocessed by the following 23
steps. 24
1. Make each letter lowercase, 25
2. Remove punctuation and stopwords. 26
3. Replace the abbreviations from the formal English words. 27
Ni, He, and Gao
14
Our study implemented a unified letter transformation approach (15) to normalize the 1
Tweet post, to replace the non-standard tokens to the Standard English words. This 2
method tends to alleviate the biased effect of abnormal text on the sentiment analysis. 3
In order to quantify the sentiments of tweets, a sentiment scores system was 4
implemented and based on the Lydia/TextMap system (16)-(17). There are 5 5
measures listed as follows. 6
polarity (p - n / p + n) : difference of positive and negative sentiment 7
references / total number of sentiment references 8
subjectivity (p + n / N) : total number of sentiment references / total number 9
of references 10
pos_refs_per_ref (p / N) : total number of positive sentiment references / 11
total number of references 12
neg_refs_per_ref (n / N) : total number of negative sentiment references / 13
total number of references 14
senti_diffs_per_ref (p - n / N) : difference of positive and negative 15
sentiment references / total number of references 16
Above 5 measures were calculated for every tweet post. 17
Prediction Analysis and Modeling 18
After the traffic and tweet data have been processed, there are many usable features of 19
those two, respectively the traffic flowing of neighboring detectors, and 5 tweet rate 20
features plus 5 tweet sentiment score features. This section, two kinds of features are 21
combined together to build the short-term traffic prediction model under the special 22
event conditions. 23
Prediction Features Selection 24
Because there are many potential features of traffic and tweet at hand, it is essential to 25
select the right significant features to build a parsimonious prediction model. 26
Ni, He, and Gao
15
It varies that the correlations between tweets features and traffic flow for each 1
detectors, as Figure 6 shows. Therefore, the features selection is an essential step and 2
very significantly affects the performance of the prediction modals. 3
4
(a) Correlation plot of twitter polarity
and 400190 traffic flow
(b) Correlation plot of No. of Twitter user
and 400955 traffic flow
Figure 6 Correlation analysis between tweet features and game traffic incremental 5
Where the “W” stands for games of Golden State Warriors. “A” stands for games of 6
Oakland Athletics. “AW” denotes the game time happened to have two teams 7
together at the Oracle Arena. “ALL” is for correlation for the all of games together. 8
Those traffic and tweet features were combined into an optimal matrix for each traffic 9
detector. Here, we used the least squares optimization with L1-norm regularization 10
(18) to solve this problem. 11
∑∑ ∑ ∑
∑
∑| |
Where K is the number of game; T stands for the time steps before game start to be 12
predicted and gT indicates the game starting time; M is total number of neighboring 13
detectors; L denotes the number of time lags of the neighboring detectors; P is the 14
number of tweet features; r represents the number of time lags of tweets relative to 15
Ni, He, and Gao
16
traffic data; ml is the coefficients for traffic variables; pV is the coefficients for tweet 1
variables; indicates the intercept of the model. 0 is the regularization 2
parameter that balances the L1 regularization term and least square term. 3
By applying different values of , we solved the above optimization problem by 4
CPLEX. Appropriate traffic and tweet features were selected with non-zero 5
coefficients. For the traffic features, the traffic flow variables of neighboring detectors 6
were always chosen by the optimization model. For the tweet features, every target 7
detector had different combination of tweet rate features and tweet sentiment features, 8
shown as Table 4. 9
Table 4 Selected tweet features for each detector when 4000 10
Traffic
Flow Selected Tweet Features
Traffic
Flow Selected Tweet Features
tf_400498_1 no_of_tws_a_sum4 no_of_tws_mentions_a_sum4 no_of_tws_mentions_w_sum4 no_of_tws_urls_w_sum4
tf_400190_1 no_of_hashtags_a_sum4
no_of_hashtags_w_sum4
no_of_tws_a_sum4
no_of_tws_mentions_a_sum4
no_of_tws_mentions_w_sum4
no_of_tws_urls_a_sum4
polarity_sum_w_sum4
tf_400955_1 no_of_hashtags_a_sum4 no_of_hashtags_w_sum4 no_of_tws_mentions_a_sum4 no_of_tws_mentions_w_sum4 no_of_tws_urls_a_sum4 no_of_tws_urls_w_sum4 no_of_tws_w_sum4 no_of_users_a_sum4 no_of_users_w_sum4 polarity_sum_w_sum4
tf_400360_1 no_of_hashtags_a_sum4
no_of_hashtags_w_sum4
no_of_tws_a_sum4
no_of_tws_mentions_w_sum4
no_of_tws_urls_a_sum4
no_of_users_a_sum4
no_of_users_w_sum4
In the Table 4, the independent variables are the volumes of traffic flow (tf) in last 1 11
hour prior to the games. Because there are 2 sport teams, in here, “w” stands for 12
Golden State Warriors and “a” stands for Oakland Athletics. “sum4” means that the 13
tweet feature is aggregated in last 4 hours before the predicting time. 14
Prediction Models 15
Ni, He, and Gao
17
There were many existing prediction approaches for short-term traffic flow, although 1
most of them focus on traffic features only. To compare with the performance under 2
different methods, we examine the following four popular methods to predict short-3
term event traffic (19) (20) (21): 4
Autoregressive Model (AR) 5
Neural Networks (NN) 6
Support Vector Regression (SVR) 7
K-Nearest Neighbor (KNN) 8
In order to test the performance of social media features in short-term traffic 9
prediction, two models is constructed for every target detector under each method. 10
The model 1 is typically based on the traffic features only, while the model 2 depends 11
on both traffic features and tweet features. By comparing the results between Model 1 12
and Model 2, we are able to tell whether the social media features improve the overall 13
performance of the accuracy of traffic prediction. 14
Model 1: Target traffic flow~ (Neighboring Traffic flow) 15
Model 2: Target traffic flow~ (Neighboring Traffic flow) + (Tweet Features) 16
The implementation of those methods is as following. AR model basically 17
implements the linear regression model. The architecture of NN model consists of the 18
input layer, a single hidden layer with 6 neurons and 1 output neuron. In terms of 19
SVR, we used typical linear kernel and epsilon-regression. KNN method takes the 20
straight average of the 10 nearest points in the training set in Euclidean distance to 21
obtain the prediction results. 22
23
4. Performance Results 24
The prediction performance is evaluated by two measures, namely Mean Absolute 25
Percentage Error (MAPE) and Root Mean Square Error (RMSE), defined as follows 26
Ni, He, and Gao
18
∑∑
|
|
√
∑∑
Where K stands for the total number of games; y denotes the estimated value of 1
traffic flow y .; T represents total number of time steps to be predicted before the 2
game starts, and gT indicates the game starting time; 3
Every prediction model is implemented with above four methods. The training data 4
occupies the 70% of entire dataset, while the remaining 30% was treated as test data. 5
We generated 100 instances of training and test data with 100 different random seeds. 6
Every model runs for 100 times. And the performance is evaluated by the average 7
MAPE and RMSE of 100 experiments. 8
Figure 7 shows the evaluation results of two models in four regression methods for 9
four target detectors. As one can see, although the MAPEs doesn’t show consistent 10
results across four regression methods and four detectors, the RMSEs of model 2 with 11
tweet features always outperform those of model 1 with only traffic features. 12
Therefore, it is very beneficial to incorporate tweet features in the prediction models. 13
In other words, the tweet features indeed improve the results of short-term traffic 14
prediction. More importantly, such performance improvement exists regardless of 15
forecasting methods. 16
17
Ni, He, and Gao
19
1
(a) MAPE of 400498 (b) RMSE of 400498
2
(c) MAPE of 400955 (d) RMSE of 400955
3
(e) MAPE of 400190 (f) RMSE of 400190
4
(g) MAPE of 400360 (h) RMSE of 400360
Figure 7 The MAPE and RSME for each traffic detector 5
0
0.02
0.04
0.06
0.08
AR NN SVR KNN
Traffic only Traffic Tweet
0
200
400
600
AR NN SVR KNN
Traffic only Traffic Tweet
0
0.05
0.1
0.15
AR NN SVR KNN
Traffic only Traffic Tweet
0
500
1000
AR NN SVR KNN
Traffic only Traffic Tweet
0
0.02
0.04
0.06
0.08
AR NN SVR KNN
Traffic only Traffic Tweet
0
200
400
600
AR NN SVR KNN
Traffic only Traffic Tweet
0
0.05
0.1
AR NN SVR KNN
Traffic only Traffic Tweet
0
200
400
600
800
AR NN SVR KNN
Traffic only Traffic Tweet
Ni, He, and Gao
20
As indicated from Figure 7, SVR shows overall best results among four methods for 1
game traffic prediction. The average of MAPE and RMSE of SVR are presented in 2
Table 5. 3
Table 5 Average of MAPE and RMSE for SVR 4
Model 1
Traffic only Model 2
Traffic Tweet Error reduction
(%)
RMSE 385.5252 311.0429 23.946%
MAPE 0.050437 0.047146 6.9804%
The percentage of error reduction in Table 5 indicates the promising value of social 5
media in event traffic prediction. Because it is very difficult to predict the short-term 6
traffic flow under abnormal conditions, any improvement is significant for traffic 7
management and operations, especially when using no-cost and real-time social media 8
data. 9
The computational efficiency of short term traffic prediction is critical for practical 10
applications, such as Advanced Traveler Information Systems (ATIS). The traffic 11
flow data usually reports every 30 seconds from detectors, which can be used directly 12
in real-time. The social media data comes from the Twitter Stream application 13
interface in real time but it requires additional data processing to get the tweet rate 14
features and tweet sentiment features. Considering the hourly aggregation and 15
keywords filtration to obtain sport game related tweet, it is feasible to prepare the 16
right social media data in order to get the desirable tweet features in an hour. 17
Estimation of processing time for an hour tweet feature extraction is twenty minutes. 18
As a result, the proposed prediction models can be directly applied for real-time 19
practical applications. 20
5. Conclusion 21
In this paper, social media data is incorporated with traffic flow data for with prior-22
event traffic prediction. The linkages of tweet data and the sport game traffic is 23
analyzed and recognized to build a short-term traffic flow prediction model. We 24
Ni, He, and Gao
21
identify the traffic flow of neighboring detectors as traffic features; the tweet rates and 1
tweet sentiments as social media features. Four popular forecasting methods, 2
respectively, autoregressive model, neural networks, support vector regression, and k-3
nearest neighbor, are implemented to build the short-term traffic flow prediction 4
models. We then find that the prediction results with both traffic and tweet features 5
outperformed those with only traditional traffic features across different detectors. 6
With support vector regression, average MAPE and RMSE was reduced by 6.98% 7
and 23.95% after including tweet features in the model. We also argue the efficiency 8
of computation time for social media data processing in practical real-time 9
applications. 10
At the same time, another purpose of research is to demonstrate one way to uncover 11
the potential great value of big data generated by the social media. In this case, it can 12
be considered as an effective indicator of public attention and opinions. This 13
possibility may transfer to other branches of transportation, such as travel behavior 14
analysis, transportation related choices, transit scheduling and operations, and so on. 15
Reference 16
1. Yu S, Kak S. A Survey of Prediction Using Social Media [Internet]. 2012 Mar. 17
Report No.: 1203.1647. Available from: http://arxiv.org/abs/1203.1647 18
2. Jansen BJ, Zhang M, Sobel K, Chowdury A. Twitter power: Tweets as electronic 19
word of mouth. Journal of the American Society for Information Science and 20
Technology. 2009;60(11):2169–88. 21
3. Tumasjan A, Sprenger T, Sandner P, Welpe I. Predicting elections with twitter: 22
What 140 characters reveal about political sentiment. 2010. p. 178–85. 23
4. Balasubramanyan R, Routledge BR, Smith NA. From Tweets to Polls : Linking 24
Text Sentiment to Public Opinion Time Series. 2010. 25
5. Asur S, Huberman BA. Predicting the Future with Social Media. 2010 26
IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent 27
Agent Technology (WI-IAT). 2010. p. 492–9. 28
Ni, He, and Gao
22
6. Bollen J, Mao H, Zeng X. Twitter mood predicts the stock market. Journal of 1
Computational Science. 2011 Mar;2(1):1–8. 2
7. Zhang X, Fuehres H, Gloor PA. Predicting Stock Market Indicators Through 3
Twitter “I hope it is not as bad as I fear.”Procedia - Social and Behavioral 4
Sciences. 2011;26:55–62. 5
8. Yasdi R. Prediction of Road Traffic using a Neural Network Approach. Neural 6
Computing & Applications. 1999 May 1;8(2):135–42. 7
9. Castro-Neto M, Jeong Y-S, Jeong M-K, Han LD. Online-SVR for short-term 8
traffic flow prediction under typical and atypical traffic conditions. Expert 9
Systems with Applications. 2009 Apr;36(3, Part 2):6164–73. 10
10. Guo F, Krishnan R, Polak JW. Short-term traffic prediction under normal and 11
incident conditions using singular spectrum analysis and the k-nearest neighbour 12
method. IET and ITS Conference on Road Transport Information and Control 13
(RTIC 2012). 2012. p. 1–6. 14
11. Wu T, Xie K, Xinpin D, Song G. A online boosting approach for traffic flow 15
forecasting under abnormal conditions. 2012 9th International Conference on 16
Fuzzy Systems and Knowledge Discovery (FSKD). 2012. p. 2555–9. 17
12. Choe T, Skabardonis A, Varaiya P. Freeway performance measurement system: 18
Operational analysis tool. Transportation research record. (1811):67–75. 19
13. Chen C, Kwon J, Rice J, Skabardonis A, Varaiya P. Detecting Errors and 20
Imputing Missing Data for Single-Loop Surveillance Systems. Transportation 21
Research Record: Journal of the Transportation Research Board. 2003 Jan 22
1;1855(-1):160–7. 23
14. R Development Core Team. R: A Language and Environment for Statistical 24
Computing. Vienna, Austria: R Foundation for Statistical Computing. 2009. 25
15. Liu F, Weng F, Wang B, Liu Y. Insertion, Deletion, or Substitution 26
Normalizing Text Messages without Pre-categorization nor Supervision. 27
Proceedings of the 49th Annual Meeting of the Association for Computational 28
Linguistics: Human Language Technologies. Portland, Oregon, USA: 29
Association for Computational Linguistics; 2011. p. 71–6. 30
16. Godbole N, Srinivasaiah M, Skiena S. Large-Scale Sentiment Analysis for News 31
and Blogs. Proceedings of the International Conference on Weblogs and Social 32
Media (ICWSM); 2007. 33
Ni, He, and Gao
23
17. Agarwal A, Xie B, Vovsha I, Rambow O, Passonneau R. Sentiment analysis of 1
Twitter data. Proceedings of the Workshop on Languages in Social Media. 2
Stroudsburg, PA, USA: Association for Computational Linguistics; 2011. p. 30–3
8. 4
18. Schmidt M. Least Squares Optimization with L1-Norm Regularization. 5
Technical report. 2005. 6
19. Smith BL, Demetsky MJ. Short-term traffic flow prediction: neural network 7
approach. Transportation Research Record. 1994;(1453). 8
20. Smola AJ, Schölkopf B. A tutorial on support vector regression. Statistics and 9
Computing. 2004 Aug 1;14(3):199–222. 10
21. R. K. Oswald, William T. Scherer, Brian Lee Smith. Traffic flow forecasting 11
using approximate nearest neighbor nonparametric regression. Center for 12
transportation studies at the University of Virginia. 2001;(Research Report No. 13
UVACTS-15-13-7). 14
15