attention based spatial-temporal graph …attention based spatial-temporal graph convolutional...

Attention Based Spatial-Temporal Graph Convolutional Networksfor Traffic Flow Forecasting

Shengnan Guo,1,2 Youfang Lin,1,2,3 Ning Feng,1,3 Chao Song,1,2 Huaiyu Wan1,2,3∗1School of Computer and Information Technology, Beijing Jiaotong University, Beijing, China

2Beijing Key Laboratory of Traffic Data Analysis and Mining, Beijing, China3CAAC Key Laboratory of Intelligent Passenger Service of Civil Aviation, Beijing, China

{guoshn, yflin, fengning, chaosong, hywan}@bjtu.edu.cn

Abstract

Forecasting the traffic flows is a critical issue for researcher-s and practitioners in the field of transportation. However,it is very challenging since the traffic flows usually showhigh nonlinearities and complex patterns. Most existing traf-fic flow prediction methods, lacking abilities of modelingthe dynamic spatial-temporal correlations of traffic data, thuscannot yield satisfactory prediction results. In this paper, wepropose a novel attention based spatial-temporal graph con-volutional network (ASTGCN) model to solve traffic flowforecasting problem. ASTGCN mainly consists of three in-dependent components to respectively model three tempo-ral properties of traffic flows, i.e., recent, daily-periodic andweekly-periodic dependencies. More specifically, each com-ponent contains two major parts: 1) the spatial-temporal at-tention mechanism to effectively capture the dynamic spatial-temporal correlations in traffic data; 2) the spatial-temporalconvolution which simultaneously employs graph convolu-tions to capture the spatial patterns and common standardconvolutions to describe the temporal features. The output ofthe three components are weighted fused to generate the fi-nal prediction results. Experiments on two real-world datasetsfrom the Caltrans Performance Measurement System (PeMS)demonstrate that the proposed ASTGCN model outperformsthe state-of-the-art baselines.

IntroductionRecently, many countries are committed to vigorously de-velop the Intelligent Transportation System (ITS) (Zhanget al. 2011) to help for efficient traffic management. Traf-fic forecasting is an indispensable part of ITS, especially onthe highway which has large traffic flows and fast drivingspeed. Since the highway is relatively closed, once a conges-tion occurs, it will seriously affect the traffic capacity. Traf-fic flow is a fundamental measurement reflecting the state ofthe highway. If it can be predicted accurately in advance, ac-cording to this, traffic management authorities will be ableto guide vehicles more reasonably to enhance the runningefficiency of the highway network.

Highway traffic flow forecasting is a typical problem ofspatial-temporal data forecasting. Traffic data are record-ed at fixed points in time and at fixed locations distributed

∗Corresponding author: [email protected] c© 2019, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

9:00 AM 5:00 PM

A

B

C

D

(a) Spatial influence of traffic flows at different times

(b) Temporal influence between traffic flows

0 t+2t+1tt-1t-4 t-2 time

flowhistory forecast

t-3

A

B

Influ

ence

1

0

A

B

C

D

Figure 1: The spatial-temporal correlation diagram of trafficflow

in continuous space. Apparently, the observations made atneighboring locations and time stamps are not independentbut dynamically correlated with each other. Therefore, thekey to solve such problems is to effectively extracting thespatial-temporal correlations of data. Fig. 1 demonstrates thespatial-temporal correlations of traffic flows (also can be ve-hicle speed, lane occupancy, etc.). The bold line betweentwo points represents their mutual influence strength. Thedarker the color of line is, the greater the influence is. In thespatial dimension (Fig. 1(a)), we can find that different loca-tions have different impacts on A and even a same locationhas varying influence on A as time goes by. In the temporaldimension (Fig. 1(b)), the historical observations of differ-ent locations have varying impacts on A’s traffic states at d-ifferent times in the future. In conclusion, the correlations intraffic data on the highway network show strong dynamicsin both the spatial dimension and temporal dimension. Howto explore nonlinear and complex spatial-temporal data todiscover its inherent spatial-temporal patterns and to makeaccurate traffic flow predictions is a very challenging issue.

Fortunately, with the development of the transportationindustry, many cameras, sensors and other information col-lection devices have been deployed on the highway. Each

device is placed at a unique geospatial location, constantlygenerating time series data about traffic. These devices haveaccumulated a large amount of rich traffic time series datawith geographic information, providing a solid data founda-tion for traffic forecasting. Many researchers have alreadymade great efforts to solve such problems. Early, time seriesanalysis models are employed for traffic prediction prob-lems. Yet, it is difficult for them to handle the unstable andnonlinear data in practice. Later, traditional machine learn-ing methods are developed to model more complex data,but it is still difficult for them to simultaneously considerthe spatial-temporal correlations of high-dimensional trafficdata. Moreover, the prediction performances of this kind ofmethods rely heavily on feature engineering, which often re-quires lots of experiences from experts in the correspondingdomain. In recent years, many researchers use deep learn-ing methods to deal with high-dimensional spatial-temporaldata, i.e., convolutional neural networks (CNN) are adoptedto effectively extract the spatial features of grid-based da-ta; graph convolutional neural networks (GCN) are used fordescribing spatial correlation of graph-based data. However,these methods still fail to simultaneously model the spatial-temporal features and dynamic correlations of traffic data.

In order to tackle the above challenges, we propose a nov-el deep learning model: Attention based Spatial-TemporalGraph Convolution Network (ASTGCN) to collectively pre-dict traffic flow at every location on the traffic network. Thismodel can process the traffic data directly on the originalgraph-based traffic network and effectively capture the dy-namic spatial-temporal features. The main contributions ofthis paper are summarized as follows:• We develop a spatial-temporal attention mechanism to

learn the dynamic spatial-temporal correlations of trafficdata. Specifically, a spatial attention is applied to modelthe complex spatial correlations between different loca-tions. A temporal attention is applied to capture the dy-namic temporal correlations between different times.

• A novel spatial-temporal convolution module is designedfor modeling spatial-temporal dependencies of traffic da-ta. It consists of graph convolutions for capturing spa-tial features from the original graph-based traffic networkstructure and convolutions in the temporal dimension fordescribing dependencies from nearby time slices.

• Extensive experiments are carried out on real-world high-way traffic datasets, which verify that our model achievesthe best prediction performances compared to the existingbaselines.

Related workTraffic forecasting After years of continuous researchesand practices, many achievements have been made in the s-tudies about traffic forecasting. The statistical models usedfor traffic prediction include HA, ARIMA (Williams andHoel 2003), VAR (Zivot and Wang 2006), etc. These ap-proaches require data to satisfy some assumptions, but traf-fic data is too complex to satisfy these assumptions, so theyusually perform poorly in practice. Machine learning meth-ods such as KNN (Van Lint and Van Hinsbergen 2012) and

SVM (Jeong et al. 2013) can model more complex data, butthey need careful feature engineering. Since deep learninghas brought about breakthroughs in many domains, such asspeech recognition and image processing, more and moreresearchers apply deep learning to spatial-temporal data pre-diction. Zhang et al. (2018) designed a ST-ResNet mod-el based on the residual convolution unit to predict crowdflows. Yao et al. (2018b) proposed a method to predict traf-fic by integrating CNN and long-short term memory (LST-M) to jointly model both spatial and temporal dependencies.Yao et al. (2018a) further proposed a Spatial-Temporal Dy-namic Network for taxi demand prediction which can learnthe similarity between locations dynamically. Although thespatial-temporal features of the traffic data can be extract-ed by these model, their limitation is that the input must bestandard 2D or 3D grid data.

Convolutions on graphs The traditional convolution caneffectively extract the local patterns of data, but it can on-ly be applied for the standard grid data. Recently, the graphconvolution generalizes the traditional convolution to dataof graph structures. Two mainstreams of graph convolutionmethods are the spatial methods and the spectral method-s. The spatial methods directly perform convolution filter-s on a graph’s nodes and their neighbors. So, the core ofthis kind of methods is to select the neighborhood of nodes.Niepert, Ahmed, and Kutzkov (2016) proposed a heuristiclinear method to select the neighborhood of every center n-ode, which achieved good results in social network tasks.Li et al. (2018) introduced graph convolutions into humanaction recognition tasks. Several partitioning strategies wereproposed here to divide the neighborhood of each node intodifferent subsets and to ensure the numbers of each node’ssubsets are equal. The spectral methods, in which the local-ity of the graph convolution is considered by spectral anal-ysis. A general graph convolution framework based on theGraph Laplacian is proposed by Bruna et al. (2014), then D-efferrard, Bresson, and Vandergheynst (2016) optimized themethod by using Chebyshev polynomial approximation torealize eigenvalue decomposition. Yu, Yin, and Zhu (2018)proposed a gated graph convolution network for traffic pre-diction based on this method, but the model does not consid-er the dynamic spatial-temporal correlations of traffic data.

Attention mechanism Recently, attention mechanismshave been widely used in various tasks such as natural lan-guage processing, image caption and speech recognition.The goal of the attention mechanism is to select informationthat is relatively critical to the current task from all input. Xuet al. (2015) proposed two attention mechanisms in the im-age description task and adopted a visualization method tointuitively show the effect of the attention mechanism. Forclassifying nodes of a graph, Velickovic et al. (2018) lever-aged self-attentional layers to process graph-structured databy neural networks and achieved state-of-the-art results. Toforecast the time series, Liang et al. (2018) proposed a multi-level attention network to adaptively adjust the correlationsamong multiple geographic sensor time series. However, itis time-consuming in practice since a separate model needsto be trained for each time series.

Motivated by the studies mentioned above, consideringthe graph structure of the traffic network and the dynamicspatio-temporal patterns of the traffic data, we simultaneous-ly employ graph convolutions and the attention mechanismsto model the network-structure traffic data.

PreliminariesTraffic NetworksIn this study, we define a traffic network as an undirectedgraph G = (V,E,A), as shown in Fig. 2(a), where V is afinite set of |V | = N nodes; E is a set of edges, indicatingthe connectivity between the nodes; A ∈ RN×N denotesthe adjacency matrix of graph G . Each node on the trafficnetwork G detects F measurements with the same samplingfrequency, that is, each node generates a feature vector oflength F at each time slice, as shown by the solid lines inFig. 2(b).

(b)

1

𝑡𝑖𝑚𝑒

flowoccupyspeedforecasting target

(a)

0

𝑡𝑖𝑚𝑒

Figure 2: (a) The spatial-temporal structure of traffic data,where the data at each time slice forms a graph; (b) Threemeasurements are detected on a node and the future traf-fic flow is the forecasting target. Here, all measurements arenormalized to [0,1].

Traffic Flow ForecastingSuppose the f -th time series recorded on each node in thetraffic network G is the traffic flow sequence, and f ∈(1, ..., F ). We use xc,it ∈ R to denote the value of thec-th feature of node i at time t , and xit ∈ RF denotesthe values of all the features of node i at time t. Xt =(x1t , x2t , ..., xNt )T ∈ RN×F denotes the values of all the fea-

tures of all the nodes at time t. X = (X1,X2, ...,Xτ )T ∈RN×F×τ denotes the value of all the features of all the n-odes over τ time slices. In addition, we set yit = xf,it ∈ R torepresent the traffic flow of node i at time t in the future.

Problem. Given X , all kinds of the historical measure-ments of all the nodes on the traffic network over pastτ time slices, predict future traffic flow sequences Y =(y1, y2, ..., yN )T ∈ RN×Tp of all the nodes on the w-hole traffic network over the next Tp time slices, whereyi = (yiτ+1, y

iτ+2, ..., y

iτ+Tp

) ∈ RTp denotes the future traf-fic flow of node i from τ + 1.

Attention Based Spatial-Temporal GraphConvolutional Networks

Fig. 3 presents the overall framework of the ASTGCN mod-el proposed in this paper. It consists of three independen-

t components with the same structure, which are designedto respectively model the recent, daily-periodic and weekly-periodic dependencies of the historical data.

Fusion𝑌 𝑌

FC FC FC

Loss

ST

blo

ckS

T b

lock

𝒳ℎ 𝒳𝑑 𝒳𝑤

SAtt+TAtt

GCN+Conv

…SAtt+TAtt

GCN+Conv

SAtt+TAtt

GCN+Conv

…SAtt+TAtt

GCN+Conv

SAtt+TAtt

GCN+Conv

…SAtt+TAtt

GCN+Conv

𝑌𝑑𝑌𝑤𝑌ℎ

Figure 3: The framework of ASTGCN. SAtt: Spatial Atten-tion; TAtt: Temporal Attention GCN: Graph Convolution;Conv: Convolution; FC: Fully-connected; ST block: Spatial-Temporal block.

Suppose the sampling frequency is q times per day. As-sume that the current time is t0 and the size of predictingwindow is Tp. As shown in Fig. 4, we intercept three timeseries segments of length Th, Td and Tw along the time ax-is as the input of the recent, daily-period and weekly-periodcomponent respectively, where Th, Td and Tw are all integermultiples of Tp. Details about the three time series segmentsare as follows:

(1) The recent segment:X h = (Xt0−Th+1,Xt0−Th+2, ...,Xt0) ∈ RN×F×Th , a seg-ment of historical time series directly adjacent to the predict-ing period, as shown by the green part of Fig. 4. Intuitively,the formation and dispersion of traffic congestions are grad-ual. So, the just past traffic flows inevitably have influenceon the future traffic flows.

(2) The daily-periodic segment:X d = (Xt0−(Td/Tp)∗q+1, ...,Xt0−(Td/Tp)∗q+Tp

,Xt0−(Td/Tp−1)∗q+1, ...,Xt0−(Td/Tp−1)∗q+Tp

, ...,

Xt0−q+1, ...,Xt0−q+Tp) ∈ RN×F×Td consists of the seg-ments on the past few days at the same time period as thepredicting period, as shown by the red part of Fig. 4. Dueto the regular daily routine of people, traffic data may showrepeated patterns, such as the daily morning peaks. The pur-pose of the daily-period component is to model the dailyperiodicity of traffic data.

(3) The weekly-periodic segment:Xw = (Xt0−7∗(Tw/Tp)∗q+1, ...,Xt0−7∗(Tw/Tp)∗q+Tp

,Xt0−7∗(Tw/Tp−1)∗q+1, ...,Xt0−7∗(Tw/Tp−1)∗q+Tp

, ...,

Xt0−7∗q+1, ...,Xt0−7∗q+Tp) ∈ RF×N×Tw is composed of

the segments on last few weeks, which have the same weekattributes and time intervals as the forecasting period, as

shown by the blue part of Fig. 4. Usually, the traffic patternson Monday have a certain similarity with the traffic pattern-s on Mondays in history, but may be greatly different fromthose on weekends. Thus, the weekly-period component isdesigned to capture the weekly periodic features in trafficdata.

6/14/2018 Thur.

8:00-8:55 am

6/14/2018 Thur.

6:00-7:55 am

𝑇ℎ……

6/13/2018 Wed.

8:00-8:55 am

6/12/2018 Tue.

8:00-8:55 am

𝑇𝑑

… …

6/17/2018 Thur.

8:00-8:55 am

5/31/2018 Thur.

8:00-8:55 am

… …

𝑇𝑤

𝑇𝑝

Figure 4: An example of constructing the input of time seriessegments (suppose the size of predicting window is 1 hour,and Th, Td and Tw are twice of Tp.

The three components share the same network structureand each of them consists of several spatial-temporal block-s and a fully-connected layer. There are a spatial-temporalattention module and a spatial-temporal convolution mod-ule in each spatial-temporal block. In order to optimize thetraining efficiency, we adopted a residual learning frame-work (He et al. 2016) in each component. In the end, theoutputs of the three components are further merged based ona parameter matrix to obtain the final prediction result. Theoverall network structure is elaborately designed to describethe dynamic spatial-temporal correlations of traffic flows.

Spatial-Temporal AttentionA novel spatial-temporal attention mechanism is proposedin our model to capture the dynamic spatial and temporalcorrelations on the traffic network (as described in Fig. 1).It contains two kinds of attentions, i.e., spatial attention andtemporal attention.

Spatial attention In the spatial dimension, the traffic con-ditions of different locations have influence among each oth-er and the mutual influence is highly dynamic. Here, we usean attention mechanism (Feng et al. 2017) to adaptively cap-ture the dynamic correlations between nodes in the spatialdimension.

Take the spatial attention in the recent component as anexample:

S = Vs · σ((X (r−1)h W1)W2(W3X (r−1)

h )T + bs) (1)

S′i,j =exp(Si,j)∑Nj=1 exp(Si,j)

(2)

where X (r−1)h = (X1,X2, ...XTr−1

) ∈ RN×Cr−1×Tr−1 isthe input of the rth spatial-temporal block. Cr−1 is the num-ber of channels of the input data in the rth layer. Whenr = 1 , C0 = F . Tr−1 is the length of the temporal dimen-sion in the rth layer. When r = 1, in the recent componentT0 = Th (in the daily-period component T0 = Td and inthe weekly-period component T0 = Tw). Vs,bs ∈ RN×N ,W1 ∈ RTr−1 , W2 ∈ RCr−1×Tr−1 , W3 ∈ RCr−1 are learn-able parameters and sigmoid σ is used as the activationfunction. The attention matrix S is dynamically computed

according to the current input of this layer. The value ofan element Si,j in S semantically represents the correlationstrength between node i and node j. Then a softmax func-tion is used to ensure the attention weights of a node sum toone. When performing the graph convolutions, we will ac-company the adjacency matrix A with the spatial attentionmatrix S′ ∈ RN×N to dynamic adjust the impacting weight-s between nodes.

Temporal attention In the temporal dimension, there ex-ist correlations between the traffic conditions in differenttime slices, and the correlations are also varying under d-ifferent situations. Likewise, we use an attention mechanismto adaptively attach different importance to data:

E = Ve · σ(((X (r−1)h )TU1)U2(U3X (r−1)

h ) + be) (3)

E′i,j =exp(Ei,j)∑Tr−1

j=1 exp(Ei,j)(4)

where Ve,be ∈ RTr−1×Tr−1 , U1 ∈ RN , U2 ∈ RCr−1×N ,U3 ∈ RCr−1 are learnable parameters. The temporal cor-relation matrix E is determined by the varying inputs. Thevalue of an element Ei,j in E semantically indicates thestrength of dependencies between time i and j. At last, E isnormalized by the softmax function. We directly apply thenormalized temporal attention matrix to the input and get

X(r−1)h = (X1, X2, ..., XTr−1) = (X1,X2, ...,XTr−1)E′ ∈

RN×Cr−1×Tr−1 to dynamically adjust the input by mergingrelevant information.

Spatial-Temporal ConvolutionThe spatial-temporal attention module let the network auto-matically pay relatively more attention on valuable informa-tion. The input adjusted by the attention mechanism is fedinto the spatial-temporal convolution module, whose struc-ture is presented in Fig. 5. The spatial-temporal convolutionmodule proposed here consists of a graph convolution inthe spatial dimension, capturing spatial dependencies fromneighborhood and a convolution along the temporal dimen-sion, exploiting temporal dependencies from nearby times.

𝑡𝑖𝑚𝑒

graph convolution

in spatial dimension

convolution

in temporal dimension

Figure 5: The architecture of spatial-temporal convolutionsof ASTGCN.

Graph convolution in spatial dimension The spectralgraph theory generalizes the convolution operation fromgrid-based data to graph structure data. In this study, the traf-fic network is a graph structure in nature, and the features of

each node can be regarded as the signals on the graph (Shu-man et al. 2013). Hence, in order to make full use of thetopological properties of the traffic network, at each time s-lice we adopt graph convolutions based on the spectral graphtheory to directly process the signals, exploiting signal cor-relations on the traffic network in the spatial dimension. Thespectral method transforms a graph into an algebraic formto analyze the topological attributes of graph, such as theconnectivity in the graph structure.

In spectral graph analysis, a graph is represented by it-s corresponding Laplacian matrix. The properties of thegraph structure can be obtained by analyzing Laplacian ma-trix and its eigenvalues. Laplacian matrix of a graph is de-fined as L = D − A, and its normalized form is L =

IN −D−12 AD−

12 ∈ RN×N , where A is the adjacent matrix,

IN is a unit matrix, and the degree matrix D ∈ RN×N is adiagonal matrix, consisting of node degrees, Dii =

∑j Aij .

The eigenvalue decomposition of the Laplacian matrix isL = UΛUT , where Λ = diag([λ0, ..., λN−1]) ∈ RN×Nis a diagonal matrix, and U is Fourier basis. Taking the traf-fic flow at time t as an example, the signal all over the graphis x = xft ∈ RN , and the graph Fourier transform of the sig-nal is defined as x = UTx. According to the properties ofthe Laplacian matrix, U is an orthogonal matrix, so the cor-responding inverse Fourier transform is x = Ux. The graphconvolution is a convolution operation implemented by us-ing linear operators that diagonalize in the Fourier domainto replace the classical convolution operator (Henaff, Bruna,and LeCun 2015). Based on this, the signal x on the graphG is filtered by a kernel gθ:

gθ ∗G x = gθ(L)x = gθ(UΛUT )x = Ugθ(Λ)UTx (5)

where ∗G denotes a graph convolution operation. Sincethe convolution operation of the graph signal is equal tothe product of these signals which have been transformedinto the spectral domain by graph Fourier transform (Si-monovsky and Komodakis 2017), the above formula can beunderstood as Fourier transforming gθ and x respectively in-to the spectral domain, then multiplying their transformedresults, and doing the inverse Fourier transform to get thefinal result of the convolution operation. However, it is ex-pensive to directly perform the eigenvalue decomposition onthe Laplacian matrix when the scale of the graph is large.Therefore, Chebyshev polynomials are adopted in this pa-per to solve this problem approximately but efficiently (Si-monovsky and Komodakis 2017):

gθ ∗G x = gθ(L)x =

K−1∑k=0

θkTk(L)x (6)

where the parameter θ ∈ RK is a vector of polynomial coef-ficients. L = 2

λmaxL− IN , λmax is the maximum eigenval-

ue of the Laplacian matrix. The recursive definition of theChebyshev polynomial is Tk(x) = 2xTk−1(x) − Tk−2(x),where T0(x) = 1 , T1(x) = x. Using approximate expan-sion of Chebyshev polynomial to solve this formulation cor-responds to extracting information of the surrounding 0 to(K − 1)th-order neighbors centered on each node in the

graph by the convolution kernel gθ. The graph convolutionmodule uses the Rectified Linear Unit (ReLU) as the finalactivation function, i.e., ReLU(gθ ∗G x).

In order to dynamically adjust the correlations betweennodes, for each term of Chebyshev polynomial, we accom-pany Tk(L) with the spatial attention matrix S′ ∈ RN×N ,then obtain Tk(L) � S′, where � is the Hadamard product.Therefore, the above graph convolution formula changes togθ ∗G x = gθ(L)x =

∑K−1k=0 θk(Tk(L)� S′)x.

We can generalize this definition to the graph signalwith multiple channels. For example, in the recent com-

ponent, the input is X(r−1)h = (X1, X2, ..., XTr−1

) ∈RN×Cr−1×Tr−1 , where the feature of each node has Cr−1channels. For each time slice t, performing Cr filters on thegraph Xt, we get gθ ∗G Xt, where Θ = (Θ1,Θ2, ...,ΘCr

) ∈RK×Cr−1×Cr is the convolution kernel parameter (Kipf andWelling 2017). Therefore, each node is updated by the infor-mation of the 0∼K-1 neighbors of the node.

Convolution in temporal dimension After the graph con-volution operations having captured neighboring informa-tion for each node on the graph in the spatial dimension, astandard convolution layer in the temporal dimension is fur-ther stacked to update the signal of a node by merging theinformation at the neighboring time slice, as shown by theright part in Fig. 5. Also take the operation on the rth layerin the recent component as an example:

X (r)h = ReLU(Φ ∗ (ReLU(gθ ∗G X

(r−1)h ))) ∈ RCr×N×Tr

(7)where ∗ denotes a standard convolution operation, Φ is theparameters of the temporal dimension convolution kernel,and the activation function is ReLU.

In conclusion, a spatial-temporal convolution moduleis able to well capture the temporal and spatial featuresof traffic data. A spatial-temporal attention module anda spatial-temporal convolution module forms a spatial-temporal block. Multiple spatial-temporal blocks are stackedto further extract larger range of dynamic spatial-temporalcorrelations. Finally, a fully connected layer is appended tomake sure the output of each component has the same di-mension and shape with the forecasting target. The final ful-ly connected layer uses ReLU as the activation function.

Multi-Component FusionIn this section, we will discuss how to integrate the outputsof the three components. Take forecasting the traffic flow onthe whole traffic network at 8:00 am on Friday as an exam-ple. It can be observed that the traffic flows at some areashave obvious peak periods in the morning, so the outputsof the daily-period and weekly-period components are morecrucial. However, there are no distinct traffic cycle pattern-s in some other places, thus the daily-period and weekly-period components may be helpless. Consequently, whenthe outputs of different components are fused, the impactingweights of the three components for each node are different,and they should be learned from the historical data. So the

final prediction result after the fusion is:

Y = Wh � Yh + Wd � Yd + Ww � Yw (8)

where � is the Hadamard product. Wh, Wd and Ww arelearning parameters, reflecting the influence degrees of thethree temporal-dimensional components on the forecastingtarget.

ExperimentsIn order to evaluate the performance of our model, we car-ried out comparative experiments on two real-world high-way traffic datasets.

DatasetsWe validate our model on two highway traffic datasetsPeMSD4 and PeMSD8 from California. The datasets arecollected by the Caltrans Performance Measurement System(PeMS) (Chen et al. 2001) in real time every 30 seconds. Thetraffic data are aggregated into every 5-minute interval fromthe raw data. The system has more than 39,000 detectorsdeployed on the highway in the major metropolitan areas inCalifornia. Geographic information about the sensor station-s are recorded in the datasets. There are three kinds of trafficmeasurements considered in our experiments, including to-tal flow, average speed, and average occupancy.

PeMSD4 It refers to the traffic data in San Francisco BayArea, containing 3848 detectors on 29 roads. The time s-pan of this dataset is from January to February in 2018. Wechoose data on the first 50 days as the training set, and theremains as the test set.

PeMSD8 It is the traffic data in San Bernardino from Julyto August in 2016, which contains 1979 detectors on 8 roads.The data on the first 50 days are used as the training set andthe data on the last 12 days are the test set.

PreprocessingWe remove some redundant detectors to ensure the distancebetween any adjacent detectors is longer than 3.5 miles. Fi-nally, there are 307 detectors in the PeMSD4 and 170 detec-tors in the PeMSD8. The traffic data are aggregated every 5minutes, so each detector contains 288 data points per day.The missing values are filled by the linear interpolation. Inaddition, the data are transformed by zero-mean normaliza-tion x′ = x−mean(x) to let the average be 0.

SettingsWe implemented the ASTGCN model based on the MXNet1framework. According to Kipf and Welling (2017), we testthe number of the terms of Chebyshev polynomial K ∈{1, 2, 3}. AsK becomes larger, the forecasting performanceimproves slightly. So does the kernel size in the temporal di-mension. Considering the computing efficiency and the de-gree of improvement of the forecasting performance, we setK = 3 and the kernel size along the temporal dimensionto 3. In our model, all the graph convolution layers use 64

1https://mxnet.apache.org/

convolution kernels. All the temporal convolution layers use64 convolution kernels and the time span of the data is ad-justed by controlling the step size of the temporal convo-lutions. For the lengths of the three segments, we set themas: Th = 24, Td = 12, Tw = 24. The size of the predictingwindow Tp = 12, that is to say, we aim at predicting the traf-fic flow over one hour in the future. In this paper, the meansquare error (MSE) between the estimator and the groundtruth are used as the loss function and minimized by back-propagation. During the training phase, the batch size is 64and the learning rate is 0.0001. In addition, in order to verifythe impact of the spatio-temporal attention mechanism pro-posed here, we also design a degraded version of ASTGCN,named Multi-Component Spatial-Temporal Graph Convolu-tion Networks (MSTGCN), which gets rid of the spatial-temporal attention. The settings of MSTGCN are the sameas those of ASTGCN, except no spatial-temporal attention.

BaselinesWe compare our model with the following eight baselines:

• HA: Historical Average method. Here, we use the averagevalue of the last 12 time slices to predict the next value.

• ARIMA (Williams and Hoel 2003): Auto-Regressive In-tegrated Moving Average method is a well-known timeseries analysis method for predicting the future values.

• VAR (Zivot and Wang 2006): Vector Auto-Regressive is amore advanced time series model, which can capture thepairwise relationships among all traffic flow series.

• LSTM (Hochreiter and Schmidhuber 1997): Long Short-Term Memory network, a special RNN model.

• GRU (Chung et al. 2014): Gated Recurrent Unit network,a special RNN model.

• STGCN (Li et al. 2018): A spatial-temporal graph convo-lution model based on the spatial method.

• GLU-STGCN (Yu, Yin, and Zhu 2018): A graph convolu-tion network with a gating mechanism, which is speciallydesigned for traffic forecasting.

• GeoMAN (Liang et al. 2018): A multi-level attention-based recurrent neural network model proposed for thegeo-sensory time series prediction problem.

Root mean square error (RMSE) and mean absolute error(MAE) are used as the evaluation metrics.

Comparison and Result AnalysisWe compare our models with the eight baseline methods onPeMSD4 and PeMSD8. Table 1 shows the average results oftraffic flow prediction performance over the next one hour.

It can be seen from Table 1 that our ASTGCN achievesthe best performance in both two datasets in terms of all e-valuation metrics. We can observe that the prediction resultsof the traditional time series analysis methods are usuallynot ideal, demonstrating those methods’ limited abilities ofmodeling nonlinear and complex traffic data. By compari-son, the methods based on deep learning generally obtain

Model PeMSD4 PeMSD8RMSE MAE RMSE MAE

HA 54.14 36.76 44.03 29.52ARIMA 68.13 32.11 43.30 24.04

VAR 51.73 33.76 31.21 21.41LSTM 45.82 29.45 36.96 23.18GRU 45.11 28.65 35.95 22.20

STGCN 38.29 25.15 27.87 18.88GLU-STGCN 38.41 27.28 30.78 20.99

GeoMAN 37.84 23.64 28.91 17.84

MSTGCN (ours) 35.64 22.73 26.47 17.47ASTGCN (ours) 32.82 21.80 25.27 16.63

Table 1: Average performance comparison of different ap-proaches on PeMSD4 and PeMSD8.

better prediction results than the traditional time series anal-ysis methods. Among them, the models which simultane-ously take both the temporal and spatial correlations intoaccount, including STGCN, GLU-STGCN, GeoMAN andtwo versions of our model, are superior to the traditionaldeep learning models such as LSTM and GRU. Besides,GeoMAN performs better than STGCN and GLU-STGCN,indicating the multi-level attention mechanisms applied inGeoMAN are efficient in capturing the dynamic changingsof traffic data. Our MSTGCN, without any attention mech-anisms, achieve better results than the previous state-of-the-art models, proving the advantages of our model in describ-ing spatial-temporal features of the highway traffic data.Then combined with the spatial-temporal attention mecha-nisms, our ASTGCN further reduces the forecasting errors.

(b) The prediction results of different methods on PeMSD8

(a) The prediction results of different methods on PeMSD4

Figure 6: Performance changes of different methods as theforecasting interval increases.

Fig. 6 shows the changes of prediction performance ofvarious methods as the prediction interval increases. Overall,as the prediction interval becomes longer, the correspondingdifficulty of prediction is getting greater, hence the predic-tion errors also increase. As can be seen from the figure,the methods only taking the temporal correlation into ac-count can achieve good results in the short-term prediction,such as HA, ARIMA, LSTM and GRU. However, with theincrease of the prediction interval, their prediction accura-cy drops dramatically. By comparison, the performance ofVAR drops slower than those methods. This is mainly be-cause VAR can simultaneously consider the spatial-temporalcorrelations which are more important in the long-term pre-diction. However, when the scale of the traffic network be-comes larger, i.e., there are more time series considered inthe model, the prediction error of VAR increases, as shownin Fig.6, its performance on PeMSD4 is worse than that onPeMSD8. The errors of deep learning methods increase s-lowly with prediction interval increases, and their overallperformance is good. Our ASTGCN model achieves the bestprediction performance almost all the time. Especially in thelong-term prediction, the differences between ASTGCN andother baselines are more significant, showing that the strat-egy of combining attention mechanism with graph convolu-tion can better mine the dynamic spatial-temporal patternsof traffic data.

0

1

2

3

4 5

6 7

89

Figure 7: The attention matrix obtained from the spatial at-tention mechanism.

In order to investigate the role of attention mechanisms inour model intuitively, we perform a case study: picking outa sub-graph with 10 detectors from the PeMSD8 and show-ing the average spatial attention matrix among detectors inthe training set. As shown on the right side of Fig. 7, in thespatial attention matrix, the i-th row represents the correla-tion strength between each detector and the i-th detector. Forinstance, look at the last row, we can know traffic flows onthe 9th detector is closely related to those on the 3th and 8thdetectors. This is reasonable since these three detectors areclose in space on the real traffic network, as shown on theleft side of Fig. 7. Hence, our model not only achieves a bestforecasting performance but also shows an interpretabilityadvantage.

Conclusion and Future WorkIn this paper, a novel attention based spatial-temporal graphconvolution model called ASTGCN is proposed and suc-cessfully applied to forecasting traffic flow. The model

combines the spatial-temporal attention mechanism and thespatial-temporal convolution, including graph convolution-s in the spatial dimension and standard convolutions in thetemporal dimension, to simultaneously capture the dynamicspatial-temporal characteristics of traffic data. Experimentson two real-world datasets show that the forecasting accu-racy of the proposed model is superior to existing models.The code has been released at: https://github.com/wanhuaiyu/ASTGCN.

Actually, the highway traffic flow is affected by many ex-ternal factors, like weather and social events. In the future,we will take some external influencing factors into accountto further improve the forecasting accuracy. Since the AST-GCN is a general spatial-temporal forecasting frameworkfor the graph structure data, we can also apply it to otherpragmatic applications, such as estimating time of arrival.

AcknowledgmentsThis work was supported by the Natural Science Foundationof China (No. 61603028).

ReferencesBruna, J.; Zaremba, W.; Szlam, A.; and Lecun, Y. 2014.Spectral networks and locally connected networks on graph-s. In International Conference on Learning Representations.Chen, C.; Petty, K.; Skabardonis, A.; Varaiya, P.; and Jia, Z.2001. Freeway performance measurement system: miningloop detector data. Transportation Research Record: Jour-nal of the Transportation Research Board (1748):96–102.Chung, J.; Gulcehre, C.; Cho, K.; and Bengio, Y. 2014.Empirical Evaluation of Gated Recurrent Neural Network-s on Sequence Modeling. In NIPS 2014 Workshop on DeepLearning.Defferrard, M.; Bresson, X.; and Vandergheynst, P. 2016.Convolutional neural networks on graphs with fast localizedspectral filtering. In Advances in Neural Information Pro-cessing Systems, 3844–3852.Feng, X.; Guo, J.; Qin, B.; Liu, T.; and Liu, Y. 2017. Effec-tive deep memory networks for distant supervised relationextraction. In International Joint Conference on ArtificialIntelligence, 19–25.He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep resid-ual learning for image recognition. In IEEE Conference onComputer Vision and Pattern Recognition, 770–778.Henaff, M.; Bruna, J.; and LeCun, Y. 2015. Deep convo-lutional networks on graph-structured data. arXiv preprintarXiv:1506.05163.Hochreiter, S., and Schmidhuber, J. 1997. Long short-termmemory. Neural Computation 9(8):1735–1780.Jeong, Y.-S.; Byon, Y.-J.; Castro-Neto, M. M.; and Easa,S. M. 2013. Supervised weighting-online learning algorith-m for short-term traffic flow prediction. IEEE Transactionson Intelligent Transportation Systems 14(4):1700–1707.Kipf, T. N., and Welling, M. 2017. Semi-supervised clas-sification with graph convolutional networks. InternationalConference on Learning Representations.

Li, C.; Cui, Z.; Zheng, W.; Xu, C.; and Yang, J. 2018.Spatio-Temporal Graph Convolution for Skeleton Based Ac-tion Recognition. In AAAI Conference on Artificial Intelli-gence, 3482–3489.Liang, Y.; Ke, S.; Zhang, J.; Yi, X.; and Zheng, Y. 2018.GeoMAN: Multi-level Attention Networks for Geo-sensoryTime Series Prediction. In International Joint Conferenceon Artificial Intelligence, 3428–3434.Niepert, M.; Ahmed, M.; and Kutzkov, K. 2016. Learningconvolutional neural networks for graphs. In Internationalconference on machine learning, 2014–2023.Shuman, D. I.; Narang, S. K.; Frossard, P.; Ortega, A.; andVandergheynst, P. 2013. The emerging field of signal pro-cessing on graphs: Extending high-dimensional data analy-sis to networks and other irregular domains. IEEE SignalProcessing Magazine 30(3):83–98.Simonovsky, M., and Komodakis, N. 2017. Dynamicedgeconditioned filters in convolutional neural networks ongraphs. In Computer Vision and Pattern Recognition, 3693–3702.Van Lint, J., and Van Hinsbergen, C. 2012. Short-term traf-fic and travel time prediction models. Artificial IntelligenceApplications to Critical Transportation Issues 22(1):22–41.Velickovic, P.; Cucurull, G.; Casanova, A.; Romero, A.; Li-o, P.; and Bengio, Y. 2018. Graph attention networks. InInternational Conference on Learning Representations.Williams, B. M., and Hoel, L. A. 2003. Modeling and fore-casting vehicular traffic flow as a seasonal ARIMA process:Theoretical basis and empirical results. Journal of trans-portation engineering 129(6):664–672.Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudi-nov, R.; Zemel, R.; and Bengio, Y. 2015. Show, attendand tell: Neural image caption generation with visual at-tention. In International conference on machine learning,2048–2057.Yao, H.; Tang, X.; Wei, H.; Zheng, G.; Yu, Y.; and Li, Z.2018a. Modeling spatial-temporal dynamics for traffic pre-diction. arXiv preprint arXiv:1803.01254.Yao, H.; Wu, F.; Ke, J.; Tang, X.; Jia, Y.; Lu, S.; Gong, P.;and Ye, J. 2018b. Deep multi-view spatial-temporal networkfor taxi demand prediction. In AAAI Conference on ArtificialIntelligence, 2588–2595.Yu, B.; Yin, H.; and Zhu, Z. 2018. Spatio-Temporal GraphConvolutional Networks: A Deep Learning Framework forTraffic Forecasting. In International Joint Conference onArtificial Intelligence, 3634–3640.Zhang, J.; Wang, F.-Y.; Wang, K.; Lin, W.-H.; Xu, X.; andChen, C. 2011. Data-driven intelligent transportation sys-tems: A survey. IEEE Transactions on Intelligent Trans-portation Systems 12(4):1624–1639.Zhang, J.; Zheng, Y.; Qi, D.; Li, R.; Yi, X.; and Li, T. 2018.Predicting citywide crowd flows using deep spatio-temporalresidual networks. Artificial Intelligence 259:147–166.Zivot, E., and Wang, J. 2006. Vector autoregressive modelsfor multivariate time series. Modeling Financial Time Serieswith S-PLUS R© 385–429.

attention based spatial-temporal graph …attention based spatial-temporal graph convolutional...

Documents