1 machine learning for spatiotemporal sequence …1 machine learning for spatiotemporal sequence...

1

Machine Learning for SpatiotemporalSequence Forecasting: A Survey

Xingjian Shi, Dit-Yan Yeung, Senior Member, IEEE

Abstract—Spatiotemporal systems are common in the real-world. Forecasting the multi-step future of thesespatiotemporal systems based on the past observations, or, Spatiotemporal Sequence Forecasting (STSF), is asignificant and challenging problem. Although lots of real-world problems can be viewed as STSF and manyresearch works have proposed machine learning based methods for them, no existing work has summarized andcompared these methods from a unified perspective. This survey aims to provide a systematic review of machinelearning for STSF. In this survey, we define the STSF problem and classify it into three subcategories: TrajectoryForecasting of Moving Point Cloud (TF-MPC), STSF on Regular Grid (STSF-RG) and STSF on Irregular Grid(STSF-IG). We then introduce the two major challenges of STSF: 1) how to learn a model for multi-stepforecasting and 2) how to adequately model the spatial and temporal structures. After that, we review the existingworks for solving these challenges, including the general learning strategies for multi-step forecasting, theclassical machine learning based methods for STSF, and the deep learning based methods for STSF. We alsocompare these methods and point out some potential research directions.

Index Terms—Spatiotemporal Sequence Forecasting, Machine Learning, Deep Learning, Data Mining.

F

1 INTRODUCTION

M ANY real-world phenomena are spatiotemporal,such as the traffic flow, the diffusion of air

pollutants and the regional rainfall. Correctly predictingthe future of these spatiotemporal systems based onthe past observations is essential for a wide range ofscientific studies and real-life applications like trafficmanagement, precipitation nowcasting, and typhoonalert systems.

If there exists an accurate numerical model of thedynamical system, which usually happen in the casethat we have fully understood its rules, forecastingcan be achieved by first finding the initial conditionof the model and then simulate it to get the futurepredictions [1]. However, for some complex spatiotem-poral dynamical systems like atmosphere, crowd, andnatural videos, we are still not entirely clear about theirinner mechanisms. In these situations, machine learn-ing based methods, by which we can try to learn thesystems’ inner “laws” based on the historical data, haveproven helpful for making accurate predictions [2], [3],[4], [5], [6]. Moreover, recent studies [7], [8] haveshown that the “laws” inferred by machine learning

• Xingjian Shi is with the Department of Computer Scienceand Engineering, the Hong Kong University of Science andTechnology. E-mail: [email protected]

• Dit-Yan Yeung is with the Department of Computer Scienceand Engineering, the Hong Kong University of Science andTechnology. E-mail: [email protected]

algorithms can further be used to guide other tasks likeclassification and control. For example, controlling arobotic arm will be much easier once you can anticipatehow it will move after performing specific actions [8].

In this paper, we review these machine learningbased methods for spatiotemporal sequence forecast-ing. We define a length-T spatiotemporal sequence asa series of matrices X1:T = [X1,X2, ...XT ]. EachXt ∈ RK×(D+E) contains a set of coordinates andtheir corresponding measurements. Here, K is thenumber of coordinates, D is the number of measure-ments, and E is the dimension of the coordinate. Wefurther denote the measurement part and coordinatepart of Xt as Mt ∈ RK×D and Ct ∈ RK×E .For many forecasting problems, we can use auxiliaryinformation to enhance the prediction accuracy. For ex-ample, regarding wind speed forecasting, the latitudesand past temperatures are also helpful for predicting thefuture wind speeds. We denote the auxiliary informa-tion available at timestamp t asAt. The SpatiotemporalSequence Forecasting (STSF) problem is to predict thelength-L (L > 1) sequence in the future given theprevious observations plus the auxiliary informationthat is allowed to be empty. The mathematical formis given in (1).

Xt+1:t+L = arg maxXt+1:t+L

p(Xt+1:t+L | X1:t,At). (1)

Here, we limit STSF to only contain problems where

arX

iv:1

808.

0686

5v1

[cs

.LG

] 2

1 A

ug 2

018

mailto:[email protected]

mailto:[email protected]

2

TABLE 1Types of spatiotemporal sequence forecasting problems.

Problem Name Coordinates MeasurementsTrajectory Forecasting of Moving Point Cloud Changing Fixed/ChangingSpatiotemporal Forecasting on Regular Grid Fixed regular grid ChangingSpatiotemporal Forecasting on Irregular Grid Fixed irregular grid Changing

the input and output are both sequences with multipleelements. According to this definition, problems likevideo generation using a static image [9] and denseoptical flow prediction [10], in which the input or theoutput contain a single element, will not be counted.

Based on the characteristics of the coordinate se-quence C1:T and the measurement sequence M1:T , weclassify STSF into three categories as shown in Table 1.If the coordinates are changing, the STSF problem iscalled Trajectory Forecasting of Moving Point Cloud(TF-MPC). Problems like human trajectory predic-tion [4], [11] and human dynamics prediction [12],[13], [14] fall into this category. We need to emphasizehere that the entities in moving point cloud can not onlybe human but also be general moving objects. If thecoordinates are fixed, we only need to predict the futuremeasurements. Based on whether the coordinates lie ina regular grid, we can get the other two categories,called STSF on Regular Grid (STSF-RG) and STSFon Irregular Grid (STSF-IG). Problems like precipita-tion nowcasting [5], crowd density prediction [15] andvideo prediction [2], in which we have dense observa-tions, are STSF-RG problems. Problems like air qualityforecasting [3], influenza prediction [16] and trafficspeed forecasting [17], in which the monitoring stations(or sensors) spread sparsely across the city, are STSF-IG problems. We need to emphasize that although themeasurements in TF-MPC can also be time-variant dueto factors like appearance change, most models for TF-MPC deal with the fixed-measurement case and onlypredict the coordinates. We thus focus on the fixed-measurement case in this survey.

Having both the characteristics of spatial data likeimages and temporal data like sentences and audios,spatiotemporal sequences contain information about“what”, “when”, and “where” and provide a com-prehensive view of the underlying dynamical system.However, due to the complex spatial and temporalrelationships within the data and the potential longforecasting horizon, spatiotemporal sequence forecast-ing imposes new challenges to the machine learn-ing community. The first challenge is how to learna model for multi-step forecasting. Early researchesin time-series forecasting [18], [19] have investigatedtwo approaches: Direct Multi-step (DMS) estimationand Iterated Multi-step (IMS) estimation. The DMSapproach directly optimizes the multi-step forecasting

objective while the IMS approach learns a one-step-ahead forecaster and iteratively applies it to gener-ate multi-step predictions. However, choosing betweenDMS and IMS involves a trade-off among forecastingbias, estimation variance, the length of the predictionhorizon and the model’s nonlinearity [20]. Recent stud-ies have tried to find the mid-ground between these twoapproaches [20], [21]. The second challenge is how tomodel the spatial and temporal structures within thedata adequately. In fact, the number of free parametersof a length-T spatiotemporal sequence can be up toO(KTDT ) where K is usually above thousands forSTSF-RG problems like video prediction. Forecastingin such a high dimensional space is impossible ifthe model cannot well capture the data’s underlyingstructure. Therefore, machine learning based methodsfor STSF, including both classical methods and deeplearning based methods, have special designs in theirmodel architectures for capturing these spatiotemporalcorrelations.

In this paper, we organize the literature about STSFfollowing these two challenges. Section 2 covers thegeneral learning strategies for multi-step forecastingincluding IMS, DMS, boosting strategy and scheduledsampling. Section 3 reviews the classical methodsfor STSF including the feature-based methods, state-space models, and Gaussian process models. Section 4reviews the deep learning based methods for STSFincluding deep temporal generative models and feed-forward and recurrent neural network-based methods.Section 5 summarizes the survey and discusses somefuture research directions.

2 LEARNING STRATEGIES FOR MULTI-STEP FORECASTING

Compared with single-step forecasting, learning amodel for multi-step forecasting is more challengingbecause the forecasting output Xt+1:t+L is a sequencewith non-i.i.d elements. In this section, we first intro-duce and compare two basic strategies called IMS andDMS [19] and then review two extensions that bridgethe gap between these two strategies. To simplify thenotation, we keep the following notation rules in thisand the following sections. We denote the informationavailable at timestamp t as Ft = {X1:t,At}. Learningan STSF model is thus to train a model, which can

3

be either probabilistic or deterministic, to make theL-step-ahead prediction Xt+1:t+L based on Ft. Wealso denote the ground-truth at timestamp t as Xt

and recursively applying the basic function f(x) forh times as f (h)(x). Setting the kth subscript to be“:” means to select all elements at the kth dimension.Setting the kth subscript to a letter i means to selectthe ith element at the kth dimension. For example,for matrix X, Xi,: means the ith row, X:,j means thejth column, and Xi,j means the (i, j)th element. Theflattened version of matrices A, B and C is denoted asvec(A,B,C).

2.1 Iterative Multi-step Estimation

The IMS strategy trains a one-step-ahead forecastingmodel and iteratively feeds the generated samples tothe one-step-ahead forecaster to get the multi-step-ahead prediction. There are two types of IMS mod-els, the deterministic one-step-ahead forecasting modelXt = m(Ft;φ) and the probabilistic one-step-aheadforecasting model p(Xt | Ft;φ). For the deterministicforecasting model, the forecasting process and thetraining objective of the deterministic forecaster [20]are shown in (2) and (3) respectively where d(x, x)measures the distance between x and x.

Xt+h = m(h)(Ft;φ), ∀1 ≤ h ≤ L. (2)

φ? = arg minφ

E[d(Xt+1,m(Ft;φ))

]. (3)

For the probabilistic forecasting model, we use thechain-rule to get the probability of the length-L predic-tion, which is given in (4), and estimate the parametersby Maximum Likelihood Estimation (MLE).

p(Xt+1:t+L | Ft;φ) = p(Xt+1 | Ft;φ)L∏i=2

p(Xt+i | Ft,Xt+1:t+i−1;φ).(4)

When the one-step-ahead forecaster is probabilistic andnonlinear, the MLE of (4) is generally intractable. Inthis case, we can draw samples from p(Xt+1:t+L |Ft;φ) by techniques like Sequential Monte-Carlo(SMC) [22] and predict based on these samples.

There are two advantages of the IMS approach: 1)The one-step-ahead forecaster is easy to train becauseit only needs to consider the one-step-ahead forecastingerror and 2) we can generate predictions of an arbitrarylength by recursively applying the basic forecaster.However, there is a discrepancy between training andtesting in IMS. In the training process, we use theground-truths of the previousL−1 steps to generate theLth-step prediction. While in the testing process, wefeed back the model predictions instead of the ground-truths to the forecaster. This makes the model prone toaccumulative errors in the forecasting process [21]. For

the deterministic forecaster example above, the optimalmuti-step-ahead forecaster, which can be obtained byminimizing E

[∑Lh=1 d(Xt+h,m

(h)(Ft;φ))], is not

the same as recursively applying the optimal one-step-ahead forecaster defined in (3) when m is nonlinear.This is because the forecasting error at earlier times-tamps will propagate to later timestamps [23].

2.2 Direct Multi-step EstimationThe main motivation behind DMS is to avoid the errordrifting problem in IMS by directly minimizing thelong-term prediction error. Instead of training a singlemodel, DMS trains a different model mh for eachforecasting horizon h. There can thus be L models inthe DMS approach. When the forecasters are determin-istic, the forecasting process and training objective areshown in (5) and (6) respectively where d(·, ·) is thedistance measure.

Xt+h = mh(Ft;φL),∀1 ≤ h ≤ L. (5)

φ?1, ..., φ?L = arg min

φ1,...,φL

E[d(Xt+1:t+L, Xt+1:t+L)

].

(6)

φ? = arg minφ

E

[L∑h=1

d(Xt+h,m(h)(Ft;φ))

]. (7)

To disentangle the model size from the number offorecasting steps L, we can also construct mh byrecursively applying the one-step-ahead forecaster m1.In this case, the model parameters {φ1, ..., φL} areshared and the optimization problem turns into (7). Weneed to emphasize here that (7), which optimizes themulti-step-ahead forecasting error, is different from (3),which only minimizes the one-step-ahead forecastingerror. This construction method is widely adopted indeep learning based methods [7], [24].

2.3 Comparison of IMS and DMSChevillon [19] compared the IMS strategy and DMSstrategy. According to the paper, DMS can lead to moreaccurate forecasts when 1) the model is misspecified, 2)the sequences are non-stationary, or 3) the training setis too small. However, DMS is more computationallyexpensive than IMS. For DMS, if the φhs are notshared, we need to store and train L models. If theφhs are shared, we need to recursively apply m1

for O(L) steps [19], [21], [25]. Both cases requirelarger memory storage or longer running time thansolving the IMS objective. On the other hand, thetraining process of IMS is easy to parallelize since eachforecasting horizon can be trained in isolation from theothers [26]. Moreover, directly optimizing a L-step-ahead forecasting loss may fail when the forecastingmodel is highly nonlinear, or the parameters are notwell-initialized [21].

4

Due to these trade-offs, models for STSF problemschoose the strategies that best match the problems’characteristics. Generally speaking, DMS is preferredfor short-term prediction while IMS is preferred forlong-term forecast. For example, the IMS strategy iswidely adopted in solving TF-MPC problems wherethe forecasting horizons are long, which are generallyabove 12 [4], [13] and can be as long as 100 [14]. TheDMS strategy is used in STSF-RG problems wherethe dimensionality of data is usually very large andonly short-term predictions are required [5], [6]. ForSTSF-IG problems, some works adopt DMS when theoverall dimensionality of the forecasting sequence isacceptable for the direct approach [3], [27].

Overall, IMS is easier to train but less accuratefor multi-step forecasting while DMS is more difficultto train but more accurate. Since IMS and DMS aresomewhat complementary, later studies have tried tobridge the gap between these two approaches. We willintroduce two representative strategies in the followingtwo sections.

2.4 Boosting StrategyThe boosting strategy proposed in [20] combines theIMS and DMS for univariate time series forecasting.The strategy boosts the linear forecaster trained by IMSwith several small and nonlinear adjustments trained byDMS. The model assumption is shown in (8).

xt+h = m(h)(Ft;φ)+Ih∑i=1

νli(xt−j , xt−k;ψi)+et,h.

(8)Here, m(Ft;φ) is the basic one-step-ahead forecastingmodel, li(xt−j , xt−k;ψi) is the chosen weak learnerat step i, Ih is the boosting iteration number, ν is theshrinkage factor and et,h is the error term. The authorset the base linear learner to be the auto-regressivemodel and set the weak nonlinear learners to be thepenalized regression splines. Also, the weak learnerswere designed to only account for the interaction be-tween two elements in the observation history. Thetraining method is similar to the gradient boostingalgorithm [28]. m(Ft;φ) is first trained by minimizingthe IMS objective (3). After that, the weak-learners arefit iteratively to the gradient residuals.

The author also performed a theoretical analysis ofthe bias and variance of different models trained withDMS, IMS and the boosting strategy for 2-step-aheadprediction. The result shows that the estimation bias ofDMS can be arbitrarily small while that of the IMS willbe amplified for larger forecasting horizon. The modelestimated by the boosting strategy also has a bias, butit is much smaller than the IMS approach due to thecompensation of the weak learners trained by DMS.Also, the variance of the DMS approach depends onthe variability of the input history Ft and the model

size which will be very large for complex and nonlinearmodels. The variance of the boosting strategy is smallerbecause the basic model is linear and the weak-learnersonly consider the interaction between two elements.

Although the author only proposed the boostingstrategy for the univariate case, the method, which liesin the mid-ground between IMS and DMS, should ap-ply to the STSF problems with multiple measurements,which leads to potential future works as discussed inSection 5.

2.5 Scheduled Sampling

The idea of Scheduled Sampling (SS) [21] is first totrain the model with IMS and then gradually replacesthe ground-truths in the objective function with sam-ples generated by the model. When all ground-truthsamples are replaced with model-generated samples,the training objective becomes the DMS objective. Thegeneration process of SS is described in (9):

∀1 ≤ h ≤ L,Xt+h ∼ p(X | Ft, Xt+1:t+h−1;φ),

Xt+h = (1− τt+h)Xt+h + τt+hXt+h,

τt+h ∼ B(1, εk).

(9)

Here, Xt+h and Xt+h are the generated sample andthe ground-truth at forecasting horizon h, correspond-ingly. p(X | Ft, Xt+1:t+h−1;φ) is the basic one-step-ahead forecasting model. τt+h is a random variablefollowing the binomial distribution, which controlswhether to use the ground-truth or the generated sam-ple. εk is the probability of choosing the ground-truthat the kth iteration. In the training phase, SS minimizesthe distance between Xt+1:t+L and Xt+1:t+L, whichis shown in (10). In the testing phase, τ is fixed to 0,meaning that the model-generated samples are alwaysused.

Ep(Xt+1:t+L|Ft;φ)

[d(Xt+1:t+L, Xt+1:t+L)

]= Ep(τt+1:t+L−1)

[d(Xt+1:t+L, Xt+1:t+L)

L∏h=1

p(Xt+1 | Ft, Xt+1:t+h−1;φ)].

(10)

If εk equals to 1, the ground-truth will always bechosen and the objective function will be the same asin the IMS strategy. If εk is 0, the generated samplewill always be chosen and the optimization objectivewill be the same as in the DMS strategy. Based on thisobservation, the author proposed to gradually decayεk during training to make the optimization objectiveshift smoothly from IMS to DMS, which is a typeof curriculum learning [29]. The paper also providedsome ways to change the sampling ratio.

One issue of SS is that the expectation over τt+hsin (10) makes the loss function non-differentiable. The

5

original paper [21] obviates the problem by treatingXt+hs as constants, which does not optimize the trueobjective [25], [30].

The SS strategy is widely adopted in DL modelsfor STSF. In [31], the author applied SS to videoprediction and showed that the model trained via SSoutperformed the model trained via IMS. In [24], theauthor proposed a training strategy similar to SS exceptthat the training curriculum is adapted from IMS toDMS by controlling the number of the prediction stepsinstead of controlling the frequency of sampling theground-truth. In [32], [33], the author used SS to trainmodels for traffic speed forecasting and showed itoutperformed IMS.

3 CLASSICAL METHODS FOR STSFIn this section, we will review the classical methodsfor STSF, including feature-based methods, state spacemodels, and Gaussian process models. These methodsare generally based on the shallow models or spa-tiotemporal kernels. Since a spatiotemporal sequencecan be viewed as a multivariate sequence if we flattenthe Xts into vectors or treat the observations at everylocation independently, methods designed for the gen-eral multivariate time-series forecasting are naturallyapplicable to STSF. However, directly doing so ignoresthe spatiotemporal structure within the data. There-fore, most algorithms introduced in this section haveextended this basic approach by utilizing the domainknowledge of the specific task or considering the inter-actions among different measurements and locations.

3.1 Feature-based Methods

Feature-based methods solve the STSF by training aregression model based on some human-engineeredspatiotemporal features. To reduce the problem size,most feature-based methods treat the measurements atthe K locations to be independent given the extractedfeatures. The high-level IMS and DMS models ofthe feature-based methods are given in (11) and (12)respectively.

Xt+1,k = fIMS(Ψk(Ft);φ), (11)

Xt+1:t+L,k = fDMS(Ψk(Ft;φ). (12)

Here, Ψk(Ft) represents the features at location k.Once the feature extraction method Ψk is defined,any regression model f can be applied. The feature-based methods are usually adopted in solving STSF-IG problems. Because the observations are sparselyand irregularly distributed in STSF-IG, it is sometimesmore straightforward to extract human-engineered fea-tures than to develop a purely machine learning basedmodel. Despite their straightforwardness, the feature-based methods have a major drawback. The model’s

performance relies heavily on the quality of the human-engineered features. We will briefly introduce tworepresentative feature-based methods for two STSF-IGproblems to illustrate the approach.

Ohashi and Torgo [34] proposed a feature-basedmethod for wind-speed forecasting. The author de-signed a set of features called spatiotemporal indicatorsand applied several regression algorithms includingregression tree, support vector regression, and randomforest. There are two kinds of spatiotemporal indicatorsdefined in the paper: the average or weighted averageof the observed historical values within a space-timeneighborhood and the ratio of two average values calcu-lated with different distance thresholds. The regressionmodel takes the spatiotemporal indicators as the inputto generate the one-step-ahead prediction. Experimentresults show that the spatiotemporal indicators arehelpful for enhancing the forecasting accuracy.

Zheng et al. [3] proposed a feature-based methodfor air quality forecasting. The author adopted the DMSapproach and combined multiple predictors, includingthe temporal predictor, the spatial predictor, and the in-flection predictor, to generate the features at location kand timestamp t. The temporal predictor focuses on thetemporal side of the data and uses the measurementswithin a temporal sliding window to generate the fea-tures. The spatial predictor focuses on the spatial sideof the data and uses the previous measurements withina nearby region to generate the features. Also, to dealwith the sparsity and irregularity of the coordinates, theauthor partitioned the neighborhood region into severalgroups and aggregated the inner measurements withina group. The inflection predictor extracts the suddenchanges within the observations. For the regressionmodel, the temporal predictor uses the linear regres-sion model, the spatial predictor uses a 2-hidden-layerneural network and the inflection predictor used theregression tree model. A regression tree model furthercombines these predictors.

3.2 State Space Models

The State Space Model (SSM) adopts a generativeviewpoint of the sequence. For the observed sequenceX1:T , SSM assumes that each element Xt is generatedby a hidden state Ht and these hidden states, whichare allowed to be discrete in our definition, form aMarkov process. The general form of the state spacemodel [35] is given in (13). Here, g is the transitionmodel, f is the observation model, Ht is the hiddenstate, εt is the noise of the transition model, and σtis the noise of the observation model. The SSM hasa long history in forecasting [36] and many classicaltime-series models like Markov Model (MM), HiddenMarkov Model (HMM), Vector Autoregression (VAR),

6

and Autoregressive Integrated Moving Average Model(ARIMA) can be written as SSMs.

Ht = g(Ht−1, εt), (13)

Xt = f(Ht,σt).

Given the underlying hidden state Ht, the obser-vation Xt is independent concerning the historicalinformation Ft. Thus, the posterior distribution of thesequence for the next L steps can be written as follows:

p(Xt+1:t+L|Ft) =

∫p(Ht | Ft)

t+L∏i=t+1

p(Xi | Hi)p(Hi | Hi−1)dHt:t+L

(14)

The advantage of SSM for solving STSF is thatit naturally models the uncertainty of the dynamicalsystem due to its Bayesian formulation. The posteriordistribution (14) encodes the probability of differentfuture outcomes. We can not only get the most likelyfuture but also know the confidence of our predictionby Bayesian inference. Common types of SSMs usedin STSF literatures have either discrete hidden statesor linear transition models and have limited represen-tational power [37]. In this section, we will focus onthese common types of SSMs and leave the review ofmore advanced SSMs, which are usually coupled withdeep architectures, in Section 4.2. We will first brieflyreview the basics of three common classes of SSMsand then introduce their representative spatiotemporalextensions, including the group-based method, Space-time Autoregressive Moving Average (STARIMA), andthe low-rank tensor auto-regression.

3.2.1 Common Types of SSMs

In STSF literature, the following three classes of SSMsare most common.

First-order Markov Model: When we set Ht

to be Xt−1 in (13) and constrain it to be discrete,the SSM reduces to the first-order MM. AssumingXts have C choices, the first-order MM is deter-mined by the transition matrix A ∈ RC×C whereAi,j = p(Xt+1 = i | Xt = j) and the priordistribution π ∈ RC where πi = p(X = i). The max-imum posterior estimator can be obtained via dynamicprogramming and the optimal model parameters can besolved in closed form [35].

Hidden Markov Model: When the states arediscrete, the SSM turns into the HMM. Assumingthe Hts have C choices, the HMM is determined bythe transition matrix A ∈ RC×C , the state prior πand the class conditional density p(Xt | Ht = i).For HMM, the maximum posterior estimator and theoptimal model parameters can be calculated via theforward algorithm and the Expectation Maximization

(EM) algorithm, respectively [35]. It is worth empha-sizing that the posterior of HMM is tractable evenfor non-Gaussian observation models. This is becausethe states of HMM are discrete and it is possible toenumerate all the possibilities.

Linear-Gaussian SSM: When the states are con-tinuous and both g and f in (13) are linear-Gaussian,the SSM becomes the Linear-Gaussian State SpaceModel (LG-SSM) [35]:

Ht = AtHt−1 + εt,

Xt = CtHt + σt,

εt ∼ N (0,Qt),

σt ∼ N (0,Rt),

(15)

where Qt and Rt are covariance matrices of theGaussian noises. A large number of classical time-series models like VAR and ARIMA can be written inthe form of LG-SSM [35], [38]. As both the transitionmodel and the observation model are linear-Gaussian,the posterior is also a linear-Gaussian and we can usethe Kalman Filtering (KF) algorithm [39] to calculatethe mean and variance. Similar to HMM, the modelparameters can also be learned by EM.

3.2.2 Group-based Methods

Group-based methods divide the N locations into non-intersected groups and train an independent SSM foreach group. To be more specific, for the locationset S = {1, 2, ..., N}, the group-based methods as-sume that it can be decomposed as S = ∪Ki=1Giwhere Gi ∩ Gj = ∅,∀i 6= j. The measurementsequences that belong to the same group, i.e., Di ={{X1,j , ...,Xt,j}, j ∈ Gi}, are used to fit a singleSSM. If K is equal to 1, it reduces to the simplest casewhere the measurement sequences at all locations sharethe same model.

This technique has mainly been applied to TF-MPCproblems. Asahara et al. [40] proposed a group-basedextension of MM called Mixed-Markov-chain Model(MMM) for pedestrian-movement prediction. MMMassumes that the people belonging to a specific groupsatisfy the same first-order MM. To use the first-orderMM, the author manually discretized the coordinatesof each person. The group-assignments along with themodel parameters are learned by EM. Experimentsshow that MMM outperforms MM and HMM in thistask. Mathew et al. [41] extended the work by usingHMM as the base model. Also, in [41], the groups aredetermined by off-line clustering algorithms instead ofEM.

Group-based methods capture the intra-group rela-tionships by model sharing. However, it cannot capturethe inter-group correlations. Also, only a single groupassignment is considered while there may be multi-ple grouping relationships in the spatiotemporal data.

7

Moreover, the latent state transitions of SSMs havesome internal grouping effects. It is often not necessaryto explicitly divide the locations into different groups.Due to these limitations, group-based methods are lessmentioned in later research works.

3.2.3 STARIMASTARIMA [42], [43], [44] is a classical spatiotem-poral extension of the univariate ARIMA model [45].STARIMA emphasizes the impact of a location’s spa-tial neighborhoods on its future values. The funda-mental assumption is that the future measurementsat a specific location i is not only related to thepast measurements observed at this location but alsorelated to the past measurements observed at its localneighborhoods. STARIMA is designed for univariateSTSF-RG and STSF-IG problems where there is asingle measurement at each location. If we denote themeasurements at timestamp t as xt ∈ RN , the modelof STARIMA is shown in the following:

∆dxt =

p∑k=1

λk∑l=0

φklW(l)∆dxt−k + εt

−p∑k=1

mk∑l=0

θklW(l)εt−k.

(16)

Here, ∆ is the temporal difference operator where∆xt = xt − xt−1 and ∆kxt = ∆{∆k−1xt}. φkland θkl are respectively the autoregressive parameterand the moving average parameter at temporal lag kand spatial lag l. λk,mk are maximum spatial orders.W(l) is a predefined N ×N matrix of spatial order l.εt is the normally distributed error vector at time t thatsatisfies the following relationships:

E[εt] = 0,

E[εtε

Tt+s

]=

{σ2I, s = 0

0, otherwise,

E[xtε

Tt+s

]= 0,∀s > 0.

(17)

The spatial order matrix W(l) is the key inSTARIMA for modeling the spatiotemporal correla-tions. W

(l)i,j is only non-zero if the site j is the lth

order neighborhood of site i. W(0) is defined to bethe identity matrix. The spatial order relationship ina two-dimensional coordinate system is illustrated inFigure 1. However, the W(l)s need to be predefinedby the model builder and cannot vary over-time. Also,the STARIMA still belongs to the class of LG-SSM. Itis thus not suitable for modeling nonlinear spatiotem-poral systems. Another problem of STARIMA is that

Fig. 1. Illustration of the spatial order relationships inSTARIMA. From left to right are spatial order equals to 1,2 and 3. The blue circles are neighbors of the white circle inthe middle.

it is designed for problems where there is only onemeasurement at each location and has not consideredthe case of multiple correlated measurements.

3.2.4 Low-rank Tensor Auto-regression

Bahadori et al. [46] proposed to train an auto-regressionmodel with low-rank constraint to better capture the la-tent structure of the spatiotemporal data. This method isdesigned for STSF-IG and STSF-RG problems wherethe coordinates are fixed. Assuming the observed mea-surements Xt ∈ RN×D satisfy the VAR(J ) model,the optimization problem of the low-rank tensor auto-regression is given in (18):

minW

T∑t=J+1

‖Xt −Xt‖2F + µT∑t=1

tr(XTt LXt),

s.t. r(W) ≤ ρ,Xt+1,k =WkXt−J+1:t, ∀t, k.

(18)

Here, W ∈ RN×D×JD is the three-dimensionalweight tensor. L ∈ RD×D is the spatial Laplacianmatrix calculated from the fixed coordinate matrixC ∈ RN×E . r(W) means the rank of the weighttensor and ρ is the maximum rank. The regularizationterm in the objective function constrains the predictedresults to be spatially smooth. µ is the strength of thisspatial similarity constraint.

Constraining the rank of the weight tensor imposesan implicit spatiotemporal structure in the data. Whenthe weight tensor is low-rank, the observed sequencescan be described with a few latent factors.

The optimization problem (18) is non-convex andNP-hard in general [47] due to the low-rank con-straint. One standard approach is to relax the rankconstraint to a nuclear norm constraint and optimizethe relaxed problem. However, this approach is notcomputationally feasible for solving (18) because of thehigh dimensionality of the spatiotemporal data. Thus,researchers [46], [48], [49] proposed optimization al-gorithms to accelerate the learning process.

The primary limitation of the low-rank tensor auto-regression is that the regression model is still lineareven if we have perfectly learned the weights. We

8

cannot directly apply it to solve forecasting problemsthat have strong nonlinearities.

3.3 Gaussian Process

Gaussian Process (GP) is a non-parametric Bayesianmodel. GP defines a prior p(f) in the function spaceand tries to infer the posterior p(f | D) given theobserved data [35]. GP assumes the joint distribu-tion of the function values of an arbitrary finite set{x1, x2, ..., xn}, i.e., p(f(x1), ..., f(xn)), is a Gaus-sian distribution. A typical form of the GP prior isshown in (19):

f ∼ GP(m(x), κ(x, x′)), (19)

where m(x) is the mean function and κ(x, x′) is thekernel function.

In geostatistics, GP regression is also known askriging [50], [51], which is used to interpolate themissing values of the data. When applied to STSF, GPis usually used in STSF-IG and STSF-RG problemswhere the coordinates are fixed. The general formulaof the GP based models is shown in (20):

f ∼ GP(m(X), κθ),

Y ∼ p(Y | f(X)).(20)

Here, X contains the space-time positions of thesamples, which is the concatenation of the spatialcoordinates and the temporal index. For example,Xk = (tk, ik, jk) where tk is the timestamp and(ik, jk) is the spatial coordinate. κθ is the kernelfunction parameterized by θ. p(Y | f(X)) is theobservation model. Like SSM, GP is also a generativemodel. The difference is that GP directly generatesthe function describing the relationship between theobservation and the future while SSM generates thefuture measurements one-by-one through the Markovassumption. We need to emphasize here that GP canalso be applied to solve TF-MPC problems. A well-known example is the Interacting Gaussian Processes(IGP) [52], in which a GP models the movement ofeach person, and a potential function defined over thepositions of different people models their interactions.Nevertheless, the method for calculating the posteriorof IGP is similar to that of (20) and we will stick to thismodel formulation.

To predict the measurements given a GP model,we are required to calculate the predictive posteriorp(f? | X?, XD, Y D) where {XD, Y D} are the ob-served data points and X? is the candidate space-time coordinates to infer the measurements. If theobservation model is a Gaussian distribution, i.e., p(Y |f(X)) = N (f(X), σ2), the posterior distribution is

also a Gaussian distribution as shown in (21).

p(f? | X?, XD, Y D) = N (µ?,Σ?),

K = κ(XD, XD) + σ2I

µ? = κ(X?, XD)K−1Y D,

Σ? = κ(X?, X?)− κ(X?, XD)K−1κ(X?, XD),(21)

If the observation model is non-Gaussian, which iscommon for STSF problems [53], Laplacian approx-imation can be utilized to approximate the poste-rior [35]. The inference process of GP is computa-tionally expensive due to the involvement of K−1

and the naive approach requires O(|D|3) computationsand O(|D|2) storage. There are many accelerationtechniques to speed up the inference [53]. We will notinvestigate them in detail in this survey and readers canrefer to [51], [53].

The key ingredient of GP is the kernel function.For STSF problems, the kernel function should takeboth the spatial and temporal correlations into account.The common strategy for defining complex kernels isto combine multiple basic kernels by operations likesummation and multiplication. Thus, GP models forSTSF generally use different kernels to model differentaspects of the spatiotemporal sequence and ensemblethem together to form the final kernel function. In [53],the author proposed to decompose the spatiotempo-ral kernel as κst((s, t), (s′, t′)) = κs(s, s

′)κt(t, t′)

where κs is the spatial kernel and κt is the temporalkernel [53]. If we denote the kernel matrix corre-sponding to κst, κs, κt as Kst,Ks,Kt, this decom-position results in Kst = Ks ⊗ Kt where ⊗ isthe Kronecker product. Based on this observation, theauthor utilized the computational benefits of Kroneckeralgebra to scale up the inference of the GP models [54].In [16], the author combined two spatial kernels andone temporal kernel to construct a GP model forseasonal influenza prediction. The author defined thebasic spatial kernels as the average of multiple RBFkernels. The temporal kernel was defined to be the sumof the periodic kernel, the Pacioreks kernel, and theexponential kernel. The final kernel function used inthe paper was κfinal = κspace + κtime + κtime × κspace2.

3.4 Remarks

In this section, we reviewed three types of classi-cal methods for solving the STSF problems, feature-based methods, SSMs, and GP models. The feature-based methods are simple to implement and can pro-vide useful predictions in some practical situations.However, these methods rely heavily on the human-engineered features and may not be applied withoutthe help of domain experts. The SSMs assume that theobservations are generated by Markovian hidden states.We introduced three common types of SSMs: first-

9

order MM, HMM and LG-SSM. Group-based methodsextend SSMs by creating groups among the states orobservations. STARIMA extends ARIMA by explicitlymodeling the impact of nearby locations on the future.Low-rank tensor autoregression extends the basic au-toregression model by adding low-rank constraints onthe model parameters. The advantage of these commontypes of SSMs for STSF is that we can calculatethe exact posterior distribution and conduct Bayesianinference. However, the overall nonlinearity of thesemodels are limited and may not be suitable for somecomplex STSF problems like video prediction. GPmodels the inner characteristics of the spatiotemporalsequences by spatiotemporal kernels. Although beingvery powerful and flexible, GP has high computationaland storage complexity. Without any optimization, theinference of GP requires O(|D|3) computations andO(|D|2) storage, which makes it not suitable for caseswhen a huge number of training instances are available.

4 DEEP LEARNING METHODS FOR STSF

Deep Learning (DL) is an emerging new area in ma-chine learning that studies the problem of how to learna hierarchically-structured model to directly map theraw inputs to the desired outputs [26]. Usually, DLmodel stacks some basic learnable blocks, or layers, toform a deep architecture. The overall network is thentrained end-to-end. In recent years, a growing numberof researchers have proposed DL models for STSF. Tobetter understand these models, we will start with anintroduction of the relevant background of DL and thenreview the models for STSF.

4.1 Preliminaries

4.1.1 Basic Building Blocks

In this section, we will review the basic blocks thathave been used to build DL models for STSF. Sincethere is a vast literature on DL, the introduction of thispart serves only as a brief introduction. Readers canrefer to [26] for more details.

Restricted Boltzmann Machine: RestrictedBoltzmann Machine (RBM) [55] is the basic buildingblock of Deep Generative Models (DGMs). RBMs areundirected graphical models that contain a layer ofobservable variables v and another layer of hiddenvariables h. These two layers are interconnected andform a bipartite graph. In its simplest form, h andv are all binary and contain nh and nv nodes. The

TABLE 2Activations that appear in models reviewed in this survey.

Name Formulaσ (sigmoid) σ(x) = 1

1+e−x

tanh tanh(x) = ex−e−xex+e−x

ReLU [58] ReLU(x) = max(0, x)Leaky ReLU [59] LReLU(x) = max(−αx, x)

joint distribution and the conditional distributions aredefined in (22) [26]:

p(v,h) =1

Zexp(−E(v,h)),

E(v,h) = −bTv − cTh− vTWh,

p(h | v) =nh∏j=1

σ(

(2h− 1) ◦ (c + WTv))j,

p(v | h) =nv∏j=1

σ ((2v − 1) ◦ (b + Wh))j .

(22)Here, b is the bias of the visible state, c is the bias ofthe hidden state, W is the weight between visible statesand hidden states, σ is the sigmoid function defined inTable 2, and Z is the normalization constant for energy-based models, which is also known as the partitionfunction [35]. RBM can be learned by Contrastive Di-vergence (CD) [56], which approximates the gradientof logZ by drawing samples from k cycles of MCMC.

The basic form of RBM has been extended in var-ious ways. Gaussian-Bernoulli RBM (GRBM) [57] as-sumes the visible variables are continuous and replacesthe Bernoulli distribution with a Gaussian distribution.Spike and Slab RBM (ssRBM) maintains two types ofhidden variables to model the covariance.

Activations: Activations refer to the element-wise nonlinear mappings h = f(x). Activation func-tions that are used in the reviewed models are listed inTable 2.

Sigmoid Belief Network: Sigmoid Belief Net-work (SBN) is another building block of DGM. UnlikeRBM which is undirected, SBN is a directed graphicalmodel. SBN contains a hidden binary state h and avisible binary state v. The basic generation process ofSBN is given in the following:

p(vm | h) = σ(wTmh + cm),

p(hj) = σ(bj).(23)

Here, bj , cm are biases and wm is the weight. Be-cause the posterior distribution is intractable, SBN islearned by methods like Neural Variational Inferenceand Learning (NVIL) [60] and Wake-Sleep (WS) [61]algorithm, which train an inference network along withthe generation network.

10

Fully-connected Layer: The Fully-Connected(FC) layer is the very basic building block of DLmodels. It refers to the linear-mapping from the inputx to the hidden state h, i.e, h = Wx + b, where Wis the weight and b is the bias.

Convolution and Pooling: The convolution layerand pooling layer are first proposed to take advantageof the translation invariance property of image data.The convolution layer computes the output by scanningover the input and applying the same set of linear filters.Although the input can have an arbitrary dimensional-ity [62], we will just introduce 2D convolution and 2Dpooling in this section. For input X ∈ RCi×Hi×Wi ,the output of the convolution layer H ∈ RCo×Ho×Wo ,which is also known as feature map, will be calculatedas the following:

H =W ∗ X + b. (24)

Here, W ∈ RCi×Co×Kh×Kw is the weight, Kh andKw are the kernel sizes, “∗” denotes convolution, andb is the bias. If we denote the feature vector at the(i, j) position as H:,i,j , the flattened weight as W =vec(W), and the flattened version of the local regioninputs as xN (i,j) = vec([X:,s,t | (s, t) ∈ N (i, j)]),(24) can be rewritten as (25):

H:,i,j = WxN (i,j) + b. (25)

This shows that convolution layer can also be viewedas applying multiple FC layers with shared weightand bias on local regions of the input. Here, theordered neighborhood set N (i, j) is determined byhyper-parameters like the kernel size, stride, padding,and dilation [63]. For example, if the kernel size ofthe convolution filter is 3 × 3, we have N (i, j) ={(i + s, j + t) | −1 ≤ s, t ≤ 1}. Readers canrefer to [26], [63] for details about how to calculatethe neighborhood set N (i, j) based on these hyper-parameters. Also, the out-of-boundary regions of theinput are usually set to zero, which is known as zero-padding.

Like convolution layer, the pooling layer also scansover the input and applies the same mapping function.Pooling layer generally has no parameter and usessome reduction operations like max, sum, and averagefor mapping. The formula of the pooling layer is givenin the following:

Hk,i,j = g({Xk,s,t | (s, t) ∈ N (i, j)}), (26)

where g is the element-wise reduction operation.Deconvolution and Unpooling: The deconvolu-

tion [64] and unpooling layer [65] are first proposed toserve as the “inverse” operation of the convolution andthe pooling layer. Computing the forward pass of thesetwo layers is equivalent to computing the backwardpass of the convolution and pooling layers [64].

Graph Convolution: Graph convolution general-izes the standard convolution, which is operated overa regular grid topology, to convolution over graphstructures [33], [66], [67], [68]. There are two inter-pretations of graph convolution: the spectral interpre-tation and the spatial interpretation. In the spectralinterpretation [66], [69], graph convolution is definedby the convolution theorem, i.e., the Fourier transformof convolution is the point-wise product of Fouriertransforms:

x ∗G y = F−1(F (x) ◦ F (y)) = U(UTx ◦UTy).(27)

Here, F (x) = UTx is the graph Fourier transform.The transformation matrix U is defined based onthe eigendecomposition of the graph Laplacian. Def-ferrard et al. [69] defined U as the eigenvectors ofI − D−

12 AD−

12 , in which D is the diagonal degree

matrix, A is the adjacency matrix, and I is the identitymatrix.

In the spatial interpretation, graph convolution isdefined by a localized parameter-sharing operator thataggregates the features of the neighboring nodes, whichcan also be termed as graph aggregator [33], [70]:

hi = γθ(xi, {zNi}), (28)

where hi is the output feature vector of node i, γθ(·)is the mapping function parameterized by θ, xi is theinput feature of node i, and zNi contains the features ofnode i’s neighborhoods. The mapping function needsto be permutation invariant and dynamically resizing.One example is the pooling aggregator defined in (29):

hi = φo(xi ⊕ poolj∈Ni(φv(zj))), (29)

where φo and φv are mapping functions and ⊕ meansconcatenation. Compared with the spectral interpre-tation, the spatial interpretation does not require theexpensive eigendecomposition and is more suitable forlarge graphs. Moreover, Defferrard et al. [69] and Kipf& Welling [68] proposed accelerated versions of thespectral graph convolution that could be interpretedfrom the spatial perspective.

4.1.2 Types of DL Models for STSFBased on the characteristic of the model formulation,we can categorize DL models for STSF into two majortypes: Deep Temporal Generative Models (DTGMs)and Feedforward Neural Network (FNN) and RecurrentNeural Network (RNN) based methods. In the follow-ing, we will briefly introduce the basics about DTGM,FNN, and RNN.

DTGM: DTGMs are DGMs for temporal se-quences. DTGM adopts a generative viewpoint of thesequence and tries to define the probability distribu-tion p(X1:T ). The aforementioned RBM and SBN arebuilding-blocks of DTGM.

11

FNN: FNNs, also called deep feedforward net-works [26], refer to deep learning models that constructthe mapping y = f(x; θ) by stacking various of theaforementioned basic blocks such as FC layer, convolu-tion layer, deconvolution layer, graph convolution layer,and activation layer. Common types of FNNs includeMultilayer Perceptron (MLP) [26], [71], which stacksmultiple FC layers and nonlinear activations, Convo-lutional Neural Network (CNN) [72], which stacksmultiple convolution and pooling layers, and graphCNN [69], which stacks multiple graph convolutionlayers. The parameters of FNN are usually estimated byminimizing the loss function plus some regularizationterms. Usually, the optimization problem is solved viastochastic gradient-based methods [26], in which thegradient is computed by backpropagation [73].

RNN: RNN generalizes the structure of FNN byallowing cyclic connections in the network. These re-current connections make RNN especially suitable formodeling sequential data. Here, we will only introducethe discrete version of RNN which updates the hiddenstates using ht = f(ht−1,xt; θ) [26]. These hiddenstates can be further used to compute the output anddefine the loss function. One example of an RNNmodel is shown in (30). The model uses the tanhactivation and the fully-connected layer for recurrentconnections. It uses another full-connected layer to getthe output:

ht = tanh(Whht−1 + Wxxt + b),

ot = Woht + c.(30)

Similar to FNNs, RNNs are also trained by stochasticgradient based methods. To calculate the gradient, wecan first unfold an RNN to a FNN and then performbackpropagation on the unfolded computational graph.This algorithm is called Backpropagation ThroughTime (BPTT) [26].

Directly training the RNN shown in (30) withBPTT will cause the vanishing and exploding gradientproblems [74]. One way to solve the vanishing gra-dient problem is to use the Long-Short Term Memory(LSTM) [75], which has a cell unit that is designedto capture the long-term dependencies. The formula ofLSTM is given in the following:

it = σ(Wxixt + Whiht−1 + bi),

ft = σ(Wxfxt + Whfht−1 + bf ),

ct = ft ◦ ct−1+

it ◦ tanh(Wxcxt + Whcht−1 + bc),

ot = σ(Wxoxt + Whoht−1 + bo),

ht = ot ◦ tanh(ct).

(31)

Here, ct is the memory cell and it, ft,ot are the inputgate, the forget gate, and the output gate, respectively.The memory cell is accessed by the next layer when ot

is turned on. Also, the cell can be updated or clearedwhen it or ft is activated. By using the memory celland gates to control information flow, the gradientwill be trapped in the cell and will not vanish overtime, which is also known as the Constant ErrorCarousel (CEC) [75]. LSTM belongs to a broadercategory of RNNs called gated RNNs [26]. We willnot introduce other architectures like Gated Recur-rent Unit (GRU) in this survey because LSTM is themost representative example. To simplify the notation,we will also denote the transition rule of LSTM asht, ct = LSTM(ht−1, ct−1,xt; W).

4.2 Deep Temporal Generative Models

In this section, we will review DTGMs for STSF.Similar to SSM and GP model, DTGM also adopts agenerative viewpoint of the sequences. The character-istics of DTGMs is that models belonging to this classcan all be stacked to form a deep architecture, whichenables them to model complex system dynamics. Wewill survey four DTGMs in the following. The generalstructures of these models are shown in Figure 2.

Temporal Restricted Boltzmann Machine:Sutskever & Hinton [76] proposed Temporal RestrictedBoltzmann Machine (TRBM), which is a temporalextension of RBM. In TRBM, the bias of the RBM inthe current timestamp is determined by the previous mRBMs. The conditional probability of TRBM is givenin the following:

p(vt,ht | vt, ht) =1

Zexp

(−E(vt,ht | vt, ht)

),

E(vt,ht | vt, ht) = −fbv (vt)Tvt − fbh(vt, ht)

Tht

− vTt Wht,

fbv (vt) = a +m∑i=1

ATi vt−i,

fbh(vt, ht) =b +m∑i=1

BTi vt−i +

m∑i=1

CTi ht−i,

(32)where vt = vt−m:t−1 and ht = ht−m:t−1. Thejoint distribution is modeled as the multiplication ofthe conditional probabilities, i.e., p(v1:T ,h1:T ) =∏Ti=1 p(vt,ht | vt, ht).

The inference of TRBM is difficult and involvesevaluating the ratio of two partition functions [77].The original paper adopted a heuristic inference pro-cedure [76]. The results on a synthetic bouncing balldataset show that a two-layer TRBM outperforms asingle-layer TRBM in this task.

Recurrent Temporal Restricted Boltzmann Ma-chine: Later, Sutskever et al. [77] proposed theRecurrent Temporal Restricted Boltzmann Machine(RTRBM) that extends the idea of TRBM by condition-ing the parameters of the RBM at timestamp t on the

12

output of an RNN. The model of RTRBM is describedin the following:

p(vt,ht | rt−1) =1

Zexp (−E(vt,ht | rt−1)) ,

E(vt,ht | rt−1) = −hTt Wvt − cTvt − bTht

− hTt Urt−1,

rt =

{σ(Wvt + Urt−1 + b), t > 1

σ(Wvt + binit), t = 1.

(33)Here, rt encodes the visible states v1:t. The inferenceis simpler in RTRBM than TRBM because the modelis the same as the standard RBM once rt−1 is given.Similar to RBM, RTRBM can also be learned by CD.The author also extended RTRBM to use GRBM formodeling continuous time-series. Qualitative experi-ments on the synthetic bouncing ball dataset show thatRTRBM can generate more realistic sequences thanTRBM.

Structured Recurrent Temporal RestrictedBoltzmann Machine: Mittelman et al. [78] proposedStructured Recurrent Temporal Restricted BoltzmannMachine (SRTRBM) that improves upon RTRBM bylearning the connection structure based on the spa-tiotemporal patterns of the input. SRTRBM uses twoblock-masking matrices to mask the weights betweenthe hidden and visible nodes. The formula of SRTRBMis similar to RTRBM except for the usage of twomasking matrices MW and MU :

E(vt,ht | rt−1) = −hTt Wvt − cTvt

− bTht − hTt Urt−1,

rt =

{σ(Wvt + Urt−1 + b), t > 1

σ(Wvt + binit), t = 1,

W = W ◦MW ,

U = U ◦MU .

(34)

Here, MW and MU are block-masking matrices gen-erated by a graph G = (V ;E). To construct the graph,the visible units and the hidden units are divided into|V | nonintersecting subsets. Each node ε in the graphis assigned to a subset of hidden units Chε and a subsetof visible units Cvε . The ways to separate and assignthe units are based on the spatiotemporal coordinatesof the data. If there is an edge between two nodes inthe graph, the corresponding hidden and visible subsetsare connected in SRTRBM.

If the masks are fixed, the learning process ofSRTRBM is the same as RTRBM. To learn the mask,the paper proposed to use the sigmoid function σ(νij)as a soft mask and learn the parameters νij . Also,a sparsity regularizer is added to νijs to encouragesparse connections. The training process is first to fixthe mask while updating the other parameters and then

h1

v1

h2

v2

h3

v3

h4

v4

(a) TRBM

h1

v1

h2

v2

h3

v3

h4

v4

(b) RTRBMh1

v1

h2

v2

h3

v3

h4

v4

r1 r2 r3 r4

h1

v1

h2

v2

h3

v3

h4

v4

(c) SRTRBM

r1 r2 r3 r4

∘ UMU

∘ WMW

∘ UMU

W

UU

W

U

h(1)1

v1

h(1)2

v2

h(1)3

v3

h(1)4

v4

h(2)1 h

(2)2 h

(2)3 h

(2)4

(d) TSBN (1layer)

(e) TSBN (2layers)

Fig. 2. Structure of the surveyed four types of deep temporalgenerative models: TRBM, RTRBM, SRTRBM and TSBN.

update the mask. The author also gives the ssRBMversion of SRTBM and RTRBM by replacing RBMwith ssRBM while fixing the part of the energy func-tion related to rt to be unchanged. Experiments onhuman motion prediction, which is a TF-MPC task,and climate data prediction, which is an STSF-IG task,prove that SRTRBM achieves smaller one-step-aheadprediction error than RTRBM and RNN. This showsthat using a masked weight matrix to model the spatialand temporal correlations is useful for STSF problems.

Temporal Sigmoid Belief Network: Gan etal. [79] proposed a temporal extension of SBN calledTemporal Sigmoid Belief Network (TSBN). TSBN con-nects a sequence of SBNs in a way that the biasof the SBN at timestamp t depends on the state ofthe previous SBNs. For a length-T binary sequencev1:T , TSBN describes the following joint probability,in which we have omitted the biases for simplicity:

pθ(v1:T ,h1:T ) = p(h1)p(v1 | h1)T∏t=2

p(ht | ht−1,vt−1)p(vt | ht,vt−1),

p(ht,j = 1 | ht−1,vt−1) = σ(ATj ht−1 + BT

j vt−1),

p(vt,j = 1 | ht,vt−1) = σ(CTj ht + DT

j vt−1).(35)

Also, the author extended TSBN to the continuousdomain by using a conditional Gaussian for the visiblenodes:

p(vt | ht,vt−1) = N (µt, diag(σ2t )),

µt = ATht−1 + BTvt−1 + e,

log(σ2t ) = A′

Tht−1 + B′

Tvt−1 + e′.

(36)

Multiple single-layer TSBNs can be stacked to

13

form a deep architecture. For a multi-layer TSBN, thehidden states of the lower layers are conditioned on thestates of the higher layers. The connection structure ofa multi-layer TSBN is given in Figure 2(e). The authorinvestigated both the stochastic and deterministic con-ditioning for multi-layer models and showed that thedeep model with stochastic conditioning works better.Experiment results on the human motion predictiontask show a multi-layer TSBN with stochastic condi-tioning outperforms variants of SRTRBM and RTRBM.

4.3 FNN and RNN based Models

One shortcoming of DTGMs for STSF is that thelearning and inference processes are often complicatedand time-consuming. Recently, there is a growing trendof using FNN and RNN based methods for STSF. Com-pared with DTGM, FNN and RNN based methods aresimpler in the learning and inference processes. More-over, these methods have achieved good performancein many STSF problems like video prediction [31],[80], [81], precipitation nowcasting [5], [82], trafficspeed forecasting [17], [32], [33], human trajectoryprediction [4], and human motion prediction [14]. Inthe following, we will review these methods.

4.3.1 Encoder-Forecaster StructureRecall that the goal of STSF is to predict the length-Lsequence in the future given the past observations. TheEncoder-Forecaster (EF) structure [5], [7] first encodesthe observations into a finite-dimensional vector ora probability distribution using the encoder and thengenerates the one-step-ahead prediction or multi-steppredictions using the forecaster:

s = f(Ft; θ1),

Xt+1:t+L = g(s; θ2).(37)

Here, f is the encoder and g is the forecaster. Inour definition, we also allow probabilistic encoder andforecaster, where s and Xt+1:t+L turn into randomvariables:

s ∼ f(Ft; θ1),

Xt+1:t+L ∼ πg(s; θ2).(38)

Using probabilistic encoder and forecaster is a way tohandle uncertainty and we will review the details inSection 4.3.5. All of the surveyed FNN and RNN basedmethods fit into the EF framework. In the following, wewill review how previous works design the encoder andthe forecaster for different types of STSF problems, i.e.,STSF-RG, STSF-IG, and TF-MPC.

4.3.2 Models for STSF-RGFor STSF-RG problems, the spatiotemporal sequenceslie in a regular grid and can naturally be modeledby CNNs, which have proved effective for extracting

features from images and videos [62], [72]. Therefore,researchers have used CNNs to build the encoder andthe forecaster. In [83], two input frames are encodedindependently to two state vectors by a 2D CNN.The future frame is generated by linearly extrapolatingthese two state vectors and then mapping the extrap-olated vector back to the image space with a seriesof 2D deconvolution and unpooling layers. Also, tobetter capture the spatial features, the author proposedthe phase pooling layer, which keeps both the pooledvalues and pooled indices in the feature maps. In [6],multi-scale 2D CNNs with ReLU activations were usedas the encoder and the forecaster. Rather than encodingthe frames independently, the author concatenated theinput frames along the channel dimension to generatea single input tensor. The input tensor is rescaled tomultiple resolutions and encoded by a multi-scale 2DCNN. To better capture the spatiotemporal correlations,Vondrick et al. [9] proposed to use 3D CNNs as theencoder and forecaster. Also, the author proposed touse a 2D CNN to predict the background image andonly use the 3D CNN to generate the foreground.Later, Shi et al. [82] compared the performance of2D CNN and 3D CNN for precipitation nowcastingand showed that 3D CNN outperformed 2D CNN inthe experiments. Kalchbrenner et al. [84] proposed theVideo Pixel Network (VPN) that uses multiple causalconvolution layers [85] in the forecaster. Unlike theprevious 2D CNN and 3D CNN models, of which theforecaster directly generates the multi-step predictions,VPN generates the elements in the spatiotemporal se-quence one-by-one with the help of causal convolution.Experiments show that VPN outperforms the baselinewhich uses a 2D CNN to generate the predictionsdirectly. Recently, Xu et al. [86] proposed the PredCNNnetwork that stacks multiple dilated causal convolutionlayers to encode the input frames and predict the future.Experiments show that PredCNN achieves state-of-the-art performance in video prediction.

Besides using CNNs, researchers have also usedRNNs to build models for STSF-RG. The advantage ofusing RNN for STSF problems is that it naturally mod-els the temporal correlations within the data. Srivastavaet al. [7] proposed to use multi-layer LSTM networksas the encoder and the forecaster. To use the LSTM, theauthor directly flattened the images into vectors. Oh etal. [24] proposed to use 2D CNNs to encode the imageframes before feeding into LSTM layers. However,the state-state connection structure of Fully-ConnectedLSTM (FC-LSTM) is redundant for STSF and is notsuitable for capturing the spatiotemporal correlationsin the data [5]. To solve the problem, Shi et al. [5]proposed the Convolutional LSTM (ConvLSTM). Con-vLSTM uses convolution instead of full-connection inboth the feed-forward and recurrent connections ofLSTM. The formula of ConvLSTM is given in the

14

following:

It = σ(Wxi ∗ Xt +Whi ∗ Ht−1 + bi),

Ft = σ(Wxf ∗ Xt +Whf ∗ Ht−1 + bf ),

Ct = Ft ◦ Ct−1+

It ◦ tanh(Wxc ∗Xt +Whc ∗ Ht−1 + bc),

Ot = σ(Wxo ∗ Xt +Who ∗ Ht−1 + bo),

Ht = Ot ◦ tanh(Ct).(39)

Here, all the states, cells, and gates are three-dimensional tensors. The author used two ConvLSTMlayers as the encoder and the forecaster and showed that1) ConvLSTM outperforms FC-LSTM when appliedto the precipitation nowcasting problem and 2) thetwo-layer model performs better than the single-layermodel. Similar to the way of extending LSTM toConvLSTM, other types of RNNs, like GRU, can be ex-tended to Convolutional RNNs (ConvRNNs) by usingconvolution in state-state transitions. Since ConvRNNcombines the advantage of CNN and RNN, it is widelyadopted in later researches [31], [80], [87], [88], [89]as a basic block for building DL models for STSF-RG. There are also attempts to improve the structureof ConvRNN. Shi et al. [82] proposed the TrajectoryGRU (TrajGRU) model that improves ConvGRU byactively learning the recurrent connection structure.Because the convolutional recurrence structure adoptedin ConvRNNs is location-invariant, it is not suitable formodeling location-variant motions, e.g., rotation andscaling. Thus, TrajGRU uses a learned subnetwork tooutput the recurrent connection structures:

Ut,Vt = γ(Xt,Ht−1),

Zt = σ(Wxz ∗ Xt +L∑l=1

W lhz ∗ Ht−1,Ut,l,Vt,l),

Rt = σ(Wxr ∗ Xt +L∑l=1

W lhr ∗ Ht−1,Ut,l,Vt,l),

H′t = f(Wxh ∗ Xt+

Rt ◦ (L∑l=1

W lhh ∗ Ht−1,Ut,l,Vt,l)),

Ht = (1−Zt) ◦ H′t + Zt ◦ Ht−1.(40)

Here, γ is a convolutional subnetwork that out-puts L couples of flow fields Ut,Vt ∈ RL×H×W .Ht−1,Ut,l,Vt,l means to shift the elements in Ht−1

through the flow field Ut,l,Vt,l by bilinear warp-ing [90]. The connection structures of ConvRNN andTrajRNN are illustrated in Figure 3. Experiments onboth the synthetic dataset and the real-world HKO-7benchmark show that TrajGRU outperforms 2D CNN,3D CNN, and ConvGRU. Wang et al. [91] proposedthe Predictive RNN (PredRNN), which extends Con-vLSTM by adding a new type of spatiotemporal mem-

(a) Structure of ConvRNN.

(b) Structure of TrajRNN.

Fig. 3. Illustration of the connection structures of ConvRNN,TrajRNN. Links with the same color share the same transi-tion weights. Source: [82].

ory. Unlike ConvLSTM, of which the memory statesare constrained inside each LSTM layer, PredRNNmaintains another global memory state to store thespatiotemporal information in the lower layers and inthe previous timestamps. In PredRNN, memory statesare allowed to zigzag across the RNN layers. Thestructure of PredRNN is illustrated in Figure 4(b).Here, Ml

ts are the global spatiotemporal memorystates and are updated in a similar way as the cellstates in ConvLSTM except that it is updated in azigzag order. Based on PredRNN, Wang et al. [81]proposed the PredRNN++ structure which added morenonlinearities when updatingMl

ts. Experiments showthat PredRNN++ outperforms both PredRNN and otherbaseline models including TrajGRU and variants ofConvLSTM.

Apart from studying the basic building blocks likeCNNs and RNNs, another line of research tries todisentangle the motion and content of the spatiotem-poral sequence to better forecast the future. Jia etal. [80] and Finn et al. [31] proposed to predict howthe frames will transform instead of directly predict-ing the future frames. Finn et al. [31] proposed twotransformation methods named Convolutional DynamicNeural Advection (CDNA) and Spatial TransformerPredictors (STP). To predict the frame at timestampt + 1, CDNA generates N 2D convolutional filters{W1

t ,W2t , ...,W

Nt } and a masking tensor M ∈

RN×H×W where∑Nc=1Mc,i,j = 1. The previous

frame Xt is transformed N times by these filters andthese N transformed frames are linearly combined bythe masking tensor to generate the final prediction.STP is similar to CDNA except that the Wn

t s becomethe parameters of different affine transformations [90].Similar to CDNA and STP, Jia et al. [80] proposed theDynamic Filter Network (DFN) that directly outputsthe parameters of a K × K local filter for every

15

(a) Structure of a stack of three ConvLSTMlayers.

(b) Structure of PredRNN with three layers.

Fig. 4. Comparison of multi-layer ConvLSTMs and the Pre-dRNN. Mt is the global spatiotemporal memory that isupdated along the red links.

location. The prediction is generated by applying thesefilters to the previous frame. Actually, DFN can beviewed as a special case of CDNA. If we manuallyfix the 2-dimensional convolutional filters in CDNAto have K2 elements, of which only a single ele-ment is non-zero, CDNA turns into DFN. Villegas etal. [87] proposed Motion-Content Network (MCnet)which uses two separated networks to encode themotion and the content. In MCnet, the input of themotion encoder is the sequence of difference images{X2−X1, ...,Xt−Xt−1} and the input of the contentencoder is the last observed image Xt. The encodedmotion features and content features are fused to-gether to generate the prediction. Experiments showthat disentangling the motion and the content helpsimprove the performance. This architecture has beenused in other papers [89], [92] to enhance the predictionperformance.

4.3.3 Models for STSF-IGUnlike STSF-RG problems, the measurements inSTSF-IG problems are observed in some sparsely dis-tributed stations which are more challenging to dealwith. The simple strategy is to model the measurementsobserved at each station independently, which has beenadopted in Yu et al. [17]. However, this strategy has notconsidered the correlation among different stations. Arecent trend in solving this problem is to convert thesesparsely distributed stations to a graph and utilize graphconvolution to build the encoder and the forecaster. Wewill mainly introduce these graph convolution-basedmethods in this section.

Li et al. [32] proposed the Diffusion ConvolutionalRecurrent Neural Network (DCRNN) for traffic speed

forecasting. To convert the traffic stations into a graph,the author generated the adjacency matrix based onthe thresholded pairwise road network distances. Afterthat, DCRNN is used to encode the sequence of graphi-cal data and predict the future. The formula of DCRNNis similar to ConvRNN except that the Diffusion Con-volution (DC) is used in the input-state and state-statetransitions. DC is a type of graph convolution thatconsiders the direction information in the graph, whichhas the following formula:

Y =K−1∑k=0

(θk,1(D−1O A)k) + θk,2(D−1

I AT )k)X,

(41)where X ∈ RN×C is the input features, A is theadjacency matrix, DO is the diagonal out-degree ma-trix, DI is the diagonal in-degree matrix, θk,1, θk,2are the parameters, K is the number of diffusionsteps, N is the number of nodes in the graph, andC is the number of the features. Unlike other graphconvolution operators like ChebNet [69] which oper-ates on undirected graphs, DC operates on a directedgraph and is more suitable for real-life scenarios.Experiments on two real-world traffic datasets showthat DCRNN outperforms FC-LSTM and the ChebNet-based Graph Convolutional Recurrent Neural Network(GCRNN) [32]. Later, Zhang et al. [33] proposed theGraph GRU (GGRU) network that generalizes the ideaof DCRNN. Instead of proposing a single model, theauthor proposed a unified method for constructing anRNN based on an arbitrary graph convolution operator.The formula of GGRU is given in the following:

Ut =σ(ΓΘxu(Xt,Xt;G)

+ ΓΘhu(Xt ⊕Ht−1,Ht−1;G)),

Rt =σ(ΓΘxr (Xt,Xt;G)

+ ΓΘhr (Xt ⊕Ht−1,Ht−1;G)),

H′t =h(ΓΘxh(Xt,Xt;G)

+ Rt ◦ ΓΘhh(Xt ⊕Ht−1,Ht−1;G)),

Ht =(1−Ut) ◦H′t + Ut ◦Ht−1.(42)

Here, Ut,Rt are the update gate and reset gate thatcontrols the memory flow, Ht is the hidden state, Xt

is the input, h(·) is the activation function, and Γ(·, ·)is an arbitrary graph convolution operator. The authoralso proposed a new multi-head gated graph attentionaggregator that uses neural attention to aggregate theneighboring features. Experiments show that GGRUoutperforms DCRNN if the gated attention aggregatoris used. Yu et al. [93] proposed the Spatio-TemporalGraph Convolutional Network (STGCN) that appliesgraph convolution in a different manner. The authordirectly stacked multiple Spatio-Temporal Convolution(ST-Conv) blocks to build the network. The ST-Convblock is a concatenation of two temporal convolution

16

layers and one graph convolution layer. For the graphconvolution operator, the paper tested the ChebNetand its first-order approximation [68] and found thatChebNet performed better. The major advantage ofSTGCN over RNN-based methods is it is faster in thetraining phase.

4.3.4 Models for TF-MPC

The goal of TF-MPC is to predict the future trajectoriesof all objects in the moving point cloud. Fragkiadakiet al. [13] proposed a basic RNN based method forthis problem. In the model, the coordinate sequenceof each human joint is encoded by the same LSTM.The drawback of this basic method is that it does notmodel the correlation among the individuals. To solvethe problem, Alahi et al. proposed the SocialLSTMmodel [4] that uses a set of interacted LSTMs tomodel the trajectory of the crowd. In SocialLSTM,the position of each person is modeled by an LSTMand a social pooling layer interconnects these LSTMs.Social pooling aggregates the hidden states of thenearby LSTMs at the previous timestamp and usethis aggregated state to help decide the current state.The transition rule of the SocialLSTM is given in thefollowing:

N it = {j | j ∈ Ni, |xjt − xit| ≤ m, |y

jt − yit| ≤ n},

Hit =

∑j∈N it

hjt−1,

eit = φ1(xit, yit; We),

ait = φ2(Hit; Wl),

hit, cit = LSTM(hit−1, c

it−1, vec(eit,a

it); Wl).

(43)Here, Hi

t is the aggregated state vector in the socialpooling step, (xjt , y

jt ) is the coordinate of the jth

person at timestamp t, φ1, φ2 are mapping functions,and m,n are distance thresholds. Experiments showthat adding the social pooling layer is helpful forthe TF-MPC task. In [94], the author proposed animproved version of social pooling called View Frus-tum Of Attention (VFOA) social pooling. This methoddefines the neighborhood set in the social pooling layerbased on the calculated VFOA interest region. Jain etal. [14] proposed the Structural-RNN (S-RNN) networkto solve the human motion prediction problem. S-RNNimposes structures in the recurrent connections basedon a given spatiotemporal graph. For human motionprediction, this spatiotemporal graph is constructed bythe relationship between different human joints. S-RNN maintains two types of RNNs, nodeRNN andedgeRNN. The outputs of edgeRNNs are fed to thecorresponding nodeRNNs. Experiments show that theS-RNN achieves the state-of-the-art result in the humanmotion prediction problem.

4.3.5 Methods for Handling Uncertainty

Most real-world dynamical systems, e.g., atmosphere,are inherently stochastic and unpredictable. Simply as-suming the outcome is deterministic, which is commonin FNN and RNN based models, will generate blurrypredictions. Thus, recent studies [9], [83], [95] haveproposed ways for handling uncertainty.

In [13], the author proposed to use a probabilisticforecaster that outputs the parameters of a GaussianMixture Model (GMM). This enables the network togenerate stochastic predictions. The author found thatthe smallest forecasting error was always producedby the most probable sample, which was similar tothe output of a model trained by the Euclidean loss.Apart from using a probabilistic forecaster, the authoralso proposed a curriculum learning strategy whichgradually adds noise to the input during training. Thiscan be viewed as regularizing the model not to beoverconfident about its predictions. This technique hasalso been adopted in later works [14]. Goroshin etal. [83] proposed to add a latent variable δ in thenetwork to represent the uncertainty of the prediction.The latent variable δi is not determined by the inputXi and can be adjusted in the learning process tominimize the loss. Before updating the parameters ofthe network, the algorithm first updates δi for k stepswhere k is a hyperparameter. At test time, the δis issampled based on its distribution on the training set.

Another approach to handle uncertainty is to utilizethe recently popularized conditional Generative Adver-sarial Network (GAN) [96], [97]. Mathieu et al. [6]proposed to use conditional GAN as the loss functionto train the multi-scale 2D CNN. The GAN loss usedin the paper is given as follows:

minθG

maxθD

EY∼pdata(Y|X )[logDθD (X ,Y)]

+ Ez∼pz [log(1−DθD (X , GθG(X , z)))].(44)

Here, DθD is the discriminator, GθG is the forecastingmodel, X is the observed sequence, Y is the ground-truth sequence, and p(z) is the noise distribution. Thediscriminator is trained to differentiate between theground-truth output and the prediction generated by theforecaster. On the other hand, the forecasting model istrained to fool the discriminator. The final loss functionin the paper combined the GAN loss, the L2 loss, andthe image gradient difference loss. Experiments showthat the model trained with GAN generates sharperpredictions. Vondrick et al. [9] used a similar GANloss to train the 3D CNN. Jang et al. [89] proposed touse two discriminators to respectively discriminate themotion and appearance part of the generated samples.

Besides GAN, the Variational Auto-encoder(VAE) [98] has also been adopted for dealing with theuncertainty in STSF. Babaeizadeh et al. [95] proposedStochastic Variational Video Prediction (SV2P) which

17

uses the following objective function:

l = −Eqφ(Z|X1:T )[log pθ(Xt:T | X1:t,Z)]

+DKL(qφ(Z | X1:T )||p(Z)),(45)

where qφ is the variational inference network thatapproximates the posterior of the latent variable Z andpθ is an RNN generator. The experiments show thatadding VAE greatly alleviates the blurry problem ofthe predictions. Later, Denton & Fergus [99] proposedthe Stochastic Video Generator with Learned Prior(SVG-LP) which uses a learned time-varying priorpψ(Zt | X1:t−1) to replace the fixed prior p(Z)in (45). The extension in their paper is similar tothe Variational RNN (VRNN) [100]. Learning a time-variant prior can be interpreted as learning a predic-tive model of uncertainty. The author showed in theexperiments that using a learned prior improves theforecasting performance.

4.4 RemarksIn this section, we reviewed two types of DL basedmethods for STSF: DTGMs and FNN and RNN basedmethods. The majority of the reviewed DTGMs treatSTSF as a general multivariate sequence forecastingproblem and SRTRBM is the only model that gives spe-cial treatment of the spatiotemporal sequences. Also,learning a DTGM requires approximation and sam-pling techniques and is more complicated comparedwith the FNN and RNN based methods. However,DTGM can well capture the uncertainty in the datadue to its stochastic generative process. On the otherhand, the FNN and RNN based models are easy totrain and fast for prediction. We reviewed the repre-sentative methods for STSF-RG, STSF-IG, and TF-MPC. Although these three types of STSF problemsinvolve different tasks, the proposed methods for theseproblems have strong relationships. For example, Con-vLSTM, DCRNN, and SocialLSTM, which are re-spectively proposed for STSF-RG, STSF-IG, and TF-MPC, improve upon RNN by introducing structuredrecurrent connections. To be more specific, ConvL-STM uses convolution in recurrent transitions, DCRNNuses the graph convolution, and SocialLSTM uses thesocial pooling layer. Therefore, the success of thesemodels greatly relies on the appropriate design of thebasic network architecture. Also, FNN and RNN basedmethods are usually deterministic and are not well-designed for capturing the uncertainty in the data. Wethus reviewed some recently popularized techniques forhandling uncertainty including GAN and VAE, whichhave improved the forecasting performance.

5 SUMMARY AND FUTURE WORKS

In this survey, we reviewed the machine learning basedmethods for STSF. We first examined the general strate-

gies for multi-step sequence forecasting, which containthe IMS, DMS, boosting strategy, and scheduled sam-pling. We showed that IMS and DMS have their ad-vantages and disadvantages. The boosting strategy andscheduled sampling can be viewed as the mid-groundsbetween these two approaches. Next, we reviewed thespecific methods for STSF. We divided these methodsinto two major categories: classical methods and DLbased methods. As mentioned in 3.4 and 4.4, the exist-ing classical methods and deep learning methods havetheir own trade-offs when solving the STSF problem.The overall summarization of the reviewed methods isgiven in Table 3.

There are many future research directions for ma-chine learning for STSF. 1) We can enhance thestrategies for the multi-step sequence forecasting. Onedirection is to design new training curriculum to shiftfrom IMS to DMS. For example, we can extend theforecasting horizon exponentially from 1 to N to bettermodel longer sequences, which is common in TF-MPC problems. Also, we can generalize the boost-ing method described in Section 2.4 to the generalSTSF problems. In addition, we can reduce the com-plexity of DMS from O(h) to O(log h) by trainingO(log h) models that generate the predictions of hori-zons 20, 21, ..., 2log h−1. We can use these models tomake prediction for any horizon up to h in the spiritof binary coding. Moreover, the objective functionmentioned in Section 2.5 will not be differentiabledue to the sampling process and the original paper hasjust ignored this problem. This problem can be solvedby approximating the gradient through techniques inreinforcement learning [101]. 2) We can propose newFNN and RNN based methods for STSF-IG problems.Compared with STSF-RG, DL for STSF-IG is lessmature and we can bring the ideas in STSF-RG, likedecomposing motion and content, adding a globalmemory structure, and using GAN/VAE to handleuncertainty, to STSF-IG. 3) We can combine DTGMswith FNN and RNN based methods to get the best ofboth worlds. The generation flavor of DTGMs has itsrole to play in FNN and RNN based models particularlyfor handling uncertainty. One feasible approach is touse the idea of Bayesian Deep Learning (BDL) [102],in which FNN and RNN based models are used forperception and DTGMs are used for inference. 4) Wecan study the type of TF-MPC problem where both themeasurements and coordinates are changing. We canconsider to solve the “video object prediction” problemwhere we have the bounding boxes of the objects atthe previous frames and the task is to predict both theobjects’ future appearances and bounding boxes. 5) Wecan study new learning scenarios like online learningand transfer learning for STSF. For the STSF problem,the online learning scenario is essential because spa-tiotemporal data, like regional rainfall, usually arrive in

18

TABLE 3Summary of the reviewed methods for STSF. We put “

√” under STSF-RG, STSF-IG, and TF-MPC to show the method has

been extended to solve the problem. We put “√

” under “Unc.” to show some of the methods have handled uncertainty.

Category Subcategory Methods STSF-RG STSF-IG TF-MPC Unc.

Classical

Feature-basedSpatiotemporal indicator [34]

√

Multiple predictors [3]√

SSMGroup-based [40], [41]

√ √

STARIMA [42], [43], [44]√ √ √

Low-rankness [46], [48], [49]√ √ √

GP GP [16], [52], [53], [54]√ √ √ √

DL

DTGM

TRBM [76]√ √ √

RTRBM [77]√ √ √

SRTRBM [78]√ √ √ √

TSBN [79]√ √ √

FNN & RNN

FC-LSTM√ √ √ √

2D CNN [6], [24] & 3D CNN [9]√ √

ConvLSTM [5] & TrajGRU [82]√

PredRNN [81], [91]√

VPN [84] & PredCNN [86]√

Learn transformation [31], [80]√

Motion + Content [87], [89], [92]√ √

DCRNN [32] & GGRU [33]√

STGCN [93]√

Social pooling [4], [94]√ √

S-RNN [14]√ √

δ-gradient [83]√ √

SV2P [95] & SVG-LP [99]√ √

an online fashion [82]. The transfer learning scenariois also essential because STSF can act as a pretrainingmethod. The model trained by STSF can be transferredto solve other tasks like classification and control [8],[31]. Designing a better transfer learning method forSTSF will improve the performance of these tasks.

REFERENCES

[1] B. R. Hunt, E. J. Kostelich, and I. Szunyogh, “Efficientdata assimilation for spatiotemporal chaos: A local en-semble transform kalman filter,” Physica D: NonlinearPhenomena, vol. 230, no. 1, pp. 112–126, 2007.

[2] M. Ranzato, A. Szlam, J. Bruna, M. Mathieu, R. Col-lobert, and S. Chopra, “Video (language) modeling: Abaseline for generative models of natural videos,” CoRR,vol. abs/1412.6604, 2014.

[3] Y. Zheng, X. Yi, M. Li, R. Li, Z. Shan, E. Chang, and T. Li,“Forecasting fine-grained air quality based on big data,” inKDD, 2015.

[4] A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei, and S. Savarese, “Social LSTM: Human trajectoryprediction in crowded spaces,” in CVPR, 2016.

[5] X. Shi, Z. Chen, H. Wang, D.-Y. Yeung, W.-k. Wong, andW.-c. Woo, “Convolutional LSTM network: A machinelearning approach for precipitation nowcasting,” in NIPS,2015.

[6] M. Mathieu, C. Couprie, and Y. LeCun, “Deep multi-scale video prediction beyond mean square error,” in ICLR,2016.

[7] N. Srivastava, E. Mansimov, and R. Salakhutdinov,“Unsupervised Learning of Video Representations usingLSTMs,” in ICML, 2015.

[8] C. Finn and S. Levine, “Deep visual foresight for planningrobot motion,” in ICRA, 2017.

[9] C. Vondrick, H. Pirsiavash, and A. Torralba, “Generatingvideos with scene dynamics,” in NIPS, 2016.

[10] J. Walker, A. Gupta, and M. Hebert, “Dense optical flowprediction from a static image,” in ICCV, 2015.

[11] K. Yamaguchi, A. C. Berg, L. E. Ortiz, and T. L. Berg,“Who are you with and where are you going?” in CVPR,2011.

[12] G. W. Taylor, G. E. Hinton, and S. T. Roweis, “Modelinghuman motion using binary latent variables,” in NIPS,2006.

[13] K. Fragkiadaki, S. Levine, P. Felsen, and J. Malik, “Re-current network models for human dynamics,” in ICCV,2015.

[14] A. Jain, A. R. Zamir, S. Savarese, and A. Sax-ena, “Structural-RNN: Deep learning on spatio-temporalgraphs,” in CVPR, 2016.

[15] A. Zonoozi, J. Kim, X. Li, and G. Cong, “Periodic-CRN: Aconvolutional recurrent model for crowd density predictionwith recurring periodic patterns,” in IJCAI, 2018.

[16] R. Senanayake, S. OCallaghan, and F. Ramos, “Predictingspatio-temporal propagation of seasonal influenza usingvariational gaussian process regression,” in AAAI, 2016.

[17] R. Yu, Y. Li, C. Shahabi, U. Demiryurek, and Y. Liu,“Deep learning: A generic approach for extreme conditiontraffic forecasting,” in SDM, 2017.

[18] D. R. Cox, “Prediction by exponentially weighted movingaverages and related methods,” Journal of the Royal Sta-

19

tistical Society. Series B (Methodological), pp. 414–422,1961.

[19] G. Chevillon, “Direct multi-step estimation and forecast-ing,” Journal of Economic Surveys, vol. 21, no. 4, pp. 746–785, 2007.

[20] S. B. Taieb and R. Hyndman, “Boosting multi-step autore-gressive forecasts,” in ICML, 2014.

[21] S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, “Sched-uled sampling for sequence prediction with recurrent neu-ral networks,” in NIPS, 2015.

[22] P. Del Moral, A. Doucet, and A. Jasra, “Sequential montecarlo samplers,” Journal of the Royal Statistical Society:Series B (Methodological), vol. 68, no. 3, pp. 411–436,2006.

[23] J.-L. Lin and C. W. Granger, “Forecasting from non-linearmodels in practice,” Journal of Forecasting, vol. 13, no. 1,pp. 1–9, 1994.

[24] J. Oh, X. Guo, H. Lee, R. L. Lewis, and S. Singh, “Action-conditional video prediction using deep networks in atarigames,” in NIPS, 2015.

[25] A. M. Lamb, A. G. ALIAS PARTH GOYAL, Y. Zhang,S. Zhang, A. C. Courville, and Y. Bengio, “Professorforcing: A new algorithm for training recurrent networks,”in NIPS, 2016.

[26] I. Goodfellow, Y. Bengio, and A. Courville, “Deeplearning,” 2016, book in preparation for MIT press.[Online]. Available: http://www.deeplearningbook.org

[27] M. Wytock and J. Z. Kolter, “Sparse gaussian conditionalrandom fields: Algorithms, theory, and application to en-ergy forecasting.” in ICML, 2013.

[28] J. Friedman, T. Hastie, and R. Tibshirani, The elements ofstatistical learning. Springer series in statistics Springer,Berlin, 2001, vol. 1.

[29] Y. Bengio, J. Louradour, R. Collobert, and J. Weston,“Curriculum learning,” in ICML, 2009.

[30] F. Huszar, “How (not) to train your generative model:Scheduled sampling, likelihood, adversary?” CoRR, vol.abs/1511.05101, 2015.

[31] C. Finn, I. Goodfellow, and S. Levine, “Unsupervisedlearning for physical interaction through video prediction,”in NIPS, 2016.

[32] Y. Li, R. Yu, C. Shahabi, and Y. Liu, “Diffusion con-volutional recurrent neural network: Data-driven trafficforecasting,” in ICLR, 2018.

[33] J. Zhang, X. Shi, J. Xie, H. Ma, I. King, and D.-Y. Yeung,“GaAN: Gated attention networks for learning on large andspatiotemporal graphs,” in UAI, 2018.

[34] O. Ohashi and L. Torgo, “Wind speed forecasting usingspatio-temporal indicators.” in ECAI, 2012.

[35] K. P. Murphy, Machine learning: A probabilistic perspec-tive. MIT press, 2012.

[36] J. G. De Gooijer and R. J. Hyndman, “25 years of timeseries forecasting,” International Journal of Forecasting,vol. 22, no. 3, pp. 443–473, 2006.

[37] M. Fraccaro, S. r. K. Sø nderby, U. Paquet, and O. Winther,“Sequential neural models with stochastic layers,” in NIPS,2016.

[38] K. H. Brodersen, F. Gallusser, J. Koehler, N. Remy, S. L.Scott et al., “Inferring causal impact using bayesian struc-tural time-series models,” The Annals of Applied Statistics,vol. 9, no. 1, pp. 247–274, 2015.

[39] R. E. Kalman, “A new approach to linear filtering and pre-diction problems,” Journal of Basic Engineering, vol. 82,no. 1, pp. 35–45, 1960.

[40] A. Asahara, K. Maruyama, A. Sato, and K. Seto,“Pedestrian-movement prediction based on mixed markov-chain model,” in SIGSPATIAL, 2011.

[41] W. Mathew, R. Raposo, and B. Martins, “Predicting fu-ture locations with hidden markov models,” in UbiComp.ACM, 2012.

[42] A. Cliff and J. Ord, “Model building and the analysis ofspatial pattern in human geography,” Journal of the RoyalStatistical Society. Series B (Methodological), vol. 37,no. 3, pp. 297–348, 1975.

[43] A. Cliff and J. K. Ord, “Space-time modelling with anapplication to regional forecasting,” Transactions of theInstitute of British Geographers, pp. 119–128, 1975.

[44] P. E. Pfeifer and S. J. Deutsch, “A starima model-buildingprocedure with application to description and regionalforecasting,” Transactions of the Institute of British Ge-ographers, pp. 330–349, 1980.

[45] G. E. Box, G. M. Jenkins, G. C. Reinsel, and G. M. Ljung,Time series analysis: Forecasting and control. John Wiley& Sons, 2015.

[46] M. T. Bahadori, Q. R. Yu, and Y. Liu, “Fast multivariatespatio-temporal analysis via low rank tensor learning,” inNIPS, 2014.

[47] B. Recht, M. Fazel, and P. A. Parrilo, “Guaranteedminimum-rank solutions of linear matrix equations vianuclear norm minimization,” SIAM Review, vol. 52, no. 3,pp. 471–501, 2010.

[48] R. Yu, D. Cheng, and Y. Liu, “Accelerated online low-ranktensor learning for multivariate spatio-temporal streams,”in ICML, 2015.

[49] R. Yu and Y. Liu, “Learning from multiway data: Simpleand efficient tensor regression,” in ICML, 2016.

[50] A. G. Journel and C. J. Huijbregts, Mining geostatistics.Academic press, 1978.

[51] C. E. Rasmussen, “Gaussian processes for machine learn-ing,” 2006.

[52] P. Trautman and A. Krause, “Unfreezing the robot: Navi-gation in dense, interacting crowds,” in IROS, 2010.

[53] S. Flaxman, “Machine learning in space and time,” PhDdissertation, Carnegie Mellon University, 2015.

[54] S. Flaxman, A. G. Wilson, D. B. Neill, H. Nickisch,and A. J. Smola, “Fast Kronecker inference in Gaussianprocesses with non-Gaussian likelihoods,” in ICML, 2015.

[55] P. Smolensky, “Information processing in dynamical sys-tems: Foundations of harmony theory,” DTIC Document,Tech. Rep., 1986.

[56] G. E. Hinton, “Training products of experts by minimiz-ing contrastive divergence,” Neural Computation, vol. 14,no. 8, pp. 1771–1800, 2002.

[57] M. Welling, M. Rosen-Zvi, and G. E. Hinton, “Exponentialfamily harmoniums with an application to informationretrieval,” in NIPS, 2004.

[58] X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifierneural networks.” in AISTATS, 2011.

[59] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifiernonlinearities improve neural network acoustic models,”in ICML, 2013.

[60] A. Mnih and K. Gregor, “Neural variational inference andlearning in belief networks,” in ICML, 2014.

[61] G. E. Hinton, P. Dayan, B. J. Frey, and R. M. Neal,“The ”wake-sleep” algorithm for unsupervised neural net-works,” Science, vol. 268, no. 5214, p. 1158, 1995.

[62] D. Tran, L. Bourdev, R. Fergus, L. Torresani, andM. Paluri, “Learning spatiotemporal features with 3d con-volutional networks,” in ICCV, 2015.

[63] F. Yu and V. Koltun, “Multi-scale context aggregation bydilated convolutions,” in ICLR, 2016.

[64] M. D. Zeiler, D. Krishnan, G. W. Taylor, and R. Fergus,“Deconvolutional networks,” in CVPR, 2010.

[65] M. D. Zeiler and R. Fergus, “Visualizing and understand-ing convolutional networks,” in ECCV, 2014.

[66] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, “Spectralnetworks and locally connected networks on graphs,” inICLR, 2014.

[67] D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bom-barell, T. Hirzel, A. Aspuru-Guzik, and R. P. Adams,

http://www.deeplearningbook.org

20

“Convolutional networks on graphs for learning molecularfingerprints,” in NIPS, 2015.

[68] T. N. Kipf and M. Welling, “Semi-supervised classificationwith graph convolutional networks,” in ICLR, 2017.

[69] M. Defferrard, X. Bresson, and P. Vandergheynst, “Con-volutional neural networks on graphs with fast localizedspectral filtering,” in NIPS, 2016.

[70] W. Hamilton, Z. Ying, and J. Leskovec, “Inductive repre-sentation learning on large graphs,” in NIPS, 2017.

[71] F. Rosenblatt, Perceptions and the theory of brain mecha-nisms. Spartan books, 1962.

[72] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNetclassification with deep convolutional neural networks,” inNIPS, 2012.

[73] J. Nocedal and S. Wright, Numerical optimization.Springer Science & Business Media, 2006.

[74] R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficultyof training recurrent neural networks.” in ICML, 2013.

[75] S. Hochreiter and J. Schmidhuber, “Long short-term mem-ory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780,1997.

[76] I. Sutskever and G. E. Hinton, “Learning multilevel dis-tributed representations for high-dimensional sequences.”in AISTATS, 2007.

[77] I. Sutskever, G. E. Hinton, and G. W. Taylor, “The re-current temporal restricted boltzmann machine,” in NIPS,2009.

[78] R. Mittelman, B. Kuipers, S. Savarese, and H. Lee, “Struc-tured recurrent temporal restricted boltzmann machines,”in ICML, 2014.

[79] Z. Gan, C. Li, R. Henao, D. E. Carlson, and L. Carin,“Deep temporal sigmoid belief networks for sequencemodeling,” in NIPS, 2015.

[80] X. Jia, B. De Brabandere, T. Tuytelaars, and L. V. Gool,“Dynamic filter networks,” in NIPS, 2016.

[81] Y. Wang, Z. Gao, M. Long, J. Wang, and P. S. Yu,“PredRNN++: Towards a resolution of the deep-in-timedilemma in spatiotemporal predictive learning,” in ICML,2018.

[82] X. Shi, Z. Gao, L. Lausen, H. Wang, D.-Y. Yeung, W.-k.Wong, and W.-c. WOO, “Deep learning for precipitationnowcasting: A benchmark and a new model,” in NIPS,2017.

[83] R. Goroshin, M. F. Mathieu, and Y. LeCun, “Learning tolinearize under uncertainty,” in NIPS, 2015.

[84] N. Kalchbrenner, A. v. d. Oord, K. Simonyan, I. Danihelka,O. Vinyals, A. Graves, and K. Kavukcuoglu, “Video pixelnetworks,” in ICML, 2016.

[85] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan,O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior,and K. Kavukcuoglu, “WaveNet: A generative model forraw audio,” CoRR, vol. abs/1609.03499, 2016.

[86] Z. Xu, Y. Wang, M. Long, and J. Wang, “PredCNN:Predictive learning with cascade convolutions,” in IJCAI,2018.

[87] R. Villegas, J. Yang, S. Hong, X. Lin, and H. Lee, “De-composing motion and content for natural video sequenceprediction,” in ICLR, 2017.

[88] A. Bhattacharyya, B. Schiele, and M. Fritz, “Accurate anddiverse sampling of sequences based on a best of manysample objective,” in CVPR, 2018.

[89] Y. Jang, G. Kim, and Y. Song, “Video prediction withappearance and motion conditions,” in ICML, 2018.

[90] M. Jaderberg, K. Simonyan, A. Zisserman et al., “Spatialtransformer networks,” in NIPS, 2015.

[91] Y. Wang, M. Long, J. Wang, Z. Gao, and P. S. Yu, “Pre-dRNN: Recurrent neural networks for predictive learningusing spatiotemporal lstms,” in NIPS, 2017.

[92] J. Xu, B. Ni, Z. Li, S. Cheng, and X. Yang, “Structurepreserving video prediction,” in CVPR, 2018.

[93] B. Yu, H. Yin, and Z. Zhu, “Spatio–temporal graph con-volutional networks: A deep learning framework for trafficforecasting,” in IJCAI, 2018.

[94] I. Hasan, F. Setti, T. Tsesmelis, A. Del Bue, F. Galasso,and M. Cristani, “MX-LSTM: Mixing tracklets and visletsto jointly forecast trajectories and head poses,” in CVPR,2018.

[95] M. Babaeizadeh, C. Finn, D. Erhan, R. H. Campbell, andS. Levine, “Stochastic variational video prediction,” inICLR, 2018.

[96] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio,“Generative adversarial nets,” in NIPS, 2014.

[97] M. Mirza and S. Osindero, “Conditional generative adver-sarial nets,” CoRR, vol. abs/1411.1784, 2014.

[98] D. P. Kingma and M. Welling, “Auto-encoding variationalbayes,” in ICLR, 2014.

[99] E. Denton and R. Fergus, “Stochastic video generationwith a learned prior,” in ICML, 2018.

[100] J. Chung, K. Kastner, L. Dinh, K. Goel, A. C. Courville,and Y. Bengio, “A recurrent latent variable model forsequential data.” in NIPS, 2015.

[101] R. S. Sutton, D. A. McAllester, S. P. Singh, Y. Mansouret al., “Policy gradient methods for reinforcement learningwith function approximation.” in NIPS, 1999.

[102] H. Wang and D.-Y. Yeung, “Towards bayesian deep learn-ing: A framework and some existing methods,” IEEETransaction on Knowledge and Data Engineering, vol. 28,no. 12, pp. 3395–3408, Dec. 2016.

PLACEPHOTOHERE

Xingjian Shi received his BEngdegree in information security fromShanghai Jiao Tong University,China. He is currently a Ph.Dstudent at the Department ofComputer Science and Engineeringof Hong Kong University of Scienceand Technology. His researchinterests are in deep learningand spatiotemporal analysis. Hereceived the Hong Kong PhD

Fellowship in 2014.

PLACEPHOTOHERE

Dit-Yan Yeung received his BEngdegree in electrical engineering andMPhil degree in computer sciencefrom the University of Hong Kong,and PhD degree in computer sci-ence from the University of South-ern California. He started his aca-demic career as an assistant profes-sor at the Illinois Institute of Tech-nology in Chicago. He then joinedthe Hong Kong University of Science

and Technology where he is now a full professor and theActing Head of the Department of Computer Science andEngineering. His research interests are in computationaland statistical approaches to machine learning and artificialintelligence.

1 machine learning for spatiotemporal sequence …1 machine learning for spatiotemporal sequence...

Documents