deep learning-based vehicle behaviour prediction for ...sajjad mozaffari, omar y. al-jarrah, mehrdad...

1

Deep Learning-based Vehicle Behaviour PredictionFor Autonomous Driving Applications: A Review

Sajjad Mozaffari, Omar Y. Al-Jarrah, Mehrdad Dianati, Paul Jennings,and Alexandros Mouzakitis

Abstract—Behaviour prediction function of an autonomousvehicle predicts the future states of the nearby vehicles basedon the current and past observations of the surrounding envi-ronment. This helps enhance their awareness of the imminenthazards. However, conventional behaviour prediction solutionsare applicable in simple driving scenarios that require shortprediction horizons. Most recently, deep learning-based ap-proaches have become popular due to their superior performancein more complex environments compared to the conventionalapproaches. Motivated by this increased popularity, we providea comprehensive review of the state-of-the-art of deep learning-based approaches for vehicle behaviour prediction in this paper.We firstly give an overview of the generic problem of vehiclebehaviour prediction and discuss its challenges, followed byclassification and review of the most recent deep learning-based solutions based on three criteria: input representation,output type, and prediction method. The paper also discussesthe performance of several well-known solutions, identifies theresearch gaps in the literature and outlines potential new researchdirections.

Index Terms—Vehicle behaviour prediction, trajectory predic-tion, autonomous vehicles, intelligent vehicles, machine learning,deep learning

I. INTRODUCTION

ADOPTION of autonomous vehicles in the near futureis expected to reduce the number of road accidents

and improve road safety [1]. However, for safe and efficientoperation on roads, an autonomous vehicle should not onlyunderstand the current state of the nearby road-users, butalso proactively anticipate their future behaviour (a.k.a. mo-tion or trajectory). One part of this general problem is topredict the behaviour of pedestrians (or generally speaking,the vulnerable road-users), which is well-studied in computervision literature [2], [3], [4], [5]. There are also severalreview papers on pedestrian behaviour prediction such as [6],[7], [8]. Another equally important part of the problem isprediction of the intended behaviour of other vehicles onthe road. In contrast to pedestrians, vehicles’ behaviour isconstrained by their higher inertia, driving rules and roadgeometry, which could help reduce the complexity of theproblem, compared to aforementioned problem. Nonetheless,new challenges arise from interdependency among vehiclesbehaviour, influence of traffic rules and driving environment,

This work was supported by Jaguar Land Rover and the U.K.-EPSRCas part of the jointly funded Towards Autonomy: Smart and ConnectedControl(TASCC) Programme under Grant EP/N01300X/1

S. Mozaffari, O. Y. Al-Jarrah, M. Dianati, and P. Jennings arewith the Warwick Manufacturing Group, University of Warwick, Coven-try CV4 7AL, U.K. (e-mail: sajjad.mozaffari, omar.al-jarrah, m.dianati,[email protected])

A. Mouzakitis is with Jaguar Land Rover Ltd., Coventry CV4 7HS, U.K.

and multimodality of vehicles behaviour. Practical limitationsin observing the surrounding environment and the requiredcomputational resources to execute prediction algorithms alsoadd to the difficulty of the problem, as explained in the latersections of this paper.

There are several published survey papers on vehicle be-haviour analysis. For example, M. S. Shirazi and B. T.Morris [9] provide a review of vehicle monitoring, behaviourand safety analysis at intersections. B. T. Morris and M.M. Trivedi [10] review unsupervised approaches for vehiclebehaviour analysis with a focus on trajectory clustering andtopic modelling methods. Anomaly detection techniques usingvisual surveillance are reviewed in [11]. In the most relatedpaper to our work, Lefevre et al. [12] provide a survey on vehi-cle behaviour prediction and risk assessment in the context ofautonomous vehicles. The authors review various conventionalapproaches that applied physics-based models and/or tradi-tional machine learning algorithms such as Hidden MarkovModels, Support Vector Machines, and Dynamic BayesianNetworks. Recent advances in machine learning techniques(e.g., deep learning) have provided new and powerful toolsfor solving the problem of vehicle behaviour prediction. Suchapproaches have become increasingly important due to theirsuperiority in complex and realistic scenarios. However, to thebest of our knowledge, there is no systematic and comparativereview of the latter deep learning-based approaches. We thuspresent a review of such studies using a new classificationmethod which is based on three criteria: input representation,output type, and prediction method. In addition, we reportthe practical limitations of implementing recent solutions inautonomous vehicles. To make the paper self-contained, wealso provide a generic problem definition for vehicle behaviourprediction.

The rest of this paper is organised to a number of sections:Section II is an introduction to the basics and the challengesof vehicle behaviour prediction for autonomous vehicles.The definition of widely used terminologies and the genericproblem formulation are also given in section II. Section IIIreviews the related deep learning-based solutions and classifiesthem based on three criteria: input representation, output type,and prediction method. Section IV discusses the commonlyused evaluation metrics, compares the performance of severalwell-known trajectory prediction models in public highwaydriving datasets, and highlights the current research gaps inthe literature and potential new research directions. The keyconcluding remarks are given in section V.

arX

iv:1

912.

1167

6v1

[cs

.CV

] 2

5 D

ec 2

019

2

II. BASICS AND CHALLENGES OF VEHICLE BEHAVIOURPREDICTION

Object detection and behaviour prediction can be consid-ered as two main functions of the perception system of anautonomous vehicle. While both of them rely on on and off-board sensory data, the former aims to localize and classifythe objects in the surrounding environment of the autonomousvehicle and the latter provides an understanding of the dynam-ics of surrounding objects and predicts their future behaviour.Behaviour prediction plays a pivotal role in autonomous driv-ing applications as it supports efficient decision making [13]and enables risk assessment [12]. In this section, we firstlydiscuss the challenges of vehicle behaviour prediction, thenwe provide a terminology for vehicle behaviour prediction,and finally we present a generic probabilistic formulation ofthe problem.

A. Challenges

Vehicles (e.g., cars and trucks) have well-structured motionswhich are governed by driving rules and environment con-ditions. In addition, vehicles cannot change their trajectoriesinstantly due to their high inertia. However, vehicle behaviourprediction is not a trivial task due to several challenges. First,there is an interdependency among vehicles behaviour wherethe behaviour of a vehicle affects the behaviour of othervehicles and vice versa. Therefore, predicting the behaviourof a vehicle requires observing the behaviour of surroundingvehicles. Second, road geometry and traffic rules can reshapethe behaviour of vehicles. For example, placing a give-waysign in an intersection can completely change the behaviour ofvehicles approaching it. Therefore, without considering trafficrules and road geometry, a model trained in a specific drivingenvironment would have limited performance in other drivingenvironments. Third, the future behaviour of vehicles is multi-modal, meaning that given history of motion of a vehicle, theremay exist more than one possible future behaviour for it. Forexample, when a vehicle is slowing down at an intersectionwithout changing its heading direction, both turning rightand turning left motions could be expected. A comprehensivebehaviour prediction module in an autonomous vehicle shouldidentify all possible future motions to allow the vehicle to actreliably.

In addition to the intrinsic challenges of the vehicle be-haviour prediction problem, implementing a behaviour pre-diction module in autonomous vehicles comes with severalpractical limitations. For example, autonomous vehicles canpartially observe the surrounding environment due to sensorlimitations (e.g., object occlusion, limited sensor range, andsensor noise). In addition, there are restricted computationalresources for on-board implementation in autonomous vehi-cles. Therefore, certain assumptions need to be consideredin defining behaviour prediction problem for applications inautonomous driving.

B. Terminology

To define the problem of vehicle behaviour prediction, weadopt the following terms:

Fig. 1. An illustration of the proposed terminology and limited observabilityof the EV. The vehicles which are not observable are blurred. For example, thepreceding vehicle of the TV, which is not observable by the EV, is changingits lane. This lane change allows the TV to accelerate.

• Target Vehicles (TVs) are the vehicles whose behaviourwe are interested in predicting. Most previous worksassumed the autonomous vehicle predicts the behaviourof one TV at a time.

• Ego Vehicle (EV) is the autonomous vehicle whichobserves the surrounding environment to predict the be-haviour of TVs.

• Surrounding Vehicles (SVs) are the vehicles in thevicinity of the TVs whose behaviour directly affectsTVs’ behaviour. With this definition, the EV can be alsoconsidered as a SV, if it is close enough to TVs.

• Non Effective Vehicles (NVs) are the vehicles that weassume to have no impact over the TV’s behaviour.

Figure 1 illustrates an example of a driving scenario usingthe proposed terminology.

C. Generic Problem Formulation

We use a probabilistic formulation for the vehicle behaviourprediction to cope with the uncertain nature of the problem.We represent the future behaviour of TVs as the trajectoriesXTV s they will traverse in the future, defined as:

XTV s = {(xit, y

it), (x

it+1, y

it+1), ..., (x

it+m, yit+m)}Ni=1 (1)

Where (xit, y

it) represents the Cartesian coordinates of vehicle

i in the XY-plane at time step t, N is the number of TVs, andm is the length of the prediction window.

The generic problem is formulated as computing the con-ditional distribution P (XTV s|OEV ), where OEV is the EV’sraw/processed sensory data. Most of previous studies assumeshaving access to an unobstructed top-down view of the drivingenvironment which can be obtained by infrastructure sensors(e.g. an infrastructure surveillance camera). Such data can beavailable if the infrastructure shares its sensory data with theEV (e.g., through V2I communication). In addition, it is notcost-effective to cover all road sections with such sensors.Therefore, a behaviour prediction module can not always relyon infrastructure sensory data. We assume the EV only hasaccess to the data from on-board sensors such as cameras,lidars, and radars. The EV may also exploit states (e.g.,bounding boxes, location, velocity, and etc) of TVs and SVs,which are usually estimated by the object detection module ofan autonomous vehicle, and map information, which is usu-ally provided for autonomous vehicles before starting a trip.Nonetheless, the EV may not have a full observability of thesurrounding environment of TVs, due to sensor impairments(e.g., occlusion) and noise. This is needed to be considered

3

when designing a behaviour prediction module for autonomousvehicles.

The distribution P (XTV s|OEV ) is a mutual distributionover trajectories of several interdependent vehicles, whereeach trajectory is a series of continuous variables, makingit intractable. To reduce the computational requirement ofestimating P (XTV s|OEV ), most of previous works droppedthe interdependency among vehicles future behaviour. Assuch, the behaviour of each TV can be predicted separatelywith an affordable computational requirement. At each step,one vehicle is selected as the TV and its P (XTV |OEV ) iscalculated, where:

XTV = {(xTt , y

Tt ), (x

Tt+1, y

Tt+1), ..., (x

Tt+m, yTt+m)} (2)

T ∈ {1, 2, , N} is the current TV.

III. CLASSIFICATIONS OF PREVIOUS WORKS

Lefevre et al. [12] classifies vehicle behaviour predictionmodels to physics-based, manoeuvre-based, and interaction-aware models. The approaches that use dynamic or kinematicmodels of vehicles as a prediction method are classifiedin physics-based models. The models that predict vehicles’manoeuvres as output are considered as maneuvre-based ap-proaches. Finally, the models that consider interaction ofvehicles in the input are called interaction-aware models. Onedrawback of this classification is that it does not explainabout other prediction methods rather than physics-based andother output types rather than maneuvre-based. It also doesnot specify how interaction is modeled in the input data (orgenerally speaking how the input is represented).

To overcome the mentioned drawbacks, we present threeclassifications based on three different criteria: input rep-resentation, output type, and prediction method. First, weclassify previous studies based on how they represent the inputdata. In this classification, the interaction-aware models aredivided into three classes. In second classification, maneuvre-based models (intention class) as well as other types ofprediction output are explained. We do not include physics-based approaches as they are no longer state-of-the-art, butdifferent deep learning methods used in behaviour predictionare discussed and classified in the last classification. Figure 2provides the classes and sub-classes for each aforementionedclassification criterion. The following subsections address theclassification based on each criterion individually.

A. Input Representation Criterion

In this subsection, we provide a classification of previousstudies based on the type of input data and how it is rep-resented. We divide them into four classes: track history ofTV, track history of TV and SVs, simplified bird’s eye view,and raw sensor data. The last three classes can be consideredas sub-classes of interaction-aware approaches which wasintroduced in [12]. We also discuss the availability of theseinput data in autonomous driving applications.

Fig. 2. The Three Proposed Classifications for Vehicle Behaviour Prediction

1) Track history of the TV: The conventional approach forpredicting behaviour of the TV is to only use its current state(e.g. position, velocity, acceleration, heading) or track historyof its states over time. This feature can be estimated basedon the EV’s observation if the TV is observable by the EV’ssensors.

In [14], [15], [16], the track history of x/y position, speed,and heading of the TV are used to predict its behaviour atdifferent road junctions. All these works study the behaviourof the TV in an environment without any SVs. Few deeplearning-based methods use this input set to predict the ve-hicle behaviour in a driving environment with presence ofother vehicles [17], [18]. Long Xin et al. [17] argue thatthe information of SVs is not available due to EV’s sensorlimitations and object occlusion; however, some of the SVscan usually be observed by EV’s sensor (see Figure 1).Excluding the observable SV’s state from the input set mayresult in inaccurate prediction of the TV’s behaviour due tointerdependencies of vehicles’ behaviour.

Although the track history of the TV has highly informativefeatures about its short-term future motion, relying only on theTV’s track history can lead to erroneous results particularly inlong-term prediction in crowded driving environments.

2) Track history of TV and SVs: One approach to considerthe interaction among vehicles is to explicitly feed the trackhistory of TV and SVs to the prediction model. SVs’ states,similar to the TV’s states, can be estimated in the objectdetection module of the EV; however, some of SVs can beoutside of perception horizon of the EV or they might beoccluded by other vehicles on the road.

The previous studies vary in how to divide the vehicles inthe scene into surrounding vehicles (SVs) and non-effective

4

vehicles (NVs). In [19], [20], [21], history of states of theTV and six of its closest neighbours are exploited to predictthe TV’s behaviour. In [22], [23], the three closest vehiclesin the TV’s current lane and two adjacent lanes are chosen asreference vehicles. The reference vehicles and four vehicles infront and behind of the two reference vehicles in adjacent lanesare selected as the SVs. F. Altch and A. de La Fortelle [24]consider nine vehicles in three lanes surrounding the targetvehicle including two vehicles in front of TV. They indicatethat considering more vehicles in the input data can improvethe performance of behaviour prediction. For example, in atraffic jam, knowing that the second vehicle ahead of the TVis accelerating can enable early prediction of speed increasefor the TV. Instead of considering a fix number of vehiclesas the SVs, a distance threshold is defined to divide vehicleinto the SVs and NVs in [25], [26], [27]. Therefore, only theinteractions of vehicles within this threshold are modeled inthe prediction model.

One drawback of previous studies is that they assume thatthe states of all surrounding vehicles are always observable,which is not a practical assumption in autonomous drivingapplications. A more realistic approach should always considersensor impairments like occlusion and noise in exploiting thefeatures of SVs. In addition, relying only on the track historyof the TV and SVs is not sufficient for behaviour prediction,because other factors like environment conditions and trafficrules can also modify the behaviour of vehicles.

3) Simplified Bird’s Eye Biew: An alternative way toconsider the interaction among vehicles is by exploiting asimplified Bird’s Eye View (BEV) of the environment. Inthis approach, static and dynamic objects, road lane, andother elements of the environment are usually depicted with acollection of polygons and lines in a BEV image. The resultis a map-like image which preserves the size and location ofobjects (e.g. vehicles) and the road geometry while ignoringtexture information.

D. Lee et al. [28] fuse front-facing radar and camera datato form a binary two-channel BEV image covering the frontalarea of the EV. One of the image channels specifies whetherthe pixel is occupied by a vehicle or not, and the other depictsthe existence of lane marks. For n past frames, the imagesare produced and stacked together and form a 2n-channelimage as the input to the prediction model. Instead of usinga sequence of binary images, which indicates the existence ofobjects over time, a single BEV image is used in [29], [30]. Inthis image, each element of the scene (e.g., road, crosswalks)loses its actual texture and instead is colour coded accordingto its semantics. The vehicles are depicted by colour-codedbounding boxes and the location history of vehicles are plottedusing bounding boxes with same colour and reduced level ofbrightness.

To enrich the temporal information within the BEV image,N. Deo and M. M. Trivedi [31] use a social tensor whichwas first introduced in [3] (known as social pooling layer).A social tensor is a spatial grid around the target vehiclethat the occupied cells are filled with the processed temporaldata (e.g., LSTM hidden state value) of the correspondingvehicle. Therefore, a social tensor contains both the temporal

dynamic of vehicles represented and spatial inter-dependenciesamong them. N. Lee et al. [32] use social pooling layer asan additional input to another BEV representation which iscreated by performing semantic segmentation on front-facingcamera of the EV and transforming it to the BEV.

The aforementioned works do not consider sensor impair-ment in the input representation. To overcome this drawback,a dynamic occupancy grid map (DOGMa [33]) is exploitedin [34], [35]. DOGMa is created from the data fusion of avariety of sensors and provides a BEV image of the environ-ment. The channels of this image contain the probability ofoccupancy and velocity estimate for each pixel. The velocityinformation helps distinguish between static and dynamicobjects in the environment; however, it does not providecomplete knowledge about the history of dynamic objects.

One advantage of simplified BEV is that it is flexiblein terms complexity of representation. Thus, it can matchapplications with different computational resource constraints.Furthermore, the data gathered from different sensors can befused into a single BEV representation. The drawback ofthis input set is that it does not include the scene textureinformation.

4) Raw sensor data: In this approach, raw sensor data isfed to the prediction model. Thus, the input data contains allavailable knowledge about the surrounding environment. Thisallows the model to learn extracting useful features from allavailable sensory data.

Raw sensor data, compared to previous input representa-tions, has larger dimension. Therefore, more computational re-sources are required to process the input data, which can makeit impractical for on-board implementation in autonomousvehicles. One solution to this problem is to share the com-putational resources among different functions of autonomousvehicle. In deep learning literature, it is common to train amodel for multiple tasks [38]. In an autonomous vehicle, theobject detection module exploits raw sensor data, and it usuallyrelies on a model with millions of parameters [39]. Thus, it canbe a candidate for parameter sharing with behaviour predictionmodule.

W. Leo et al. [36] use a deep neural network to jointly solvethe problems of 3D detection, tracking, and motion forecastingfor autonomous vehicles. They exploit 3D point clouds dataover several time frames. The data is represented in bird’s eyeview, and the height is considered as the channel dimension.To exploit the lidar data, the same approach is used by [37];however, they feed the 3D point cloud data in addition to asimplified BEV to their deep model.

Table I provides a summary of classification of previousstudies based on input representation. It also summarizes theadvantages and disadvantages of each class.

B. Output Type Criterion

In this subsection, we classify previous studies based onhow they represent a vehicle future behaviour in the outputof their prediction model. We consider four classes: intention,unimodal trajectory, intention-based trajectory, and multimodaltrajectory.

5

TABLE ISUMMARY OF CLASSIFICATION OF PREVIOUS WORKS BASED ON INPUT REPRESENTATION AND THE ADVANTAGES/DISADVANTAGES OF EACH CLASS

Class Advantages Disadvantages Works Summary

Track History ofthe TV

- Complies with limitedobservability of the EV

- Does not consider theimpact of environmentand interaction amongvehicles on TV’s behaviour

[14], [16],[15], [17],[18]

Track history of the TV’s states(e.g. position, velocity, heading, and etc.)

Track History ofTV and SVs

- Considers the impact ofinteraction among vehicles onTV’s behaviour.

- Does not consider theimpact of environment onTV’s behaviour- The States of SVs are notalways observable

[20], [19],[21]

History of states for the TV and six SVs.

[22], [23] History of states for the TV and three referencevehicles and four adjacent vehicles to them.

[24] History of states for the TV and nine SVs.[25], [26],[27]

A distance threshold is defined to dividevehicles into the SVs and NVs

Simplified Bird’sEye View

- Considers the impact ofenvironment and interactionamong vehicles on TV’sbehaviour- Facilitates fusing the datagathered from different sensorson the EV.- Flexible in terms ofcomplexity of representation- It can comply with limitedobservability of the EV

- Potentially usefulinformation in scene textureis lost

[28]

A sequence of 2 channel top-down imagecovering the environment in front of the TV.It indicates the existence of vehicles andlane lines over time.

[29], [30]A BEV image of environment, in whichthe road elements and vehicles are representedwith color-coded polygons and lines.

[31]A top-down grid representation. Each occupiedcell is filled with the corresponding vehicle’sLSTM hidden state. (similar to [3])

[32] Semantic segmentation of environment inBird’s eye view

[34], [35]A top-down grid representation. Each cellcontains the probability of the cell occupation,and its velocity.

Raw SensorData

- Complies with limitedobservability of the EV.- No information loss

- High computational cost [36] 3D point clouds data over several time steps

[37] lidar data and rasterized map (the representationused in [30])

1) Intention: Vehicle intention (a.k.a. manoeuvre) predic-tion is the task of classifying the future behaviour of vehiclesusing a set of pre-defined discrete classes. For example,in highway driving, the set of classes could be left lanechange, right lane change, and keeping the lane; while in anintersection, the set of classes could be: go straight, turn left,and turn right.

To predict the intention of a vehicle approaching a T-junction, A. Zyner et al. [15] define three classes based on thedestination of the vehicle, namely ”east”, ”west”, or ”south”.In [14], the same set of classes are used to predict the intentionof a vehicle at an un-signalized roundabout. D. J. Phillips etal. [19] design a generalizable intention prediction model thatcan predict the direction of travel of a vehicle up to 150mbefore reaching three- and four-way intersections. W. Ding etal. [23] and D. Lee et al. [28] apply intention prediction tohighway driving scenario. The former proposes an intentionprediction model to predict lane change and lane keepingbehaviour for the TV; while, the latter designs a model topredict the cut-in intention of right/left preceding TVs w.r.t.the EV.

Previous studies predict the intention of vehicles using aset of few classes. On drawback of these works is that theycan only provide a high-level understanding of the vehiclebehaviour. This problem can be solved by subdividing high-level manoeuvres into sub-classes that describe the behaviourmore precisely. For example, in a highway driving scenario,we can subdivide lane change classes into sharp lane change

and normal lane change. Another drawback is the specificityof manoeuvre set to single driving environment, which canbe resolved by defining a set that contains the manoeuvresin all desired driving scenarios. However, to predict a vehiclebehaviour using large and in depth set of classes, a larger andmore diverse training dataset that includes sufficient samplesin each class is required. In addition, larger model capacity isneeded to learn the mapping of the input data to the intentionset.

2) Unimodal trajectory: Trajectory prediction models de-scribe the future behaviour of a vehicle by predicting seriesof future locations of the TV over a time window. Dealingwith continuous output of trajectory prediction models can addmore complexity to the problem compared to discrete outputof intention prediction models. However, predicting trajectoryinstead of intention, provides more precise information aboutfuture behaviour of vehicles. Given a specific driving situationand history of motion for a vehicle, it might be possiblefor it to traverse multiple different trajectories. Therefore,the corresponding distribution has multiple modals. Unimodaltrajectory predictors are the models that only predict oneof these possible trajectories (usually the one with highestlikelihood).

The straightforward approach to predict the trajectory ofthe TV is to estimate the position of it over time [40], [25].The predictor model can also estimate the displacement ofthe TV relative to its last position at each step [27], [21].The other approach used in [24] is to predict lateral position

6

Fig. 3. An illustration of invalidity of the average of manoeuvres. The redcar is approaching the green car on the road. It is probable for the red car toeither reduce its speed (green dots) or change its lane (blue dots). A unimodaltrajectory predictor may predict an average of these two manoeuvres (red dots)to reduce the prediction error. However, the average of these two manoeuvresis not a valid manoeuvre since it results in a collision with the precedingvehicle.

and longitudinal velocity separately. This approach can be spe-cially useful when the region of interest is longitudinally large,therefore longitudinal position can be a quite large figure. Inaddition to the position and velocity, the heading angle ofthe vehicle is predicted in [36]. To cope with uncertainty ofthe trajectory prediction problem, Nemanja Djuric et al. [30]propose a trajectory prediction model that estimates standarddeviation for the predicted x- and y-positions.

The main disadvantage of aforementioned methods is thatthey do not fully represent the vehicle behaviour predictionspace which can have more than one modality. Furthermore,their models may converge to the average of all the possiblemodals because the average can minimize the displacementerror of unimodal trajectory prediction; however, the averageof modals is not necessarily a valid future behaviour [16].Figure 3 illustrates this problem.

3) Intention-based trajectory: To deal with multimodalnature of vehicle behaviour prediction, One approach is toestimate the likelihood of each member of a predefined inten-tion (i.e. policy/manoeuvre/behaviour modal) set and predictthe trajectory that corresponds to the most probable intention.

Long Xin et al. [17] propose an intention-aware model topredict trajectory based on estimated lane change intentionfor the TV in highway driving. In [26], [37], the intentionset is extended from only lane changes intentions to turningright, turning left, stopping, and so on. This allows using theprediction model in urban driving.

Unlike unimodal trajectory predictors, intention-based tra-jectory prediction approaches are unlikely to converge to themean of modals, as in these approaches the predicted trajectorycorresponds to one of predefined behaviour modals. However,there are two main drawbacks in these approaches. First,intention-based trajectory predictors cannot accurately predicta vehicle trajectory if the vehicle’s intention does not exist inthe predefined intention set. This problem can commonly occurin complex driving scenarios, as it is hard to predetermineall possible driving intentions in such environments. Second,unlike unimodal trajectory prediction, we need to manuallylabel the intention of vehicles in the training dataset, which istime-consuming, expensive and error-prone.

4) Multimodal trajectory: Multimodal trajectory predictionmodels predict one trajectory per behaviour modals. They alsoestimate the probability for each of the modals. We dividemultimodal prediction approaches into two sub-categories:

• Static Intention set: In this sub-class, a set of intentionsis explicitly defined and the trajectories are predicted for eachmember of this set. In [31], [20], a set of six manoeuvreclasses for highway driving is defined and the trajectory dis-tribution for each manoeuvre class is predicted. Predicting thedistribution allows them to model the uncertainty of trajectoryprediction for each manoeuvre separately.Their model alsopredicts the likelihood of each manoeuvre.

• Dynamic Intention set: In these approaches, the intentionset can be dynamically learnt by the trajectory predictionmodel based on the driving scenario. Henggang Cui et al. [29]develop a model that predicts a fixed number of deterministictrajectory sequences and their probabilities. Each of thesesequences can correspond to a possible manoeuvre in thedriving environment. In [32], [16], [18], the distribution ofvehicles’ trajectory is modeled. Then, a fixed number of trajec-tory sequences are sampled from the modeled distribution andranked based on their likelihood. In [34], [35], the trajectoryis predicted by estimating the TV’s occupancy likelihood foreach cell in the dynamic occupancy grid map (DOGMa [33])and each time-step in prediction horizon. They create DOGMaby assigning a grid map to a bird’s eye view of the environmentaround the EV. Their model can dynamically predict multipletrajectory modals by assigning high probability to separategroups of cells in front of a TV.

The first sub-category of multimodal approaches can beconsidered as an extension to intention-based trajectory pre-diction approaches. The only difference is that the multimodaltrajectory predictors with static intention set predict the trajec-tories for all the behaviour modals rather than the modal withhighest likelihood. Therefore, the drawbacks we mentionedfor intention-based trajectory prediction approaches, namelydifficulties in defining a comprehensive intention set andmanual labeling of intentions in the training dataset, are notsolved here. In contrast, the approaches in the second sub-category are exempted from these two problems as they do notrequire a pre-defined intention set. However, due to dynamicdefinition of manoeuvres, they are prone to converge to asingle manoeuvre [16] or not being able to explore all theexisting manoeuvres.

Table II provides a summary of classification of previousstudies based on output type. It also summarizes the advan-tages and disadvantages of each class.

C. Based on Prediction Method

In this subsection, we classify previous studies based onthe prediction model used into three classes, namely recurrentneural networks, convolutional neural networks, and othermethods.

1) Recurrent neural networks: The simplest recurrent neu-ral network (a.k.a. Vanilla RNN) can be considered as anextension to two-layer fully-connected neural network wherethe hidden layer has a feedback. This small change allows

7

TABLE IISUMMARY OF CLASSIFICATION OF PREVIOUS WORKS BASED ON OUTPUT TYPE AND THE ADVANTAGES/DISADVANTAGES OF EACH CLASS

Class Advantages Disadvantages Work Summary of Output Type

Intention - Usually has lowcomputational cost

- Only provides a high-levelunderstanding of the vehicle behaviour- Usually covers manoeuvres that arespecifically defined for a single drivingscenario.

[15],[14]

The destination of travel at roundabout and T-junction.

[19] The probabilities of turning right, left, and going straight atintersection.

[23] Lane change behaviour of the TV in highway drivingscenario.

[28] Right/left cut-in of left/right preceding TVs.

UnimodalTrajectory

- Less computationalcost compared tomultimodal models

- Does not fully represent the vehiclebehaviour prediction space which ismultimodal- Prone to converge to the mean ofbehaviour modals which itself mightnot be a valid prediction.

[40][25]

The position of the TV(s) over time.

[27],[21]

The displacement of the TV relative to its last position foreach step.

[24] Longitudinal velocity and lateral position over time.

[30] The x-y position and the standard deviation.

[36] The bounding box (e.g. location and heading angle) of theTV.

Intention-basedTrajectory

- Fix the problem ofconverging to meanof behaviour modals

- Prone to trajectory prediction errorif the vehicle’s intention is not amongpre-defined intentions.- Manual labelling is required

[17] The TV’s trajectory (based on lane change estimation forhighway driving)

[26],[37]

The TV’s trajectory (based on an intention estimation forurban driving)

MultimodalTrajectory

- Solves the disadvantagesof intention-basedtrajectory predictionmodels (Dynamic intentionsets)

- Prone to converge to one behaviourmodal and not to explore all behaviourmodals (Dynamic intention sets)- High computational cost (Bothsub-classes)

[31],[20]

Static Intention Set: the trajectory distribution per each ofsix predefined manoeuvres and their probability

[32],[16],[18]

Dynamic Intention Set: a number of samples from theestimated distribution of trajectory

[34],[35]

Dynamic Intention Set: the probability of occupancy foreach cell of BEV grid map of surrounding environment.

[29] Dynamic Intention Set: a number of deterministic trajectoriessequences and their probabilities

to model sequential data more efficiently. At each sequencestep, the Vanilla RNN processes the input data from currentstep alongside the memory of past steps, which is carried inthe previous hidden neurons. A vanilla RNN with sufficientnumber of hidden units can, in principle, learn to approximateany sequence to sequence mapping [45]. However, it is difficultto train this network to learn long sequences in practice due togradient vanishing or exploding, which is why gated RNNs areintroduced [46]. In these networks, instead of a simple fullyconnected hidden layer, a gated architecture is used for pro-cessing sequential data. Long short-term memory (LSTM) [47]and Gated recurrent unit (GRU) [48] are the most commonlyused gated RNNs. In vehicle behaviour prediction, LSTMs arethe most used deep models. Here, we sub-categorize recentstudies based on the complexity of network architecture:

• Single RNN: In these models, either a single recurrentneural network is used in the simplest form of behaviourprediction (e.g., intention prediction or unimodal trajectoryprediction) or a secondary model is used alongside a singleRNN to support more sophisticated features like interaction-awareness and/or multimodal prediction. To predict the in-tention of vehicles, a LSTM is used by [15], [14], [19] as

a sequence classifier. In this task a sequence of features isfed to successive cells of a LSTM. Then, the hidden state ofthe last cell in the sequence is mapped to output dimension,which is the number of defined classes. In [15], [14], theinput is embedded using a fully-connected layer and is fedto a three-layer LSTM; while, a two-layer LSTM withoutembedding is used in [19]. F. Altch and A. de La Fortelle [24]use a single layer LSTM to predict the future x-y position ofthe TV as a regression task. Despite having less parametersand complexity, single layer LSTMs are reported to achievecompetitive results compared to the multilayer counterpart insome tasks [49], [50]. To predict an intention-based trajectory,W. Ding and S. Shen [26] use a LSTM encoder to predictthe intention of the TV using its states. Then, the predictedintention and map information are used to generate an initialfuture trajectory for the TV. Finally, a nonlinear optimizationmethod is used to refine the initial future trajectory basedon the vehicles interaction, traffic rules (e.g. red lights), androad geometry. To predict multimodal behaviour, A. Zyneret al. [16] first use an encoder-decoder three-layer LSTMto predict the parameters of a weighted Gaussian MixtureModel (GMM) for each step of the future trajectory. Then,

8

TABLE IIISUMMARY OF CLASSIFICATION OF PREVIOUS WORKS BASED ON THE PREDICTION METHOD AND THE ADVANTAGES/DISADVANTAGES OF EACH CLASS

Class Work Summary of Prediction Method

RecurrentNeuralNetworks

[15], [14], [19] Single RNN: Multi-layer LSTM network is used as a sequence classifier.[26] Single RNN: Two-layer LSTM is used to predict the parameters of acceleration distribution.[24] Single RNN: Single-layer LSTM is used to predict future x-y position of TV[18] Single RNN: An encoder decoder LSTM is used to predict the probability of the occupancy on a grid BEV[23] Multiple RNNs: A group of GRUs is used model the pairwise interaction between the TV and each of SVs

[21] Multiple RNNs: One group of LSTMs is used to model individial vehicles’ trajectory, another group is usedto model pairwise interaction

[17] Multiple RNNs: One LSTM is used to estimate the target lane, another LSTM is used to predict trajectorybased on estimated target lane

[16] Multiple RNNs: Multi-layer LSTM are used to predict mixtures of Gaussian distribution.

[20] Multiple RNNs: On Lstm encoder is applied to the input sequence. The hidden state is feeded to six LSTMdecoders (one per maneuver). Another LSTM encoder is used to predict the probability of each manoeuvre.

ConvolutionalNeuralNetworks

[28] six layer CNN with convolution and fully connected layers are used to predict the intention of surroundingvehicles.

[29], [30] MobileNetV2 [41] is used as feature extractor.[34] A convolution-deconvolution architecture is used to predict vehicle behaviour. It is introduced in [42]

[36]First 3D convolutions are applied to tempral dimension of input data. Then, a series of 2D convolution are used tocapture spatial features. Finally, two branches of convolution layers are used to find the probability of being avehicle and predict the bounding box over current and future frames.

[37] First two backbone CNNs are used to extract the features of lidar data and rasterized map separetely. Then threedifferent network are applied to concatenation of extracted features to generate detection, intention and trajectory

Other Methods

[22] Fully-connected Neural Networks: parameters of vehicle behaviour distribution are estimated using multi-layerfully-connected network

[31] Combination of RNNs and CNNs: An LSTM is applied on each vehicle trajectory. The result is represented in BEVgrid structure and then are fed to a CNN. The output is fed to six LSTM decoder (one per manoeuvre)

[35]Combination of RNNs and CNNs: A convolution network extracts spatial features from the input image. Thesefeatures are fed to encoder-decoder LSTM. The result is fed to deconvolution network to map to output image withthe same size as input.

[32] Combination of RNNs and CNNs: CVAE-based encoder-decoder GRU generates the trajectory distribution. A numberof samples from this distribution are ranked and refined based on contextual features.

[27] Graph Neural Networks: Graph Convolutional Network(GCN [43]) and Graph Attention Network (GAT [44])are used with some adaptations.

[25] Graph Neural Networks: Graph Convolutional Model is used which consists of several convolutional and graphoperation layers.

a clustering approach is used to extract the trajectories thatcorrespond to the modals with highest probabilities. SeongHyeon Park et al. [18] use an encoder decoder LSTM topredict the probability of occupancy on a grid map and applya beam search algorithm [51] to select k most probable futuretrajectory candidates.

• Multiple RNNs: To deal with multimodality and/or inter-action awareness within recurrent neural networks, usually anarchitecture of several RNNs are used in previous studies. W.Ding et al. [23] use a group of GRU encoders to model thepairwise interaction between the TV and each of SVs, basedon which the intention of the TV is predicted for a longerhorizon. S. Dai et al. [21] use two groups of LSTM networksfor the TV’s trajectory prediction, one group for modelingthe TV and each of SVs individual trajectory and the otherfor modeling the interaction between the TV and each of theSVs. L. Xin et al. [17] exploit one LSTM to predict the targetlane of the TV and another LSTM to predict the trajectorybased on the TV’s states and the predicted target lane. Topredict multimodal trajectories, N. Deo and M. M. Trivedi [20]use six different decoder LSTMs which correlate with sixspecific manoeuvres of highway driving. An encoder LSTMis applied to the past trajectory of vehicles. The hidden stateof each decoder LSTM is initialized with the concatenation

of the last hidden state of the encoder LSTM and a one-hotvector representing the maneuvre specific to each decoder.The decoder LSTMs predict the parameters of manoeuvre-conditioned bivariate Gaussian distribution for future locationof vehicle. Another encoder LSTM is also used to predict theprobability of each of six manoeuvres.

2) Convolutional neural networks: Convolutional neuralnetworks (CNNs) include convolution layers, where a filterwith learnable weights is convolved over the input, poolinglayers, which reduce the spatial size of input by subsam-pling, and fully-connected layers, which map their input todesired output dimension. CNNs are commonly used to extractfeatures from image data. They have achieved successfulresults in the computer vision domain [52], [53]. This successmotivates researchers in other domains to represent their dataas an image to be able to apply CNNs on them [54]. However,recently one-dimensional CNNs are also widely used to extractfeatures from one-dimensional signals [55].

D. Lee et al. [28] use a six-layer CNN to predict theintention of surrounding vehicles using a binary BEV rep-resentation. MobileNetV2 [41], which is a memory-efficientCNN designed for mobile applications, is used in [29], [30]to extract relevant features from a relatively complex BEVrepresentation. S. Hoermann et al. [34] use a convolution-

9

deconvolution architecture, which was previously introducedin [42] for image segmentation task, to output the probabilityof occupancy for future time steps in an BEV image. Thismodel first generates a feature vector using a convolutionalnetwork. Then, a deconvolutional network is used to upscalethis vector to the output image. A more complex architectureis used in [37], [36] to deal the tasks of object detection andbehaviour prediction simultaneously. In [36], 3D convolutionis performed on the temporal dimension of 4D representationof voxelized lidar data to capture temporal features, thena series of 2D convolutions are applied to extract spatialfeatures. Finally, two branches of convolution layers are addedto predict the bounding boxes over the detected objects forcurrent and future frames and estimate the probability of beinga vehicle for the detected objects, respectively. In [37], twobackbone CNNs are used to separately process the BEV lidarinput data and the rasterized map. The extracted features areconcatenated and fed to three different networks to detect thevehicles, estimate their intention, and predict their trajectories.

3) Other Methods:• Fully-connected Neural Networks: A simplistic approach

for vehicle behaviour modelling is to rely only on the currentstate of the vehicles, which might be inevitable due to un-availability of states history of vehicles or first-order Markovassumption. In this case, the input data is not a sequence andany feed-forward neural networks (e.g. fully-connected neuralnetwork) can be used instead of RNNs. In [56], it is shown thatin some driving scenarios, feed-forward neural networks canhave competitive results with faster processing time comparedto recurrent neural networks. Yeping Hu et al. [22] use a multi-layer fully connected network to predict the parameters ofa Gaussian Mixture Model (GMM). The GMM models themultimodal distribution of arriving time and final location forthe TV.

• Combination of RNNs and CNNs: In previous works,recurrent neural networks are used because of their temporalfeature extracting power, and convolutional neural networksare used for their spatial feature extracting ability. This inspirssome researchers to use both in their models to process boththe temporal and spatial dimensions of the data. D. Nachiketet al. [31] use one encoder-LSTM per vehicle to extract thetemporal dynamics of the vehicle. The internal states of theseLSTMs form a social tensor which is fed to a convolutionalneural network to learn the spatial interdependencies. Finally,six decoder LSTMs are used to produce the manoeuvre-conditioned distribution for the future trajectory of the TV. In[35], a CNN is applied on each simplified BEV frame repre-senting of the environment around the TV. Then, the sequenceof extracted features is fed to an Encoder-Decoder LSTM tolearn the temporal dynamics of the input data. The decoderLSTM outputs are fed to a deconvolutional neural network toproduce output images which represent how the environmentaround the TV will evolve in the following time steps. In [32],an encoder-decoder GRU is used to generate the distributionof trajectories, then multiple samples of this distribution arefed to decoder GRU to refine and rank them. The latter modulealso receives the contextual features which are extracted by aCNN model applied on the scene representation.

• Graph Neural Networks: The vehicles in a driving sce-nario and their interaction can be considered as a graph inwhich the nodes are the vehicles and the edges represent theinteraction among them. Using this representation, Graph Neu-ral Networks (GNNs) [57], [58] can be used to predict TV’sbehaviour. F. Diehl et al. [27] compare the trajectory predictionperformance of two state-of-the-art graph neural networks,namely, Graph Convolutional Network(GCN) [43] and GraphAttention Network (GAT) [44]. They also propose some adap-tations to improve the performance of these networks for thevehicle behaviour prediction problem. X. Li et al. [25] proposea graph-based interaction-aware trajectory prediction (GRIP)model. They use a graph convolutional model, which consistsof several convolutional layers as well as graph operations, tomodel the interaction among the vehicles. The output of thegraph convolutional model is fed to an LSTM encoder-decoderto predict the trajectory for multiple TVs.

Table III provides a summary of classification of previousstudies based on the prediction method.

IV. EVALUATION

In this section, first we present evaluation metrics thatare commonly used for vehicle behaviour prediction. Then,the performance for some of previous works is discussed.Finally, we identify and discuss the main research gaps andopportunities.

A. Evaluation Metrics

We discuss the evaluation metrics for intention predictionmodels and trajectory prediction models separately, as theformer is a classification problem and the latter is a regressionproblem and each problem has a separate set of metrics.

1) Intention Prediction Metrics:• Accuracy: One of the most common classification met-

rics is accuracy which is defined as total number of correctlyclassified data samples divided by total number of data sam-ples. However, relying only on the accuracy is sometimes mis-leading for an imbalanced dataset. For example, the number oflane changes in a highway driving dataset is usually much lessthan lane keeping. Thus, an intention predictor that regardlessof input data always output lane keeping gains high accuracyscore. Therefore, other metrics like precision, recall, and F1score were also used in previous studies [19], [37].

• Precision: For a given class, precision is defined as theratio of total number of data samples which are correctlyclassified in that class to the total number of samples classifiedas the given class. A low precision indicates a large numberof incorrectly classified data in the given class.

• Recall: For a given class, recall is defined as the ratio oftotal number of data samples which are correctly classified inthat class to the total number of samples in the given class. Alow Recall indicates a large number of data in the given classthat are incorrectly classified in other classes.

• F1 Score: The F1 score (a.k.a. F-score or F-measure) isa balance between precision and recall and is defined as:

F1 = 2 · precision · recallprecision+ recall

(3)

10

TABLE IVCOMPARISON OF TRAJECTORY PREDICTION ERROR OF THE PREVIOUS MODELS FOR DIFFERENT PREDICTION HORIZONS

Work Classification RMSE

Input Representation Output Type Prediction Model 1 s 2 s 3 s 4 s 5 s

CV - - - 0.73 1.78 3.13 4.78 6.68

[24] Track history of TV and SVs Unimodal Trajectory RNN (Single RNN) 0.72 2 3.76 5.97 9.01

[17] Track history of TV Intention-based Trajectory RNN (Multiple RNNs) 0.49 1.41 2.6 4.06 5.79

M-LSTM [20] Track history of TV and SVs Multimodal Trajectory RNN (Multiple RNNs) 0.58 1.26 2.12 3.24 4.66

CS-LSTM [31] Simplified Bird’s Eye View Multimodal Trajectory Combination of RNNs and CNNs 0.61 1.27 2.09 3.1 4.37

ST-LSTM [21] Track history of TV and SVs Unimodal Trajectory RNN (Multiple RNNs) 0.56 1.19 1.93 2.78 3.76

GRIP [25] Track history of TV and SVs Unimodal Trajectory Graph Neural Networks 0.64 1.13 1.8 2.62 3.6

• Negative Log Likelihood (NLL): For a multi-class clas-sification task, NLL is calculated as:

NLL = −M∑c=1

yclog(pc) (4)

Where yc is a binary indicator of correctness of predictingthe observation in class c, pc is the predicted probability ofbelonging to class c, and M is the number of classes. AlthoughNLL values are not as interpretable as previous metrics, itcan be used to compare the uncertainty of different intentionprediction models [23].

• Prediction Horizon: Apart from the metrics that dealwith the correctness of prediction, it is important to reporthow far in the future the prediction model can predict thedriving intention of the TV. Prediction horizon can also be usedto distinguish between manoeuvre prediction and detectionproblems. For example, a model that outputs lane change forthe TV 1.0 s before it crosses the dividing line is considereda behaviour detection model rather than behaviour predictionmodel, since lane change behaviour commonly takes 3.0 to5.0 s [59] and during this period the manoeuvre is clearlyhappening.

2) Trajectory Prediction Metrics:• Final Displacement Error: This error measures the dis-

tance between predicted final location ytpred and true finallocation of the vehicle ytpred at the end of prediction horizontpred , while it does not consider the prediction error occurredin other time steps in the prediction horizon.

efinal disp = |ytpred − ytpred | (5)

• Mean Absolute Error (MAE) or Root Mean SquaredError (RMSE): MAE measures the average magnitude ofprediction error et, while RMSE measures the square root ofthe average of squared prediction error:

MAE =

√√√√ 1

n

n∑t=1

|et| (6)

RMSE =

√√√√ 1

n

n∑t=1

e2t (7)

Where m is number of data samples and et can be definedas the displacement error between the predicted trajectoryand ground truth. MAE and RMSE are two of the mostcommon metrics for regression problems and act roughlysimilar. However, RMSE is more sensitive to large errors dueto usage of squared error in its definition.

• Negative Log Likelihood (NLL): For a modelled trajec-tory distribution f(Y ), NLL is calculated as:

NLL = −log(f(Y )) (8)

Where Y is the ground-truth trajectory. NLL can be reported asa metric in both intention prediction and trajectory prediction;however, in trajectory prediction this metric can be moreimportant as both MAE and RMSE are biased in favor ofmodels that predict the average of modals[31] which is notnecessarily a good prediction, as discussed before. Therefore,NLL is also reported in previous studies to compare thecorrectness of underlying predicted trajectory distribution.

• Computation Time: The trajectory prediction modelsare usually more complex compared to intention predictionmodels. Therefore, they can take more computation time whichmight make them impractical for on-board implementation inautonomous vehicles. Thus, it is crucial to report and comparecomputation time in trajectory prediction models.

B. Performance of Existing Methods

In this part we compare the performance of some of re-viewed trajectory prediction methods. The selected studies forcomparison are the ones that used common publicly availabledatasets and common metrics. These studies report RMSEerrors for prediction horizons of 1.0 to 5.0 s on NGSIM I-80and US-101 highway driving datasets [60]. Table IV providesthe reported error for each paper which is obtained from theoriginal paper (except the RMSE calculation of [24] whichhas been modified by [17] to match the position error in SIunits). Note that the RMSE error is reported for longitudinaland lateral position separately in [17]; however, we calculatedthe total RMSE error to be consistent with other studies.Furthermore, in [21] the error is calculated for US-101 andI-80 separately; while, we report the average of them. Wealso report the prediction result of a constant velocity KalmanFilter(CV) model as a baseline which is obtained from [25].

11

To compare the performance of selected works, Table IVstates the category each work belongs to. According to thetable IV, most of deep learning-based methods surpass con-ventional methods like constant velocity model (CV) with ahigh margin. Among reviewed deep learning-based models,complex models (e.g. Multiple RNNs or Combination ofRNNs and CNNs) achieve better performance compared tosimple models like single RNN. Nonetheless, increasing thecomplexity of output, by predicting multimodal trajectoryinstead of unimodal trajectory, does not always result in lowerRMSE. For example, the models named GRIP [25] and ST-LSTM [21] achieve better performance w.r.t. M-LSTM [20]and CS-LSTM [31], while the former studies predict unimodaltrajectories and the latter ones predict multimodal trajectories.This can be due to limited model capacity or limited data usedin the discussed multi-modal trajectory prediction models.

C. Research Gaps and New Opportunities

We discuss some of the main research gaps in vehiclebehaviour prediction problem, which can be considered asopportunities for future works:

1) Unlike object detection which has unified way of eval-uation [39], there is no benchmark for evaluating previousstudies on vehicle behaviour prediction. Among the reviewedpaper, there are only six works that use unified evaluationmethod (the works that we compare their performance in thispaper). In addition, only a few works report the computationtime of their algorithms, while this metric is highly importantin autonomous driving applications. As a future work, abenchmark can be defined and used in vehicle behaviourprediction to be able to thoroughly compare the performanceof different studies.

2) Most of the previous works consider full observabilityof the surrounding environment and vehicles’ states whichis not feasible in practice. Infrastructure sensors can providenon-occluded top-down view of the environment; however, itis impractical to cover all road sections with such sensors.Therefore, a realistic solution for behaviour prediction shouldalways consider sensor impairments (e.g. occlusion, noise)which can limit the number of observable vehicles around tar-get vehicle and in turn may reduce the accuracy of behaviourpredictors in autonomous vehicles. One possible solution tothis is the utilization of connected autonomous vehicles. In thiscase, the connected vehicle can exploit the information gainedby sensors implemented in other vehicles or infrastructurethrough V2V and V2I communication (see Figure 4).

3) In recent studies, traffic rules are rarely considered asan explicit input to the model; while, they can reshape thebehaviour of a vehicle in a driving scenario. Some of theprevious studies include road direction or traffic light as aninput to their model [19], [26] which are only a small part oftraffic signs and rules.

4) In addition to the vehicle’s states and scene informationwhich both are usually considered in recent works, other visualand auditory data of vehicles, like vehicle’s signalling lightsand vehicle horn can also be used to infer about its futurebehaviour.

Fig. 4. An illustration of the vehicle behaviour prediction problem for con-nected autonomous vehicles. The sensors implemented in other autonomousvehicles and infrastructure can provide more information about the surround-ing vehicle and reduce the object occlusion problem in ego vehicle.

5) Most of the previous works are limited to a specificdriving scenario such as roundabout, intersection, and T-junction. However, a vehicle behaviour prediction modulein fully autonomous vehicle should be able to predict thebehaviour in any driving scenario. Developing a model whichcan be applied to a variety of driving environment can be adirection for future research.

V. CONCLUSION

Although deep learning-based behaviour prediction solu-tions have superior performance and are more sophisticatedin terms of input representation, output type, and predictionmethod compared to conventional solutions, there are severalopen challenges that need to be addressed to enable their adop-tion in autonomous driving applications. Particularly, whilemost of existing solutions considered the interaction amongvehicles, factors such as environment conditions and set oftraffic rules are not directly inputted to the prediction model.In addition, practical limitations such as sensor impairmentsand limited computational resources have not been fully takeninto account.

REFERENCES

[1] “Research on the impacts of connected and autonomousvehicles(cavs) on traffic flow: Summary report,” UK Departmentfor Transport, Tech. Rep., May 2016. [Online]. Available:https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment data/file/530091/impacts-of-connected-and-autonomous-vehicles-on-traffic-flow-summary-report.pdf

[2] D. Helbing and P. Molnar, “Social force model for pedestrian dynamics,”Physical review E, vol. 51, no. 5, p. 4282, 1995.

[3] A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei, andS. Savarese, “Social lstm: Human trajectory prediction in crowdedspaces,” in The IEEE Conference on Computer Vision and PatternRecognition (CVPR), June 2016.

[4] A. Gupta, J. Johnson, L. Fei-Fei, S. Savarese, and A. Alahi, “Social gan:Socially acceptable trajectories with generative adversarial networks,”in The IEEE Conference on Computer Vision and Pattern Recognition(CVPR), June 2018.

[5] T. Fernando, S. Denman, S. Sridharan, and C. Fookes, “Soft + hardwiredattention: An lstm framework for human trajectory prediction andabnormal event detection,” Neural Networks, vol. 108, pp. 466 – 478,2018. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0893608018302648

[6] A. Rudenko, L. Palmieri, M. Herman, K. M. Kitani, D. M.Gavrila, and K. O. Arras, “Human motion trajectory prediction:A survey,” CoRR, vol. abs/1905.06113, 2019. [Online]. Available:http://arxiv.org/abs/1905.06113

[7] T. Hirakawa, T. Yamashita, T. Tamaki, and H. Fujiyoshi, “Survey onvision-based path prediction,” in Distributed, Ambient and PervasiveInteractions: Technologies and Contexts, N. Streitz and S. Konomi, Eds.Cham: Springer International Publishing, 2018, pp. 48–64.

https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/530091/impacts-of-connected-and-autonomous-vehicles-on-traffic-flow-summary-report.pdf



http://www.sciencedirect.com/science/article/pii/S0893608018302648

http://www.sciencedirect.com/science/article/pii/S0893608018302648

http://arxiv.org/abs/1905.06113

12

[8] P. V. K. Borges, N. Conci, and A. Cavallaro, “Video-based humanbehavior understanding: A survey,” IEEE Transactions on Circuits andSystems for Video Technology, vol. 23, no. 11, pp. 1993–2008, Nov2013.

[9] M. S. Shirazi and B. T. Morris, “Looking at intersections: A survey ofintersection monitoring, behavior and safety analysis of recent studies,”IEEE Transactions on Intelligent Transportation Systems, vol. 18, no. 1,pp. 4–24, Jan 2017.

[10] M. T. Brendan Tran Morris, “Understanding vehicular traffic behaviorfrom video: a survey of unsupervised approaches,” Journal of ElectronicImaging, vol. 22, no. 4, pp. 1 – 16 – 16, 2013. [Online]. Available:https://doi.org/10.1117/1.JEI.22.4.041113

[11] S. K. Kumaran, D. P. Dogra, and P. P. Roy, “Anomaly detection in roadtraffic using visual surveillance: A survey,” CoRR, vol. abs/1901.08292,2019. [Online]. Available: http://arxiv.org/abs/1901.08292

[12] S. Lefevre, D. Vasquez, and C. Laugier, “A survey on motionprediction and risk assessment for intelligent vehicles,” ROBOMECHJournal, vol. 1, no. 1, p. 1, Jul 2014. [Online]. Available:https://doi.org/10.1186/s40648-014-0001-z

[13] W. Zhan, A. L. de Fortelle, Y. Chen, C. Chan, and M. Tomizuka,“Probabilistic prediction from planning perspective: Problem formula-tion, representation simplification and evaluation metric,” in 2018 IEEEIntelligent Vehicles Symposium (IV), June 2018, pp. 1150–1156.

[14] A. Zyner, S. Worrall, and E. Nebot, “A recurrent neural networksolution for predicting driver intention at unsignalized intersections,”IEEE Robotics and Automation Letters, vol. 3, no. 3, pp. 1759–1764,July 2018.

[15] A. Zyner, S. Worrall, J. Ward, and E. Nebot, “Long short term memoryfor driver intent prediction,” in 2017 IEEE Intelligent Vehicles Sympo-sium (IV), June 2017, pp. 1484–1489.

[16] A. Zyner, S. Worrall, and E. M. Nebot, “Naturalistic driverintention and path prediction using recurrent neural networks,” CoRR,vol. abs/1807.09995, 2018. [Online]. Available: http://arxiv.org/abs/1807.09995

[17] L. Xin, P. Wang, C. Chan, J. Chen, S. E. Li, and B. Cheng, “Intention-aware long horizon trajectory prediction of surrounding vehicles usingdual lstm networks,” in 2018 21st International Conference on IntelligentTransportation Systems (ITSC), Nov 2018, pp. 1441–1446.

[18] S. H. Park, B. Kim, C. M. Kang, C. C. Chung, and J. W. Choi,“Sequence-to-sequence prediction of vehicle trajectory via lstm encoder-decoder architecture,” in 2018 IEEE Intelligent Vehicles Symposium (IV),June 2018, pp. 1672–1678.

[19] D. J. Phillips, T. A. Wheeler, and M. J. Kochenderfer, “Generalizableintention prediction of human drivers at intersections,” in 2017 IEEEIntelligent Vehicles Symposium (IV), June 2017, pp. 1665–1670.

[20] N. Deo and M. M. Trivedi, “Multi-modal trajectory prediction of sur-rounding vehicles with maneuver based lstms,” in 2018 IEEE IntelligentVehicles Symposium (IV), June 2018, pp. 1179–1184.

[21] S. Dai, L. Li, and Z. Li, “Modeling vehicle interactions via modifiedlstm models for trajectory prediction,” IEEE Access, vol. 7, pp. 38 287–38 296, 2019.

[22] Y. Hu, W. Zhan, and M. Tomizuka, “Probabilistic prediction of vehiclesemantic intention and motion,” in 2018 IEEE Intelligent VehiclesSymposium (IV), June 2018, pp. 307–313.

[23] W. Ding, J. Chen, and S. Shen, “Predicting vehicle behaviors overan extended horizon using behavior interaction network,” CoRR,vol. abs/1903.00848, 2019. [Online]. Available: http://arxiv.org/abs/1903.00848

[24] F. Altch and A. de La Fortelle, “An lstm network for highway trajectoryprediction,” in 2017 IEEE 20th International Conference on IntelligentTransportation Systems (ITSC), Oct 2017, pp. 353–359.

[25] X. Li, X. Ying, and M. C. Chuah, “Grip: Graph-based interaction-awaretrajectory prediction,” arXiv preprint arXiv:1907.07792, 2019.

[26] W. Ding and S. Shen, “Online Vehicle Trajectory Prediction using Pol-icy Anticipation Network and Optimization-based Context Reasoning,”arXiv e-prints, p. arXiv:1903.00847, Mar 2019.

[27] M. T. L. A. K. Frederik Diehl, Thomas Brunner, “Graph neuralnetworks for modelling traffic participant interaction,” in IEEEIntelligent Vehicles Symposium 2019, Jun 2019. [Online]. Available:https://arxiv.org/abs/1903.01254

[28] D. Lee, Y. P. Kwon, S. McMains, and J. K. Hedrick, “Convolution neuralnetwork-based lane change intention prediction of surrounding vehiclesfor acc,” in 2017 IEEE 20th International Conference on IntelligentTransportation Systems (ITSC), Oct 2017, pp. 1–6.

[29] H. Cui, V. Radosavljevic, F. Chou, T. Lin, T. Nguyen, T. Huang,J. Schneider, and N. Djuric, “Multimodal trajectory predictionsfor autonomous driving using deep convolutional networks,” CoRR,

vol. abs/1809.10732, 2018. [Online]. Available: http://arxiv.org/abs/1809.10732

[30] N. Djuric, V. Radosavljevic, H. Cui, T. Nguyen, F. Chou, T. Lin, andJ. Schneider, “Motion prediction of traffic actors for autonomous drivingusing deep convolutional networks,” CoRR, vol. abs/1808.05819, 2018.[Online]. Available: http://arxiv.org/abs/1808.05819

[31] N. Deo and M. M. Trivedi, “Convolutional social pooling for vehicletrajectory prediction,” in The IEEE Conference on Computer Vision andPattern Recognition (CVPR) Workshops, June 2018.

[32] N. Lee, W. Choi, P. Vernaza, C. B. Choy, P. H. S. Torr, and M. Chan-draker, “Desire: Distant future prediction in dynamic scenes with inter-acting agents,” in The IEEE Conference on Computer Vision and PatternRecognition (CVPR), July 2017.

[33] D. Nuss, S. Reuter, M. Thom, T. Yuan, G. Krehl, M. Maile,A. Gern, and K. Dietmayer, “A random finite set approach for dynamicoccupancy grid maps with real-time application,” The InternationalJournal of Robotics Research, vol. 37, no. 8, pp. 841–866, 2018.[Online]. Available: https://doi.org/10.1177/0278364918775523

[34] S. Hoermann, M. Bach, and K. Dietmayer, “Dynamic occupancy gridprediction for urban autonomous driving: A deep learning approach withfully automatic labeling,” in 2018 IEEE International Conference onRobotics and Automation (ICRA), May 2018, pp. 2056–2063.

[35] M. Schreiber, S. Hoermann, and K. Dietmayer, “Long-term occupancy grid prediction using recurrent neuralnetworks,” CoRR, vol. abs/1809.03782, 2018. [Online]. Available:http://arxiv.org/abs/1809.03782

[36] W. Luo, B. Yang, and R. Urtasun, “Fast and furious: Real time end-to-end 3d detection, tracking and motion forecasting with a singleconvolutional net,” in The IEEE Conference on Computer Vision andPattern Recognition (CVPR), June 2018.

[37] S. Casas, W. Luo, and R. Urtasun, “Intentnet: Learning to predictintention from raw sensor data,” in Proceedings of The 2ndConference on Robot Learning, ser. Proceedings of Machine LearningResearch, A. Billard, A. Dragan, J. Peters, and J. Morimoto, Eds.,vol. 87. PMLR, 29–31 Oct 2018, pp. 947–956. [Online]. Available:http://proceedings.mlr.press/v87/casas18a.html

[38] S. Ruder, “An overview of multi-task learning in deep neuralnetworks,” CoRR, vol. abs/1706.05098, 2017. [Online]. Available:http://arxiv.org/abs/1706.05098

[39] E. Arnold, O. Y. Al-Jarrah, M. Dianati, S. Fallah, D. Oxtoby, andA. Mouzakitis, “A survey on 3d object detection methods for au-tonomous driving applications,” IEEE Transactions on Intelligent Trans-portation Systems, pp. 1–14, 2019.

[40] C. Ju, Z. Wang, C. Long, X. Zhang, G. Cong, and D. E. Chang,“Interaction-aware Kalman Neural Networks for Trajectory Prediction,”arXiv e-prints, Feb. 2019.

[41] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen,“Mobilenetv2: Inverted residuals and linear bottlenecks,” in The IEEEConference on Computer Vision and Pattern Recognition (CVPR), June2018.

[42] H. Noh, S. Hong, and B. Han, “Learning deconvolution network forsemantic segmentation,” in The IEEE International Conference onComputer Vision (ICCV), December 2015.

[43] T. N. Kipf and M. Welling, “Semi-supervised classification with graphconvolutional networks,” ArXiv, vol. abs/1609.02907, 2016.

[44] P. Velikovi, G. Cucurull, A. Casanova, A. Romero, P. Li, andY. Bengio, “Graph attention networks,” in International Conferenceon Learning Representations, 2018. [Online]. Available: https://openreview.net/forum?id=rJXMpikCZ

[45] B. Hammer, “On the approximation capability of recurrent neuralnetworks,” Neurocomputing, vol. 31, no. 1-4, pp. 107–123, 2000.

[46] A. Graves, Long Short-Term Memory. Berlin, Heidelberg: SpringerBerlin Heidelberg, 2012, pp. 37–45. [Online]. Available: https://doi.org/10.1007/978-3-642-24797-2 4

[47] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neuralcomputation, vol. 9, pp. 1735–80, 12 1997.

[48] K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares,H. Schwenk, and Y. Bengio, “Learning phrase representations usingrnn encoder–decoder for statistical machine translation,” in Proceedingsof the 2014 Conference on Empirical Methods in Natural LanguageProcessing (EMNLP), 2014, pp. 1724–1734.

[49] J. Martinez, M. J. Black, and J. Romero, “On human motion predictionusing recurrent neural networks,” in The IEEE Conference on ComputerVision and Pattern Recognition (CVPR), July 2017.

[50] R. Kiros, Y. Zhu, R. R. Salakhutdinov, R. Zemel, R. Urtasun,A. Torralba, and S. Fidler, “Skip-thought vectors,” in Advancesin Neural Information Processing Systems 28, C. Cortes, N. D.

https://doi.org/10.1117/1.JEI.22.4.041113


https://doi.org/10.1186/s40648-014-0001-z





https://arxiv.org/abs/1903.01254




https://doi.org/10.1177/0278364918775523


http://proceedings.mlr.press/v87/casas18a.html


https://openreview.net/forum?id=rJXMpikCZ

https://openreview.net/forum?id=rJXMpikCZ

https://doi.org/10.1007/978-3-642-24797-2_4

https://doi.org/10.1007/978-3-642-24797-2_4

13

Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, Eds. CurranAssociates, Inc., 2015, pp. 3294–3302. [Online]. Available: http://papers.nips.cc/paper/5950-skip-thought-vectors.pdf

[51] G. Neubig, “Neural machine translation and sequence-to-sequence mod-els: A tutorial,” arXiv preprint arXiv:1703.01619, 2017.

[52] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classificationwith deep convolutional neural networks,” in Advances in NeuralInformation Processing Systems 25, F. Pereira, C. J. C. Burges,L. Bottou, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2012,pp. 1097–1105. [Online]. Available: http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf

[53] C. Farabet, C. Couprie, L. Najman, and Y. LeCun, “Learning hierarchicalfeatures for scene labeling,” IEEE Transactions on Pattern Analysis andMachine Intelligence, vol. 35, no. 8, pp. 1915–1929, Aug 2013.

[54] C. Li, Z. Zhang, W. Sun Lee, and G. Hee Lee, “Convolutional sequenceto sequence model for human dynamics,” in The IEEE Conference onComputer Vision and Pattern Recognition (CVPR), June 2018.

[55] A. Van Den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals,A. Graves, N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu,“Wavenet: A generative model for raw audio.” SSW, vol. 125, 2016.

[56] M. Bahari and A. Alahi, “Feed-forwards meet recurrent networks invehi-cle trajectory prediction,” 2019.

[57] M. Gori, G. Monfardini, and F. Scarselli, “A new model for learningin graph domains,” in Proceedings. 2005 IEEE International JointConference on Neural Networks, 2005., vol. 2, July 2005, pp. 729–734vol. 2.

[58] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini,“The graph neural network model,” IEEE Transactions on NeuralNetworks, vol. 20, no. 1, pp. 61–80, Jan 2009.

[59] T. Toledo and D. Zohar, “Modeling duration of lane changes,”Transportation Research Record, vol. 1999, no. 1, pp. 71–78, 2007.[Online]. Available: https://doi.org/10.3141/1999-08

[60] “Traffic analysis tools: Next generation simulation- fhwa operations,”https://ops.fhwa.dot.gov/trafficanalysistools/ngsim.htm, accessed: 2019-08-15.

Sajjad Mozaffari is a PhD candidate with the Warwick Manufacturing Group(WMG) at University of Warwick, UK. He recieved the B.Sc. and M.Sc.degrees in Electrical Engineering at the University of Tehran, Iran in 2015 and2018, respectively. His research interests include machine learning, computervision, and connected and autonomous vehicles.

Omar Y. Al-Jarrah received the B.S. degree in Computer Engineering fromYarmouk University, Jordan, in 2005, the MSc degree in Engineering fromThe University of Sydney, Sydney, Australia in 2008 and the Ph.D.degree inElectrical and Computer Engineering from Khalifa University, UAE, in 2016.Omar has worked as a postdoctoral fellow in the Department of Electricaland Computer Engineering, Khalifa University, UAE, and currently he worksas a senior research fellow at WMG, The University of Warwick, U.K. Hismain research interest involves machine learning, connected and autonomousvehicles, intrusion detection, big data analytics, and knowledge discoveryin various applications. He has authored/co-authored several publications onthese topics. Omar has served as TPC member of several conferences, suchas IEEE Globecom 2018. He was the recipient of several scholarships duringhis undergraduate and graduate studies.

Mehrdad Dianati is a Professor of Autonomous and Connected Vehicles atWarwick Manufacturing Group (WMG), University of Warwick, as well as, avisiting professor at 5G Innovation Centre (5GIC), University of Surrey, wherehe was previously a Professor. He has been involved in a number of nationaland international projects as the project leader and work-package leader inrecent years. Prior to his academic endeavour, he have worked in the industryfor more than 9 years as senior software/hardware developer and Director ofR&D. He frequently provide voluntary services to the research community invarious editorial roles; for example, he has served as an associate editor forthe IEEE Transactions on Vehicular Technology, IET Communications andWileys Journal of Wireless Communications and Mobile.

Paul Jennings received a BA degree in physics from the University of Oxfordin 1985 and an Engineering Doctorate from the University of Warwick in1996. Since 1988 he has worked on industry-focused research for WMG at theUniversity of Warwick. His current interests include: vehicle electrification, inparticular energy management and storage; connected and autonomous vehi-cles, in particular the evaluation of their dependability; and user engagementin product and environment design, with a particular focus on automotiveapplications.

Alexandros Mouzakitis is the head of the Electrical, Electronics and SoftwareEngineering Research Department at Jaguar Land Rover. Dr Mouzakitis hasover 15 years of technological and managerial experience especially in the areaof automotive embedded systems. In his current role is responsible for leadinga multidisciplinary research and technology department dedicated to delivera portfolio of advanced research projects in the areas of human-machineinterface, digital transformation, self-learning vehicle, smart/connected sys-tems and onboard/off board data platforms. In his previous position withinJaguar Land Rover, Dr Mouzakitis served as the head of the Model-basedProduct Engineering department responsible for model-based developmentand automated testing standards and processes.

http://papers.nips.cc/paper/5950-skip-thought-vectors.pdf

http://papers.nips.cc/paper/5950-skip-thought-vectors.pdf

http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf

http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf

https://doi.org/10.3141/1999-08

https://ops.fhwa.dot.gov/trafficanalysistools/ngsim.htm

deep learning-based vehicle behaviour prediction for ...sajjad mozaffari, omar y. al-jarrah, mehrdad...

Documents