reinforcement learning based traffic optimization at an
TRANSCRIPT
Reinforcement learning based traffic
optimization at an intersection with GLOSA
Master Thesis
Submitted in Fulfillment of the
Requirements for the Academic Degree
M.Sc.
Dept. of Computer Science
Chair of Computer Engineering
Submitted by: Rajitha Jayasinghe
Student ID: 456470
Date: 07.01.2019
Supervising tutor: Prof. Dr. W. Hardt
Prof. Dr. Uranchimeg Tudevdagva
Dr. Leonhard Lücken, DLR - Berlin
1
Abstract
Traffic flow optimization at an intersection helps to maintain a smooth urban traffic
flow. It can reduce travel time and emission. Regularly, new algorithms are
introduced to control approaching vehicles and traffic light phases. Reinforcement
learning and traffic optimization is a novel combination that is used by the research
community. This thesis suggests a methodology to reduce travel time and emission
of vehicles for a specific intersection design. Here the author provides a clear
solution by considering the driving route of approaching the vehicle to the
intersection. By using reinforcement learning and route information, this research
suggests a vehicle ordering mechanism in order to improve the throughput of the
intersection. Before proposing the solution, the author gives a thorough research
review of previous studies. Various findings regarding various Reinforcement
learning algorithms and how it has used to traffic optimization are explained in
Literature review. Further, the author is using GLOSA as a baseline to evaluate the
new solution. Several types of GLOSA variations are discussed in this report. A new
approach, which can be seen as an extension of the existing GLOSA algorithms, is
described in the concept chapter. A deep Q network approach and a rule-based
policy are introduced as the solution. The proposed solution was implemented and
evaluated. The author was able to achieve promising results from a rule-based policy
approach. Further, the issues related to both approaches were discussed in detail
and solutions were given to further improve the proposed solutions.
Keywords: Traffic management, junction optimization, GLOSA, Reinforcement
learning
2
Content
Abstract ....................................................................................................................... 1
Content ........................................................................................................................ 2
Acknowledgement ....................................................................................................... 6
List of Figures .............................................................................................................. 7
List of Tables ............................................................................................................. 10
List of Abbreviations .................................................................................................. 11
1 Introduction ........................................................................................................ 12
1.1 Motivation ..................................................................................................... 12
1.2 Problem Statement....................................................................................... 13
1.3 Proposed Solution and Approach ................................................................. 14
1.4 Objectives : .................................................................................................. 15
1.5 Thesis structure ............................................................................................ 15
1.6 Summary ...................................................................................................... 16
2 Fundamentals of Reinforcement Learning ......................................................... 17
2.1 Introduction .................................................................................................. 17
2.2 What is Reinforcement Learning .................................................................. 17
2.3 Comparison with Supervised and Unsupervised learning ............................ 17
2.4 Exploration and exploitation ......................................................................... 18
2.5 Markov Decision Process ............................................................................. 18
2.5.1 Definitions .............................................................................................. 19
2.6 Components of Reinforcement learning ....................................................... 19
2.7 Reinforcement learning algorithms ............................................................... 21
2.7.1 Q learning .............................................................................................. 21
2.7.2 Deep Q Network .................................................................................... 22
2.7.3 Rule-based policies ............................................................................... 23
2.8 Summary ...................................................................................................... 23
3 State of the Art ................................................................................................... 24
3.1 Fundamental parameters of traffic flow ........................................................ 24
3
3.2 Traffic stream parameters ............................................................................ 24
3.3 Car2X communication .................................................................................. 26
3.3.1 Reasons for using car2x communication ............................................... 26
3.3.2 Technical specifications ......................................................................... 27
3.3.3 The technical architecture of Car2X ....................................................... 27
3.3.4 Software architecture ............................................................................. 28
3.3.5 Car2X message types ........................................................................... 28
3.3.6 Forwarding types ................................................................................... 29
3.3.7 Car2X applications. ................................................................................ 29
3.4 Traditional approaches to intersection management .................................... 30
3.5 Green Light Optimal Speed Advisory. .......................................................... 31
3.5.2 Test scenarios ....................................................................................... 35
3.5.3 Limitations of GLOSA ............................................................................ 35
3.5.4 AGLOSA and their variations ................................................................. 35
3.5.5 Results ................................................................................................... 36
3.6 Traditional intersection management plans .................................................. 36
3.7 Reinforcement learning based intersection optimization .............................. 37
3.7.1 Results ................................................................................................... 48
3.8 Non-RL approaches to optimize traffic flow .................................................. 48
3.9 Summary ...................................................................................................... 50
4 Concept .............................................................................................................. 51
4.1 Reinforcement learning and proposed solution ............................................ 51
4.2 Deep Q network-based approach ................................................................ 51
4.2.1 System architecture ............................................................................... 51
4.2.2 Building a Reinforcement learning model .............................................. 52
4.2.3 Deep Q network structure ...................................................................... 53
4.2.4 Single-agent and multi-agent approaches ............................................. 55
4.2.5 Reward definition ................................................................................... 56
4.3 Rule-based policy approach ......................................................................... 56
4.3.1 System architecture ............................................................................... 56
4
4.3.2 Observation space ................................................................................. 58
4.3.3 Action Space.......................................................................................... 58
4.3.4 Rules ..................................................................................................... 59
4.3.5 Rewards ................................................................................................ 65
4.3.6 Policy parameters .................................................................................. 66
4.4 Summary ...................................................................................................... 67
5 Implementation ................................................................................................... 68
5.1 Technical details .......................................................................................... 68
5.1.1 Deep Q network approach ..................................................................... 68
5.1.2 Rule-based policy approach .................................................................. 68
5.2 Implementation of Deep Q network approach .............................................. 69
5.2.1 Flow configuration .................................................................................. 69
5.2.2 Dynamic SUMO network configuration .................................................. 69
5.3 Implementation of Rule-based policy approach ........................................... 74
5.3.1 Data extraction....................................................................................... 74
5.3.2 Rules implementation ............................................................................ 74
5.3.3 Extended-GLOSA implementation ......................................................... 75
5.3.4 Optimizer implementation ...................................................................... 76
5.3.5 Reward calculation ................................................................................ 77
5.4 Summary ...................................................................................................... 77
6 Evaluation .......................................................................................................... 78
6.1 DQN based approach ................................................................................... 78
6.1.1 Tests ...................................................................................................... 78
6.1.2 Results and discussion of DQN approach ............................................. 78
6.1.3 Issues observed ..................................................................................... 80
6.2 Rule-based policy approach ......................................................................... 81
6.2.1 Tests ...................................................................................................... 81
6.2.2 Results and discussion .......................................................................... 81
6.2.3 Optimization results and discussion ....................................................... 83
6.2.4 Further discussion of Grid test ............................................................... 84
5
6.2.5 Solutions and improvements ................................................................. 85
6.3 Summary ...................................................................................................... 86
7 Conclusion ......................................................................................................... 87
7.1 Challenges ................................................................................................... 87
7.2 Future improvements ................................................................................... 87
7.3 Concluding remarks ..................................................................................... 88
Bibliography ............................................................................................................... 90
Appendix ................................................................................................................... 93
6
Acknowledgement
First of all, I would thank Prof. Dr. Wolfram Hardt for giving me the opportunity to
commence my thesis at the department of Automotive Software Engineering, TU
Chemnitz. Especially, I would like to thank my university supervisor Prof. Uranchimeg
Tudevdagva for giving me the assistance for the fullfilment of the thesis.
Next, I would like to thank Deutsches Zentrum für Luft- und Raumfahrt (DLR) – Berlin
for giving me the opportunity to proceed with my master's thesis. Especial thanks to
my supervisor Dr. Leonhard Lücken for the constant support, time spent on my
project and feedback during the research. Further, I would like to thank the SUMO
development team for creating such a wonderful simulation toolkit.
Further, I would like to thank my parents, my closest friends for the continuous
encouragement and support until the end of my Master's degree.
7
List of Figures
Figure 1 : Problem scenario ...................................................................................... 13
Figure 2 : Solution scenario ....................................................................................... 14
Figure 3 : Machine learning paradigms ..................................................................... 18
Figure 4 : MDP sample scenario ............................................................................... 19
Figure 5 : RL agent and environment interaction [7] .................................................. 20
Figure 6 : Q leaning in action..................................................................................... 21
Figure 7 : Q table and DQN [15] ................................................................................ 22
Figure 8 : CNN to calculate Q values [15] ................................................................. 23
Figure 9: Car2X overview ......................................................................................... 26
Figure 10 : Car2x architecture [21] ............................................................................ 27
Figure 11 : Car2x software architecture [19] ............................................................. 28
Figure 12 : Example Car2X message [21] ................................................................. 29
Figure 13 : Car2X application stack [23] .................................................................... 30
Figure 14: GLOSA vs regular driving [24] .................................................................. 32
Figure 15 : GLOSA algorithm [5] ............................................................................... 33
Figure 16 : Coasting and freewheeling [24] ............................................................... 34
Figure 17 : GLOSA + coasting and freewheeling [24] ............................................... 34
Figure 18 : Road network by [2] ................................................................................ 38
Figure 19 : Simulation runs VS average delay [26] ................................................... 39
Figure 20 : Intersection network ................................................................................ 39
Figure 21 : Running time, delay and RL policies [28] ................................................ 40
Figure 22 : Online and offline training [30] ................................................................. 41
Figure 23 : Double DQN + Duel DQN structure [2] .................................................... 42
Figure 24 : Convolutional Neural Network + prioritized replay [33] ............................ 43
Figure 25 : RL agents and intersection scenarios [34] .............................................. 44
Figure 26 : FLOW experiments [11] .......................................................................... 45
Figure 27 : FLOW architecture [11] ........................................................................... 46
Figure 28 : FLOW components [35] .......................................................................... 47
Figure 29 : Intersection and cells [44] ........................................................................ 50
Figure 30: Deep Q network approach overview ........................................................ 51
Figure 31 : Deep Q network approach components .................................................. 52
Figure 32 : DQN - multi-agent ................................................................................... 54
Figure 33 : DQN - single-agent .................................................................................. 55
Figure 34 : Rule-based policy method overview ........................................................ 56
Figure 35: Rule-based policy approach components ................................................ 57
8
Figure 36: Rule 1 expected result ............................................................................. 59
Figure 37 : Rule 1 algorithm ...................................................................................... 59
Figure 38 : Rule 2 expected results ........................................................................... 60
Figure 39 : Rule 2 algorithm ...................................................................................... 60
Figure 40 : Rule 3 expected results ........................................................................... 61
Figure 41 : Rule 3 algorithm ...................................................................................... 62
Figure 42 : Extended-GLOSA algorithm .................................................................... 64
Figure 43 : FLOW steps ............................................................................................ 69
Figure 44 : Specify nodes code snippet .................................................................... 70
Figure 45 : Specify edges code snippet .................................................................... 70
Figure 46 : Specify route code snippet ...................................................................... 70
Figure 47 : Speify edge starts code snippet .............................................................. 71
Figure 48 : Action space code snippet ...................................................................... 72
Figure 49 : Observation space code snippet ............................................................. 72
Figure 50 : Get state code snippet ............................................................................ 72
Figure 51 : Master configuration code snippet........................................................... 73
Figure 52 : DQN policy .............................................................................................. 73
Figure 53 : Data extraction code snippet ................................................................... 74
Figure 54 : Rule 1 and 2 code snippet ....................................................................... 75
Figure 55 : Traffic cycle ............................................................................................. 75
Figure 56 : Extended GLOSA code snippet .............................................................. 76
Figure 57 : Optimizer code snippet ............................................................................ 76
Figure 58 : Reward calculation code snippet ............................................................. 77
Figure 59 : FLOW results1 ........................................................................................ 79
Figure 60 : Flow results2 ........................................................................................... 80
Figure 61 : Travel time evaluation ............................................................................. 82
Figure 62 : Emission evaluation ................................................................................ 83
Figure 63 : Optimizer traveltime results ..................................................................... 84
Figure 64 : Optimizer emission results ...................................................................... 84
Figure 65 : Gap creation issue .................................................................................. 85
Figure 66 : Rules execution - simulation loop ............................................................ 93
Figure 67 :Rule 3 code snippet .................................................................................. 93
Figure 68 : GLOSA arrival time calculation all variations ........................................... 94
Figure 69 : Traffic light phase selector ...................................................................... 94
Figure 70 : Advised speed - GREEN phase .............................................................. 94
Figure 71: Advised speed next GREEN phase.......................................................... 95
Figure 72 : SUMO initialization .................................................................................. 95
Figure 73 : SUMO main configuration ....................................................................... 96
9
Figure 74 : Route file SUMO ..................................................................................... 96
Figure 75 Rules psudo code.................................................................................... 100
10
List of Tables
Table 1: Traffic stream parameters Fun [18] ............................................................. 26
Table 2: Ca2X application categories ........................................................................ 30
Table 3 : GLOSA experiments results ....................................................................... 36
Table 4 : RL past experiments results ....................................................................... 48
Table 5 : RL and proposed solutions ......................................................................... 51
Table 6 : Observations - DQN approach ................................................................... 53
Table 7 : Actions: DQN approach .............................................................................. 54
Table 8 : Components, a Rule-based policy approach .............................................. 57
Table 9 : Observations - rule-based policy approach ................................................ 58
Table 10 : Actions - rule-based policy approach........................................................ 59
Table 11 : Optimizing variables ................................................................................. 66
Table 12 : Software - DQN approach ........................................................................ 68
Table 13 : Software - rule-based policy approach ..................................................... 68
11
List of Abbreviations
RL Reinforcement Learning
MDP Marcov Decision Process
RELU Rectified Linear Unit
FC Fully Connected
CNN Convolutional Neural Network
DQN Deep Q Network
Car2X Car2X Communication
GLOSA Green Light Speed Advisory
RSU Road Side Unit
V2V Vehicle to vehicle communication
V2I Vehicle to Infrastructure communication
IEEE The Institute of Electrical and Electronic Engineers
DSRC Dedicated Short Range
SUMO Simulation of Urban Mobility
CDRL Cooperative Deep Reinforcement Learning
LIDAR Laser Detection and Ranging
RADAR Radio Detection and Ranging
AU Application Unit
CCU Communication Control Unit
HMI Human Machine Interface
TCP Transmission Control Protocol
OBD2 On-Board Diagnostics
CAN Controller Area Network
CAM Cooperative Awareness Message
DENM Decentralized Environmental Notification Message
TSB Topologically Scoped Broadcast
GSB Geographically Scoped Broadcast
SPAT Signal Phase and Timing Message
SAM Service Announce Message
TTL Time to Live
AGLOSA Adaptive GLOSA
API Application Programming Interface
TRPO Trust Region Policy Optimization
12
1 Introduction
Traffic control at an intersection plays a vital role to achieve smooth urban traffic flow.
In conventional way, traffic lights with fixed-length phases controlled an intersection
[1]. But this is now considered as an inefficient way of controlling traffic due to the
growth of road infrastructure and vehicles. One way of solving these issues is to use
pre-computed plans for different times of the day or different days of the week [1], [2].
This is more efficient than conventional traffic lights. More sophisticated forms of
adaption rely on real-time traffic measurements. Mainly optimization can be done in
two ways. Controlling approaching vehicles to an intersection or controlling traffic
light phases depends on real-time traffic information [1], [2],[3]. In some situations,
both approaches are combined to achieve more optimized behavior [1]. Algorithms
related to all the approaches are explained the more detailed manner in the next
chapters. However, these advanced approaches are still under development and
testing.
An optimization of traffic may aim at several objectives. For instance, safety may be
increased by preventing sudden decelerations [1]. Efficiency can be increased by
reducing the number of vehicle stops since freely flowing traffic has higher
throughput than an accelerating queue. Likewise, reduction of emissions can be the
objective of an optimized control. With emerging progress in vehicle automation and
connectivity [4], it becomes more and more conceivable that the algorithms can rely
on a more adherent behavior of the traffic participants and thus unfold their
optimization potential to a higher degree in the future
1.1 Motivation
Many algorithms are developed already based on different approaches. Traffic lights
are already controlled by optimization algorithms and pre-computed plans. But
controlling vehicles is only available in test fields and yet to come. Algorithms like
GLOSA [5],[6] are using real-time information in order to adjust the speed of an
approaching vehicle to an intersection. GLOSA uses vehicle speed, position, and
many other parameters in order to suggest advice speed. It also considers several
other parameters in order to decide the approaching speed. Also, AGLOSA which is
more sophisticated than GLOSA is able to control vehicle as well as traffic light
phases. All these approaches achieved more efficient outputs than conventional
traffic light systems. However, these advanced algorithms do not address the
direction of the vehicle (direction after it passes the junction) which prescribe which
13
lane it needs to use when approaching an intersection. Also, lane change advises
are another aspect which needs to investigate more.
This thesis also uses machine learning to optimize traffic flows. Reinforcement
learning is a very promising machine learning paradigm due to the interaction
mechanism with the environment and how reinforcement learning algorithm takes
decisions depends on the changes in the environment due to its own actions [7].
Usage of reinforcement learning to traffic optimization is a new trend and currently
very popular with the research community [8]. It also achieves more promising
results. Here the author is trying to optimize traffic flow at an intersection by using
reinforcement learning specifically for a scenario which further discusses in the
“problem statement”
The main research areas of this thesis are “traffic optimization” and “reinforcement
learning”.
1.2 Problem Statement
In this thesis, this will be studied for a particular scenario, where a traffic light
controlled intersection features mixed turning lanes and straight going vehicles are
occasionally blocked by left-turning vehicles that have to wait for traffic in the
opposite direction.
Figure 1 : Problem scenario
14
Above figure illustrates previously described problem statement well. In this example
circled vehicles are straight going vehicles and those were blocked by leading left
turner.
1.3 Proposed Solution and Approach
Assuming knowledge of the approaching vehicles’ destination as an additional
observable it will be studied how an ordering of the approaching vehicle can improve
the intersection throughput. For instance, the ordering may lead to a queue where
straight going vehicles are in front of the left turning ones (see the following figure)
As an additional controllable element, this approach requires the inclusion of lane
change advice to the approaching vehicles, which are necessary to modify the
vehicle’s order.
Figure 2 : Solution scenario
The extended GLOSA algorithm which provides the speed advisory is expected to
further decrease the time loss of vehicles at the intersection in comparison to the
basic GLOSA algorithm [6] as it adds more degrees of controllability and provides
more information to the traffic control. To attain an operational controller the thesis
takes a reinforcement learning (RL) [7], [9] approach to train vehicles in a simulated
environment (SUMO) [10]. To perform this training, 2 approaches were selected. The
15
first approach is the RL framework FLOW [11], which is based on the reinforcement
learning library Rllab. A rule-based policy with RL is the second approach.
The more successful solution will be compared to the classical GLOSA algorithm,
which serves as a baseline. Further, the author expects to discuss issues and
shortcomings, which may occur during the testing phase
1.4 Objectives :
The objectives of this thesis are
Traffic simulation with reinforcement learning model: - This thesis aims at
evaluating a rule-based policy and a Deep Q network approach. Both
solutions are based on reinforcement leaning. For implementing Deep Q
network approach, here a framework named “FLOW” is used. The research
aims to build a simulation for demonstrating the approach and examines the
issues.
Comparison of the proposed solution with existing GLOSA: The more
promising approach is compared with exitsing GLOSA algorithm.
Discussion of recorded issues and solutions: - This explains recorded issues
during testing and evaluation. Further, the report briefly discusses how to
resolve the recorded issues
1.5 Thesis structure
The thesis is organized into seven chapters
Introduction: - The chapter introduces the problem which the author is solving.
Also introduce the solution approach briefly along with motivation and thesis
objectives.
Fundamentals to reinforcement learning: - This chapter explains the
reinforcement learning paradigm briefly. Mainly it explains what is
reinforcement learning, what are the similarities between reinforcement
learning and other machine learning paradigms, the main components of a
reinforcement learning scenario and various types of reinforcement learning
algorithms
State of the art: This is especially outlined previous work which has done by
the research community. Mainly it explains traffic engineering basics, Car2X,
various traffic optimization algorithms, reinforcement learning usage to
16
optimize traffic flow, intersection management and the results of previous
work. Further, the author explains GLOSA in more detail with variations.
Concept: Author proposes the solution along with the technologies and
algorithms. Mainly the reinforcement learning model technical architecture and
specifications.
Implementation: This chapter is reserved for presenting the implementation
aspect of the previously introduced solution.
Evaluation:-It discusses the obtained results from the proposed
implementation. Also, the chapter compares the results with currently existing
algorithms.
Conclusion: - This chapter gives an overview of the report and the progress of
the goals, drawbacks. It also states the future work which can improve the
proposed solution
1.6 Summary
In this chapter, the author introduced the research problem and the solution
approach to the reader. Further, it discussed thesis motivation and objectives which
needs to address during the research. Next chapter focuses on fundamentals of
Reinforcement learning.
17
2 Fundamentals of Reinforcement
Learning
2.1 Introduction
In this chapter, the author briefly explains the theoretical aspect of Reinforcement
learning. First, the basics of RL, the main difference and why it differs from other
machine learning paradigms. Next, the author discusses various types of RL
algorithms currently exist. Further addresses the components of RL and how it
relates to Markov Decision Process.
2.2 What is Reinforcement Learning
Reinforcement learning is a specific type of machine learning where states are
mapped to actions to maximize the numerical reward signal [7],[9] . RL differs from
currently well-established supervised learning and unsupervised learning due to
several reasons which the author discusses in next topic. One main characteristic of
RL is that the learner is not notified in the training phase which action to take. Instead
of that learner must learn which the best action for a specific state is by trying them.
Choosing the best action is controlled by the reward. Trial and error approach and
reward after an action are the main two characteristics of RL [7].
2.3 Comparison with Supervised and Unsupervised learning
As the author stated earlier, reinforcement learning is different from supervised
learning and unsupervised learning. Supervised learning has pre-labeled datasets for
training a classifier [12]. Pre-labeling is done by an external supervisor. Set of inputs
or attributes are fed to the model and model predicts the output or its class depends
on its previous experience in a training session. The object of a supervised learning
based system is to generalize and learn the characteristics of trained data. Later it
will be able to classify unseen data. But there is no interaction with the environment
in Supervised learning. In an interactive problem, sometimes it is not possible to get
all the states which can train by a supervised learning model [12]. That is why there
should be a way for a model to interact with the environment and learn from its own
experience which is provided by the reinforcement learning paradigm.
Unsupervised learning is also different from reinforcement learning. The main
difference between reinforcement learning and unsupervised learning is that RL
18
always tries to maximize reward while unsupervised learning tries to find hidden
structures. Also, there is a map from inputs to outputs in reinforcement learning which
is not used in unsupervised learning.
There is another learning paradigm named “semi-supervised learning” which is a
combination of supervised and unsupervised methodologies. This also differs from
Reinforcement Learning due to the above reasons.
Figure 3 : Machine learning paradigms
2.4 Exploration and exploitation
One challenge that comes with RL is a trade-off between exploration and exploitation
[13]. The agent always goes through the same states several times in order to find
optimal actions. Assume that the agent already found a better action set in one path,
but there may be much better action sets in other paths too. If RL does not consider
exploration, it will not find the best action set. In another hand, if an agent explores
too much, which means it cannot stick to one or a limited amount of paths [13]. It
cannot exploit knowledge and acts as it does not learn anything. This is why it is
important to find a balance between exploration and exploitation.
2.5 Markov Decision Process
Markov Decision Process (MDP) is a mathematical representation which is used to
model decision making in a stochastic environment [7]. The goal is to find a policy
which maps the best actions on all states in a certain environment. Reinforcement
learning is a methodology which can solve MDP for a given scenario [14].
Machine leaning
Supervised
Tasks driver (classificcation,
regression)
Unsupervised
(Clustering)
Reinforcement
(Agents learn to react to an environment)
19
2.5.1 Definitions
States (S): Set of possible occurrences of a given scenario
Model T(s,a,s′)∴P(s′|s,a): Probability of a state(s) transform to another state(s′) due
to a specific action .This is called a transmission model.
Action A(s): Influence which causes a state transition
Reward R(s): Feedback for an action
Policy Π(s)→a: A map of optimal action for each and every state
Optimum policy Π∗(s)→a: A special policy which maximizes expected reward
The following figure is an example of MDP. It consists of several states (s0,s1,s2,s3)
and due to actions (A0,A1,A2) a states is transitioned to the next state. Also, a
reward value(R1, R2, R3) is marked for each state.
Figure 4 : MDP sample scenario
2.6 Components of Reinforcement learning
Most of the components in a reinforcement learning model are also similar to MDP.
But there are some small differences too.
Agent: An actor or an object which lives inside the environment. The actor receives
information from the environment and executes actions on the environment
Environment: A place where the actor lives in. Generate states and where actions
are executed by the agent
State: A specific situation which returns from the environment. Example: Values for
observing parameters for the 1-second duration.
20
Policy: Is the behavior of an agent at a given time. A policy is a methodology which
the agent uses to determine the next best action based on the state. In another word,
it is a map of states and actions which can execute when in those states. The policy
is the core component in a reinforcement learning scenario. Policy can be simply a
small function like a lookup table or a complicated structure like a deep neural
network.
Reward: Immediate feedback which is given by the environment for the executed
action in the last step. The Agent’s main goal is to maximize the total reward at a
long run. Also, the reward informs the agent whether the action is good or bad. Good
actions always have positive rewards and bad actions always have negative rewards.
Values of the rewards depend on the scenario. Depends on reward values, policy
should identify whether the action is good or bad and it will try to avoid executing bad
actions in the future.
Value: Value is similar to Reward. But value is the long-term feedback. It is the total
reward (accumulated) an agent can gain from that specific state to end. According to
the experts, values are used to select the best action for each step, not the reward.
But without a reward, there is no value.
The following diagram shows how agent and environment interact with each other.
An agent is living in an environment. The agent receives a state along with observing
factors from the environment at each step. Depends on the state and rewards
received in the last step, policy decides the best action for the current step. Agent
executes the best action and moves to the next state. At the same time agent
receives the reward too. This agent and environment interaction happen until the
training phase ends.
Figure 5 : RL agent and environment interaction [7]
21
2.7 Reinforcement learning algorithms
2.7.1 Q learning
Q-learning is a value based algorithm [14], [15]. The main goal is to create a Q-table
and depends on the Q-table best action will be determined. Following is an example
of a Q-table. Scenario contains 4 actions (move up, down, left and right). Each and
every cell contains the value which is the maximum expected reward if it takes these
actions at that state.
The author explains Q-learning with the following scenario [15]. The main goal of the
agent is to reach the bottom – right corner by avoiding obstacles. Please note that
obstacles are not shown in the following diagram. The agent has several possible
actions (move up, down, left, right). The following Q-table consists of 25 states. It is in
the middle of the Q learning process. At the end of the process, each state should
contain a Q value. Q value with the highest numerical figure is the best action for that
given state.
Figure 6 : Q leaning in action
One other important fact about Q-learning is that there is no policy, only the
implemented Q-table. All Q-values are calculated by the action-value function [15].
22
Eq 1 : Q value function [15]
Gamma (γ) is the discount factor. A discount factor of 0 would mean that algorithm
only considers immediate rewards. The higher discount factor (which is 1), the further
rewards will propagate through time. State and action are the inputs to the function
and output is the Q value which represents the expected future reward. The process
continues until a maximum number of iterations exceeded. Finally, the outcome is an
optimized Q-table
2.7.2 Deep Q Network
When a scenario consists of a massive amount of states, Q-learning is not a good
choice [2]. Creating Q-table and updating it, is not an efficient way at all. The best
solution is Deep Q networks. Deep Q network is a neural network which takes states
as inputs and emits Q-value for each action. Following diagram shows the difference
between Q-learning and Deep Q network.
Figure 7 : Q table and DQN [15]
DQN can consist of several fully connected hidden layers. Convolution neural
network(CNN) are also used as DQN in some experiments [3]. Further details
regarding CNN's will discuss in next chapter. Following is a structure of convolution
neural network for a gaming scenario. It consists of a convolution, Rectified Linear
23
Unit (RELU) a fully connected( FC) layers. Video frames are used as the input to the
network and final output is the Q values for possible actions for a given state
Figure 8 : CNN to calculate Q values [15]
2.7.3 Rule-based policies
Rule-based policies are considered a traditional approach to modeling reinforcement
learning policies. Here the rules are deciding the action that the actor needs to take
depends on the state. In the early stages of reinforcement learning, rule-based
policies were more popular than Q learning and other approaches as it is an easier
and direct approach to solve a problem. When comparing with DQN, rule-based
systems are simpler and have a short training period.
2.8 Summary
This chapter provided a basic understanding of RL. It is a special machine learning
paradigm which does not belong to supervised or unsupervised learning categories.
An agent-environment interaction, trial and error mechanism in order to find the best
action for a given state are the main characteristics of the RL system. This chapter
introduces basic components of RL. Main components are agents, states with
observations, actions, policy, and most important reward. Q value which derives from
reward states how important a specific action in a particular state. The author
introduces MDP and how RL is related to it. Reinforcement learning is a method to
solve MDP. This chapter briefly explains on Q-learning and DQN with its variations.
Further, the research community uses rule-based policies which are less complex
and straightforward than DQNs.
24
3 State of the Art
The chapter explains the work, experiments, and results of previous researches.
First, the author plans to introduce the fundamentals of traffic engineering. First, the
report introduces the basic parameters of traffic flow. Next focus is on Car2x
technology. Especially how car2x works and why it is important. Next author tries to
introduce GLOSA which is one of the important Car2x applications. Further, the
author discusses variations of GLOSA and results achieved by the research
community. Finally, chapter introduces important approaches the experts have taken
in order to optimize traffic flow and intersection throughput. As reinforcement learning
plays a vital role for this project the author point out major reinforcement learning
based algorithms and techniques past researches have used for optimization. Finally,
the author briefly discusses few other notable algorithms (non-RL)
3.1 Fundamental parameters of traffic flow
Traffic engineering is to understand the characteristics and behavior of traffic flows
[16]. This assists to build smooth, efficient and safe traffic models which can later
deploy in the real world. Traffic stream parameters help to understand the nature and
the variations of traffic flow.
3.2 Traffic stream parameters
Traffic stream represents a combination of driver and vehicle behavior. More
importantly, engineers need to examine how vehicle and driver interact with other
instances in a large flow [17]. Also, a flow is varied depends on the location,
geographical characteristics of the road and the time factors.
The stream parameters are important to model flows. These parameters are used to
forecast traffic flows. According to the research, there are several types of
parameters [16],[18].
1. Measurement of Quality: Eg. speed
2. Measurement of Quantity: Eg.Density and Flow of traffic
25
Also stream parameters can be either macroscopic or microscopic. Macroscopic
represents the behavior of the flow as a whole while microscopic represents the
behavior of each individual with the communication with other instances.
Fundamental stream parameters are as follows
Parameter name Mathematical
representation
Description
Speed Distance/Time Considered as a quality measurement
Spot speed - The instantaneous speed of a vehicle at
a specific location
Running Speed Length of the track/
time spent in motion
Average speed a vehicle maintained
over when vehicle reach from one point
to another only it is in a motion
Journey Speed Length of the
track/time spent
(including stopped
durations)
Average speed a vehicle took to reach
from one point to another, including the
stopping times.
Time mean
Speed
The speed of all
vehicles/
Time
The average speed of all vehicles which
pass a specific point for a given time
Flow Number of
vehicles/time interval
A number of vehicles which pass a given
point of a road, during a specific time
interval. Volume can be further divided
into various other measurements.
Example:
1. Average annual daily traffic
2. Average weekday traffic
Density Number of vehicles/
road length
Number of vehicles driving on a given
length of a road
Travel time - Time takes to complete a journey by a
specific vehicle
Time headway - Time difference between two closed
vehicles pass a given point
Distance
headway
- Distance from leading vehicle’s rear
bumper to following vehicles rear
bumper at a point of time.
26
Table 1: Traffic stream parameters Fun [18]
3.3 Car2X communication
Car2X is a wireless communication, data exchange between vehicles, roadside
units, and pedestrians [4]. Several types of car2x communications are available
currently
1. Car to Car communication
2. Car to Infrastructures communication
3. Car to pedestrian communication
Vehicles, RSUs (Eg. Traffic lights, Road signs), mobile phones which own by
pedestrians are main building blocks of a Car2x environment
Figure 9: Car2X overview [4]
3.3.1 Reasons for using car2x communication
The Car2x research is important due to the following reasons [4], [19].
1. Save the lives of people and avoid road accidents.
2. Traffic efficiency. To optimize the traffic flow in smart cities and improve the
fuel efficiency of drivers which will also reduce travel time.
27
3. Comfort. Low priority applications which focus on improving the driving
experience.
3.3.2 Technical specifications
The car2x research community is using Dedicated Short Range Communication
based IEEE 802,11p standard [20] which is specifically developed for vehicular
networks. The main advantage of using DSRC is, it has the ability to establish a
connection very quickly and the latency is low when transmitting [4]. Frequency 5.9
GHz and two main bandwidths are used in Europe and United States. In United
states 75MHz is used while 30MHz in Europe [21]. Maximum communication range is
1Km which is better than currently available LIDAR and RADAR. GPS is also using to
positioning the vehicle or RSU.
3.3.3 The technical architecture of Car2X
The following diagram depicts the technical architecture of current simTD project [22]
which is a premiere project currently run by the German government.
Figure 10 : Car2x architecture [21]
All vehicles and RSUs consist of an AU and CCU. CCU to handle the
communication with other vehicles RSUs like a traffic sign, color light. AU is to handle
the Car2X Application. HMI for driver interacts with the application. All vehicles and
RSUs are connected by a regular TCP/IP connection to central management
28
systems where people monitor the environment. All these management centers
manage the environment depends on the input it gets from the vehicles and RSUs.
3.3.4 Software architecture
Car2Car applications consist of vehicle components while Car2I applications consist
of both vehicle and infrastructure components. Vehicle component gathers data from
the OBD2 system and CAN. RU have embedded sensors depends on the
functionality.
Figure 11 : Car2x software architecture [19]
3.3.5 Car2X message types
Several message types are transmitted over car2x networks [21].
CAM represents the presence of a vehicle or roadside units. Each and every
vehicle maintains a neighborhood table. It records details about nearby
vehicles and RSUs which updates frequently.
29
DENM is an event triggered message. Each and every scenario triggers a
certain type of DENM. Depends on the situation, event DENM changes This
will be broadcasted via TSB or GSB.
SPAT is mainly for intersection management. Color lights broadcast the
remaining green light interval to the vehicles
SAM is another type which is still under discussion.
This is a generic Car2x message which consists of a header and a payload
Figure 12 : Example Car2X message [21]
3.3.6 Forwarding types
This is how a message broadcasted from origin to destination. Transmission needs
to consider the number of hops, TTL and other technical information [21].
Unicast is the first type. Key information is that the destination is predefined. It
knows exactly where it needs to go. Maybe it’s the direct neighborhood which
means 1 hop or maybe several hops. Figure 2.5 shows a scenario of Unicast
where the vehicle which marked in Orange is the predefined destination.
TSB is usually many hops. There is a TTL value. Whenever it passes
networks it is reduced TTL value by 1. In this way, it reaches the destination.
Usually, messages used this forwarding type. But in a high-density area like a
traffic jam or something, the interruption will be very high. The following figure
shows a sample scenario of TSB
A single hop is when transmitted to the direct neighborhood. Here the TTL is
one.
GSB. This is similar TSB. But here it considered the Geographical details too.
The geographical area is marked as a circle or Rectangle
3.3.7 Car2X applications.
30
Car2X applications can divide into safety and non-safety related applications. Safety
applications are mainly focused on protecting human lives while rest aiming at
improving traffic efficiency.
Category Examples
Safety [21], [19] Emergency Vehicle Warning, Motorcycle approach
notification, Intersection Collision warning, Weather
related warnings, Blind spot detection etc..
Improving traffic efficiency
[21], [4]
Green Light Optimal Speed Advisory (GLOSA),
Enhanced Route Guidance.
Table 2: Ca2X application categories
Figure 13 : Car2X application stack [23]
3.4 Traditional approaches to intersection management
Before several decades ago, traffic engineers used several simple approaches to
improve traffic efficiency by controlling color light phases. The easiest way is to
introduce precomputed plans for traffic lights depends on the time of the day and day
of the week. (Eg: traffic plan for busy hours and another to off-peak hours).
Mainly there are two ways to address traffic optimization problem at an intersection
[1]
Adaptive Traffic lights: Change the traffic light phases in order to improve
traffic flow
31
Guidance for vehicles: Real-time guidance to drivers to adjust vehicle
parameters before reaching an intersection.
3.5 Green Light Optimal Speed Advisory.
Greenlight optimal speed advisory (GLOSA) is a car2x application which aims to
improve the fuel efficiency of a driver and reduce journey time. GLOSA is used to
optimize the approaching speed of vehicles to a nearby color light intersection. This
will prevent driving too fast when the color light phase is red and drive too slow when
the phase is Green [5]. Communication type is primarily considered as Car to
Infrastructure communication. Also, GLOSA can reduce traffic congestion.
Algorithm execution starts when the vehicle enters the communication range. Here
the communication in the sense the communication between vehicle an RSUs (color
lights). Color light sends CAM declaring the availability to the vehicle. The vehicle
receives the message and on-board GLOSA application checks the type of the
message, whether it is a traffic light, the position of the traffic light. Also, the onboard
application will check if the traffic light situated in the current route. Otherwise, it will
disregard the signal.
3.5.1.1 Important steps
The following steps represent how GLOSA make decisions, calling sequence, data
exchange and required conditions to work[1],[5].
1. GLOSA application calculated the time vehicle needs to reach tragic light
considering the current speed of the vehicle, the distance between the traffic
light and the vehicle and the acceleration.
2. Next important thing is checking the phase of the traffic light.
2.1. If the light is GREEN the algorithm sets expected speed to maximum
speed.
2.2. If the light is RED, algorithm slows down the vehicle if the vehicle is
traveling too fast. Algorithm calculated the speed the vehicle needs to
have in order to reach the color light in the next green phase.
2.3. If the light is YELLOW, depending on the distance of the vehicle
expected speed is finalized.
3. Finally, the driver receives a speed range (minimum expected speed,
Maximum Expected speed)
4. The algorithm runs once for each step
32
The following figure shows how GLOSA advises a driver and slow down a vehicle
before it reaches the intersection.
Figure 14: GLOSA vs regular driving [24]
3.5.1.2 Useful mathematical equations
The following equations were used to find the arrival time (t) of a vehicle from current
distance to the intersection and advised GLOSA speed (U) [5].
First, calculate the arrival time (t) to the intersection from current posistion when
distance (d), velocity (u) and acceleration (a). Distance is the difference between the
current vehicle position and intersection.
When a != 0 and (t) is the subject from the above equation
(
) √(
)
When a=0
33
Next calculates the Advised speed (U), when the distance from current vehicle
position to color light (d) , time needs to reach color light (t) and the current speed of
the vehicle (u)
(
)
3.5.1.3 A Pseudo Code
The following pseudocode further illustrates how to implement the algorithm in the
simplest form.
Figure 15 : GLOSA algorithm [5]
According to [24] , GLOSA productivity increases if the system can advise the driver
to freewheel and coast when the vehicle cannot pass during the current green phase
Both freewheeling and coasting apply only for a vehicle with a manual transmission.
Freewheeling: When a vehicle slows down with a neural gear. This is more efficient
and braking and the distance the vehicle can travel depends on initial speed.
Coasting: When a vehicle slows down vehicle with any gear. It uses the vehicle’s
kinetic energy.
The following diagram shows how coasting and freewheeling can be used with
GLOSA. According to experts, Freewheeling is more energy efficient than Coasting
[24].
34
Figure 16 : Coasting and freewheeling [24]
Before suggesting a speed for slowing down, the algorithm needs to check whether it
is possible for freewheeling. If not, then it needs to check the possibility of coasting
before braking. Here it checks the possibility of freewheeling before coasting, as
freewheeling is more efficient than coasting. Following is an activity diagram of the
suggested approach.
Figure 17 : GLOSA + coasting and freewheeling [24]
35
Also, some experiments are carried out in order to check how multi-hop
communication is useful with GLOSA. Here researches are suggesting to use
forwarding mechanism like TSB instead of SHB [6]. It is important to improve
information distance and communication range between RSUs and vehicles. This
usually disregards in simulation-based testing. Applying multi-hop with parked cars is
considered as the best way to increase the information distance.
3.5.2 Test scenarios
GLOSA is tested with various traffic simulators [5],[1]. Further several authors tested
GLOSA in real test fields [6]. Also, GLOSA has tested with various conditions.
Among them, length of GLOSA activated track, single or multi intersection
approaches, the percentage of vehicles equipped with GLOSA systems, traffic
density, various forwarding types were few test variables.
3.5.3 Limitations of GLOSA
A number of vehicles which equips GLOSA [24]: The percentage of vehicles which
use GLOSA is very important when it uses in the real world. Vehicles which do not
equip GLOSA, drive regular, inefficient way. This can cause traffic congestion as
usual. Even though rest has GLOSA, because of the congestion, GLOSA does not
work well. GLOSA does not detect traffic congestion details as it is only focusing on
specific vehicle and the traffic light. Installation of GLOSA system for all approaching
vehicles is the best way of getting maximum results [5], [24].
High traffic density areas: GLOSA does not consider traffic congestions, information
about leading vehicles, road information when it suggests a speed. Due to that, it is
not a good solution for high traffic density situations [24].
3.5.4 AGLOSA and their variations
Adaptive GLOSA (AGLOSA) is a collection of adaptive traffic lights and GLOSA [1].
GLOSA only changes the vehicle speed. It does not affect the traffic light
phases in any way
Adaptive traffic lights only change the traffic light phases depends on the
inputs it gets from nearby vehicles.
36
AGLOSA changes both vehicle approaching speed and the traffic light phase which
is more efficient. Mainly AGLOSA has several steps [1].
1. Each and every vehicle sends positions, speed to the RSU (traffic light)
2. RSU calculated the most optimum plan.
3. RSU send the best switching time/phase to each vehicle
4. Vehicle onboard system calculates the approaching speed according to the
switching phase
3.5.5 Results
The following table summarizes important observations of past work related to
GLOSA
Past work Results
[5] According to the results [5], 300 m is the best distance to
activate GLOSA for approaching vehicles. Further, the
researcher has done an experiment that consists of both
GLOSA equipped and non-equipped vehicles. When
increasing the GLOSA equipped vehicles, the author was able
to get more improved results.
[6] Single hop broadcast suffers from signal attenuation due to
various reasons. [6] suggests multi-hop mechanism to extend
information distance.
[1] [1] compared AGLOSA with currently existing fixed length
lights, adaptive lights, and GLOSA. Simulation has run up to
2000 vehicles per hour and was able to achieve successful
results.
Table 3 : GLOSA experiments results
3.6 Traditional intersection management plans
Fixed time length phases are used for traditional traffic lights. Here all phase lengths
are pre-calculated and do not consider the environmental or external factors when
changing the phases [1]. Installing precomputed plans based on the time of the day
and the day of the week is a much better approach than fixed length phase lights
which are mostly used in a real environment [6]. The Engineers focus mainly the rush
37
hours of the day and an average number of vehicles which passes the intersection
during rush hours and many other environmental factors. This is a more effective way
of controlling the traffic flow than fixed length phased traffic lights.
Due to the improvements of Car to car and car to infrastructure communication, the
engineers were able to introduce more advanced algorithms and methodologies to
control vehicle movements and traffic light phases [24].
3.7 Reinforcement learning based intersection optimization
Reinforcement learning is one of the premier machine learning paradigms currently
used by the research community in order to solve complex traffic optimization
problems. Traffic flow optimization, intersection throughput enhancement,
autonomous driving related testing are few example usages. Mainly experiments are
done in simulation environments and overall results show that reinforcement learning
is a good selection for solving the above problems.
Here the author has listed several notable reinforcement learning based researches
mainly deals with solving various traffic situations.
Mainly there are a few ways of solving traffic-related problems by using reinforcement
learning.
1. Consider traffic light as an agent
2. Consider vehicle as an agent
[2] has proposed Deep Q network and Q matrix approach to experiment on a traffic
light system where red and green light phases change dynamically. Its duration
depends on the number of approaching vehicles to the traffic light. [2] used the
following road network.
38
Figure 18 : Road network by [2]
As usual, vehicles are approaching the intersection from North-South direction and
West-East direction vise versa. The author used SUMO [10] as the simulation
interface and Py-Brain [25] to train the Reinforcement learning algorithm.
Both Q–network and Q-Matrix was tested with constant demand and varying
demand. Also used the negative sum of squared delays of all vehicles. Delay was
taken by the difference of actual arrival time and virtual arrival time. According to the
result gained, Q-network did not produce promising results when the vehicle queue is
high. Another important fact is that Q-network was trained by a various number of
hidden layers. But on the other hand Q- matrix produces more productive results
than the Q-network even when the length is high.
[26] introduced a Q learning based traffic light optimization algorithm to improve the
efficiency of traffic lights. The main difference is the researcher used a road network
with 3 nearby intersections. But this instance equipped with left turning lanes similar
to the real world. For building the perfect algorithm [26] has used both peak and off-
peak situations. Author has used VISSIM [27] for simulation purposes and external
APIs to train Q learning algorithm. Also, this is a multi-agent approach. Following
observations were used as inputs to Q-learning
1. Queue lengths of main and side lanes
2. Current signal phase
3. Duration of the green phase
4. Leader information
Extending the existing phase or changing the phase (Green light for the opposite)
directions were the actions proposed by Q-learning. Total Q values of all
intersections were taken as the global reward for the experiment while Q value of
each intersection were considered as a local reward. Following chart shows the
39
results achieved by [26]. Mainly it shows how the delay reduces with various
demands.
Figure 19 : Simulation runs VS average delay [26]
.
[28] also proposed a reinforcement learning based approach for a multi-intersection
scenario where all intersections are located at close proximity.
Figure 20 : Intersection network
Traffic lights were considered as agents and due to multiple intersections proposed
scenario consist of multiple agents. Vehicle position, speed, Traffic signal phases of
40
current intersection and other intersection were used as state variables. The
direction which needs to turn on the green light was taken as the action spaces. As
an example, South to West is one action space. Similarly, there were four actions
including left turning vehicle phases. [28] used a Convolution Network to create the
reinforcement learning algorithm. Here the researcher specifically named it as
Cooperative Deep Reinforcement Learning (CDRL). As usual, the network consists
of convolution layers, Batch normalizations, RELU layers, and finally fully connected
layers. Further CDRL is equipped with Experience Replay feature.
Author has used a well-known transportation planning function which is first
introduced by U.S Bureau of Public Roads (BPR [29]). According to the results
received, CDRL has lower delay values when running time and episodes are high.
The following figure shows how CDRL can compare with other algorithms like Deep
Q network and Q learning.
Figure 21 : Running time, delay and RL policies [28]
[30] is another Reinforcement learning approach to reduce the intersection delay by
considering traffic light as an agent. But this approach has few differences, compared
to previous approaches.
Mainly the author has simulated the implemented algorithm by using real-time data.
SUMO has used as the simulation framework. The model introduced by [30] has 2
phases.
41
The offline phase is a data gathering phase with a fixed traffic light schedule and it
allows vehicles to pass the intersection as usual. Online phase is the reinforcement
learning phase which extracts details from the simulation and sent it to a
convolutional network. Finally, Q values were presented. Like above instance here
the past author has used experience replay mechanism. The following figure shows
more details about the researcher’s approach in detail.
Figure 22 : Online and offline training [30]
Main actions were similar to [28]. State observations were taken for each and every
lane. Observation space includes with queue length, updated waiting times and other
regular vehicle data. But [28] has used several different reward types. Among them
accumulated waiting time for all approaching vehicles, total vehicles that pass the
intersection during the simulation step and total travel time for passing vehicles were
special
[2] is another different approach where the author has used a 3DQN network. 3DQN
is a combination of Double DQN [31] and Duel DQN [32]. Double DQN approach is
heavily used to handle the problem of Q-value overestimation. The solution to this
problem is the usage of two networks in order to decouple action selection and Q
value prediction.
Use a single DQN to predict the best action for the next state
A new type of network (target network) to estimate the target Q-value by
taking the above-predicted action.
42
According to the research community, Double DQN is easier to train the regular DQN
network and have more stable learning.
Duel DQN which is the other special technique [2] has used. It again decouples the
Q value and the action.
One network for calculating the state value for the current state
Calculate the advantage for all existing actions. Here the network checks what
the best action for the state is by comparing all available actions.
According to the researcher, by combining both techniques can enhance the
performance of the overall network. Because the author used both 2 techniques,
overall network consist of 3 networks which are the main difference when comparing
all other approaches.
Also here author used a prioritized experienced replay technique which is used for
samples that occur rarely and also more important than other. This technique
enhanced the probability of reoccurrence for samples which have a high temporal
error. The following figure shows the overall architecture of the network
Figure 23 : Double DQN + Duel DQN structure [2]
Author has used vehicle speed, the position has state variables and actions spaces
were four phases of traffic lights. Accumulative waiting time between cycles was
taken as the reward for the system. According to the final results, 3DQN achieved the
highest cumulative reward and lease waiting time when compared with the well-
established DQN approach.
[33] has proposed another approach with prioritized replay technique. The researcher
used a convolution neural network with a target network. Speed and positions were
fed into the convolution network as state variables. Convolution network also consists
43
of two parts. The following figure shows the network structure [33]used. This is a
specific structure when comparing with other authors.
Figure 24 : Convolutional Neural Network + prioritized replay [33]
Main action space was the green light direction for South to North or East to the
west. Author has tested the algorithm with real traffic data and the results show that
the algorithm is effective than fixed phase control and the longest queue first
algorithm.
As seen above, there were many occurrences that use traffic light as the agent. But
only a few authors used the approaching vehicle as the agent.
[34] has tested how an automated vehicle handles a un-signaled intersection by
using reinforcement learning. According to [34] traditional rule-based systems are not
good at handling complex systems similar to this and DQN is a good choice for
solving it.
The following figure shows a few scenarios that used to train a DQN. Red vehicle is a
reinforcement learning agent and other vehicles were controlled by SUMO.
44
Figure 25 : RL agents and intersection scenarios [34]
Here the researcher tested the reinforcement learning algorithm with 3
representations.
Time-to-Go: Action space only consists of Wait and Go when the vehicle next
to the intersection.
Sequential: Action space consists of accelerating, decelerate and keep a
constant speed.
Creep-and-Go: This is a combination of the above 2 representations. Action
space consists of a wait, move, forward slowly and go
State variables were heading angle, the presence of a vehicle (0 or 1) and velocity.
Rewards were calculated by checking vehicle status. Reward increments happen
based on the collision status of vehicles. According to the results, Time-to-go
representation outperforms other 2 representations and much better than traditional
Time to Collision (TTC) algorithm.
[8]’s DeepTraffic is a DQN, a Javascript based simulation that lives on a browser.
The main idea of the project was to train DQN the most efficient way of handling a
vehicle in a dense traffic area and avoid collisions. Action space consists of changing
lanes to sides, keep same settings, acceleration, and deceleration. This is an open
competition for students who can change the hyper-parameters of the algorithm
(learning rates, number of hidden layers, neurons of DQN, rewards, exploration, and
exploitation and many other) and train and see how well the algorithm reacted
according to the changed parameters. This way [8] tried to figure out the best set of
hyper-parameters for solving the scenario. These are few facts the researcher has
identified so far
Lager and deeper the DQN, performance is better for this specific scenario
Input size too large is not good for the performance
Too large input size is also good if the time constraint is not a big issue for the
project. Because training time is very high when the network is deeper, larger
45
and input size is too high. But this is always not the case as sometimes
diminishing returns with larger and deeper networks.
[11], [35] introduced a computational framework named FLOW which is helping to
simulate autonomous vehicle behaviors using reinforcement learning algorithms.
FLOW uses SUMO [10] as the simulation framework and a reinforcement library
named Rllab [36]. The research community uses FLOW for developing complex
controllers for reinforcement learning agents with human-driven vehicles for various
scenarios. Mainly [11] has developed controllers for ring roads (single lane, double
lane), figure eight road network, intersection scenarios. Here the development team
trained autonomous vehicles using reinforcement learning algorithms to drive without
colliding each other. Following images shows few example scenarios and all vehicles
were reinforcement learning agents which live in SUMO environment.
Figure 26 : FLOW experiments [11]
The following diagram shows a process diagram of FLOW. As mentioned earler,
SUMO is the simulation environment and Traci to extend the sumo functionality and
modify sumo states dynamically in each simulation step. Flow used Rllab to
implement reinforcement learning policies. Also, Rllab was used to evaluate and
46
optimization purposes. OpenAI-Gym [37] was used as the base framework when
creating Rllab which is another reinforcement learning library.
Figure 27 : FLOW architecture [11]
FLOW is the connector between SUMO and Rllab. After the user adds information
that needs for creating SUMO network and information for training the model, FLOW
dynamically creates the SUMO road network in runtime. It collects data from
simulation in each and every step and passes it to RLlab. Reinforcement learning
policies from Rllab uses this information to find the best action for each and every
state and pass it to FLOW. SUMO gets required action which needs to execute as a
Traci command.
Until it exceeds no of steps simulation continues. At the end of the simulation, It
calculates gradient and updates policies. Then FLOW resets the environment and
starts the simulation again until it reaches the user-defined number of iterations. At
47
the end of the training phase FLOW gives a fully trained reinforcement learning
policies for a specific scenario. However, there is no standard process is defined to
find the exact hyper-parameter values initially and it needs to find out experimentally.
As usual computation power plays a key role during the training phase as it takes a
considerable amount of time to train the module. One disadvantage of FLOW is that
the current released version does not use multi-agent architecture. If the user has
many agents in one simulation sill the user needs to pass all inputs to one single
model. Due to that, models are often larger and deeper and it needs more time to
train it.
Flow has a dynamic way of creating SUMO based road networks. It was completely
written in Python. It provides several classes which user needs to override.
Generator: User enters information for generating nodes, edges, and routes
Scenario: Specifying network configuration using shapes which the user
mentioned in Generator class for an experiment. Based on the specification,
FLOW creates the Sumo configuration and net files.
Environment: Class responsible to retrieve observation space for each and
every step and execute the action from listed action space. It also calculates
the rewards. Use can write a custom function or use already defined
specifications by FLOW.
The following figure provides more details regarding the main components of FLOW.
Figure 28 : FLOW components [35]
48
For all the above scenarios observation space consist of velocity, the current lane,
and current position. Action space consists of lane change or a velocity change
(acceleration, deceleration or keep constant velocity)
Also, the user needs to add other details which are important for the training process
(Reinforcement learning algorithm, policies, Evaluation methodology, Iterations,
Simulation time)
3.7.1 Results
Past work Results
[2] Neural network approach has mixed results. The network was
unable to fit with action-value function. Further when
increasing hidden layers. Results got worse.
Q learning approach results were much robust and certainly
better than the Neural network.
[26] Q-learning, multi-agent approach decreases average delay
per vehicle in both saturated and oversaturated flows
[28] CDRL achieved the highest reward and lowest average
cumulative than Deep reinforcement learning and Q-learning
[33] Average staying time reduced significantly after 800 episodes
Up to 40 % reduction when compared with the longest first
queue and fixed control algorithms.
[11] FLOW provides an environment to experiment with RL
algorithms for a wide range of scenarios already. But
currently, it is only supporting a single agent mechanism. But
in future FLOW developers will introduce a complete
framework which expands the number of scenarios with much
more functionalities. Further future releases support multi-
agent mechanism.
Table 4 : RL past experiments results
3.8 Non-RL approaches to optimize traffic flow
This section briefly explains few other methodologies and algorithms the research
community used to optimize traffic flows.
49
[38] ran queue length optimization in order to find the best cycle values (phase length
for Red and green) in order to achieve the lowest queue lengths at an intersection.
The solution was designed for the real environment and the optimized cycle length
was 140 seconds according to the results.
[39] a Genetic algorithm based approach to allocating green time depends on the
traffic demand. A chromosome consists of 4 genes. Genes represent an effective
green time ratio. The author follows regular genetic algorithm steps. Genes go
through fitness function, selection, crossover, and mutation until it reached the
stopping criteria. The rank selection was used for selection phase and Blending for
crossover. [40] proposed a similar methodology to optimize traffic light phases by
using adaptive Genetic algorithm. The main goal was to reduce queuing time and
improve intersection capacity.
[41] uses a fuzzy algorithm to optimize phase sequence. Vehicle queue length and
arriving rate were inputs to the algorithm. It has 4 output classes (very long, long,
medium, and short). It controls the length of the current green light phase depends
on the predicted class. [42] has introduced another fuzzy approach to control phase
duration similar to [41]. Inputs for the fuzzy controller were a number of vehicles of
arrival direction (the direction which is green) and the queue length of waiting for
direction.
[43] proposed a hybrid fuzzy-genetic controller which is architecturally a Multi-agent
system. It has an Intersection management agent and Driver agents.
[44] suggested a reservation-based scheduling approach for multilane intersections.
This can consider as a unique system when compared with other approaches.
The area which belongs to the intersection is divided into small portions/cell
according to [44]’s design. Approaching vehicles need to inform the availability to the
intersection management system. Intersection management system needs to
calculate arrival time and departing time to each vehicle for each cell. Each cell has
a queue with vehicle identities sorted by arrival times of vehicles. If the cells are not
available, the system checks when it is available for a specific vehicle. Depends on
the availability, the system suggested arrival time for a vehicle to a specific cell. The
following figure shows how the author has divided an intersection into cells.
50
Figure 29 : Intersection and cells [44]
Please note that the above approaches are very few notable experiments. Traffic
flow optimization and intersection management is a wide research scope and there
were various experiments have done with a wide range of optimization algorithms.
3.9 Summary
This chapter starts by introducing traffic stream parameters. It defines important
parameters like traffic density, travel time, traffic flow which further use in this report.
Next is a brief explanation of the Car2x technology. The report explains how Car2x
works and overall architecture of hardware and software setup of it. Further, it
explains more technical details like message types and forwarding types. Next author
explains about GLOSA which is one of the car2x applications. The report describes
how GLOSA works mainly along with variations of GLOSA. It explains the
mathematical and technical detail of GLOSA. Next is one of the most important
topics, past work regarding Reinforcement learning is discussed. Various research
has used mainly different types of DQNs to optimize traffic flows and intersections.
Among those, CNN, Duel networks, 3DQN are popular. Q learning has also used too.
Further, the way various authors have to address the problem was different too.
Mainly the scenarios, road network, assumptions, simulation or live data were
different aspects. Rule-based policies for traffic optimizations was not used widely
among the research community. Finally, the author discusses few other non-RL
approaches authors have taken for traffic optimization. Among that Genetic
algorithm, various model-based optimizations, fuzzy logic were widely used.
51
4 Concept
This chapter serves for introducing the solution to the problem statement. Here the
author is experimenting on two main approaches. Both approaches use
Reinforcement learning to solve the problem. The chapter describes the structure of
Deep Q network and rule-based policy. Further, it explains the system architecture
and components of both solutions and processes/activities in order to achieve the
final results.
4.1 Reinforcement learning and proposed solution
Before moving to a specific approach, the following table compares how
Reinforcement learning concepts match with the proposed problem. Please note that
this is a general view of the problem and no specific details are given. More details
are given in the following sections
RL concepts Proposed problem
Environment A traffic simulation
Agent The vehicle in traffic simulation
States Single simulation step of a simulation
Policy Deep Q network or rule-based policy
Actions Vary depends on the approach. Eg: lane change, the speed
change
Observations Depends on the approach. Eg: Speed, position..etc
Table 5 : RL and proposed solutions
4.2 Deep Q network-based approach
4.2.1 System architecture
Following diagram shows an abstract view of the proposed prototype.
Figure 30: Deep Q network approach overview
Next diagram provide more clear view of the connector component. It is responsible
to connect simulator and the DQN. Mainly it extract real time data from the simulation
52
and feed it to the DQN for training. Further it recieves best suited action for each
state from DQN and feed it back to the simulator. Connector eqippped with other
important components to calculate reward and optimize the weights of the DQN.
Figure 31 : Deep Q network approach components
4.2.2 Building a Reinforcement learning model
Several Major steps need to follow in order to create a RL policy.
1. Creation of agents: the First challenge is to define the agent of an
environment. Further how many agents live and interaction between agents.
2. Selection of observations: Defining a state of a scenario. Select what kind of
information extract from the environment
3. Selection of actions: Defining possible actions agent can take considering
constraints
4. Designing policy: Author experimenting on two algorithms. First discusses a
DQN approach and later a rule-based policy.
5. Defining rewards: Creation of reward metrics/variables to provide feedback for
actions. Sometimes a reward needs to be normalized.
6. Run until receive optimized results: This phase takes a considerable amount
of time and resources. Further author has to try various mechanisms in order
to achieve better and faster results.
7. Rerun with various hyper-parameters: Several parameters can change in
order to see the behavior of the algorithm.
53
4.2.3 Deep Q network structure
Before designing the network, the author should decide the inputs (observations) and
outputs (actions) for the network.
4.2.3.1 Observations selection
Observation/inputs Range Description
Speed 0 to 14 ms-1 The current speed of a specific vehicle for
the current simulation step
Lane Left lane 1
Right lane 0
The current lane where a specific vehicle is
positioned
Distance 0 to 1000m Distance from the current position to the
traffic light
Number of leading
vehicles
0 or depends
on the flow
Calculate leading vehicles until the
intersection count for each vehicle for both
lanes.
Number of following
vehicles
0 or depends
on the flow
Calculated following vehicle count for every
vehicle
Remaining green light
time
0 to less
than 90
Full traffic
light cycle is
90 seconds
If the current phase is GREEN, calculate the
remaining green light time until it changes
Next green light 0 or 59
seconds
(exclude
green light
phase) if the
vehicle pass
intersection
in next cycle.
If the vehicle misses the current green
phase, calculate the time for the next green
phase after RED and YELLOW phases
Table 6 : Observations - DQN approach
54
4.2.3.2 Actions section
Action/outputs Range Description
Speed change 0 to 14 ms-1 Indicate acceleration or a
deceleration
Lane changes No lane change 0,left to
right 1, right to left -1
Instruct vehicle to change the
lane (from left to right, right to left
lane or no lane change at all)
Table 7 : Actions: DQN approach
4.2.3.3 Hidden Neurons and layers
The author tested the proposed solution with several DQN structures to train agents.
Especially above observations and actions are tested with various hidden layer
structures.
Example: 32*32: 2 hidden layers and have 32 neurons in each layer
64*64: 2 hidden layers and have 64 neurons in each layer.
The following diagram shows an abstract structure of proposed DQN. Inputs and
outputs are already defined in the previous section.
Figure 32 : DQN - multi-agent
Further network structure varies due to the multi-agent and single agent approach
which the author discusses in the next section. Above structure is only for the multi-
agent approach. Next diagram shows how it can modify for single agent approach.
55
Figure 33 : DQN - single-agent
Neurons from one layer are connected with all neurons from the next layer, which is
similar to a usual feed-forward neural network.
4.2.4 Single-agent and multi-agent approaches
Single-agent approach: Data obtains from all vehicles (agents) is sent to one Deep Q
network. One network decides the action for each and every vehicle. Due to that
training phase only contains one neural network. But the disadvantage is when the
simulation changes the number of vehicles, DQN structure changes and needs to
train an entirely new network. Also, when the number of vehicles increases,
observations and actions are increased. Training a network with lots of inputs needs
more time and computation power. Here the “single agent approach” defines training
all vehicles and whole scenario as one instance.
Multi-agent vehicle approach: There is also only one Deep Q network architecture to
train all vehicles. But several instances are available with the same architecture for
each and every vehicle. Each vehicle sends its observations to this Deep Q network
and receives the action from it. All deep Q networks should run parallel. The “Multi-
agent approach” defines training each and every vehicle as training a universal
vehicle controller identical to all instances/agents. This approach also needs
considerable computation power, but expectedly less time than the single-agent
approach.
56
4.2.5 Reward definition
The author considers the accumulated reward which is the total reward of each and
every vehicle for the whole episode. The episode is a whole scenario which includes
all the states from start to end.
Accumulated travel time:- travel time of all the vehicles from start to end
Accumulated emission: Emission of all vehicles for the whole scenario,
After completing a simulation, the program checks the accumulated travel time and
emission with the previous simulation. Mainly it checks any improvement (reduction
of travel time and emission) occur. If so, the reward increases proportionally to the
improvement. If not, reward decreases.
4.3 Rule-based policy approach
4.3.1 System architecture
The following diagram shows the main components and interaction between those.
Figure 34 : Rule-based policy method overview
Component Functionality
Simulator The simulator is used to extract information for training
the policies and executing the actions in every simulation
step
Connector Data extraction from simulation and passing the selected
57
action for each step
Rule-based policy The control logic of an agent. The component consists of
a set of rules which decides the appropriate action
depending on the state of the observation space.
Optimizer Responsible for optimizing rule-based policies. Several
parameters of the policy are exposed to the optimizer.
More details about optimized variables will discuss later
Table 8 : Components, a Rule-based policy approach
Optimizer runs several parallel simulations for optimizing parameters. The optimizer
continues until achieves the best suited values for optimizing variables or reach a
maximum number of iterations. The final output is a well-trained rule-based policy
with optimized values. Unlike the DQN approach, here the author only considers a
multi-agent reinforcement learning approach. That means all simulated vehicles are
considered as agents which are controlled by rule-based policies. The following
figure shows the main components of the suggested system
Figure 35: Rule-based policy approach components
58
4.3.2 Observation space
Following observations are extracted from simulation in each step. Please note that
the table only shows observation space for a single vehicle. As the author follows a
multi-agent approach each and every vehicle has its own observation space.
Observation space is similar to the above DQN based approach. But it consists of a
few additional variables too.
Observation
(single vehicle)
Range Description
Speed 0 to Max speed set up by
simulation framework
The speed of the current
simulation step
Acceleration Lowest and highest set up
by simulation framework
Acceleration of the current
simulation step
Lane Left lane 1, right lane 0 The current lane where the
vehicle positioned
Distance 0 t0 1000m Distance from the current
position to the traffic light
Leading vehicles
IDs
0 or many vehicles
depends on the flow
Leasing vehicle IDs for current
vehicle
Following vehicles
IDs
0 or many vehicles
depends on the flow
Following vehicle IDs for current
vehicle
Remaining green
light time
0 to less than 90
Full traffic light cycle is 90
seconds
If the current phase is GREEN,
calculate the remaining green
light time until it changes
Next green light 0 or 59 seconds (exclude
green light phase) if the
vehicle pass intersection in
next cycle.
If the vehicle misses the current
green phase, calculate the time
for the next green phase after
RED and YELLOW phases
Table 9 : Observations - rule-based policy approach
4.3.3 Action Space
Action Range Description
Speed change 0 to 14 ms-1 Indicate acceleration or deceleration. Speed
59
change is decided by extended GLOSA
function.
Possible values: 0 (full stop) to Max speed
Lane changes 0 or 1 Two Boolean values for represent left and
right lane.
Table 10 : Actions - rule-based policy approach
4.3.4 Rules
Following set of rules define previously declared actions. Please note that each
vehicle should execute these rules in each simulation step to find the best possible
action.
4.3.4.1 Rule 1
If a left turn vehicle is traveling in the right lane, it should change to the left lane
Figure 36: Rule 1 expected result
Figure 37 : Rule 1 algorithm
60
4.3.4.2 Rule 2
If a straight going vehicle is following a left turn vehicle in the left lane, it should
change to the right lane.
Figure 38 : Rule 2 expected results
Figure 39 : Rule 2 algorithm
61
4.3.4.3 Rule 3
This rule is only valid for straight going vehicles which are traveling on the right lane.
If there are no leading left turners in the left lane and if there are free SLOTS
available in the left lane, the vehicle can do a lane change from right to left. The
number of SLOTS in the left lane is a parameter, which is defined according to the
scenario. The number of vehicles which can pass during a Green light phase and
extended green light phase for left turners is considered when defining slots.
SLOTS= maximum desired number of leading straight going vehicles in front of the
first left turner which is traveling on the left lane.
Figure 40 : Rule 3 expected results
Several flags are set after changing the lanes to indicate that the ego vehicle already
executed particular rule and thereafter the program keeps the vehicle in the changed
lane.
62
Figure 41 : Rule 3 algorithm
63
4.3.4.4 Extended GLOSA
This is an extension for traditional GLOSA algorithm which decides the approaching
speed of the vehicle. The extension is to establish the connection between the
Traditional GLOSA and Rules. Several new modifications were added mainly to
control the left turners and to figure out when the vehicle reaches the intersection.
The Following diagram shows the activity flow of the proposed extended GLOSA
function.
Following equations were introduced by above extended-GLOSA algorithm
If dMax > distance to junction condition is true, the following equation is valid. In other
word, when vehicle tries to reach the Max speed and the distance it needs to reach
the MAX_SPEED is larger than the current distance to the junction, following
equation (EQ1) should execute.
(
) √(
∗
∗ ) ∗
Eq 2 : Arrival time calculation equation 1
If the above condition is false, the next equation(EQ2) is valid.
Eq 3 : Arrival time calculation equation 2
Following equation(EQ3) should execute when the acceleration is 0 and speed is not
0
Eq 4 : Arrival time calculation equation 3
All 3 above equations are to find the arrival time to the junction from current vehicle
position in different circumstances.
64
Figure 42 : Extended-GLOSA algorithm
65
Final equation(EQ4) is to find the suitable speed needs for a vehicle in order to fulfill
GLOSA constraints.
( ∗
)
Eq 5 : Advised sped calculation
GLOSA algorithm is executed for all RL agents and every simulation step.
4.3.5 Rewards
Following equations are used to find the Reward value for training rule-based
policies. Here the author is explaining each and every equation step-by-step. The
final reward is constructed from accumulated total travel time and accumulated
emission. Several parallel runs are running for optimizing values which are chosen by
the optimizer.
The author explains how to measure accumulated total travel time.
∑ ∑
Eq 6 : travel time calculation
Equation 5 explains the travel time for all vehicles, for overall simulation in each
parallel run. Duration represents how long the particular vehicle runs in a simulation.
Depart delay indicates the delay causes when a vehicle enters into the simulation in
the beginning due to no space in the lane.
∑
Eq 7 : Accumulated travel time calculation
Equation 6 derives from equation 5. Here route length means the total distance a
vehicle travels from start to finish. Further, this is used to find the accumulated travel
time figure per km.
Next is the accumulated emission which is the other factor relevant for constructing
reward.
66
∑ ∑
Eq 8 : Acumulated emission calculation (per g)
Equation 7 calculated accumulated emission (CO2) for all vehicles, for the whole
simulation in each and every parallel run.
∑
Eq 9 : Accumulated emission calculation (per km)
Equation 8 is used to find the accumulated emission figure per km. At last, the author
introduces reward function. Here author is using previously calculated accumulated
travel time(per km) and accumulated emission (per km). Also, two normalization
coefficients are used separately for travel time and emission.
Travel time normalization = 1/120 which is used to normalize the accumulated travel
time figure.Emission normalization figure is 1/150 which again is used to normalized
accumulated emission figure. Values for the constants were found experimentally.
Alpha is used to control the final influence of travel time and emission to the Reward.
In another word Alpha is a weighting factor. Alpha = 1 means only travel time matters
for final reward.
∗ ∗
∗
Eq 10 : Reward calculation
4.3.6 Policy parameters
Optimizing variable Description
Margin Extra added time for traffic light cycle which is used to
control the vehicles (for all vehicles). Extended
GLOSA uses Margin as a parameter when calculating
arrival time to the intersection.
Extra delay for the first left
turner
This duration is added for the first left turner’s targeted
arrival time in the GLOSA algorithm in order to slow it
down. This allows straight going vehicles to overtake
left turner easily
Table 11 : Optimizing variables
67
The suitable optimization algorithm is selected depending on the nature of the
problem. Here the problem is a stochastic discontinuous function. The author uses
the Broyden-Fletcher-Goldfarb-Shanno algorithm (BFGS) for the optimization [45] .
Optimizer runs until it achieves the best suited margin and left turner delay by gaining
highest reward value. If it does not find the best suited values optimizer run a specific
number of turners and find the best suited value out of all turns. Further, it is
impossible that is to minimize both objective quantities simultaneously. Due to that
reason, the author has used Alpha has a weighting factor to combine both
accumulated figures and create one single figure The author needs to provide
starting and bound values for margin and delay.
Bounds values : Margin from 0.1 to 5 and delay from 0 to 40
Starting values : Margin = 2 and delay = 31
4.4 Summary
The author introduced DQN and Rule-based policy approach. Mainly it discusses
Major steps creating RL algorithms. Further, the section discussed design,
components and major processes/functions of both solutions. Next, the report
defined the architecture of the solution, observation, and action space, and Reward
functions. For designing DQN two architectures were used. Mainly the author divided
it into single agent and multi-agent based designs. Seven neurons were placed and
two in output layer in multi-agent based DQN. However single-agent architecture was
more complicated than the multi-agent architecture due to many parameters. For the
rule-based approach, the author introduced the rules and the expected results.
Mainly the rule based policy consists of three rules to decide the best suited lane for
each vehicle during all simulation steps. The Extended GLOSA was used to calculate
the speed for each vehicle. Extended GLOSA has several modifications in order to
support rules. At last, the chapter discussed the optimization process which uses to
improve rules and extended-GLOSA.
68
5 Implementation
This chapter focuses on introducing the implementation aspects of proposed
solutions. First, it introduces the software stack the author uses for implementing the
solution. Next, discuss the steps need to follow for both approaches. FLOW has
several implementation steps in order to create a RL prototype. Rule-based policies
also consist of several components. Code snippets are provided when explaining the
implementation aspects.
5.1 Technical details
5.1.1 Deep Q network approach
Software Description
FLOW [11] FLOW is using as a connector for linking Reinforcement library and
SUMO microscopic traffic simulator. FLOW is using several other
3rd party libraries.
Theano [46] , OpenAI Gym [37], SUMO Traci
SUMO [10] Microscopic traffic simulator which provides an environment for RL
agents
Rllab [36] Reinforcement learning library which uses to train Deep Q network.
Python All above frameworks are written in python.
Table 12 : Software - DQN approach
5.1.2 Rule-based policy approach
Software Description
Python All components are written in Python
SUMO Similar to the above approach SUMO is the simulation environment
Sumo-Traci
[47]
Traci is able to change the state of the simulation in runtime. By
using Traci user can pass commands to the simulation and retrieve
data from it.
SciPy [48] Scipy provides python based optimization with various optimization
algorithms.
Table 13 : Software - rule-based policy approach
69
5.2 Implementation of Deep Q network approach
5.2.1 Flow configuration
Following diagram shows the steps the user needs to follow in order to create FLOW
based SUMO simulation
5.2.2 Dynamic SUMO network configuration
Figure 43 : FLOW steps
Please note that here author only explains how to create a FLOW simulation. More
information regarding FLOW’s architecture, functionality was explained in the
Literature Review chapter.
5.2.2.1 FLOW Generator creation
The user needs to create a custom Generator class which extends from base
Generator and overrides several inherited methods. Usually in SUMO context user
needs to provide node, edges and route information in order to generate the network
configuration file. Similar to that, here the FLOW creators have given several
methods in order to provide node, edges and route information to create the net file.
Specify_nodes() which states the location of the nodes relative to one specific node.
Here the author only considers an intersection scenario. Due to that center node
which creates the intersection is considered as the main node(junction node) and all
the other nodes are relatively positioned to the main node.
Following code snippet shows the node array. Proposed intersection scenario
consists of 9 nodes and coordinates for “center” node is 0,0.
70
Figure 44 : Specify nodes code snippet
Specift_edges(): give edge details. Mainly type of the edge, length, which nodes
contribute to creating the edge... etc. Following code shows the edge array which
consists of 9 edges.
Figure 45 : Specify edges code snippet
Specify_routes(): which states the route names and edges contribute to creating the
route.
Figure 46 : Specify route code snippet
Above code shows the routing array. As an example route “left” consists of 3 edges (
“left”,”altleft1” and “right”) which creates the road from left to right through the
intersection.
71
5.2.2.2 FLOW Scenario creation
Similar to Generator creation, the user needs to create custom Scenario class which
override several methods of base Scenario class
Specify_edge_starts(): Here the user needs to provide the starting coordinates of an
edge relative to one specific edge. Following code snippet shows a part of Edge
array. Here the “bottom“ edge is taken as the main edge and other edges like “top“
have given the starting position relative to the “bottom“ edge.
Figure 47 : Speify edge starts code snippet
Specifiy_intersection_edge_starts() and specify_internal_edge_starts() are another 2
functions similar to specify_edge_starts(). Only difference is that internal_edge
fuction targeting internal edges of an intersection which allows vehicle to pass
intersection for various directions.
5.2.2.3 FLOW environment creation
The user needs to create a custom environment class, extends from Base
Environment. As usual, there are several inherited methods which need to override.
Action_space(): Following functions declare the possible actions the agent can
execute at a given time. The agent can only change lane and speed of the vehicles.
Also, the author has set upper and lower bounds to acceleration and deceleration. As
FLOW only supports single agent scenarios currently, actions for all vehicles live in
the simulation should provide by a single DQN.
As an example, if a simulation has 30 RL agents, DQN should provide 60
(2*30)actions.
72
Figure 48 : Action space code snippet
Observation_space() : Here the author provides observations/inputs to the DQN
Following code snippet shows all inputs to DQN. For 30 RL agents, DQN should
support (30*8) inputs/observations. The user needs to provide lower and upper
bounds for all inputs. More information regarding inputs was already mentioned in the
Design chapter
Figure 49 : Observation space code snippet
Get_state() is to retrieve observations of the following state after execution of actions.
Here the user can call getters which were already defined by FLOW
Figure 50 : Get state code snippet
There are several other functions like apply_rl_actions() which executes previously
defined actions and compute_reward() which calculate the reward as specify in the
Design chapter.
73
5.2.2.4 Master configuration creation
The main configuration connects previously defined generator, scenario and
experiment classes. The further user needs to enter additional information like the
length of vertical and horizontal lanes, number of agents which needs to create,
velocity bounds for vehicles and many other technical information
Figure 51 : Master configuration code snippet
Following code snippet shows how the author has used predefined Gaussian MLP
Policy which consists a DQN wit 2 hidden layers (64*64). Further FLOW is using
TRPO algorithm to adjust DQN weights.
Figure 52 : DQN policy
74
5.3 Implementation of Rule-based policy approach
5.3.1 Data extraction
The first step is data extraction using SUMO-Traci. It allows extracting data in runtime
for each and every simulation step. Traci provides a wide range of functions as seen
in following code snippet. Here the author collects information related to vehicle and
neighboring vehicles for a specific vehicle. The author is using a python dictionary to
store information for all agents and several lists (eg: leaders_rightlane_straight_list)
to save information regarding neighboring vehicles. Further, there are several flags
which indicate whether the rule is already set for a specific vehicle.
Figure 53 : Data extraction code snippet
5.3.2 Rules implementation
Rules are already introduced in the last chapter. Following code snippet shows rule 1
and 2. The author uses Traci commands to control agents in runtime. Mainly
changeLane() which changes vehicles from one lane to another. Here rule 1 change
left turners from the right lane to left (lane 0 to 1 according to SUMO environment)
and rule2 changes straight going vehicle, if it is following a left turner from left to right
lane (0 to 1).
75
Figure 54 : Rule 1 and 2 code snippet
Code snippet Relevant to rule3 can be found in Appendix
5.3.3 Extended-GLOSA implementation
All equations and conditions which needed for creating extended-GLOSA were
already introduced in Concept chapter. Full algorithm is added to the appendix.
But here the author is discussing above the implementation aspect of it.
The proposed Sumo traffic light has 8 phases for one single cycle. Following code
snippet shows available phases. Phase 1 which is a Green phase for all vehicles
traveling from left to right and vice versa. Phase 3 shows the extended Green phase
for left turners which are not available for straight going vehicles. Following code
snippet tries to find the next green light phase for a given current phase.
Figure 55 : Traffic cycle
Next code snippet shows how the author handles to slow down vehicles during Red
phases. The program mainly has 2 ways to handle a slowdown of a vehicle. First, it
checks whether the vehicle can arrive in the next green phase or the following one
76
using previously calculated arrival time. Further, it considers the extended green
phase to left turners. The program has an extreme slowdown technique for first left
turner which is controlled by “slow_down_flag”.
For slowing down a vehicle, the author uses Sumo slowdown() which is able to slow
down a specific vehicle to a given speed and a time duration.
Figure 56 : Extended GLOSA code snippet
5.3.4 Optimizer implementation
Following implementation shows how the author has given starting values, bounds
for Arrival margin and delay for first left turner as mentioned in Concept. The author
uses Scipy framework and minimize() to call the inbuilt optimization process. Here it
is using BFGS-B optimization algorithm. But during the optimization process author
has used various optimization algorithms to find the best fit. After executing this, it
runs until optimizer finds the lowest accumulated travel time and emission figure.
Figure 57 : Optimizer code snippet
77
5.3.5 Reward calculation
This is part of the reward function which is called by above minimize() function. After
every simulation, Sumo generates 2 separate files which contain information of
emission figures and travel time for all vehicles. Also, the program runs several
simulations parallel and each simulation creates above files separately. Programs
read all files and calculate accumulated travel time and emission. Then considering
normalization factors as mentioned in following code final reward is calculated.
Figure 58 : Reward calculation code snippet
5.4 Summary
This chapter explained how the author implemented both solutions. FLOW was used
to implement the DQN. According to the guidelines provided by FLOW development
team [11], the author implemented those required classes (Generator, Scenario, and
Experiment). Further, the reward was introduced and it was based on the
accumulated travel time and emission for each vehicle. Important code snippets were
provided when necessary. Next, rule based approach was also described. The
author has introduced necessary inputs to rules. Input vector was similar to the
earlier approach. But it has several other inputs like data related to leading vehicles.
Further chapter explained the implementation aspects of rules and how SUMO-Traci
was used with important functions. Finally, the optimizer was introduced and it was
built by using Scipy.
78
6 Evaluation
This chapter explains the results and main observations the author received after the
execution of previously implemented prototypes. Further issues which occurred
during the testing phase also discussed. Next, solutions to the previously discovered
problems are also explained briefly.
6.1 DQN based approach
6.1.1 Tests
As mentioned in the above chapters, the author is using FLOW to create DQN.
During the training phase of DQN, the author has completed following subtasks.
Preliminary tests: Initially ran the simulation for a very short time. Here the simulation
steps and a number of iterations were low. Idea was to see how FLOW performs
under test circumstances. Started with 10 minutes runs (execution time) and later
extended to 1 hour. Duration of a single simulation is 5 minutes.
Next step was to increase simulation time (up to15 minutes) and a number of
iterations (up to 1000). Total execution time is approximately 20 hours for these set
of experiments.
Change the number of hidden layers and neurons in the hidden layer: Experiments
were started with 32 neurons in the hidden layer (single hidden layer) and then tried
two other variations.
32*32: 2 hidden layers and 32 neurons in each
64*64: As similar to previous structure 2 hidden layers. Instead of 32 here, the author
is using 64 neurons in each layer.
Ran parallel runs: Experimented up to 4 parallel runs for 20 hours.
6.1.2 Results and discussion of DQN approach
Preliminary tests: This is to check whether FLOW is generating the intersection
scenario as expected. There were no errors with FLOW’s capabilities.
79
Single-agent architecture: Author was able to see how agents are reacting for RL.
Starting positions of all vehicles during the first simulation step were varied as FLOW
is using a custom algorithm introduced by the author to generate starting positions of
vehicles on lanes. That means every simulation starts with different starting positions.
One advantage of this approach is this can create a new scenario, instead of
vehicles starting at same positions in every simulation.
In the first few iterations, movement of vehicles was extremely slow. Lane changes
were not much. As mentioned in Literature review RL takes more time to learn best
actions for each state. Experience gains from the first set of iterations are not good
enough to find the best action for a given state. Due to its slowness, vehicles did not
reach the intersection. Still, all the vehicles were structured as the initial partition. The
following figure illustrates the observed output.
Figure 59 : FLOW results1
After 5 iterations, was able to see vehicles are creating groups (partitions) which
mean it broke from the original partition as leading vehicles try to reach the
intersection with a higher speed than initial times. The author was able to see higher
speed from leading vehicles of the partition and one or two vehicles passes the
intersection. Further, there were many overtakes in each and every step.
80
Figure 60 : Flow results2
To run 50 iterations, it took around 20 hours and still saws similar output. Mainly from
the results, the author was able to see that FLOW is trying various actions to find the
best actions for all vehicles. But after 5th iteration, the changes between iterations
were not much.
When adding more neurons to hidden layers (eg 64*64) training was much slower.
Above results are only valid for single-agent approach.
6.1.3 Issues observed
Extreme slowness of single-agent approach: As mentioned above, after initial
iterations, results between iterations were very similar. There were not many
variations even after it ran for 20 hours. The author was able to see lane changes
and speed changes. But the improvement is very slow. In another words the issue
here is training time. Training needs more time than expected. Due to the timeline of
the project, resources and the preliminary inputs author received from rule-based
policy approach, the author experimented on rule-based approach than DQN
FLOW based multi-agent approach: Due to the extremely slow response from the
above tests, the author took the next steps to extend FLOW for a multi-agent
approach. But due to the design of FLOW, it was not possible to convert FLOW to a
multi-agent system with minor improvements.
81
6.2 Rule-based policy approach
6.2.1 Tests
SUMO route files were used to generate various traffic flows. By changing attributes
like “period”, “number of vehicles per hour” and “probability” various traffic scenarios
were generated [49]. The author was able to change a number of vehicles for each
direction/route. All these scenarios were tested with the following variations.
GLOSA only scenario: Here the main goal is to see how traditional GLOSA
algorithm performs for a given scenario.
Extended-GLOSA with Rule-based system: Instead of traditional –GLOSA
here author used Extended- GLOSA algorithms with newly developed rules.
Due to that system equipped with speed and lane advisory system here.
Estimation: This is a test step in order to verify whether the proposed
approach is functioning as expected. Here vehicles enter with the already
arranged order. According to the proposed approach straight going vehicles
need to arrive earlier than left turners. Due to that, order was already created
when the vehicle enters. Further, no speed and lane change advisory was
used here.
Random SUMO runs: Here no speed or lane advisory is provided. Further, no
order was provided like the above estimation step. All the vehicles were
controlled by built-in models (Kraus Model [49]) in SUMO.
Another important factor is that SUMO has the ability to create random behavior
when loading vehicles. This means it does not run the same simulation again.
Further tests have been done for above variants with various simulation times
between 10 and 60 minutes simulation time per run.
By calculating the accumulated travel time and emission the author evaluates the
final results for the above variants
6.2.2 Results and discussion
First the author explains how accumulated travel time changes with the above
variants.
82
Figure 61 : Travel time evaluation
Category 1: When no of approaching straight going vehicles in left lane = 11 and left
turn vehicles = 5 for a single Green phase.
Category 2: no of approaching straight going vehicles in left lane = 11 and left turn
vehicles = 4 for a single Green phase.
Category 3: no of approaching straight going vehicles in left lane = 11 and left turn
vehicles = 3 for a single Green phase.
Category 4: no of approaching straight going vehicles in left lane = 13 and left turn
vehicles = 3 for a single Green phase.
Further, this figure is constructed after conducting above SUMO experiments.
According to the traffic light cycle used for the experiments, a maximum number of
vehicles which can pass in green and extended green phase (just for left turners) is
15. For first 3 categories, number of straight going vehicles was fixed as 11 as only
11 can pass the intersection during whole green phase. But number of left turners
has changed to check how rules react to it. Up to 5 left turners can pass the
intersection during the extended green phase. Last category is just to see the delay
when number of vehicles exceeds 15. In another words this is definitely create a
traffic jam and the author wanted to see how rules react for this.
Next, explains how accumulated emission changes with the same 4 categories and
variants. All above conditions are valid for this experiment too.
75
80
85
90
95
100
105
Category 1 Category 2 Category 3 Category 4
Accumulated travel time (%)
Category
Random
Estimation
GLOSA
GLOSA+rules
83
Figure 62 : Emission evaluation
According to the above results, this section can conclude as following
There is up to 10% improvement of newly introduce Extended GLOSA+rules
than traditional GLOSA comparing accumulated travel time.
According to the estimation this can be further reduced.
When compare accumulated travel time with random flows, new prototype has
an improvement of 15%
Up to 12% improvement of accumulated emission when compared new
prototype and GLOSA
When adding more straight going vehicles, both accumulated travel time and
emission percentage reduces significantly.
6.2.3 Optimization results and discussion
Optimization process was carried out as mentioned in Concept chapter. Even though
optimizer ran for several hours with different optimization algorithms, it was unable to
find the most optimum values for arrival margin and left turner delay. Due to that the
author ran a parameter grid scan.
The idea was to examine a range of values for the algorithm parameters and check
monitoring values- accumulated travel time and emission figures. Further several
parallel simulations (15 runs) were run with various simulation times.
0
20
40
60
80
100
120
Category 1 Category 2 Category 3 Category 4
Accuumulated emission (%)
Category
Random
Estimation
GLOSA
GLOSA+rules
84
Arrival margin : 0.1 to 5
Left turner delay = 0 to 40
After running several grid tests with various traffic flows, the following results were
achieved.
Figure 63 : Optimizer traveltime results
Figure 64 : Optimizer emission results
As seen in the above diagrams, there were no clear patterns found from Grid test
too. Another main observation was the optimization process was slower than
expected. But the author discusses the reasons for this in the next section.
6.2.4 Further discussion of Grid test
Here the author is explaining further regarding reasons for the above results.
85
One major issue was created by newly designed rules. This does not mean rules are
not working properly. As shown in the above results rules are working as expected.
In fact, rule-based policy approach was much better than existing GLOSA. When
closely examining the random scenarios generated by SUMO, there were situations
where no gaps were found in order to do the lane changes. As an example, when
rule 3 applied for a certain straight going vehicle in the right lane, but no immediate
space was created in the left lane to do the lane change until it reached the
intersection (marked vehicle on the following figure), proposed order will not be
created. When running parallel runs with randomness, there were so many various
situations where the vehicle was unable to do the lane change as expected. This was
the major reason for unclear results during Grid test and optimization.
Figure 65 : Gap creation issue
The slowness of simulation execution: the author observed an unexpected slowness
when running rule-based algorithm and optimization. All rules and Extended-GLOSA
was running for each and every simulation step for all monitored vehicles. This was
the main reason for the slowness.
6.2.5 Solutions and improvements
86
Need to introduce a novel gap creation strategy to create space/gap when there are
no possible spaces in the opposite lane. For this cooperative decision making is
needed for neighboring vehicles.
To reduce the slowness of the simulation: further experiments need to carry out in
order to find the optimum rule and GLOSA execution frequency.
Even after introducing a proposed rule set, there were situations where few left
turners were unable to cross the intersection. Due to that reason left turners to have
to stop next to the intersection when the light is red. This happens when numbers of
left turners are higher than expected by rules. Special rules need to introduce to
address this situation.
6.3 Summary
This chapter explained the results obtained during testing. First, the author explained
the results obtained from FLOW based solution. First, preliminary tests were run in
order to find whether it created the designed SUMO road network. Later, several
tests were run to test the DQN with few variations. The custom algorithm was created
to generate vehicles for each lane. The author was able to create the proposed
scenario as expected. But FLOW only supported Single-agent mechanism yet. But
the development team is currently developing a multi-agent support toolkit too. Due
to the timeline of this project, the author was unable to use the multi-agent version of
FLOW. The model was trained up to 20 hours with several parallel simulations. It
started to respond well during the first few hours of running and showed promising
results. FLOW was able to order vehicles and adjust speed. But after preliminary
runs, the author did not see any further progress even though it ran for several hours
(nearly a day). The main issue of this approach was discussed in detail in this
chapter.
According to the currently available results, the Rule-based policy was more
successful than FLOW. Implemented rules worked as expected. The proposed
solution was compared with traditional GLOSA and performed better. But the
optimizer started to output unexpected results. A parameter grid scan was carried out
in order to further investigate the results. The results were examined thoroughly and
issues which caused the unexpected outcomes were identified. Finally, suitable
solutions were suggested in order to retreive better results in the future.
87
7 Conclusion
This chapter summarizes the work that has been done during the research. The
challenges that the author has faced and future work to improve the proposed
solution are outlined.
7.1 Challenges
The author faced several challenges during the research.
FLOW initial simulation setup: FLOW framework is a newly introduced toolkit
to the research community. In fact, the development is still ongoing. Further,
FLOW is only supported for simple scenarios at the moment. The author has
to spend a considerable amount of time to create the proposed road network
and traffic flow.
Finalization of GLOSA: The author has found various approached to create
GLOSA. This research used GLOSA algorithm suggested by [5]. But during
the implementation phase, it did not work as expected and the author had to
take extra steps to solve issues. Further, extensive tests have been done to
check whether GLOSA is functioning properly.
FLOW multi-agent approach setup: As mentioned during the evaluation
chapter, FLOW only supports single-agent architecture. Recently, FLOW
development team started to implement a multi-agent supported toolkit too.
However, the author was unable to use the new FLOW solution due to the
research timeline. Due to that reason, the author investigated the possibility of
converting existing FLOW to a multi-agent architecture.
Initial development of rules: After unexpected issues with FLOW, the research
focused on developing a rules-based RL policy. Due to the nature of the
problem, the initial designing phase of rules was complicated. Mainly the
problem was how to arrange the approaching vehicles and to see what the
best arrangement was.
Investigating optimization issues: This issue was already discussed in the
evaluation chapter. Optimization produced unexpected results even after
running it several times. After running a comprehensive parameter scan, the
author was able to identify the issues for the unexpected results.
7.2 Future improvements
88
DQN approach with multi-agent architecture: FLOW based DQN solution was
very slow due to the single-agent architecture. This should run on a GPU
environment and needs to provide necessary hardware. One disadvantage of
DQN is the longer training times. Further, DQN needs more training time when
it tried to solve a complex problem due to its reward, trial and error
mechanism.
Extension of rules: The author has seen several issues during the optimization
phase. Especially, the gap creation between vehicles when it changes the
lane is necessary. This version does not handle this issue and rules only
execute a lane change when there is an available space.
Speed up simulations; The simulation was slow and this needs to speed up by
reducing the calling frequency of rules and GLOSA.
7.3 Concluding remarks
The research introduces a reinforcement learning based approach in order to
optimize the traffic flow at an intersection. It focuses on reducing travel time and
emission of vehicles. This approach is very successful when the scenario consists of
more straight going vehicles than left turners. The author proposes a vehicle
reordering mechanism that establishes a specific sequence of vehicles in the traffic
flow before it reaches the intersection.
A separate chapter describes the basics of “Reinforcement learning”. It explains a
theoretical aspect of reinforcement learning. Mainly how reinforcement learning
works, components of RL scenario and existing algorithms.
Before introducing the new approach, a literature review was carried out. The
literature review section pointed out the most important findings. First, it introduced
the basic of traffic engineering. Before stating about GLOSA variants, the author
mentioned about Car2X system. The structure of Ca2x and how it works in the real
environment was discussed. GLOSA is the only car2x application discussed here as
it is considered as the baseline for the project. Next, the literature review focuses on
various RL algorithms that have been used to optimize the traffic flow and the
intersection. Simple DQN, CNN, More complex DQNs are predominant in the existing
research body.
The concept chapter introduced the DQN based solution and rule-based policy. DQN
structure, observation space, action space, reward functions were described in detail
89
for both solutions. DQN network had 7 neurons in the input layer and 2 in the output
layer.
The hidden layer had 32*32 (2 hidden layers, each layer has 32 neurons) and 64*64
architecture. Next solution was the rule-based policy. It consists of several rules
which control an approaching vehicle in each simulation step. Rules were designed
to slow down left turners and allow more straight going vehicles to overtake when
possible. All possible vehicles pass the intersection during the next green phase.
More improved GLOSA was introduced as the speed advisory and it can support
newly introduced rules which provide lane changes. GLOSA allows vehicles to reach
to the intersection exactly on time (when the green phase starts). Due to that reason,
the vehicle will not come to a full stop next to the intersection. An optimizer was
responsible for optimizing rules further.
The implementation chapter explains the technical aspects of the solution. All the 3rd
party frameworks were introduced here. The author used SUMO [10] , SUMO-Traci
[47], FLOW [11] to implement the first solution. Rules were fully implemented using
Traci[47]. Also, it consists of important code snippets for creating SUMO road
network, rules and extended-GLOSA.
The final step was to evaluate the solutions and point out the issues. FLOW based
solution did not provide the expected results. After running simulation up to 20
hours, the progress was slow. But still, the author was able to see a small
development. It was able to do lane changes and speed changes. Due to the single-
approach setup, the simulation was much slower than thought. But the FLOW- multi-
agent setup was not available during the development phase of this research.
Next approach, rules were evaluated with GLOSA. Rules were 10 to 12% percent
more efficient than existing GLOSA. But the optimizer showed unexpected results
again. The author troubleshoots the issue and described extensively in the
Evaluation chapter. Further, the promising solutions were explained. As mentioned in
the Introduction chapter the initial proposal was to reorder approaching left turners
and straight going vehicles in order to improve the intersection throughput. It is
successful as the new rule based policy uses rules to dynamically change the speed
and the lane of each vehicle in order to achieve a better sequence. Further, the
output was more efficient than the currently existing GLOSA.
90
Bibliography
[1] Jakob Erdmann, “Combining Adaptive Junction Control with Simultaneous Green-Light-Optimal-Speed-Advisory,” presented at the 2013 IEEE 5th International Symposium on Wireless Vehicular Communications (WiVeC), Dresden,Germany, 2013.
[2] Yang, Kaidi, Isabelle Tan, and Monica Menendez., “A reinforcement learning based traffic signal control algorithm in a connected vehicle environment,” presented at the 17th Swiss Transport Research Conference (STRC 2017)., Lausanne, 2017.
[3] B. Otkrist and Naik, Nikhil and Raskar, Ramesh Bowen, “Designing neural network architectures using reinforcement learning,” ArXiv Prepr., 2016.
[4] Harding, Y. Gregory, and J. Wang, “Vehicle-to-vehicle communications: Readiness of V2V technology for application,” NHTSA, Technical HS 812 014, 2014.
[5] K. K. Mehrdad Dianati, David Riecky Ralf Kernchen, “Performance study of a Green Light Optimized Speed Advisory (GLOSA) Application Using an Integrated Cooperative ITS Simulation Platform,” presented at the 2011 7th International Wireless Communications and Mobile Computing Conference, Turkey, 2011, p. 6.
[6] R. S. ; A. T. Reinhard German ; David Eckhoff, “Multi-hop for GLOSA Systems: Evaluation and Results From a Field Experiment,” presented at the 2017 IEEE Vehicular Networking Conference (VNC), Torino,Italy, 2017.
[7] Richard S. Sutton and Andrew G. Barto, Reinforcement Learning: An Introduction, Second. MIT Press, 2017.
[8] DeepTraffic: Driving Fast through Dense Traffic with Deep Reinforcement Learning, “DeepTraffic: Driving Fast through Dense Traffic with Deep Reinforcement Learning,” ArXiv Prepr. ArXiv, Jan. 2018.
[9] S. Ricchard S, “Introduction,” in The challenge of reinforcement learning, Boston: Springer, 1992.
[10] DLR, “SUMO - Simulation of Urban Mobility,” Company web, 2018. . [11] C. W. Kanaad Parvate, Eugene Vinitskyz, Alexandre M Bayen Aboudy Kreidieh,
“Flow: Architecture and Benchmarking for Reinforcement Learning in Traffic Control,” 2017.
[12] M. Riedmiller, M. V. .. Kavukcuoglu, K., “Playing atari with deep reinforcement learning,” ArXiv Prepr., vol. 1312.5602, 2013.
[13] M. Coggan, “Exploration and exploitation in reinforcement learning,” McGill Univ., vol. Research supervised by Prof. Doina Precup, CRA-W DMP Project at McGill University, 2004.
[14] L. Riedmiller, Martin Martin, “An algorithm for distributed reinforcement learning in cooperative multi-agent systems,” in In Proceedings of the Seventeenth International Conference on Machine Learning}, Citeseer, 2000.
[15] Thomas Simonini, “Diving deeper into Reinforcement Learning with Q-Learning,” Apr-2018. .
[16] Mathew, Tom V., and KV Krishna Rao, Fundamental parameters of traffic flow. NPTE, 2016.
[17] C. F. D. Daganzo, Carlos, “Fundamentals of transportation and traffic operations,” in Fundamentals of transportation and traffic operations, vol. 30, Oxford: Pergamon, 1997.
91
[18] B. D. M. Sven Maerivoet, “Traffic flow theory,” in Physics and Society, 2005, p. 33.
[19] D. Stephens and J. Schroeder, “Vehicle-to-infrastructure (V2I) safety applications performance,” US Dept.of Transportation, Technical FHWA-JPO-16-253, 2013.
[20] IEEE Standards Association, “IEEE 802.11p.” [21] H. Stübing, “Car-to-X Communication: System Architecture and Applications,” in
Multilayered Security and Privacy Protection in Car-to-X Networks, Wiesbaden: Springer, 2013, pp. 9–19.
[22] German Association of the Automotive Industry., “SimTD,” Projrct SimTD, 2018. . [23] roberto Baldessari and W. Zhang, “CAR-2-X Communication SDK – A Software
Toolkit for Rapid Application Development and Experimentations,” presented at the nternational Conference on Communication, Dreseden, 2009.
[24] D. E. Bastian Halmos ; Reinhard German, “Potentials and Limitations of Green Light Optimal Speed Advisory Systems,” presented at the 2013 IEEE Vehicular Networking Conference, Boston,USA, 2013.
[25] PyBrain, “Pybrain features,” 2018. . [26] Marsetič, Rok, Darja Šemrov, and Marijan Žura, “ROAD ARTERY TRAFFIC
LIGHT OPTIMIZATION WITH USE OF REINFORCEMENT LEARNING,” presented at the Intelligent Transportation Systems (ITS), 2014.
[27] PTV grou[, “PTV Vissim,” 2010. . [28] W. W. Mengqi LIU, Jiachuan DENG, “Cooperative Deep Reinforcement Learning
for Trafic Signal,” presented at the International Workshop on Urban Computing, canada, 2017.
[29] US Department of Transportation, “Federal Highway aministration,” 2018. . [30] H. W. Zhenhui Li, “IntelliLight: A Reinforcement Learning Approach for Intelligent
Traffic Light Control,” presented at the 8: The 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, 2013, p. 10.
[31] A. and S. Van Hasselt, Hado and Guez David, “Deep Reinforcement Learning with Double Q-Learning,” Cornell Univ., vol. 2, p. 5, 2016.
[32] M. and V. H. Wang, Ziyu and Schaul, Tom and Hessel Hado and Lanctot, “Dueling network architectures for deep reinforcement learning,” arXiv, 2015.
[33] J. G. Minoru Ito, Norio Shiratori Yulong Shen, Jia Liu, “Adaptive Traffic Signal Control: Deep Reinforcement Learning Algorithm with Experience Replay and Target Network,” 2017, p. 10.
[34] D. I. Kaushik Subramanian, Kikuo Fujimura Reza Rahimi, Akansel Cosgun, “Navigating Occluded Intersections with Autonomous Vehicles using Deep Reinforcement Learning,” presented at the IEEE International Conference on Robotics and Automation, 2017.
[35] Ankur Mehta ; Eugene Vinitsky, “Framework for control and deep reinforcement learning in traffic,” presented at the 2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC), Yokohama, Japan, 2017.
[36] J. S. Yan Duan, Xi Chen, Rein Houthooft Pieter Abbeel, “Benchmarking Deep Reinforcement Learning for Continuous Contro,” presented at the Proceedings of the 33rd International Conference on Machine Learning, 2016.
[37] L. and S. Brockman, G., Cheung, V., Pettersson, L Jonas and Schulman, “OpenAI Gym,” arXiv, 2016.
[38] E. J. and A. S. Mutiara Maulida,Herman Y. Sutarto, “Queue Length Optimization of Vehicles at Road Intersection Using Parabolic Interpolation Method,” presented
92
at the International Conference on Automation, Cognitive Science, Optics, Micro Electro-Mechanical System, and Information Technology, Indonesia, 2015.
[39] R. K. yin Min Keng, Helen Chuo Kenneth Tze, “Genetic algorithm based signal optimizer for oversaturated urban signalized intetsection,” presented at the IEEE Inteernational Conference on Consumeer Electronics, Malaysia, 2016.
[40] Guo, X., Song, Y, “Research of traffic assignment algorithm based on adaptive genetic algorithm,” presented at the Computing, Control and Industrial Engineering (CCIE), 2011 IEEE 2nd International Conferenc, 2011.
[41] Leng, Junqiang, and Yuqin Feng., “Research on the Fuzzy Control and Simulation for Intersection Based on the Phase Sequence Optimization,” in Measuring Technology and Mechatronics Automation, 2009, 2009.
[42] P. B. Vipul Vilas Sawake1, “Review of Traffic Signal Timing Optimization based on Fuzzy Logic Controller,” presented at the International Conference on Innovation in Information Embedded and Communication System, 2017.
[43] M. and H. Abdelhameed, Magdy M and Abdelaziz S. and Shehata, Omar M., “A hybrid fuzzy-genetic controller for a multi-agent intersection control system,” presented at the Engineering and Technology (ICET), 2014 International Conference, 2014.
[44] A. and C. Choi, Myungwhan and Rubenecia Hyo Hyun, “Reservation-based cooperative traffic management at an intersection of multi-lane roads,” presented at the Information Networking (ICOIN), 2018 International Conference, 2018.
[45] Scipy, “Scipy Optimizatin algorithms,” BFGS, Nov-2018. . [46] worldwide, Theano. 2018. [47] DLR, Sumo-Traci. Berlin,Germany: DLR, 2018. [48] worldwide, Scipy. Opensource, 2018. [49] DLR, “Sumo User Documentation,” Nov-2018. .
93
Appendix
Figure 66 : Rules execution - simulation loop
Figure 67 :Rule 3 code snippet
94
Figure 68 : GLOSA arrival time calculation all variations
Figure 69 : Traffic light phase selector
Figure 70 : Advised speed - GREEN phase
95
Figure 71: Advised speed next GREEN phase
Figure 72 : SUMO initialization
96
Figure 73 : SUMO main configuration
Figure 74 : Route file SUMO
97
Variable initialization:
GREEN_PHASE_DURATION
LEFT_TURN_SEPERATE_PHASE
CYCLE_LENGTH
MAX_SPEED
ARRIVAL_MARGIN
MAX_NUMBER_VEHICLES_LEFT_LANE_POOL
LEFT_LANE_CHANGE_CONTROL_DISTANCE
Simulated_DATA // List for store data for simulation vehicles for all steps
For step 0 to end
Data extraction from list Simulated_DATA:
Data extracted :
For each and every vehicle
Speed
Acceleration
Distance to the intersection
Current lane
Leading vehicles in left lane
Leading vehicles in right lane
Leading vehicles in left lane (straight going)
Leading vehicles in left lane (left turning)
Leading vehicles in right lane (straight going)
Flag 1 : Flag for rule 1
Flag 2: Flag for rule2
Flag 3: Flag for rule 3
Slow_down_flag //for first left turner
Save it in a list
Get the Current phase and calculate remaining GREEN phase and NEXT GREEN phase
Get the current traffic light phase and give a number to the phase
Ex. if phase == "rrrrGGGgrrrrGGGg":
phaseNo = 1
//This until phase 8. Phase 1 is GREEN for all vehicles
// Phase 3 GREEN only for left turners
// Others are RED phases
Calculated remaining time of the current phase
NEXT calculate remaining GREEN light time and NEXT green light
Eg: phaseNo== 1
Remaining_green= remaining
NextGreenPhase= remaining + 59 (depends on Sumo traffic light )
98
Extended-GLOSA
Set speed mode for vehicle 0b11111
If acceleration != 0
//tmax represents time vehicle needs to reach MAX speed
Calculate tmax -> tmax = (MAX_SPEED - speed) / acceleration
//dmax represents time vehicle needs to reach MAX speed
dmax = tmax * ((MAX_SPEED + speed) / 2)
if dmax > distance to traffic light
//calculate arrival time
arrival_time = -(speed / acceleration) + math.sqrt(((speed * speed) /
(acceleration * acceleration)) + ((2 * distance) / acceleration))
else
arrival_time = tmax + ((distance - dmax) / MAX_SPEED)
else
//executes for very small speeds like 0.1 ,0.0012 which is next to a traffic light.
//Mainly to avoid unexpected stopping
//consider last 100m for executing this command (current 1km scenario)
if (speed >= 0 and speed < 1)
if distance <= position of the intersection / 10:
arrival_time = 1
else if speed != 0:
arrival_time = distance / speed
if phase == 1 or 3
//GREEN or 6seconds additional left turners
if (remaining > arrival_time or vehicle is a left turner
and remaining + LEFT_TURN_SEPERATE_GREEN_TIME > arrival_time
traci setspeed -1
//advised speed MAX speed
Else
Get the time for next green phase+added Margin
//calculate advised speed
advisedSpeed = max(0, ((2 * distance) / (nextGreenPhase)) - speed)
traci slowdown with advised speed
else
calculate next NEXTGREENPHASE starting each and every phase
Eg: if the phase==1
nextGreenPhase = 59 + remaining + ARRIVAL_MARGIN
99
if the vehicle is the a left turner, consider the additional 6 seconds phase also
if arrival_time> NEXTGREENPHASE
//arrive in green phase after next green
If vehicle == first left turner
myGreenPhase = nextGreenPhase + CYCLE_LENGTH +
GREEN_PHASE_DURATION
Else
myGreenPhase = nextGreenPhase + CYCLE_LENGTH
else
//vehicle arrive in next green phase
If vehicle == first left turner
myGreenPhas= nextGreenPhase+GREEN_PHASE_DURATION
else
myGreenPhase = nextGreenPhase
advisedSpeed = max(0, ((2 * distance) / (myGreenPhase)) - speed)
Issue slowdown command with advisedSpeed
Rules execution
For each and every vehicle
Rule1
//change left all left turners to left lane
If the route (whether it is going to turn left) and current lane is right
Lane change to left lane
Change flag to true
Rule2
// check left lane. check whether straight going is following a left turner. if then change it to
right lane
If the route is straight going and the currently in left lane:
Check how many left turners are leading in left lane
If Leading vehicles in left lane (left turning) >0
Change to right lane
Set flag2 = true
Rule3
//grab the leading straight going vehicles in right lane to the first left turner in left lane, and
change it to left lane
Left_lane_pool // number of vehicles which can lead on left lane for first left turner
// First find the first left turner
Get vehicle list currently travelling in left lane.
For from end to start vehicle list
If the route is a left turning route
Firsrt_left_turner_found
100
Break
Else
Left_lane_pool++
If Firsrt_left_turner_found== currently investigating vehicle
If Left_lane_pool <= MAX_NUMBER_VEHICLES_LEFT_LANE_POOL
// slow down mechanism for first left tuner
Slow_down_flag ==true
Get list Leading vehicles in right lane
While from end to start
If left_lane_pool<= MAX_NUMBER_VEHICLES_LEFT_LANE_POOL
Change leading straight going vehicle to left lane
left_lane_pool
else
break
Store updated in list Simulated_DATA
For end
Figure 75 Rules psudo code