journal of transactions on intelligent … · journal of transactions on intelligent transportation...

11
JOURNAL OF TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. X, NO. X, AUGUST 2019 1 An Open-Source Framework for Adaptive Traffic Signal Control Wade Genders and Saiedeh Razavi Abstract—Sub-optimal control policies in transportation sys- tems negatively impact mobility, the environment and human health. Developing optimal transportation control systems at the appropriate scale can be difficult as cities’ transportation systems can be large, complex and stochastic. Intersection traffic signal controllers are an important element of modern transportation infrastructure where sub-optimal control policies can incur high costs to many users. Many adaptive traffic signal controllers have been proposed by the community but research is lacking regarding their relative performance difference - which adaptive traffic signal controller is best remains an open question. This research contributes a framework for developing and evaluating different adaptive traffic signal controller models in simula- tion - both learning and non-learning - and demonstrates its capabilities. The framework is used to first, investigate the performance variance of the modelled adaptive traffic signal controllers with respect to their hyperparameters and second, analyze the performance differences between controllers with optimal hyperparameters. The proposed framework contains implementations of some of the most popular adaptive traffic signal controllers from the literature; Webster’s, Max-pressure and Self-Organizing Traffic Lights, along with deep Q-network and deep deterministic policy gradient reinforcement learning controllers. This framework will aid researchers by accelerating their work from a common starting point, allowing them to generate results faster with less effort. All framework source code is available at https://github.com/docwza/sumolights. Index Terms—traffic signal control, adaptive traffic signal con- trol, intelligent transportation systems, reinforcement learning, neural networks. I. I NTRODUCTION C ITIES rely on road infrastructure for transporting indi- viduals, goods and services. Sub-optimal control poli- cies incur environmental, human mobility and health costs. Studies observe vehicles consume a significant amount of fuel accelerating, decelerating or idling at intersections [1]. Land transportation emissions are estimated to be responsible for one third of all mortality from fine particulate matter pollution in North America [2]. Globally, over three million deaths are attributed to air pollution per year [3]. In 2017, residents of three of the United States’ biggest cities, Los Angeles, New W. Genders was a Ph.D. student with the Department of Civil En- gineering, McMaster University, Hamilton, Ontario, Canada e-mail: gen- [email protected] S. Razavi is an Associate Professor, Chair in Heavy Construction and Direc- tor of McMaster Institute for Transportation & Logistics at the Department of Civil Engineering, McMaster University, Hamilton, Ontario, Canada e-mail: [email protected] 20XX IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. Manuscript received August X, 2019; revised August X, 2019. York and San Francisco, spent between three and four days on average delayed in congestion over the year, respectively costing 19, 33 and 10 billion USD from fuel and individual time waste [4]. It is paramount to ensure transportation systems are optimal to minimize these costs. Automated control systems are used in many aspects of transportation systems. Intelligent transportation systems seek to develop optimal solutions in transportation using intelli- gence. Intersection traffic signal controllers are an important element of many cities’ transportation infrastructure where sub-optimal solutions can contribute high costs. Traditionally, traffic signal controllers have functioned using primitive logic which can be improved. Adaptive traffic signal controllers can improve upon traditional traffic signal controllers by conditioning their control on current traffic conditions. Traffic microsimulators such as SUMO [5], Paramics, VIS- SUM and AIMSUM have become popular tools for developing and testing adaptive traffic signal controllers before field de- ployment. However, researchers interested in studying adaptive traffic signal controllers are often burdened with developing their own adaptive traffic signal control implementations de novo. This research contributes an adaptive traffic signal control framework, including Webster’s, Max-pressure, Self- organizing traffic lights (SOTL), deep Q-network (DQN) and deep deterministic policy gradient (DDPG) implementations for the freely available SUMO traffic microsimulator to aid researchers in their work. The framework’s capabilities are demonstrated by studying the effect of optimizing traffic signal controllers hyperparameters and comparing optimized adaptive traffic signal controllers relative performance. II. BACKGROUND A. Traffic Signal Control An intersection is composed of traffic movements or ways that a vehicle can traverse the intersection beginning from an incoming lane to an outgoing lane. Traffic signal controllers use phases, combinations of coloured lights that indicate when specific movements are allowed, to control vehicles at the intersection. Fundamentally, a traffic signal control policy can be decou- pled into two sequential decisions at any given time; what should the next phase be and for how long in duration? A variety of models have been proposed as policies. The simplest and most popular traffic signal controller determines the next phase by displaying the phases in an ordered sequence known as a cycle, where each phase in the cycle has a fixed, potentially unique, duration - this is known as a fixed-time, cycle-based traffic signal controller. Although simple, fixed- time, cycle-based traffic signal controllers are ubiquitous in arXiv:1909.00395v1 [eess.SY] 1 Sep 2019

Upload: others

Post on 11-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

  • JOURNAL OF TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. X, NO. X, AUGUST 2019 1

    An Open-Source Framework for Adaptive TrafficSignal Control

    Wade Genders and Saiedeh Razavi

    Abstract—Sub-optimal control policies in transportation sys-tems negatively impact mobility, the environment and humanhealth. Developing optimal transportation control systems at theappropriate scale can be difficult as cities’ transportation systemscan be large, complex and stochastic. Intersection traffic signalcontrollers are an important element of modern transportationinfrastructure where sub-optimal control policies can incur highcosts to many users. Many adaptive traffic signal controllershave been proposed by the community but research is lackingregarding their relative performance difference - which adaptivetraffic signal controller is best remains an open question. Thisresearch contributes a framework for developing and evaluatingdifferent adaptive traffic signal controller models in simula-tion - both learning and non-learning - and demonstrates itscapabilities. The framework is used to first, investigate theperformance variance of the modelled adaptive traffic signalcontrollers with respect to their hyperparameters and second,analyze the performance differences between controllers withoptimal hyperparameters. The proposed framework containsimplementations of some of the most popular adaptive trafficsignal controllers from the literature; Webster’s, Max-pressureand Self-Organizing Traffic Lights, along with deep Q-networkand deep deterministic policy gradient reinforcement learningcontrollers. This framework will aid researchers by acceleratingtheir work from a common starting point, allowing them togenerate results faster with less effort. All framework source codeis available at https://github.com/docwza/sumolights.

    Index Terms—traffic signal control, adaptive traffic signal con-trol, intelligent transportation systems, reinforcement learning,neural networks.

    I. INTRODUCTION

    C ITIES rely on road infrastructure for transporting indi-viduals, goods and services. Sub-optimal control poli-cies incur environmental, human mobility and health costs.Studies observe vehicles consume a significant amount of fuelaccelerating, decelerating or idling at intersections [1]. Landtransportation emissions are estimated to be responsible forone third of all mortality from fine particulate matter pollutionin North America [2]. Globally, over three million deaths areattributed to air pollution per year [3]. In 2017, residents ofthree of the United States’ biggest cities, Los Angeles, New

    W. Genders was a Ph.D. student with the Department of Civil En-gineering, McMaster University, Hamilton, Ontario, Canada e-mail: [email protected]

    S. Razavi is an Associate Professor, Chair in Heavy Construction and Direc-tor of McMaster Institute for Transportation & Logistics at the Department ofCivil Engineering, McMaster University, Hamilton, Ontario, Canada e-mail:[email protected]

    20XX IEEE. Personal use of this material is permitted. Permission fromIEEE must be obtained for all other uses, in any current or future media,including reprinting/republishing this material for advertising or promotionalpurposes, creating new collective works, for resale or redistribution to serversor lists, or reuse of any copyrighted component of this work in other works.

    Manuscript received August X, 2019; revised August X, 2019.

    York and San Francisco, spent between three and four dayson average delayed in congestion over the year, respectivelycosting 19, 33 and 10 billion USD from fuel and individualtime waste [4]. It is paramount to ensure transportation systemsare optimal to minimize these costs.

    Automated control systems are used in many aspects oftransportation systems. Intelligent transportation systems seekto develop optimal solutions in transportation using intelli-gence. Intersection traffic signal controllers are an importantelement of many cities’ transportation infrastructure wheresub-optimal solutions can contribute high costs. Traditionally,traffic signal controllers have functioned using primitive logicwhich can be improved. Adaptive traffic signal controllerscan improve upon traditional traffic signal controllers byconditioning their control on current traffic conditions.

    Traffic microsimulators such as SUMO [5], Paramics, VIS-SUM and AIMSUM have become popular tools for developingand testing adaptive traffic signal controllers before field de-ployment. However, researchers interested in studying adaptivetraffic signal controllers are often burdened with developingtheir own adaptive traffic signal control implementations denovo. This research contributes an adaptive traffic signalcontrol framework, including Webster’s, Max-pressure, Self-organizing traffic lights (SOTL), deep Q-network (DQN) anddeep deterministic policy gradient (DDPG) implementationsfor the freely available SUMO traffic microsimulator to aidresearchers in their work. The framework’s capabilities aredemonstrated by studying the effect of optimizing traffic signalcontrollers hyperparameters and comparing optimized adaptivetraffic signal controllers relative performance.

    II. BACKGROUNDA. Traffic Signal Control

    An intersection is composed of traffic movements or waysthat a vehicle can traverse the intersection beginning from anincoming lane to an outgoing lane. Traffic signal controllersuse phases, combinations of coloured lights that indicate whenspecific movements are allowed, to control vehicles at theintersection.

    Fundamentally, a traffic signal control policy can be decou-pled into two sequential decisions at any given time; whatshould the next phase be and for how long in duration?A variety of models have been proposed as policies. Thesimplest and most popular traffic signal controller determinesthe next phase by displaying the phases in an ordered sequenceknown as a cycle, where each phase in the cycle has a fixed,potentially unique, duration - this is known as a fixed-time,cycle-based traffic signal controller. Although simple, fixed-time, cycle-based traffic signal controllers are ubiquitous in

    arX

    iv:1

    909.

    0039

    5v1

    [ee

    ss.S

    Y]

    1 S

    ep 2

    019

    https://github.com/docwza/sumolights

  • JOURNAL OF TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. X, NO. X, AUGUST 2019 2

    transportation networks because they are predictable, stableand effective, as traffic demands exhibit reliable patterns overregular periods (i.e., times of the day, days of the week). How-ever, as ubiquitous as the fixed-time controller is, researchershave long sought to develop improved traffic signal controllerswhich can adapt to changing traffic conditions.

    Actuated traffic signal controllers use sensors and booleanlogic to create dynamic phase durations. Adaptive traffic signalcontrollers are capable of acyclic phase sequences and dy-namic phase durations to adapt to changing intersection trafficconditions. Adaptive controllers attempt to achieve higherperformance at the expense of complexity, cost and reliability.Various techniques have been proposed as the foundation foradaptive traffic signal controllers, from analytic mathematicalsolutions, heuristics and machine learning.

    B. Literature Review

    Developing an adaptive traffic signal control ultimatelyrequires some type of optimization technique. For decadesresearchers have proposed adaptive traffic signal controllersbased on a variety of techniques such as evolutionary algo-rithms [6], [7], [8], [9], [10], [11] and heuristics such as pres-sure [12], [13], [14], immunity [15], [16] and self-organization[17], [18], [19]. Additionally, many comprehensive adaptivetraffic signal control systems have been proposed such asOPAC [20], SCATS [21], RHODES [22] and ACS-Lite [23].

    Reinforcement learning has been demonstrated to be aneffective method for developing adaptive traffic signal con-trollers in simulation [6], [24], [25], [26], [27], [28]. Recently,deep reinforcement learning has been used for adaptive trafficsignal control with varying degrees of success [29], [30], [31],[32], [33], [34], [35], [36], [37], [38], [39]. A comprehensivereview of reinforcement learning adaptive traffic signal con-trollers is presented in Table I.

    Readers interested in additional adaptive traffic signal con-trol research can consult extensive review articles [40], [41],[42], [43], [44].

    Although ample research exists proposing novel adaptivetraffic signal controllers, it can be arduous to compare be-tween previously proposed ideas. Developing adaptive trafficsignal controllers can be challenging as many of them requiredefining many hyperparameters. The authors seek to addressthis problem by contributing an adaptive traffic signal controlframework to address these problems and aid in their research.

    C. Contribution

    The authors’ work contributes in the following areas:

    • Diverse Adaptive Traffic Signal Controller Implemen-tations: The proposed framework contributes adaptivetraffic signal controllers based on a variety of paradigms,the broadest being non-learning (e.g., Webster’s, SOTL,Max-pressure) and learning (e.g., DQN and DDPG). Thediversity of adaptive traffic signal controllers allows re-searchers to experiment at their leisure without investingtime developing their own implementations.

    • Scalable, Optimized: The proposed framework is opti-mized for use with parallel computation techniques lever-aging modern multicore computer architecture. This fea-ture significantly reduces the compute time of learning-based adaptive traffic signal controllers and the generationof results for all controllers. By making the frameworkcomputationally efficient, the search for optimal hyperpa-rameters is tractable with modest hardware (e.g., 8 coreCPU). The framework was designed to scale to developadaptive controllers for any SUMO network.

    All source code used in this manuscript can be retrievedfrom https://github.com/docwza/sumolights.

    III. TRAFFIC SIGNAL CONTROLLERS

    Before describing each traffic signal controller in detail,elements common to all are detailed. All of the included trafficsignal controllers share the following; a set of intersectionlanes L, decomposed into incoming Linc and outgoing lanesLout and a set of green phases P . The set of incoming laneswith green movements in phase p ∈ P is denoted as Lp,incand their outgoing lanes as Lp,out.

    A. Non-Learning Traffic Signal Controllers

    1) Uniform: A simple cycle-based, uniform phase durationtraffic signal controller is included for use as a base-linecomparison to the other controllers. The uniform controller’sonly hyperparameter is the green duration u, which definesthe same duration for all green phases; the next phase isdetermined by a cycle.

    2) Websters: Webster’s method develops a cycle-based,fixed phase length traffic signal controller using phase flowdata [53]. The authors propose an adaptive Webster’s trafficsignal controller by collecting data for a time interval W induration and then using Webster’s method to calculate thecycle and green phase durations for the next W time interval.This adaptive Webster’s essentially uses the most recent Winterval to collect data and assumes the traffic demand willbe approximately the same during the next W interval. Theselection of W is important and exhibits various trade-offs,smaller values allow for more frequent adaptations to changingtraffic demands at the risk of instability while larger valuesadapt less frequently but allow for increased stability. Pseudo-code for the Webster’s traffic signal controller is presented inAlgorithm 1.

    In Algorithm 1, F represents the set of phase flows collectedover the most recent W interval and R represents the totalcycle lost time. In addition to the time interval hyperparameterW , the adaptive Webster’s algorithm also has hyperparametersdefining a minimum cycle duration cmin, maximum cycleduration cmax and lane saturation flow rate s.

    3) Max-pressure: The Max-pressure algorithm developsan acyclic, dynamic phase length traffic signal controller.The Max-pressure algorithm models vehicles in lanes as asubstance in a pipe and enacts control in a manner whichattempts to maximize the relief of pressure between incoming

    https://github.com/docwza/sumolights

  • JOURNAL OF TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. X, NO. X, AUGUST 2019 3

    TABLE IADAPTIVE TRAFFIC SIGNAL CONTROL RELATED WORK.

    Research Network Intersections Multi-agent RL Function Approximation[45] Grid 15 Max-plus Model-based N/A[27] Grid, Corridor cmax then9: C = cmax

    10: end if11: G = C −R12: #allocate green time proportional to flow13: return C, { G y∑Y for y in Y }14: end procedure

    and outgoing lanes [13]. For a given green phase p, thepressure is defined in (1).

    Pressure(p) =∑

    l∈Lp,inc

    |Vl| −∑

    l∈Lp,out

    |Vl| (1)

    Where Lp,inc represents the set of incoming lanes withgreen movements in phase p and Lp,out represents the set ofoutgoing lanes from all incoming lanes in Lp,inc.

    Pseudo-code for the Max-pressure traffic signal controlleris presented in Algorithm 2.

    Algorithm 2 Max-pressure Algorithm1: procedure MAXPRESSURE(gmin, tp, P )2: if tp < gmin then3: tp = tp + 14: else5: tp = 06: #next phase has largest pressure7: return argmax({ Pressure(p) for p in P })8: end if9: end procedure

    In Algorithm 2, tp represents the time spent in the currentphase. The Max-pressure algorithm requires a minimum greentime hyperparameter gmin which ensures a newly enactedphase has a minimum duration.

    4) Self Organizing Traffic Lights: Self-organizing trafficlights (SOTL) [17], [18], [19] develop a cycle-based, dynamicphase length traffic signal controller based on self-organizingprinciples, where a “...self-organizing system would be onein which elements are designed to dynamically and au-tonomously solve a problem or perform a function at thesystem level.” [18, p. 2].

    Pseudo-code for the SOTL traffic signal controller is pre-sented in Algorithm 3.

    Algorithm 3 SOTL Algorithm1: procedure SOTL(tp, gmin, θ, ω, µ)2: #accumulate red phase vehicles time integral3: κ = κ+

    ∑l∈Linc−Lp,inc |Vl|

    4: if tp > gmin then5: #vehicles approaching in current green phase6: #< ω distance of stop line7: n =

    ∑l∈Lp,inc |Vl|

    8: #only consider phase change if no platoon9: #or too large n > µ

    10: if n > µ or n == 0 then11: if κ > θ then12: κ = 013: #next phase in cycle14: i = i+ 115: return Pimod|P |16: end if17: end if18: end if19: end procedure

    The SOTL algorithm functions by changing lights accord-ing to a vehicle-time integral threshold θ constrained by aminimum green phase duration gmin. Additionally, small (i.e.n < µ) vehicle platoons are kept together by preventing aphase change if sufficiently close (i.e., at a distance < ω) tothe stop line.

    B. Learning Traffic Signal Controllers

    Reinforcement learning uses the framework of MarkovDecision Processes to solve goal-oriented, sequential decision-making problems by repeatedly acting in an environment.

  • JOURNAL OF TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. X, NO. X, AUGUST 2019 4

    At discrete points in time t, a reinforcement learning agentobserves the environment state st and then uses a policy π todetermine an action at. After implementing its selected action,the agent receives feedback from the environment in the formof a reward rt and observes a new environment state st+1.The reward quantifies how ‘well’ the agent is achieving itsgoal (e.g., score in a game, completed tasks). This processis repeated until a terminal state sterminal is reached, andthen begins anew. The return Gt =

    ∑k=Tk=0 γ

    krt+k is theaccumulation of rewards by the agent over some time horizonT , discounted by γ ∈ [0, 1). The agent seeks to maximize theexpected return E[Gt] from each state st. The agent developsan optimal policy π∗ to maximize the return.

    There are many techniques for an agent to learn the optimalpolicy, however, most of them rely on estimating value func-tions. Value functions are useful to estimate future rewards.State value functions V π(s) = E[Gt|st = s] represent theexpected return starting from state s and following policy π.Action value functions Qπ(s, a) = E[Gt|st = s, at = a]represent the expected return starting from state s, taking ac-tion a and following policy π. In practice, value functions areunknown and must be estimated using sampling and functionapproximation techniques. Parametric function approximation,such as neural networks, use a set of parameters θ to estimatean unknown function f(x|θ) ≈ f(x). To develop accurateapproximations, the function parameters must be developedwith some optimization technique.

    Experiences are tuples et = (st, at, rt, st+1) that representan interaction between the agent and the environment at time t.A reinforcement learning agent interacts with its environmentin trajectories or sequences of experiences et, et+1, et+2, ....Trajectories begin in an initial state sinit and end in a terminalstate sterminal. To accurately estimate value functions, expe-riences are used to optimize the parameters. If neural networkfunction approximation is used, the parameters are optimizedusing experiences to perform gradient-based techniques andbackpropagation [54], [55]. Additional technical details re-garding the proposed reinforcement learning adaptive trafficsignal controllers can be found in the Appendix.

    To train reinforcement learning controllers for all intersec-tions, a distributed acting, centralized learning architecture isdeveloped [56], [57], [58]. Using parallel computing, multipleactors and learners are created, illustrated in Figure 1. Actorshave their own instance of the traffic simulation and neuralnetworks for all intersections. Learners are assigned a subsetof all intersections, for each they have a neural network andan experience replay buffer D. Actors generate experienceset for all intersections and send them to the appropriatelearner. Learners only receive experiences for their assignedsubset of intersections. The learner stores the experiencesin an experience replay buffer, which is uniformly sampledfor batches to optimize the neural network parameters. Aftercomputing parameter updates, learners send new parametersto all actors.

    There are many benefits to this architecture, foremost beingthat it makes the problem feasible; because there are hundredsof agents, distributing computation across many actors andlearners is necessary to decrease training time. Another benefit

    is experience diversity, granted by multiple environments andvaried exploration rates.

    C. DQN

    The proposed DQN traffic signal controller enacts control bychoosing the next green phase without utilizing a phase cycle.This acyclic architecture is motivated by the observation thatenacting phases in a repeating sequence may be contributingto sub-optimal control policy. After the DQN has selected thenext phase, it is enacted for a fixed duration known as anaction repeat arepeat. After the phase has been enacted for theaction repeat duration, a new phase is selected acyclically.

    1) State: The proposed state observation for the DQN is acombination of the most recent green phase and the densityand queue of incoming lanes at the intersection at time t.Assume each intersection has a set L of incoming lanes anda set P of green phases. The state space is then defined asS ∈ (R2|L| × B|P |+1). The density and queue of each laneare normalized to the range [0, 1] by dividing by the lane’sjam density kj . The most recent phase is encoded as a one-hot vector B|P |+1, where the plus one encodes the all-redclearance phase.

    2) Action: The proposed action space for the DQN trafficsignal controller is the next green phase. The DQN selectsone action from a discrete set, in this model one of the manypossible green phases at ∈ P . After a green phase has beenselected, it is enacted for a duration equal to the action repeatarepeat.

    3) Reward: The reward used to train the DQN traffic signalcontroller is a function of vehicle delay. Delay d is thedifference in time between a vehicle’s free-flow travel timeand actual travel time. Specifically, the reward is the negativesum of all vehicles’ delay at the intersection, defined in (2):

    rt = −∑v∈V

    dvt (2)

    Where V is the set of all vehicles on incoming lanes atthe intersection, and dtv is the delay of vehicle v at time t.Defined in this way, the reward is a punishment, with theagent’s goal to minimize the amount of punishment it receives.Each intersection saves the reward with the largest magnitudeexperienced to perform minimum reward normalization rt|rmin|to scale the reward to the range [−1, 0] for stability.

    4) Agent Architecture: The agent approximates the action-value Q function with a deep artificial neural network. Theaction-value function Q is two hidden layers of 3(|st|) fullyconnected neurons with exponential linear unit (ELU) acti-vation functions and the output layer is |P | neurons withlinear activation functions. The Q function’s input is the localintersection state st. A visualization of the DQN is presentedin Fig. 1.

    D. DDPG Traffic Signal Controller

    The proposed DDPG traffic signal controller implementsa cycle with dynamic phase durations. This architecture ismotivated by the observation that cycle-based policies can

  • JOURNAL OF TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. X, NO. X, AUGUST 2019 5

    Fig. 1. Adaptive traffic signal control DDPG and DQN neural network agents (left) and distributed acting, centralized learning architecture (right) composedof actors and learners. Each actor has one SUMO network as an environment and neural networks for all intersections. Each learner is assigned a subset ofintersections at the beginning of training and is only responsible for computing parameter updates for their assigned intersections, effectively distributing thecomputation load for learning. However, learners distribute parameter updates to all actors.

    maintain fairness and ensure a minimum quality of servicebetween all intersection users. Once the next green phase hasbeen determined using the cycle, the policy π is used to selectits duration. Explicitly, the reinforcement learning agent islearning how long in duration to make the next green phase inthe cycle to maximize its return. Additionally, the cycle skipsphases when no vehicles are present on incoming lanes.

    1) Actor State: The proposed state observation for the actoris a combination of the current phase and the density and queueof incoming lanes at the intersection at time t. The state spaceis then defined as S ∈ (R2|L|×B|P |+1). The density and queueof each lane are normalized to the range [0, 1] by dividing bythe lane’s jam density kj . The current phase is encoded as aone-hot vector B|P |+1, where the plus one encodes the all-redclearance phase.

    2) Critic State: The proposed state observation for the criticcombines the state st and the actor’s action at, depicted inFigure 1.

    3) Action: The proposed action space for the adaptivetraffic signal controller is the duration of the next greenphase in seconds. The action controls the duration of thenext phase; there is no agency over what the next phaseis, only on how long it will last. The DDPG algorithmproduces a continuous output, a real number over some rangeat ∈ R. Since the DDPG algorithm outputs a real numberand the phase duration is defined in intervals of seconds, theoutput is rounded to the nearest integer. In practice, phasedurations are bounded by minimum time gmin and a maximumtime gmax hyperparameters to ensure a minimum quality ofservice for all users. Therefore the agent selects an action{at ∈ Z|gmin ≤ at ≤ gmax} as the next phase duration.

    4) Reward: The reward used to train the DDPG trafficsignal controller is the same delay reward used by the DQNtraffic signal controller defined in (2).

    5) Agent Architecture: The agent approximates the policyπ and action-value Q function with deep artificial neuralnetworks. The policy function is two hidden layers of 3|st|fully connected neurons, each with batch normalization andELU activation functions, and the output layer is one neuronwith a hyperbolic tangent activation function. The action-valuefunction Q is two hidden layers of 3(|st|+|at|) fully connectedneurons with batch normalization and ELU activation func-tions and the output layer is one neuron with a linear activationfunction. The policy’s input is the intersection’s local trafficstate st and the action-value function’s input is the local stateconcatenated with the local action st + at. The action-valueQ function also uses a L2 weight regularization of λ = 0.01.

    By deep reinforcement learning standards the networks usedare not that deep, however, their architecture is selected forsimplicity and they can easily be modified within the frame-work. Simple deep neural networks were also implemented toallow for future scalability, as the proposed framework can bedeployed to any SUMO network - to reduce the computationalload the default networks are simple.

    IV. EXPERIMENTS

    A. Hyperparameter Optimization

    To demonstrate the capabilities of the proposed framework,experiments are conducted on optimizing adaptive traffic sig-nal control hyperparameters. The framework is for use with theSUMO traffic microsimulator [5], which was used to evaluatethe developed adaptive traffic signal controllers. Understandinghow sensitive any specific adaptive traffic signal controller’sperformance is to changes in hyperparameters is importantto instill confidence that the solution is robust. Determiningoptimal hyperparameters is necessary to ensure a balancedcomparison between adaptive traffic signal control methods.

  • JOURNAL OF TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. X, NO. X, AUGUST 2019 6

    gneJ0 gneJ6

    Fig. 2. Two intersection SUMO network used for hyperparameter ex-periments. In addition to this two intersection network, a single, isolatedintersection is also included with the framework.

    Using the hyperparameter optimization script included inthe framework, a grid search is performed with the imple-mented controllers’ hyperparameters on a two intersectionnetwork, shown in Fig. 2 under a simulated three hour dynamictraffic demand scenario. The results for each traffic signalcontroller are displayed in Fig. 3 and collectively in Fig. 4.

    As can be observed in Fig. 3 and Fig. 4 the choice ofhyperparameter significantly impacts the performance of thegiven traffic signal controller. As a general trend observedin Fig. 3, methods with larger numbers of hyperparameters(e.g., SOTL, DDPG, DQN) exhibit greater performance vari-ance than methods with fewer hyperparameters (e.g., Max-pressure). Directly comparing methods in Fig. 4 demonstratesnon-learning adaptive traffic signal control methods (e.g.,Max-pressure, Webster’s) robustness to hyperparameter valuesand high performance (i.e., lowest travel time). Learning-basedmethods exhibit higher variance with changes in hyperparam-eters, DQN more so than DDPG. In the following section, thebest hyperparameters for each adaptive traffic signal controllerwill be used to further investigate and compare performance.

    B. Optimized Adaptive Traffic Signal Controllers

    Using the optimized hyperparameters all traffic signal con-trollers are subjected to an additional 32 simulations withrandom seeds to estimate their performance, quantified usingnetwork travel time, individual intersection queue and delaymeasures of effectiveness (MoE). Results are presented in Fig.5 and Fig. 6.

    Observing the travel time boxplots in Fig. 5, the SOTLcontroller produces the worst results, exhibiting a mean traveltime almost twice the next closest method and with manysignificant outliers. The Max-pressure algorithm achieves thebest performance, with the lowest mean and median alongwith the lowest standard deviation. The DQN, DDPG, Uniformand Webster’s controllers achieve approximately equal perfor-mance, however, the DQN controller has significant outliers,indicating some vehicles experience much longer travel timesthan most.

    Each intersection’s queue and delay MoE with respect toeach adaptive traffic signal controller is presented in Fig. 6.The results are consistent with previous observations from

    the hyperparameter search and travel time data, however, thereader’s attention is directed to comparing the performanceof DQN and DDPG in Fig. 6. The DQN controller performspoorly (i.e., high queues and delay) at the beginning and endof the simulation when traffic demand is low. However, at thedemand peak, the DQN controller performs just as well, ifnot a little better, than every method except the Max-pressurecontroller. Simultaneously considering the DDPG controller,the performance is opposite the DQN controller. The DDPGcontroller achieves relatively low queues and delay at thebeginning and end of the simulation and then is bested bythe DQN controller in the middle of the simulation when thedemand peaks. This performance difference can potentiallybe understood by considering the difference between theDQN and DDPG controllers. The DQN’s ability to selectthe next phase acyclically under high traffic demand mayallow it to reduce queues and delay more than the cycleconstrained DDPG controller. However, it is curious thatunder low demands the DQN controller performance sufferswhen it should be relatively simple to develop the optimalpolicy. The DQN controller may be overfitting to the periodsin the environment when the magnitude of the rewards arelarge (i.e., in the middle of the simulation when the demandpeaks) and converging to a policy that doesn’t generalizewell to the environment when the traffic demand is low.The author’s present these findings to reader’s and suggestfuture research investigate this and other issues to understandthe performance difference between reinforcement learningtraffic signal controllers. Understanding the advantages anddisadvantages of a variety of controllers can provide insightinto developing future improvements.

    V. CONCLUSION & FUTURE WORK

    Learning and non-learning adaptive traffic signal controllershave been developed within an optimized framework for thetraffic microsimulator SUMO for use by the research com-munity. The proposed framework’s capabilities were demon-strated by studying adaptive traffic signal control algorithm’ssensitivity to hyperparameters, which was found to be sensi-tive with hyperparameter rich controllers (i.e., learning) andrelatively insensitive with hyperparameter sparse controllers(i.e., heuristics). Poor hyperparameters can drastically alterthe performance of an adaptive traffic signal controller, lead-ing researchers to erroneous conclusions about an adaptivetraffic signal controller’s performance. This research providesevidence that dozens or hundreds of hyperparameter config-urations may have to be tested before selecting the optimalone.

    Using the optimized hyperparamters, each adaptive con-troller’s performance was estimated and the Max-pressurecontroller was found to achieve the best performance, yieldingthe lowest travel times, queues and delay. This manuscript’sresearch provides evidence that heuristics can offer powerfulsolutions even compared to complex deep-learning methods.This is not to suggest that this is definitively the case in all en-vironments and circumstances. The authors’ hypothesize thatlearning-based controllers can be further developed to offer

  • JOURNAL OF TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. X, NO. X, AUGUST 2019 7

    0 50 100 1500

    20

    40

    60

    80

    100

    120

    140

    StandardDeviation

    Travel Time (s)

    Uniform

    0 250 500 750 10000

    200

    400

    600

    800

    1000

    1200

    1400SOTL

    0 20 40 60 800

    20

    40

    60

    80

    DDPG

    0 20 40 60Mean

    Travel Time (s)

    0

    5

    10

    15

    20

    25

    30

    35

    StandardDeviation

    Travel Time (s)

    Max-pressure

    0 200 400 600Mean

    Travel Time (s)

    0

    200

    400

    600

    800

    1000

    DQN

    0 20 40 60 80Mean

    Travel Time (s)

    0

    10

    20

    30

    40

    Webster's

    Best

    WorstHyperparameter Performance

    Fig. 3. Individual hyperparameter results for each traffic signal controller. Travel time is used as a measure of effectiveness and is estimated for eachhyperparameter from eight simulations with random seeds in units of seconds (s). The coloured dots gradient from green (best) to red (worst) orders thehyperparameters by the sum of the travel time mean and standard deviation. Note differing scales between graph axes making direct visual comparison biased.

    0 25 50 75 100 125 150 175 200Mean

    Travel Time (s)

    0

    25

    50

    75

    100

    125

    150

    175

    200

    StandardDeviation

    Travel Time (s)

    Traffic Signal ControlHyperparameter Comparison

    DDPGDQNMax-pressureSOTLUniformWebster's

    Fig. 4. Comparison of all traffic signal controller hyperparameter travel time performance. Note both vertical and horizontal axis limits have been clipped at200 to improve readability.

    improved performance that may yet best the non-learning,heuristic-based methods detailed in this research. Promisingextensions that have improved reinforcement learning in otherapplications and may do the same for adaptive traffic signalcontrol include richer function approximators [59], [60], [61]and reinforcement learning algorithms [62], [63], [64].

    The authors intend for the framework to grow, with the addi-tion of more adaptive traffic signal controllers and features. In

    its current state, the framework can already aid adaptive trafficsignal control researchers rapidly experiment on a SUMOnetwork of their choice. Acknowledging the importance ofoptimizing our transportation systems, the authors hope thisresearch helps others solve practical problems.

  • JOURNAL OF TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. X, NO. X, AUGUST 2019 8

    Traffic Signal Controller

    0

    250

    500

    750

    1000

    1250

    1500

    1750

    Travel Time (s)( , , median)

    (72, 34, 65) (78, 46, 66) (59, 21, 54) (158, 169, 85) (79, 37, 74) (71, 30, 66)

    Traffic Signal ControllerTravel Time)

    DDPGDQNMax-pressureSOTLUniformWebster's

    Fig. 5. Boxplots depicting the distribution of travel times for each traffic signal controller. The solid white line represents the median, solid coloured box theinterquartile range (IQR) (i.e., from first (Q1) to third quartile (Q3)), solid coloured lines the Q1-1.5IQR and Q3+1.5IQR and coloured crosses the outliers.

    0 25 50 75 100 125 150 1750

    2000

    4000

    6000

    8000

    Queue (veh)

    gneJ0DDPGDQNMax-pressureUniformWebster's

    0 25 50 75 100 125 150 1750

    2000

    4000

    6000

    8000

    gneJ6DDPGDQNMax-pressureUniformWebster's

    0 25 50 75 100 125 150 175Time (min)

    0

    20000

    40000

    60000

    80000

    Delay (s)

    DDPGDQNMax-pressureUniformWebster's

    0 25 50 75 100 125 150 175Time (min)

    0

    10000

    20000

    30000

    40000

    50000

    60000

    70000DDPGDQNMax-pressureUniformWebster's

    Intersection Measures of Effectiveness

    Fig. 6. Comparison of traffic signal controller individual intersection queue and delay measure of effectiveness in units of vehicles (veh) and seconds (s).Solid coloured lines represent the mean and shaded areas represent the 95% confidence interval. SOTL has been omitted to improve readability since itsQueue and Delay values are exclusively outside the graph range.

    APPENDIX ADQN

    Deep Q-Networks [65] combine Q-learning and deep neuralnetworks to produce autonomous agents capable of solvingcomplex tasks in high-dimensional environments. Q-learning[66] is a model-free, off-policy, value-based temporal dif-ference [67] reinforcement learning algorithm which can beused to develop an optimal discrete action space policy for

    a given problem. Like other temporal difference algorithms,Q-learning uses bootstrapping [68] (i.e., using an estimate toimprove future estimates) to develop an action-value functionQ(s, a) which can estimate the expected return of taking actiona in state s and acting optimally thereafter. If the Q functioncan be estimated accurately it can be used to derive the optimalpolicy π∗ = argmaxaQ(s, a). In DQN the Q function isapproximated with a deep neural network. The DQN algorithm

  • JOURNAL OF TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. X, NO. X, AUGUST 2019 9

    utilizes two techniques to ensure stable development of the Qfunction with DNN function approximation - a target networkand experience replay. Two parameter sets are used whentraining a DQN, online θ and target θ′. The target parametersθ′ are used to stabilize the return estimates when performingupdates to the neural network and are periodically changed tothe online parameters θ′ = θ at a fixed interval. The experiencereplay is a buffer which stores the most recent D experiencetuples to create a slowly changing dataset. The experiencereplay is uniformly sampled from for experience batches toupdate the Q function online parameters θ.

    Training a deep neural network requires a loss function,which is used to determine how to change the parameters toachieve better approximations of the training data. Reinforce-ment learning develops value functions (e.g., neural networks)using experiences from the environment.

    The DQN’s loss function is the gradient of the mean squarederror of the return Gt target yt and the prediction, defined in(3).

    yt = rt + γQ(st+1, argmaxaQ(st+1, a|θ′)|θ′)LDQN(θ) = (yt −Q(st, at|θ))2

    (3)

    APPENDIX BDDPG

    Deep deterministic policy gradients [69] are an extensionof DQN to continuous action spaces. Similiar to Q-learning,DDPG is a model-free, off-policy reinforcement learning al-gorithm. The DDPG algorithm is an example of actor-criticlearning, as it develops a policy function π(s|φ) (i.e., actor)using an action-value function Q(s, a|θ) (i.e., critic). The actorinteracts with the environment and modifies its behaviourbased on feedback from the critic.

    The DDPG critic’s loss function is the gradient of the meansquared error of the return Gt target yt and the prediction,defined in (4).

    yt = rt + γQ(st+1, π(st+1|φ′)|θ′)LCritic(θ) = (yt −Q(st, at|θ))2

    (4)

    The DDPG actor’s loss function is the sampled policygradient, defined in (5).

    LActor(θ) = ∇θQ(st, π(st|φ)|θ) (5)

    Like DQN, DDPG uses two sets of parameters, online θand target θ′, and experience replay [70] to reduce instabilityduring training. DDPG performs updates on the parametersfor both the actor and critic by uniformly sampling batches ofexperiences from the replay. The target parameters are slowlyupdated towards the online parameters according to θ′ = (1−τ)θ′ + (τ)θ after every batch update.

    APPENDIX CTECHNICAL

    Software used include SUMO 1.2.0 [5], Tensorflow 1.13[71], SciPy [72] and public code [73]. The neural networkparameters were initialized with He [74] and optimized usingAdam [75].

    To ensure intersection safety, two second yellow change andthree second all-red clearance phases were inserted between allgreen phase transitions. For the DQN and DDPG traffic signalcontrollers, if no vehicles are present at the intersection, thephase defaults to all-red, which is considered a terminal statesterminal. Each intersection’s state observation is bounded by150 m (i.e., the queue and density are calculated from vehiclesup to a maximum of 150 m from the intersection stop line).

    ACKNOWLEDGMENTS

    This research was enabled in part by support in the formof computing resources provided by SHARCNET (www.sharcnet.ca), their McMaster University staff and ComputeCanada (www.computecanada.ca).

    REFERENCES[1] L. Wu, Y. Ci, J. Chu, and H. Zhang, “The influence of intersections on

    fuel consumption in urban arterial road traffic: a single vehicle test inharbin, china,” PloS one, vol. 10, no. 9, 2015.

    [2] R. A. Silva, Z. Adelman, M. M. Fry, and J. J. West, “The impactof individual anthropogenic emissions sectors on the global burden ofhuman mortality due to ambient air pollution,” Environmental healthperspectives, vol. 124, no. 11, p. 1776, 2016.

    [3] World Health Organization et al., “Ambient air pollution: A globalassessment of exposure and burden of disease,” 2016.

    [4] G. Cookson, “INRIX global traffic scorecard,” INRIX, Tech. Rep., 2018.[5] D. Krajzewicz, J. Erdmann, M. Behrisch, and L. Bieker, “Recent devel-

    opment and applications of SUMO - Simulation of Urban MObility,”International Journal On Advances in Systems and Measurements,vol. 5, no. 3&4, pp. 128–138, December 2012.

    [6] S. Mikami and Y. Kakazu, “Genetic reinforcement learning for cooper-ative traffic signal control,” in Evolutionary Computation, 1994. IEEEWorld Congress on Computational Intelligence., Proceedings of the FirstIEEE Conference on. IEEE, 1994, pp. 223–228.

    [7] J. Lee, B. Abdulhai, A. Shalaby, and E.-H. Chung, “Real-time optimiza-tion for adaptive traffic signal control using genetic algorithms,” Journalof Intelligent Transportation Systems, vol. 9, no. 3, pp. 111–122, 2005.

    [8] H. Prothmann, F. Rochner, S. Tomforde, J. Branke, C. Müller-Schloer,and H. Schmeck, “Organic control of traffic lights,” in InternationalConference on Autonomic and Trusted Computing. Springer, 2008, pp.219–233.

    [9] L. Singh, S. Tripathi, and H. Arora, “Time optimization for traffic signalcontrol using genetic algorithm,” International Journal of Recent Trendsin Engineering, vol. 2, no. 2, p. 4, 2009.

    [10] E. Ricalde and W. Banzhaf, “Evolving adaptive traffic signal controllersfor a real scenario using genetic programming with an epigeneticmechanism,” in 2017 16th IEEE International Conference on MachineLearning and Applications (ICMLA). IEEE, 2017, pp. 897–902.

    [11] X. Li and J.-Q. Sun, “Signal multiobjective optimization for urban trafficnetwork,” IEEE Transactions on Intelligent Transportation Systems,vol. 19, no. 11, pp. 3529–3537, 2018.

    [12] T. Wongpiromsarn, T. Uthaicharoenpong, Y. Wang, E. Frazzoli, andD. Wang, “Distributed traffic signal control for maximum networkthroughput,” in Intelligent Transportation Systems (ITSC), 2012 15thInternational IEEE Conference on. IEEE, 2012, pp. 588–595.

    [13] P. Varaiya, “The max-pressure controller for arbitrary networks ofsignalized intersections,” in Advances in Dynamic Network Modelingin Complex Transportation Systems. Springer, 2013, pp. 27–66.

    [14] J. Gregoire, X. Qian, E. Frazzoli, A. De La Fortelle, and T. Wong-piromsarn, “Capacity-aware backpressure traffic signal control,” IEEETransactions on Control of Network Systems, vol. 2, no. 2, pp. 164–173, 2015.

    [15] S. Darmoul, S. Elkosantini, A. Louati, and L. B. Said, “Multi-agentimmune networks to control interrupted flow at signalized intersections,”Transportation Research Part C: Emerging Technologies, vol. 82, pp.290–313, 2017.

    [16] A. Louati, S. Darmoul, S. Elkosantini, and L. ben Said, “An artificialimmune network to control interrupted flow at a signalized intersection,”Information Sciences, vol. 433, pp. 70–95, 2018.

    [17] C. Gershenson, “Self-organizing traffic lights,” arXiv preprintnlin/0411066, 2004.

    www.sharcnet.cawww.sharcnet.cawww.computecanada.ca

  • JOURNAL OF TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. X, NO. X, AUGUST 2019 10

    [18] S.-B. Cools, C. Gershenson, and B. DHooghe, “Self-organizing trafficlights: A realistic simulation,” in Advances in applied self-organizingsystems. Springer, 2013, pp. 45–55.

    [19] S. Goel, S. F. Bush, and C. Gershenson, “Self-organization in trafficlights: Evolution of signal control with advances in sensors and com-munications,” arXiv preprint arXiv:1708.07188, 2017.

    [20] N. Gartner, “A demand-responsive strategy for traffic signal control,”Transportation Research Record, vol. 906, pp. 75–81, 1983.

    [21] P. Lowrie, “Scats, sydney co-ordinated adaptive traffic system: A trafficresponsive method of controlling urban traffic,” 1990.

    [22] P. Mirchandani and L. Head, “A real-time traffic signal control system:architecture, algorithms, and analysis,” Transportation Research Part C:Emerging Technologies, vol. 9, no. 6, pp. 415–432, 2001.

    [23] F. Luyanda, D. Gettman, L. Head, S. Shelby, D. Bullock, and P. Mirchan-dani, “Acs-lite algorithmic architecture: applying adaptive control systemtechnology to closed-loop traffic signal control systems,” TransportationResearch Record: Journal of the Transportation Research Board, no.1856, pp. 175–184, 2003.

    [24] T. L. Thorpe and C. W. Anderson, “Traffic light control using sarsa withthree state representations,” Citeseer, Tech. Rep., 1996.

    [25] E. Bingham, “Reinforcement learning in neurofuzzy traffic signal con-trol,” European Journal of Operational Research, vol. 131, no. 2, pp.232–241, 2001.

    [26] B. Abdulhai, R. Pringle, and G. J. Karakoulas, “Reinforcement learningfor true adaptive traffic signal control,” Journal of TransportationEngineering, vol. 129, no. 3, pp. 278–285, 2003.

    [27] L. Prashanth and S. Bhatnagar, “Reinforcement learning with functionapproximation for traffic signal control,” IEEE Transactions on Intelli-gent Transportation Systems, vol. 12, no. 2, pp. 412–421, 2011.

    [28] S. El-Tantawy, B. Abdulhai, and H. Abdelgawad, “Multiagent rein-forcement learning for integrated network of adaptive traffic signalcontrollers (marlin-atsc): methodology and large-scale application ondowntown toronto,” IEEE Transactions on Intelligent TransportationSystems, vol. 14, no. 3, pp. 1140–1150, 2013.

    [29] T. Rijken, “Deeplight: Deep reinforcement learning for signalised trafficcontrol,” Ph.D. dissertation, Masters Thesis. University College London,2015.

    [30] E. van der Pol, “Deep reinforcement learning for coordination intraffic light control,” Ph.D. dissertation, Masters Thesis. University ofAmsterdam, 2016.

    [31] L. Li, Y. Lv, and F.-Y. Wang, “Traffic signal timing via deep reinforce-ment learning,” IEEE/CAA Journal of Automatica Sinica, vol. 3, no. 3,pp. 247–254, 2016.

    [32] W. Genders and S. Razavi, “Using a deep reinforcement learning agentfor traffic signal control,” arXiv preprint arXiv:1611.01142, 2016, https://arxiv.org/abs/1611.01142.

    [33] M. Aslani, M. S. Mesgari, and M. Wiering, “Adaptive traffic signalcontrol with actor-critic methods in a real-world traffic network withdifferent traffic disruption events,” Transportation Research Part C:Emerging Technologies, vol. 85, pp. 732–752, 2017.

    [34] S. S. Mousavi, M. Schukat, P. Corcoran, and E. Howley, “Traffic lightcontrol using deep policy-gradient and value-function based reinforce-ment learning,” arXiv preprint arXiv:1704.08883, 2017.

    [35] X. Liang, X. Du, G. Wang, and Z. Han, “Deep reinforcement learn-ing for traffic light control in vehicular networks,” arXiv preprintarXiv:1803.11115, 2018.

    [36] ——, “A deep reinforcement learning network for traffic light cyclecontrol,” IEEE Transactions on Vehicular Technology, vol. 68, no. 2,pp. 1243–1253, 2019.

    [37] S. Wang, X. Xie, K. Huang, J. Zeng, and Z. Cai, “Deep reinforcementlearning-based traffic signal control using high-resolution event-baseddata,” Entropy, vol. 21, no. 8, p. 744, 2019.

    [38] W. Genders and S. Razavi, “Asynchronous n-step q-learning adaptivetraffic signal control,” Journal of Intelligent Transportation Systems,vol. 23, no. 4, pp. 319–331, 2019.

    [39] T. Chu, J. Wang, L. Codecà, and Z. Li, “Multi-agent deep reinforcementlearning for large-scale traffic signal control,” IEEE Transactions onIntelligent Transportation Systems, 2019.

    [40] A. Stevanovic, Adaptive traffic control systems: domestic and foreignstate of practice, 2010, no. Project 20-5 (Topic 40-03).

    [41] S. El-Tantawy, B. Abdulhai, and H. Abdelgawad, “Design of reinforce-ment learning parameters for seamless application of adaptive trafficsignal control,” Journal of Intelligent Transportation Systems, vol. 18,no. 3, pp. 227–245, 2014.

    [42] S. Araghi, A. Khosravi, and D. Creighton, “A review on computationalintelligence methods for controlling traffic signal timing,” Expert systemswith applications, vol. 42, no. 3, pp. 1538–1550, 2015.

    [43] P. Mannion, J. Duggan, and E. Howley, “An experimental review ofreinforcement learning algorithms for adaptive traffic signal control,”in Autonomic Road Transport Support Systems. Springer, 2016, pp.47–66.

    [44] K.-L. A. Yau, J. Qadir, H. L. Khoo, M. H. Ling, and P. Komisarczuk,“A survey on reinforcement learning models and algorithms for trafficsignal control,” ACM Computing Surveys (CSUR), vol. 50, no. 3, p. 34,2017.

    [45] L. Kuyer, S. Whiteson, B. Bakker, and N. Vlassis, “Multiagent rein-forcement learning for urban traffic control using coordination graphs,”in Joint European Conference on Machine Learning and KnowledgeDiscovery in Databases. Springer, 2008, pp. 656–671.

    [46] J. C. Medina and R. F. Benekohal, “Traffic signal control using reinforce-ment learning and the max-plus algorithm as a coordinating strategy,”in Intelligent Transportation Systems (ITSC), 2012 15th InternationalIEEE Conference on. IEEE, 2012, pp. 596–601.

    [47] M. Abdoos, N. Mozayani, and A. L. Bazzan, “Holonic multi-agentsystem for traffic signals control,” Engineering Applications of ArtificialIntelligence, vol. 26, no. 5, pp. 1575–1587, 2013.

    [48] M. A. Khamis and W. Gomaa, “Adaptive multi-objective reinforcementlearning with hybrid exploration for traffic signal control based on co-operative multi-agent framework,” Engineering Applications of ArtificialIntelligence, vol. 29, pp. 134–151, 2014.

    [49] T. Chu, S. Qu, and J. Wang, “Large-scale traffic grid signal controlwith regional reinforcement learning,” in American Control Conference(ACC), 2016. IEEE, 2016, pp. 815–820.

    [50] N. Casas, “Deep deterministic policy gradient for urban traffic lightcontrol,” arXiv preprint arXiv:1703.09035, 2017.

    [51] W. Liu, G. Qin, Y. He, and F. Jiang, “Distributed cooperative rein-forcement learning-based traffic signal control that integrates v2x net-works dynamic clustering,” IEEE Transactions on Vehicular Technology,vol. 66, no. 10, pp. 8667–8681, 2017.

    [52] W. Genders, “Deep reinforcement learning adaptive traffic signal con-trol,” Ph.D. dissertation, McMaster University, 2018.

    [53] F. Webster, “Traffic signal settings, road research technical paper no.39,” Road Research Laboratory, 1958.

    [54] S. Linnainmaa, “Taylor expansion of the accumulated rounding error,”BIT Numerical Mathematics, vol. 16, no. 2, pp. 146–160, 1976.

    [55] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning represen-tations by back-propagating errors,” nature, vol. 323, no. 6088, p. 533,1986.

    [56] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley,D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep rein-forcement learning,” in International Conference on Machine Learning,2016, pp. 1928–1937.

    [57] D. Horgan, J. Quan, D. Budden, G. Barth-Maron, M. Hessel,H. Van Hasselt, and D. Silver, “Distributed prioritized experiencereplay,” arXiv preprint arXiv:1803.00933, 2018.

    [58] L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward,Y. Doron, V. Firoiu, T. Harley, I. Dunning et al., “Impala: Scalable dis-tributed deep-rl with importance weighted actor-learner architectures,”arXiv preprint arXiv:1802.01561, 2018.

    [59] M. G. Bellemare, W. Dabney, and R. Munos, “A distributional perspec-tive on reinforcement learning,” in Proceedings of the 34th InternationalConference on Machine Learning-Volume 70. JMLR. org, 2017, pp.449–458.

    [60] W. Dabney, M. Rowland, M. G. Bellemare, and R. Munos, “Distribu-tional reinforcement learning with quantile regression,” in Thirty-SecondAAAI Conference on Artificial Intelligence, 2018.

    [61] W. Dabney, G. Ostrovski, D. Silver, and R. Munos, “Implicit quan-tile networks for distributional reinforcement learning,” arXiv preprintarXiv:1806.06923, 2018.

    [62] R. S. Sutton, D. Precup, and S. Singh, “Between mdps and semi-mdps: Aframework for temporal abstraction in reinforcement learning,” Artificialintelligence, vol. 112, no. 1-2, pp. 181–211, 1999.

    [63] P.-L. Bacon, J. Harb, and D. Precup, “The option-critic architecture,” inThirty-First AAAI Conference on Artificial Intelligence, 2017.

    [64] S. Sharma, A. S. Lakshminarayanan, and B. Ravindran, “Learning torepeat: Fine grained action repetition for deep reinforcement learning,”arXiv preprint arXiv:1702.06054, 2017.

    [65] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovskiet al., “Human-level control through deep reinforcement learning,”Nature, vol. 518, no. 7540, pp. 529–533, 2015.

    [66] C. J. Watkins and P. Dayan, “Q-learning,” Machine learning, vol. 8, no.3-4, pp. 279–292, 1992.

    https://arxiv.org/abs/1611.01142https://arxiv.org/abs/1611.01142

  • JOURNAL OF TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. X, NO. X, AUGUST 2019 11

    [67] R. S. Sutton, “Learning to predict by the methods of temporal differ-ences,” Machine learning, vol. 3, no. 1, pp. 9–44, 1988.

    [68] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.MIT press Cambridge, 1998, vol. 1, no. 1.

    [69] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,D. Silver, and D. Wierstra, “Continuous control with deep reinforcementlearning,” arXiv preprint arXiv:1509.02971, 2015.

    [70] L.-J. Lin, “Self-improving reactive agents based on reinforcement learn-ing, planning and teaching,” Machine learning, vol. 8, no. 3-4, pp. 293–321, 1992.

    [71] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S.Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow,A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser,M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray,C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar,P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals,P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng,“TensorFlow: Large-scale machine learning on heterogeneous systems,”2015, software available from tensorflow.org. [Online]. Available:https://www.tensorflow.org/

    [72] E. Jones, T. Oliphant, P. Peterson et al., “SciPy: Open source scientifictools for Python,” 2001, ”http://www.scipy.org/”.

    [73] P. Tabor, 2019. [Online]. Available: https://github.com/philtabor/Youtube-Code-Repository/blob/master/ReinforcementLearning/PolicyGradient/DDPG/pendulum/tensorflow/ddpg orig tf.py

    [74] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers:Surpassing human-level performance on imagenet classification,” inProceedings of the IEEE international conference on computer vision,2015, pp. 1026–1034.

    [75] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,”arXiv preprint arXiv:1412.6980, 2014.

    Wade Genders earned a Software B.Eng. & Societyin 2013, Civil M.A.Sc. in 2014 and Civil Ph.Din 2018 from McMaster University. His researchinterests include traffic signal control, intelligenttransportation systems, machine learning and arti-ficial intelligence.

    Saiedeh Razavi Saiedeh Razavi is the inauguralChair in Heavy Construction, Director of the Mc-Master Institute for Transportation and Logisticsand Associate Professor at the Department of CivilEngineering at McMaster University. Dr. Razavihas a multidisciplinary background and considerableexperience in collaborating and leading national andinternational multidisciplinary team-based projectsin sensing and data acquisition, sensor technologies,data analytics, data fusion and their applications insafety, productivity, and mobility of transportation,

    construction, and other systems. She combines several years of industrial expe-rience with academic teaching and research. Her formal education includes de-grees in Computer Engineering (B.Sc), Artificial Intelligence (M.Sc) and CivilEngineering (Ph.D.). Her research, funded by Canadian council (NSERC), aswell as the ministry of Transportation of Ontario, focuses on connected andautomated vehicles, on smart and connected work zones and on computationalmodels for improving safety and productivity of highway construction. Dr.Razavi brings together the private and public sectors with academia for thedevelopment of high quality research in smarter mobility, construction andlogistics. She has received several awards including McMasters Student UnionMerit Award for Teaching, the Faculty of Engineering Team Excellent Award,and the Construction Industry Institute best poster award.

    https://www.tensorflow.org/"http://www.scipy.org/"https://github.com/philtabor/Youtube-Code-Repository/blob/master/ReinforcementLearning/PolicyGradient/DDPG/pendulum/tensorflow/ddpg_orig_tf.pyhttps://github.com/philtabor/Youtube-Code-Repository/blob/master/ReinforcementLearning/PolicyGradient/DDPG/pendulum/tensorflow/ddpg_orig_tf.pyhttps://github.com/philtabor/Youtube-Code-Repository/blob/master/ReinforcementLearning/PolicyGradient/DDPG/pendulum/tensorflow/ddpg_orig_tf.py

    I IntroductionII BackgroundII-A Traffic Signal ControlII-B Literature ReviewII-C Contribution

    III Traffic Signal ControllersIII-A Non-Learning Traffic Signal ControllersIII-A1 UniformIII-A2 WebstersIII-A3 Max-pressureIII-A4 Self Organizing Traffic Lights

    III-B Learning Traffic Signal ControllersIII-C DQNIII-C1 StateIII-C2 ActionIII-C3 RewardIII-C4 Agent Architecture

    III-D DDPG Traffic Signal ControllerIII-D1 Actor StateIII-D2 Critic StateIII-D3 ActionIII-D4 RewardIII-D5 Agent Architecture

    IV ExperimentsIV-A Hyperparameter OptimizationIV-B Optimized Adaptive Traffic Signal Controllers

    V Conclusion & Future WorkAppendix A: DQNAppendix B: DDPGAppendix C: TechnicalReferencesBiographiesWade GendersSaiedeh Razavi