cooperative deep reinforcement learning for traffic signal...

8
Cooperative Deep Reinforcement Learning for Traic Signal Control Mengqi LIU Beijing University of Posts and Telecommunications Beijing, China [email protected] Jiachuan DENG Purdue University Indiana, US [email protected] Ming XU * Tsinghua University Beijing, China [email protected] Xianbo ZHANG Northeastern Univeristy Shenyang, China neu zhang [email protected] Wei WANG Northeastern Univeristy Shenyang, China neu [email protected] ABSTRACT Trac signal control plays a crucial role in intelligent transporta- tion system. However, the existing approaches for trac signal control based on reinforcement learning mainly focus on trac signal optimization for single intersection. We propose a deep- reinforcement-learning-based approach to collaborative control trac signal phases of multiple intersections. For the state rep- resentation, we use comprehensive road network information as input, and for each controller (intersection), sucient autonomy is given so as to enable it to adapt to dierent kinds of intersec- tions. Besides, we design a reward considering driver patience and cumulative delay, which can beer reect the reward from the road network. Based on the above opinions, a variant of indepen- dent deep Q-learning is introduced, so that multiple agents can be applied experience replay to speed up the training process. KEYWORDS Trac Signal Control, Deep Reinforcement Learning, Independent Q-Learning, Simulation 1 INTRODUCTION People’s living standards are increasing, which leads to the in- creasing of the demands of private cars. In this regard, in order to alleviate the increasing trac pressure, we should strengthen the urban trac signal management. e reasonable trac lights set, not only conducive to increasing trac ow, reducing trac safety hazards and travel time, but also reducing trac energy consump- tion, urban air pollution and peoples’ travel costs. ere are three possible solutions for this problem. * is is the corresponding author. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). UrbComp’17, Halifax, Nova Scotia, Canada © 2017 Copyright held by the owner/author(s). . DOI: 10.1145/nnnnnnn.nnnnnnn Figure 1: Road Network (1) Macro-control, i.e. national policy, to limit the number of vehicles on the road network. For example, even and odd- numbered license plate. But the promulgation of the policy involves too many aspects, and not within our capabilities. (2) From the perspective of road infrastructure, we can build viaducts, underground fast-ways, etc., to increase the road network capacity. But this method will cost too much, and in the early time, it will reduce the road network capacity. (3) On the basis of the existing infrastructure, we can improve the operational eciency of the road network and our ability to manage the road network. We are more inclined to the third solution, and we decide to make some improvements from the point of view of trac signal control- ling. e reinforcement learning technique has already been widely applied to the problem of trac signal control [4, 9, 20]. However, the scope of research on the problem of trac signal control is still limited to the behavior of single intersection, and that each inter- section is independent of each other. In order to take the potential relationship among dierent intersections into consideration and not go too far, intersections need to be managed in one certain region at one time, and for each intersection inside this region, they need to be able to perceive strategies of others. To illustrate the relationship among intersections, consider the following examples.

Upload: others

Post on 21-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Cooperative Deep Reinforcement Learning for Traffic Signal ...urbcomp.ist.psu.edu/2017/papers/Cooperative.pdfdent deep Q-learning is introduced, so that multiple agents can be applied

Cooperative Deep Reinforcement Learning for Tra�ic SignalControl

Mengqi LIUBeijing University of Posts and

TelecommunicationsBeijing, China

[email protected]

Jiachuan DENGPurdue University

Indiana, [email protected]

Ming XU∗

Tsinghua UniversityBeijing, China

[email protected]

Xianbo ZHANGNortheastern Univeristy

Shenyang, Chinaneu zhang [email protected]

Wei WANGNortheastern Univeristy

Shenyang, Chinaneu [email protected]

ABSTRACTTra�c signal control plays a crucial role in intelligent transporta-tion system. However, the existing approaches for tra�c signalcontrol based on reinforcement learning mainly focus on tra�csignal optimization for single intersection. We propose a deep-reinforcement-learning-based approach to collaborative controltra�c signal phases of multiple intersections. For the state rep-resentation, we use comprehensive road network information asinput, and for each controller (intersection), su�cient autonomyis given so as to enable it to adapt to di�erent kinds of intersec-tions. Besides, we design a reward considering driver patienceand cumulative delay, which can be�er re�ect the reward from theroad network. Based on the above opinions, a variant of indepen-dent deep Q-learning is introduced, so that multiple agents can beapplied experience replay to speed up the training process.

KEYWORDSTra�c Signal Control, Deep Reinforcement Learning, IndependentQ-Learning, Simulation

1 INTRODUCTIONPeople’s living standards are increasing, which leads to the in-creasing of the demands of private cars. In this regard, in order toalleviate the increasing tra�c pressure, we should strengthen theurban tra�c signal management. �e reasonable tra�c lights set,not only conducive to increasing tra�c �ow, reducing tra�c safetyhazards and travel time, but also reducing tra�c energy consump-tion, urban air pollution and peoples’ travel costs. �ere are threepossible solutions for this problem.

∗�is is the corresponding author.

Permission to make digital or hard copies of part or all of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor pro�t or commercial advantage and that copies bear this notice and the full citationon the �rst page. Copyrights for third-party components of this work must be honored.For all other uses, contact the owner/author(s).UrbComp’17, Halifax, Nova Scotia, Canada© 2017 Copyright held by the owner/author(s). .DOI: 10.1145/nnnnnnn.nnnnnnn

Figure 1: Road Network

(1) Macro-control, i.e. national policy, to limit the number ofvehicles on the road network. For example, even and odd-numbered license plate. But the promulgation of the policyinvolves too many aspects, and not within our capabilities.

(2) From the perspective of road infrastructure, we can buildviaducts, underground fast-ways, etc., to increase the roadnetwork capacity. But this method will cost too much, andin the early time, it will reduce the road network capacity.

(3) On the basis of the existing infrastructure, we can improvethe operational e�ciency of the road network and ourability to manage the road network.

We are more inclined to the third solution, and we decide to makesome improvements from the point of view of tra�c signal control-ling.

�e reinforcement learning technique has already been widelyapplied to the problem of tra�c signal control [4, 9, 20]. However,the scope of research on the problem of tra�c signal control is stilllimited to the behavior of single intersection, and that each inter-section is independent of each other. In order to take the potentialrelationship among di�erent intersections into consideration andnot go too far, intersections need to be managed in one certainregion at one time, and for each intersection inside this region, theyneed to be able to perceive strategies of others. To illustrate therelationship among intersections, consider the following examples.

Page 2: Cooperative Deep Reinforcement Learning for Traffic Signal ...urbcomp.ist.psu.edu/2017/papers/Cooperative.pdfdent deep Q-learning is introduced, so that multiple agents can be applied

As shown in Figure 1, when tra�c �ow at intersection A goes high,intersection B and C may need to increase the duration of West-East Green Light and North-South Green Light, respectively, so thatthere won’t be future congestion at intersection B and C . Besides,since we are considering the problem of tra�c signal control whichcan be simpli�ed to manage vehicles on the road. Intersections thatare far away from each other, i.e. out of a certain region, will notbe taken into consideration, this is because vehicles can’t move todistant intersections without passing through other intersectionsbetween them. To summarize, the impact between di�erent inter-sections will decrease as distance goes far, distant intersectionscontribute too li�le to be counted. By limiting the scope of manage-ment in a certain region, we can achieve high accuracy as well ashigh e�ciency at the same time. A powerful approach to tackle thechallenge of perceiving strategies is Independent Q Learning (IQL)[24], in which the agent is trained one by one, and delivers thetraining results to other agents, so that these cooperative agentscan sense each other’s polices and can handle the collaborativetask. However, IQL doesn’t perform well when combined withExperience Replay [16] which will be explained in Section 2.1.

Except for the above challenges, there are still some other chal-lenges, as shown below, occur when mapping the problem to thelanguage of reinforcement learning. To tackle all these challenges,we propose the Cooperative Deep Reinforcement Learning (CDRL).

• State Representation. Previous researches tend to repre-sent the network with some a�ributes such as the queuedlength, average tra�c �ow, etc.. Although these a�ributesare very abstract expressions for road network information,and are widely recognized in the �eld of traditional trans-portation system, they can’t be used as inputs to the DeepNeural Network (DNN). Because the inputs of DNN aredesigned to be comprehensive, unmodi�ed information.

• Action Space. �e traditional operation model of tra�csignal phase is not uni�ed, for example, some intersectionsuse through phases before advance le� phases, while someothers are just the opposite.

• Rewards. �ere are some common reward de�nitions inthese �elds, change in the number of queued vehicles,change in delay, etc.. However, none of them can re�ectthe psychological changes in the process of tra�c conges-tion. For example, if we are waiting for a regular red lightof about 40 seconds, we are generally not impatient. Inthe contrast, if we are caught in tra�c jam and have beenblocked for 10 minutes, most of us will feel very unhappy.

• Computing Complexity. Many problems in machine learn-ing may be in�uenced by the curse of dimensions, as thedimension of the data increases, the computing complexityincreases exponentially.

For each of these di�culties, the contributions of CDRL are asfollows:

• We summarize the road network as four image-like rep-resentations, vehicle position, vehicle speed, tra�c signalphase of other intersections, and tra�c signal phase ofthe current intersection (see Figure 3). CDRL will takethese four matrixes as inputs for the current state of roadnetwork.

• �e action space of CDRL is �xed, however, the order ofactions are generated by CDRL without restriction. Foreach intersection, our action space de�nition is supposed togive it enough autonomy, so that it can learn the potentialoptimal policy despite the changing environment.

• We propose a reward function to describe the exponentialrelationship between the patience of the driver and thetime waiting. So as to be�er describe the road tra�c fromthe perspective of human beings.

• CDRL use multiple agents, i.e. one agent only representone intersection, so that it can prevent the state space of asingle agent from being too large and growing exponen-tially as the number of intersection increases. Besides, weuse convolution-based residual network to speed up thetraining process while preventing the possibility that thetraining results might be a�ected as the neural networkgoes deeper.

2 BACKGROUND2.1 Independent Q-Learning with Experience

Replay2.1.1 Q-Learning [28]. Q-learning is aimed to calculate the util-

ity values when executing an certain action under the given en-vironment by iteratively updating the Q-values with Equation (1).�e utility value refers to long-term reward, which means the cho-sen action may not produce the maximum reward for the currentstate, however, it does contribute most for the future reward.

Q(s,a) = Q(s,a) + α(r + γ maxa′

Q(s ′,a′) −Q(s,a)) (1)

In Equation (1), Q(s,a) represents the utility value of taking ac-tion a under state s . Learning rate α determines how new estimatesare weighted against the old ones, and γ denotes the discount fac-tor, which controls the degree between future rewards and currentrewards.

2.1.2 Experience Replay . At each training step, the agent’sexperience tuple et is stored in a �xed-size replay memory, et =(st ,at , rt , st+1), where rt is the reward (utility value) obtained bytaking action at under state st , and st+1 is the next state, resultedby action at and current state st . �ese experience tuples are storedin order, older before newer, and when the replay memory goes full,it will automatically pop the oldest tuple. Training tuples will berandomly selected from the replay memory for Q-value updating,so as to stabilize the training process, enhance sample e�ciencyand prevent correlation between samples.

2.1.3 Independent Q-Learning. In IQL, each agent learns its ownpolicy independently and treats other agents as part of the environ-ment, and each agent is able to track the policies of other agents inreal time. From the perspective of single agent, the environmentis unstable. �is is because the other agents in the same systemare also changing (updating). It will be di�cult for the system toachieve convergence. Despite the non-stationarity issues, IQL canlearn well as long as each agent can track other agents’ policies inreal time.

2

Page 3: Cooperative Deep Reinforcement Learning for Traffic Signal ...urbcomp.ist.psu.edu/2017/papers/Cooperative.pdfdent deep Q-learning is introduced, so that multiple agents can be applied

Figure 2: Simpli�ed Road Network

2.2 Deep Reinforcement Learning (DRL)Q-Learning can be extended to DRL with Equation (2), where θare the weights of DNN and Q(s,a |θ ) is the neural network ap-proximation of Q(s,a) in Equation (1). To be speci�c, the weightsare updated by minimizing Equation (2), which is also called lossfunction.

L(s,a |θi ) = (r + γ maxa

Q(s ′,a |θi ) −Q(s,a |θi ))2 (2)

2.3 Deep Residual Network (ResNet) [12]ResNet is modeled to speed up training process while providinga reasonable training results, regardless of the depth of neuralnetwork. �is process is achieved by performing the followingcomputation, Equation (3).

xl+1 = xl + F (xl ,Wl ) (3)

Here xl and xl+1 are the input and output of the l-th ResidualUnit, respectively. Wl is the corresponding weights (and biases). Frepresents the residual function, e.g. the stacked two convolutionallayers in [12].

3 METHOD3.1 State SpaceFor the convenience of display, we give another simpli�ed roadnetwork as Figure 2 with two intersections, instead of the originalfour intersections in Figure 1. For the state representation, CDRLwill transfer the state of the entire road network into four image-likerepresentation: vehicle position, vehicle speed, tra�c signal phases(TSP) of other intersections and TSP of the current intersection.

3.1.1 Vehicle Position. Following the previous work [11, 26], weuse Boolean values to mark the presence of the vehicles, 0 for notexist, 1 for exist, as shown in Figure 3(a).

3.1.2 Vehicle Speed. �e speed of each vehicle is normalized bythe speed limitation, which leads to the real numbers representedin Figure 3(b).

Figure 3: Four channel image-like representation of agent Xin Figure 2

Because the tra�c signal lights don’t occupy the driving spaces,nor do they occupy the non-street spaces. In order to avoid in-terference, we can’t mark the TSP at the entrance / exit of eachlane, or the center of the intersections. In our work, the TSP arerepresented by �lling the corresponding lanes as 4 for red light and2 for green light. In this example, we take intersection X as thecurrent intersection, and intersection Y as the other intersection.

3.1.3 TSP of Other intersections . Figure 3(c) represents the TSPof intersection Y . Besides, when the number of intersections in-creases, the area of TSP in the third image, Figure 3(c), should alsoincrease, correspondingly. For example, as shown in Figure 1, if thecurrent intersection is A, then the other intersections should be B,C and D, which means all of these three TSP should be shown inthe third image.

3.1.4 TSP of the Current intersection. As shown in Figure 3(d),it represents the TSP of intersection X . Besides, the fourth imageshould always represents the TSP of one intersection, the currentintersection, no ma�er how many intersections there are in thecurrent road network. In the case of example in Section 3.1.3,the current intersection is A, and there should only be the TSP ofintersection A appeared in the fourth image.

However, in the real-time application, the representation of statespace will be much easier, we only need the real-time videos orphoto streams for each intersection as the input states. �is isbecause single frame of the videos can represent the vehicles’ po-sitions and TSP, and adjacent frames can represent the vehicles’speeds.

3.2 Action SpaceOnce the state is observed, agent has to choose one appropriateaction based on the de�nition of action space. Formally, the setof all possible compass directions for each agents is de�ned asA = {NSG,EWG,NSLG,EWLG}, where NSG stands for North-South Green, EWG stands for East-West Green, NSLG stands forNorth-South Le� Green, and EWLG stands for East-West Le� Green.For each time step, the agent can only choose one of these actions,and the other three directions will be set as Red by default. In thiscase, it will be possible for the agent to determine whether it shouldcontinue the previous action or change to another one based onthe current state, so as to give the agent more autonomy. �us,agents can make more �exible decisions under changing states.

3

Page 4: Cooperative Deep Reinforcement Learning for Traffic Signal ...urbcomp.ist.psu.edu/2017/papers/Cooperative.pdfdent deep Q-learning is introduced, so that multiple agents can be applied

Figure 4: CDRL architectureConv: Convolution; BN: Batch Normalization; FC: Fully-Connected

For example, at time step t , at = NSG, assume the north-southdirection is slightly congested, and there is no vehicle in the east-west direction; and at time step t + 1, assume the chosen actionwill be at+1 = NSG, the north-south direction continues to jam,and there is one single vehicle in the east-west direction; then attime step t + 2, the agent with a high degree of autonomy willcontinue NSG, however, agent with strict pre-de�ned rules maychoose EWG under certain restrictions, such as the duration of redlight shouldn’t exceed 40 seconds.

Additional yellow phase is added before the changing phase forsafety assurance. To be speci�c, yellow phase will only be addedbefore the phase whose upcoming phase di�ers from its currentphase.

3.3 Reward�e reward in reinforcement learning is used to evaluate the impactof an action on its environment. �e objective of learning is tomaximize the cumulative reward that agents receive in the longrun. In order to achieve global optimum, a common approach is tode�ne the reward as the average delay, given by

rt =1N

N∑i=1

t i (4)

rt in Equation (4) represents the reward at each time-step, andt i refers to the current waiting time for each driver, while N is thetotal number of drivers.

However, when such rewards are used as learning objectives,the interests of a small number of vehicles may be overly damaged.Speci�cally, in order to minimize global vehicle delays, DRL maytend to keep some road segments with heavy tra�c loads being

unobstructed, which may cause the extension of red signal phaseson some other segments with small tra�c �ows. Unfortunately,such extension usually last long. In order to avoid similar situations,the reward function, inspired by the well-known U.S. Bureau ofPublic Roads (BPR) function in transportation planning, is de�nedas Equation (5),

rt =1N

N∑i=1

d[η − η( (ti )C)τ ] (5)

where rt is the reward at time-step t , t i refers to the currentwaiting time for driver i , d represents one unit of time, C denotesthe generally acceptable waiting time, η and τ are constants, whichare normally set as η = 0.15, τ = 2 according to BPR.

3.4 CDRL with Experience Replay3.4.1 Cooperation among Agents. In IQL, although each agent

will treat others as part of the environment, non-stationary isbrought by the impact among multiple agents. In particular, sincemultiple agents are interacting with the environment, the com-bination of any single agent and the environment can no morebe thought as stable. In this case, if we still want to let multipleagents interact with the environment all together, all of the agentsinvolved need to be taken into consideration, and will certainlycause dimension curse. In CDRL, agent will still learn its ownpolicy and treat others as part of the environment. However, toprevent non-stationary, there will only be one agent training ateach training episode, the others will behave based on their pre-viously generated policies. In the training process, agent doesn’tneed to communication with the other agents. A�er each training

4

Page 5: Cooperative Deep Reinforcement Learning for Traffic Signal ...urbcomp.ist.psu.edu/2017/papers/Cooperative.pdfdent deep Q-learning is introduced, so that multiple agents can be applied

episode, the �nal policy will be delivered to the other agents, whilethe learning process itself will not be shared. 1

3.4.2 Combined with Experience Replay. Each agent has its ownreplay memory. Since tuples et = (st ,at , rt , st+1) generated bydi�erent agents are di�erent, especially on their state representationst and st+1. If we simply apply one replay memory to all agents, thetuple that is randomly generated from the reply memory is unlikelyto re�ect the current agent.

Figure 4 shows the architecture of the CDRL. Four 10x10 image-like matrix will be combined as a 10x10x4 dimension tensor as theinput for each agent. Speci�cally, assuming that we are currentlytraining agentA2 (see Figure 1). Each of the four agent will get theirunique 10x10x4 dimension tensor as input. However, only agentA will get the reward, which means only the ResNet and ReplayMemory of A will be updated. A�er a certain iterations, the latestpolicy of A will be distributed to the other three agents, and thetraining intersection will also change to another intersection.

In this way, CDRL avoids the non-stationary problem caused bychanging environment, thus can be further combined with experi-ence replay without causing troubles.

3.5 Algorithm and OptimizationAlgorithm 1 shows the training process of CDRL. We �rst initializeall intersections then we get actions from all intersections, what’smore, there will be one intersection being trained and updated foreach training period.

4 EXPERIMENTS4.1 SettingTo simulate a road network, tra�c micro-simulator SUMO v0.22[13] is used in the experiments. It provides a Python API to interactwith the program of neural network. We implemented CDRL modelin TensorFlow Python framework.

�e simulated tra�c network is built as a network with theshape of ”#”, which consists of four intersections. Each intersectionconnects to four road segment, and each road segment includesthree outgoing lanes in two opposite directions: le� turning lane,straight lane and right turning lane, as is shown in Figure 1. Forsimplicity and clarity, four lanes in di�erent directions are shownspeci�cally, and other parts of the tra�c network are arranged inthe same way. During the simulation process, each road segmentcan be viewed as an origin or a destination, and an certain amount oftra�c is generated according to the pre-de�ned Origin-Destinationmatrix.

To evaluate the e�ectiveness of CDRL, we compare CDRL withthree baseline algorithms:

(1) Self-Organizing Tra�c Lights (SOTL) [7]: a SOTL con-troller turns the state of signal lights according to theelapsed time and the number of vehicles in queues.

(2) Q-Learning [3]: a tabular reinforcement learning methodwidely used in signal control, its state and action spaces

1one training episode contain 450 iterations2intersection A, B, C, D is represented by agent A, B, C, D , respectively

Algorithm 1: CDRL Training AlgorithmInput: Replay memory D with size MAll available intersections I = I0, I1, · · · , IlTotal number of training period: nTraining period for each intersection: pNumber of intersections: lOutput: Actions for all intersectionstrainIter = 0// initialize all intersectionsInitialize replay memory D with size MInitialize action value-function Q with random weights// get actionsif probability < ε then

select random action atelse

select at = maxa Q(st ,a,θi )endExecute action at in simulator and observe reward rt and next

state st+1// train and updateget actionsput all training instances (st ,at , rt , st+1) into Dif terminal sj+1 then

set yj = r jelse

set yj = r j + γ ∗maxa′ Q(sj+1,a′;θi )endPerform a gradient step on (yj −Q(sj ,aj ;Qi ))2 with respectto θ// main loopinitialize all intersectionsrepeat

for iteration times of each intersection t(1 6 t 6 p) doget actions from all intersections Itrain and update intersection ItrainI ter%l

endtrainIter+ = 1

until trainIter > n;

are small enough, and the policy function is representedas a table.

(3) Deep Q Network (DQN) [19]: each junction is controlledby one agent, and each agent determines the alternationof TSP only basing on tra�c situation of its direct linkingroads.

Note that, both SOTL and Q-Learning need to take the tra�cfeatures as input, such as the length of queue, the tra�c �ow oraverage speed of lanes, whereas, DQN solely needs original tra�cinformation as in CDRL, see Section 3.1. In the training process,agents of reinforcement learning and DQN are trained simultane-ously and independently and do not distribute policies with otheragents. An episode is a simulation process that lasts 30 minutes.

4.2 ResultsFigure 5 presents the variation of rewards with the increase inthe number of episodes during training process. As is shown, the

5

Page 6: Cooperative Deep Reinforcement Learning for Traffic Signal ...urbcomp.ist.psu.edu/2017/papers/Cooperative.pdfdent deep Q-learning is introduced, so that multiple agents can be applied

Figure 5: Reward - Episode

Figure 6: Average Delay - Episode

SOTL still keep unimproved, for it is a pre-de�ned method. In Q-learning, DQN and CDRL, the reward grows at di�erent rates andthen stabilizes. In contrast, the CDRL converges slowly, however,when the convergence is complete, the CDRL behaves slightly be�erthan the other two.

Figure 6 shows a change in the travel delay with the increase inthe number of episodes. In addition to SOTL, its overall performanceshows a declining trend. We can observe that the convergence rateof CDRL is the slowest in Q-Learning, DQN and CDRL. However,once convergence, the CDRL achieves the minimum travel delay.By comparing Figure 5 and Figure 6, it is not di�cult to �nd thatthe CDRL achieves a signi�cant improvement in performance bysharing policies with others in each turn. 3 By comparing Q-Learning and DQN, we can see that DQN converges more slowly,but it performs be�er a�er convergence. In conjunction with theperformance of CDRL, it is not di�cult to �nd that deep neuralnetworks are more capable of capturing tra�c �ow and improvingperformance.

31 turn = 1500 episodes

Figure 7: Overall Cumulative Delay - Running time

Figure 8: Maximum Delay - Episode

Figure 7 shows the relationship between the overall cumulativedelay and the running time in the 7500-th episode. CDRL per-forms best. �is further con�rms the importance of multi-agentcooperation for good performance.

Next, in order to clearly understand the impact of di�erent re-wards. We compare the proposed reward to the baseline reward,average delay, based on the state space and action space de�ned inCDRL (see Section 3.1, Section 3.2). Speci�cally, we measure thesetwo rewards by the maximum delay, which represents the delaytime of the vehicle with the greatest delay in the road network. Asshown in Figure 8, these two rewards perform similar in the earlyterms of training, and as the running time increases, the proposedreward signi�cantly reduces the maximum delay, however, thebaseline reward shows the trend that contributes to the maximumdelay, which means that some vehicles have been blocked for quitea long time.

Based on the above experiments, we can see that by applyingthe proposed reward, the CDRL can usually minimize the averagedelay of the vehicle as well as reduce the likelihood that a smallnumber of vehicles will be blocked for a long time.

6

Page 7: Cooperative Deep Reinforcement Learning for Traffic Signal ...urbcomp.ist.psu.edu/2017/papers/Cooperative.pdfdent deep Q-learning is introduced, so that multiple agents can be applied

5 RELATEDWORK5.1 Tra�c Signal ControlEarly research in tra�c signal control was largely limited by the per-formance of the tra�c simulator [2, 5, 25, 29]. From the beginning of2000, tra�c simulation ushered in a rapid progress, simulation toolswere developed to be more and more complex and realistic, whichcan almost reproduce the real world tra�c behavior. However, theemergence of reinforcement learning has broken the traditionalsolution and demonstrated remarkable results [10, 17]. �e use ofreinforcement learning to solve tra�c signal control problems ismainly distinguished by di�erences in state space, action space,rewards, network topology and simulators. 4 �e de�nition of statespace is o�en based on tra�c a�ributes, such as the queued length[1, 2, 6, 29], average tra�c �ow [3, 4], cumulative delay [11, 21]and so on. �e action space is o�en de�ned by the signal phase ofall signal phases or green lights [1, 4, 6]. And the reward is o�ende�ned by changes in average delay [3] and changes in the queuedlength [1, 4, 6].

5.2 Deep Reinforcement LearningReinforcement learning is aimed to maximize the long-term rewardsby performing a state-action policy. However, when the state spacegoes too large to handle, function approximators, neural networks,can be used to approximate value functions. To summarize, Deeplearning is responsible for representing the state of the MarkovDecision Process, while reinforcement learning is supposed to takecontrol of the direction of learning.

Regarding previous research, the road network state space isde�ned based on the abstract information of the existing tra�ca�ributes, which may be lack of useful information for de�ning thestate space. For example, the de�nition based on the queued lengthignores the speed of the vehicle, since both of the following arelikely to have a larger queued length, a large number of vehiclespassing smoothly and quickly through the intersection, and a largenumber of vehicles without movements, i.e. tra�c congestion.

In order to comprehensively extract road network information,it is necessary to use DRL, not only because the restrictions ofreinforcement learning when dealing with large state space, but alsobecause of the advantages of convolution neural network (CNN), itrequires li�le pre-processing for the input, and can further developits own features. In addition, convolution neural networks haveshown charming results in the area of computer vision [14]. Sincethe road network itself can sometimes be represented by images, itwill be quite convenient to apply DRL to our problem. DRL itselfhas already successfully applied to the problem of managing thestrategic dialogue [8] and the problem of control actions in progress[15]. �ere are also more complex domains like Go games [22]. Inaddition, researchers have studied various ways to improve DRLperformance such as double Q-learning [27] and asynchronouslearning [18]. However, it is still very di�cult to apply the DRLto multi-agent cooperation, and the best existing results exist onlyunder the cooperation of two DQNs [23].

4tra�c generation model is involved in simulators

Table 1: Hyperparameters for CDRL

Hyperparameter ValueEpisode 450 iterationMax train iterations 7500 episodeMinibatch size 32Replay Size 3000Learning Rate 0.0001Update rule ADAMInitial ε 1.0ε decay 5e-5Target network update 150Agent network update 1500 episode

Finally, all previous studies have used computer simulationsbecause real-world experiments are not feasible for a variety ofreasons.

6 CONCLUSIONIn this paper, we proposed a tra�c signal control method for multi-ple junctions basing on cooperative deep reinforcement learning.At the core of the proposed method is a multi-agent framework, inwhich each agent utilizes a convolution-based residual network tosolve the control optimization problem with continuous state spacein a cooperative manner. In addition, we introduced a more realisticreward function from a perspective of Tra�c Psychology. Simu-lation experiments show that the proposed method outperformsother baseline tra�c signal control methods.

7 FUTUREWORKIn the future work, we will extend deep multi-agent reinforce learn-ing framework to collaborate more agents and apply to tra�c signalscheduling on a larger scale of road network. In order to proverobustness, we will also apply our framework to the road networkwith di�erent topologiest and di�erent tra�c generation model.

A HYPERPARAMETERS FOR CDRL�e hyperparameters for training our CDRL is shown as table1. �emax train iterations refers to the total training times, which isp∗n inAlgorithm1. Minibatch size refers to the number of training samplesfor each updates. Replay size is the size D of replay memory M .Learning rate refers to α as explained in Equation (1). Our updaterule for the optimizer is ADAM. Initial ε refers to the initial value ofε for ε-greedy exploration, and the decay rate of the exploration isshown as ε decay. Target network update represents the frequencyof updating the target Q network in the experiment. And agentnetwork update refers to training period of each intersection asshown in Algorithm1.

REFERENCES[1] Monireh Abdoos, Nasser Mozayani, and Ana LC Bazzan. 2013. Holonic multi-

agent system for tra�c signals control. Engineering Applications of Arti�cialIntelligence 26, 5 (2013), 1575–1587.

7

Page 8: Cooperative Deep Reinforcement Learning for Traffic Signal ...urbcomp.ist.psu.edu/2017/papers/Cooperative.pdfdent deep Q-learning is introduced, so that multiple agents can be applied

[2] Baher Abdulhai, Rob Pringle, and Grigoris J Karakoulas. 2003. Reinforcementlearning for true adaptive tra�c signal control. Journal of Transportation Engi-neering 129, 3 (2003), 278–285.

[3] Itamar Arel, Cong Liu, T Urbanik, and AG Kohls. 2010. Reinforcement learning-based multi-agent system for network tra�c signal control. IET IntelligentTransport Systems 4, 2 (2010), 128–135.

[4] PG Balaji, X German, and D Srinivasan. 2010. Urban tra�c signal control usingreinforcement learning agents. IET Intelligent Transport Systems 4, 3 (2010),177–188.

[5] Elmar Brockfeld, Robert Barlovic, Andreas Schadschneider, and Michael Schreck-enberg. 2001. Optimizing tra�c lights in a cellular automaton model for citytra�c. Physical Review E 64, 5 (2001), 056132.

[6] Yit Kwong Chin, Nurmin Bolong, Aroland Kiring, Soo Siang Yang, and KennethTze Kin Teo. 2011. Q-learning based tra�c optimization in management of signaltiming plan. International Journal of Simulation, Systems, Science & Technology12, 3 (2011), 29–35.

[7] Seung-Bae Cools, Carlos Gershenson, and Bart D�Hooghe. 2013. Self-organizingtra�c lights: A realistic simulation. In Advances in applied self-organizing systems.Springer, 45–55.

[8] Heriberto Cuayahuitl, Simon Keizer, and Oliver Lemon. 2015. Strategic dialoguemanagement via deep reinforcement learning. arXiv preprint arXiv:1511.08099(2015).

[9] Samah El-Tantawy, Baher Abdulhai, and Hossam Abdelgawad. 2013. Multiagentreinforcement learning for integrated network of adaptive tra�c signal con-trollers (MARLIN-ATSC): methodology and large-scale application on downtownToronto. IEEE Transactions on Intelligent Transportation Systems 14, 3 (2013),1140–1150.

[10] Samah El-Tantawy, Baher Abdulhai, and Hossam Abdelgawad. 2014. Design ofreinforcement learning parameters for seamless application of adaptive tra�csignal control. Journal of Intelligent Transportation Systems 18, 3 (2014), 227–245.

[11] Wade Genders and Saiedeh Razavi. 2016. Using a Deep Reinforcement LearningAgent for Tra�c Signal Control. arXiv preprint arXiv:1611.01142 (2016).

[12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residuallearning for image recognition. In Proceedings of the IEEE Conference on ComputerVision and Pa�ern Recognition. 770–778.

[13] Daniel Krajzewicz, Jakob Erdmann, Michael Behrisch, and Laura Bieker. 2012.Recent development and applications of SUMO-Simulation of Urban MObility.International Journal On Advances in Systems and Measurements 5, 3&4 (2012).

[14] Alex Krizhevsky, Ilya Sutskever, and Geo�rey E Hinton. 2012. Imagenet classi�ca-tion with deep convolutional neural networks. In Advances in neural informationprocessing systems. 1097–1105.

[15] Timothy Paul Lillicrap, Jonathan James Hunt, Alexander Pritzel, Nicolas Man-fred O�o Heess, Tom Erez, Yuval Tassa, David Silver, and Daniel Pieter Wierstra.2016. Continuous control with deep reinforcement learning. (July 22 2016). USPatent App. 15/217,758.

[16] Long-Ji Lin. 1993. Reinforcement learning for robots using neural networks. Ph.D.Dissertation. Fujitsu Laboratories Ltd.

[17] Patrick Mannion, Jim Duggan, and Enda Howley. 2016. An experimental reviewof reinforcement learning algorithms for adaptive tra�c signal control. InAutonomic Road Transport Support Systems. Springer, 47–66.

[18] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Tim-othy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. 2016. Asyn-chronous methods for deep reinforcement learning. In International Conferenceon Machine Learning. 1928–1937.

[19] Joseph O�Neill, Barty Pleydell-Bouverie, David Dupret, and Jozsef Csicsvari.2010. Play it again: reactivation of waking experience and memory. Trends inneurosciences 33, 5 (2010), 220–229.

[20] LA Prashanth and Shalabh Bhatnagar. 2011. Reinforcement learning with func-tion approximation for tra�c signal control. IEEE Transactions on IntelligentTransportation Systems 12, 2 (2011), 412–421.

[21] Lu Shoufeng, Liu Ximin, and Dai Shiqiang. 2008. Q-learning for adaptive tra�csignal control based on delay minimization strategy. In Networking, Sensing andControl, 2008. ICNSC 2008. IEEE International Conference on. IEEE, 687–691.

[22] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, GeorgeVan Den Driessche, Julian Schri�wieser, Ioannis Antonoglou, Veda Panneer-shelvam, Marc Lanctot, et al. 2016. Mastering the game of Go with deep neuralnetworks and tree search. Nature 529, 7587 (2016), 484–489.

[23] Ardi Tampuu, Tambet Matiisen, Dorian Kodelja, Ilya Kuzovkin, Kristjan Korjus,Juhan Aru, Jaan Aru, and Raul Vicente. 2017. Multiagent cooperation andcompetition with deep reinforcement learning. PloS one 12, 4 (2017), e0172395.

[24] Ming Tan. 1993. Multi-agent reinforcement learning: Independent vs. cooperativeagents. In Proceedings of the tenth international conference on machine learning.330–337.

[25] �omas L �orpe and Charles W Anderson. 1996. Tra�cc light control using sarsawith three state representations. Technical Report. Technical report, Citeseer.

[26] Elise van der Pol and Frans A Oliehoek. 2016. Coordinated Deep ReinforcementLearners for Tra�c Light Control. NIPS.

[27] Hado Van Hasselt, Arthur Guez, and David Silver. 2016. Deep ReinforcementLearning with Double Q-Learning.. In AAAI. 2094–2100.

[28] Christopher John Cornish Hellaby Watkins. 1989. Learning from delayed rewards.Ph.D. Dissertation. University of Cambridge England.

[29] Marco Wiering et al. 2000. Multi-agent reinforcement learning for tra�c lightcontrol. In ICML. 1151–1158.

8