reinforcement learning and network design
TRANSCRIPT
![Page 1: Reinforcement Learning and Network Design](https://reader031.vdocument.in/reader031/viewer/2022012013/61584d17b01c084f9c6a147b/html5/thumbnails/1.jpg)
Reinforcement Learningand Network Design
Yuki OyamaShibaura Institute of Technology
Activity Landscape Design Lab.September 17, 2021
B e h a v i o r M o d e l i n g i n T r a n s p o r t a t i o n N e t w o r k sL e c t u r e S e r i e s # 3 - 1 ( 1 6 : 0 0 - 1 6 : 3 0 )
![Page 2: Reinforcement Learning and Network Design](https://reader031.vdocument.in/reader031/viewer/2022012013/61584d17b01c084f9c6a147b/html5/thumbnails/2.jpg)
A road network example
The planner who aims to maximize efficiency wants to answer:• if a new road should be constructed• where and how large parking spaces should be placed• on which road and how much tolls should be charged• etc.
topology
capacity
control
These decisions will impact on travelers’ behavior
![Page 3: Reinforcement Learning and Network Design](https://reader031.vdocument.in/reader031/viewer/2022012013/61584d17b01c084f9c6a147b/html5/thumbnails/3.jpg)
Let’s generalize the frameworkAn example of pricing (on which road a toll is installed)
Planner (Supply)
Users (Demand)
Objective functionDesign variables
Behavior distribution Network traffic
New toll installedon road A
Drivers changeroute choicesto avoid road A
Alternative routeof using road Ais congested
Wants to minimize total travel time
![Page 4: Reinforcement Learning and Network Design](https://reader031.vdocument.in/reader031/viewer/2022012013/61584d17b01c084f9c6a147b/html5/thumbnails/4.jpg)
Let’s generalize the frameworkAn example of pricing (on which road a toll is installed)
Planner (Supply)
Users (Demand)
Objective functionDesign variables
Behavior distribution Network traffic
New toll installedon road A
Drivers changeroute choicesto avoid road A
Alternative routeof using road Ais congested
Total travel timein the network increased as a result
![Page 5: Reinforcement Learning and Network Design](https://reader031.vdocument.in/reader031/viewer/2022012013/61584d17b01c084f9c6a147b/html5/thumbnails/5.jpg)
Network designis a demand-based planning of network topology & systems
Planner (Supply)
Users (Demand)
the planner wants to maximizeObjective function
the planner decides under some constraints
Design variables
travelers decide/changein reaction to the design of network
Behavior distributionthat appears as a collective state
Network traffic
[network topology & conditions]
[aggregation]
[indicators]
[review]What you (the planner) want to achieve by the project
How to predict the impact on behavior in a network?
Follows Magnanti and Wong (1984); Farahani et al. (2013)
![Page 6: Reinforcement Learning and Network Design](https://reader031.vdocument.in/reader031/viewer/2022012013/61584d17b01c084f9c6a147b/html5/thumbnails/6.jpg)
Modeling behavior in a network
OD
Path choice model (logit) Traffic flow on paths(in the static case)
: choice set (set of paths) Not as easy as it looks…
![Page 7: Reinforcement Learning and Network Design](https://reader031.vdocument.in/reader031/viewer/2022012013/61584d17b01c084f9c6a147b/html5/thumbnails/7.jpg)
Networks are generally complex…
O
D
O
D
O
D
O
D
The path set is almost impossible to define !!
* A description of more complex choices (e.g., time) needs additionaldimensions of network, which further increases the network size.
This is because a path is a combination of links in the network
![Page 8: Reinforcement Learning and Network Design](https://reader031.vdocument.in/reader031/viewer/2022012013/61584d17b01c084f9c6a147b/html5/thumbnails/8.jpg)
Not combination but SEQUENCE
Sequence of choices
a
b
c
1
2
#
@
%
Choice
A
B
C
…
An approach is modeling based on Reinforcement Learning (RL)
This presentation shows a special case of RL for network path choice modeling
that models sequential decisions of agents.
![Page 9: Reinforcement Learning and Network Design](https://reader031.vdocument.in/reader031/viewer/2022012013/61584d17b01c084f9c6a147b/html5/thumbnails/9.jpg)
How to model a sequence ?
⇒ what is link choice probability exactly?
A path r can be described as:
a sequence of links
Path choice probability:
: Link choice probability conditional on the previous link
![Page 10: Reinforcement Learning and Network Design](https://reader031.vdocument.in/reader031/viewer/2022012013/61584d17b01c084f9c6a147b/html5/thumbnails/10.jpg)
What should be considered is ...
the outcome given by the product of link choice probabilities to be consistent with the original model, i.e.,
This is achieved by considering forward-looking mechanism
*when assuming logit model
![Page 11: Reinforcement Learning and Network Design](https://reader031.vdocument.in/reader031/viewer/2022012013/61584d17b01c084f9c6a147b/html5/thumbnails/11.jpg)
Value function
1. Myopic2. Forward-looking
D
D
: Link choice utility: Value function
Goal is modeling mechanisms of behavior
c.f. Shortest Path (SP) problem:
Value function is the SP cost from aj to destination
Random utility
Generalization
![Page 12: Reinforcement Learning and Network Design](https://reader031.vdocument.in/reader031/viewer/2022012013/61584d17b01c084f9c6a147b/html5/thumbnails/12.jpg)
Gumbel distribution has a nice property:
Value function is the solution to:
a system of linear equations.
Value function Weight incidence matrix Unit vector
(Recurrence relation)
![Page 13: Reinforcement Learning and Network Design](https://reader031.vdocument.in/reader031/viewer/2022012013/61584d17b01c084f9c6a147b/html5/thumbnails/13.jpg)
Let’s check the consistency!
Link choice probability is given by:
*like logit by assuming
New deterministic utility
Then we have:
Path utility issum of link utilities
Exp. Max. of all possible paths
⇒ Consistent with logit model with the universal path set
= 1
![Page 14: Reinforcement Learning and Network Design](https://reader031.vdocument.in/reader031/viewer/2022012013/61584d17b01c084f9c6a147b/html5/thumbnails/14.jpg)
What’s the point ?
Now you can model path choice behavior without explicitly defining choice set
1. Decompose path choice into sequential link choices:
2. Describe forward-looking behavioral mechanism by value function:
Recursively computed
This (efficient) computational method of modeling is called:
“Recursive Logit (RL) model”Named by Fosgerau et al. (2013)
![Page 15: Reinforcement Learning and Network Design](https://reader031.vdocument.in/reader031/viewer/2022012013/61584d17b01c084f9c6a147b/html5/thumbnails/15.jpg)
Markov Decision Process (MDP)
To more generalize, define
• Action: choice behavior (what agent does)• State: situation (where agent is) that changes as result of action
*In path choice (recursive) modeling: Action is directly choice of State
Action
Statetransition
Initialstate
State transition probability
Discount factor
![Page 16: Reinforcement Learning and Network Design](https://reader031.vdocument.in/reader031/viewer/2022012013/61584d17b01c084f9c6a147b/html5/thumbnails/16.jpg)
Reinforcement Learning approaches
MDP Reinforcement Learning
Recursive LogitInverse Reinforcement
Learning
Learn optimal action probability by a number of trials & results
General modeling framework of sequential decisions
Modeling action probability under RUM assumption Learn utility (reward) function
from actions of expertsParameter estimation
Dynamic Discrete Choice Model
Stochastic
Action = StateGumbel
Boltzmann policy(temp. = 1/scale)
See also: Mai and Jaillet (2020)
![Page 17: Reinforcement Learning and Network Design](https://reader031.vdocument.in/reader031/viewer/2022012013/61584d17b01c084f9c6a147b/html5/thumbnails/17.jpg)
Now, we have link transition probabilities
Given OD demand q, we compute network traffic flows
: link flow (on a)
: transition flow (from link k to link a)
k
a
[1]
[2]
[1] and [2] reduces to:
Can be efficiently computed!
![Page 18: Reinforcement Learning and Network Design](https://reader031.vdocument.in/reader031/viewer/2022012013/61584d17b01c084f9c6a147b/html5/thumbnails/18.jpg)
Another dimension may be needede.g., a planner may expect changes of visitors’ time-use in a city center
allows for integrated modeling of route, activity place and duration.
y
Space
x
tTime
yt
x
: Node
(a) (b) (c)
: Link
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
21 22 23 24 25
t=1
t=2
t=3
t=4
t=5
move from 8 to 9
from 9 to 14
at 14
from 13 to 8
move
from 14 to 13
move
move
stay
Time-structured network
A path
Network traffic:: no. people who visited space a
: total time spent at space a
represents multiple activities
![Page 19: Reinforcement Learning and Network Design](https://reader031.vdocument.in/reader031/viewer/2022012013/61584d17b01c084f9c6a147b/html5/thumbnails/19.jpg)
Calculate indicators based on traffic
: total travel time experienced [min.]
Examples:
: total revenue the manager gains [JPY]
: consumer surplus (welfare)
Remark (again):
The choice of objective reflectswhat you (the planner) want to achieve through the project
Minimizing negative indicators is enough? What is a better/ideal city you think?
: total CO2 emission [g CO2]
![Page 20: Reinforcement Learning and Network Design](https://reader031.vdocument.in/reader031/viewer/2022012013/61584d17b01c084f9c6a147b/html5/thumbnails/20.jpg)
A public project entails trade-offs of goals
A road closure may increase travel time of the network.
But the space can be utilized as a park that is good for activities, health and environment.
Of course, it requires a large capital cost, and the budget is limited.
Barcelona superblock@Bcomu Global
Weighted sum is enough ?
• Often, there is a clear trade-off between two objectives• Weight selection may lead to a biased policy decision
![Page 21: Reinforcement Learning and Network Design](https://reader031.vdocument.in/reader031/viewer/2022012013/61584d17b01c084f9c6a147b/html5/thumbnails/21.jpg)
Multi-objective design
Social welfare(to be maximized)
Capital cost (to be minimized)
Ideal but never achieved
Reality
Another option
dominated solutions
Pareto frontier(set of non-dominated
solutions)
Budget constraint
![Page 22: Reinforcement Learning and Network Design](https://reader031.vdocument.in/reader031/viewer/2022012013/61584d17b01c084f9c6a147b/html5/thumbnails/22.jpg)
Case study | A pedestrian activity network design
Department store Gintengai mall
Okaido mallCity hall
Department store
Park
1 2 3 4
5 6 7
8 9
10 11 12 13
14 15 16 17
18 19 20 21
1o
3o
4o
2o
n : Node for move/stayn : Node for only moveo : Start/End node
100mN
418
19-20 1
City center of Matsuyama city
• Design: expansion of walking space on each street [m.]
• Expectation: resistance decreases, and more places are visited
![Page 23: Reinforcement Learning and Network Design](https://reader031.vdocument.in/reader031/viewer/2022012013/61584d17b01c084f9c6a147b/html5/thumbnails/23.jpg)
Case study | A pedestrian activity network design
78.0 79.0 80.0 81.0 82.0 83.0 84.0
010
000
2000
030
000
4000
050
000
Average sojourn time
AcceptedRejected
Tota
l are
a of w
iden
ed si
dewa
lk
[*60s]
[m2 ]
(b)(a)
104 105 106 107 108 10910
000
2000
030
000
4000
050
000
Tota
l are
a of w
iden
ed si
dewa
lk[m
2 ]
AcceptedRejected
Total expected utility0
A(82.60, 6000) (105.30, 6000)
B
Goal I:Sojourn time maximization
• Clear trade-offs between goals and budget are observed.• Pareto frontier offers a variety of policies based on the investment level
Goal II:Expected utility maximization
![Page 24: Reinforcement Learning and Network Design](https://reader031.vdocument.in/reader031/viewer/2022012013/61584d17b01c084f9c6a147b/html5/thumbnails/24.jpg)
Case study | A pedestrian activity network design
+4
+4
+4
+x
+6
+3 +4+1
+1+3
+1+4 +3
1 2 3 4
5 6 7
8 9
10 11 12 13
14 15 16 17
18 19 20 21
1 2 3 4
5 6 7
8 9
10 11 12 13
14 15 16 17
18 19 20 21
5
0
15
20
10
1 4 7 13 15 17 18 19 20 21
: 100Link !ow
Network (solution)
[Upper level problem]
[Lower level problem]
: 250: 500: 1000
: di!. > +10: di!. < -10
: di!. > +0.5
: di!. < -0.5
: increased width [m]
[*60sec./person]
(2) Solution A Solution B(3)Staying node number
Activity duration
1 2 3 4
5 6 7
8 9
10 11 12 13
14 15 16 17
18 19 20 21
Original network(1)1 4 7 13 15 17 18 19 20 21 1 4 7 13 15 17 18 19 20 21
1 2 3 4
5 6 7
8 9
10 11 12 13
14 15 16 17
18 19 20 21
1 2 3 4
5 6 7
8 9
10 11 12 13
14 15 16 17
18 19 20 21
1 2 3 4
5 6 7
8 9
10 11 12 13
14 15 16 17
18 19 20 21
Goal I:Sojourn time
Goal II:Expected utility
![Page 25: Reinforcement Learning and Network Design](https://reader031.vdocument.in/reader031/viewer/2022012013/61584d17b01c084f9c6a147b/html5/thumbnails/25.jpg)
• Reinforcement Learning is a general framework of modeling sequential decisions in networks.• You can model any “state-action network”• “State = action = space” is just an example
• Network design is a mathematical problem of behavior (in a network) based planning• Be thoughtful when you set an objective• Multi-objective design may fit in public projects
Summary & Remarks
![Page 27: Reinforcement Learning and Network Design](https://reader031.vdocument.in/reader031/viewer/2022012013/61584d17b01c084f9c6a147b/html5/thumbnails/27.jpg)
References• Magnanti, T.L., Wong, R.T., 1984. Network design and transportation planning:
models and algorithms. Transportation Science 18 (1), 1–55.• Farahani, R. Z., Miandoabchi, E., Szeto, W. Y., & Rashidi, H. 2013. A review of
urban transportation network design problems. European Journal of Operational Research, 229(2), 281-302.
• Fosgerau, M., Frejinger, E., Karlstrom, A., 2013. A link based network route choice model with unrestricted choice set. Transportation Research Part B: Methodological 56, 70–80.
• Mai, T., & Jaillet, P., 2020. A Relation Analysis of Markov Decision Process Frameworks. arXiv:2008.07820.
• Oyama, Y., 2017. A Markovian route choice analysis for trajectory-based urban planning. PhD thesis, The University of Tokyo.
• Oyama, Y., Hato, E., 2019. Prism-based path set restriction for solving Markovian traffic assignment problem. Transportation Research Part B: Methodological 122, 528–546.
• ⼤⼭雄⼰, ⽻藤英⼆, 2017. 多⽬的最適化に基づく歩⾏者の活動ネットワークデザイン. 都市計画論⽂集 52(3): 810-817.
![Page 28: Reinforcement Learning and Network Design](https://reader031.vdocument.in/reader031/viewer/2022012013/61584d17b01c084f9c6a147b/html5/thumbnails/28.jpg)
Appendix | Design levels and examples
Figure 1 in Farahani et al. (2013)
![Page 29: Reinforcement Learning and Network Design](https://reader031.vdocument.in/reader031/viewer/2022012013/61584d17b01c084f9c6a147b/html5/thumbnails/29.jpg)
Appendix | Solution algorithms (metaheuristics)
Figure2 in Farahani et al. (2013)
SA: Simulated Annealing; GA: Genetic Algorithm; TS: Tabu Search; AC: Ant Colony; PSO: Particle Swarm Optimization; SS: Scatter Search; H: Hybrid metaheuristics