04/19/23NIPS 2007 workshop1
Reinforcement Learning with Multiple, Qualitatively Different State Representations
Harm van SeijenBram BakkerLeon Kester
- TNO / UvA- UvA- TNO / UvA
04/19/23NIPS 2007 workshop2
The Reinforcement Learning Problem
Agent Environment
Goal: maximize cumulative discounted reward
action a
state s, reward r
Question: What is the best way to represent the environment?
04/19/23NIPS 2007 workshop3
0 50 1000
50
100
0 50 1000
50
100
0 50 1000
50
100
0 50 1000
50
100
0 50 1000
50
100
0 50 1000
50
100
0 50 1000
50
100
x
y
0 50 1000
50
100
x
y
04/19/23NIPS 2007 workshop6
agent 1 : state space S1 = {s11, s1
2, s13, … s1
N1}
state space size = N1
agent 2 : state space S2 = {s21, s2
2, s23, … s2
N2}
state space size = N2
agent 3 : state space S3 = {s31, s3
2, s33, … s3
N3}
state space size = N3
(mutual) action space A = {a1, a2}
action space size = 2
Suppose 3 agents work in the same environment and have the same action-space, but different state space:
04/19/23NIPS 2007 workshop7
Extension action space
External Actionsa_e1 : old a1
a_e2 : old a2
Switch actions:a_s1 : ‘switch to representation 1’a_s2 : ‘switch to representation 2’a_s3 : ‘switch to representation 3’
New Action space:a1 : a_e1 + a_s1 a2 : a_e1 + a_s2
a3 : a_e1 + a_s3
a4 : a_e2 + a_s1 a5 : a_e2 + a_s2
a6 : a_e2 + a_s3
04/19/23NIPS 2007 workshop8
agent 1 : state space S1 = {s11, s1
2, s13, … s1
N1}
state space size = N1agent 2 : state space S2 = {s2
1, s22, s2
3, … s2N2}
state space size = N2agent 3 : state space S3 = {s3
1, s32, s3
3, … s3N3}
state space size = N3
switch agent: state space S = {s1
1, s12, …, s1
N1, s21, s2
2, …, s2N2,s3
1, s32, …, s3
N3}
state space size = N1+N2+N3
Extension state space
04/19/23NIPS 2007 workshop10
Requirements for Convergence
Theoretical Requirement
If the individual representations obey the Markov property than convergence to the optimal solution is guaranteed.
Empirical Requirement
Each representation should contain information that is useful for deciding on which external action to take and information that is useful for deciding when to switch.
04/19/23NIPS 2007 workshop11
Representation States Actions State-Actions
Rep 1 100 2 200
Rep 2 50 2 100
Rep 3 100 2 200
Switch (OR) 250 6 1.500
Union (AND) 500.000 2 1.000.000
State-Action Space Sizes Example
04/19/23NIPS 2007 workshop12
Switching is advantageous if:
1. The state-space is very large AND
2. The state-space is heterogeneous.
04/19/23NIPS 2007 workshop14
Traffic Scenario
Situation: crossroad of 2 one-way roads
Task: traffic agent has to decide at each time step whether the vertical lane or the horizontal lane should get green light. Changing lights involves an orange time of 5 time steps.
Reward: total cars waiting in front of the traffic light * -1
04/19/23NIPS 2007 workshop15
V-LaneH-Lane
Feature 1: traffic light status (4 values)
V-LaneH-Lane
V-LaneH-Lane
Feature 2-3: occupancy first 2 squares vertical lane (2x2 values)
V-LaneH-Lane
Feature 4-5: occupancy first 2 squares horizontal lane (2x2 values)
V-LaneH-Lane
Representation 1
04/19/23NIPS 2007 workshop16
V-LaneH-Lane
Feature 1: traffic light status (4 values)
V-LaneH-Lane
Feature 2: first Car Configuration (3 values)
V-LaneH-Lane
Feature 3: first square before traffic light occupied (2 values)
Representation 2
04/19/23NIPS 2007 workshop17
Representations Compared
Representation States Actions State-Actions
Rep 1 64 2 128
Rep 2 24 2 48
Switch 88 4 352
Rep 1+ 256 2 512
04/19/23NIPS 2007 workshop18
On-line performance for Traffic Scenario
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 105
-2
-1.8
-1.6
-1.4
-1.2
-1
-0.8
-0.6
-0.4
timsteps
aver
age
rew
ard
per
times
tep
representation 1
representation 2switch representation
representation 1+
04/19/23NIPS 2007 workshop21
Conclusions
• We introduced an extension to the standard RL problem by allowing the decision agent to dynamically switch between a number of qualitatively different representations.
• This approach offers advantages in RL problems with large, heterogeneous state spaces.
• Experiments with a (simulated) traffic control problem showed good results: the agent allowed to switch had a higher end-performance, while the convergence rate was similar compared to a representation with similar state-action space size.
04/19/23NIPS 2007 workshop22
• Use larger state spaces (~ few hundred states per representation) and more than 2 different representations.
• Explore the application domain of sensor management (for example switch between radar settings)
• Combine the switching approach with function approximation.
• Examine in more detail the convergence properties of the switch representation.
• Use representations that describe realistic sensor output.
• Explore new methods for switching.
Future Work
04/19/23NIPS 2007 workshop24
Switching Algorithm versus POMDP
POMDP: • update estimate of a hidden variable and base decisions on a probability distribution over all possible values of this hidden variable.• not possible to choose between different representations
Switch Algorithm: • hidden information is present, but not taken into account. The price for this is a more stochastic action outcome.• when hidden information is very important for the decision making process the agent can decide to switch to a different representation that does take the information into account.