cooperation and communication in multiagent deep ...larg/hausknecht_thesis/slides/thesis.pdf ·...
TRANSCRIPT
![Page 1: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/1.jpg)
Cooperation and Communication in Multiagent Deep Reinforcement Learning
Matthew Hausknecht
Nov 28, 2016
Advisor: Peter Stone
1
![Page 2: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/2.jpg)
Intelligent decision making is at the heart of AI.
Motivation
2
![Page 3: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/3.jpg)
Thesis QuestionHow can the power of Deep Neural Networks be leveraged to extend Reinforcement Learning towards domains featuring partial observability, continuous parameterized action spaces, and sparse rewards?
3
How can Deep Reinforcement Learning agents learn to cooperate in a multiagent setting?
![Page 4: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/4.jpg)
Contributions• Half Field Offense Enivronment
• Deep RL in parameterized action space
• Multiagent Deep RL
• Deep Recurrent Q-Network (DRQN)
• Curriculum learning in HFO
4
![Page 5: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/5.jpg)
Outline1. Background
2. Deep Reinforcement Learning
3. Multiagent Architectures
4. Communication
5
![Page 6: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/6.jpg)
Markov Decision Process
Action at
State st
Reward rt
Formalizes the interaction between the agent and environment.
6
![Page 7: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/7.jpg)
Half Field OffenseCooperative multiagent soccer domain built on the libraries used by the RoboCup competition
Objective: Learn a goal scoring policy for the offense agents
Features continuous actions, partial observability, and opportunities for multi agent coordination
7
![Page 8: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/8.jpg)
Half Field Offense
8
![Page 9: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/9.jpg)
9
![Page 10: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/10.jpg)
10
![Page 11: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/11.jpg)
State Action Spaces
58 continuous state features encoding distances and angles to points of interest
Parameterized-Continuous Action Space: Dash(direction, power) Turn(direction)Tackle(direction) Kick(direction, power)
Choose one discrete action + parameters every timestep
11
![Page 12: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/12.jpg)
Learning in HFO is difficult
12
![Page 13: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/13.jpg)
Reinforcement LearningReinforcement Learning provides a general framework for sequential decision making.
Objective: Learn a policy that maximizes discounted sum of future rewards.
Deterministic policy π is a mapping from states to actions.
For each encountered state, what is the best action to perform.
13
![Page 14: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/14.jpg)
Q-Value FunctionEstimates the expected return from a given state-action:
Answers the question: “How good is action a from state s.”
Optimal Q-Value function yields an optimal policy.
Q⇡(s, a) = E⇥rt+1 + �rt+2 + �2rt+3 + . . . |s, a
⇤
14
![Page 15: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/15.jpg)
Deep Neural Network
Parametric model with stacked layers of representation.
Powerful, general purpose function approximator.
Parameters optimized via backpropagation.
Input
Output
✓
15
![Page 16: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/16.jpg)
Deep Reinforcement Learning
Neural network used to approximate Q-Value function and policy π.
Replay Memory: a queue of recent experience tuples (s,a,r,s’) seen by agent.
Updates to network are done on experience sampled randomly from replay memory.
State
Q-Value / Action
16
![Page 17: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/17.jpg)
Deep Deterministic Policy Gradients
Model-free Deep Actor Critic architecture [Lillicrap ’15]
Actor learns a policy π, Critic learns to estimate Q-values
Actor outputs 4 actions + 6 parameters.
at = max(4 actions) + associated parameter(s)
State
4 Actions 6 Parameters
1024
ReLU
256
ReLU
512
ReLU
128
ReLU
Q-Value
1024
ReLU
256
ReLU
512
ReLU
128
ReLU Actor
Critic
17
![Page 18: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/18.jpg)
TrainingCritic trained using temporal difference:
Given Experience
Actor trained via Critic gradients:
State
ᵘθμ
4 Actions 6 Parameters
ᵘθQ
Q-Value
Actor
Critic
auQ(s,a)
y = rt + �(Q(st+1, µ(st+1)|✓Q))
r✓µµ(s) = raQ(s, a|✓Q)r✓µµ(s|✓µ)
18
![Page 19: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/19.jpg)
Reward Signal
rt = -ᵂd(Agent, Ball) + Ikick + -3ᵂd(Ball, Goal) + 5IGoal
Go to Ball Kick to Goal
With only goal-scoring reward, agent never learns to approach the ball or dribble.
19
![Page 20: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/20.jpg)
Results
20
![Page 21: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/21.jpg)
ResultsScoring Avg. StepsPercent to Goal
DDPG1 1.0 108.0DDPG2 .99 107.1DDPG3 .98 104.8DDPG4 .96 112.3
Helios’ Champion .96 72.0DDPG5 .94 119.1DDPG6 .84 113.2SARSA .81 70.7DDPG7 .80 118.2
[Deep Reinforcement Learning in Parameterized Action Space, Hausknecht and Stone, in ICLR ‘16]
21
![Page 22: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/22.jpg)
Offense versus keeper
Automated Helios goal keeper is quite effective at stopping shots.
Independently created by Helios RoboCup team.
DDPG fails to reliably score against keeper.
22
![Page 23: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/23.jpg)
23
![Page 24: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/24.jpg)
Better value estimatesQ-Learning is known to overestimate Q-Values [Hasselt ’16].
Several approaches have been found, but don’t always extend to an actor/critic framework.
We will show that mixing off-policy updates with on-policy Monte-Carlo updates yields quicker, more stable learning.
24
![Page 25: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/25.jpg)
Q-Learning SpectrumQ-Learning is a bootstrap off-policy method:
N-step Q-learning [Watkins ’89]:
On-Policy Monte-Carlo updates are on-policy, non-bootstrap:
Q(st, at|✓) = rt+1 + �max
aQ(st+1, a|✓)
Q(st, at|✓) = rt+1 + �rt+2 + · · ·+ �T rT
Q(st, at|✓) = rt+1 + �rt+2 + · · ·+ �n�1rt+n + �nmax
aQ(st+n, a|✓)
25
![Page 26: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/26.jpg)
On-PolicyOff-PolicyB
oots
trap
No-
Boo
tstra
p
SARSA
Off-policy MC
n-step-return methods
On-policy MC
Q-Learning
Low Bias High Variance
High Bias Low Variance
26
![Page 27: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/27.jpg)
On-Policy Monte CarloOn-Policy Monte-Carlo updates make sense near the beginning of learning, since is nearly always wrong.
After Q-Values are refined, off-policy, bootstrap updates more efficiently utilize experience samples.
A middle path is to mix both update types:
max
aQ(st+1, a|✓)
y = � yon-policy-MC
+ (1� �) y1-step-q-learning
27
![Page 28: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/28.jpg)
ExperimentsTrained on 1v1 task
We evaluated 5 different β values: 0, .2, .5, .8, 1
y = � yon-policy-MC
+ (1� �) y1-step-q-learning
28
![Page 29: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/29.jpg)
β = 0
29
![Page 30: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/30.jpg)
β = 0.2
30
![Page 31: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/31.jpg)
β = 0.5
31
![Page 32: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/32.jpg)
β = 0.8
32
![Page 33: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/33.jpg)
β = 1.0
33
![Page 34: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/34.jpg)
34
![Page 35: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/35.jpg)
Off-Policy Monte CarloFor 1v1 experiments, a middle ground between on-policy and off-policy updates works best.
Purely off-policy updates can’t learn; Purely on-policy updates take far too long to learn.
[On-Policy vs. Off-Policy Updates for Deep Reinforcement Learning, Hausknecht and Stone,
DeepRL ’16]
35
![Page 36: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/36.jpg)
Thesis QuestionHow can the power of Deep Neural Networks be leveraged to extend Reinforcement Learning towards domains featuring partial observability, continuous parameterized action spaces, and sparse rewards?
Novel extension of DDPG to parameterized-continuos action space.
Method for efficiently mixing on-policy and off-policy update targets.
36
![Page 37: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/37.jpg)
Outline1. Background
2. Deep Reinforcement Learning
3. Multiagent Architectures
4. Communication
37
![Page 38: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/38.jpg)
Deep Multiagent RLCan multiple Deep RL agents cooperate to achieve a shared goal?
Examine several architectures:
Centralized: Single controller for multiple agents
Parameter Sharing: Layers shared between agents
Memory Sharing: Shared replay memory
38
![Page 39: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/39.jpg)
CentralizedBoth agents are controlled by a single actor-critic
State & Action spaces are concatenated
Learning takes place in higher-dimensional joint state, action space
4 Actions 6 Parameters
1024
ReLU
256
ReLU
512
ReLU
128
ReLU
Q-Value
1024
ReLU
256
ReLU
512
ReLU
128
ReLU
Actor
Critic
State
6 Parameters4 Actions
State
39
![Page 40: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/40.jpg)
40
![Page 41: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/41.jpg)
41
![Page 42: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/42.jpg)
Parameter Sharing
Shared weights between layers in Actor networks. Separate sharing between Critic networks.
Reduces total number of parameters
Encourages both agents to participate even though 2v0 is solvable by a single agent.
State
4 Actions 6 Parameters
256
ReLU
128
ReLU
Q-Value
256
ReLU
128
ReLU
4 Actions 6 Parameters
256
ReLU
128
ReLU
Q-Value
256
ReLU
128
ReLU
State
1024
ReLU
512
ReLU
1024
ReLU
512
ReLU
Critics
Actors
42
![Page 43: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/43.jpg)
43
![Page 44: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/44.jpg)
44
![Page 45: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/45.jpg)
45
![Page 46: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/46.jpg)
Both agents add experiences to a shared memory. Both agents perform updates from the shared memory. Parameters of agents are not shared.
Memory Sharing
46
Shared Replay Queue
Agent-0 Agent-1
![Page 47: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/47.jpg)
Memory Sharing agents add experience and update from a shared replay memory.
Memory Sharing
47
![Page 48: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/48.jpg)
48
![Page 49: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/49.jpg)
Multiagent Architectures
Centralized controller utilizes only a single agent.
Sharing parameters and memories encourages policy similarity, which can help all agents learn the task.
Memory sharing results least performance gap between agents.
49
![Page 50: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/50.jpg)
Outline1. Background
2. Deep Reinforcement Learning
3. Multiagent Architectures
4. Communication
50
![Page 51: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/51.jpg)
Symbiosis in Nature
Crocodile and Egyptian Plover
Clownfish and anemone
51
![Page 52: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/52.jpg)
Communication
In human society, cooperation can be achieved far faster than in nature, through communication.
How can learning agents use communication to achieve cooperation?
52
![Page 53: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/53.jpg)
1. Identify task-relevant information
2. Communicate meaningful information to the teammate
3. Remain stable enough that teammate can trust the meaning of messages
Desire a learned communication protocol that can:
53
![Page 54: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/54.jpg)
Related Work• Multiagent Cooperation and Competition with Deep
Reinforcement Learning; Tampuu et. al, 2015
• Learning to Communicate to Solve Riddles with Deep Distributed Recurrent Q-Networks; Foerster et al., 2016
• Learning to Communicate with Deep Multi-Agent Reinforcement Learning; Foerster et al., 2016
54
![Page 55: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/55.jpg)
Baseline Communication Architecture
Continuous communication actions. Messages are real values.
No meaning attached to messages; no pre-defined communication protocol.
Incoming messages appended to state.
Messages updated in direction of higher Q-Values.
Actions Parameters
1024
ReLU
256
ReLU
512
ReLU
128
ReLU
Q-Value
1024
ReLU
256
ReLU
512
ReLU
128
ReLUActor
Critic
Comm
State Comm
55
![Page 56: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/56.jpg)
Teammate Comm Gradients
Same as baseline except Communication gradients exchanged with teammate.
Allows teammate to directly alter communicated messages in the direction of higher reward.
56
![Page 57: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/57.jpg)
Baseline Comm Arch
Actor Actor
Critic Critic
state comm statecommaction action
state state
T=1
T=0
Agent 0 Agent 1
57
![Page 58: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/58.jpg)
Teammate Comm Gradients
Actor Actor
Critic Critic
state comm statecommaction action
state state
T=1
T=0
Agent 0 Agent 1
58
![Page 59: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/59.jpg)
Guess My Number Task
Each agent assigned secret number
Goal: teammate send a message close to secret number h
Reward:
Max reward when teammate message equals your secret number.
59
![Page 60: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/60.jpg)
Baseline
60
![Page 61: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/61.jpg)
Teammate Comm Grad
61
![Page 62: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/62.jpg)
Blind Agent can hear but cannot see
Sighted Agent can see but cannot move
Goal: sighted agent must use communication to help blind agent locate and approach the ball
Rewards:
Agents communicate using messages
Blind Soccer
62
![Page 63: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/63.jpg)
63
![Page 64: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/64.jpg)
Baseline
64
Baseline architecture begins to solve the task, but the protocol is not stable enough and performance crashes.
![Page 65: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/65.jpg)
Teammate Comm Gradients
65
Fails to ground messages in the state of the environment.
Agents fabricate idealized messages that don’t reflect reality.
Example: blind agent wants the ball to be directly ahead
So it alters the sighted agents messages to say this, regardless of the actual location of the ball.
![Page 66: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/66.jpg)
Teammate Comm Gradients
66
![Page 67: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/67.jpg)
GSN learns to extract information from the sighted agent’s observations that is useful for predicting the blind agent’s rewards.
Intuition: We can use observed rewards to guide the learning of a communication protocol.
Grounded Semantic Network
o(1)
r(2)
128
ReLU
256
ReLU
64
ReLU
64
ReLU
m(1) a(2)
θr
θm
67
![Page 68: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/68.jpg)
Maps sighted agent’s observation o(1) and blind teammate’s action a(2) to blind teammate reward r(2)
r(2) = GSN(o(1), a(2))
Grounded Semantic Network
o(1)
r(2)
128
ReLU
256
ReLU
64
ReLU
64
ReLU
m(1) a(2)
θr
θm
68
![Page 69: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/69.jpg)
Grounded Semantic Network
Message encoder M and a reward model R:
Activations of layer m(1) form the message.
Intuition: m(1) will contain any salient aspects of o(1) that are relevant for predicting reward.
o(1)
r(2)
128
ReLU
256
ReLU
64
ReLU
64
ReLU
m(1) a(2)
θr
θm
69
![Page 70: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/70.jpg)
Grounded Semantic Network
Training minimizes supervised loss:
Evaluation requires only observation o(1) to generate message m(1)
GSN is trained in parallel with agent. Uses learning rate 10x smaller than agent for stability. o(1)
r(2)
128
ReLU
256
ReLU
64
ReLU
64
ReLU
m(1) a(2)
θr
θm
70
![Page 71: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/71.jpg)
GSN
71
![Page 72: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/72.jpg)
72
![Page 73: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/73.jpg)
Is communication really helping?
![Page 74: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/74.jpg)
74
![Page 75: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/75.jpg)
t-SNE Analysis2D t-SNE projection of 4D messages sent by the sighted agent
Similar messages in 4D space are close in the 2D projection
Each dot is colored according to whether the blind agent Dashed or Turned
Content of messages strongly influences actions of blind agent
75
![Page 76: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/76.jpg)
![Page 77: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/77.jpg)
1. Identify task-relevant information
2. Communicate meaningful information to the teammate
3.Remain stable enough that teammate can trust the meaning of messages
Desire a learned communication protocol that can:
77
GSN fulfills these criteria. For more info see: [Grounded Semantic Networks for Learning Shared Communication Protocols] NIPS DeepRL Workshop ‘16
![Page 78: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/78.jpg)
Communication ConclusionsCommunication can help cooperation. It is possible to learn stable and informative communication protocols.
Teammate Communication Gradients is best in domains where reward is tied directly to the content of the messages.
GSN is ideal in domains in which communication needs to be used as a way to achieve some other objectives in the environment.
78
![Page 79: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/79.jpg)
Thesis QuestionHow can Deep Reinforcement Learning agents learn to cooperate in a multiagent setting?
Showed that sharing parameters and replay memories can help multiple agents learn to perform a task.
Demonstrated communication can help agents cooperate in a domain featuring asymmetric information.
79
![Page 80: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/80.jpg)
Future Work
Teammate modeling: Could such a model be used for planning or better cooperation?
Embodied Imitation Learning: How can an agent learn from a teacher without directly observing the states or actions of the teacher?
Adversarial multiagent learning: How to communicate in the presence of an adversary?
80
![Page 81: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/81.jpg)
Contributions• Extended Deep RL algorithms to parameterized-
continuous action space.
• Demonstrated that mixing bootstrap and Monte Carlo returns yields better learning performance.
• Introduced and analyzed parameter and memory sharing multiagent architectures.
• Introduced communication architectures and demonstrated that learned communication could help cooperation.
• Open source contributions: HFO, all learning agents81
![Page 82: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/82.jpg)
Thanks!
State
4 Actions 6 Parameters
1024
ReLU
256
ReLU
512
ReLU
128
ReLU
Q-Value
1024
ReLU
256
ReLU
512
ReLU
128
ReLU Actor
CriticConvolution 1
Convolution 2
Convolution 3
LSTM
Fully Connected
Q-Values
82
![Page 83: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/83.jpg)
Partially Observable MDP (POMDP)
Action at
Observation ot
Reward rt
Observations provide noisy or incomplete information
Memory may help to learn a better policy83
![Page 84: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/84.jpg)
Atari Environment
Action at
Observation ot
Reward rt
Resolution 160x210x318 discrete actions
Reward is change in game score
84
![Page 85: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/85.jpg)
Atari: MDP or POMDP?
Depends on the number game screens used in the state representation.
Many games PO with a single frame.
85
![Page 86: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/86.jpg)
Neural network estimates Q-Values Q(s,a) for all 18 actions:
Learns via temporal difference:
Accepts the last 4 screens as input.
Deep Q-Network (DQN)
Convolution 1
Convolution 2
Convolution 3
Fully Connected
Fully Connected
Q-Values
Q(s|✓) = (Qs,a1 . . . Qs,an)
Li(✓) = Es,a,r,s0⇠D
h�Q(st|✓)� yi
�2i
yi = rt + �max(Q(st+1|✓))
86
![Page 87: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/87.jpg)
Flickering AtariHow well does DQN perform on POMDPs?
Induce partial observability by stochastically obscuring the game screen
Game state must be inferred from past observations
ot =
⇢st with p =
12
< 0, . . . , 0 > otherwise
87
![Page 88: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/88.jpg)
DQN Pong
True Game Screen Observed Game Screen88
![Page 89: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/89.jpg)
DQN Flickering Pong
True Game Screen Observed Game Screen89
![Page 90: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/90.jpg)
Uses a Long Short Term Memory (LSTM) to selectively remember past game screens.
Architecture identical to DQN except: 1. Replaces FC layer with LSTM 2. Single frame as input each
timestep
Trained end-to-end using BPTT for last 10 timesteps.
Deep Recurrent Q-Network
Convolution 1
Convolution 2
Convolution 3
LSTM
Fully Connected
Q-Values
90
![Page 91: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/91.jpg)
DRQN Flickering Pong
True Game Screen Observed Game Screen91
![Page 92: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/92.jpg)
92
![Page 93: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/93.jpg)
LSTM infers velocity
93
![Page 94: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/94.jpg)
DRQN Frostbite
94
![Page 95: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/95.jpg)
95
![Page 96: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/96.jpg)
ExtensionsDRQN has been extended in several ways:
• Addressable Memory: Control of Memory, Active Perception, and Action in Minecraft; Oh et al. in ICML ’16
• Continuous Action Space: Memory Based Control with Recurrent Neural Networks; Heess et al., 2016
[Deep Recurrent Q-Learning for Partially Observable MDPs, Hausknecht et al, 2015; ArXiv]
96
![Page 97: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/97.jpg)
Bounded Action SpaceHFO’s continuous parameters are bounded
Dash(direction, power) Turn(direction) Tackle(direction) Kick(direction, power)
Direction in [-180,180], Power in [0, 100]
Exceeding these ranges results in no action
If DDPG is unaware of the bounds, it will invariably exceed them
97
![Page 98: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/98.jpg)
We examine 3 approaches for bounding the DDPG’s action space:
1. Squashing Function
2. Zero Gradients
3. Invert Gradients
Bounded DDPG
98
![Page 99: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/99.jpg)
Squashing Function1. Use Tanh non-linearity to bound parameter output
2. Rescale into desired range
99
![Page 100: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/100.jpg)
Squashing Function
100
![Page 101: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/101.jpg)
Each continuous parameter has a range: [pmin, pmax]
Let p denote current value of parameter, and the suggested gradient.
Then:
Zeroing Gradients
rp =
(rp if p
min
< p < pmax
0 otherwise
rp
101
![Page 102: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/102.jpg)
Zeroing Gradients
102
![Page 103: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/103.jpg)
Inverting Gradients
rp = rp ·((p
max
� p)/(pmax
� pmin
) if rp suggests increasing p
(p� pmin
)/(pmax
� pmin
) otherwise
For each parameter:
Allows parameters to approach the bounds of the ranges without exceeding them
Parameters don’t get “stuck” or saturate
103
![Page 104: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/104.jpg)
Inverting Gradients
104
![Page 105: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/105.jpg)
2v1
• Can agents learn cooperative behaviors like passing and cross kicks?
• Hypothesis: Cross kicks can help achieve more reliable scoring in 2v1 setting. Can sharing architectures learn such behaviors?
105
![Page 106: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/106.jpg)
2v1
106
![Page 107: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/107.jpg)
2v1
107
![Page 108: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/108.jpg)
108
![Page 109: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/109.jpg)
2v1
• Both memory sharing and parameter sharing result in reasonably high goal percentage
• Agents do not learn passes or cross kicks and instead rely on individual attacks
109
![Page 110: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/110.jpg)
Curriculum Learning• Motivation: it’s difficult to design unbiased reward
functions for complex tasks.
• Easier to break a complex task into many subtasks, learn each subtask, and then use the skills to address the complex task.
• Given: Complex target task with sparse reward function, curriculum of tasks with non-sparse reward.
• Goal: Learn how to perform all tasks in curriculum including the target task.
110
![Page 111: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/111.jpg)
State Embed Architecture
Each task in curriculum is represented as an embedding vector.
Task embedding vector is concatenated with agent’s observation.
State
4 Actions 6 Parameters
1024
ReLU
256
ReLU
512
ReLU
128
ReLU
i
Task Embedding
Wemb
111
![Page 112: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/112.jpg)
Weight Embed ArchitectureEach task in curriculum is represented as an embedding vector.
Weight embedding architecture conditions the activations of a particular layer on the task embedding.
Allows a single network to learn many tasks and act uniquely in each task.
State
4 Actions 6 Parameters
1024
ReLU
256
ReLU
512
ReLU
128
ReLU
i
Task Embedding
128
Wemb
WWenc
Wdec
112
![Page 113: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/113.jpg)
Curriculum
• Target Task: Score on Goal
• Curriculum: Move to Ball, Kick to Goal
• Each task in curriculum corresponds to one skill in the target task.
• Tasks are represented using an embedding vector.
113
![Page 114: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/114.jpg)
Curriculum Ordering• The order of training tasks has an impact on
ultimate performance.
• Random curriculum: Presents a random task in the curriculum at each episode
• Sequential curriculum: Easiest tasks presented first, then harder tasks. Each task trained until convergence.
114
![Page 115: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/115.jpg)
Random Curriculum, No Embedding
115
![Page 116: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/116.jpg)
Seq Curriculum, No Embedding
116
![Page 117: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/117.jpg)
Random Curriculum, State Embedding
117
![Page 118: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/118.jpg)
Seq Curriculum, State Embedding
118
![Page 119: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/119.jpg)
Random Curriculum, Weight Embedding
119
![Page 120: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/120.jpg)
Seq Curriculum, Weight Embedding
120
![Page 121: Cooperation and Communication in Multiagent Deep ...larg/hausknecht_thesis/slides/thesis.pdf · Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have](https://reader035.vdocument.in/reader035/viewer/2022070710/5ec6990d3f83e745073e87cb/html5/thumbnails/121.jpg)
Curriculum Learning
• Agents can reuse learned skills to perform the soccer task which features a sparse goal reward
• Agents must continue to revisit all training tasks or they will forget previous skills
• Ablation experiments show that all tasks are necessary for the soccer curriculum
121