-
Deep Recurrent Q-Learning for PartiallyObservable MDPs
Matthew Hausknecht and Peter Stone
University of Texas at Austin
November 13, 2015
1
-
Motivation
Intelligent decision making is the heart of AI
Desire agents capable of learning to act intelligently in diverseenvironments
Reinforcement Learning provides a general learning framework
RL + deep neural networks yields robust controllers that learnfrom pixels (DQN)
DQN lacks mechanisms for handling partial observability
Extend DQN to handle Partially Observable Markov DecisionProcesses (POMDPs)
2
-
Motivation
Intelligent decision making is the heart of AI
Desire agents capable of learning to act intelligently in diverseenvironments
Reinforcement Learning provides a general learning framework
RL + deep neural networks yields robust controllers that learnfrom pixels (DQN)
DQN lacks mechanisms for handling partial observability
Extend DQN to handle Partially Observable Markov DecisionProcesses (POMDPs)
2
-
Motivation
Intelligent decision making is the heart of AI
Desire agents capable of learning to act intelligently in diverseenvironments
Reinforcement Learning provides a general learning framework
RL + deep neural networks yields robust controllers that learnfrom pixels (DQN)
DQN lacks mechanisms for handling partial observability
Extend DQN to handle Partially Observable Markov DecisionProcesses (POMDPs)
2
-
Motivation
Intelligent decision making is the heart of AI
Desire agents capable of learning to act intelligently in diverseenvironments
Reinforcement Learning provides a general learning framework
RL + deep neural networks yields robust controllers that learnfrom pixels (DQN)
DQN lacks mechanisms for handling partial observability
Extend DQN to handle Partially Observable Markov DecisionProcesses (POMDPs)
2
-
Motivation
Intelligent decision making is the heart of AI
Desire agents capable of learning to act intelligently in diverseenvironments
Reinforcement Learning provides a general learning framework
RL + deep neural networks yields robust controllers that learnfrom pixels (DQN)
DQN lacks mechanisms for handling partial observability
Extend DQN to handle Partially Observable Markov DecisionProcesses (POMDPs)
2
-
Motivation
Intelligent decision making is the heart of AI
Desire agents capable of learning to act intelligently in diverseenvironments
Reinforcement Learning provides a general learning framework
RL + deep neural networks yields robust controllers that learnfrom pixels (DQN)
DQN lacks mechanisms for handling partial observability
Extend DQN to handle Partially Observable Markov DecisionProcesses (POMDPs)
2
-
Outline
Motivation
BackgroundMDPPOMDPAtari DomainDeep Q-Network
Deep Recurrent Q-Network
Results
Related Work
Appendix
3
-
Markov Decision Process (MDP)
Statest
Actionat
Rewardrt
At each timestep Agentperforms actions at and receivesreward rt and state st+1 from theenvironment
Markov property ensures thatst+1 depends only on st , at
Learning an optimal policy π∗
requires no memory of past states
4
-
Markov Decision Process (MDP)
Statest
Actionat
Rewardrt
At each timestep Agentperforms actions at and receivesreward rt and state st+1 from theenvironment
Markov property ensures thatst+1 depends only on st , at
Learning an optimal policy π∗
requires no memory of past states
4
-
Markov Decision Process (MDP)
Statest
Actionat
Rewardrt
At each timestep Agentperforms actions at and receivesreward rt and state st+1 from theenvironment
Markov property ensures thatst+1 depends only on st , at
Learning an optimal policy π∗
requires no memory of past states
4
-
Partially Observable Markov Decision Process(POMDP)
Observationot
Actionat
Rewardrt
True state of environment ishidden. Observations ot provideonly partial information.
Memory of past observations mayhelp understand true systemstate, improve the policy
5
-
Partially Observable Markov Decision Process(POMDP)
Observationot
Actionat
Rewardrt
True state of environment ishidden. Observations ot provideonly partial information.
Memory of past observations mayhelp understand true systemstate, improve the policy
5
-
Atari Domain
ActionObservation Score
160× 210 state space→ 84× 84 grayscale
18 discrete actions
Rewards clipped ∈ {−1, 0, 1}
Source: www.arcadelearningenvironment.org
6
www.arcadelearningenvironment.orgwww.arcadelearningenvironment.org
-
Atari Domain: MDP or POMDP?
ActionObservation Score
Depends on the staterepresentation!
• Single Frame ⇒ POMDP• Four Frames ⇒ MDP• Console RAM ⇒ MDP
7
-
Atari Domain: MDP or POMDP?
ActionObservation Score
Depends on the staterepresentation!
• Single Frame ⇒ POMDP• Four Frames ⇒ MDP• Console RAM ⇒ MDP
7
-
Atari Domain: MDP or POMDP?
ActionObservation Score
Depends on the staterepresentation!
• Single Frame ⇒ POMDP• Four Frames ⇒ MDP• Console RAM ⇒ MDP
7
-
Deep Q-Network (DQN)Q-Values 18
IP1512
Conv3
Conv2
Conv1 4
84
84
Model-free ReinforcementLearning method using deepneural network as Q-Valuefunction approximator Mnihet al. (2015)
Takes the last four game screensas input: enough to make mostAtari games Markov
How well does DQN perform inpartially observed domains?
8
-
Deep Q-Network (DQN)Q-Values 18
IP1512
Conv3
Conv2
Conv1 4
84
84
Model-free ReinforcementLearning method using deepneural network as Q-Valuefunction approximator Mnihet al. (2015)
Takes the last four game screensas input: enough to make mostAtari games Markov
How well does DQN perform inpartially observed domains?
8
-
Flickering Atari
Observationot
Actionat
Rewardrt
Induce partial observability bystochastically obscuring the gamescreen
ot =
{st with p =
12
< 0, . . . , 0 > otherwise
Game state must now be inferredfrom past observations
9
-
Flickering Atari
Observationot
Actionat
Rewardrt
Induce partial observability bystochastically obscuring the gamescreen
ot =
{st with p =
12
< 0, . . . , 0 > otherwise
Game state must now be inferredfrom past observations
9
-
Flickering Atari
Observationot
Actionat
Rewardrt
Induce partial observability bystochastically obscuring the gamescreen
ot =
{st with p =
12
< 0, . . . , 0 > otherwise
Game state must now be inferredfrom past observations
9
-
DQN Pong
True Game Screen Perceived Game Screen
10
-
DQN Flickering Pong
True Game Screen Perceived Game Screen
11
-
Outline
Motivation
BackgroundMDPPOMDPAtari DomainDeep Q-Network
Deep Recurrent Q-Network
Results
Related Work
Appendix
12
-
Deep Recurrent Q-Network
. . . LSTM 512
18
1
84
84
tt − 1Long Short Term MemoryHochreiter (1997)
Identical to DQN Except:
• Replaces DQN’s IP1 withrecurrent LSTM layer ofsame dimension
• Each timestep takes asingle frame as input
LSTM provides a selectivememory of past game states
Trained end-to-end usingBPTT: unrolled for last 10timesteps
13
-
Deep Recurrent Q-Network
. . . LSTM 512
18
1
84
84
tt − 1Long Short Term MemoryHochreiter (1997)
Identical to DQN Except:
• Replaces DQN’s IP1 withrecurrent LSTM layer ofsame dimension
• Each timestep takes asingle frame as input
LSTM provides a selectivememory of past game states
Trained end-to-end usingBPTT: unrolled for last 10timesteps
13
-
Deep Recurrent Q-Network
. . . LSTM 512
18
1
84
84
tt − 1Long Short Term MemoryHochreiter (1997)
Identical to DQN Except:
• Replaces DQN’s IP1 withrecurrent LSTM layer ofsame dimension
• Each timestep takes asingle frame as input
LSTM provides a selectivememory of past game states
Trained end-to-end usingBPTT: unrolled for last 10timesteps
13
-
Deep Recurrent Q-Network
. . . LSTM 512
18
1
84
84
tt − 1Long Short Term MemoryHochreiter (1997)
Identical to DQN Except:
• Replaces DQN’s IP1 withrecurrent LSTM layer ofsame dimension
• Each timestep takes asingle frame as input
LSTM provides a selectivememory of past game states
Trained end-to-end usingBPTT: unrolled for last 10timesteps
13
-
DRQN Maximal Activations
Unit detects the agent missing the ball
Unit detects ball reflection on paddle
Unit detects ball reflection on wall
14
-
DRQN Maximal Activations
Unit detects the agent missing the ball
Unit detects ball reflection on paddle
Unit detects ball reflection on wall
14
-
DRQN Maximal Activations
Unit detects the agent missing the ball
Unit detects ball reflection on paddle
Unit detects ball reflection on wall14
-
Outline
Motivation
BackgroundMDPPOMDPAtari DomainDeep Q-Network
Deep Recurrent Q-Network
Results
Related Work
Appendix
15
-
DRQN Flickering Pong
True Game Screen Perceived Game Screen
16
-
Flickering Pong
17
-
Pong Generalization: POMDP ⇒ MDP
How does DRQN generalize when trained on Flickering Pong andevaluated on standard Pong?
18
-
Pong Generalization: POMDP ⇒ MDP
0.0 0.2 0.4 0.6 0.8 1.0Observation Probability
20
15
10
5
0
5
10
15
20DRQN 1-frameDQN 10-frameDQN 4-frame
18
-
Performance on Flickering Atari Games
Game 10-frame DRQN ±std 10-frame DQN ±stdPong 12.1 (±2.2) -9.9 (±3.3)
Beam Rider 618 (±115) 1685.6 (±875)Asteroids 1032 (±410) 1010 (±535)Bowling 65.5 (±13) 57.3 (±8)Centipede 4319.2 (±4378) 5268.1 (±2052)Chopper Cmd 1330 (±294) 1450 (±787.8)Double Dunk -14 (±2.5) -16.2 (±2.6)Frostbite 414 (±494) 436 (±462.5)Ice Hockey -5.4 (±2.7) -4.2 (±1.5)Ms. Pacman 1739 (±942) 1824 (±490)
19
-
Performance on Flickering Atari Games
Game 10-frame DRQN ±std 10-frame DQN ±stdPong 12.1 (±2.2) -9.9 (±3.3)Beam Rider 618 (±115) 1685.6 (±875)
Asteroids 1032 (±410) 1010 (±535)Bowling 65.5 (±13) 57.3 (±8)Centipede 4319.2 (±4378) 5268.1 (±2052)Chopper Cmd 1330 (±294) 1450 (±787.8)Double Dunk -14 (±2.5) -16.2 (±2.6)Frostbite 414 (±494) 436 (±462.5)Ice Hockey -5.4 (±2.7) -4.2 (±1.5)Ms. Pacman 1739 (±942) 1824 (±490)
19
-
Performance on Flickering Atari Games
Game 10-frame DRQN ±std 10-frame DQN ±stdPong 12.1 (±2.2) -9.9 (±3.3)Beam Rider 618 (±115) 1685.6 (±875)Asteroids 1032 (±410) 1010 (±535)Bowling 65.5 (±13) 57.3 (±8)Centipede 4319.2 (±4378) 5268.1 (±2052)Chopper Cmd 1330 (±294) 1450 (±787.8)Double Dunk -14 (±2.5) -16.2 (±2.6)Frostbite 414 (±494) 436 (±462.5)Ice Hockey -5.4 (±2.7) -4.2 (±1.5)Ms. Pacman 1739 (±942) 1824 (±490)
19
-
Performance on Standard Atari Games
Game 10-frame DRQN ±std 10-frame DQN ±stdDouble Dunk -2 (±7.8) -10 (±3.5)Frostbite 2875 (±535) 519 (±363)
Beam Rider 3269 (±1167) 6923 (±1027)Asteroids 1020 (±312) 1070 (±345)Bowling 62 (±5.9) 72 (±11)Centipede 3534 (±1601) 3653 (±1903)Chopper Cmd 2070 (±875) 1460 (±976)Ice Hockey -4.4 (±1.6) -3.5 (±3.5)Ms. Pacman 2048 (±653) 2363 (±735)
20
-
Performance on Standard Atari Games
Game 10-frame DRQN ±std 10-frame DQN ±stdDouble Dunk -2 (±7.8) -10 (±3.5)Frostbite 2875 (±535) 519 (±363)Beam Rider 3269 (±1167) 6923 (±1027)
Asteroids 1020 (±312) 1070 (±345)Bowling 62 (±5.9) 72 (±11)Centipede 3534 (±1601) 3653 (±1903)Chopper Cmd 2070 (±875) 1460 (±976)Ice Hockey -4.4 (±1.6) -3.5 (±3.5)Ms. Pacman 2048 (±653) 2363 (±735)
20
-
Performance on Standard Atari Games
Game 10-frame DRQN ±std 10-frame DQN ±stdDouble Dunk -2 (±7.8) -10 (±3.5)Frostbite 2875 (±535) 519 (±363)Beam Rider 3269 (±1167) 6923 (±1027)Asteroids 1020 (±312) 1070 (±345)Bowling 62 (±5.9) 72 (±11)Centipede 3534 (±1601) 3653 (±1903)Chopper Cmd 2070 (±875) 1460 (±976)Ice Hockey -4.4 (±1.6) -3.5 (±3.5)Ms. Pacman 2048 (±653) 2363 (±735)
20
-
Performance on Standard Atari Games
21
-
DRQN Frostbite
True Game Screen Perceived Game Screen
22
-
Generalization: MDP ⇒ POMDP
How does DRQN generalize when trained on standard Atari andevaluated on flickering Atari?
23
-
Generalization: MDP ⇒ POMDP
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Observation Probability
0.2
0.3
0.4
0.5
0.6
0.7Perc
enta
ge O
rigin
al Sco
reDRQN
DQN
23
-
Outline
Motivation
BackgroundMDPPOMDPAtari DomainDeep Q-Network
Deep Recurrent Q-Network
Results
Related Work
Appendix
24
-
Related Work
Deep Recurrent Q-Network
Hochreiter, S. and Schmidhuber, J. (1997). Long short-termmemory. Neural Comput., 9(8):1735–1780.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J.,Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K.,Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou,I., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis,D. (2015). Human-level control through deep reinforcementlearning. Nature, 518(7540):529–533.
Narasimhan, K., Kulkarni, T., and Barzilay, R. (2015). Languageunderstanding for text-based games using deep reinforcementlearning. CoRR, abs/1506.08941.
Wierstra, D., Foerster, A., Peters, J., and Schmidthuber, J. (2007).Solving deep memory POMDPs with recurrent policy gradients.
25
-
Thanks!
. . . LSTM
1
84
84
tt − 1LSTM can help deal withpartial observability
Largest gains in generalizationbetween MDP ⇔ POMDP
Future work understandingwhy DRQN does better/worseon certain games
Source: https://github.com/mhauskn/dqn/tree/recurrent
Matthew Hausknecht and Peter Stone
26
https://github.com/mhauskn/dqn/tree/recurrenthttps://github.com/mhauskn/dqn/tree/recurrent
-
Outline
Motivation
BackgroundMDPPOMDPAtari DomainDeep Q-Network
Deep Recurrent Q-Network
Results
Related Work
Appendix
27
-
Computational Efficiency
Backwards (ms) Forwards (ms)
Frames 1 4 10 1 4 10
Baseline 8.82 13.6 26.7 2.0 4.0 9.0Unroll 1 18.2 22.3 33.7 2.4 4.4 9.4Unroll 10 77.3 111.3 180.5 2.5 4.4 8.3Unroll 30 204.5 263.4 491.1 2.5 3.8 9.4
Table : Average milliseconds per backwards/forwards pass. Frames refersto the number of channels in the input image. Baseline is a non recurrentnetwork (e.g. DQN). Unroll refers to an LSTM network backpropagatedthrough time 1/10/30 steps.
28
MotivationBackgroundMDPPOMDPAtari DomainDeep Q-Network
Deep Recurrent Q-NetworkResultsRelated WorkAppendix