the predictron: end-to-end learning and...
TRANSCRIPT
![Page 1: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using](https://reader034.vdocument.in/reader034/viewer/2022050409/5f85f919b308e70bac00382b/html5/thumbnails/1.jpg)
The Predictron: End-to-end Learning andPlanning
Yoonho Lee
Department of Computer Science and EngineeringPohang University of Science and Technology
December 27, 2016
![Page 2: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using](https://reader034.vdocument.in/reader034/viewer/2022050409/5f85f919b308e70bac00382b/html5/thumbnails/2.jpg)
Hierarchical RLmotivation
![Page 3: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using](https://reader034.vdocument.in/reader034/viewer/2022050409/5f85f919b308e70bac00382b/html5/thumbnails/3.jpg)
Hierarchical RL
![Page 4: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using](https://reader034.vdocument.in/reader034/viewer/2022050409/5f85f919b308e70bac00382b/html5/thumbnails/4.jpg)
Hierarchical RL
![Page 5: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using](https://reader034.vdocument.in/reader034/viewer/2022050409/5f85f919b308e70bac00382b/html5/thumbnails/5.jpg)
Reward augmentation1
1Bellemare et al. Unifying Count-Based Exploration and IntrinsicMotivation, NIPS 2016
![Page 6: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using](https://reader034.vdocument.in/reader034/viewer/2022050409/5f85f919b308e70bac00382b/html5/thumbnails/6.jpg)
Hierarchical RL
![Page 7: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using](https://reader034.vdocument.in/reader034/viewer/2022050409/5f85f919b308e70bac00382b/html5/thumbnails/7.jpg)
Hierarchical actions2
2Florensa et al. Stochastic Neural Networks for Hierarchical ReinforcementLearning, under review for ICLR 2017
![Page 8: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using](https://reader034.vdocument.in/reader034/viewer/2022050409/5f85f919b308e70bac00382b/html5/thumbnails/8.jpg)
Hierarchical RL
![Page 9: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using](https://reader034.vdocument.in/reader034/viewer/2022050409/5f85f919b308e70bac00382b/html5/thumbnails/9.jpg)
Model-based video prediction3
3Oh et al. Action-Conditional Video Prediction using Deep Networks inAtari Games, NIPS 2015
![Page 10: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using](https://reader034.vdocument.in/reader034/viewer/2022050409/5f85f919b308e70bac00382b/html5/thumbnails/10.jpg)
Value Iteration Networks
Aviv Tamar, Yi Wu, Garrett Thomas, Sergey Levine, Pieter AbbeelBest paper at NIPS 2016
![Page 11: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using](https://reader034.vdocument.in/reader034/viewer/2022050409/5f85f919b308e70bac00382b/html5/thumbnails/11.jpg)
Value Iteration NetworksMotivation
![Page 12: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using](https://reader034.vdocument.in/reader034/viewer/2022050409/5f85f919b308e70bac00382b/html5/thumbnails/12.jpg)
Value Iteration NetworksNetwork diagram
![Page 13: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using](https://reader034.vdocument.in/reader034/viewer/2022050409/5f85f919b308e70bac00382b/html5/thumbnails/13.jpg)
Value Iteration NetworksBellman Operator as a CNN forward pass
Bellman Operator
Every V converges to V ∗ under the Bellman Operator T definedas:
(TV )(s) = maxa∈A
{R(s, a) + γ∑s′∈S
P(s ′|s, a)V (s ′)} (1)
V ∗ = limn→∞
T nV ∀V (2)
The Bellman Operator can be viewed as a CNN forward pass:
(T ∗V )(s) = maxa∈A
{R(s, a) + γconvP(a)(V (s))}
![Page 14: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using](https://reader034.vdocument.in/reader034/viewer/2022050409/5f85f919b308e70bac00382b/html5/thumbnails/14.jpg)
Value Iteration NetworksVI Module
![Page 15: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using](https://reader034.vdocument.in/reader034/viewer/2022050409/5f85f919b308e70bac00382b/html5/thumbnails/15.jpg)
Value Iteration NetworksResults
![Page 16: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using](https://reader034.vdocument.in/reader034/viewer/2022050409/5f85f919b308e70bac00382b/html5/thumbnails/16.jpg)
Value Iteration NetworksSummary
I Neural network architecture that plans using value iteration
I Assumes that the state is a sufficient statistic for the rewardfunction
I The small MDP must have a finite state space
I Uses prior knowledge about the environment’s structure
![Page 17: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using](https://reader034.vdocument.in/reader034/viewer/2022050409/5f85f919b308e70bac00382b/html5/thumbnails/17.jpg)
The Predictron: End-to-End Learning and Planning
David Silver, Hado van Hasselt, Matteo Hessel, Tom Schaul,Arthur Guez, Tim Harley, Gabriel Dulac-Arnold, David Reichert,
Neil Rabinowitz, Andre Barreto, Thomas DegrisUnder review for ICLR 2017
![Page 18: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using](https://reader034.vdocument.in/reader034/viewer/2022050409/5f85f919b308e70bac00382b/html5/thumbnails/18.jpg)
Predictron
![Page 19: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using](https://reader034.vdocument.in/reader034/viewer/2022050409/5f85f919b308e70bac00382b/html5/thumbnails/19.jpg)
Predictron
![Page 20: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using](https://reader034.vdocument.in/reader034/viewer/2022050409/5f85f919b308e70bac00382b/html5/thumbnails/20.jpg)
Predictron
![Page 21: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using](https://reader034.vdocument.in/reader034/viewer/2022050409/5f85f919b308e70bac00382b/html5/thumbnails/21.jpg)
Predictron
![Page 22: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using](https://reader034.vdocument.in/reader034/viewer/2022050409/5f85f919b308e70bac00382b/html5/thumbnails/22.jpg)
Predictron
![Page 23: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using](https://reader034.vdocument.in/reader034/viewer/2022050409/5f85f919b308e70bac00382b/html5/thumbnails/23.jpg)
Predictron
![Page 24: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using](https://reader034.vdocument.in/reader034/viewer/2022050409/5f85f919b308e70bac00382b/html5/thumbnails/24.jpg)
Predictron
![Page 25: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using](https://reader034.vdocument.in/reader034/viewer/2022050409/5f85f919b308e70bac00382b/html5/thumbnails/25.jpg)
PredictronSummary
I End-to-end simulation of an MRP
I Works with arbitrary state space for small MRP
I Small MRP has a non-interpretable state space
I Designed for RL, but does not take actions into account
![Page 26: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using](https://reader034.vdocument.in/reader034/viewer/2022050409/5f85f919b308e70bac00382b/html5/thumbnails/26.jpg)
Summary
I Hierarchical Reinforcement Learning attempts to identifycrucial decisions
I Agents can now use NN-based planning for better decisionmaking
I Research directionsI Theoretical bounds for optimal policy of a smaller MDPI Learning a smaller MDP with abstract actionsI End-to-end planning based Q network or policy network
![Page 27: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using](https://reader034.vdocument.in/reader034/viewer/2022050409/5f85f919b308e70bac00382b/html5/thumbnails/27.jpg)
Thank You
Value Iteration NetworksThe Predictron: End-to-End Learning and Planning