the predictron: end-to-end learning and...

27

The Predictron: End-to-end Learning and Planning Yoonho Lee Department of Computer Science and Engineering Pohang University of Science and Technology December 27, 2016

Upload: others

Post on 01-Aug-2020

1 views

Category:

Documents

0 download

Report

Download

Embed Size (px):

TRANSCRIPT

Page 1: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

The Predictron: End-to-end Learning andPlanning

Yoonho Lee

Department of Computer Science and EngineeringPohang University of Science and Technology

December 27, 2016

Page 2: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Hierarchical RLmotivation

Page 3: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Hierarchical RL

Page 4: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Hierarchical RL

Page 5: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Reward augmentation1

1Bellemare et al. Unifying Count-Based Exploration and IntrinsicMotivation, NIPS 2016

Page 6: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Hierarchical RL

Page 7: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Hierarchical actions2

2Florensa et al. Stochastic Neural Networks for Hierarchical ReinforcementLearning, under review for ICLR 2017

Page 8: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Hierarchical RL

Page 9: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Model-based video prediction3

3Oh et al. Action-Conditional Video Prediction using Deep Networks inAtari Games, NIPS 2015

Page 10: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Value Iteration Networks

Aviv Tamar, Yi Wu, Garrett Thomas, Sergey Levine, Pieter AbbeelBest paper at NIPS 2016

Page 11: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Value Iteration NetworksMotivation

Page 12: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Value Iteration NetworksNetwork diagram

Page 13: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Value Iteration NetworksBellman Operator as a CNN forward pass

Bellman Operator

Every V converges to V ∗ under the Bellman Operator T definedas:

(TV )(s) = maxa∈A

{R(s, a) + γ∑s′∈S

P(s ′|s, a)V (s ′)} (1)

V ∗ = limn→∞

T nV ∀V (2)

The Bellman Operator can be viewed as a CNN forward pass:

(T ∗V )(s) = maxa∈A

{R(s, a) + γconvP(a)(V (s))}

Page 14: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Value Iteration NetworksVI Module

Page 15: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Value Iteration NetworksResults

Page 16: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Value Iteration NetworksSummary

I Neural network architecture that plans using value iteration

I Assumes that the state is a sufficient statistic for the rewardfunction

I The small MDP must have a finite state space

I Uses prior knowledge about the environment’s structure

Page 17: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

The Predictron: End-to-End Learning and Planning

David Silver, Hado van Hasselt, Matteo Hessel, Tom Schaul,Arthur Guez, Tim Harley, Gabriel Dulac-Arnold, David Reichert,

Neil Rabinowitz, Andre Barreto, Thomas DegrisUnder review for ICLR 2017

Page 18: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Predictron

Page 19: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Predictron

Page 20: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Predictron

Page 21: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Predictron

Page 22: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Predictron

Page 23: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Predictron

Page 24: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Predictron

Page 25: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

PredictronSummary

I End-to-end simulation of an MRP

I Works with arbitrary state space for small MRP

I Small MRP has a non-interpretable state space

I Designed for RL, but does not take actions into account

Page 26: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Summary

I Hierarchical Reinforcement Learning attempts to identifycrucial decisions

I Agents can now use NN-based planning for better decisionmaking

I Research directionsI Theoretical bounds for optimal policy of a smaller MDPI Learning a smaller MDP with abstract actionsI End-to-end planning based Q network or policy network

Page 27: The Predictron: End-to-end Learning and Planningmlg.postech.ac.kr/~readinglist/slides/20161227.pdf · Value Iteration Networks Summary I Neural network architecture that plans using

Thank You

Value Iteration NetworksThe Predictron: End-to-End Learning and Planning

InfoGAIL: Intepretable Imitation Learning from Visual ...mlg.postech.ac.kr/~readinglist/slides/20180220.pdf · Introduction • Imitation Learning: mimic expert behavior without access

Composing graphical models with neural networks …mlg.postech.ac.kr/~readinglist/slides/20161101.pdf2016/11/01 · Composing graphical models with neural networks for structured

12204-1 MARC Drain Line Root Control 20161227 7 12204 Then resume semi-annual preventative treatments. Do not make more than two applications per calendar year. Do not use more than

Common Platform Enumeration (CPE) – Specificationpeople.cs.ksu.edu/~zhangs84/ReadingList/cpe-specification_2.1.pdfThere is a strong trend toward automation in security practice

HarmonicAnalysis: SmoothandNon-smooth ... · "HarmonicAnalysis: SmoothandNon-smooth." IntroductiontotheConference: Historicalcontext, Readinglist,Basicdeﬁnitions,somerelevantresults,and

Bu˜ni & Company’s Recommended Reading Listresources.buffiniandcompany.com/.../2015/05/readinglist-2015.pdf · Bu˜ni & Company’s Recommended Reading List Business Personal Financial

Gradient Estimation Using Stochastic Computation Graphsmlg.postech.ac.kr/~readinglist/slides/20170509.pdf · 7Andriy Mnih and Karol Gregor.\Neural Variational Inference and Learning

The Holy season of Thelema - Daniel Tarr · The Holy season of Thelema Readinglist for english speaking practicioners Bahlasti Ompehda ± O.T.O. Hungary

Generalized Zero-Shot Learning with Deep Calibration Networkmlg.postech.ac.kr/~readinglist/slides/20181120.pdf · 2018-11-20 · Generalized Zero-Shot Learning with Deep Calibration

149289 (TM-108)Timer UT Instruction KAB 20161227 Update …pdf.lowes.com/useandcareguides/827214004552_use.pdfTitle: 149289_(TM-108)Timer_UT_Instruction_KAB_20161227 Update_OL Created

Deadline Constrained Packet Scheduling in Wireless Networkingkeshi.ubiwna.org/2014IoTComm/readinglist/Deadline Constrained Pa… · Deadline Constrained Packet Scheduling in Wireless

20161227 Taipei Smart IOT Innovation Lab workshop

Random Forest for the Contextual Bandit Problem …mlg.postech.ac.kr/~readinglist/slides/20170328.pdf · 1/19 Random Forest for the Contextual Bandit Problem Raphael Feraud, Robin

End Caps - brass-fasteners-inserts.com · Related Words : end caps , plastic end caps , rubber end caps , pipe end caps , tube end caps , end caps , pvc end caps ,square end caps

Guayaquil´s history end end end

Society of Behavioral Sleep Medicine (SBSM) ReadingList

One-shot Generalization in Deep Generative Modelmlg.postech.ac.kr/~readinglist/slides/20160919.pdf · Deep Generative Model Danilo J. Rezende, Shakir Mohamed, ICML 2016 Reference

The IoT Architectural Framework, Design Issues and ...ksuweb.kennesaw.edu/~she4/2018Spring/cs4491/ReadingList/IoT_Architecture.pdfThe IoT Architectural Framework, Design Issues and

SAT: A Security Architecture Achieving Anonymity and ...wzhang/teach-552/ReadingList/552-19.pdf · Anonymity and privacy issues have gained consider- ... security issues such as authentication,

Calculation Tools for the Energy Concept Adviser€¦ · The Kulu window consists of three parts: READINGLIST, GRAPH and REPORT. In the readinglist there is shown data (dates, readings

SP8M5FRA : Power MOSFET - ROHM Semiconductorrohmfs.rohm.com/en/products/databook/datasheet/...20161227 - Rev.003 Not Recommended for New Designs. SP8M5FRA Datasheet lElectrical characteristic

TADAM: Task dependent adaptive metric for improved few-shot …mlg.postech.ac.kr/~readinglist/slides/20190311.pdf · 11-03-2019 · the few-shot learning domain: Matching Networks

20161227 MLB COOPERSTOWN 4 PIN SET FLYER20161227 MLB COOPERSTOWN 4 PIN SET FLYER Created Date: 1/24/2017 9:50:54 AM

Food Security Policies –Formulation and Implementationeconomia.unipv.it/pagp/pagine_personali/msassi/readinglist/... · Marketing & Trade Food price and marketing regulations; food

Stochastic Neural Networks For HRL - POSTECH MLGmlg.postech.ac.kr/~readinglist/slides/20180109.pdf · 2018-03-25 · 1.2) SNN (Stochastic Neural Networks) •To learn several skills

Containerized End-2-End Testing - sigs.desigs.de › ...2017 › ...containerized-end-2-end-testing.pdf · Containerized End-2-End Testing + Tobias Schneck , ConSol Software GmbH

Food and Agricolture Organization of the United …economia.unipv.it/.../msassi/readinglist/students/fao.pdfFAO Food and Agricolture Organization of the United Nations Alm ... EX-ACT

Future OSS Orchestration - dtw.tmforum.org · End to End Service Management End to End Service Assurance Analytics / ML End to End Service Orchestration Open APIs End to …

WHITE PAPER DER End-End Assurance: California’s Rule 21 ... End-End Assurance Califor… · White Paper: DER End-End Assurance 8 Does the CA Rule 21 Plan Guarantee End-End Performance?

UNIVERSITY OF OXFORDmaxsmeets.com/wp-content/uploads/2017/01/213-readinglist... · The Military Balance ... Domestic Mobilisation and Sino-American Conflict, 1947-1958, 1996 Foot,

OPPORTUNITIES IN OPPORTUNISTIC COMPUTINGkeshi.ubiwna.org/2014IoTComm/readinglist/OPPORTUNITIES... · 2014. 8. 31. · exchange 1 petabyte of data per second. If we consider the 10

The Internet of Things: A surveyksuweb.kennesaw.edu/~she4/2018Spring/cs4491/ReadingList/IoT_survey.pdfNIC foresees that ‘‘by 2025 Internet nodes may reside in everyday things –

DistrictSummer ReadingList

Bundle Protocol Mail Convergence Layer - Ke Shikeshi.ubiwna.org/2014IoTComm/readinglist/Bundle protocol... · 2014-08-31 · two standard discovery mechanisms: IPND [5] for local

Pacific Antenna Easy SWR Indicator Kit · 2016. 12. 27. · 20161227. Usage The Easy SWR Indicator provides a means to monitor the match between your antenna and transmitter to avoid