introduction to deep reinforcement learning model-based

65
Introduction to Deep Reinforcement Learning Model-based Methods Zhiao Huang

Upload: others

Post on 10-Jan-2022

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Deep Reinforcement Learning Model-based

Introduction to Deep Reinforcement Learning

Model-based Methods

Zhiao Huang

Page 2: Introduction to Deep Reinforcement Learning Model-based

Model-free Methods

• The model 𝑝(𝑠′|𝑠, 𝑎) is unknown

• we solve 𝑄, 𝑉 and the policy from the sampled

trajectories/transitions

Page 3: Introduction to Deep Reinforcement Learning Model-based

Model-based Method

• Learn the environment model directly

𝑝(𝑠′|𝑠, 𝑎)

• Learn 𝑅(𝑠, 𝑎, 𝑠′) if it’s unknown

Page 4: Introduction to Deep Reinforcement Learning Model-based

Model-based Method

• Learn the environment model directly by supervised

learning

𝑝(𝑠′|𝑠, 𝑎)

• Search the solution with the model directly

Page 5: Introduction to Deep Reinforcement Learning Model-based

Model-based Methods

Physics

Geometry

Probability model

Inverse Dynamics

Game Engine

….

MCTS

CEM

RL

iLQR

RRT/PRM

……

Model Search

• Model and search have broad meanings

Page 6: Introduction to Deep Reinforcement Learning Model-based

Model-Predictive Control

• Forward model with parameters 𝜃

𝑓𝜃 𝑠, 𝑎 : 𝑆 × 𝐴 → 𝑆

• Predicts what will happen if we execute the action

𝑎 at the state 𝑠

Page 7: Introduction to Deep Reinforcement Learning Model-based

Forward Model

• We may have a forward model in mind

Page 8: Introduction to Deep Reinforcement Learning Model-based

Learning the Forward Model

• Sample transitions (𝑠, 𝑎, 𝑠′) from the replay buffer

and train the model with supervised learning

min𝜃

𝐸 𝑓𝜃 𝑠, 𝑎 − 𝑠′ 2

Page 9: Introduction to Deep Reinforcement Learning Model-based

Rollout

• Predict a short trajectory 𝑠1, 𝑠2, … , 𝑠𝑇 if we start at 𝑠0and execute 𝑎0, 𝑎1, 𝑎2, … , 𝑎𝑇−1 sequentially

𝑠0 𝑠1 𝑠2 𝑠𝑇⋯𝑓𝜃(𝑠0, 𝑎0) 𝑓𝜃(𝑠1, 𝑎1) 𝑓𝜃(𝑠2, 𝑎2) 𝑓𝜃(𝑠𝑇−1, 𝑎𝑇−1)

“rollout” the forward model

Page 10: Introduction to Deep Reinforcement Learning Model-based

Model-Predictive Control

• Given the forward model 𝑓𝜃 and the current state 𝑠0,

find a sequence of action 𝑎0, 𝑎1, 𝑎2, … , 𝑎𝑇−1 such

that has the maximum reward

max𝑎0:𝑇−1

σ𝑖=1𝑇 𝑅(𝑠𝑖 , 𝑎𝑖 , 𝑠𝑖+1) s.t. 𝑠𝑖+1 = f𝜃(𝑠𝑖 , 𝑎𝑖)

Page 11: Introduction to Deep Reinforcement Learning Model-based

Random Shooting

• Sample 𝑁 random action sequences:

𝑎01, 𝑎1

1, 𝑎21, … , 𝑎𝑇−1

1

𝑎02, 𝑎1

2, 𝑎22, … , 𝑎𝑇−1

2

𝑎01, 𝑎1

1, 𝑎21, … , 𝑎𝑇−1

1

Page 12: Introduction to Deep Reinforcement Learning Model-based

Random Shooting

• Evaluate the reward of each action sequence by

simulating the model 𝑓𝜃

𝑠0

𝑠11

𝑠21 𝑠𝑇

1⋯

“rollout” the forward model 𝑓𝜃(𝑠, 𝑎)

𝑠12 𝑠2

2 𝑠𝑇2⋯

𝑠1𝑁

𝑠2𝑁

𝑠𝑇𝑁⋯

𝑎01

𝑎21

𝑎31

𝑎𝑇−11

𝑎02 𝑎1

2 𝑎22 𝑎𝑇−1

2

𝑎0N

𝑎1𝑁

𝑎2𝑁 𝑎𝑇−1

𝑁

𝑅1

𝑅2

𝑅𝑁

Page 13: Introduction to Deep Reinforcement Learning Model-based

Random Shooting

• Return the best action sequence 𝑎0:𝑇−1∗ and execute

in the real environment

𝑠0

𝑠11

𝑠21 𝑠𝑇

1⋯

𝑠12 𝑠2

2 𝑠𝑇2⋯

𝑠1𝑁

𝑠2𝑁

𝑠𝑇𝑁⋯

𝑎01

𝑎21

𝑎31

𝑎𝑇−11

𝑎02 𝑎1

2 𝑎22 𝑎𝑇−1

2

𝑎0N

𝑎1𝑁

𝑎2𝑁 𝑎𝑇−1

𝑁

𝑅1

𝑅2

𝑅𝑁

Page 14: Introduction to Deep Reinforcement Learning Model-based

Planning at Each Step

• The action sequence 𝑎1, 𝑎2, … , 𝑎𝑇−1 maximize the

reward from s1′ = 𝑓𝜃(𝑠0, 𝑎0) but not 𝑠1

• Search new actions for state 𝑠1 again!

𝑠0 𝑠1′

𝑠1

Model

Reality

If we execute the searched action sequence 𝑎0, 𝑎1, 𝑎2, …in the environment

Page 15: Introduction to Deep Reinforcement Learning Model-based

Model Predictive Control

• Repeat

• Observe the current state 𝑠

• Sample 𝑁 random action trajectories

• Evaluate the reward of each action sequence

from 𝑠 with the model 𝑓𝜃; Find the best action

sequence {𝑎0𝑘 , 𝑎1

𝑘 , … , 𝑎𝑇−1𝑘 }

• Execute 𝑎0𝑘 in the environment

Page 16: Introduction to Deep Reinforcement Learning Model-based

Cross-Entropy Method

• Black-box Function Optimization

max𝑥

𝑓(𝑥)

• We have many other choices

• Cross Entropy Method

Page 17: Introduction to Deep Reinforcement Learning Model-based

Cross-Entropy Method

• Basically the simplest evolutionary algorithm

• Maintain the distribution of solutions

Page 18: Introduction to Deep Reinforcement Learning Model-based

Cross-Entropy Method

• Initialize 𝜇 ∈ ℝ𝑑 , 𝜎 ∈ ℝ>0𝑑

• For iteration = 1,2, …

• Sample 𝑛 candidates 𝑥𝑖~𝑁(𝜇, diag(𝜎2))

• For each 𝑥𝑖 evaluate its value 𝑓(𝑥𝑖)

• Select the top k of 𝑥 as elites

• Fit a new diagonal Gaussian to those samples

and update 𝜇, 𝜎

Page 19: Introduction to Deep Reinforcement Learning Model-based

Cross-Entropy Method (in Python)

Page 20: Introduction to Deep Reinforcement Learning Model-based

Model Predictive Control

• Hyper parameters

𝜇𝑎 , 𝜎𝑎, 𝑛iter, 𝑛pop, 𝑛elite

• Initialize an action sequence 𝜇 = {𝑎𝑖 = 𝜇𝑎}𝑖<𝑇• Repeat

• Observe the current state 𝑠

• Search the new action sequence with CEM

{𝑎0′ , 𝑎1

′ , … , 𝑎𝑇−1′ } = CEM(𝜇, {𝜎𝑎}𝑖<𝑇)

• Execute 𝑎0′ in the environment

• Update 𝜇 ← {𝑎1′ , 𝑎2

′ , … , 𝑎𝑇−1′ , 𝜇𝑎}

Page 21: Introduction to Deep Reinforcement Learning Model-based

Model Predictive Control

• Hyper parameters

𝜇𝑎 , 𝜎𝑎, 𝑛iter, 𝑛pop, 𝑛elite

• Initialize an action sequence 𝜇 = {𝑎𝑖 = 𝜇𝑎}𝑖<𝑇• Repeat

• Observe the current state 𝑠

• Search the new action sequence with CEM

{𝑎0′ , 𝑎1

′ , … , 𝑎𝑇−1′ } = CEM(𝜇, {𝜎𝑎}𝑖<𝑇)

• Execute 𝑎0′ in the environment

• Update 𝜇 ← {𝑎1′ , 𝑎2

′ , … , 𝑎𝑇−1′ , 𝜇𝑎}

Page 22: Introduction to Deep Reinforcement Learning Model-based

Notes

• CEM performs well for most control tasks

• Instead of searching for action sequence, we can

also search for the parameters of the network

• General “Gradient Descent”

• CEM and Random shooting work poorly for very

long horizons 𝑇 or dense reward

Page 23: Introduction to Deep Reinforcement Learning Model-based

Performance of CEM

• When the model is known

Page 24: Introduction to Deep Reinforcement Learning Model-based

Comparisons

• On-policy methods: Policy

• Off-policy methods: Value/Q

• Model-based methods: Model

Model ⇒ Value ⇒ Policy

Page 25: Introduction to Deep Reinforcement Learning Model-based

Comparisons

• How difficult to model it

• Q Value > Policy

• Model depends on the priors

• Robustness

• Model < Q Value < Policy

• Time complexity

• Model > Q/Policy

• Data-efficient/Generalization

• Model > Q Value > Policy

Page 26: Introduction to Deep Reinforcement Learning Model-based

Conclusion

• Very few data / We know the model well

• Model-based methods

• We can’t model the environment and we don’t

want to sample too much

• Off-policy methods

• We have enough time/money

• Off-policy + On-policy methods

Page 27: Introduction to Deep Reinforcement Learning Model-based

Sample-based Motion Planning

Page 28: Introduction to Deep Reinforcement Learning Model-based

Topics

• Problem formulation

• Probabilistic roadmap method (PRM)

• Rapidly exploring random trees (RRT)

Page 29: Introduction to Deep Reinforcement Learning Model-based

Topics

• Problem formulation

• Probabilistic roadmap method (PRM)

• Rapidly exploring random trees (RRT)

Page 30: Introduction to Deep Reinforcement Learning Model-based

Configuration Space

• Configuration space( -space) is a subset of containing all possible states of the system(state space in RL).

• contains all valid states.• represents obstacles. • Examples:

• All valid poses of a robot.• All valid joint values of a robot.• …

C C ℝn

Cfree ⊆ CCobs ⊆ C

Page 31: Introduction to Deep Reinforcement Learning Model-based

Motion Planning

• Problem:• Given a configuration space • Given start state and goal state in • Calculate a sequence of actions that leads from

start to goal• Challenge:

• Need to avoid obstacles• Long planning horizon• High-dimensional planning space

Cfreeqstart qgoal Cfree

Page 32: Introduction to Deep Reinforcement Learning Model-based

Motion Planning

LaValle, Steven M. Planning algorithms. Cambridge university press, 2006.

Page 34: Introduction to Deep Reinforcement Learning Model-based

Examples

Ratliff N, Zucker M, Bagnell J A, et al. CHOMP: Gradient optimization techniques for efficient motion planning, ICRA 2009Schulman, John, et al. Finding Locally Optimal, Collision-Free Trajectories with Sequential Convex Optimization, RSS 2013

Page 35: Introduction to Deep Reinforcement Learning Model-based

Sample-based algorithm• The key idea is to explore a smaller subset of

possibilities randomly without exhaustively exploring all possibilities.

• Pros:• Probabilistically complete• Solve the problem after knowing partial of • Apply easily to high-dimensional -space

• Cons:• Requires find path between two close points• Does not work well when the connection of is

bad• Never optimal

CfreeC

Cfree

Page 36: Introduction to Deep Reinforcement Learning Model-based

Topics

• Problem formulation

• Probabilistic roadmap method (PRM)

• Rapidly exploring random trees (RRT)

Page 37: Introduction to Deep Reinforcement Learning Model-based

Probabilistic roadmap(PRM)

• The algorithm contains two stages:• Construction phase

• Randomly sample states in • Connect every sampled state to its neighbors• Connect the start and goal state to the graph

• Query phase• Run path finding algorithms like Dijkstra

Cfree

Kavraki, Lydia E., et al. "Probabilistic roadmaps for path planning in high-dimensional configuration spaces." IEEE transactions on Robotics and Automation 12.4 (1996): 566-580.

Page 38: Introduction to Deep Reinforcement Learning Model-based

Rejection Sampling

• Aim to sample uniformly in .• Method

• Sample uniformly over .• Reject the sample not in the feasible area.

Cfree

C

Page 39: Introduction to Deep Reinforcement Learning Model-based

Pipeline

Page 40: Introduction to Deep Reinforcement Learning Model-based

Challenges

• Connect neighboring points: • In general it requires solving boundary value

problem.• Collision checking:

• It takes a lot of time to check if the edges are in the configuration space.

Page 41: Introduction to Deep Reinforcement Learning Model-based

Example

• PRM generates a graph such that every edge is in the configuration space without colliding with obstacles.

G = (V, E)

Page 42: Introduction to Deep Reinforcement Learning Model-based

Example• Find the path from start state to goal state qstart qgoal

Page 43: Introduction to Deep Reinforcement Learning Model-based

Limitations:narrow passages• It is unlikely to sample the points in the narrow bridge

Page 44: Introduction to Deep Reinforcement Learning Model-based

Gaussian Sampling• Generate one sample uniformly in the configuration

space• Generate another sample from a Gaussian

distribution • If and then add

q1

q2𝒩(q1, σ2)

q1 ∈ Cfree q2 ∉ Cfree q1

Page 45: Introduction to Deep Reinforcement Learning Model-based

Bridge sampling• Generate one sample uniformly in the configuration

space• Generate another sample from a Gaussian

distribution

• If are not in then add

q1

q2𝒩(q1, σ2)

q3 =q1 + q2

2q1, q2 Cfree q3

Sun, Zheng, et al. "Narrow passage sampling for probabilistic roadmap planning." IEEE Transactions on Robotics 21.6 (2005): 1105-1115.

Page 46: Introduction to Deep Reinforcement Learning Model-based

Topics

• Problem formulation

• Probabilistic roadmap method (PRM)

• Rapidly exploring random trees (RRT)

Page 47: Introduction to Deep Reinforcement Learning Model-based

Rapidly-exploring random tree(RRT)

• RRT grows a tree rooted at the start state by using random samples from configuration space.

• As each sample is drawn, a connection is attempted between it and the nearest state in the tree. If the connection is in the configuration space, this results in a new state in the tree.

Page 48: Introduction to Deep Reinforcement Learning Model-based

Extend operation

Page 49: Introduction to Deep Reinforcement Learning Model-based

Pipeline

Page 50: Introduction to Deep Reinforcement Learning Model-based

Challenges

• Find nearest neighbor in the tree• We need to support online quick query• Examples: KD Trees

• Need to choose a good to expand the tree efficiently• Large : hard to generate new samples • Small : too many samples in the tree

ϵϵϵ

Page 51: Introduction to Deep Reinforcement Learning Model-based

Examples

Page 52: Introduction to Deep Reinforcement Learning Model-based

RRT-Connect

• Grow two trees starting from and respectively instead of just one.

• Grow the trees towards each other rather than random configurations

• Use stronger greediness by growing the tree with multiple epsilon steps instead of a single one.

qstart qgoal

Kuffner, James J., and Steven M. LaValle. "RRT-connect: An efficient approach to single-query path planning." Proceedings 2000 ICRA. Millennium Conference. IEEE International Conference on Robotics and Automation. Symposia Proceedings (Cat. No. 00CH37065). Vol. 2. IEEE, 2000.

Page 53: Introduction to Deep Reinforcement Learning Model-based

A single RRT-Connect iteration

Page 54: Introduction to Deep Reinforcement Learning Model-based

One tree grown using a random target

Page 55: Introduction to Deep Reinforcement Learning Model-based

New node becomes target for the other tree

Page 56: Introduction to Deep Reinforcement Learning Model-based

Calculate nearest node to target

Page 57: Introduction to Deep Reinforcement Learning Model-based

Try to add the new node

Page 58: Introduction to Deep Reinforcement Learning Model-based

If successful, keep extending

Page 59: Introduction to Deep Reinforcement Learning Model-based

If successful, keep extending

Page 60: Introduction to Deep Reinforcement Learning Model-based

If successful, keep extending

Page 61: Introduction to Deep Reinforcement Learning Model-based

Path found if branch reaches target

Page 62: Introduction to Deep Reinforcement Learning Model-based

Local planners

• Both RRT and PRM rely on a local planner which gives a sequence of action to connect two states.

• RL can give local planners without solving the dynamics equations explicitly.

• RL cannot solve long term high dimensional planning problem efficiently due to the exploration.

Page 63: Introduction to Deep Reinforcement Learning Model-based

PRM+RL• Sample some landmarks and build a graph in the

configuration space like PRM.• Find the planning path.• Action sequence is found by RL algorithm like DDPG.

Huang, Zhiao, Fangchen Liu, and Hao Su. "Mapping state space using landmarks for universal goal reaching." Advances in Neural Information Processing Systems. 2019.

Page 64: Introduction to Deep Reinforcement Learning Model-based

Moveit• Moveit is package on ROS• A lot of useful motion planning algorithm can be found

Page 65: Introduction to Deep Reinforcement Learning Model-based

Summary

• Motion planning• PRM

• Sampling: Gaussian sampling, bridge sampling• RRT

• RRT-Connect• Motion planning + RL

• PRM + RL