maximum margin planning - cornell universityย ยท problem specific qp-formulation min ,๐๐ 2 2+...
Post on 09-Aug-2020
5 Views
Preview:
TRANSCRIPT
Maximum Margin Planning
Presenters:
Ashesh Jain, Michael Hu
CS6784 Class Presentation
Nathan Ratliff, Drew Bagnell and Martin Zinkevich
Jain, Hu
Theme
1. Supervised learning
2. Unsupervised learning
3. Reinforcement learning
โ Reward based learning
Inverse Reinforcement Learning
Jain, Hu
Motivating Example (Abbeel et. al)
Jain, Hu
Outline
Basics of Reinforcement Learning โ State and action space
โ Activity
Introducing Inverse RL โ Max-margin formulation
โ Optimization algorithms
โ Activity
Conclusion
Jain, Hu
Reward-based Learning
Jain, Hu
Markov Decision Process (MDP)
MDP is defined by (X,A, ๐, ๐ )
X State space
A Action space
๐ Dynamics model ๐(๐ฅโฒ|๐ฅ, ๐)
๐ Reward function
๐ (๐ฅ) or ๐ (๐ฅ, ๐)
๐
๐ 0.7
0.1
0.1
0.1
100 -1
-1 โฆ
โฆ
โฆ
Jain, Hu
MDP is defined by (X,A, ๐, ๐ )
X State space
A Action space
๐ Dynamics model ๐(๐ฅโฒ|๐ฅ, ๐)
๐ Reward function
๐ (๐ฅ) or ๐ (๐ฅ, ๐)
๐
๐ 0.7
0.1
0.1
0.1
100 -1
-1 โฆ
โฆ
โฆ
Markov Decision Process (MDP)
Policy
๐:X โA
Example policy
Jain, Hu
๐ฅ๐ก ๐ฅ๐ก+1
๐๐ก ๐๐ก+1
๐ ๐ก+1 ๐ ๐ก
๐โ = argmax๐๐ธ ๐พ๐ก๐ ๐ก(๐ฅ๐ก , ๐(๐ฅ๐ก))
โ
๐ก=0
๐:X โA
Goal: Learn an optimal policy
Graphical representation of MDP
๐ฅ๐ก , ๐ ๐ฅ๐ก โ ๐ฅ๐ก+1, ๐ ๐ฅ๐ก+1 โโ โ โ
Jain, Hu
Activity โ Very Easy!!!
Draw the optimal Policy!
Deterministic Dynamics
1
0
0
0
Jain, Hu
Activity โ Very Easy!!!
Jain, Hu
Activity โ Solution
Jain, Hu
Outline
Basics of Reinforcement Learning โ State and action space
โ Activity
Introducing Inverse RL โ Max-margin formulation
โ Optimization algorithms
โ Activities
Conclusion
Jain, Hu
Optimal Policy ๐โ ๐: ๐ฅ โ ๐
Optimal Policy ๐โ ๐: ๐ฅ โ ๐
Reward Function R(๐ฅ, ๐)
Reward Function R(๐ฅ, ๐)
RL to Inverse-RL (IRL)
RL Algorithm Dynamics Model ๐(๐ฅโฒ|๐ฅ, ๐)
โข State space: X โข Action space:A
Inverse problem: Given optimal policy ๐โ or samples drawn from it, recover the reward R(๐ฅ, ๐)
Inverse
Jain, Hu
Reward Function R(๐ฅ, ๐)
Why Inverse-RL?
RL Algorithm
โข State space: X โข Action space:A
Abbeel et. al. Ratliff et. al. Kober et. al.
Specifying R(๐ฅ, ๐) is hard, but samples from ๐โ are easily available
Jain, Hu
IRL framework
Expert
๐โ: ๐ฅ โ ๐
Interacts
Demonstration
๐ฆโ = ๐ฅ1, ๐1 โ ๐ฅ2, ๐2 โ ๐ฅ3, ๐3 โ โฏ โ ๐ฅ๐, ๐๐
โฆโฆ + + + + ๐ ๐ฆโ = ๐ค๐ ๐ค๐ ๐ค๐ ๐ค๐ ๐ค๐
Reward
Jain, Hu
Paper contributions
1. Formulated IRL as structural prediction
Maximum margin approach
2. Two optimization algorithms
Problem specific solver
Sub-gradient method
3. Robotic application: 2D navigation etc.
Jain, Hu
Outline
Basics of Reinforcement Learning โ State and action space
โ Activity
Introducing Inverse RL โ Max-margin formulation
โ Optimization algorithms
โ Activities
Conclusion
Jain, Hu
Formulation
Input X๐ ,A๐ , ๐๐ , ๐๐ , ๐ฆ๐ ,L๐ ๐=1๐
X๐ State space
A๐ Action space
๐๐ Dynamics model
๐ฆ๐ Expert demonstration
๐๐(โ ) Feature function
L๐(โ , ๐ฆ๐) Loss function
๐
๐ 0.7
0.1
0.1
0.1
Jain, Hu
X |A|
X |A|
Max-margin formulation
min๐ค,๐๐
๐
2๐ค 2 +
1
๐ ๐ฝ๐๐๐๐
๐ . ๐ก. โ๐ ๐ค๐๐๐ ๐ฆ๐ โฅ max๐ฆโY
๐
๐ค๐๐๐ ๐ฆ + L ๐ฆ, ๐ฆ๐ โ ๐๐
Features linearly decompose over path ๐๐ ๐ฆ = ๐น๐๐
๐ ๐น๐ = ๐ =
Feature map
Frequency counts
Jain, Hu
Max-margin formulation
min๐ค,๐๐
๐
2๐ค 2 +
1
๐ ๐ฝ๐๐๐๐
๐ . ๐ก. โ๐ ๐ค๐๐น๐๐๐ โฅ max๐โG
๐
๐ค๐๐น๐๐ + ๐๐๐๐ โ ๐๐
satisfies Bellman-flow constraints ๐
๐๐ฅ,๐๐๐ ๐ฅโฒ ๐ฅ, ๐ + ๐ ๐
๐ฅโฒ = ๐๐ฅโฒ,๐
๐๐ฅ,๐
โ๐ฅโฒ
Outflow Inflow
Jain, Hu
Outline
Basics of Reinforcement Learning โ State and action space โ Activity
Introducing Inverse RL โ Max-margin formulation โ Optimization algorithms
โข Problem specific QP-formulation โข Sub-gradient method
โ Activities
Conclusion
Jain, Hu
๐๐ฅ,๐๐๐ ๐ฅโฒ ๐ฅ, ๐ + ๐ ๐
๐ฅโฒ = ๐๐ฅโฒ,๐
๐๐ฅ,๐
โ๐, ๐ฅ, ๐ ๐ฃ๐๐ฅ โฅ ๐ค๐๐น๐ + ๐๐
๐ฅ,๐ + ๐๐ ๐ฅโฒ ๐ฅ, ๐ ๐ฃ๐
๐ฅโฒ
๐ฅโฒ
Problem specific QP-formulation
min๐ค,๐๐
๐
2๐ค 2 +
1
๐ ๐ฝ๐๐๐๐
๐ . ๐ก. โ๐ ๐ค๐๐น๐๐๐ โฅ max๐โG
๐
๐ค๐๐น๐๐ + ๐๐๐๐ โ ๐๐
โ๐ฅโฒ
Outflow Inflow
๐ . ๐ก. โ๐ ๐ค๐๐น๐๐๐ โฅ ๐ ๐๐๐ฃ๐ โ ๐๐
Can be optimized using off-the-shelf QP solvers
Jain, Hu
Outline
Basics of Reinforcement Learning โ State and action space โ Activity
Introducing Inverse RL โ Max-margin formulation โ Optimization algorithms
โข Problem specific QP-formulation โข Sub-gradient method
โ Activities
Conclusion
Jain, Hu
Optimizing via Sub-gradient
min๐ค,๐๐
๐
2๐ค 2 +
1
๐ ๐ฝ๐๐๐๐
๐ . ๐ก. โ๐ ๐ค๐๐น๐๐๐ โฅ max๐โG
๐
๐ค๐๐น๐๐ + ๐๐๐๐ โ ๐๐
satisfies Bellman-flow constraints ๐
๐๐ โฅ max๐โG
๐
๐ค๐๐น๐๐ + ๐๐๐๐ โ ๐ค๐๐น๐๐๐ โฅ 0
๐๐ = max๐โG
๐
๐ค๐๐น๐๐ + ๐๐๐๐ โ ๐ค๐๐น๐๐๐
Re-writing constraint
Its tight at optimality
Jain, Hu
Optimizing via Sub-gradient
๐ ๐ค = min๐ค,๐๐
๐
2๐ค 2 +
1
๐ ๐ฝ๐๐
max๐โG
๐
๐ค๐๐น๐๐ + ๐๐๐๐ โ ๐ค๐๐น๐๐๐
Ratliff, PhD Thesis
Jain, Hu
๐โ = argmax๐โG
๐
๐ค๐ก๐๐น๐ + ๐๐
๐ ๐
min๐ค,๐๐
๐
2๐ค 2 +
1
๐ ๐ฝ๐๐
max๐โG
๐
๐ค๐๐น๐๐ + ๐๐๐๐ โ ๐ค๐๐น๐๐๐
Weight update via Sub-gradient
๐ค๐ก+1 โ ๐ค๐ก โ ๐ผ๐ป๐ค๐ก๐(๐ค)
Standard gradient descent update
๐ค๐ก+1 โ ๐ค๐ก โ ๐ผ๐๐ค๐ก
๐๐ค๐ก = ๐๐ค๐ก +1
๐ ๐ฝ๐๐น๐(๐
โ โ ๐๐)
๐
Loss-augmented reward
Jain, Hu
Convergence summary
๐ ๐ค =๐
2๐ค 2 +
1
๐ ๐๐ ๐ค
๐
๐=1
๐ค โ ๐ค โ ๐ผ๐๐ค
How to chose ๐ผ?
โ๐ค โ๐๐คโ โค ๐บ 0 < ๐ผ โค1
๐ If then a constant step size
gives linear convergence to a region around the minimum
๐ค๐ก+1 โ๐คโ 2 โค ๐ ๐ก+1 ๐ค0 โ๐ค
โ 2 +๐ผ๐บ2
๐
Optimization error
0 < ๐ < 1
Jain, Hu
Outline
Basics of Reinforcement Learning โ State and action space
โ Activity
Introducing Inverse RL โ Max-margin formulation
โ Optimization algorithms
โ Activity
Conclusion
Jain, Hu
Substitute ๐
2๐ค๐ก โ๐ค
โ 2 โค ๐๐ก๐(๐ค๐ก โ๐ค
โ)
Convergence proof
๐ค๐ก+1 โ๐คโ 2 = ๐ค๐ก โ ๐ผ๐๐ก โ๐ค
โ 2
= ๐ค๐ก โ๐คโ 2 + ๐ผ2 ๐๐ก
2 โ 2๐ผ๐๐ก๐(๐ค๐ก โ๐ค
โ) ?
โค 1 โ ๐ผ๐ ๐ค๐ก โ๐คโ 2 + ๐ผ2๐บ2
โ๐ค โ๐๐คโ โค ๐บ We know
๐ ๐ค โฅ ๐ ๐คโฒ + ๐๐คโฒ๐ ๐ค โ๐คโฒ +
๐
2๐ค โ ๐คโฒ 2
๐คโฒ
๐ค ๐ค โ ๐คโ ๐คโฒ โ ๐ค๐ก
๐ ๐คโ โค ๐(๐ค๐ก) Use
๐(๐ค)
Jain, Hu
Outline
Basics of Reinforcement Learning โ State and action space
โ Activity
Introducing Inverse RL โ Max-margin formulation
โ Optimization algorithms
โ Activity
Conclusion
Jain, Hu
2D path planning
Expertโs Demonstration
Learned Reward
Planning in a new Environment
Jain, Hu
Car driving (Abbeel et. al)
Jain, Hu
Bad driver (Abbeel et. al)
Jain, Hu
Recent works
1. N. Ratliff, D. Bradley, D. Bagnell, J. Chestnutt. Boosting Structures Prediction for Imitation Learning. In NIPS 2007
2. B. Ziebart, A. Mass, D. Bagnell, A. K. Dey. Maximum Entropy Inverse Reinforcement Learning. In AAAI 2010
3. P. Abbeel, A. Coats, A. Ng. Autonomous helicopter aerobatics through apprenticeship learning. In IJRR 2010
4. S. Ross, G. Gordon, D. Bagnell. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning. In AISTATS 2011
5. B. Akgun, M. Cakmak, K. Jiang, A. Thomaz. Keyframe-based leaning from demonstrations. In IJSR 2012
6. A. Jain, B. Wojcik, T. Joachims, A. Saxena. Learning Trajectory Preferences for Manipulators via Iterative Improvement. In NIPS 2013
Jain, Hu
Questions
โข Summary of key points:
โ Reward-based learning can be modeled by the MDP framework.
โ Inverse Reinforcement Learning takes perception features as input and outputs a reward function.
โ The max-margin planning problem can be formulated as a tractable QP.
โ Sub-gradient descent solves an equivalent problem with provable convergence bound.
top related