maximum margin planning - cornell universityย ยท problem specific qp-formulation min ,๐œ‰๐‘– 2 2+...

Post on 09-Aug-2020

5 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Maximum Margin Planning

Presenters:

Ashesh Jain, Michael Hu

CS6784 Class Presentation

Nathan Ratliff, Drew Bagnell and Martin Zinkevich

Jain, Hu

Theme

1. Supervised learning

2. Unsupervised learning

3. Reinforcement learning

โ€“ Reward based learning

Inverse Reinforcement Learning

Jain, Hu

Motivating Example (Abbeel et. al)

Jain, Hu

Outline

Basics of Reinforcement Learning โ€“ State and action space

โ€“ Activity

Introducing Inverse RL โ€“ Max-margin formulation

โ€“ Optimization algorithms

โ€“ Activity

Conclusion

Jain, Hu

Reward-based Learning

Jain, Hu

Markov Decision Process (MDP)

MDP is defined by (X,A, ๐‘, ๐‘…)

X State space

A Action space

๐‘ Dynamics model ๐‘(๐‘ฅโ€ฒ|๐‘ฅ, ๐‘Ž)

๐‘… Reward function

๐‘…(๐‘ฅ) or ๐‘…(๐‘ฅ, ๐‘Ž)

๐‘‹

๐‘Œ 0.7

0.1

0.1

0.1

100 -1

-1 โ€ฆ

โ€ฆ

โ€ฆ

Jain, Hu

MDP is defined by (X,A, ๐‘, ๐‘…)

X State space

A Action space

๐‘ Dynamics model ๐‘(๐‘ฅโ€ฒ|๐‘ฅ, ๐‘Ž)

๐‘… Reward function

๐‘…(๐‘ฅ) or ๐‘…(๐‘ฅ, ๐‘Ž)

๐‘‹

๐‘Œ 0.7

0.1

0.1

0.1

100 -1

-1 โ€ฆ

โ€ฆ

โ€ฆ

Markov Decision Process (MDP)

Policy

๐œ‹:X โ†’A

Example policy

Jain, Hu

๐‘ฅ๐‘ก ๐‘ฅ๐‘ก+1

๐‘Ž๐‘ก ๐‘Ž๐‘ก+1

๐‘…๐‘ก+1 ๐‘…๐‘ก

๐œ‹โˆ— = argmax๐œ‹๐ธ ๐›พ๐‘ก๐‘…๐‘ก(๐‘ฅ๐‘ก , ๐œ‹(๐‘ฅ๐‘ก))

โˆž

๐‘ก=0

๐œ‹:X โ†’A

Goal: Learn an optimal policy

Graphical representation of MDP

๐‘ฅ๐‘ก , ๐œ‹ ๐‘ฅ๐‘ก โ†’ ๐‘ฅ๐‘ก+1, ๐œ‹ ๐‘ฅ๐‘ก+1 โ†’โ‹…โ‹…โ‹…

Jain, Hu

Activity โ€“ Very Easy!!!

Draw the optimal Policy!

Deterministic Dynamics

1

0

0

0

Jain, Hu

Activity โ€“ Very Easy!!!

Jain, Hu

Activity โ€“ Solution

Jain, Hu

Outline

Basics of Reinforcement Learning โ€“ State and action space

โ€“ Activity

Introducing Inverse RL โ€“ Max-margin formulation

โ€“ Optimization algorithms

โ€“ Activities

Conclusion

Jain, Hu

Optimal Policy ๐œ‹โˆ— ๐œ‹: ๐‘ฅ โ†’ ๐‘Ž

Optimal Policy ๐œ‹โˆ— ๐œ‹: ๐‘ฅ โ†’ ๐‘Ž

Reward Function R(๐‘ฅ, ๐‘Ž)

Reward Function R(๐‘ฅ, ๐‘Ž)

RL to Inverse-RL (IRL)

RL Algorithm Dynamics Model ๐‘(๐‘ฅโ€ฒ|๐‘ฅ, ๐‘Ž)

โ€ข State space: X โ€ข Action space:A

Inverse problem: Given optimal policy ๐œ‹โˆ— or samples drawn from it, recover the reward R(๐‘ฅ, ๐‘Ž)

Inverse

Jain, Hu

Reward Function R(๐‘ฅ, ๐‘Ž)

Why Inverse-RL?

RL Algorithm

โ€ข State space: X โ€ข Action space:A

Abbeel et. al. Ratliff et. al. Kober et. al.

Specifying R(๐‘ฅ, ๐‘Ž) is hard, but samples from ๐œ‹โˆ— are easily available

Jain, Hu

IRL framework

Expert

๐œ‹โˆ—: ๐‘ฅ โ†’ ๐‘Ž

Interacts

Demonstration

๐‘ฆโˆ— = ๐‘ฅ1, ๐‘Ž1 โ†’ ๐‘ฅ2, ๐‘Ž2 โ†’ ๐‘ฅ3, ๐‘Ž3 โ†’ โ‹ฏ โ†’ ๐‘ฅ๐‘›, ๐‘Ž๐‘›

โ€ฆโ€ฆ + + + + ๐‘“ ๐‘ฆโˆ— = ๐‘ค๐‘‡ ๐‘ค๐‘‡ ๐‘ค๐‘‡ ๐‘ค๐‘‡ ๐‘ค๐‘‡

Reward

Jain, Hu

Paper contributions

1. Formulated IRL as structural prediction

Maximum margin approach

2. Two optimization algorithms

Problem specific solver

Sub-gradient method

3. Robotic application: 2D navigation etc.

Jain, Hu

Outline

Basics of Reinforcement Learning โ€“ State and action space

โ€“ Activity

Introducing Inverse RL โ€“ Max-margin formulation

โ€“ Optimization algorithms

โ€“ Activities

Conclusion

Jain, Hu

Formulation

Input X๐‘– ,A๐‘– , ๐‘๐‘– , ๐‘“๐‘– , ๐‘ฆ๐‘– ,L๐‘– ๐‘–=1๐‘›

X๐‘– State space

A๐‘– Action space

๐‘๐‘– Dynamics model

๐‘ฆ๐‘– Expert demonstration

๐‘“๐‘–(โ‹…) Feature function

L๐‘–(โ‹…, ๐‘ฆ๐‘–) Loss function

๐‘‹

๐‘Œ 0.7

0.1

0.1

0.1

Jain, Hu

X |A|

X |A|

Max-margin formulation

min๐‘ค,๐œ‰๐‘–

๐œ†

2๐‘ค 2 +

1

๐‘› ๐›ฝ๐‘–๐œ‰๐‘–๐‘–

๐‘ . ๐‘ก. โˆ€๐‘– ๐‘ค๐‘‡๐‘“๐‘– ๐‘ฆ๐‘– โ‰ฅ max๐‘ฆโˆˆY

๐‘–

๐‘ค๐‘‡๐‘“๐‘– ๐‘ฆ + L ๐‘ฆ, ๐‘ฆ๐‘– โˆ’ ๐œ‰๐‘–

Features linearly decompose over path ๐‘“๐‘– ๐‘ฆ = ๐น๐‘–๐œ‡

๐‘‘ ๐น๐‘– = ๐œ‡ =

Feature map

Frequency counts

Jain, Hu

Max-margin formulation

min๐‘ค,๐œ‰๐‘–

๐œ†

2๐‘ค 2 +

1

๐‘› ๐›ฝ๐‘–๐œ‰๐‘–๐‘–

๐‘ . ๐‘ก. โˆ€๐‘– ๐‘ค๐‘‡๐น๐‘–๐œ‡๐‘– โ‰ฅ max๐œ‡โˆˆG

๐‘–

๐‘ค๐‘‡๐น๐‘–๐œ‡ + ๐‘™๐‘–๐‘‡๐œ‡ โˆ’ ๐œ‰๐‘–

satisfies Bellman-flow constraints ๐œ‡

๐œ‡๐‘ฅ,๐‘Ž๐‘๐‘– ๐‘ฅโ€ฒ ๐‘ฅ, ๐‘Ž + ๐‘ ๐‘–

๐‘ฅโ€ฒ = ๐œ‡๐‘ฅโ€ฒ,๐‘Ž

๐‘Ž๐‘ฅ,๐‘Ž

โˆ€๐‘ฅโ€ฒ

Outflow Inflow

Jain, Hu

Outline

Basics of Reinforcement Learning โ€“ State and action space โ€“ Activity

Introducing Inverse RL โ€“ Max-margin formulation โ€“ Optimization algorithms

โ€ข Problem specific QP-formulation โ€ข Sub-gradient method

โ€“ Activities

Conclusion

Jain, Hu

๐œ‡๐‘ฅ,๐‘Ž๐‘๐‘– ๐‘ฅโ€ฒ ๐‘ฅ, ๐‘Ž + ๐‘ ๐‘–

๐‘ฅโ€ฒ = ๐œ‡๐‘ฅโ€ฒ,๐‘Ž

๐‘Ž๐‘ฅ,๐‘Ž

โˆ€๐‘–, ๐‘ฅ, ๐‘Ž ๐‘ฃ๐‘–๐‘ฅ โ‰ฅ ๐‘ค๐‘‡๐น๐‘– + ๐‘™๐‘–

๐‘ฅ,๐‘Ž + ๐‘๐‘– ๐‘ฅโ€ฒ ๐‘ฅ, ๐‘Ž ๐‘ฃ๐‘–

๐‘ฅโ€ฒ

๐‘ฅโ€ฒ

Problem specific QP-formulation

min๐‘ค,๐œ‰๐‘–

๐œ†

2๐‘ค 2 +

1

๐‘› ๐›ฝ๐‘–๐œ‰๐‘–๐‘–

๐‘ . ๐‘ก. โˆ€๐‘– ๐‘ค๐‘‡๐น๐‘–๐œ‡๐‘– โ‰ฅ max๐œ‡โˆˆG

๐‘–

๐‘ค๐‘‡๐น๐‘–๐œ‡ + ๐‘™๐‘–๐‘‡๐œ‡ โˆ’ ๐œ‰๐‘–

โˆ€๐‘ฅโ€ฒ

Outflow Inflow

๐‘ . ๐‘ก. โˆ€๐‘– ๐‘ค๐‘‡๐น๐‘–๐œ‡๐‘– โ‰ฅ ๐‘ ๐‘–๐‘‡๐‘ฃ๐‘– โˆ’ ๐œ‰๐‘–

Can be optimized using off-the-shelf QP solvers

Jain, Hu

Outline

Basics of Reinforcement Learning โ€“ State and action space โ€“ Activity

Introducing Inverse RL โ€“ Max-margin formulation โ€“ Optimization algorithms

โ€ข Problem specific QP-formulation โ€ข Sub-gradient method

โ€“ Activities

Conclusion

Jain, Hu

Optimizing via Sub-gradient

min๐‘ค,๐œ‰๐‘–

๐œ†

2๐‘ค 2 +

1

๐‘› ๐›ฝ๐‘–๐œ‰๐‘–๐‘–

๐‘ . ๐‘ก. โˆ€๐‘– ๐‘ค๐‘‡๐น๐‘–๐œ‡๐‘– โ‰ฅ max๐œ‡โˆˆG

๐‘–

๐‘ค๐‘‡๐น๐‘–๐œ‡ + ๐‘™๐‘–๐‘‡๐œ‡ โˆ’ ๐œ‰๐‘–

satisfies Bellman-flow constraints ๐œ‡

๐œ‰๐‘– โ‰ฅ max๐œ‡โˆˆG

๐‘–

๐‘ค๐‘‡๐น๐‘–๐œ‡ + ๐‘™๐‘–๐‘‡๐œ‡ โˆ’ ๐‘ค๐‘‡๐น๐‘–๐œ‡๐‘– โ‰ฅ 0

๐œ‰๐‘– = max๐œ‡โˆˆG

๐‘–

๐‘ค๐‘‡๐น๐‘–๐œ‡ + ๐‘™๐‘–๐‘‡๐œ‡ โˆ’ ๐‘ค๐‘‡๐น๐‘–๐œ‡๐‘–

Re-writing constraint

Its tight at optimality

Jain, Hu

Optimizing via Sub-gradient

๐‘ ๐‘ค = min๐‘ค,๐œ‰๐‘–

๐œ†

2๐‘ค 2 +

1

๐‘› ๐›ฝ๐‘–๐‘–

max๐œ‡โˆˆG

๐‘–

๐‘ค๐‘‡๐น๐‘–๐œ‡ + ๐‘™๐‘–๐‘‡๐œ‡ โˆ’ ๐‘ค๐‘‡๐น๐‘–๐œ‡๐‘–

Ratliff, PhD Thesis

Jain, Hu

๐œ‡โˆ— = argmax๐œ‡โˆˆG

๐‘–

๐‘ค๐‘ก๐‘‡๐น๐‘– + ๐‘™๐‘–

๐‘‡ ๐œ‡

min๐‘ค,๐œ‰๐‘–

๐œ†

2๐‘ค 2 +

1

๐‘› ๐›ฝ๐‘–๐‘–

max๐œ‡โˆˆG

๐‘–

๐‘ค๐‘‡๐น๐‘–๐œ‡ + ๐‘™๐‘–๐‘‡๐œ‡ โˆ’ ๐‘ค๐‘‡๐น๐‘–๐œ‡๐‘–

Weight update via Sub-gradient

๐‘ค๐‘ก+1 โ† ๐‘ค๐‘ก โˆ’ ๐›ผ๐›ป๐‘ค๐‘ก๐‘(๐‘ค)

Standard gradient descent update

๐‘ค๐‘ก+1 โ† ๐‘ค๐‘ก โˆ’ ๐›ผ๐‘”๐‘ค๐‘ก

๐‘”๐‘ค๐‘ก = ๐œ†๐‘ค๐‘ก +1

๐‘› ๐›ฝ๐‘–๐น๐‘–(๐œ‡

โˆ— โˆ’ ๐œ‡๐‘–)

๐‘–

Loss-augmented reward

Jain, Hu

Convergence summary

๐‘ ๐‘ค =๐œ†

2๐‘ค 2 +

1

๐‘› ๐‘Ÿ๐‘– ๐‘ค

๐‘›

๐‘–=1

๐‘ค โ† ๐‘ค โˆ’ ๐›ผ๐‘”๐‘ค

How to chose ๐›ผ?

โˆ€๐‘ค โ€–๐‘”๐‘คโ€– โ‰ค ๐บ 0 < ๐›ผ โ‰ค1

๐œ† If then a constant step size

gives linear convergence to a region around the minimum

๐‘ค๐‘ก+1 โˆ’๐‘คโˆ— 2 โ‰ค ๐œ…๐‘ก+1 ๐‘ค0 โˆ’๐‘ค

โˆ— 2 +๐›ผ๐บ2

๐œ†

Optimization error

0 < ๐œ… < 1

Jain, Hu

Outline

Basics of Reinforcement Learning โ€“ State and action space

โ€“ Activity

Introducing Inverse RL โ€“ Max-margin formulation

โ€“ Optimization algorithms

โ€“ Activity

Conclusion

Jain, Hu

Substitute ๐œ†

2๐‘ค๐‘ก โˆ’๐‘ค

โˆ— 2 โ‰ค ๐‘”๐‘ก๐‘‡(๐‘ค๐‘ก โˆ’๐‘ค

โˆ—)

Convergence proof

๐‘ค๐‘ก+1 โˆ’๐‘คโˆ— 2 = ๐‘ค๐‘ก โˆ’ ๐›ผ๐‘”๐‘ก โˆ’๐‘ค

โˆ— 2

= ๐‘ค๐‘ก โˆ’๐‘คโˆ— 2 + ๐›ผ2 ๐‘”๐‘ก

2 โˆ’ 2๐›ผ๐‘”๐‘ก๐‘‡(๐‘ค๐‘ก โˆ’๐‘ค

โˆ—) ?

โ‰ค 1 โˆ’ ๐›ผ๐œ† ๐‘ค๐‘ก โˆ’๐‘คโˆ— 2 + ๐›ผ2๐บ2

โˆ€๐‘ค โ€–๐‘”๐‘คโ€– โ‰ค ๐บ We know

๐‘ ๐‘ค โ‰ฅ ๐‘ ๐‘คโ€ฒ + ๐‘”๐‘คโ€ฒ๐‘‡ ๐‘ค โˆ’๐‘คโ€ฒ +

๐œ†

2๐‘ค โˆ’ ๐‘คโ€ฒ 2

๐‘คโ€ฒ

๐‘ค ๐‘ค โ† ๐‘คโˆ— ๐‘คโ€ฒ โ† ๐‘ค๐‘ก

๐‘ ๐‘คโˆ— โ‰ค ๐‘(๐‘ค๐‘ก) Use

๐‘(๐‘ค)

Jain, Hu

Outline

Basics of Reinforcement Learning โ€“ State and action space

โ€“ Activity

Introducing Inverse RL โ€“ Max-margin formulation

โ€“ Optimization algorithms

โ€“ Activity

Conclusion

Jain, Hu

2D path planning

Expertโ€™s Demonstration

Learned Reward

Planning in a new Environment

Jain, Hu

Car driving (Abbeel et. al)

Jain, Hu

Bad driver (Abbeel et. al)

Jain, Hu

Recent works

1. N. Ratliff, D. Bradley, D. Bagnell, J. Chestnutt. Boosting Structures Prediction for Imitation Learning. In NIPS 2007

2. B. Ziebart, A. Mass, D. Bagnell, A. K. Dey. Maximum Entropy Inverse Reinforcement Learning. In AAAI 2010

3. P. Abbeel, A. Coats, A. Ng. Autonomous helicopter aerobatics through apprenticeship learning. In IJRR 2010

4. S. Ross, G. Gordon, D. Bagnell. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning. In AISTATS 2011

5. B. Akgun, M. Cakmak, K. Jiang, A. Thomaz. Keyframe-based leaning from demonstrations. In IJSR 2012

6. A. Jain, B. Wojcik, T. Joachims, A. Saxena. Learning Trajectory Preferences for Manipulators via Iterative Improvement. In NIPS 2013

Jain, Hu

Questions

โ€ข Summary of key points:

โ€“ Reward-based learning can be modeled by the MDP framework.

โ€“ Inverse Reinforcement Learning takes perception features as input and outputs a reward function.

โ€“ The max-margin planning problem can be formulated as a tractable QP.

โ€“ Sub-gradient descent solves an equivalent problem with provable convergence bound.

top related