pieter abbeel and andrew y. ng apprenticeship learning via inverse reinforcement learning pieter...

Pieter Abbeel and Andrew Y. Ng

Apprenticeship Learning via Inverse Reinforcement Learning

Pieter Abbeel

Stanford University

[Joint work with Andrew Ng.]

Overview

• Reinforcement Learning (RL)

• Motivation for Apprenticeship Learning

• Proposed algorithm

• Theoretical results

• Experimental results

• Conclusion

Example of Reinforcement Learning Problem

Highway driving.

RL formalism

• Assume that at each time step, our system is in some state st.

• Upon taking an action at, our system randomly transitions to some new state st+1.

• We are also given a reward function R.

• The goal: Pick actions over time so as to maximize the expected sum of rewards E[R(s0) + R(s1) + … + R(sT)].

Systemdynamics

…System

dynamicssT-1

R(s0) R(s2) R(sT-1)R(s1) R(sT)+ ++…++

RL formalism

• Markov Decision Process (S,A,P,s0,R)

• W.l.o.g. we assume

• Policy

• Utility of a policy for reward R=wT

Motivation for Apprenticeship Learning

Reinforcement learning (RL) gives powerful tools for solving MDPs. It can be difficult to specify the reward function. Example: Highway driving.

Apprenticeship Learning

• Learning from observing an expert.

• Previous work:

– Learn to predict expert’s actions as a function of states.

– Usually lacks strong performance guarantees.

– (E.g.,. Pomerleau, 1989; Sammut et al., 1992; Kuniyoshi et al., 1994; Demiris & Hayes, 1994; Amit & Mataric, 2002; Atkeson & Schaal, 1997; …)

• Our approach:

– Based on inverse reinforcement learning (Ng & Russell, 2000).

– Returns policy with performance as good as the expert as measured according to the expert’s unknown reward function.

Algorithm

For i = 1,2,…

Inverse RL step:

Estimate expert’s reward function R(s)= wT(s) such that under R(s) the expert performs better than all previously found policies {j}.

RL step:

Compute optimal policy i for

the estimated reward w.

Algorithm: Inverse RL step

Quadratric programming problem. (same as for SVM)

Algorithm

w(2)(1)

Feature Expectation Closeness and Performance

If we can find a policy such that

||(E) - ()||2 ,

then for any underlying reward R*(s) =w*T(s),

we have that

|Uw*(E) - Uw*()| = |w*T (E) - w*T ()|

||w*||2 ||(E) - ()||2

Theoretical Results: Convergence

Theorem. Let an MDP (without reward function), a k-dimensional feature vector and the expert’s feature expectations (E) be given. Then after at most

k T2/2

iterations, the algorithm outputs a policy that performs nearly as well as the expert, as evaluated on the unknown reward function R*(s)=w*T(s), i.e.,

Uw*() Uw*(E) - .

Theoretical Results: Sampling

In practice, we have to use sampling to estimate the feature expectations of the expert. We still have -optimal performance with high probability if the number of observed samples is at least

O(poly(k,1/)).

Note: the bound has no dependence on the “complexity” of the policy.

Gridworld Experiments

Reward function is piecewise constant over small regions.Features for IRL are these small regions.

128x128 grid, small regions of size 16x16.

Case study: Highway driving

The only input to the learning algorithm was the driving demonstration (left panel). No reward function was provided.

Input: Driving demonstration Output: Learned behavior

More driving examples

In each video, the left sub-panel shows a demonstration of a different driving “style”, and the right sub-panel shows the behavior learned from watching the demonstration.

Car driving results

CollisionLeft Shoulder

Left Lane

Middle Lane

Right Lane

Right Shoulder

(expert) 0 0 0.13 0.20 0.60 0.07

1 (learned) 0 0 0.09 0.23 0.60 0.08

w (learned) -0.08 -0.04 0.01 0.01 0.03 -0.01

(expert) 0.12 0 0.06 0.47 0.47 0

2 (learned) 0.13 0 0.10 0.32 0.58 0

w (learned) 0.23 -0.11 0.01 0.05 0.06 -0.01

(expert) 0 0 0 0.01 0.70 0.29

3 (learned) 0 0 0 0 0.74 0.26

w (learned) -0.11 -0.01 -0.06 -0.04 0.09 0.01

Different Formulation

LP formulation for RL problem

max. s,a (s,a) R(s)

s a (s,a) = s’,a P(s|s’,a) (s’,a)

QP formulation for Apprenticeship Learning

min. , i (E,i - i)2

s a (s,a) = s’,a P(s|s’,a) (s’,a)

i i = s,a i(s) (s,a)

Different Formulation (ctd.)

Our algorithm is equivalent to iteratively

linearizing QP at current point (Inverse RL step),

solve resulting LP (RL step).

Why not solving QP directly? Typically only possible for very small toy problems (curse of dimensionality). [Our algorithm makes use of existing RL solvers to deal with the curse of dimensionality.]

Our algorithm returns a policy with performance as good as the expert as evaluated according to the expert’s unknown reward function.

Algorithm is guaranteed to converge in poly(k,1/) iterations.

Sample complexity poly(k,1/).

The algorithm exploits reward “simplicity” (vs. policy “simplicity” in previous approaches).

Conclusions

Proof (sketch)

In each video, the left sub-panel shows a demonstration of a different driving “style”, and the right sub-panel shows the behavior learned from watching the demonstration.

Additional slides for poster

(slides to come are additional material, not included in the talk, in particular: projection (vs. QP) version of the Inverse RL step; another formulation of the apprenticeship learning problem, and its relation to our algorithm)

Simplification of Inverse RL step: QP Euclidean projection

• In the Inverse RL step

– set (i-1) = orthogonal projection of E onto line through { (i-1),((i-1)) }

– set w(i) = E - (i-1)

• Note: the theoretical results on convergence and sample complexity hold unchanged for the simpler algorithm.

Algorithm (projection version)

w(2)(1)

Appendix: Different View

Bellman LP for solving MDPs

Min. V c’V s.t.

s,a V(s) R(s,a) + s’ P(s,a,s’)V(s’)

Dual LP

Max. s,a (s,a)R(s,a) s.t.

s c(s) - a (s,a) + s’,a P(s’,a,s) (s’,a) =0

Apprenticeship Learning as QP

Min. i (E,i - s,a (s,a)i(s))2 s.t.

s c(s) - a (s,a) + s’,a P(s’,a,s) (s’,a) =0

Different View (ctd.)

Our algorithm is equivalent to iteratively

linearize QP at current point (Inverse RL step),

solve resulting LP (RL step).

Why not solving QP directly? Typically only possible for very small toy problems (curse of dimensionality). [Our algorithm makes use of existing RL solvers to deal with the curse of dimensionality.]

Slides that are different for poster

(slides to come are slightly different for poster, but already “appeared” earlier)

Algorithm (QP version)

Uw() = wT()

w(2)(1)

Uw() = wT()

w(2)(1)

Uw() = wT()

Case study: Highway driving

(Videos available.)

Input: Driving demonstration Output: Learned behavior

(Videos available.)

Collision

Offroad Left

Left Lane

Middle

Lane Right

Lane Offroad

1 Feature Distr. Expert 0 0 0.1325 0.2033 0.5983 0.0658

Feature Distr. Learned 5.00E-05 0.0004 0.0904 0.2286 0.604 0.0764

Weights Learned -0.0767 -0.0439 0.0077 0.0078 0.0318 -0.0035

2 Feature Distr. Expert 0.1167 0 0.0633 0.4667 0.47 0

Feature Distr. Learned 0.1332 0 0.1045 0.3196 0.5759 0

Weights Learned 0.234 -0.1098 0.0092 0.0487 0.0576 -0.0056

3 Feature Distr. Expert 0 0 0 0.0033 0.7058 0.2908

Feature Distr. Learned 0 0 0 0 0.7447 0.2554

Weights Learned -0.1056 -0.0051 -0.0573 -0.0386 0.0929 0.0081

4 Feature Distr. Expert 0.06 0 0 0.0033 0.2908 0.7058

Feature Distr. Learned 0.0569 0 0 0 0.2666 0.7334

Weights Learned 0.1079 -0.0001 -0.0487 -0.0666 0.059 0.0564

5 Feature Distr. Expert 0.06 0 0 1 0 0

Feature Distr. Learned 0.0542 0 0 1 0 0

Weights Learned 0.0094 -0.0108 -0.2765 0.8126 -0.51 -0.0153

Car driving results (more detail)

Proof (sketch)

Apprenticeship Learning via Inverse Reinforcement Learning

Stanford University

pieter abbeel and andrew y. ng apprenticeship learning via inverse reinforcement learning pieter...

andrew ng

w t slide

rs t slide

t e w

ng motivation

e u w

ee slide

inverse rl step slide

Documents

cs 188: artificial intelligence › assets › courseware...

probability: review pieter abbeel uc berkeley eecs many...

denoising diffusion probabilistic modelsdenoising diffusion...

planning curvature and torsion constrained ribbons for...

sachin patil, yan duan, john schulman, ken goldberg, pieter...

maximum likelihood (ml), expectation maximization (em)...

nonlinear optimization for optimal control part 2 pieter ...

opmalcontrolfor...

gaussians pieter abbeel uc berkeley eecs

jia pan zhuo chen pieter abbeelpabbeel/papers/2014... ·...

markov decision processes value iteration pieter abbeel uc...

pieter abbeel and andrew y. ng reinforcement learning and...

learning parameterized maneuvers for autonomous helicopter...

stanford hierarchical apprenticeship learning with...

mapping with known poses pieter abbeel uc berkeley eecs

learning first order markov models for control pieter abbeel...

nonlinear optimization for optimal control pieter abbeel uc...

apprenticeship learning pieter abbeel stanford university in...

portable gps baseband logging -...

adam coates, pieter abbeel, and andrew y. ng stanford...