robot, learning from data

30
DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING SEOUL NATIONAL UNIVERSITY Robot, Learning From Data : Direct Policy Learning in RKHS & Inverse Reinforcement Learning Methods Presenter: Sungjoon Choi Cyber-Physical Systems Laboratory (CPSLAB) Seoul National University

Upload: sungjoon-samuel

Post on 21-Feb-2017

769 views

Category:

Engineering


1 download

TRANSCRIPT

Page 1: Robot, Learning From Data

DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERINGSEOUL NATIONAL UNIVERSITY

Robot, Learning From Data: Direct Policy Learning in RKHS

& Inverse Reinforcement Learning Methods

Presenter: Sungjoon ChoiCyber-Physical Systems Laboratory (CPSLAB)

Seoul National University

Page 2: Robot, Learning From Data

CPSLAB (http://cpslab.snu.ac.kr)

2Learning From Data

https://canvas.northwestern.edu/courses/20122/assignments/syllabus

Page 3: Robot, Learning From Data

CPSLAB (http://cpslab.snu.ac.kr)

3Learning From Data

Page 4: Robot, Learning From Data

CPSLAB (http://cpslab.snu.ac.kr)

4Learning From Data

Contents

Learning from Demonstration

Direct Policy Learning Reward Learning

Kernel Methods

Reproducing Kernel Hilbert Space

Learning Theory in RKHS

Inverse Reinforcement Learning Methods

Page 5: Robot, Learning From Data

CPSLAB (http://cpslab.snu.ac.kr)

5Learning From Data

Learning From Demonstration

Human Expert

http://villains.wikia.com/wiki/Chef_Skinner

http://www.filmspotting.net/forum/index.php?topic=12312.660

Learning from Demonstration

http://blogs.disney.com/oh-my-disney/2014/09/04/learn-to-love-cooking-with-ratatouille/

Execute in Unseen Environments

Page 6: Robot, Learning From Data

CPSLAB (http://cpslab.snu.ac.kr)

6Learning From Data

Learning From Demonstration

There are two approaches: direct policy learning and reward learning.

Direct policy learning Reward learning• Try to find a policy function which maps a state

space to an action space.

State space,

Action space,

Policy function

• Cast the learning problem to regression or multi-class classification problem.

• Standard learning theory or approximation theory are often used to analyze the performance of learning.

• Try to find a reward function indicating how ‘good’ each state-action pair is.

Joint State-Action space,

Rewardspace,

• “The reward function, rather than the policy, is the most succint, robust, and transferable definition of the task.” [1]

[1] Ng, Andrew Y., and Stuart J. Russell. "Algorithms for inverse reinforcement learning."  ICML. 2000.

• Often refer to as an Inverse Reinforcement Learning (IRL) problem.

Page 7: Robot, Learning From Data

CPSLAB (http://cpslab.snu.ac.kr)

7Learning From Data

Direct Policy Learning

Page 8: Robot, Learning From Data

CPSLAB (http://cpslab.snu.ac.kr)

8Learning From Data

Direct Policy Learning in Reproducing Kernel Hilbert Space (RKHS)

State space,

Action space,

Policy function

We will see this problem as a nonlinear regression problem. • In particular, we will use kernel-based regression method. • RKHS refers to as a reproducing kernel Hilbert space where our policy function is

included, , in other words, the hypothesis space.

We will also focus on how well this function generalizes, that is, how well it esti-mates the outputs for previously unseen inputs based in learning theory.

Page 9: Robot, Learning From Data

CPSLAB (http://cpslab.snu.ac.kr)

9Learning From Data

Definition and Existence Reproducing Kernel Hilbert Space

Existence of the RKHS is shown by Moore-Aronszajn theorem.

Following is the definition of the reproducing kernel Hilbert space. [1]

[1] Rasmussen, Carl Edward. "Gaussian processes for machine learning." (2006).

Page 10: Robot, Learning From Data

CPSLAB (http://cpslab.snu.ac.kr)

10Learning From Data

Reproducing Kernel Hilbert Space

Existence of the RKHS is shown by Moore-Aronszajn theorem.

Following is the definition of the reproducing kernel Hilbert space. [1]

[1] Rasmussen, Carl Edward. "Gaussian processes for machine learning." (2006).

What…?

Page 11: Robot, Learning From Data

CPSLAB (http://cpslab.snu.ac.kr)

11Learning From Data

Reproducing Kernel Hilbert Space – Approach 1

The first thing that might come into your mind when you hear about kernel methods would be transferring input data into the feature space.

What Mercer’s theorem say is that every kernel function can be expressed as an inner-product of infinite (finite for degenerate kernels) dimensional eigenvectors.

Page 12: Robot, Learning From Data

CPSLAB (http://cpslab.snu.ac.kr)

12Learning From Data

Reproducing Kernel Hilbert Space – Approach 1

Suppose we are given a kernel function .

From the Mercer’s theorem, we have infinite number of basis (eigen) function :.

Then, let’s think about a vector space spanned by the eigenfunctions: and

If we define the inner product of this space as , this space satisfies the reproducing prop-erty! In other words, this space spanned with eigenfunctions (features) is a RHKS!!

Reproducing property: Check! By definition

Page 13: Robot, Learning From Data

CPSLAB (http://cpslab.snu.ac.kr)

13Learning From Data

Reproducing Kernel Hilbert Space – Approach 1

Suppose we are given a kernel function .

From the Mercer’s theorem, we have infinite number of basis (eigen) function :.

Then, let’s think about a vector space spanned by the eigenfunctions: and

If we define the inner product of this space as , this space satisfies the reproducing prop-erty! In other words, this space spanned with eigenfunctions (features) is a RHKS!!

Reproducing property: Check! By definition

What…?

Page 14: Robot, Learning From Data

CPSLAB (http://cpslab.snu.ac.kr)

14Learning From Data

Reproducing Kernel Hilbert Space – Approach 2

The RKHS has two properties: 1. For every , as a function of belongs to . 2. has a reproducing property.

Suppose our kernel function is

Then, from the Moore-Aronszajn theorem, there exists a RKHS . But, what does this mean? We want to define a space of functions whose element has following form:

for some .

Page 15: Robot, Learning From Data

CPSLAB (http://cpslab.snu.ac.kr)

15Learning From Data

Reproducing Kernel Hilbert Space – Approach 2

We want to define a space of functions whose element has following form:

for some .

The RKHS has two properties: 1. For every , as a function of belongs to . 2. has a reproducing property.

Then, this space satisfies the reproducing property:

⟨ 𝑓 (⋅) ,𝑘 (⋅ , 𝑥 ) ⟩=⟨∑𝑖 𝛼𝑖𝑘 (𝑧𝑖 , ⋅) ,𝑘 (⋅ ,𝑥 )⟩𝐻=∑𝑖𝛼 𝑖 ⟨𝑘 (𝑧 𝑖 , ⋅) ,𝑘 (⋅ ,𝑥 ) ⟩=∑

𝑖𝛼𝑖𝑘 (𝑧𝑖 ,𝑥 )= 𝑓 (𝑥 )

If we define the inner-product of the Hilbert space as

Page 16: Robot, Learning From Data

CPSLAB (http://cpslab.snu.ac.kr)

16Learning From Data

Reproducing Kernel Hilbert Space – Approach 2

We want to define a space of functions whose element has following form:

for some .

If we define the inner-product of the Hilbert space as

The space defined with as above is a reproducing kernel Hilbert space.

As must be greater equal to zero, kernel function should be positive semi-definite!

A norm is defined as follows:.

Page 17: Robot, Learning From Data

CPSLAB (http://cpslab.snu.ac.kr)

17Learning From Data

Practical Usage

Empirical risk minimization with Hilbert norm regularization

If we set our hypothesis space as radial basis function networks, ,

then the optimization becomes

If we rewrite in a matrix form,

which is a quadratic programming with respect to dimensional vector .

Page 18: Robot, Learning From Data

CPSLAB (http://cpslab.snu.ac.kr)

18Learning From Data

Practical Usage

𝐾 𝑋𝑍𝛼 𝑌

𝛼𝑇 𝐾 𝑍𝑍 𝛼

If is identical to , then the above equation is identical to Gaussian process regression or kernel ridge regression. In practice, we can add additional constraints such as or which greatly increases the stability issue!

Often referred to as a sparse Gaussian process regression with inducing points.

Page 19: Robot, Learning From Data

CPSLAB (http://cpslab.snu.ac.kr)

19Learning From Data

Learning Theory

Suppose our training data be sampled from .

The expected risk of a function is defined as: .

Then, the expected risk can be decomposed into

where is the regression function.

Page 20: Robot, Learning From Data

CPSLAB (http://cpslab.snu.ac.kr)

20Learning From Data

Learning Theory

𝐼 [ 𝑓 ]= ∫𝑋 ×𝑌

( 𝑦− 𝑓 𝜌 (𝑥 ) )2𝑃 (𝑥 , 𝑦 )𝑑𝑥𝑑𝑦+ ∫𝑋 ×𝑌

( 𝑓 (𝑥 )− 𝑓 𝜌 (𝑥 ) )2𝑃 (𝑥 , 𝑦 ) 𝑑𝑥𝑑𝑦

∫𝑋 ×𝑌

( 𝑦− 𝑓 𝜌 (𝑥 ) )2𝑃 (𝑥 , 𝑦 )𝑑𝑥𝑑𝑦

∫𝑋

( 𝑓 (𝑥 )− 𝑓 𝜌 (𝑥 ) )2 𝑃 (𝑥 )𝑑𝑥

Expected Risk = Intrinsic Error + Estimation Error

Intrinsic Error (approximation error)

Estimation Error

We cannot handle this.

We can handle only part of this.

Page 21: Robot, Learning From Data

CPSLAB (http://cpslab.snu.ac.kr)

21Learning From Data

Learning Theory

The goal of the learning theory is to minimize following functional:.

is often called generalization error.

[1] Niyogi, Partha, and Federico Girosi. "On the relationship between generalization error, hypothesis complexity, and sample complexity for radial basis functions." Neural Computation 8.4 (1996): 819-842.

If we use a radial basis function network, , we can achieve following bound on the generalization error [1].

with probability at least .

Page 22: Robot, Learning From Data

CPSLAB (http://cpslab.snu.ac.kr)

22Learning From Data

Learning Theory

[1] Niyogi, Partha, and Federico Girosi. "On the relationship between generalization error, hypothesis complexity, and sample complexity for radial basis functions." Neural Computation 8.4 (1996): 819-842.

There are two sources of errors: 1. We are trying to approximate an infinite dimensional object, the regression function with a finite number of parameters (approximation error).

Approximation error:

2. We minimize the empirical risk and obtain , rather than minimizing the expected risk (estimation error).

Estimation error:

Page 23: Robot, Learning From Data

CPSLAB (http://cpslab.snu.ac.kr)

23Learning From Data

Reward Learning

Page 24: Robot, Learning From Data

CPSLAB (http://cpslab.snu.ac.kr)

24Learning From Data

Reward Learning

Understanding basic concept of solving Markov decision process (MDP) or reinforce-ment learning (RL) is crucial in a reward learning problem.

Goal of RL is to find a policy function which maximizes an expected sum of rewards:

If we define , then the optimization becomes,

Where is often called Bellman flow constraint.

Page 25: Robot, Learning From Data

CPSLAB (http://cpslab.snu.ac.kr)

25Learning From Data

Inverse Reinforcement Learning

True reward True density State-action traj.

Solving MDP

State-action traj. max𝑅∈𝐻 (𝑅 )

⟨ 𝜇 ,𝑅 ⟩

Solving IRL

Estimated reward

http://users.eecs.northwestern.edu/~argall/learning.html

Page 26: Robot, Learning From Data

CPSLAB (http://cpslab.snu.ac.kr)

26Learning From Data

Inverse Reinforcement Learning Methods

NR [1]

MMP [2]AN [3]

MaxEnt [4]BIRL [6] RelEnt [7]

StructIRL [8] GPIRL [5]

DeepIRL [9]

[1] Ng, Andrew Y., and Stuart J. Russell. "Algorithms for inverse reinforcement learning." ICML, 2000.

[2] Ratliff, Nathan D., J. Andrew Bagnell, and Martin A. Zinkevich. "Maximum margin planning." ICML, 2006.

[3] Abbeel, Pieter, and Andrew Y. Ng. "Apprenticeship learning via inverse reinforcement learning." ICML, 2004.

[4] Ziebart, Brian D., Andrew Maas, J.Andrew Bagnell, and Anind K. Dey, "Maximum Entropy Inverse Reinforcement Learning."AAAI. 2008.

[5] Levine, Sergey, Zoran Popovic, and Vladlen Koltun. "Nonlinear inverse reinforcement learning with Gaussian processes." NIPS 2011.

[6] Ramachandran, Deepak, and Eyal Amir. "Bayesian inverse reinforcement learning.“, AAAI, 2007

[7] Boularias, Abdeslam, Jens Kober, and Jan R. Peters. "Relative entropy inverse reinforcement learning." AISTATS. 2011.[8] Klein, Edouard, Matthieu Geis, Bilal Piot, and Olivier Pietquin, "Inverse reinforcement learning through structured classification." NIPS. 2012.[9] Wulfmeier, Markus, Peter Ondruska, and Ingmar Posner. "Deep Inverse Reinforcement Learning." arXiv. 2015.

Page 27: Robot, Learning From Data

CPSLAB (http://cpslab.snu.ac.kr)

27Learning From Data

Inverse Reinforcement Learning Methods

NR [1]

MMP [2]AN [3]

Maximize discrepancy between expert’s and sampled value.Objective

Maximize margin between expert’s demonstration and every state-actions.

Minimize value between expert’s and sampled ones.

StructIRL [8] Cast IRL to multi-class classification problem.

NR [1]

MMP [2]

AN [3]

StructIRL [8]

Page 28: Robot, Learning From Data

CPSLAB (http://cpslab.snu.ac.kr)

28Learning From Data

Inverse Reinforcement Learning Methods

MaxEnt [4]

BIRL [6]

RelEnt [7]

GPIRL [5]

DeepIRL [9]

Objective

Define likelihood of state-action trajectories and use MLE.

MaxEnt [4]

BIRL [6]

Define posterior of state-action trajectories and use MH sampling.

Define likelihood using SGP and use gradient ascent method.

Minimize relative entropy between expert’s and learner’s distribution.

GPIRL [5]

RelEnt [7]

DeepIRL [9]

Model likelihood with neural networks.

Page 29: Robot, Learning From Data

CPSLAB (http://cpslab.snu.ac.kr)

29Learning From Data

Conclusion

I believe selecting a proper machine learning algorithm is more than selecting a chocolate from a chocolate box.

"Deep Learning is brute force learning. It is not intelligent learning." "Machine learning is not only about machines, but also about humans."

- Vladimir Vapnik @ NIPS15

http://www.forbes.com/forbes/welcome/http://aboutintelligence.blogspot.kr/2009/01/vapniks-picture-explained.html

Robotics deals with humans!

Page 30: Robot, Learning From Data

CPSLAB (http://cpslab.snu.ac.kr)

30Learning From Data

Thank you for your attention!!Any Questions?

[email protected]