reinforcement learninghanxiaol/slides/rl.pdf · reinforcement learning hanxiao liu carnegie mellon...

27
Reinforcement Learning Hanxiao Liu Carnegie Mellon University [email protected] September 20, 2016 0 Based on David Silver’s lectures on RL. 1 / 27

Upload: others

Post on 10-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Reinforcement Learninghanxiaol/slides/rl.pdf · Reinforcement Learning Hanxiao Liu Carnegie Mellon University hanxiaol@cs.cmu.edu September 20, 2016 0Based on David Silver’slectures

Reinforcement Learning

Hanxiao LiuCarnegie Mellon [email protected]

September 20, 2016

0Based on David Silver’s lectures on RL.1 / 27

Page 2: Reinforcement Learninghanxiaol/slides/rl.pdf · Reinforcement Learning Hanxiao Liu Carnegie Mellon University hanxiaol@cs.cmu.edu September 20, 2016 0Based on David Silver’slectures

Outline

Introduction

Markov Decision Process

Model-Free Prediction

Model-Free Control

Function Approximation

2 / 27

Page 3: Reinforcement Learninghanxiaol/slides/rl.pdf · Reinforcement Learning Hanxiao Liu Carnegie Mellon University hanxiaol@cs.cmu.edu September 20, 2016 0Based on David Silver’slectures

Examples

1. Helicopter Control

2. Atari Games

3. Learning Simple Algorithms

3 / 27

Page 4: Reinforcement Learninghanxiaol/slides/rl.pdf · Reinforcement Learning Hanxiao Liu Carnegie Mellon University hanxiaol@cs.cmu.edu September 20, 2016 0Based on David Silver’slectures

Basic Setups

History: Agent’s experience

Ht = O1, R1, A1, O2, R2, A2, . . . , At−1, Ot, Rt (1)

State: A summary of the history

St = f(Ht) (2)

Markov Property

Pr [St+1|St] = P [St+1|S1, . . . , St] (3)

4 / 27

Page 5: Reinforcement Learninghanxiaol/slides/rl.pdf · Reinforcement Learning Hanxiao Liu Carnegie Mellon University hanxiaol@cs.cmu.edu September 20, 2016 0Based on David Silver’slectures

Basic Setups

Key components of a RL agent

I Policy: Behavior of the agent

a = π(s) (4)

π(a|s) = Pr [At = a|St = s] (5)

I Value Function: A prediction of future reward

vπ(s) = Eπ[Rt+1 + γRt+2 + γ2Rt+3 + . . . |St = s

](6)

I Model: Agent’s representation of the environment.I We will primarily focus on model-free RL.

5 / 27

Page 6: Reinforcement Learninghanxiaol/slides/rl.pdf · Reinforcement Learning Hanxiao Liu Carnegie Mellon University hanxiaol@cs.cmu.edu September 20, 2016 0Based on David Silver’slectures

Basic Setups

Fundamental sequential decision making problems:

I Planning: Fully observed environment.

I RL: The environment is initially unknown.I Exploration and exploitation.

Tasks

I Prediction: Evaluate vπ(s) given π.

I Control: Optimize v∗(s) by refining π.

6 / 27

Page 7: Reinforcement Learninghanxiaol/slides/rl.pdf · Reinforcement Learning Hanxiao Liu Carnegie Mellon University hanxiaol@cs.cmu.edu September 20, 2016 0Based on David Silver’slectures

Outline

Introduction

Markov Decision Process

Model-Free Prediction

Model-Free Control

Function Approximation

7 / 27

Page 8: Reinforcement Learninghanxiaol/slides/rl.pdf · Reinforcement Learning Hanxiao Liu Carnegie Mellon University hanxiaol@cs.cmu.edu September 20, 2016 0Based on David Silver’slectures

Markov Decision Process

8 / 27

Page 9: Reinforcement Learninghanxiaol/slides/rl.pdf · Reinforcement Learning Hanxiao Liu Carnegie Mellon University hanxiaol@cs.cmu.edu September 20, 2016 0Based on David Silver’slectures

Markov Decision Process

A MDP M is defined by (S,A,P ,R, γ)

Pass′ = Pr [St+1 = s′|St = s, At = a] (7)

Ras = E [Rt+1|St = s, At = a] (8)

I Here we assume the environment has been fullyobserved—P and R are known.

I For any fixed policy, (S,Pπ) defines a Markov Process.

Pπss′ =∑a∈A

π(a|s)Pass′ (9)

9 / 27

Page 10: Reinforcement Learninghanxiaol/slides/rl.pdf · Reinforcement Learning Hanxiao Liu Carnegie Mellon University hanxiaol@cs.cmu.edu September 20, 2016 0Based on David Silver’slectures

Value Functions

Goal: Refining π to maximize future returns

Gt = Rt+1 + γRt+2 + γ2Rt+3 + . . . (10)

Value functions

I State-value function

vπ(s) = Eπ [Gt|St = s] (11)

I Action-value function

qπ(s, a) = Eπ [Gt|St = s, At = a] (12)

10 / 27

Page 11: Reinforcement Learninghanxiaol/slides/rl.pdf · Reinforcement Learning Hanxiao Liu Carnegie Mellon University hanxiaol@cs.cmu.edu September 20, 2016 0Based on David Silver’slectures

Bellman Expectation Equation

Q: How to evaluate any given π (obtain vπ)?

Value = immediate reward + discounted successor value

vπ(s) = Eπ [Rt+1 + γvπ(St+1)|St = s] (13)

qπ(s, a) = Eπ [Rt+1 + γqπ(St+1, At+1)|St = s, At = a] (14)

vπ(s) =∑a

π(a|s)[Ras + γ

∑s′

Pass′vπ(s′)︸ ︷︷ ︸qπ(s,a)

](15)

=⇒ vπ can be obtained by solving a linear system.

11 / 27

Page 12: Reinforcement Learninghanxiaol/slides/rl.pdf · Reinforcement Learning Hanxiao Liu Carnegie Mellon University hanxiaol@cs.cmu.edu September 20, 2016 0Based on David Silver’slectures

Bellman Optimality Equation

Q: How do we know if π is already the optimal?

Recall for any fixed π:

vπ(s) =∑a

π(a|s)[Ras + γ

∑s′

Pass′vπ(s′)]

(16)

For the optimal π∗:

vπ∗(s) = maxaRas + γ

∑s′

Pass′vπ∗(s′) (17)

No closed-form solution, but iterative solvers are available.

12 / 27

Page 13: Reinforcement Learninghanxiaol/slides/rl.pdf · Reinforcement Learning Hanxiao Liu Carnegie Mellon University hanxiaol@cs.cmu.edu September 20, 2016 0Based on David Silver’slectures

Control

Q: How to improve π?

Policy iteration

(a) Policy evaluation: compute vπ(s) given π.

(b) Getting an improved policy π′ = greedy(vπ)I vπ(s) =⇒ qπ(s, a)I π′(s) = argmaxa qπ(s, a)I Theorem: π′ � π.

Alternative approach: Value iteration

(a) Obtain v∗(s) by solving Bellman optimality equation.

(b) π∗ is implied by v∗.

13 / 27

Page 14: Reinforcement Learninghanxiaol/slides/rl.pdf · Reinforcement Learning Hanxiao Liu Carnegie Mellon University hanxiaol@cs.cmu.edu September 20, 2016 0Based on David Silver’slectures

Outline

Introduction

Markov Decision Process

Model-Free Prediction

Model-Free Control

Function Approximation

14 / 27

Page 15: Reinforcement Learninghanxiaol/slides/rl.pdf · Reinforcement Learning Hanxiao Liu Carnegie Mellon University hanxiaol@cs.cmu.edu September 20, 2016 0Based on David Silver’slectures

Model-Free Prediction

So far we’ve been assuming M is fully observed.

I We are informed about P and R, so estimatingvπ(s) = Eπ [Gt|St = s] is easy.

Can vπ(s) still be estimated if M is partially observed?

I Taking the full exception is not possible.

I However, we can sample from the environment.

15 / 27

Page 16: Reinforcement Learninghanxiaol/slides/rl.pdf · Reinforcement Learning Hanxiao Liu Carnegie Mellon University hanxiaol@cs.cmu.edu September 20, 2016 0Based on David Silver’slectures

Sampling Approaches

Monte Carlo (MC)

1. Estimate Gt by sampling from M following π.I We are sampling from a Markov (Reward) Process.

2. vπ(St)← vπ(St) + α(Gt − vπ(St))

Temporal Difference (TD)

V (St)← V (St) + α(Rt+1 + γV (St+1)− V (St)) (18)

Advantages of TD

I Lower variance.

I More efficient (by exploit the Markov property).

I Handle incomplete sequences.

16 / 27

Page 17: Reinforcement Learninghanxiaol/slides/rl.pdf · Reinforcement Learning Hanxiao Liu Carnegie Mellon University hanxiaol@cs.cmu.edu September 20, 2016 0Based on David Silver’slectures

A Unified View

0http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_

files/MC-TD.pdf

17 / 27

Page 18: Reinforcement Learninghanxiaol/slides/rl.pdf · Reinforcement Learning Hanxiao Liu Carnegie Mellon University hanxiaol@cs.cmu.edu September 20, 2016 0Based on David Silver’slectures

Outline

Introduction

Markov Decision Process

Model-Free Prediction

Model-Free Control

Function Approximation

18 / 27

Page 19: Reinforcement Learninghanxiaol/slides/rl.pdf · Reinforcement Learning Hanxiao Liu Carnegie Mellon University hanxiaol@cs.cmu.edu September 20, 2016 0Based on David Silver’slectures

Model-Free Control

Policy Iteration

1. Policy evaluation: compute Qπ(s, a) given π.I using TD or MC.

2. Policy refinement: π′ = ε-greedy(Qπ)

I π(a|s) =

{εm + 1− ε a∗ = argmaxa Q(s, a)εm otherwise

I Pure greedy is a bad idea—no exploration.

19 / 27

Page 20: Reinforcement Learninghanxiaol/slides/rl.pdf · Reinforcement Learning Hanxiao Liu Carnegie Mellon University hanxiaol@cs.cmu.edu September 20, 2016 0Based on David Silver’slectures

Outline

Introduction

Markov Decision Process

Model-Free Prediction

Model-Free Control

Function Approximation

20 / 27

Page 21: Reinforcement Learninghanxiaol/slides/rl.pdf · Reinforcement Learning Hanxiao Liu Carnegie Mellon University hanxiaol@cs.cmu.edu September 20, 2016 0Based on David Silver’slectures

Large-scale RL

Some real-world problems:

I Go: 10170 states

I Robot control: continuous (infinite) action space

Size of v(s) and/or q(s, a) becomes intractable.

Solution: Function approximation

v̂(s, w) ∼ vπ(s) (19)

q̂(s, a, w) ∼ qπ(s, a) (20)

Ideally v̂ and q̂ are both differentiable and expressive

I deep neural networks

21 / 27

Page 22: Reinforcement Learninghanxiaol/slides/rl.pdf · Reinforcement Learning Hanxiao Liu Carnegie Mellon University hanxiaol@cs.cmu.edu September 20, 2016 0Based on David Silver’slectures

Optimization

Recall in TD:

∆v = α(Rt+1 + γv(St+1)− v(St)) (21)

With function approximation:

∆w = α(Rt+1 + γv̂(St+1, w)− v̂(St, w))∇wv̂(St, w) (22)

≈ supervised learning using stochastic gradient descent

I Experience reply (in DQN): cache and reuse historicaltraining examples to refine w.

22 / 27

Page 23: Reinforcement Learninghanxiaol/slides/rl.pdf · Reinforcement Learning Hanxiao Liu Carnegie Mellon University hanxiaol@cs.cmu.edu September 20, 2016 0Based on David Silver’slectures

Policy-based RL

Alternatively, we can parameterize π instead of vπ (or qπ)

πθ(s, a) = Pr [a|s, θ] (23)

Then optimize θ w.r.t. some objective J(θ).

Advantage: no need to carry out argmaxaq(s, a)

I More efficient for high-dimensional/continuous A.

23 / 27

Page 24: Reinforcement Learninghanxiaol/slides/rl.pdf · Reinforcement Learning Hanxiao Liu Carnegie Mellon University hanxiaol@cs.cmu.edu September 20, 2016 0Based on David Silver’slectures

Policy Gradient

Consider a one-step MDP starting from s ∼ d(s)

J(θ) := Eπθ [r] =∑s

d(s)∑a

πθ(s, a)Ras (24)

∇θJ(θ) =∑s

d(s)∑a

πθ(s, a)∇θ log πθ(s, a)Ras (25)

= Eπθ [∇θ log πθ(s, a)r] (26)

The above allows us to access stochastic policy gradientwithout knowing the environment.

24 / 27

Page 25: Reinforcement Learninghanxiaol/slides/rl.pdf · Reinforcement Learning Hanxiao Liu Carnegie Mellon University hanxiaol@cs.cmu.edu September 20, 2016 0Based on David Silver’slectures

Actor-Critic Models

Policy gradient in more generic cases

∇θJ(θ) = Eπθ [∇θ log π(s, a)qπθ(s, a)] (27)

qπθ(s, a) can be approximated by q̂(s, a, w).

Critic updates w.

Actor updates θ based on critic’s suggestion.

25 / 27

Page 26: Reinforcement Learninghanxiaol/slides/rl.pdf · Reinforcement Learning Hanxiao Liu Carnegie Mellon University hanxiaol@cs.cmu.edu September 20, 2016 0Based on David Silver’slectures

Actor-Critic Models

Similar ideas in AlphaGo

26 / 27

Page 27: Reinforcement Learninghanxiaol/slides/rl.pdf · Reinforcement Learning Hanxiao Liu Carnegie Mellon University hanxiaol@cs.cmu.edu September 20, 2016 0Based on David Silver’slectures

The End

Other interesting topics

I Convergence.

I Exploration v.s. Exploitation.

I Credit assignment.

I Off-Policy RL.

27 / 27