reinforcement learninghanxiaol/slides/rl.pdf · reinforcement learning hanxiao liu carnegie mellon...

Reinforcement Learning

Hanxiao LiuCarnegie Mellon [email protected]

September 20, 2016

0Based on David Silver’s lectures on RL.1 / 27

http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html

Outline

Introduction

Markov Decision Process

Model-Free Prediction

Model-Free Control

Function Approximation

2 / 27

Examples

1. Helicopter Control

2. Atari Games

3. Learning Simple Algorithms

3 / 27

https://www.youtube.com/watch?v=0JL04JJjocc

https://www.youtube.com/watch?v=V1eYniJ0Rnk

https://www.youtube.com/watch?v=GVe6kfJnRAw

Basic Setups

History: Agent’s experience

Ht = O1, R1, A1, O2, R2, A2, . . . , At−1, Ot, Rt (1)

State: A summary of the history

St = f(Ht) (2)

Markov Property

Pr [St+1|St] = P [St+1|S1, . . . , St] (3)

4 / 27

Basic Setups

Key components of a RL agent

I Policy: Behavior of the agent

a = π(s) (4)

π(a|s) = Pr [At = a|St = s] (5)

I Value Function: A prediction of future reward

vπ(s) = Eπ[Rt+1 + γRt+2 + γ2Rt+3 + . . . |St = s

](6)

I Model: Agent’s representation of the environment.I We will primarily focus on model-free RL.

5 / 27

Basic Setups

Fundamental sequential decision making problems:

I Planning: Fully observed environment.

I RL: The environment is initially unknown.I Exploration and exploitation.

Tasks

I Prediction: Evaluate vπ(s) given π.

I Control: Optimize v∗(s) by refining π.

6 / 27

Outline

Introduction



Model-Free Control


7 / 27


8 / 27


A MDP M is defined by (S,A,P ,R, γ)

Pass′ = Pr [St+1 = s′|St = s, At = a] (7)

Ras = E [Rt+1|St = s, At = a] (8)

I Here we assume the environment has been fullyobserved—P and R are known.

I For any fixed policy, (S,Pπ) defines a Markov Process.

Pπss′ =∑a∈A

π(a|s)Pass′ (9)

9 / 27

Value Functions

Goal: Refining π to maximize future returns

Gt = Rt+1 + γRt+2 + γ2Rt+3 + . . . (10)

Value functions

I State-value function

vπ(s) = Eπ [Gt|St = s] (11)

I Action-value function

qπ(s, a) = Eπ [Gt|St = s, At = a] (12)

10 / 27

Bellman Expectation Equation

Q: How to evaluate any given π (obtain vπ)?

Value = immediate reward + discounted successor value

vπ(s) = Eπ [Rt+1 + γvπ(St+1)|St = s] (13)

qπ(s, a) = Eπ [Rt+1 + γqπ(St+1, At+1)|St = s, At = a] (14)

vπ(s) =∑a

π(a|s)[Ras + γ

∑s′

Pass′vπ(s′)︸︷︷︸qπ(s,a)

](15)

=⇒ vπ can be obtained by solving a linear system.

11 / 27

Bellman Optimality Equation

Q: How do we know if π is already the optimal?

Recall for any fixed π:

vπ(s) =∑a

π(a|s)[Ras + γ

∑s′

Pass′vπ(s′)]

(16)

For the optimal π∗:

vπ∗(s) = maxaRas + γ

∑s′

Pass′vπ∗(s′) (17)

No closed-form solution, but iterative solvers are available.

12 / 27

Control

Q: How to improve π?

Policy iteration

(a) Policy evaluation: compute vπ(s) given π.

(b) Getting an improved policy π′ = greedy(vπ)I vπ(s) =⇒ qπ(s, a)I π′(s) = argmaxa qπ(s, a)I Theorem: π′ � π.

Alternative approach: Value iteration

(a) Obtain v∗(s) by solving Bellman optimality equation.

(b) π∗ is implied by v∗.

13 / 27

Outline

Introduction



Model-Free Control


14 / 27


So far we’ve been assuming M is fully observed.

I We are informed about P and R, so estimatingvπ(s) = Eπ [Gt|St = s] is easy.

Can vπ(s) still be estimated if M is partially observed?

I Taking the full exception is not possible.

I However, we can sample from the environment.

15 / 27

Sampling Approaches

Monte Carlo (MC)

1. Estimate Gt by sampling from M following π.I We are sampling from a Markov (Reward) Process.

2. vπ(St)← vπ(St) + α(Gt − vπ(St))

Temporal Difference (TD)

V (St)← V (St) + α(Rt+1 + γV (St+1)− V (St)) (18)

Advantages of TD

I Lower variance.

I More efficient (by exploit the Markov property).

I Handle incomplete sequences.

16 / 27

A Unified View

0http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_

files/MC-TD.pdf

17 / 27

http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MC-TD.pdf

http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MC-TD.pdf

Outline

Introduction



Model-Free Control


18 / 27

Model-Free Control

Policy Iteration

1. Policy evaluation: compute Qπ(s, a) given π.I using TD or MC.

2. Policy refinement: π′ = ε-greedy(Qπ)

I π(a|s) =

{εm + 1− ε a∗ = argmaxa Q(s, a)εm otherwise

I Pure greedy is a bad idea—no exploration.

19 / 27

Outline

Introduction



Model-Free Control


20 / 27

Large-scale RL

Some real-world problems:

I Go: 10170 states

I Robot control: continuous (infinite) action space

Size of v(s) and/or q(s, a) becomes intractable.

Solution: Function approximation

v̂(s, w) ∼ vπ(s) (19)

q̂(s, a, w) ∼ qπ(s, a) (20)

Ideally v̂ and q̂ are both differentiable and expressive

I deep neural networks

21 / 27

Optimization

Recall in TD:

∆v = α(Rt+1 + γv(St+1)− v(St)) (21)

With function approximation:

∆w = α(Rt+1 + γv̂(St+1, w)− v̂(St, w))∇wv̂(St, w) (22)

≈ supervised learning using stochastic gradient descent

I Experience reply (in DQN): cache and reuse historicaltraining examples to refine w.

22 / 27

Policy-based RL

Alternatively, we can parameterize π instead of vπ (or qπ)

πθ(s, a) = Pr [a|s, θ] (23)

Then optimize θ w.r.t. some objective J(θ).

Advantage: no need to carry out argmaxaq(s, a)

I More efficient for high-dimensional/continuous A.

23 / 27

Policy Gradient

Consider a one-step MDP starting from s ∼ d(s)

J(θ) := Eπθ [r] =∑s

d(s)∑a

πθ(s, a)Ras (24)

∇θJ(θ) =∑s

d(s)∑a

πθ(s, a)∇θ log πθ(s, a)Ras (25)

= Eπθ [∇θ log πθ(s, a)r] (26)

The above allows us to access stochastic policy gradientwithout knowing the environment.

24 / 27

Actor-Critic Models

Policy gradient in more generic cases

∇θJ(θ) = Eπθ [∇θ log π(s, a)qπθ(s, a)] (27)

qπθ(s, a) can be approximated by q̂(s, a, w).

Critic updates w.

Actor updates θ based on critic’s suggestion.

25 / 27

Actor-Critic Models

Similar ideas in AlphaGo

26 / 27

The End

Other interesting topics

I Convergence.

I Exploration v.s. Exploitation.

I Credit assignment.

I Off-Policy RL.

27 / 27

reinforcement learninghanxiaol/slides/rl.pdf · reinforcement learning hanxiao liu carnegie mellon...

Documents