reinforcementlearning - github pages

150
Reinforcement Learning Shan-Hung Wu [email protected] Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 1 / 64

Upload: others

Post on 18-Dec-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ReinforcementLearning - GitHub Pages

Reinforcement Learning

Shan-Hung [email protected]

Department of Computer Science,National Tsing Hua University, Taiwan

Machine Learning

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 1 / 64

Page 2: ReinforcementLearning - GitHub Pages

Outline

1 Introduction

2 Markov Decision ProcessValue IterationPolicy Iteration

3 Reinforcement LearningModel-Free RL using Monte Carlo EstimationTemporal-Difference Estimation and SARSA (Model-Free)Exploration StrategiesQ-Learning (Model-Free)SARSA vs. Q-Learning

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 2 / 64

Page 3: ReinforcementLearning - GitHub Pages

Outline

1 Introduction

2 Markov Decision ProcessValue IterationPolicy Iteration

3 Reinforcement LearningModel-Free RL using Monte Carlo EstimationTemporal-Difference Estimation and SARSA (Model-Free)Exploration StrategiesQ-Learning (Model-Free)SARSA vs. Q-Learning

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 3 / 64

Page 4: ReinforcementLearning - GitHub Pages

Reinforcement Learning I

An agent sees states s(t)’s of an environment, takes actions a(t)’s,and receives rewards R(t)’s (or penalties)

Environment does not change over timeThe state of the environment may change due to an actionReward R(t) may depend on s(t+1),s(t), · · · or a(t),a(t−1), · · ·

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 4 / 64

Page 5: ReinforcementLearning - GitHub Pages

Reinforcement Learning I

An agent sees states s(t)’s of an environment, takes actions a(t)’s,and receives rewards R(t)’s (or penalties)

Environment does not change over time

The state of the environment may change due to an actionReward R(t) may depend on s(t+1),s(t), · · · or a(t),a(t−1), · · ·

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 4 / 64

Page 6: ReinforcementLearning - GitHub Pages

Reinforcement Learning I

An agent sees states s(t)’s of an environment, takes actions a(t)’s,and receives rewards R(t)’s (or penalties)

Environment does not change over timeThe state of the environment may change due to an action

Reward R(t) may depend on s(t+1),s(t), · · · or a(t),a(t−1), · · ·

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 4 / 64

Page 7: ReinforcementLearning - GitHub Pages

Reinforcement Learning I

An agent sees states s(t)’s of an environment, takes actions a(t)’s,and receives rewards R(t)’s (or penalties)

Environment does not change over timeThe state of the environment may change due to an actionReward R(t) may depend on s(t+1),s(t), · · · or a(t),a(t−1), · · ·

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 4 / 64

Page 8: ReinforcementLearning - GitHub Pages

Reinforcement Learning II

Goal: to learn the best policy π∗(s(t)) = a(t) that maximizes the totalreward ∑t R(t)

Training:1 Perform trial-and-error runs2 Learn from the experience

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 5 / 64

Page 9: ReinforcementLearning - GitHub Pages

Reinforcement Learning II

Goal: to learn the best policy π∗(s(t)) = a(t) that maximizes the totalreward ∑t R(t)

Training:1 Perform trial-and-error runs2 Learn from the experience

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 5 / 64

Page 10: ReinforcementLearning - GitHub Pages

Compared to Supervised/Unsupervised Learning

π∗(s(t)) = a(t) maximizing total reward ∑t R(t) vs. f ∗(x(i)) = y(i)minimizing total loss

Examples x(i)’s are i.i.d., but not in RLs(t+1) may depend on s(t),s(t−1), · · · and a(t),a(t−1), · · ·

No what to predict (y(i)’s), just how good a prediction is (R(t)’s)R(t)’s are also called the critics

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 6 / 64

Page 11: ReinforcementLearning - GitHub Pages

Compared to Supervised/Unsupervised Learning

π∗(s(t)) = a(t) maximizing total reward ∑t R(t) vs. f ∗(x(i)) = y(i)minimizing total lossExamples x(i)’s are i.i.d., but not in RL

s(t+1) may depend on s(t),s(t−1), · · · and a(t),a(t−1), · · ·

No what to predict (y(i)’s), just how good a prediction is (R(t)’s)R(t)’s are also called the critics

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 6 / 64

Page 12: ReinforcementLearning - GitHub Pages

Compared to Supervised/Unsupervised Learning

π∗(s(t)) = a(t) maximizing total reward ∑t R(t) vs. f ∗(x(i)) = y(i)minimizing total lossExamples x(i)’s are i.i.d., but not in RL

s(t+1) may depend on s(t),s(t−1), · · · and a(t),a(t−1), · · ·No what to predict (y(i)’s), just how good a prediction is (R(t)’s)

R(t)’s are also called the critics

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 6 / 64

Page 13: ReinforcementLearning - GitHub Pages

ApplicationsSequential decision making and control problems

From machine learning to AI that changes the world

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 7 / 64

Page 14: ReinforcementLearning - GitHub Pages

ApplicationsSequential decision making and control problems

From machine learning to AI that changes the world

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 7 / 64

Page 15: ReinforcementLearning - GitHub Pages

Outline

1 Introduction

2 Markov Decision ProcessValue IterationPolicy Iteration

3 Reinforcement LearningModel-Free RL using Monte Carlo EstimationTemporal-Difference Estimation and SARSA (Model-Free)Exploration StrategiesQ-Learning (Model-Free)SARSA vs. Q-Learning

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 8 / 64

Page 16: ReinforcementLearning - GitHub Pages

Markov ProcessesA random process {s(t)}t is a collection of time-indexed randomvariables

A classic way to model dependency between input samples {s(t)}t

A random process is called a Markov process if it satisfies theMarkov property:

P(s(t+1)|s(t),s(t−1), · · ·) = P(s(t+1)|s(t))

For those who knows Markov process:**

States are fullyobservable

States are partiallyobservable

Transition isautonomous Markov chains Hidden Markov models

Transition iscontrolled

Markov decisionprocesses (MDP)

Partially observableMDP

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 9 / 64

Page 17: ReinforcementLearning - GitHub Pages

Markov ProcessesA random process {s(t)}t is a collection of time-indexed randomvariables

A classic way to model dependency between input samples {s(t)}t

A random process is called a Markov process if it satisfies theMarkov property:

P(s(t+1)|s(t),s(t−1), · · ·) = P(s(t+1)|s(t))

For those who knows Markov process:**

States are fullyobservable

States are partiallyobservable

Transition isautonomous Markov chains Hidden Markov models

Transition iscontrolled

Markov decisionprocesses (MDP)

Partially observableMDP

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 9 / 64

Page 18: ReinforcementLearning - GitHub Pages

Markov Decision Process IA Markov decision process (MDP) is defined by

S the state space; A the action spaceStart state s(0)P(s′|s; a) the transition distribution controlled by actions; fixed overtime tR(s,a,s′) ∈ R (or simply R(s′) ) the deterministic reward functionγ ∈ [0,1] is the discount factorH ∈ N the horizon; can be infinite

An absorbing/terminal state transit to itself with probability 1

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 10 / 64

Page 19: ReinforcementLearning - GitHub Pages

Markov Decision Process IA Markov decision process (MDP) is defined by

S the state space; A the action spaceStart state s(0)P(s′|s; a) the transition distribution controlled by actions; fixed overtime tR(s,a,s′) ∈ R (or simply R(s′) ) the deterministic reward functionγ ∈ [0,1] is the discount factorH ∈ N the horizon; can be infinite

An absorbing/terminal state transit to itself with probability 1

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 10 / 64

Page 20: ReinforcementLearning - GitHub Pages

Markov Decision Process IIGiven a policy π(s) = a, an MDP proceeds as follows:

s(0) a(0)−→ s(1) a(1)−→ ·· · a(H−1)

−→ s(H),

with the accumulative reward

R(s(0),a(0),s(1))+ γR(s(1),a(1),s(2))+ · · ·+ γH−1R(s(H−1),a(H−1),s(H))

To accrue rewards as soon as possible (prefer a shot path)Different accumulative rewards in different trials

MDP π

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 11 / 64

Page 21: ReinforcementLearning - GitHub Pages

Markov Decision Process IIGiven a policy π(s) = a, an MDP proceeds as follows:

s(0) a(0)−→ s(1) a(1)−→ ·· · a(H−1)

−→ s(H),

with the accumulative reward

R(s(0),a(0),s(1))+ γR(s(1),a(1),s(2))+ · · ·+ γH−1R(s(H−1),a(H−1),s(H))

To accrue rewards as soon as possible (prefer a shot path)Different accumulative rewards in different trials

MDP π

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 11 / 64

Page 22: ReinforcementLearning - GitHub Pages

Goal

Given a policy π, the expected accumulative reward collected bytaking actions following π can be express by:

Vπ = Es(0),··· ,s(H)

(H

∑t=0

γtR(s(t),π(s(t)),s(t+1)); π

)

Goal: to find the optimal policy

π∗ = argmax

πVπ

How?

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 12 / 64

Page 23: ReinforcementLearning - GitHub Pages

Goal

Given a policy π, the expected accumulative reward collected bytaking actions following π can be express by:

Vπ = Es(0),··· ,s(H)

(H

∑t=0

γtR(s(t),π(s(t)),s(t+1)); π

)

Goal: to find the optimal policy

π∗ = argmax

πVπ

How?

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 12 / 64

Page 24: ReinforcementLearning - GitHub Pages

Outline

1 Introduction

2 Markov Decision ProcessValue IterationPolicy Iteration

3 Reinforcement LearningModel-Free RL using Monte Carlo EstimationTemporal-Difference Estimation and SARSA (Model-Free)Exploration StrategiesQ-Learning (Model-Free)SARSA vs. Q-Learning

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 13 / 64

Page 25: ReinforcementLearning - GitHub Pages

Optimal Value Function

π∗ = argmax

πEs(0),··· ,s(H)

(H

∑t=0

γtR(s(t),π(s(t)),s(t+1)); π

)

Optimal value function:

V∗(h)(s) = maxπ

Es(1),··· ,s(h)

(h

∑t=0

γtR(s(t),π(s(t)),s(t+1))|s(0) = s; π

)

Maximum expected accumulative reward when starting from state sand acting optimally for h steps

Having V∗(H−1)(s) for each s, we cab solve π∗ easily by

π∗(s) = argmax

a ∑s′

P(s′|s;a)[R(s,a,s′)+ γV∗(H−1)(s′)],∀s

in O(|S||A|) timeHow to obtain V∗(H−1)(s) for each s?

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 14 / 64

Page 26: ReinforcementLearning - GitHub Pages

Optimal Value Function

π∗ = argmax

πEs(0),··· ,s(H)

(H

∑t=0

γtR(s(t),π(s(t)),s(t+1)); π

)Optimal value function:

V∗(h)(s) = maxπ

Es(1),··· ,s(h)

(h

∑t=0

γtR(s(t),π(s(t)),s(t+1))|s(0) = s; π

)

Maximum expected accumulative reward when starting from state sand acting optimally for h steps

Having V∗(H−1)(s) for each s, we cab solve π∗ easily by

π∗(s) = argmax

a ∑s′

P(s′|s;a)[R(s,a,s′)+ γV∗(H−1)(s′)],∀s

in O(|S||A|) timeHow to obtain V∗(H−1)(s) for each s?

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 14 / 64

Page 27: ReinforcementLearning - GitHub Pages

Optimal Value Function

π∗ = argmax

πEs(0),··· ,s(H)

(H

∑t=0

γtR(s(t),π(s(t)),s(t+1)); π

)Optimal value function:

V∗(h)(s) = maxπ

Es(1),··· ,s(h)

(h

∑t=0

γtR(s(t),π(s(t)),s(t+1))|s(0) = s; π

)

Maximum expected accumulative reward when starting from state sand acting optimally for h steps

Having V∗(H−1)(s) for each s, we cab solve π∗ easily by

π∗(s) = argmax

a ∑s′

P(s′|s;a)[R(s,a,s′)+ γV∗(H−1)(s′)],∀s

in O(|S||A|) time

How to obtain V∗(H−1)(s) for each s?

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 14 / 64

Page 28: ReinforcementLearning - GitHub Pages

Optimal Value Function

π∗ = argmax

πEs(0),··· ,s(H)

(H

∑t=0

γtR(s(t),π(s(t)),s(t+1)); π

)Optimal value function:

V∗(h)(s) = maxπ

Es(1),··· ,s(h)

(h

∑t=0

γtR(s(t),π(s(t)),s(t+1))|s(0) = s; π

)

Maximum expected accumulative reward when starting from state sand acting optimally for h steps

Having V∗(H−1)(s) for each s, we cab solve π∗ easily by

π∗(s) = argmax

a ∑s′

P(s′|s;a)[R(s,a,s′)+ γV∗(H−1)(s′)],∀s

in O(|S||A|) timeHow to obtain V∗(H−1)(s) for each s?

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 14 / 64

Page 29: ReinforcementLearning - GitHub Pages

Dynamic Programming

V∗(h)(s) = maxπ

Es(1),··· ,s(H)

(h

∑t=0

γtR(s(t),π(s(t)),s(t+1))|s(0) = s; π

)

h = H−1:

V∗(H−1)(s) = maxa ∑

s′P(s′|s;a)[R(s,a,s′)+ γV∗(H−2)(s′)],∀s

h = H−2:

V∗(H−2)(s) = maxa ∑

s′P(s′|s;a)[R(s,a,s′)+ γV∗(H−3)(s′)],∀s

h = 0:

V∗(0)(s) = maxa ∑

s′P(s′|s;a)[R(s,a,s′)+ γV∗(−1)(s′)],∀s

h =−1:V∗(−1)(s) = 0,∀s

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 15 / 64

Page 30: ReinforcementLearning - GitHub Pages

Dynamic Programming

V∗(h)(s) = maxπ

Es(1),··· ,s(H)

(h

∑t=0

γtR(s(t),π(s(t)),s(t+1))|s(0) = s; π

)h = H−1:

V∗(H−1)(s) = maxa ∑

s′P(s′|s;a)[R(s,a,s′)+ γV∗(H−2)(s′)],∀s

h = H−2:

V∗(H−2)(s) = maxa ∑

s′P(s′|s;a)[R(s,a,s′)+ γV∗(H−3)(s′)],∀s

h = 0:

V∗(0)(s) = maxa ∑

s′P(s′|s;a)[R(s,a,s′)+ γV∗(−1)(s′)],∀s

h =−1:V∗(−1)(s) = 0,∀s

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 15 / 64

Page 31: ReinforcementLearning - GitHub Pages

Dynamic Programming

V∗(h)(s) = maxπ

Es(1),··· ,s(H)

(h

∑t=0

γtR(s(t),π(s(t)),s(t+1))|s(0) = s; π

)h = H−1:

V∗(H−1)(s) = maxa ∑

s′P(s′|s;a)[R(s,a,s′)+ γV∗(H−2)(s′)],∀s

h = H−2:

V∗(H−2)(s) = maxa ∑

s′P(s′|s;a)[R(s,a,s′)+ γV∗(H−3)(s′)],∀s

h = 0:

V∗(0)(s) = maxa ∑

s′P(s′|s;a)[R(s,a,s′)+ γV∗(−1)(s′)],∀s

h =−1:V∗(−1)(s) = 0,∀s

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 15 / 64

Page 32: ReinforcementLearning - GitHub Pages

Dynamic Programming

V∗(h)(s) = maxπ

Es(1),··· ,s(H)

(h

∑t=0

γtR(s(t),π(s(t)),s(t+1))|s(0) = s; π

)h = H−1:

V∗(H−1)(s) = maxa ∑

s′P(s′|s;a)[R(s,a,s′)+ γV∗(H−2)(s′)],∀s

h = H−2:

V∗(H−2)(s) = maxa ∑

s′P(s′|s;a)[R(s,a,s′)+ γV∗(H−3)(s′)],∀s

h = 0:

V∗(0)(s) = maxa ∑

s′P(s′|s;a)[R(s,a,s′)+ γV∗(−1)(s′)],∀s

h =−1:V∗(−1)(s) = 0,∀s

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 15 / 64

Page 33: ReinforcementLearning - GitHub Pages

Algorithm: Value Iteration (Finite Horizon)

Input: MDP (S,A,P,R,γ,H)Output: π∗(s)’s for all s’s

For each state s, initialize V∗(s)← 0;for h← 0 to H−1 do

foreach s doV∗(s)←maxa ∑s′ P(s′|s;a)[R(s,a,s′)+ γV∗(s′)];

endendforeach s do

π∗(s)← argmaxa ∑s′ P(s′|s;a)[R(s,a,s′)+ γV∗(s′)];end

Algorithm 1: Value iteration with finite horizon.

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 16 / 64

Page 34: ReinforcementLearning - GitHub Pages

ExampleMDP settings:

Actions: “up,” “down,” “left,” “right”Noise of transition probability P(s′|s; a): 0.2,

e.g., up:.8

.1 .1.0

, right:.1

0 .8.1

, right:.1

0 .1 .8�

,

etc.γ: 0.9

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 17 / 64

Page 35: ReinforcementLearning - GitHub Pages

ExampleMDP settings:

Actions: “up,” “down,” “left,” “right”Noise of transition probability P(s′|s; a): 0.2,

e.g., up:.8

.1 .1.0

, right:.1

0 .8.1

, right:.1

0 .1 .8�

,

etc.γ: 0.9

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 17 / 64

Page 36: ReinforcementLearning - GitHub Pages

ExampleMDP settings:

Actions: “up,” “down,” “left,” “right”Noise of transition probability P(s′|s; a): 0.2,

e.g., up:.8

.1 .1.0

, right:.1

0 .8.1

, right:.1

0 .1 .8�

,

etc.γ: 0.9

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 17 / 64

Page 37: ReinforcementLearning - GitHub Pages

ExampleMDP settings:

Actions: “up,” “down,” “left,” “right”Noise of transition probability P(s′|s; a): 0.2,

e.g., up:.8

.1 .1.0

, right:.1

0 .8.1

, right:.1

0 .1 .8�

,

etc.γ: 0.9

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 17 / 64

Page 38: ReinforcementLearning - GitHub Pages

ExampleMDP settings:

Actions: “up,” “down,” “left,” “right”Noise of transition probability P(s′|s; a): 0.2,

e.g., up:.8

.1 .1.0

, right:.1

0 .8.1

, right:.1

0 .1 .8�

,

etc.γ: 0.9

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 17 / 64

Page 39: ReinforcementLearning - GitHub Pages

ExampleMDP settings:

Actions: “up,” “down,” “left,” “right”Noise of transition probability P(s′|s; a): 0.2,

e.g., up:.8

.1 .1.0

, right:.1

0 .8.1

, right:.1

0 .1 .8�

,

etc.γ: 0.9

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 17 / 64

Page 40: ReinforcementLearning - GitHub Pages

ExampleMDP settings:

Actions: “up,” “down,” “left,” “right”Noise of transition probability P(s′|s; a): 0.2,

e.g., up:.8

.1 .1.0

, right:.1

0 .8.1

, right:.1

0 .1 .8�

,

etc.γ: 0.9

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 17 / 64

Page 41: ReinforcementLearning - GitHub Pages

Infinite Horizon & Bellman Optimality EquationRecurrence of optimal values:

V∗(h)(s) = maxa ∑

s′P(s′|s;a)[R(s,a,s′)+ γV∗(h−1)(s′)],∀s

When h→ ∞, we have the Bellman optimality equation:

V∗(s) = maxa ∑

s′P(s′|s;a)[R(s,a,s′)+ γV∗(s′)],∀s

Optimal policy:

π∗(s) = argmax

a ∑s′

P(s′|s;a)[R(s,a,s′)+ γV∗(s′)],∀s

When h→ ∞, π∗ isStationary: the optimal action at a state s is the same at all times(efficient to store)Memoryless: independent with s(0)

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 18 / 64

Page 42: ReinforcementLearning - GitHub Pages

Infinite Horizon & Bellman Optimality EquationRecurrence of optimal values:

V∗(h)(s) = maxa ∑

s′P(s′|s;a)[R(s,a,s′)+ γV∗(h−1)(s′)],∀s

When h→ ∞, we have the Bellman optimality equation:

V∗(s) = maxa ∑

s′P(s′|s;a)[R(s,a,s′)+ γV∗(s′)],∀s

Optimal policy:

π∗(s) = argmax

a ∑s′

P(s′|s;a)[R(s,a,s′)+ γV∗(s′)],∀s

When h→ ∞, π∗ isStationary: the optimal action at a state s is the same at all times(efficient to store)Memoryless: independent with s(0)

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 18 / 64

Page 43: ReinforcementLearning - GitHub Pages

Infinite Horizon & Bellman Optimality EquationRecurrence of optimal values:

V∗(h)(s) = maxa ∑

s′P(s′|s;a)[R(s,a,s′)+ γV∗(h−1)(s′)],∀s

When h→ ∞, we have the Bellman optimality equation:

V∗(s) = maxa ∑

s′P(s′|s;a)[R(s,a,s′)+ γV∗(s′)],∀s

Optimal policy:

π∗(s) = argmax

a ∑s′

P(s′|s;a)[R(s,a,s′)+ γV∗(s′)],∀s

When h→ ∞, π∗ isStationary: the optimal action at a state s is the same at all times(efficient to store)Memoryless: independent with s(0)

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 18 / 64

Page 44: ReinforcementLearning - GitHub Pages

Infinite Horizon & Bellman Optimality EquationRecurrence of optimal values:

V∗(h)(s) = maxa ∑

s′P(s′|s;a)[R(s,a,s′)+ γV∗(h−1)(s′)],∀s

When h→ ∞, we have the Bellman optimality equation:

V∗(s) = maxa ∑

s′P(s′|s;a)[R(s,a,s′)+ γV∗(s′)],∀s

Optimal policy:

π∗(s) = argmax

a ∑s′

P(s′|s;a)[R(s,a,s′)+ γV∗(s′)],∀s

When h→ ∞, π∗ isStationary: the optimal action at a state s is the same at all times(efficient to store)Memoryless: independent with s(0)

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 18 / 64

Page 45: ReinforcementLearning - GitHub Pages

Algorithm: Value Iteration (Infinite Horizon)

Input: MDP (S,A,P,R,γ,H→ ∞)Output: π∗(s)’s for all s’s

For each state s, initialize V∗(s)← 0;repeat

foreach s doV∗(s)←maxa ∑s′ P(s′|s;a)[R(s,a,s′)+ γV∗(s′)];

enduntil V∗(s)’s converge;foreach s do

π∗(s)← argmaxa ∑s′ P(s′|s;a)[R(s,a,s′)+ γV∗(s′)];end

Algorithm 2: Value iteration with infinite horizon.

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 19 / 64

Page 46: ReinforcementLearning - GitHub Pages

Convergence

TheoremValue iteration converges and gives the optimal policy π∗ when H→ ∞.

Intuition:

V∗(s)−V∗(H)(s) = γH+1R(s(H+1),a(H+1),s(H+2))

+γH+2R(s(H+2),a(H+2),s(H+3))+ · · ·≤ γH+1Rmax + γH+2Rmax + · · ·= γH+1

1−γRmax

Goes to 0 as H→ ∞

Hence, V∗(H)(s) H→∞−−−→ V∗(s)Assumed that R(·)≥ 0; still holds if rewards can be negative

by using max |R(·)| and bounding from both sides

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 20 / 64

Page 47: ReinforcementLearning - GitHub Pages

Convergence

TheoremValue iteration converges and gives the optimal policy π∗ when H→ ∞.

Intuition:

V∗(s)−V∗(H)(s) = γH+1R(s(H+1),a(H+1),s(H+2))

+γH+2R(s(H+2),a(H+2),s(H+3))+ · · ·≤ γH+1Rmax + γH+2Rmax + · · ·= γH+1

1−γRmax

Goes to 0 as H→ ∞

Hence, V∗(H)(s) H→∞−−−→ V∗(s)

Assumed that R(·)≥ 0; still holds if rewards can be negativeby using max |R(·)| and bounding from both sides

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 20 / 64

Page 48: ReinforcementLearning - GitHub Pages

Convergence

TheoremValue iteration converges and gives the optimal policy π∗ when H→ ∞.

Intuition:

V∗(s)−V∗(H)(s) = γH+1R(s(H+1),a(H+1),s(H+2))

+γH+2R(s(H+2),a(H+2),s(H+3))+ · · ·≤ γH+1Rmax + γH+2Rmax + · · ·= γH+1

1−γRmax

Goes to 0 as H→ ∞

Hence, V∗(H)(s) H→∞−−−→ V∗(s)Assumed that R(·)≥ 0; still holds if rewards can be negative

by using max |R(·)| and bounding from both sides

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 20 / 64

Page 49: ReinforcementLearning - GitHub Pages

Effect of MDP Parameters

How does the noise of P(s′|s; a) and γ affect π∗?

1 γ = 0.99, noise= 0.52 γ = 0.99, noise= 03 γ = 0.1, noise= 0.54 γ = 0.1, noise= 0

π∗ prefers the close exit (+1); risking the cliff (−10)? (4)π∗ prefers the close exit (+1); avoiding the cliff (−10)? (3)π∗ prefers the distant exit (+10); risking the cliff (−10)? (2)π∗ prefers the distant exit (+10); avoiding the cliff (−10)? (1)

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 21 / 64

Page 50: ReinforcementLearning - GitHub Pages

Effect of MDP Parameters

How does the noise of P(s′|s; a) and γ affect π∗?

1 γ = 0.99, noise= 0.52 γ = 0.99, noise= 03 γ = 0.1, noise= 0.54 γ = 0.1, noise= 0

π∗ prefers the close exit (+1); risking the cliff (−10)?

(4)

π∗ prefers the close exit (+1); avoiding the cliff (−10)?

(3)

π∗ prefers the distant exit (+10); risking the cliff (−10)?

(2)

π∗ prefers the distant exit (+10); avoiding the cliff (−10)?

(1)

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 21 / 64

Page 51: ReinforcementLearning - GitHub Pages

Effect of MDP Parameters

How does the noise of P(s′|s; a) and γ affect π∗?

1 γ = 0.99, noise= 0.52 γ = 0.99, noise= 03 γ = 0.1, noise= 0.54 γ = 0.1, noise= 0

π∗ prefers the close exit (+1); risking the cliff (−10)? (4)π∗ prefers the close exit (+1); avoiding the cliff (−10)?

(3)

π∗ prefers the distant exit (+10); risking the cliff (−10)?

(2)

π∗ prefers the distant exit (+10); avoiding the cliff (−10)?

(1)

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 21 / 64

Page 52: ReinforcementLearning - GitHub Pages

Effect of MDP Parameters

How does the noise of P(s′|s; a) and γ affect π∗?

1 γ = 0.99, noise= 0.52 γ = 0.99, noise= 03 γ = 0.1, noise= 0.54 γ = 0.1, noise= 0

π∗ prefers the close exit (+1); risking the cliff (−10)? (4)π∗ prefers the close exit (+1); avoiding the cliff (−10)? (3)π∗ prefers the distant exit (+10); risking the cliff (−10)?

(2)

π∗ prefers the distant exit (+10); avoiding the cliff (−10)?

(1)

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 21 / 64

Page 53: ReinforcementLearning - GitHub Pages

Effect of MDP Parameters

How does the noise of P(s′|s; a) and γ affect π∗?

1 γ = 0.99, noise= 0.52 γ = 0.99, noise= 03 γ = 0.1, noise= 0.54 γ = 0.1, noise= 0

π∗ prefers the close exit (+1); risking the cliff (−10)? (4)π∗ prefers the close exit (+1); avoiding the cliff (−10)? (3)π∗ prefers the distant exit (+10); risking the cliff (−10)? (2)π∗ prefers the distant exit (+10); avoiding the cliff (−10)?

(1)

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 21 / 64

Page 54: ReinforcementLearning - GitHub Pages

Effect of MDP Parameters

How does the noise of P(s′|s; a) and γ affect π∗?

1 γ = 0.99, noise= 0.52 γ = 0.99, noise= 03 γ = 0.1, noise= 0.54 γ = 0.1, noise= 0

π∗ prefers the close exit (+1); risking the cliff (−10)? (4)π∗ prefers the close exit (+1); avoiding the cliff (−10)? (3)π∗ prefers the distant exit (+10); risking the cliff (−10)? (2)π∗ prefers the distant exit (+10); avoiding the cliff (−10)? (1)

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 21 / 64

Page 55: ReinforcementLearning - GitHub Pages

Outline

1 Introduction

2 Markov Decision ProcessValue IterationPolicy Iteration

3 Reinforcement LearningModel-Free RL using Monte Carlo EstimationTemporal-Difference Estimation and SARSA (Model-Free)Exploration StrategiesQ-Learning (Model-Free)SARSA vs. Q-Learning

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 22 / 64

Page 56: ReinforcementLearning - GitHub Pages

Goal

Given an MDP (S,A,P,R,γ,H→ ∞)

Expected accumulative reward collected by taking actions following apolicy π:

Vπ = Es(0),··· ,s(H)

(∞

∑t=0

γtR(s(t),π(s(t)),s(t+1)); π

)

Goal: to find the optimal policy

π∗ = argmax

πVπ

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 23 / 64

Page 57: ReinforcementLearning - GitHub Pages

Algorithm: Policy Iteration (Simplified)

Given a π, define its value function:

Vπ(s) = Es(1),···

(∞

∑t=0

γtR(s(t),π(s(t)),s(t+1))|s(0) = s; π

)

Expected accumulative reward when starting from state s and actingbased on π

Input: MDP (S,A,P,R,γ,H→ ∞)Output: π(s)’s for all s’s

For each state s, initialize π(s) randomly;repeat

Evaluate Vπ(s),∀s;Improve π such that Vπ(s), ∀s, becomes higher;

until π(s)’s converge;Algorithm 3: Policy iteration.

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 24 / 64

Page 58: ReinforcementLearning - GitHub Pages

Algorithm: Policy Iteration (Simplified)

Given a π, define its value function:

Vπ(s) = Es(1),···

(∞

∑t=0

γtR(s(t),π(s(t)),s(t+1))|s(0) = s; π

)

Expected accumulative reward when starting from state s and actingbased on π

Input: MDP (S,A,P,R,γ,H→ ∞)Output: π(s)’s for all s’s

For each state s, initialize π(s) randomly;repeat

Evaluate Vπ(s),∀s;Improve π such that Vπ(s), ∀s, becomes higher;

until π(s)’s converge;Algorithm 4: Policy iteration.

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 24 / 64

Page 59: ReinforcementLearning - GitHub Pages

Evaluating Vπ(s)

How to evaluate the value function of a given π?

Vπ(s) = Es(1),···

(∞

∑t=0

γtR(s(t),π(s(t)),s(t+1))|s(0) = s; π

)

Bellman expectation equation:

Vπ(s) = ∑s′

P(s′|s;π(s))[R(s,π(s),s′)+ γVπ(s′)],∀s

M1 Solve the system of linear equations|S| linear equations and |S| variablesTime complexity: O(|S|3)

M2 Dynamic programming (just like value iteration):Initializes Vπ(s) = 0,∀sVπ(s)← ∑s′ P(s′|s;π(s))[R(s,π(s),s′)+ γVπ(s′)],∀s

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 25 / 64

Page 60: ReinforcementLearning - GitHub Pages

Evaluating Vπ(s)

How to evaluate the value function of a given π?

Vπ(s) = Es(1),···

(∞

∑t=0

γtR(s(t),π(s(t)),s(t+1))|s(0) = s; π

)

Bellman expectation equation:

Vπ(s) = ∑s′

P(s′|s;π(s))[R(s,π(s),s′)+ γVπ(s′)],∀s

M1 Solve the system of linear equations|S| linear equations and |S| variablesTime complexity: O(|S|3)

M2 Dynamic programming (just like value iteration):Initializes Vπ(s) = 0,∀sVπ(s)← ∑s′ P(s′|s;π(s))[R(s,π(s),s′)+ γVπ(s′)],∀s

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 25 / 64

Page 61: ReinforcementLearning - GitHub Pages

Evaluating Vπ(s)

How to evaluate the value function of a given π?

Vπ(s) = Es(1),···

(∞

∑t=0

γtR(s(t),π(s(t)),s(t+1))|s(0) = s; π

)

Bellman expectation equation:

Vπ(s) = ∑s′

P(s′|s;π(s))[R(s,π(s),s′)+ γVπ(s′)],∀s

M1 Solve the system of linear equations|S| linear equations and |S| variablesTime complexity: O(|S|3)

M2 Dynamic programming (just like value iteration):Initializes Vπ(s) = 0,∀sVπ(s)← ∑s′ P(s′|s;π(s))[R(s,π(s),s′)+ γVπ(s′)],∀s

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 25 / 64

Page 62: ReinforcementLearning - GitHub Pages

Policy Improvement

How to find π such that Vπ(s)≥ Vπ(s) for all s?

Update rule: for all s do

π(s)← argmaxa ∑

s′P(s′|s;a)[R(s,a,s′)+ γVπ(s′)]

Why?

Vπ(s) = ∑s′ P(s′|s;π(s))[R(s,π(s),s′)+ γVπ(s′)]≤ ∑s′ P(s′|s; π(s))[R(s, π(s),s′)+ γVπ(s′)]≤ ∑s′ P(s′|s; π(s)){R(s, π(s),s′)+

γ∑s′′ P(s′′|s′; π(s′))[R(s′, π(s′),s′′)+ γVπ(s′′)]}· · ·≤ Es′,s′′,···

(R(s, π(s),s′)+ γR(s′, π(s′),s′′)+ · · · |s(0) = s; π

)= Vπ(s),∀s

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 26 / 64

Page 63: ReinforcementLearning - GitHub Pages

Policy Improvement

How to find π such that Vπ(s)≥ Vπ(s) for all s?Update rule: for all s do

π(s)← argmaxa ∑

s′P(s′|s;a)[R(s,a,s′)+ γVπ(s′)]

Why?

Vπ(s) = ∑s′ P(s′|s;π(s))[R(s,π(s),s′)+ γVπ(s′)]≤ ∑s′ P(s′|s; π(s))[R(s, π(s),s′)+ γVπ(s′)]≤ ∑s′ P(s′|s; π(s)){R(s, π(s),s′)+

γ∑s′′ P(s′′|s′; π(s′))[R(s′, π(s′),s′′)+ γVπ(s′′)]}· · ·≤ Es′,s′′,···

(R(s, π(s),s′)+ γR(s′, π(s′),s′′)+ · · · |s(0) = s; π

)= Vπ(s),∀s

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 26 / 64

Page 64: ReinforcementLearning - GitHub Pages

Policy Improvement

How to find π such that Vπ(s)≥ Vπ(s) for all s?Update rule: for all s do

π(s)← argmaxa ∑

s′P(s′|s;a)[R(s,a,s′)+ γVπ(s′)]

Why?

Vπ(s) = ∑s′ P(s′|s;π(s))[R(s,π(s),s′)+ γVπ(s′)]

≤ ∑s′ P(s′|s; π(s))[R(s, π(s),s′)+ γVπ(s′)]≤ ∑s′ P(s′|s; π(s)){R(s, π(s),s′)+

γ∑s′′ P(s′′|s′; π(s′))[R(s′, π(s′),s′′)+ γVπ(s′′)]}· · ·≤ Es′,s′′,···

(R(s, π(s),s′)+ γR(s′, π(s′),s′′)+ · · · |s(0) = s; π

)= Vπ(s),∀s

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 26 / 64

Page 65: ReinforcementLearning - GitHub Pages

Policy Improvement

How to find π such that Vπ(s)≥ Vπ(s) for all s?Update rule: for all s do

π(s)← argmaxa ∑

s′P(s′|s;a)[R(s,a,s′)+ γVπ(s′)]

Why?

Vπ(s) = ∑s′ P(s′|s;π(s))[R(s,π(s),s′)+ γVπ(s′)]≤ ∑s′ P(s′|s; π(s))[R(s, π(s),s′)+ γVπ(s′)]

≤ ∑s′ P(s′|s; π(s)){R(s, π(s),s′)+γ∑s′′ P(s′′|s′; π(s′))[R(s′, π(s′),s′′)+ γVπ(s′′)]}

· · ·≤ Es′,s′′,···

(R(s, π(s),s′)+ γR(s′, π(s′),s′′)+ · · · |s(0) = s; π

)= Vπ(s),∀s

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 26 / 64

Page 66: ReinforcementLearning - GitHub Pages

Policy Improvement

How to find π such that Vπ(s)≥ Vπ(s) for all s?Update rule: for all s do

π(s)← argmaxa ∑

s′P(s′|s;a)[R(s,a,s′)+ γVπ(s′)]

Why?

Vπ(s) = ∑s′ P(s′|s;π(s))[R(s,π(s),s′)+ γVπ(s′)]≤ ∑s′ P(s′|s; π(s))[R(s, π(s),s′)+ γVπ(s′)]≤ ∑s′ P(s′|s; π(s)){R(s, π(s),s′)+

γ∑s′′ P(s′′|s′; π(s′))[R(s′, π(s′),s′′)+ γVπ(s′′)]}

· · ·≤ Es′,s′′,···

(R(s, π(s),s′)+ γR(s′, π(s′),s′′)+ · · · |s(0) = s; π

)= Vπ(s),∀s

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 26 / 64

Page 67: ReinforcementLearning - GitHub Pages

Policy Improvement

How to find π such that Vπ(s)≥ Vπ(s) for all s?Update rule: for all s do

π(s)← argmaxa ∑

s′P(s′|s;a)[R(s,a,s′)+ γVπ(s′)]

Why?

Vπ(s) = ∑s′ P(s′|s;π(s))[R(s,π(s),s′)+ γVπ(s′)]≤ ∑s′ P(s′|s; π(s))[R(s, π(s),s′)+ γVπ(s′)]≤ ∑s′ P(s′|s; π(s)){R(s, π(s),s′)+

γ∑s′′ P(s′′|s′; π(s′))[R(s′, π(s′),s′′)+ γVπ(s′′)]}· · ·≤ Es′,s′′,···

(R(s, π(s),s′)+ γR(s′, π(s′),s′′)+ · · · |s(0) = s; π

)= Vπ(s),∀s

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 26 / 64

Page 68: ReinforcementLearning - GitHub Pages

Algorithm: Policy Iteration

Input: MDP (S,A,P,R,γ,H→ ∞)Output: π(s)’s for all s’s

For each state s, initialize π(s) randomly;repeat

For each state s, initialize Vπ(s)← 0;repeat

foreach s doVπ(s)← ∑s′ P(s′|s;π(s))[R(s,π(s),s′)+ γVπ(s′)];

enduntil Vπ(s)’s converge;foreach s do

π(s)← argmaxa ∑s′ P(s′|s;a)[R(s,a,s′)+ γVπ(s′)];end

until π(s)’s converge;Algorithm 5: Policy iteration.

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 27 / 64

Page 69: ReinforcementLearning - GitHub Pages

Convergence

TheoremPolicy iteration converges and gives the optimal policy π∗ when H→ ∞.

Convergence: in every step the policy improvesOptimal policy: at convergence, Vπ(s), ∀s, satisfies the Bellmanoptimality equation

V∗(s) = maxa ∑

s′P(s′|s;a)[R(s,a,s′)+ γV∗(s′)]

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 28 / 64

Page 70: ReinforcementLearning - GitHub Pages

Convergence

TheoremPolicy iteration converges and gives the optimal policy π∗ when H→ ∞.

Convergence: in every step the policy improves

Optimal policy: at convergence, Vπ(s), ∀s, satisfies the Bellmanoptimality equation

V∗(s) = maxa ∑

s′P(s′|s;a)[R(s,a,s′)+ γV∗(s′)]

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 28 / 64

Page 71: ReinforcementLearning - GitHub Pages

Convergence

TheoremPolicy iteration converges and gives the optimal policy π∗ when H→ ∞.

Convergence: in every step the policy improvesOptimal policy: at convergence, Vπ(s), ∀s, satisfies the Bellmanoptimality equation

V∗(s) = maxa ∑

s′P(s′|s;a)[R(s,a,s′)+ γV∗(s′)]

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 28 / 64

Page 72: ReinforcementLearning - GitHub Pages

Outline

1 Introduction

2 Markov Decision ProcessValue IterationPolicy Iteration

3 Reinforcement LearningModel-Free RL using Monte Carlo EstimationTemporal-Difference Estimation and SARSA (Model-Free)Exploration StrategiesQ-Learning (Model-Free)SARSA vs. Q-Learning

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 29 / 64

Page 73: ReinforcementLearning - GitHub Pages

Difference from MDPIn practice, we may not be able to model the environment as an MDP

Unknown transition distribution P(s′|s; a)

Unknown reward function R(s,a,s′)

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 30 / 64

Page 74: ReinforcementLearning - GitHub Pages

Difference from MDPIn practice, we may not be able to model the environment as an MDPUnknown transition distribution P(s′|s; a)

Unknown reward function R(s,a,s′)

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 30 / 64

Page 75: ReinforcementLearning - GitHub Pages

Difference from MDPIn practice, we may not be able to model the environment as an MDPUnknown transition distribution P(s′|s; a)

Unknown reward function R(s,a,s′)

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 30 / 64

Page 76: ReinforcementLearning - GitHub Pages

Exploration vs. Exploitation

What would you do to get some rewards?

1 Perform actions randomly to explore P(s′|s;a) and R(s,a,s′) firstCollect samples so to estimate P(s′|s;a) and R(s,a,s′)

2 Then, perform actions to exploit the learned P(s′|s;a) and R(s,a,s′)π∗ can be computed/planned “in mind” using value/policy iteration

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 31 / 64

Page 77: ReinforcementLearning - GitHub Pages

Exploration vs. Exploitation

What would you do to get some rewards?1 Perform actions randomly to explore P(s′|s;a) and R(s,a,s′) first

Collect samples so to estimate P(s′|s;a) and R(s,a,s′)

2 Then, perform actions to exploit the learned P(s′|s;a) and R(s,a,s′)π∗ can be computed/planned “in mind” using value/policy iteration

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 31 / 64

Page 78: ReinforcementLearning - GitHub Pages

Exploration vs. Exploitation

What would you do to get some rewards?1 Perform actions randomly to explore P(s′|s;a) and R(s,a,s′) first

Collect samples so to estimate P(s′|s;a) and R(s,a,s′)2 Then, perform actions to exploit the learned P(s′|s;a) and R(s,a,s′)

π∗ can be computed/planned “in mind” using value/policy iteration

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 31 / 64

Page 79: ReinforcementLearning - GitHub Pages

Model-based RL using Monte Carlo Estimation

1 Use some exploration policy π ′ to perform one or moreepisodes/trails

Each episode records samples of P(s′|s;a) and R(s,a,s′) from start toterminal state

s(0)π ′(s(0))−→ s(1)

π ′(s(1))−→ ·· · π ′(s(H−1))−→ s(H)

andR(s(0),π ′(s(0)),s(1))→ ·· · → R(s(H−1),π(s(H−1)),s(H))

2 Estimate P(s′|s;a) and R(s,a,s′) using the samples and update theexploitation policy π

P(s′|s;a) = # times the action a takes state s to state s′

# times action a is taken in state sR(s,a,s′) = average of reward values received when a takes s to s′

3 Repeat from Step 1, but gradually mix π into π ′

Mix-in strategy? E.g., ε-greedy (more on this later)

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 32 / 64

Page 80: ReinforcementLearning - GitHub Pages

Problems of Model-based RL

There may be lots of P(s′|s;a) and R(s,a,s′) to estimateIf P(s′|s;a) is low, there may be too few samples to have a goodestimateLow P(s′|s;a) may also lead to a poor estimate of R(s,a,s′)

when R(s,a,s′) depends on s′

We estimate P(s′|s;a)’s and R(s,a,s′)’s in order to computeV∗(s)/Vπ(s) and solve π∗

Why not estimate V∗(s)/Vπ(s) directly?

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 33 / 64

Page 81: ReinforcementLearning - GitHub Pages

Problems of Model-based RL

There may be lots of P(s′|s;a) and R(s,a,s′) to estimateIf P(s′|s;a) is low, there may be too few samples to have a goodestimateLow P(s′|s;a) may also lead to a poor estimate of R(s,a,s′)

when R(s,a,s′) depends on s′

We estimate P(s′|s;a)’s and R(s,a,s′)’s in order to computeV∗(s)/Vπ(s) and solve π∗

Why not estimate V∗(s)/Vπ(s) directly?

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 33 / 64

Page 82: ReinforcementLearning - GitHub Pages

Outline

1 Introduction

2 Markov Decision ProcessValue IterationPolicy Iteration

3 Reinforcement LearningModel-Free RL using Monte Carlo EstimationTemporal-Difference Estimation and SARSA (Model-Free)Exploration StrategiesQ-Learning (Model-Free)SARSA vs. Q-Learning

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 34 / 64

Page 83: ReinforcementLearning - GitHub Pages

Model-based vs. Model-Free

How to estimate E(f (x)|y) = ∑x P(x|y)f (x) given samples(x(1),y(1)),(x(2),y(2)), · · ·?

Model-based estimation:1 For each x, estimate P(x|y) = count(x(i)=x and y(i)=y)

count(y(j)=y)2 E(f (x)|y) = ∑x P(x|y)f (x)

Model-free estimation:

E(f (x)|y) = 1count(y(j) = y) ∑

i :x(i)=x and y(i)=yf (x(i))

Why does it work? Because samples appear in right frequencies

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 35 / 64

Page 84: ReinforcementLearning - GitHub Pages

Model-based vs. Model-Free

How to estimate E(f (x)|y) = ∑x P(x|y)f (x) given samples(x(1),y(1)),(x(2),y(2)), · · ·?Model-based estimation:

1 For each x, estimate P(x|y) = count(x(i)=x and y(i)=y)count(y(j)=y)

2 E(f (x)|y) = ∑x P(x|y)f (x)

Model-free estimation:

E(f (x)|y) = 1count(y(j) = y) ∑

i :x(i)=x and y(i)=yf (x(i))

Why does it work? Because samples appear in right frequencies

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 35 / 64

Page 85: ReinforcementLearning - GitHub Pages

Model-based vs. Model-Free

How to estimate E(f (x)|y) = ∑x P(x|y)f (x) given samples(x(1),y(1)),(x(2),y(2)), · · ·?Model-based estimation:

1 For each x, estimate P(x|y) = count(x(i)=x and y(i)=y)count(y(j)=y)

2 E(f (x)|y) = ∑x P(x|y)f (x)Model-free estimation:

E(f (x)|y) = 1count(y(j) = y) ∑

i :x(i)=x and y(i)=yf (x(i))

Why does it work?

Because samples appear in right frequencies

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 35 / 64

Page 86: ReinforcementLearning - GitHub Pages

Model-based vs. Model-Free

How to estimate E(f (x)|y) = ∑x P(x|y)f (x) given samples(x(1),y(1)),(x(2),y(2)), · · ·?Model-based estimation:

1 For each x, estimate P(x|y) = count(x(i)=x and y(i)=y)count(y(j)=y)

2 E(f (x)|y) = ∑x P(x|y)f (x)Model-free estimation:

E(f (x)|y) = 1count(y(j) = y) ∑

i :x(i)=x and y(i)=yf (x(i))

Why does it work? Because samples appear in right frequencies

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 35 / 64

Page 87: ReinforcementLearning - GitHub Pages

Challenges of Model-Free RL

Value iteration: start from V∗(s)← 01 Iterate V∗(s)←maxa ∑s′ P(s′|s;a)[R(s,a,s′)+ γV∗(s′)] until converge

/ Not easy to estimate V∗(s) = maxπ E(

∑t γ tR(t)|s(0) = s; π

)

2 Solve π∗(s)← argmaxa ∑s′ P(s′|s;a)[R(s,a,s′)+ γV∗(s′)]

/ Need model to solve

Policy iteration: start from a random π, repeat until converge:1 Iterate Vπ(s)← ∑s′ P(s′|s;π(s))[R(s,π(s),s′)+ γVπ(s′)] until converge

, Can estimate Vπ(s) = E(

∑t γ tR(t)|s(0) = s; π

)using MC est.

2 Solve π(s)← argmaxa ∑s′ P(s′|s;a)[R(s,a,s′)+ γVπ(s′)]

/ Need model to solve

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 36 / 64

Page 88: ReinforcementLearning - GitHub Pages

Challenges of Model-Free RL

Value iteration: start from V∗(s)← 01 Iterate V∗(s)←maxa ∑s′ P(s′|s;a)[R(s,a,s′)+ γV∗(s′)] until converge

/ Not easy to estimate V∗(s) = maxπ E(

∑t γ tR(t)|s(0) = s; π

)2 Solve π∗(s)← argmaxa ∑s′ P(s′|s;a)[R(s,a,s′)+ γV∗(s′)]

/ Need model to solve

Policy iteration: start from a random π, repeat until converge:1 Iterate Vπ(s)← ∑s′ P(s′|s;π(s))[R(s,π(s),s′)+ γVπ(s′)] until converge

, Can estimate Vπ(s) = E(

∑t γ tR(t)|s(0) = s; π

)using MC est.

2 Solve π(s)← argmaxa ∑s′ P(s′|s;a)[R(s,a,s′)+ γVπ(s′)]/ Need model to solve

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 36 / 64

Page 89: ReinforcementLearning - GitHub Pages

Q Function for π

Value function for a given π:

Vπ (s) = Es(1),···

(∞

∑t=0

γtR(s(t),π(s(t)),s(t+1)) |s(0) = s; π

)with recurrence (Bellman expectation equation):

Vπ (s) = ∑s′

P(s′|s;π(s))[R(s,π(s),s′)+ γVπ (s′)],∀s

Maximum expected accumulative reward when starting from state sand acting based on π onward

Define Q function for π:

Qπ (s,a) = Es(1),···

(R(s,a,s(1))+

∑t=1

γtR(s(t),π(s(t)),s(t+1)); s,a,π

)such that Vπ(s) = Qπ(s,π(s)) with recurrence:

Qπ (s,a) = ∑s′

P(s′|s;a)[R(s,a,s′)+ γQπ (s′,π(s′))],∀s,a

Maximum expected accumulative reward when starting from state s,taking action a, and then acting based on π

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 37 / 64

Page 90: ReinforcementLearning - GitHub Pages

Q Function for π

Value function for a given π:

Vπ (s) = Es(1),···

(∞

∑t=0

γtR(s(t),π(s(t)),s(t+1)) |s(0) = s; π

)with recurrence (Bellman expectation equation):

Vπ (s) = ∑s′

P(s′|s;π(s))[R(s,π(s),s′)+ γVπ (s′)],∀s

Maximum expected accumulative reward when starting from state sand acting based on π onward

Define Q function for π:

Qπ (s,a) = Es(1),···

(R(s,a,s(1))+

∑t=1

γtR(s(t),π(s(t)),s(t+1)); s,a,π

)such that Vπ(s) = Qπ(s,π(s)) with recurrence:

Qπ (s,a) = ∑s′

P(s′|s;a)[R(s,a,s′)+ γQπ (s′,π(s′))],∀s,a

Maximum expected accumulative reward when starting from state s,taking action a, and then acting based on π

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 37 / 64

Page 91: ReinforcementLearning - GitHub Pages

Algorithm: Policy Iteration based on Qπ

Input: MDP (S,A,P,R,γ,H→ ∞)Output: π(s)’s for all s’s

For each state s, initialize π(s) randomly;repeat

For each state s, initialize Vπ(s)← 0 Initialize Qπ(s,a) = 0,∀s,a;repeat

foreach s and a doVπ(s)← ∑s′ P(s′|s;π(s))[R(s,π(s),s′)+ γVπ(s′)]Qπ(s,a)← ∑s′ P(s′|s;a)[R(s,a,s′)+ γQπ(s′,π(s′))];

enduntil Vπ(s)’s Qπ(s,a)’s converge;foreach s do

π(s)← argmaxa ∑s′ P(s′|s;a)[R(s,a,s′)+ γVπ(s′)]π(s)← argmaxa Qπ(s,a);

enduntil π(s)’s converge;

Algorithm 6: Policy iteration.Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 38 / 64

Page 92: ReinforcementLearning - GitHub Pages

Example

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 39 / 64

Page 93: ReinforcementLearning - GitHub Pages

Policy Iteration and Model-Free RL

Policy iteration: start from a random π, repeat until converge:1 Iterate Qπ(s,a)← ∑s′ P(s′|s;π(s))[R(s,π(s),s′)+ γQπ(s′,π(s′))] until

converge

, Can estimate Qπ(s,a) = E(

R(0)+∑t=1 γ tR(t))

using MC est.

2 Solve π(s)← argmaxa Qπ(s,a)

, No need for model to solveRL: start from a random π for both exploration & exploitation,repeat until converge:

1 Create episodes using π and get MC estimates Qπ(s,a),∀s,a2 Update π by π(s)← argmaxa Qπ(s,a),∀s

Problem: π improves little after running lots of trialsCan we improve π right after each action?

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 40 / 64

Page 94: ReinforcementLearning - GitHub Pages

Policy Iteration and Model-Free RL

Policy iteration: start from a random π, repeat until converge:1 Iterate Qπ(s,a)← ∑s′ P(s′|s;π(s))[R(s,π(s),s′)+ γQπ(s′,π(s′))] until

converge, Can estimate Qπ(s,a) = E

(R(0)+∑t=1 γ tR(t)

)using MC est.

2 Solve π(s)← argmaxa Qπ(s,a), No need for model to solve

RL: start from a random π for both exploration & exploitation,repeat until converge:

1 Create episodes using π and get MC estimates Qπ(s,a),∀s,a2 Update π by π(s)← argmaxa Qπ(s,a),∀s

Problem: π improves little after running lots of trialsCan we improve π right after each action?

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 40 / 64

Page 95: ReinforcementLearning - GitHub Pages

Policy Iteration and Model-Free RL

Policy iteration: start from a random π, repeat until converge:1 Iterate Qπ(s,a)← ∑s′ P(s′|s;π(s))[R(s,π(s),s′)+ γQπ(s′,π(s′))] until

converge, Can estimate Qπ(s,a) = E

(R(0)+∑t=1 γ tR(t)

)using MC est.

2 Solve π(s)← argmaxa Qπ(s,a), No need for model to solve

RL: start from a random π for both exploration & exploitation,repeat until converge:

1 Create episodes using π and get MC estimates Qπ(s,a),∀s,a2 Update π by π(s)← argmaxa Qπ(s,a),∀s

Problem: π improves little after running lots of trialsCan we improve π right after each action?

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 40 / 64

Page 96: ReinforcementLearning - GitHub Pages

Policy Iteration and Model-Free RL

Policy iteration: start from a random π, repeat until converge:1 Iterate Qπ(s,a)← ∑s′ P(s′|s;π(s))[R(s,π(s),s′)+ γQπ(s′,π(s′))] until

converge, Can estimate Qπ(s,a) = E

(R(0)+∑t=1 γ tR(t)

)using MC est.

2 Solve π(s)← argmaxa Qπ(s,a), No need for model to solve

RL: start from a random π for both exploration & exploitation,repeat until converge:

1 Create episodes using π and get MC estimates Qπ(s,a),∀s,a2 Update π by π(s)← argmaxa Qπ(s,a),∀s

Problem: π improves little after running lots of trials

Can we improve π right after each action?

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 40 / 64

Page 97: ReinforcementLearning - GitHub Pages

Policy Iteration and Model-Free RL

Policy iteration: start from a random π, repeat until converge:1 Iterate Qπ(s,a)← ∑s′ P(s′|s;π(s))[R(s,π(s),s′)+ γQπ(s′,π(s′))] until

converge, Can estimate Qπ(s,a) = E

(R(0)+∑t=1 γ tR(t)

)using MC est.

2 Solve π(s)← argmaxa Qπ(s,a), No need for model to solve

RL: start from a random π for both exploration & exploitation,repeat until converge:

1 Create episodes using π and get MC estimates Qπ(s,a),∀s,a2 Update π by π(s)← argmaxa Qπ(s,a),∀s

Problem: π improves little after running lots of trialsCan we improve π right after each action?

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 40 / 64

Page 98: ReinforcementLearning - GitHub Pages

Outline

1 Introduction

2 Markov Decision ProcessValue IterationPolicy Iteration

3 Reinforcement LearningModel-Free RL using Monte Carlo EstimationTemporal-Difference Estimation and SARSA (Model-Free)Exploration StrategiesQ-Learning (Model-Free)SARSA vs. Q-Learning

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 41 / 64

Page 99: ReinforcementLearning - GitHub Pages

Policy Iteration Revisited

Policy iteration: start from a random π, repeat until converge:1 Iterate Qπ(s,a)← ∑s′ P(s′|s;π(s))[R(s,π(s),s′)+ γQπ(s′,π(s′))] until

converge

, Can estimate Qπ(s,a) = E(

R(0)+∑t=1 γ tR(t))

using MC est., Can also estimate Qπ(s,a) based on the recurrence

2 Solve π(s)← argmaxa Qπ(s,a)

, No need for model to solve

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 42 / 64

Page 100: ReinforcementLearning - GitHub Pages

Policy Iteration Revisited

Policy iteration: start from a random π, repeat until converge:1 Iterate Qπ(s,a)← ∑s′ P(s′|s;π(s))[R(s,π(s),s′)+ γQπ(s′,π(s′))] until

converge, Can estimate Qπ(s,a) = E

(R(0)+∑t=1 γ tR(t)

)using MC est.

, Can also estimate Qπ(s,a) based on the recurrence

2 Solve π(s)← argmaxa Qπ(s,a), No need for model to solve

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 42 / 64

Page 101: ReinforcementLearning - GitHub Pages

Policy Iteration Revisited

Policy iteration: start from a random π, repeat until converge:1 Iterate Qπ(s,a)← ∑s′ P(s′|s;π(s))[R(s,π(s),s′)+ γQπ(s′,π(s′))] until

converge, Can estimate Qπ(s,a) = E

(R(0)+∑t=1 γ tR(t)

)using MC est.

, Can also estimate Qπ(s,a) based on the recurrence2 Solve π(s)← argmaxa Qπ(s,a)

, No need for model to solve

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 42 / 64

Page 102: ReinforcementLearning - GitHub Pages

Temporal Difference Estimation I

Qπ(s,a)←∑s′

P(s′|s;π(s))[R(s,π(s),s′)+ γQπ(s′,π(s′))]

Given the samples

s(0) a(0)−→ s(1) a(1)−→ ·· · a(H−1)

−→ s(H)

andR(s(0),a(0),s(1))→ ··· → R(s(H−1),a(H−1),s(H))

Temporal difference (TD) estimation of Qπ(s,a):1 Qπ(s,a)← randon value,∀s,a2 Repeat until converge for each action a(t):

Qπ(s(t),a(t))← Qπ(s(t),a(t))+η

[(R(s(t),a(t),s(t+1))+ γQπ(s(t+1),π(s(t+1))))− Qπ(s(t),a(t))

]

Qπ(s,a) can be updated on the fly during an episodeη the “learning rate”?

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 43 / 64

Page 103: ReinforcementLearning - GitHub Pages

Temporal Difference Estimation I

Qπ(s,a)←∑s′

P(s′|s;π(s))[R(s,π(s),s′)+ γQπ(s′,π(s′))]

Given the samples

s(0) a(0)−→ s(1) a(1)−→ ·· · a(H−1)

−→ s(H)

andR(s(0),a(0),s(1))→ ··· → R(s(H−1),a(H−1),s(H))

Temporal difference (TD) estimation of Qπ(s,a):1 Qπ(s,a)← randon value,∀s,a2 Repeat until converge for each action a(t):

Qπ(s(t),a(t))← Qπ(s(t),a(t))+η

[(R(s(t),a(t),s(t+1))+ γQπ(s(t+1),π(s(t+1))))− Qπ(s(t),a(t))

]Qπ(s,a) can be updated on the fly during an episode

η the “learning rate”?

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 43 / 64

Page 104: ReinforcementLearning - GitHub Pages

Temporal Difference Estimation I

Qπ(s,a)←∑s′

P(s′|s;π(s))[R(s,π(s),s′)+ γQπ(s′,π(s′))]

Given the samples

s(0) a(0)−→ s(1) a(1)−→ ·· · a(H−1)

−→ s(H)

andR(s(0),a(0),s(1))→ ··· → R(s(H−1),a(H−1),s(H))

Temporal difference (TD) estimation of Qπ(s,a):1 Qπ(s,a)← randon value,∀s,a2 Repeat until converge for each action a(t):

Qπ(s(t),a(t))← Qπ(s(t),a(t))+η

[(R(s(t),a(t),s(t+1))+ γQπ(s(t+1),π(s(t+1))))− Qπ(s(t),a(t))

]Qπ(s,a) can be updated on the fly during an episodeη the “learning rate”?

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 43 / 64

Page 105: ReinforcementLearning - GitHub Pages

Temporal Difference Estimation II

Qπ(s,a)← Qπ(s,a)+η[(R(s,a,s′)+ γQπ(s′,π(s′)))− Qπ(s,a)

]= η(R(s,a,s′)+ γQπ(s′,π(s′)))+(1−η)Qπ(s,a)

,

Exponential moving average:

xn =x(n)+(1−η)x(n−1)+(1−η)2x(n−2)+ · · ·

1+(1−η)+(1−η)2 + · · ·,

where η ∈ [0,1] is the “forget rate”Recent samples are exponentially more important

Since 1/η = 1+(1−η)+(1−η)2 + · · · , we havexn = ηx(n)+(1−η)xn−1 [Proof]As long as η gradually decreases to 0:

Qπ(s,a) degenerates to average of accumulative rewardsQπ(s,a) converges to Qπ(s,a) (more on this later)

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 44 / 64

Page 106: ReinforcementLearning - GitHub Pages

Temporal Difference Estimation II

Qπ(s,a)← Qπ(s,a)+η[(R(s,a,s′)+ γQπ(s′,π(s′)))− Qπ(s,a)

]= η(R(s,a,s′)+ γQπ(s′,π(s′)))+(1−η)Qπ(s,a)

,

Exponential moving average:

xn =x(n)+(1−η)x(n−1)+(1−η)2x(n−2)+ · · ·

1+(1−η)+(1−η)2 + · · ·,

where η ∈ [0,1] is the “forget rate”Recent samples are exponentially more important

Since 1/η = 1+(1−η)+(1−η)2 + · · · , we havexn = ηx(n)+(1−η)xn−1 [Proof]As long as η gradually decreases to 0:

Qπ(s,a) degenerates to average of accumulative rewardsQπ(s,a) converges to Qπ(s,a) (more on this later)

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 44 / 64

Page 107: ReinforcementLearning - GitHub Pages

Temporal Difference Estimation II

Qπ(s,a)← Qπ(s,a)+η[(R(s,a,s′)+ γQπ(s′,π(s′)))− Qπ(s,a)

]= η(R(s,a,s′)+ γQπ(s′,π(s′)))+(1−η)Qπ(s,a)

,

Exponential moving average:

xn =x(n)+(1−η)x(n−1)+(1−η)2x(n−2)+ · · ·

1+(1−η)+(1−η)2 + · · ·,

where η ∈ [0,1] is the “forget rate”Recent samples are exponentially more important

Since 1/η = 1+(1−η)+(1−η)2 + · · · , we havexn = ηx(n)+(1−η)xn−1 [Proof]

As long as η gradually decreases to 0:Qπ(s,a) degenerates to average of accumulative rewardsQπ(s,a) converges to Qπ(s,a) (more on this later)

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 44 / 64

Page 108: ReinforcementLearning - GitHub Pages

Temporal Difference Estimation II

Qπ(s,a)← Qπ(s,a)+η[(R(s,a,s′)+ γQπ(s′,π(s′)))− Qπ(s,a)

]= η(R(s,a,s′)+ γQπ(s′,π(s′)))+(1−η)Qπ(s,a)

,

Exponential moving average:

xn =x(n)+(1−η)x(n−1)+(1−η)2x(n−2)+ · · ·

1+(1−η)+(1−η)2 + · · ·,

where η ∈ [0,1] is the “forget rate”Recent samples are exponentially more important

Since 1/η = 1+(1−η)+(1−η)2 + · · · , we havexn = ηx(n)+(1−η)xn−1 [Proof]As long as η gradually decreases to 0:

Qπ(s,a) degenerates to average of accumulative rewardsQπ(s,a) converges to Qπ(s,a) (more on this later)

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 44 / 64

Page 109: ReinforcementLearning - GitHub Pages

Algorithm: SARSA (Simplified)Input: S, A, and γ

Output: π∗(s)’s for all s’s

For each state s and a, initialize Qπ(s,a) arbitrarily;foreach episode do

Set s to initial state;repeat

Take action a← argmaxa′ Qπ(s,a′);Observe s′ and reward R(s,a,s′);Qπ(s,a)←Qπ(s,a)+η [(R(s,a,s′)+ γQπ(s′,π(s′)))−Qπ(s,a)];

s← s′;until s is terminal state;

endAlgorithm 7: State-Action-Reward-State-Action (SARSA).

Policy improves each time when deciding the next action a

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 45 / 64

Page 110: ReinforcementLearning - GitHub Pages

Algorithm: SARSA (Simplified)Input: S, A, and γ

Output: π∗(s)’s for all s’s

For each state s and a, initialize Qπ(s,a) arbitrarily;foreach episode do

Set s to initial state;repeat

Take action a← argmaxa′ Qπ(s,a′);Observe s′ and reward R(s,a,s′);Qπ(s,a)←Qπ(s,a)+η [(R(s,a,s′)+ γQπ(s′,π(s′)))−Qπ(s,a)];

s← s′;until s is terminal state;

endAlgorithm 8: State-Action-Reward-State-Action (SARSA).

Policy improves each time when deciding the next action a

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 45 / 64

Page 111: ReinforcementLearning - GitHub Pages

Convergence

TheoremSARSA converges and gives the optimal policy π∗ almost surely if

1) π is GLIE (Greedy in the Limit with Infinite Exploration);2) η small enough eventually, but not decreasing too fast.

Greedy in the limit: the policy π converges (in the limit) to theexploitation/greedy policy

At each step, we choose a← argmaxa′Qπ(s,a′)Infinite exploration: all (s,a) pairs are visited infinite times

a← argmaxa′Qπ(s,a′) cannot guarantee thisNeed a better way (next section)

Furthermore, η should satisfy: ∑t η(t) = ∞ and ∑t η(t)2 < ∞

E.g., η(t) = O( 1t )

∑t1t is a harmonic series known to diverge

∑t(1t )

p, p > 1, is a p-series converging to Riemann zeta ζ (p)

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 46 / 64

Page 112: ReinforcementLearning - GitHub Pages

Convergence

TheoremSARSA converges and gives the optimal policy π∗ almost surely if

1) π is GLIE (Greedy in the Limit with Infinite Exploration);2) η small enough eventually, but not decreasing too fast.

Greedy in the limit: the policy π converges (in the limit) to theexploitation/greedy policy

At each step, we choose a← argmaxa′Qπ(s,a′)

Infinite exploration: all (s,a) pairs are visited infinite timesa← argmaxa′Qπ(s,a′) cannot guarantee thisNeed a better way (next section)

Furthermore, η should satisfy: ∑t η(t) = ∞ and ∑t η(t)2 < ∞

E.g., η(t) = O( 1t )

∑t1t is a harmonic series known to diverge

∑t(1t )

p, p > 1, is a p-series converging to Riemann zeta ζ (p)

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 46 / 64

Page 113: ReinforcementLearning - GitHub Pages

Convergence

TheoremSARSA converges and gives the optimal policy π∗ almost surely if

1) π is GLIE (Greedy in the Limit with Infinite Exploration);2) η small enough eventually, but not decreasing too fast.

Greedy in the limit: the policy π converges (in the limit) to theexploitation/greedy policy

At each step, we choose a← argmaxa′Qπ(s,a′)Infinite exploration: all (s,a) pairs are visited infinite times

a← argmaxa′Qπ(s,a′) cannot guarantee thisNeed a better way (next section)

Furthermore, η should satisfy: ∑t η(t) = ∞ and ∑t η(t)2 < ∞

E.g., η(t) = O( 1t )

∑t1t is a harmonic series known to diverge

∑t(1t )

p, p > 1, is a p-series converging to Riemann zeta ζ (p)

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 46 / 64

Page 114: ReinforcementLearning - GitHub Pages

Convergence

TheoremSARSA converges and gives the optimal policy π∗ almost surely if

1) π is GLIE (Greedy in the Limit with Infinite Exploration);2) η small enough eventually, but not decreasing too fast.

Greedy in the limit: the policy π converges (in the limit) to theexploitation/greedy policy

At each step, we choose a← argmaxa′Qπ(s,a′)Infinite exploration: all (s,a) pairs are visited infinite times

a← argmaxa′Qπ(s,a′) cannot guarantee thisNeed a better way (next section)

Furthermore, η should satisfy: ∑t η(t) = ∞ and ∑t η(t)2 < ∞

E.g., η(t) = O( 1t )

∑t1t is a harmonic series known to diverge

∑t(1t )

p, p > 1, is a p-series converging to Riemann zeta ζ (p)

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 46 / 64

Page 115: ReinforcementLearning - GitHub Pages

Outline

1 Introduction

2 Markov Decision ProcessValue IterationPolicy Iteration

3 Reinforcement LearningModel-Free RL using Monte Carlo EstimationTemporal-Difference Estimation and SARSA (Model-Free)Exploration StrategiesQ-Learning (Model-Free)SARSA vs. Q-Learning

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 47 / 64

Page 116: ReinforcementLearning - GitHub Pages

General RL Steps

Repeat until converge:1 Use some exploration policy π ′ to run episodes/trails2 Estimate targets (e.g., P(s′|s;a), R(s,a,s′), or Qπ(s,a)) and update the

exploitation policy π

3 Gradually mix π into π ′

Goals of mix-in/exploration strategy:Infinite exploration with π ′

π ′ is greedy/exploitative in the end

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 48 / 64

Page 117: ReinforcementLearning - GitHub Pages

General RL Steps

Repeat until converge:1 Use some exploration policy π ′ to run episodes/trails2 Estimate targets (e.g., P(s′|s;a), R(s,a,s′), or Qπ(s,a)) and update the

exploitation policy π

3 Gradually mix π into π ′

Goals of mix-in/exploration strategy:Infinite exploration with π ′

π ′ is greedy/exploitative in the end

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 48 / 64

Page 118: ReinforcementLearning - GitHub Pages

ε-Greedy Strategy

At every time step, flip a coinWith probability ε, act randomly (explore)With probability (1− ε), compute/update the exploitation policy andact accordingly (exploit)

Gradually decrease ε over timeAt each time step, the action is at either of two extremes

Exploration or exploitation

“Soft” policy between the two extremes?

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 49 / 64

Page 119: ReinforcementLearning - GitHub Pages

ε-Greedy Strategy

At every time step, flip a coinWith probability ε, act randomly (explore)With probability (1− ε), compute/update the exploitation policy andact accordingly (exploit)

Gradually decrease ε over time

At each time step, the action is at either of two extremesExploration or exploitation

“Soft” policy between the two extremes?

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 49 / 64

Page 120: ReinforcementLearning - GitHub Pages

ε-Greedy Strategy

At every time step, flip a coinWith probability ε, act randomly (explore)With probability (1− ε), compute/update the exploitation policy andact accordingly (exploit)

Gradually decrease ε over timeAt each time step, the action is at either of two extremes

Exploration or exploitation

“Soft” policy between the two extremes?

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 49 / 64

Page 121: ReinforcementLearning - GitHub Pages

Softmax Strategy

Idea: perform action a more often if a gives more accumulative rewardsE.g., in SARSA, choose a from s by sampling from the distribution:

P(a|s) = exp(Qπ(s,a)/t)∑a′(expQπ(s,a′)/t)

, ∀a

Softmax function converts Qπ(s,a)’s to probabilities

Temperature t starts from a high value (exploration), and decreasesover time (exploitation)

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 50 / 64

Page 122: ReinforcementLearning - GitHub Pages

Softmax Strategy

Idea: perform action a more often if a gives more accumulative rewardsE.g., in SARSA, choose a from s by sampling from the distribution:

P(a|s) = exp(Qπ(s,a)/t)∑a′(expQπ(s,a′)/t)

, ∀a

Softmax function converts Qπ(s,a)’s to probabilities

Temperature t starts from a high value (exploration), and decreasesover time (exploitation)

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 50 / 64

Page 123: ReinforcementLearning - GitHub Pages

Exploration FunctionIdea: to explore areas with fewest samplesE.g., in each step of SARSA, define an exploration function

f (q,n) = q+K/n,

whereq the estimated Q-valuen the number of samples for the estimateK some positive constant

Instead of:

Qπ (s,a)← Qπ (s,a)+η[(R(s,a,s′)+ γQπ (s′,a′))−Qπ (s,a)

]Use f when updating Qπ(s,a):

Qπ (s,a)← Qπ (s,a)+η {[R(s,a,s′)+ γf (Qπ (s′,a′),count(s′,a′))]−Qπ (s,a)}

Infinite explorationExploit once exploring enough

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 51 / 64

Page 124: ReinforcementLearning - GitHub Pages

Exploration FunctionIdea: to explore areas with fewest samplesE.g., in each step of SARSA, define an exploration function

f (q,n) = q+K/n,

whereq the estimated Q-valuen the number of samples for the estimateK some positive constant

Instead of:

Qπ (s,a)← Qπ (s,a)+η[(R(s,a,s′)+ γQπ (s′,a′))−Qπ (s,a)

]Use f when updating Qπ(s,a):

Qπ (s,a)← Qπ (s,a)+η {[R(s,a,s′)+ γf (Qπ (s′,a′),count(s′,a′))]−Qπ (s,a)}

Infinite explorationExploit once exploring enough

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 51 / 64

Page 125: ReinforcementLearning - GitHub Pages

Outline

1 Introduction

2 Markov Decision ProcessValue IterationPolicy Iteration

3 Reinforcement LearningModel-Free RL using Monte Carlo EstimationTemporal-Difference Estimation and SARSA (Model-Free)Exploration StrategiesQ-Learning (Model-Free)SARSA vs. Q-Learning

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 52 / 64

Page 126: ReinforcementLearning - GitHub Pages

Model-Free RL with Value Iteration?

Value iteration: start from V∗(s)← 01 Iterate V∗(s)←maxa ∑s′ P(s′|s;a)[R(s,a,s′)+ γV∗(s′)] until converge

/ Not easy to estimate V∗(s) = maxπ E(

∑t γ tR(t)|s(0) = s; π

)2 Solve π∗(s)← argmaxa ∑s′ P(s′|s;a)[R(s,a,s′)+ γV∗(s′)]

/ Need model to solve

Q-version that helps sample-based estimation?

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 53 / 64

Page 127: ReinforcementLearning - GitHub Pages

Model-Free RL with Value Iteration?

Value iteration: start from V∗(s)← 01 Iterate V∗(s)←maxa ∑s′ P(s′|s;a)[R(s,a,s′)+ γV∗(s′)] until converge

/ Not easy to estimate V∗(s) = maxπ E(

∑t γ tR(t)|s(0) = s; π

)2 Solve π∗(s)← argmaxa ∑s′ P(s′|s;a)[R(s,a,s′)+ γV∗(s′)]

/ Need model to solve

Q-version that helps sample-based estimation?

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 53 / 64

Page 128: ReinforcementLearning - GitHub Pages

Optimal Q FunctionOptimal value function:

V∗(s) = maxπ

Es(1),···

(∞

∑t=0

γtR(s(t),π(s(t)),s(t+1)) |s(0) = s; π

)with recurrence

V∗(s) = maxa ∑

s′P(s′|s;a)[R(s,a,s′)+ γV∗(s′)],∀s

Maximum expected accumulative reward when starting from state sand acting optimally onward

Q∗ function:

Q∗(s,a) = maxπ

Es(1),···

(R(s,a,s(1))+

∑t=1

γtR(s(t),π(s(t)),s(t+1)); s,a,π

)such that V∗(s) = maxa Q∗(s,a) with recurrence:

Q∗(s,a) = ∑s′

P(s′|s;a)[R(s,a,s′)+ γ maxa′

Q∗(s′,a′)],∀s

Maximum expected accumulative reward when starting from state s,taking actin a, and then acting optimally onward

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 54 / 64

Page 129: ReinforcementLearning - GitHub Pages

Optimal Q FunctionOptimal value function:

V∗(s) = maxπ

Es(1),···

(∞

∑t=0

γtR(s(t),π(s(t)),s(t+1)) |s(0) = s; π

)with recurrence

V∗(s) = maxa ∑

s′P(s′|s;a)[R(s,a,s′)+ γV∗(s′)],∀s

Maximum expected accumulative reward when starting from state sand acting optimally onward

Q∗ function:

Q∗(s,a) = maxπ

Es(1),···

(R(s,a,s(1))+

∑t=1

γtR(s(t),π(s(t)),s(t+1)); s,a,π

)such that V∗(s) = maxa Q∗(s,a) with recurrence:

Q∗(s,a) = ∑s′

P(s′|s;a)[R(s,a,s′)+ γ maxa′

Q∗(s′,a′)],∀s

Maximum expected accumulative reward when starting from state s,taking actin a, and then acting optimally onward

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 54 / 64

Page 130: ReinforcementLearning - GitHub Pages

Algorithm: Q-Value Iteration

Input: MDP (S,A,P,R,γ,H→ ∞)Output: π∗(s)’s for all s’s

For each state s, initialize V∗(s)← 0 Initialize Q∗(s,a) = 0,∀s,a;repeat

foreach s and a doV∗(s)←maxa ∑s′ P(s′|s;a)[R(s,a,s′)+ γV∗(s′)]Q∗(s,a)← ∑s′ P(s′|s;a)[R(s,a,s′)+ γ maxa′ Q∗(s′,a′)];

enduntil V∗(s)’s Q∗(s,a)’s converge;foreach s do

π∗(s)← argmaxa ∑s′ P(s′|s;a)[R(s,a,s′)+ γV∗(s′)]π∗(s)← argmaxa Q∗(s,a);

endAlgorithm 9: Q-Value iteration with infinite horizon.

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 55 / 64

Page 131: ReinforcementLearning - GitHub Pages

Example

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 56 / 64

Page 132: ReinforcementLearning - GitHub Pages

Q-Value Iteration

Value iteration: start from Q∗(s,a)← 01 Iterate Q∗(s,a)← ∑s′ P(s′|s;a)[R(s,a,s′)+ γ maxa′Q∗(s′,a′)] until

converge

/ Still not easy to estimate Q∗(s,a) = maxπ E(

R(0)+∑t γ tR(t))

, But we can estimate Q∗(s,a) based on the recurrence now!

2 Solve π∗(s)← argmaxa Q∗(s,a)

, No need for model to solve

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 57 / 64

Page 133: ReinforcementLearning - GitHub Pages

Q-Value Iteration

Value iteration: start from Q∗(s,a)← 01 Iterate Q∗(s,a)← ∑s′ P(s′|s;a)[R(s,a,s′)+ γ maxa′Q∗(s′,a′)] until

converge/ Still not easy to estimate Q∗(s,a) = maxπ E

(R(0)+∑t γ tR(t)

)

, But we can estimate Q∗(s,a) based on the recurrence now!

2 Solve π∗(s)← argmaxa Q∗(s,a), No need for model to solve

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 57 / 64

Page 134: ReinforcementLearning - GitHub Pages

Q-Value Iteration

Value iteration: start from Q∗(s,a)← 01 Iterate Q∗(s,a)← ∑s′ P(s′|s;a)[R(s,a,s′)+ γ maxa′Q∗(s′,a′)] until

converge/ Still not easy to estimate Q∗(s,a) = maxπ E

(R(0)+∑t γ tR(t)

), But we can estimate Q∗(s,a) based on the recurrence now!

2 Solve π∗(s)← argmaxa Q∗(s,a), No need for model to solve

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 57 / 64

Page 135: ReinforcementLearning - GitHub Pages

Temporal Difference Estimation

Q∗(s,a)←∑s′

P(s′|s;a)[R(s,a,s′)+ γ maxa′

Q∗(s′,a′)]

Given an exploration policy π ′ and samples

s(0) a(0)−→ s(1) a(1)−→ ·· · a(H−1)

−→ s(H)

andR(s(0),a(0),s(1))→ ··· → R(s(H−1),a(H−1),s(H))

Temporal difference (TD) estimation of Q∗(s,a) for exploitationpolicy π:

1 Q∗(s,a)← randon value,∀s,a2 Repeat until converge for each action a(t):

Q∗(s(t),a(t))← Q∗(s(t),a(t))+η

[(R(s(t),a(t),s(t+1))+ γ maxa Q∗(s(t+1),a))− Q∗(s(t),a(t))

]

Q∗(s,a) can be updated on the fly during explorationη the “forget rate” of moving avg. that gradually decreases

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 58 / 64

Page 136: ReinforcementLearning - GitHub Pages

Temporal Difference Estimation

Q∗(s,a)←∑s′

P(s′|s;a)[R(s,a,s′)+ γ maxa′

Q∗(s′,a′)]

Given an exploration policy π ′ and samples

s(0) a(0)−→ s(1) a(1)−→ ·· · a(H−1)

−→ s(H)

andR(s(0),a(0),s(1))→ ··· → R(s(H−1),a(H−1),s(H))

Temporal difference (TD) estimation of Q∗(s,a) for exploitationpolicy π:

1 Q∗(s,a)← randon value,∀s,a2 Repeat until converge for each action a(t):

Q∗(s(t),a(t))← Q∗(s(t),a(t))+η

[(R(s(t),a(t),s(t+1))+ γ maxa Q∗(s(t+1),a))− Q∗(s(t),a(t))

]Q∗(s,a) can be updated on the fly during exploration

η the “forget rate” of moving avg. that gradually decreases

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 58 / 64

Page 137: ReinforcementLearning - GitHub Pages

Temporal Difference Estimation

Q∗(s,a)←∑s′

P(s′|s;a)[R(s,a,s′)+ γ maxa′

Q∗(s′,a′)]

Given an exploration policy π ′ and samples

s(0) a(0)−→ s(1) a(1)−→ ·· · a(H−1)

−→ s(H)

andR(s(0),a(0),s(1))→ ··· → R(s(H−1),a(H−1),s(H))

Temporal difference (TD) estimation of Q∗(s,a) for exploitationpolicy π:

1 Q∗(s,a)← randon value,∀s,a2 Repeat until converge for each action a(t):

Q∗(s(t),a(t))← Q∗(s(t),a(t))+η

[(R(s(t),a(t),s(t+1))+ γ maxa Q∗(s(t+1),a))− Q∗(s(t),a(t))

]Q∗(s,a) can be updated on the fly during explorationη the “forget rate” of moving avg. that gradually decreases

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 58 / 64

Page 138: ReinforcementLearning - GitHub Pages

Algorithm: Q-Learning

Input: S, A, and γ

Output: π∗(s)’s for all s’s

For each state s and a, initialize Q∗(s,a) arbitrarily;foreach episode do

Set s to initial state;repeat

Take action a from s using some exploration policy π ′

derived from Q∗ (e.g., ε-greedy);Observe s′ and reward R(s,a,s′);Q∗(s,a)←Q∗(s,a)+η [(R(s,a,s′)+ γ maxa′ Q∗(s′,a′))−Q∗(s,a)];

s← s′;until s is terminal state;

endAlgorithm 10: Q-learning.

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 59 / 64

Page 139: ReinforcementLearning - GitHub Pages

Convergence

TheoremQ-learning converges and gives the optimal policy π∗ if

1) π ′ has explored enough;2) η small enough eventually, but not decreasing too fast.

π ′ has explored enough:All states and actions are visited infinitely oftenDoes not matter how π ′ selects actions!

η satisfies ∑t η(t) = ∞ and ∑t η(t)2 < ∞

E.g., η(t) = O( 1t )

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 60 / 64

Page 140: ReinforcementLearning - GitHub Pages

Convergence

TheoremQ-learning converges and gives the optimal policy π∗ if

1) π ′ has explored enough;2) η small enough eventually, but not decreasing too fast.

π ′ has explored enough:All states and actions are visited infinitely oftenDoes not matter how π ′ selects actions!

η satisfies ∑t η(t) = ∞ and ∑t η(t)2 < ∞

E.g., η(t) = O( 1t )

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 60 / 64

Page 141: ReinforcementLearning - GitHub Pages

Convergence

TheoremQ-learning converges and gives the optimal policy π∗ if

1) π ′ has explored enough;2) η small enough eventually, but not decreasing too fast.

π ′ has explored enough:All states and actions are visited infinitely oftenDoes not matter how π ′ selects actions!

η satisfies ∑t η(t) = ∞ and ∑t η(t)2 < ∞

E.g., η(t) = O( 1t )

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 60 / 64

Page 142: ReinforcementLearning - GitHub Pages

Outline

1 Introduction

2 Markov Decision ProcessValue IterationPolicy Iteration

3 Reinforcement LearningModel-Free RL using Monte Carlo EstimationTemporal-Difference Estimation and SARSA (Model-Free)Exploration StrategiesQ-Learning (Model-Free)SARSA vs. Q-Learning

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 61 / 64

Page 143: ReinforcementLearning - GitHub Pages

Off-Policy vs. On-Policy RLGeneral RL steps: repeat until converge:

1 Use some exploration policy π ′ to run episodes/trails2 Use smaples to update the exploitation policy π

3 Gradually mix π into π ′

Off-policy: π updated toward a greedy policy independent with π ′

Q-learning:Q∗(s,a)← Q∗(s,a)+η [(R(s,a,s′)+ γ maxa′Q∗(s′,a′))−Q∗(s,a)]

On-policy: π updated to improve (and depends on) π ′

SARSA: Qπ(s,a)← Qπ(s,a)+η [(R(s,a,s′)+ γQπ(s′,π(s′)))−Qπ(s,a)]

π ′ π

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 62 / 64

Page 144: ReinforcementLearning - GitHub Pages

Off-Policy vs. On-Policy RLGeneral RL steps: repeat until converge:

1 Use some exploration policy π ′ to run episodes/trails2 Use smaples to update the exploitation policy π

3 Gradually mix π into π ′

Off-policy: π updated toward a greedy policy independent with π ′

Q-learning:Q∗(s,a)← Q∗(s,a)+η [(R(s,a,s′)+ γ maxa′Q∗(s′,a′))−Q∗(s,a)]

On-policy: π updated to improve (and depends on) π ′

SARSA: Qπ(s,a)← Qπ(s,a)+η [(R(s,a,s′)+ γQπ(s′,π(s′)))−Qπ(s,a)]

π ′ π

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 62 / 64

Page 145: ReinforcementLearning - GitHub Pages

Off-Policy vs. On-Policy RLGeneral RL steps: repeat until converge:

1 Use some exploration policy π ′ to run episodes/trails2 Use smaples to update the exploitation policy π

3 Gradually mix π into π ′

Off-policy: π updated toward a greedy policy independent with π ′

Q-learning:Q∗(s,a)← Q∗(s,a)+η [(R(s,a,s′)+ γ maxa′Q∗(s′,a′))−Q∗(s,a)]

On-policy: π updated to improve (and depends on) π ′

SARSA: Qπ(s,a)← Qπ(s,a)+η [(R(s,a,s′)+ γQπ(s′,π(s′)))−Qπ(s,a)]

π ′ π

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 62 / 64

Page 146: ReinforcementLearning - GitHub Pages

Practical Results

SARSA has the capability to avoid the mistakes due to explorationE.g., the maze-with-cliff problemAdvantageous when ε > 0 with ε-greedy exploration strategyConverges to optimal policy π∗ (as Q-learning) when ε → 0

But Q-learning has the capability to continue learning whilechanging the exploration policy

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 63 / 64

Page 147: ReinforcementLearning - GitHub Pages

Practical Results

SARSA has the capability to avoid the mistakes due to explorationE.g., the maze-with-cliff problemAdvantageous when ε > 0 with ε-greedy exploration strategyConverges to optimal policy π∗ (as Q-learning) when ε → 0

But Q-learning has the capability to continue learning whilechanging the exploration policy

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 63 / 64

Page 148: ReinforcementLearning - GitHub Pages

Remarks

Update rule for Q∗ (or Qπ) takes terminal states differently:

Q∗(s,a)←{

(1−η)Q∗(s,a)+ηR(s,a,s′) if s′ is terminal(1−η)Q∗(s,a)+η [R(s,a,s′)+ γ maxa′ Q∗(s′,a′)] , otherwise

Better convergence in practice

Watch out your memory usage!Space complexity for SARSA/Q-learning: O(|S||A|)

Store Q∗(s,a)’s or Qπ(s,a)’s for all (s,a) combinationsWhy not train a (deep) regressor for Q∗(s,a)’s or Qπ(s,a)’s?

Space reduction due to the generalizability of the regressor

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 64 / 64

Page 149: ReinforcementLearning - GitHub Pages

Remarks

Update rule for Q∗ (or Qπ) takes terminal states differently:

Q∗(s,a)←{

(1−η)Q∗(s,a)+ηR(s,a,s′) if s′ is terminal(1−η)Q∗(s,a)+η [R(s,a,s′)+ γ maxa′ Q∗(s′,a′)] , otherwise

Better convergence in practice

Watch out your memory usage!Space complexity for SARSA/Q-learning: O(|S||A|)

Store Q∗(s,a)’s or Qπ(s,a)’s for all (s,a) combinations

Why not train a (deep) regressor for Q∗(s,a)’s or Qπ(s,a)’s?Space reduction due to the generalizability of the regressor

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 64 / 64

Page 150: ReinforcementLearning - GitHub Pages

Remarks

Update rule for Q∗ (or Qπ) takes terminal states differently:

Q∗(s,a)←{

(1−η)Q∗(s,a)+ηR(s,a,s′) if s′ is terminal(1−η)Q∗(s,a)+η [R(s,a,s′)+ γ maxa′ Q∗(s′,a′)] , otherwise

Better convergence in practice

Watch out your memory usage!Space complexity for SARSA/Q-learning: O(|S||A|)

Store Q∗(s,a)’s or Qπ(s,a)’s for all (s,a) combinationsWhy not train a (deep) regressor for Q∗(s,a)’s or Qπ(s,a)’s?

Space reduction due to the generalizability of the regressor

Shan-Hung Wu (CS, NTHU) Reinforcement Learning Machine Learning 64 / 64