machine learning...sapienza university of rome, italy - machine learning (2020/2021) reasoning vs....

40
Sapienza University of Rome, Italy - Machine Learning (2020/2021) Sapienza University of Rome Master in Artificial Intelligence and Robotics Master in Engineering in Computer Science Machine Learning A.Y. 2020/2021 Prof. L. Iocchi, F. Patrizi L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 1 / 79 Sapienza University of Rome, Italy - Machine Learning (2020/2021) 17. Markov Decision Processes and Reinforcement Learning L. Iocchi, F. Patrizi L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 2 / 79

Upload: others

Post on 15-Mar-2021

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine Learning...Sapienza University of Rome, Italy - Machine Learning (2020/2021) Reasoning vs. Learning in Dynamic Systems w k v k x k z k x k+1 h k fk D Reasoning : given the

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

Sapienza University of Rome

Master in Artificial Intelligence and RoboticsMaster in Engineering in Computer Science

Machine Learning

A.Y. 2020/2021

Prof. L. Iocchi, F. Patrizi

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 1 / 79

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

17. Markov Decision Processes andReinforcement Learning

L. Iocchi, F. Patrizi

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 2 / 79

Page 2: Machine Learning...Sapienza University of Rome, Italy - Machine Learning (2020/2021) Reasoning vs. Learning in Dynamic Systems w k v k x k z k x k+1 h k fk D Reasoning : given the

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

Overview

Dynamic systems

Markov Decision Processes

Q-Learning

Experimentation strategies

Evaluation

MDP Non-deterministic case

Non-deterministic Q-Learning

Temporal Difference

SARSA

Policy gradient algorithms

ReferencesRichard S. Sutton and Andrew G. Barto. Reinforcement Learning: AnIntroduction. http://incompleteideas.net/book/the-book.html

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 3 / 79

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

Dynamic System

The classical view of a dynamic system

wk

vk

xk

zk

xk+1

hk

fk ∆

x : statez : observationsw , v : noise

f : state transition modelh: observation model

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 4 / 79

Page 3: Machine Learning...Sapienza University of Rome, Italy - Machine Learning (2020/2021) Reasoning vs. Learning in Dynamic Systems w k v k x k z k x k+1 h k fk D Reasoning : given the

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

Reasoning vs. Learning in Dynamic Systems

wk

vk

xk

zk

xk+1

hk

fk ∆

Reasoning: given the model (f , h) and the current state xk , predict thefuture (xk+T , zk+T ).

Learning: given past experience (z0:k), determine the model (f , h).

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 5 / 79

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

State of a Dynamic System

The state x encodes:

all the past knowledge needed to predict the future

the knowledge gathered through operation

the knowledge needed to pursue the goal

Examples:

configuration of a board game

configuration of robot devices

screenshot of a video-game

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 6 / 79

Page 4: Machine Learning...Sapienza University of Rome, Italy - Machine Learning (2020/2021) Reasoning vs. Learning in Dynamic Systems w k v k x k z k x k+1 h k fk D Reasoning : given the

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

Observability of the state

When the state is fully observable, the decision making problem for anagent is to decide which action must be executed in a given state.

The agent has to compute the function

π : X 7→ A

When the model of the system is not known, the agent has to learn thefunction π.

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 7 / 79

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

Supervised vs. Reinforcement Learning

Supervised LearningLearning a function f : X → Y , given

D = {〈xi , yi 〉}

Reinforcement LearningLearning a behavior function π : X→ A, given

D = {〈x1, a1, r1, . . . , xn, an, rn〉(i)}

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 8 / 79

Page 5: Machine Learning...Sapienza University of Rome, Italy - Machine Learning (2020/2021) Reasoning vs. Learning in Dynamic Systems w k v k x k z k x k+1 h k fk D Reasoning : given the

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

Example: Tic-Tac-Toe

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 9 / 79

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

Example: Tic Tac Toe

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 10 / 79

Page 6: Machine Learning...Sapienza University of Rome, Italy - Machine Learning (2020/2021) Reasoning vs. Learning in Dynamic Systems w k v k x k z k x k+1 h k fk D Reasoning : given the

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

RL Example: Humanoid Walk

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 11 / 79

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

RL Example: Controlling an Helicopter

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 12 / 79

Page 7: Machine Learning...Sapienza University of Rome, Italy - Machine Learning (2020/2021) Reasoning vs. Learning in Dynamic Systems w k v k x k z k x k+1 h k fk D Reasoning : given the

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

Deep Reinforcement Learning: ATARI games

Deep Q-Network (DQN) by Deep Mind (Google)

Human-level control through Deep Reinforcement Learning, Nature (2015)L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 13 / 79

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

Dynamic System Representation

X: set of states

explicit discrete and finite representation X = {x1, . . . , xn}continuous representation X = F (. . .) (state function)

probabilistic representation P(X) (probabilistic state function)

A: set of actions

explicit discrete and finite representation A = {a1, . . . , am}continuous representation A = U(. . .) (control function)

δ: transition function

deterministic / non-deterministic / probabilistic

Z: set of observations

explicit discrete and finite representation Z = {z1, . . . , zk}continuous representation Z = Z (. . .) (observation function)

probabilistic representation P(Z) (probabilistic observation function)

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 14 / 79

Page 8: Machine Learning...Sapienza University of Rome, Italy - Machine Learning (2020/2021) Reasoning vs. Learning in Dynamic Systems w k v k x k z k x k+1 h k fk D Reasoning : given the

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

Markov property

Markov property

Once the current state is known, the evolution of the dynamic systemdoes not depend on the history of states, actions and observations.

The current state contains all the information needed to predict thefuture.

Future states are conditionally independent of past states and pastobservations given the current state.

The knowledge about the current state makes past, present andfuture observations statistically independent.

Markov process is a process that has the Markov property.

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 15 / 79

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

Markov Decision Processes (MDP)

Markov processes for decision making.

States are fully observable, no need of observations.

Graphical model

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 16 / 79

Page 9: Machine Learning...Sapienza University of Rome, Italy - Machine Learning (2020/2021) Reasoning vs. Learning in Dynamic Systems w k v k x k z k x k+1 h k fk D Reasoning : given the

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

Markov Decision Processes (MDP)

Deterministic transitions

MDP = 〈X,A, δ, r〉

X is a finite set of states

A is a finite set of actions

δ : X× A→ X is a transition function

r : X× A→ < is a reward function

Markov property: xt+1 = δ(xt , at) and rt = r(xt , at)Sometimes, the reward function is defined as r : X→ <

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 17 / 79

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

Markov Decision Processes (MDP)

Non-deterministic transitions

MDP = 〈X,A, δ, r〉

X is a finite set of states

A is a finite set of actions

δ : X× A→ 2X is a transition function

r : X× A× X→ < is a reward function

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 18 / 79

Page 10: Machine Learning...Sapienza University of Rome, Italy - Machine Learning (2020/2021) Reasoning vs. Learning in Dynamic Systems w k v k x k z k x k+1 h k fk D Reasoning : given the

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

Markov Decision Processes (MDP)

Stochastic transitions

MDP = 〈X,A, δ, r〉

X is a finite set of states

A is a finite set of actions

P(x′|x, a) is a probability distribution over transitions

r : X× A× X→ < is a reward function

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 19 / 79

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

Example: deterministic grid controller

Reaching the right-most side of the environment from any initial state.

MDP 〈X,A, δ, r〉X = {(r , c)|coordinates in the grid}A = {Left,Right,Up,Down}δ: cardinal movements with no effects (i.e., the agent remains in thecurrent state) if destination state is a black square

r : 1000 for reaching the right-most column, -10 for hitting anyobstacle, 0 otherwise

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 20 / 79

Page 11: Machine Learning...Sapienza University of Rome, Italy - Machine Learning (2020/2021) Reasoning vs. Learning in Dynamic Systems w k v k x k z k x k+1 h k fk D Reasoning : given the

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

Example: non-deterministic grid controller

Reaching the right-most side of the environment from any initial state.

MDP 〈X,A, δ, r〉X = {(r , c)|coordinates in the grid}A = {Left,Right}δ: cardinal movements with non-deterministic effects(0.1 probability of moving diagonally)

r : 1000 for reaching the right-most column, -10 for hitting anyobstacle, +1 for any Right action, -1 for any Left action.

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 21 / 79

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

Full Observability in MDP

States are fully observable.

In presence of non-deterministic or stochastic actions, the state resultingfrom the execution of an action is not known before the execution of theaction, but it can be fully observed after its execution.

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 22 / 79

Page 12: Machine Learning...Sapienza University of Rome, Italy - Machine Learning (2020/2021) Reasoning vs. Learning in Dynamic Systems w k v k x k z k x k+1 h k fk D Reasoning : given the

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

MDP Solution Concept

Given an MDP, we want to find an optimal policy.

Policy is a function

π : X→ A

For each state x ∈ X, π(x) ∈ A is the optimal action to be executed insuch state.

optimality = maximize the cumulative reward.

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 23 / 79

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

MDP Solution Concept

Optimality is defined with respect to maximizing the (expected value ofthe) cumulative discounted reward.

V π(x1) ≡ E [r1 + γ r2 + γ2r3 + . . .]

where rt = r(xt , at , xt+1), at = π(xt), and γ ∈ [0, 1] is the discount factorfor future rewards.

Optimal policy: π∗ ≡ argmaxπ Vπ(x) , ∀x ∈ X

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 24 / 79

Page 13: Machine Learning...Sapienza University of Rome, Italy - Machine Learning (2020/2021) Reasoning vs. Learning in Dynamic Systems w k v k x k z k x k+1 h k fk D Reasoning : given the

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

Value function

Deterministic case

V π(x) ≡ r1 + γr2 + γ2r3 + . . .

Non-deterministic/stochastic case:

V π(x) ≡ E [r1 + γr2 + γ2r3 + . . .]

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 25 / 79

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

Optimal policy

π∗ is an optimal policy iff for any other policy π

V π∗(x) ≥ V π(x),∀x

For infinite horizon problems, a stationary MDP always has an optimalstationary policy.

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 26 / 79

Page 14: Machine Learning...Sapienza University of Rome, Italy - Machine Learning (2020/2021) Reasoning vs. Learning in Dynamic Systems w k v k x k z k x k+1 h k fk D Reasoning : given the

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

Example: non-deterministic grid controller

Optimal policy

green: action Left red: action Right

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 27 / 79

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

Reasoning and Learning in MDP

Problem: MDP 〈X,A, δ, r〉

Solution: Policy π : X→ A

If the MDP 〈X,A, δ, r〉 is completely known → reasoning or planning

If the MDP 〈X,A, δ, r〉 is not completely known → learning

Note: Simple examples of reasoning in MDP can be modeled as a searchproblem and solved using standard search algorithm (e.g., A*).

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 28 / 79

Page 15: Machine Learning...Sapienza University of Rome, Italy - Machine Learning (2020/2021) Reasoning vs. Learning in Dynamic Systems w k v k x k z k x k+1 h k fk D Reasoning : given the

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

One-state Markov Decision Processes (MDP)

MDP = 〈{x0},A, δ, r〉

x0 unique state

A finite set of actions

δ(x0, ai ) = x0, ∀ai ∈ A transition function

r(x0, ai , x0) = r(ai ) reward function

Optimal policy: π∗(x0) = ai

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 29 / 79

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

Deterministic One-state MDP

If r(ai ) is deterministic and known, then

Optimal policy: π∗(x0) = argmaxai∈A r(ai )

What if reward function r is unknown?

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 30 / 79

Page 16: Machine Learning...Sapienza University of Rome, Italy - Machine Learning (2020/2021) Reasoning vs. Learning in Dynamic Systems w k v k x k z k x k+1 h k fk D Reasoning : given the

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

Deterministic One-state MDP

If r(ai ) is deterministic and unknown, then

Algorithm:

1 for each ai ∈ Aexecute ai and collect reward r(i)

2 Optimal policy: π∗(x0) = ai , with i = argmaxi=1...|A| r(i)

Note: exactly |A| iterations are needed.

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 31 / 79

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

Non-Deterministic One-state MDP

If r(ai ) is non-deterministic and known, then

Optimal policy: π∗(x0) = argmaxai∈A E [r(ai )]

Example:

If r(ai ) = N (µi , σi ), then

π∗(x0) = ai , with i = argmaxi=1...|A| µi

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 32 / 79

Page 17: Machine Learning...Sapienza University of Rome, Italy - Machine Learning (2020/2021) Reasoning vs. Learning in Dynamic Systems w k v k x k z k x k+1 h k fk D Reasoning : given the

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

Non-Deterministic One-state MDP

If r(ai ) is non-deterministic and unknown, then

Algorithm:

1 Initialize a data structure Θ2 For each time t = 1, . . . ,T (until termination condition)

choose an action a(t) ∈ Aexecute a(t) and collect reward r(t)

Update the data structure Θ

3 Optimal policy: π∗(x0) = . . ., according to the data structure Θ

Note: many iterations (T >> |A|) are needed.

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 33 / 79

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

Non-Deterministic One-state MDP

Example:

If r(ai ) is non-deterministic and unknown and r(ai ) = N (µi , σi ), then

Algorithm:

1 Initialize Θ(0)[i ]← 0 and c[i ]← 0, i = 1...|A|2 For each time t = 1, . . . ,T (until termination condition)

choose an index ı for action a(t) = aı ∈ Aexecute a(t) and collect reward r(t)

increment c [ı]update Θ(t) [ı]← 1

c [ı] ( r(t) + (c [ı]− 1)Θ(t−1) [ı] )

3 Optimal policy: π∗(x0) = ai , with i = argmaxi=1...|A|Θ(T )[i ]

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 34 / 79

Page 18: Machine Learning...Sapienza University of Rome, Italy - Machine Learning (2020/2021) Reasoning vs. Learning in Dynamic Systems w k v k x k z k x k+1 h k fk D Reasoning : given the

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

Learning with Markov Decision Processes

Given an agent accomplishing a task according to an MDP 〈X,A, δ, r〉,for which functions δ and r are unknown to the agent,

determine the optimal policy π∗

Note: This is not a supervised learning approach!

Target function is π : X→ A

but we do not have training examples {(x(i), π(x(i)))}training examples are in the form 〈(x(1), a(1), r(1)), . . . , (x(t), a(t), r(t))〉

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 35 / 79

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

Agent’s Learning Task

Since δ and r are not known, the agent cannot predict the effect of itsactions. But it can execute them and then observe the outcome.

The learning task is thus performed by repeating these steps:

choose an action

execute the chosen action

observe the resulting new state

collect the reward

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 36 / 79

Page 19: Machine Learning...Sapienza University of Rome, Italy - Machine Learning (2020/2021) Reasoning vs. Learning in Dynamic Systems w k v k x k z k x k+1 h k fk D Reasoning : given the

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

Approaches to Learning with MDP

Value iteration(estimate the Value function and then compute π)

Policy iteration(estimate directly π)

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 37 / 79

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

Learning through value iteration

The agent could learn the value function V π∗(x) (written as V ∗(x))

From which it could determine the optimal policy:

π∗(x) = argmaxa∈A

[ r(x, a) + γV ∗(δ(x, a)) ]

However, this policy cannot be computed in this way because δ and r arenot known.

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 38 / 79

Page 20: Machine Learning...Sapienza University of Rome, Italy - Machine Learning (2020/2021) Reasoning vs. Learning in Dynamic Systems w k v k x k z k x k+1 h k fk D Reasoning : given the

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

Q Function (deterministic case)

Qπ(x, a): expected value when executing a in the state x and then actaccording to π.

Qπ(x, a) ≡ r(x, a) + γV π(δ(x, a))

Q(x, a) ≡ r(x, a) + γV ∗(δ(x, a))

If the agent learns Q, then it can determine the optimal policy withoutknowing δ and r .

π∗(x) = argmaxa∈A

Q(x, a)

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 39 / 79

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

Q Function (deterministic case)

Observe that

V ∗(x) = maxa∈A{r(x, a) + γV ∗(δ(x, a))} = max

a∈AQ(x, a)

Thus, we can rewrite

Q(x, a) ≡ r(x, a) + γV ∗(δ(x, a))

as

Q(x, a) = r(x, a) + γmaxa′∈A

Q(δ(x, a), a′)

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 40 / 79

Page 21: Machine Learning...Sapienza University of Rome, Italy - Machine Learning (2020/2021) Reasoning vs. Learning in Dynamic Systems w k v k x k z k x k+1 h k fk D Reasoning : given the

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

Training Rule to Learn Q (deterministic case)

Deterministic case:

Q(x, a) = r(x, a) + γmaxa′∈A

Q(δ(x, a), a′)

Let Q denote learner’s current approximation of Q.

Training rule:

Q(x, a)← r + γmaxa′

Q(x′, a′)

where r is the immediate reward and x′ is the state resulting from applyingaction a in state x.

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 41 / 79

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

Q Learning Algorithm for Deterministic MDPs

1 for each x, a initialize table entry Q(0)(x, a)← 0

2 observe current state x3 for each time t = 1, . . . ,T (until termination condition)

choose an action a

execute the action a

observe the new state x′

collect the immediate reward r

update the table entry for Q(x, a) as follows:

Q(t)(x, a)← r + γmaxa′∈A

Q(t−1)(x′, a′)

x← x′

4 Optimal policy: π∗(x) = argmaxa∈A Q(T )(x, a)

Note: not using δ and r , but just observing new state x′ and immediatereward r , after the execution of the chosen action.

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 42 / 79

Page 22: Machine Learning...Sapienza University of Rome, Italy - Machine Learning (2020/2021) Reasoning vs. Learning in Dynamic Systems w k v k x k z k x k+1 h k fk D Reasoning : given the

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

Convergence in deterministic MDP

Qn(x, a) underestimates Q(x, a)

We have: 0 ≤ Qn(x, a) ≤ Qn+1(x, a) ≤ Q(x, a)

Convergence guranteed if all state-action pairs visited infinitely often

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 43 / 79

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

Experimentation Strategies

How actions are chosen by the agents?Exploitation: select action a that maximizes Q(x, a)Exploration: select random action a (with low value of Q(x, a))

ε-greedy strategy

Given, 0 ≤ ε ≤ 1,select a random action with probability εselect the best action with probability 1− ε

ε can decrease over time (first exploration, then exploitation).

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 44 / 79

Page 23: Machine Learning...Sapienza University of Rome, Italy - Machine Learning (2020/2021) Reasoning vs. Learning in Dynamic Systems w k v k x k z k x k+1 h k fk D Reasoning : given the

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

Experimentation Strategies

soft-max strategy

actions with higher Q values are assigned higher probabilities, but everyaction is assigned a non-zero probability.

P(ai |x) =kQ(x,ai )

∑j k

Q(x,aj )

k > 0 determines how strongly the selection favors actions with high Qvalues.k may increase over time (first exploration, then exploitation).

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 45 / 79

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

Example: grid world

Reaching the goal state G from initialstate S0.

S0

G

S1

S2

S4

S3

100

100

0

0

0 0

0

MDP 〈X,A, δ, r〉X = {S0, S1,S2,S3, S4,G}A = {L,R,U,D}δ represented as arrows in the figure (e.g., δ(S0,R) = S1)

r(x, a) represented as red values on the arrows in the figure (e.g.,r(S0,R) = 0)

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 46 / 79

Page 24: Machine Learning...Sapienza University of Rome, Italy - Machine Learning (2020/2021) Reasoning vs. Learning in Dynamic Systems w k v k x k z k x k+1 h k fk D Reasoning : given the

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

Example: grid world

S0

G

S1

S2

S4

S3

100

100

0

0

0 0

0S

0

G

S1

S2

S4

S3

100

100

90

90

81

81

90

81

81

72.9

72.9

81

MDP Q

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 47 / 79

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

Example: Grid World

G90

90 100

100

81 S0

G

S1

S2

S4

S3

V ∗(x) values One optimal policy

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 48 / 79

Page 25: Machine Learning...Sapienza University of Rome, Italy - Machine Learning (2020/2021) Reasoning vs. Learning in Dynamic Systems w k v k x k z k x k+1 h k fk D Reasoning : given the

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

Example: Hanoi Tower

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 49 / 79

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

Evaluating Reinforcement Learning Agents

Evaluation of RL agents is usually performed through the cumulativereward gained over time.

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 50 / 79

Page 26: Machine Learning...Sapienza University of Rome, Italy - Machine Learning (2020/2021) Reasoning vs. Learning in Dynamic Systems w k v k x k z k x k+1 h k fk D Reasoning : given the

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

Evaluating Reinforcement Learning Agents

Cumulative reward plot may be very noisy (due to exploration phases).A better approach could be:

Repeat until termination condition:

1 Execute k steps of learning

2 Evaluate the current policy πk (average and stddev of cumulativereward obtained in d runs with no exploration)

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 51 / 79

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

Evaluating Reinforcement Learning Agents

Domain-specific performance metrics can also be used.

Average of all the results obtained during the learning process.

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 52 / 79

Page 27: Machine Learning...Sapienza University of Rome, Italy - Machine Learning (2020/2021) Reasoning vs. Learning in Dynamic Systems w k v k x k z k x k+1 h k fk D Reasoning : given the

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

Non-deterministic Case

Transition and reward functions are non-deterministic.

We define V ,Q by taking expected values

V π(x) ≡ E [rt + γrt+1 + γ2rt+2 + . . .]

≡ E [∞∑

i=0

γ i rt+i ]

Optimal policy

π∗ ≡ argmaxπ

V π(x), (∀x)

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 53 / 79

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

Non-deterministic Case

Definition of Q

Q(x, a) ≡ E [r(x, a) + γV ∗(δ(x, a))]

= E [r(x, a)] + γE [V ∗(δ(x, a))]

= E [r(x, a)] + γ∑

x ′P(x′|x, a)V ∗(x′)

= E [r(x, a)] + γ∑

x ′P(x′|x, a) max

a′Q(x′, a′)

Optimal policy

π∗(x) = argmaxa∈A

Q(x, a)

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 54 / 79

Page 28: Machine Learning...Sapienza University of Rome, Italy - Machine Learning (2020/2021) Reasoning vs. Learning in Dynamic Systems w k v k x k z k x k+1 h k fk D Reasoning : given the

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

Example: k-Armed Bandit

One state MDP with k actions: a1, . . . , ak .

Stochastic case: r(ai ) = N (µi , σi ) Gaussian distribution

Choose ai with ε-greedy strategy:uniform random choice with prob. ε (exploration),best choice with probability 1− ε (exploitation).

Training rule:

Qn(ai )← Qn−1(ai ) + α[r − Qn−1(ai )]

α =1

1 + vn−1(ai )

with vn−1(ai ) = number of executions of action ai up to time n − 1.

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 55 / 79

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

Exercise: k-Armed Bandit

Compare the following two strategies for the stochastic k-Armed Banditproblem (with Gaussian distributions), by plotting the reward over time.

1 For each of the k actions, perform 30 trials and compute the meanreward; then always play the action with the highest estimated mean.

2 ε-greedy strategy (with different values of ε) and training rule fromprevious slide.

Note: realize a parametric software with respect to k and the parametersof the Gaussian distributions and use the following values for theexperiments: k = 4, r(a1) = N (100, 50), r(a2) = N (90, 20),r(a3) = N (70, 50), r(a4) = N (50, 50).

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 56 / 79

Page 29: Machine Learning...Sapienza University of Rome, Italy - Machine Learning (2020/2021) Reasoning vs. Learning in Dynamic Systems w k v k x k z k x k+1 h k fk D Reasoning : given the

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

Example: k-Armed Bandit

What happens if parameters of Gaussian distributions slightly varies overtime, e.g. µi ± 10% at unknown instants of time (with much lowerfrequency with respect to trials) ?

Qn(ai )← Qn−1(ai ) + α[r − Qn−1(ai )]

α = constant

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 57 / 79

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

Non-deterministic Q-learning

Q learning generalizes to non-deterministic worlds with training rule

Qn(x, a)← Qn−1(x, a) + α[r + γmaxa′

Qn−1(x′, a′)− Qn−1(x, a)]

which is equivalent to

Qn(x, a)← (1− α)Qn−1(x, a) + α[r + γmaxa′

Qn−1(x′, a′)]

where

α = αn−1(x, a) =1

1 + visitsn−1(x, a)

visitsn(x, a): total number of times state-action pair (x, a) has been visitedup to n-th iteration

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 58 / 79

Page 30: Machine Learning...Sapienza University of Rome, Italy - Machine Learning (2020/2021) Reasoning vs. Learning in Dynamic Systems w k v k x k z k x k+1 h k fk D Reasoning : given the

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

Convergence in non-deterministic MDP

Deterministic Q-learning does not converge in non-deterministicworlds! Qn+1(x, a) ≥ Qn(x, a) is not valid anymore.

Non-deterministic Q-learning also converges when every pair state,action is visited infinitely often [Watkins and Dayan, 1992].

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 59 / 79

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

Example: non-deterministic Grid World

S0

G

S1

S4

S3

100

0

0

0 0

0 F

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 60 / 79

Page 31: Machine Learning...Sapienza University of Rome, Italy - Machine Learning (2020/2021) Reasoning vs. Learning in Dynamic Systems w k v k x k z k x k+1 h k fk D Reasoning : given the

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

Other algorithms for non-deterministic learning

Temporal Difference

SARSA

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 61 / 79

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

Temporal Difference Learning

Q learning: reduce discrepancy between successive Q estimates

One step time difference:

Q(1)(xt , at) ≡ rt + γmaxa

Q(xt+1, a)

Two steps time difference:

Q(2)(xt , at) ≡ rt + γrt+1 + γ2 maxa

Q(xt+2, a)

n steps time difference:

Q(n)(xt , at) ≡ rt + γrt+1 + · · ·+ γ(n−1)rt+n−1 + γn maxa

Q(xt+n, a)

Blend all of these (0 ≤ λ ≤ 1):

Qλ(xt , at) ≡ (1− λ)[Q(1)(xt , at) + λQ(2)(xt , at) + λ2Q(3)(xt , at) + · · ·

]

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 62 / 79

Page 32: Machine Learning...Sapienza University of Rome, Italy - Machine Learning (2020/2021) Reasoning vs. Learning in Dynamic Systems w k v k x k z k x k+1 h k fk D Reasoning : given the

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

Temporal Difference Learning

Qλ(xt , at) ≡ (1− λ)[Q(1)(xt , at) + λQ(2)(xt , at) + λ2Q(3)(xt , at) + · · ·

]

Equivalent expression:

Qλ(xt , at) = rt + γ[(1− λ) maxa

Q(xt , at) + λ Qλ(xt+1, at+1)]

λ = 0: Q(1) learning as seen before

λ > 0: algorithm increases emphasis on discrepancies based on moredistant look-aheads

λ = 1: only observed rt+i are considered.

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 63 / 79

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

Temporal Difference Learning

Qλ(xt , at) = rt + γ[(1− λ) maxa

Q(xt , at) + λ Qλ(xt+1, at+1)]

TD(λ) algorithm uses above training rule

Sometimes converges faster than Q learning

converges for learning V ∗ for any 0 ≤ λ ≤ 1 [Dayan, 1992]

TD-Gammon [Tesauro, 1995] uses this algorithm (approximatelyequal to best human backgammon player).

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 64 / 79

Page 33: Machine Learning...Sapienza University of Rome, Italy - Machine Learning (2020/2021) Reasoning vs. Learning in Dynamic Systems w k v k x k z k x k+1 h k fk D Reasoning : given the

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

SARSA

SARSA is based on the tuple < s, a, r , s ′, a′ > ( < x, a, r , x′, a′ > in ournotation).

Qn(x, a)← Qn−1(x, a) + α[r + γQn−1(x′, a′)− Qn−1(x, a)]

a′ is chosen according to a policy based on current estimate of Q.

On-policy method: it evaluates the current policy

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 65 / 79

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

Convergence of non-deterministic algorithms

Fast convergence does not imply better solution in the optimal policy.

Example: comparison among Q-learning, TD, and SARSA.

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 66 / 79

Page 34: Machine Learning...Sapienza University of Rome, Italy - Machine Learning (2020/2021) Reasoning vs. Learning in Dynamic Systems w k v k x k z k x k+1 h k fk D Reasoning : given the

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

Remarks on explicit representation of Q

Explicit representation of Q table may not be feasible for largemodels.

Algorithms perform a kind of rote learning. No generalization onunseen state-action pairs.

Convergence is guaranteed only if every possible state-action pair isvisited infinitely often.

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 67 / 79

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

Remarks on explicit representation of Q

Use function approximation:

Qθ(x, a) = θ0 + θ1F1(x, a) + . . .+ θnFn(x, a)

Use linear regression to learn Qθ(x, a).

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 68 / 79

Page 35: Machine Learning...Sapienza University of Rome, Italy - Machine Learning (2020/2021) Reasoning vs. Learning in Dynamic Systems w k v k x k z k x k+1 h k fk D Reasoning : given the

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

Remarks on explicit representation of Q

Use a neural network as function approximation and learn function Q withBackpropagation.

Implementation options:

Train a network, using (x, a) as input and Q(x, a) as output

Train a separate network for each action a, using x as input andQ(x, a) as output

Train a network, using x as input and one output Q(x, a) for eachaction

TD-Gammon [Tesauro, 1995] uses a neural network and theBackpropagation algorithm together with TD(λ).

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 69 / 79

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

Reinforcement Learning with Policy Iteration

Use directly π instead of V (x) or Q(x, a).

Parametric representation of π: πθ(x) = maxa∈A Qθ(x, a)

Policy value: ρ(θ) = expected value of executing πθ.

Policy gradient: ∆θρ(θ)

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 70 / 79

Page 36: Machine Learning...Sapienza University of Rome, Italy - Machine Learning (2020/2021) Reasoning vs. Learning in Dynamic Systems w k v k x k z k x k+1 h k fk D Reasoning : given the

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

Policy Gradient Algorithm

Policy gradient algorithm for a parametric representation of the policyπθ(x)

choose θwhile termination condition do

estimate ∆θρ(θ) (through experiments)θ ← θ + η∆θρ(θ)

end while

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 71 / 79

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

Policy Gradient Algorithm

Policy Gradient Algorithm for robot learning [Kohl and Stone, 2004]

Estimate optimal parameters of a controller πθ = {θ1, ..., θN}, given anobjective function F .Method is based on iterating the following steps:

1) generating perturbations of πθ by modifying the parameters2) evaluate these perturbations3) generate a new policy from ”best scoring” perturbations

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 72 / 79

Page 37: Machine Learning...Sapienza University of Rome, Italy - Machine Learning (2020/2021) Reasoning vs. Learning in Dynamic Systems w k v k x k z k x k+1 h k fk D Reasoning : given the

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

Policy Gradient Algorithm

General method

π ← InitialPolicywhile termination condition do

compute {R1, ...,Rt}, random perturbations of πevaluate {R1, ...,Rt}π ← getBestCombinationOf({R1, ...,Rt})

end while

Note: in the last step we can simply set π ← argmaxRjF (Rj)

(i.e., hill climbing).

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 73 / 79

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

Policy Gradient Algorithm

Perturbations are generated from π by

Ri = {θ1 + δ1, ..., θN + δN}with δj randomly chosen in {−εj , 0,+εj}, and εj is a small fixed valuerelative to θj .

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 74 / 79

Page 38: Machine Learning...Sapienza University of Rome, Italy - Machine Learning (2020/2021) Reasoning vs. Learning in Dynamic Systems w k v k x k z k x k+1 h k fk D Reasoning : given the

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

Policy Gradient Algorithm

Combination of {R1, ...,Rt} is obtained by computing for each parameter j :- Avg−ε,j : average score of all Ri with a negative perturbations- Avg0,j : average score of all Ri with a zero perturbation- Avg+ε,j : average score of all Ri with a positive perturbationsThen define a vector A = {A1, ...,AN} as follows

Aj =

{0 if Avg0,j > Avg−ε,j and Avg0,j > Avg+ε,j

Avg+ε,j − Avg−ε,j otherwise

and finally

π ← π +A

|A|η

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 75 / 79

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

Policy Gradient Algorithm

Task: optimize AIBO gait for fast and stable locomotion [Saggar et al.,2006]Objective function F

F = 1− (WtMt + WaMa + WdMd + WθMθ)

Mt : normalized time to walk between two landmarksMa: normalized standard deviation of AIBO’s accelerometersMd : normalized distance of the centroid of landmark from the imagecenterMθ: normalized difference between slope of the landmark and an idealslopeWt ,Wa,Wd ,Wθ: weights

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 76 / 79

Page 39: Machine Learning...Sapienza University of Rome, Italy - Machine Learning (2020/2021) Reasoning vs. Learning in Dynamic Systems w k v k x k z k x k+1 h k fk D Reasoning : given the

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

Example: Robot Learning

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 77 / 79

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

References

[AI] S. Russell and P. Norvig. Artificial Intelligence: A modern approach, Chapter21. Reinforcement Learning. 3rd Edition, Pearson, 2010.

[RL] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: AnIntroduction. MIT Press, 1998.On-line: http://webdocs.cs.ualberta.ca/~sutton/book/ebook/

[ML] Tom Mitchell. Machine Learning. Chapter 13 Reinforcement Learning.McGraw-Hill, 1997.

[ArtInt] David Poole and Alan Mackworth. Artificial Intelligence: Foundations ofComputational Agents, Chapter 11.3 Reinforcement Learning. CambridgeUniversity Press, 2010.On-line: http://artint.info/

[Watkins and Dayan, 1992] Watkins, C. and Dayan, P. Q-learning. MachineLearning, 8, 279-292. 1992.

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 78 / 79

Page 40: Machine Learning...Sapienza University of Rome, Italy - Machine Learning (2020/2021) Reasoning vs. Learning in Dynamic Systems w k v k x k z k x k+1 h k fk D Reasoning : given the

Sapienza University of Rome, Italy - Machine Learning (2020/2021)

References

[Dayan, 1992] Dayan, P.. The convergence of TD(λ) for general λ. MachineLearning, 8, 341-362. 1992.

[Tesauro, 1995] Gerald Tesauro. Temporal Difference Learning and TD-GammonCommunications of the ACM, Vol. 38, No. 3. 1995.

[Kohl and Stone, 2004] Nate Kohl and Peter Stone. Policy GradientReinforcement Learning for Fast Quadrupedal Locomotion. In Proceedings of theIEEE International Conference on Robotics and Automation (ICRA), 2004.

[Saggar et al., 2006] Manish Saggar, Thomas D’Silva, Nate Kohl, and PeterStone. Autonomous Learning of Stable Quadruped Locomotion. InRoboCup-2006: Robot Soccer World Cup X, pp. 98109, Springer Verlag, 2007.

L. Iocchi, F. Patrizi MDPs and Reinforcement Learning 79 / 79