défis en intelligence artificielle. reinforcement learning (1) · 2019. 11. 7. · kasparov vs....

Défis en Intelligence Artificielle.Reinforcement Learning (1)

Xavier Siebert

MathRO, Université de Mons————–

http://www.ig.fpms.ac.be/~siebertx

Xavier Siebert DEFIS IA 1 / 61

The Reinforcement

Pierre De Handschutter Bastien Vanderplaetse


Kasparov vs. Deep Blue (1997)


Ke Jie vs. Alpha Go Zero (2017)


Complexity of the task

Tic-Tac-Toe : 255 168 unique games

So

urc

e:

Wik

imed

ia

Chess : ≈ 10120 nodes for 80 movesGo : ≈ 10170 − 10360 nodes for 150 movesNumber of atoms in the universe : ≈ 1080


Deep Blue (1997) - Alpha Go Zero (2017)Computer power is important. . . but not enough !

knowledge engineering

evaluation function

libraries of moves

computational power

deep reinforcement learning

zero knowledge engineering

computational power : 48 TPUs


Deep Reinforcement Learning (Deep RL)

Deep RL = Reinforcement Learning + Deep Learning

computational approach to learning from interaction

can solve a wide range of complex sequential decision-making tasksthat were previously out of reach for a machine.


Topics covered in this course

1 Markov Decision Processes

2 Reinforcement Learning

3 Deep Reinforcement Learning

Feel free to jump to part 3 if you already master 1 and 2.


Introduction

Menu

1 Introduction

2 Markov Decision Processes


Introduction

Reinforcement Learning Scheme

Agent

Environment

Action atNew state st+1 Reward rt+1

(artificial) agent can choose action, interacting with its environment.

state st = (partial) informations about the environment at time t

policy π = strategy = procedure followed to select actions

objective = maximise the cumulated reward


Introduction

Reinforcement Learning (RL)

Objective of Reinforcement learning

Automatic acquisition of competences for decision making in a complexand uncertain environment

birth around 1960’s

intersection of two fieldsI Computational NeurosciencesI Experimental Psychology : adaptative behaviour


Introduction

Experimental PsychologyAnimal conditioning (Pavlov, 1904)

Reinforcement Learning linked to the way animals learnThe term “reinforcement” was first used by Pavlov in 1927


Introduction

Experimental PsychologyFrom the dog’s perspective. . .

Reinforcement Learning linked to the way humans learn


Introduction

Application : Games

So

urc

e:

Da

vid

Silve

r

Each action influences the agent’s future state

Success is measured by a reward signal (e.g., score)

Goal is to select actions to maximize future reward


Introduction

Application : Games

Atari Breakout game.

actions : left / right

(occasional) feedback information = score

Ima

ge

cred

it:

Dee

pM

ind

.

history : Checkers (Samuel, 1952), TD-Gammon (Tesauro, 1992), . . .


Introduction

Application : Robots

observations = positions and angles of moving parts

actions = forces, couples, . . .

reinforcement = keep balance / move / specific task / . . .

https://www.youtube.com/watch?v=VCdxqn0fcnE


Introduction

Toy Example (for practical sessions)

GridWorld

So

urc

e:ai.berkeley.edu

we use this toy example to modelI Markov Decision Processes (MDP)I Basic Reinforcement Learning (RL)I Q-Learning


ai.berkeley.edu

Introduction

Toy Example (for practical sessions)

PacMan

So

urc

e:ai.berkeley.edu

we use this toy example to modelI Q-LearningI approximate Q-LearningI deep Q-Learning


ai.berkeley.edu

Markov Decision Processes

Menu

1 Introduction

2 Markov Decision ProcessesDefinitionsPoliciesRewardsValue FunctionBellman EquationValue IterationPolicy Iteration


Markov Decision Processes Definitions

Markov Chain

Definition : Markov chain

A Markov chain is an oriented graph whose nodes are states (∈ S) andedges are weighted by the transition probabilitites (∈ T ) from one state toanother.

example : weather forecast

Sunny Rainy0.6

0.7

0.4 0.3

no actions nor rewards, only transition probabilities between states



Markov Chain

Can be used to model a discrete-time dynamical system

(st)t∈N,

where st ∈ S = state space.St is a random variable

st is a realization of St

Definition : Markov property

The transition probabilities T ∈ T for a Markov Chain,

T (s, s ′) = P(St+1 = s ′|St = s),

only depend on St , not on the past history (St−1,St−2, . . .).



Markov ChainMarkov property rephrased

Given the present state, the future and the past are independent



Markov Decision Process (MDP)

Definition : Markov Decision Process (MDP)

A Markov Decision Process (MDP) is a discrete time stochastic controlprocess characterized by a 5-tuple {S,A, T ,T ,R}.

we assume that the system is fully observable

a transition graph summarizes the dynamics of a (finite) MDP




S = state space= {Cool, Warm, Overheated}A = actions space = {Slow, Fast}T = transitions space = {(Warm, Slow, Warm), . . . }



Markov Decision Process (MDP)Transitions are stochastic !

the selected action does not always lead to the same state

So

urc

e:ai.berkeley.edu


ai.berkeley.edu


Markov Decision Process (MDP)Transitions are stochastic !

closer look at action Slow from state Warm

does not always lead to the same state !I P(St+1 = Warm|St = Warm,At = Slow) = 0.5I P(St+1 = Cool|St = Warm,At = Slow) = 0.5




T : transition function (conditional probability)

T : T → [0, 1] : {s, a, s ′} → T (s, a, s ′) = P(St+1 = s ′|At = a,St = s)

R : reward function

R : T → R : {s, a, s ′} → R(s, a, s ′)




T are known, for example :I P(Warm|Slow ,Warm) = 0.5I P(Cool |Slow ,Warm) = 0.5

R are knownI R(s ′,Slow , s) = (+1)

I R(s ′,Fast, s) =

{(−10) if s ′ = overheated(+2) otherwise




in general, the consequences of the actions could depend on all thepast states, rewards and actions :

P{St+1 = s;Rt+1 = r |St = st ,Rt = rt ,At = at , . . . ,S0 = s0,R0 = r0,A0 = a0}.

Definition : Markov property for a Markov Decision Process

The consequences of an action only depend on the last step

P{St+1 = s;Rt+1 = r |St = st ,Rt = rt ,At = at}.

Thus the agent has no interest in looking at the full history.




In a finite MDP, S, A and T all have a finite number of elements.abstract and flexible framework


Markov Decision Processes Policies

MDP and Policiespolicy π ∈ Π = strategy, defines the behaviour of an agent

I deterministic π(s): defines action a to perform in state s (with P = 1)I stochastic π(s, a): gives the probability to select a from state s

any policy can be stationary (πt = π, ∀t) or not

Definition : stochastic policy

A stochastic policy is a distribution over actions given states,

π(a, s) = π(a|s) = P[At = a|St = s]



MDP - HANDS ON : Grid World

do section 1.1 of RL1

goal : move the agent (•) in the grid and understand stochasticitypython gridworld.py -m -g BookGrid --noise=0.2

the stochastic behavior is controlled by the parameter --noise

it is the probability that the agent moves in another direction thatthat expected by the action




how to “solve” an MDP ?

goal : find the optimal policy π∗

So

urc

e:ai.berkeley.edu

to find the optimal policy π∗, we have to maximize some reward


ai.berkeley.edu

Markov Decision Processes Rewards

Markov Decision Process (MDP)Cumulated Reward

first (näıve) definition of return

expected cumulated reward from current state St = sI after N steps (finite horizon)

E [Rt+1 + Rt+2 + . . .+ Rt+N |St = s]

I after infinite number of steps (infinite horizon)

E

[ ∞∑i=0

Rt+i+1|St = s

]

this can lead to convergence issues

need another definition


Markov Decision Processes Rewards

Markov Decision Process (MDP)More general definition : Deprecated Reward = Return

expected deprecated reward = return from current state St = s

E

∞∑i=0

γ iRt+i+1︸︷︷︸:=Gt

|St = s

where the discount factor γ ∈ [0, 1]

I γ → 0 : immediate reward matters (“myopic” agent)I γ → 1 : all rewards matter equally (“far-sighted” agent)

reward hypothesis : all the purposes can be represented by theexpected value of cumulated sum of a received scalar signal (return)which the agent tries to maximize


Markov Decision Processes Value Function


Definition : Value Function for policy π

For a given policy π, a value function

vπ : S → R

associates to any state s ∈ S the value of the expected (deprecated)reward obtained by following π from s.

vπ(s) := Eπ[Gt |St = s] = Eπ[∞∑i=0

γ iRt+i+1|St = s] ∀s ∈ S

Intuitively, a value function encodes our knowledge about the system,and specifies what is good in the long run




Definition : Value Function

A value functionv : S → R

associates to any state s ∈ S the value of the expected (deprecated)reward

v(s) := E[Gt |St = s] = E[∞∑i=0

γ iRt+i+1|St = s] ∀s ∈ S



Markov Decision Process (MDP)Value Function and recurrence relationship

from the definition

Gt :=∞∑i=0

γ iRt+i+1

we can deduce the recurrence

Gt = Rt+1 + γGt+1

the value function can thus be decomposed into two parts:

v(s) = E[Rt+1 + γGt+1|St = s]= E[ Rt+1︸︷︷︸

immediate reward

+ γv(St+1)︸︷︷︸value of successor

|St = s]

= E[Rt+1|St = s] + E[γv(St+1)|St = s]


Markov Decision Processes Bellman Equation

Markov Decision Process (MDP)Optimal Policies and Optimal Value Functions

value functions define a partial ordering over policies

π dominates π′ if (assuming policies are comparable)1 ∀s, vπ(s) ≥ vπ′(s)2 ∃s : vπ(s) > vπ′(s)

a policy π∗ is optimal if it is not dominated by any other

π∗(s) = arg maxπ

vπ(s)

there can be many optimal policies, but they all have the same value

v∗(s) = vπ∗(s) = max

πvπ(s)

how to find v∗ ?




Bellman Equations (1957)

For a given state s, the optimal value v∗(s) is the solution of

v(s) = maxa

∑s′

T (s, a, s ′)[R(s, a, s ′) + γv(s ′)]

“Backup diagram”



Markov Decision Process (MDP)Optimality Equations (Bellman)

introducing the operator L:

Lvπ(s) = maxa

∑s′

T (s, a, s ′)[R(s, a, s ′) + γvπ(s ′)]

the optimal value v∗(s) is the solution of

vπ(s) = Lvπ(s)

fixed-point theorems (Banach)

how to compute v∗(s) in practice ?I compute explicitly, for finite MDPI linear programmingI value iterationI policy iteration (= strategy iteration)



Markov Decision Process (MDP)Compute explicitly, for finite MDP

For finite MDP with n states

Bellman gives a system of n equations in n unknowns

Computational complexity is O(n3)

Direct solution only possible for a small MDP

Otherwise : need approximate methods



Markov Decision Process (MDP)Linear Programming

Property

If v ∈ V minimizes∑

s ∈S v(s) under the constraints

v ≥ Lv

then v = v∗.

we can thus use any linear programming solver

in practice : slow. . .


Markov Decision Processes Value Iteration

Markov Decision Process (MDP)Value Iteration (VI)

construct the sequence {v0, v1, . . . , vk , vk+1, . . .} where

vk+1 = Lvk

iterations to dynamically evaluate Bellman’s equation

vk+1(s)← maxa

∑s′

T (s, a, s ′)[R(s, a, s ′) + γvk(s′)]




fixed-point theorem shows that

v∗(s) = limk→∞

vk(s)

if the stopping criterion is

‖vk+1 − vk‖ < �

then we can prove that the solution is close to the optimum, that is:

‖vk − v∗‖ <2γ

1− γ�




Algorithm 1 Value Iteration

Initialize v0(s) for all states, T and R are knownk = −1repeatk = k + 1for s ∈ S dovk+1(s) = maxa

∑s′ T (s, a, s

′)[R(s, a, s ′) + γvk(s′)]

end foruntil ‖vk+1(s)− vk(s)‖ < �return vk



Markov Decision Process (MDP)Value Iteration (VI) - numerical example

starting point : all values v0(s) initialized to 0

c w o

v0 0 0 0

we take γ = 0.5




first round of updates : v1 from v0

v1(c) = max{1(1 + 0.5× 0), 0.5(2 + 0.5× 0) + 0.5(2 + 0.5× 0)}= max{1, 2} = 2

v1(w) = max{0.5(1 + 0.5× 0) + 0.5(1 + 0.5× 0), 1(−10 + 0.5× 0)}= max{1,−10} = 1

v1(o) = max{0} = 0




first round of updates : v1 from v0

c w o

v0 0 0 0v1 2 1 0




second round of updates : v2 from v1

v2(c) = max{1(1 + 0.5× 2), 0.5(2 + 0.5× 2) + 0.5(2 + 0.5× 1)}= max{2, 2.75} = 2.75

v2(w) = max{0.5(1 + 0.5× 2) + 0.5(1 + 0.5× 1), 1(−10 + 0.5× 0)}= max{1.75,−10} = 1.75

v2(o) = max{0} = 0




second round of updates : v2 from v1

c w o

v0 0 0 0v1 2 1 0v2 2.75 1.75 0


Markov Decision Processes Policy Iteration

Markov Decision Process (MDP)Policy Iteration (PI) = Strategy Iteration (SI)

two steps, after initialization of vπ0 (s)1 policy evaluation : solve (no max)

vπkk = Lπk vπkk

equivalent to iterate

vπkk+1(s)←∑s′

T (s, πk(s), s′)[R(s, πk(s), s

′) + γvπkk (s′)]

2 policy improvement :

πk+1 = arg maxπ∈Π

∑s′

T (s, π(s), s ′)[R(s, π(s), s ′) + γvπkk (s′)]

stop when policy does not change anymore (πk = πk+1)



Markov Decision Process (MDP)Policy Iteration (PI) = Strategy Iteration (SI)

Algorithm 2 Policy Iteration

Initialize π0 with some policyk = −1repeat

k = k + 1for s ∈ S do

solve vπk (s) =∑

s′ T (s, πk(s), s′)[R(s, πk(s), s

′) + γvπk (s ′)]end forfor s ∈ S doπk+1 = arg max

π∈Π

∑s′ T (s, π(s), s

′)[R(s, π(s), s ′) + γvπk (s ′)]

end foruntil πk+1(s) = πk(s)return vk , πk



Markov Decision Process (MDP)Policy Iteration (PI)

the policy π(s) is often more important than the precise value v∗(s)

policy iteration can converge even though the values did not



Markov Decision Process (MDP)Policy Iteration (PI) - numerical example

starting point :I initial policy π0 = “always go slow”I all values vπ0 (s) initialized to 0

assume γ = 0.5

c w o

π0 slow slow -




first round of updates : (1) policy evaluation

vπ0(c) = 1(1 + 0.5× vπ0(c))vπ0(w) = 0.5(1 + 0.5× vπ0(c)) + 0.5(1 + 0.5× vπ0(w))

solving these equations yields

vπ0(c) = 2 vπ0(w) = 2




first round of updates : (2) policy improvement

π1(c) = arg max {slow : 1(1 + 0.5× 2),

fast : 0.5(2 + 0.5× 2) + 0.5(2 + 0.5× 2)}= arg max {slow : 2, fast : 3} = fast

π1(w) = arg max {slow : 0.5(1 + 0.5× 2) + 0.5(1 + 0.5× 2),

fast : 0.5(−10 + 0.5× 0)}= arg max {slow : 3, fast : −10} = slow

(1)Xavier Siebert DEFIS IA 57 / 61



c w o

π0 slow slow -π1 fast slow -




second round of updates : policy evaluation and improvement yields

c w o

π0 slow slow -π1 fast slow -π2 fast slow -

the policy converged already !



HANDS ON : MDP

do sections 1.2 → 1.4 of RL1goal : explore MDP algorithms with simple labyrinths



HANDS ON : MDPSometimes the agent takes too long to find the exitOne can also add a living reward to speed up the process∀ (s, a, s ′) with no reward, add a negative reward :

I −0.01 (agent has plenty of time to get out)I −2.0 (hurry up !)

So

urc

e:ai.berkeley.edu


ai.berkeley.edu

IntroductionMarkov Decision ProcessesDefinitionsPoliciesRewardsValue FunctionBellman EquationValue IterationPolicy Iteration

défis en intelligence artificielle. reinforcement learning (1) · 2019. 11. 7. · kasparov vs....

Documents