défis en intelligence artificielle. reinforcement learning (1) · 2019. 11. 7. · kasparov vs....
TRANSCRIPT
-
Défis en Intelligence Artificielle.Reinforcement Learning (1)
Xavier Siebert
MathRO, Université de Mons————–
http://www.ig.fpms.ac.be/~siebertx
Xavier Siebert DEFIS IA 1 / 61
-
The Reinforcement
Pierre De Handschutter Bastien Vanderplaetse
Xavier Siebert DEFIS IA 2 / 61
-
Kasparov vs. Deep Blue (1997)
Xavier Siebert DEFIS IA 3 / 61
-
Ke Jie vs. Alpha Go Zero (2017)
Xavier Siebert DEFIS IA 4 / 61
-
Complexity of the task
Tic-Tac-Toe : 255 168 unique games
So
urc
e:
Wik
imed
ia
Chess : ≈ 10120 nodes for 80 movesGo : ≈ 10170 − 10360 nodes for 150 movesNumber of atoms in the universe : ≈ 1080
Xavier Siebert DEFIS IA 5 / 61
-
Deep Blue (1997) - Alpha Go Zero (2017)Computer power is important. . . but not enough !
knowledge engineering
evaluation function
libraries of moves
computational power
deep reinforcement learning
zero knowledge engineering
computational power : 48 TPUs
Xavier Siebert DEFIS IA 6 / 61
-
Deep Reinforcement Learning (Deep RL)
Deep RL = Reinforcement Learning + Deep Learning
computational approach to learning from interaction
can solve a wide range of complex sequential decision-making tasksthat were previously out of reach for a machine.
Xavier Siebert DEFIS IA 7 / 61
-
Topics covered in this course
1 Markov Decision Processes
2 Reinforcement Learning
3 Deep Reinforcement Learning
Feel free to jump to part 3 if you already master 1 and 2.
Xavier Siebert DEFIS IA 8 / 61
-
Introduction
Menu
1 Introduction
2 Markov Decision Processes
Xavier Siebert DEFIS IA 9 / 61
-
Introduction
Reinforcement Learning Scheme
Agent
Environment
Action atNew state st+1 Reward rt+1
(artificial) agent can choose action, interacting with its environment.
state st = (partial) informations about the environment at time t
policy π = strategy = procedure followed to select actions
objective = maximise the cumulated reward
Xavier Siebert DEFIS IA 10 / 61
-
Introduction
Reinforcement Learning (RL)
Objective of Reinforcement learning
Automatic acquisition of competences for decision making in a complexand uncertain environment
birth around 1960’s
intersection of two fieldsI Computational NeurosciencesI Experimental Psychology : adaptative behaviour
Xavier Siebert DEFIS IA 11 / 61
-
Introduction
Experimental PsychologyAnimal conditioning (Pavlov, 1904)
Reinforcement Learning linked to the way animals learnThe term “reinforcement” was first used by Pavlov in 1927
Xavier Siebert DEFIS IA 12 / 61
-
Introduction
Experimental PsychologyFrom the dog’s perspective. . .
Reinforcement Learning linked to the way humans learn
Xavier Siebert DEFIS IA 13 / 61
-
Introduction
Application : Games
So
urc
e:
Da
vid
Silve
r
Each action influences the agent’s future state
Success is measured by a reward signal (e.g., score)
Goal is to select actions to maximize future reward
Xavier Siebert DEFIS IA 14 / 61
-
Introduction
Application : Games
Atari Breakout game.
actions : left / right
(occasional) feedback information = score
Ima
ge
cred
it:
Dee
pM
ind
.
history : Checkers (Samuel, 1952), TD-Gammon (Tesauro, 1992), . . .
Xavier Siebert DEFIS IA 15 / 61
-
Introduction
Application : Robots
observations = positions and angles of moving parts
actions = forces, couples, . . .
reinforcement = keep balance / move / specific task / . . .
https://www.youtube.com/watch?v=VCdxqn0fcnE
Xavier Siebert DEFIS IA 16 / 61
-
Introduction
Toy Example (for practical sessions)
GridWorld
So
urc
e:ai.berkeley.edu
we use this toy example to modelI Markov Decision Processes (MDP)I Basic Reinforcement Learning (RL)I Q-Learning
Xavier Siebert DEFIS IA 17 / 61
ai.berkeley.edu
-
Introduction
Toy Example (for practical sessions)
PacMan
So
urc
e:ai.berkeley.edu
we use this toy example to modelI Q-LearningI approximate Q-LearningI deep Q-Learning
Xavier Siebert DEFIS IA 18 / 61
ai.berkeley.edu
-
Markov Decision Processes
Menu
1 Introduction
2 Markov Decision ProcessesDefinitionsPoliciesRewardsValue FunctionBellman EquationValue IterationPolicy Iteration
Xavier Siebert DEFIS IA 19 / 61
-
Markov Decision Processes Definitions
Markov Chain
Definition : Markov chain
A Markov chain is an oriented graph whose nodes are states (∈ S) andedges are weighted by the transition probabilitites (∈ T ) from one state toanother.
example : weather forecast
Sunny Rainy0.6
0.7
0.4 0.3
no actions nor rewards, only transition probabilities between states
Xavier Siebert DEFIS IA 20 / 61
-
Markov Decision Processes Definitions
Markov Chain
Can be used to model a discrete-time dynamical system
(st)t∈N,
where st ∈ S = state space.St is a random variable
st is a realization of St
Definition : Markov property
The transition probabilities T ∈ T for a Markov Chain,
T (s, s ′) = P(St+1 = s ′|St = s),
only depend on St , not on the past history (St−1,St−2, . . .).
Xavier Siebert DEFIS IA 21 / 61
-
Markov Decision Processes Definitions
Markov ChainMarkov property rephrased
Given the present state, the future and the past are independent
Xavier Siebert DEFIS IA 22 / 61
-
Markov Decision Processes Definitions
Markov Decision Process (MDP)
Definition : Markov Decision Process (MDP)
A Markov Decision Process (MDP) is a discrete time stochastic controlprocess characterized by a 5-tuple {S,A, T ,T ,R}.
we assume that the system is fully observable
a transition graph summarizes the dynamics of a (finite) MDP
Xavier Siebert DEFIS IA 23 / 61
-
Markov Decision Processes Definitions
Markov Decision Process (MDP)
S = state space= {Cool, Warm, Overheated}A = actions space = {Slow, Fast}T = transitions space = {(Warm, Slow, Warm), . . . }
Xavier Siebert DEFIS IA 24 / 61
-
Markov Decision Processes Definitions
Markov Decision Process (MDP)Transitions are stochastic !
the selected action does not always lead to the same state
So
urc
e:ai.berkeley.edu
Xavier Siebert DEFIS IA 25 / 61
ai.berkeley.edu
-
Markov Decision Processes Definitions
Markov Decision Process (MDP)Transitions are stochastic !
closer look at action Slow from state Warm
does not always lead to the same state !I P(St+1 = Warm|St = Warm,At = Slow) = 0.5I P(St+1 = Cool|St = Warm,At = Slow) = 0.5
Xavier Siebert DEFIS IA 26 / 61
-
Markov Decision Processes Definitions
Markov Decision Process (MDP)
T : transition function (conditional probability)
T : T → [0, 1] : {s, a, s ′} → T (s, a, s ′) = P(St+1 = s ′|At = a,St = s)
R : reward function
R : T → R : {s, a, s ′} → R(s, a, s ′)
Xavier Siebert DEFIS IA 27 / 61
-
Markov Decision Processes Definitions
Markov Decision Process (MDP)
T are known, for example :I P(Warm|Slow ,Warm) = 0.5I P(Cool |Slow ,Warm) = 0.5
R are knownI R(s ′,Slow , s) = (+1)
I R(s ′,Fast, s) =
{(−10) if s ′ = overheated(+2) otherwise
Xavier Siebert DEFIS IA 28 / 61
-
Markov Decision Processes Definitions
Markov Decision Process (MDP)
in general, the consequences of the actions could depend on all thepast states, rewards and actions :
P{St+1 = s;Rt+1 = r |St = st ,Rt = rt ,At = at , . . . ,S0 = s0,R0 = r0,A0 = a0}.
Definition : Markov property for a Markov Decision Process
The consequences of an action only depend on the last step
P{St+1 = s;Rt+1 = r |St = st ,Rt = rt ,At = at}.
Thus the agent has no interest in looking at the full history.
Xavier Siebert DEFIS IA 29 / 61
-
Markov Decision Processes Definitions
Markov Decision Process (MDP)
In a finite MDP, S, A and T all have a finite number of elements.abstract and flexible framework
Xavier Siebert DEFIS IA 30 / 61
-
Markov Decision Processes Policies
MDP and Policiespolicy π ∈ Π = strategy, defines the behaviour of an agent
I deterministic π(s): defines action a to perform in state s (with P = 1)I stochastic π(s, a): gives the probability to select a from state s
any policy can be stationary (πt = π, ∀t) or not
Definition : stochastic policy
A stochastic policy is a distribution over actions given states,
π(a, s) = π(a|s) = P[At = a|St = s]
Xavier Siebert DEFIS IA 31 / 61
-
Markov Decision Processes Policies
MDP - HANDS ON : Grid World
do section 1.1 of RL1
goal : move the agent (•) in the grid and understand stochasticitypython gridworld.py -m -g BookGrid --noise=0.2
the stochastic behavior is controlled by the parameter --noise
it is the probability that the agent moves in another direction thatthat expected by the action
Xavier Siebert DEFIS IA 32 / 61
-
Markov Decision Processes Policies
Markov Decision Process (MDP)
how to “solve” an MDP ?
goal : find the optimal policy π∗
So
urc
e:ai.berkeley.edu
to find the optimal policy π∗, we have to maximize some reward
Xavier Siebert DEFIS IA 33 / 61
ai.berkeley.edu
-
Markov Decision Processes Rewards
Markov Decision Process (MDP)Cumulated Reward
first (näıve) definition of return
expected cumulated reward from current state St = sI after N steps (finite horizon)
E [Rt+1 + Rt+2 + . . .+ Rt+N |St = s]
I after infinite number of steps (infinite horizon)
E
[ ∞∑i=0
Rt+i+1|St = s
]
this can lead to convergence issues
need another definition
Xavier Siebert DEFIS IA 34 / 61
-
Markov Decision Processes Rewards
Markov Decision Process (MDP)More general definition : Deprecated Reward = Return
expected deprecated reward = return from current state St = s
E
∞∑i=0
γ iRt+i+1︸ ︷︷ ︸:=Gt
|St = s
where the discount factor γ ∈ [0, 1]
I γ → 0 : immediate reward matters (“myopic” agent)I γ → 1 : all rewards matter equally (“far-sighted” agent)
reward hypothesis : all the purposes can be represented by theexpected value of cumulated sum of a received scalar signal (return)which the agent tries to maximize
Xavier Siebert DEFIS IA 35 / 61
-
Markov Decision Processes Value Function
Markov Decision Process (MDP)
Definition : Value Function for policy π
For a given policy π, a value function
vπ : S → R
associates to any state s ∈ S the value of the expected (deprecated)reward obtained by following π from s.
vπ(s) := Eπ[Gt |St = s] = Eπ[∞∑i=0
γ iRt+i+1|St = s] ∀s ∈ S
Intuitively, a value function encodes our knowledge about the system,and specifies what is good in the long run
Xavier Siebert DEFIS IA 36 / 61
-
Markov Decision Processes Value Function
Markov Decision Process (MDP)
Definition : Value Function
A value functionv : S → R
associates to any state s ∈ S the value of the expected (deprecated)reward
v(s) := E[Gt |St = s] = E[∞∑i=0
γ iRt+i+1|St = s] ∀s ∈ S
Xavier Siebert DEFIS IA 37 / 61
-
Markov Decision Processes Value Function
Markov Decision Process (MDP)Value Function and recurrence relationship
from the definition
Gt :=∞∑i=0
γ iRt+i+1
we can deduce the recurrence
Gt = Rt+1 + γGt+1
the value function can thus be decomposed into two parts:
v(s) = E[Rt+1 + γGt+1|St = s]= E[ Rt+1︸︷︷︸
immediate reward
+ γv(St+1)︸ ︷︷ ︸value of successor
|St = s]
= E[Rt+1|St = s] + E[γv(St+1)|St = s]
Xavier Siebert DEFIS IA 38 / 61
-
Markov Decision Processes Bellman Equation
Markov Decision Process (MDP)Optimal Policies and Optimal Value Functions
value functions define a partial ordering over policies
π dominates π′ if (assuming policies are comparable)1 ∀s, vπ(s) ≥ vπ′(s)2 ∃s : vπ(s) > vπ′(s)
a policy π∗ is optimal if it is not dominated by any other
π∗(s) = arg maxπ
vπ(s)
there can be many optimal policies, but they all have the same value
v∗(s) = vπ∗(s) = max
πvπ(s)
how to find v∗ ?
Xavier Siebert DEFIS IA 39 / 61
-
Markov Decision Processes Bellman Equation
Markov Decision Process (MDP)
Bellman Equations (1957)
For a given state s, the optimal value v∗(s) is the solution of
v(s) = maxa
∑s′
T (s, a, s ′)[R(s, a, s ′) + γv(s ′)]
“Backup diagram”
Xavier Siebert DEFIS IA 40 / 61
-
Markov Decision Processes Bellman Equation
Markov Decision Process (MDP)Optimality Equations (Bellman)
introducing the operator L:
Lvπ(s) = maxa
∑s′
T (s, a, s ′)[R(s, a, s ′) + γvπ(s ′)]
the optimal value v∗(s) is the solution of
vπ(s) = Lvπ(s)
fixed-point theorems (Banach)
how to compute v∗(s) in practice ?I compute explicitly, for finite MDPI linear programmingI value iterationI policy iteration (= strategy iteration)
Xavier Siebert DEFIS IA 41 / 61
-
Markov Decision Processes Bellman Equation
Markov Decision Process (MDP)Compute explicitly, for finite MDP
For finite MDP with n states
Bellman gives a system of n equations in n unknowns
Computational complexity is O(n3)
Direct solution only possible for a small MDP
Otherwise : need approximate methods
Xavier Siebert DEFIS IA 42 / 61
-
Markov Decision Processes Bellman Equation
Markov Decision Process (MDP)Linear Programming
Property
If v ∈ V minimizes∑
s ∈S v(s) under the constraints
v ≥ Lv
then v = v∗.
we can thus use any linear programming solver
in practice : slow. . .
Xavier Siebert DEFIS IA 43 / 61
-
Markov Decision Processes Value Iteration
Markov Decision Process (MDP)Value Iteration (VI)
construct the sequence {v0, v1, . . . , vk , vk+1, . . .} where
vk+1 = Lvk
iterations to dynamically evaluate Bellman’s equation
vk+1(s)← maxa
∑s′
T (s, a, s ′)[R(s, a, s ′) + γvk(s′)]
Xavier Siebert DEFIS IA 44 / 61
-
Markov Decision Processes Value Iteration
Markov Decision Process (MDP)Value Iteration (VI)
fixed-point theorem shows that
v∗(s) = limk→∞
vk(s)
if the stopping criterion is
‖vk+1 − vk‖ < �
then we can prove that the solution is close to the optimum, that is:
‖vk − v∗‖ <2γ
1− γ�
Xavier Siebert DEFIS IA 45 / 61
-
Markov Decision Processes Value Iteration
Markov Decision Process (MDP)Value Iteration (VI)
Algorithm 1 Value Iteration
Initialize v0(s) for all states, T and R are knownk = −1repeatk = k + 1for s ∈ S dovk+1(s) = maxa
∑s′ T (s, a, s
′)[R(s, a, s ′) + γvk(s′)]
end foruntil ‖vk+1(s)− vk(s)‖ < �return vk
Xavier Siebert DEFIS IA 46 / 61
-
Markov Decision Processes Value Iteration
Markov Decision Process (MDP)Value Iteration (VI) - numerical example
starting point : all values v0(s) initialized to 0
c w o
v0 0 0 0
we take γ = 0.5
Xavier Siebert DEFIS IA 47 / 61
-
Markov Decision Processes Value Iteration
Markov Decision Process (MDP)Value Iteration (VI) - numerical example
first round of updates : v1 from v0
v1(c) = max{1(1 + 0.5× 0), 0.5(2 + 0.5× 0) + 0.5(2 + 0.5× 0)}= max{1, 2} = 2
v1(w) = max{0.5(1 + 0.5× 0) + 0.5(1 + 0.5× 0), 1(−10 + 0.5× 0)}= max{1,−10} = 1
v1(o) = max{0} = 0
Xavier Siebert DEFIS IA 48 / 61
-
Markov Decision Processes Value Iteration
Markov Decision Process (MDP)Value Iteration (VI) - numerical example
first round of updates : v1 from v0
c w o
v0 0 0 0v1 2 1 0
Xavier Siebert DEFIS IA 49 / 61
-
Markov Decision Processes Value Iteration
Markov Decision Process (MDP)Value Iteration (VI) - numerical example
second round of updates : v2 from v1
v2(c) = max{1(1 + 0.5× 2), 0.5(2 + 0.5× 2) + 0.5(2 + 0.5× 1)}= max{2, 2.75} = 2.75
v2(w) = max{0.5(1 + 0.5× 2) + 0.5(1 + 0.5× 1), 1(−10 + 0.5× 0)}= max{1.75,−10} = 1.75
v2(o) = max{0} = 0
Xavier Siebert DEFIS IA 50 / 61
-
Markov Decision Processes Value Iteration
Markov Decision Process (MDP)Value Iteration (VI) - numerical example
second round of updates : v2 from v1
c w o
v0 0 0 0v1 2 1 0v2 2.75 1.75 0
Xavier Siebert DEFIS IA 51 / 61
-
Markov Decision Processes Policy Iteration
Markov Decision Process (MDP)Policy Iteration (PI) = Strategy Iteration (SI)
two steps, after initialization of vπ0 (s)1 policy evaluation : solve (no max)
vπkk = Lπk vπkk
equivalent to iterate
vπkk+1(s)←∑s′
T (s, πk(s), s′)[R(s, πk(s), s
′) + γvπkk (s′)]
2 policy improvement :
πk+1 = arg maxπ∈Π
∑s′
T (s, π(s), s ′)[R(s, π(s), s ′) + γvπkk (s′)]
stop when policy does not change anymore (πk = πk+1)
Xavier Siebert DEFIS IA 52 / 61
-
Markov Decision Processes Policy Iteration
Markov Decision Process (MDP)Policy Iteration (PI) = Strategy Iteration (SI)
Algorithm 2 Policy Iteration
Initialize π0 with some policyk = −1repeat
k = k + 1for s ∈ S do
solve vπk (s) =∑
s′ T (s, πk(s), s′)[R(s, πk(s), s
′) + γvπk (s ′)]end forfor s ∈ S doπk+1 = arg max
π∈Π
∑s′ T (s, π(s), s
′)[R(s, π(s), s ′) + γvπk (s ′)]
end foruntil πk+1(s) = πk(s)return vk , πk
Xavier Siebert DEFIS IA 53 / 61
-
Markov Decision Processes Policy Iteration
Markov Decision Process (MDP)Policy Iteration (PI)
the policy π(s) is often more important than the precise value v∗(s)
policy iteration can converge even though the values did not
Xavier Siebert DEFIS IA 54 / 61
-
Markov Decision Processes Policy Iteration
Markov Decision Process (MDP)Policy Iteration (PI) - numerical example
starting point :I initial policy π0 = “always go slow”I all values vπ0 (s) initialized to 0
assume γ = 0.5
c w o
π0 slow slow -
Xavier Siebert DEFIS IA 55 / 61
-
Markov Decision Processes Policy Iteration
Markov Decision Process (MDP)Policy Iteration (PI) - numerical example
first round of updates : (1) policy evaluation
vπ0(c) = 1(1 + 0.5× vπ0(c))vπ0(w) = 0.5(1 + 0.5× vπ0(c)) + 0.5(1 + 0.5× vπ0(w))
solving these equations yields
vπ0(c) = 2 vπ0(w) = 2
Xavier Siebert DEFIS IA 56 / 61
-
Markov Decision Processes Policy Iteration
Markov Decision Process (MDP)Policy Iteration (PI) - numerical example
first round of updates : (2) policy improvement
π1(c) = arg max {slow : 1(1 + 0.5× 2),
fast : 0.5(2 + 0.5× 2) + 0.5(2 + 0.5× 2)}= arg max {slow : 2, fast : 3} = fast
π1(w) = arg max {slow : 0.5(1 + 0.5× 2) + 0.5(1 + 0.5× 2),
fast : 0.5(−10 + 0.5× 0)}= arg max {slow : 3, fast : −10} = slow
(1)Xavier Siebert DEFIS IA 57 / 61
-
Markov Decision Processes Policy Iteration
Markov Decision Process (MDP)Policy Iteration (PI) - numerical example
c w o
π0 slow slow -π1 fast slow -
Xavier Siebert DEFIS IA 58 / 61
-
Markov Decision Processes Policy Iteration
Markov Decision Process (MDP)Policy Iteration (PI) - numerical example
second round of updates : policy evaluation and improvement yields
c w o
π0 slow slow -π1 fast slow -π2 fast slow -
the policy converged already !
Xavier Siebert DEFIS IA 59 / 61
-
Markov Decision Processes Policy Iteration
HANDS ON : MDP
do sections 1.2 → 1.4 of RL1goal : explore MDP algorithms with simple labyrinths
Xavier Siebert DEFIS IA 60 / 61
-
Markov Decision Processes Policy Iteration
HANDS ON : MDPSometimes the agent takes too long to find the exitOne can also add a living reward to speed up the process∀ (s, a, s ′) with no reward, add a negative reward :
I −0.01 (agent has plenty of time to get out)I −2.0 (hurry up !)
So
urc
e:ai.berkeley.edu
Xavier Siebert DEFIS IA 61 / 61
ai.berkeley.edu
IntroductionMarkov Decision ProcessesDefinitionsPoliciesRewardsValue FunctionBellman EquationValue IterationPolicy Iteration