défis en intelligence artificielle. reinforcement learning (1) · 2019. 11. 7. · kasparov vs....

61
efis en Intelligence Artificielle. Reinforcement Learning (1) Xavier Siebert MathRO, Universit´ e de Mons ————– http://www.ig.fpms.ac.be/ ~ siebertx Xavier Siebert DEFIS IA 1 / 61

Upload: others

Post on 29-Jan-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

  • Défis en Intelligence Artificielle.Reinforcement Learning (1)

    Xavier Siebert

    MathRO, Université de Mons————–

    http://www.ig.fpms.ac.be/~siebertx

    Xavier Siebert DEFIS IA 1 / 61

  • The Reinforcement

    Pierre De Handschutter Bastien Vanderplaetse

    Xavier Siebert DEFIS IA 2 / 61

  • Kasparov vs. Deep Blue (1997)

    Xavier Siebert DEFIS IA 3 / 61

  • Ke Jie vs. Alpha Go Zero (2017)

    Xavier Siebert DEFIS IA 4 / 61

  • Complexity of the task

    Tic-Tac-Toe : 255 168 unique games

    So

    urc

    e:

    Wik

    imed

    ia

    Chess : ≈ 10120 nodes for 80 movesGo : ≈ 10170 − 10360 nodes for 150 movesNumber of atoms in the universe : ≈ 1080

    Xavier Siebert DEFIS IA 5 / 61

  • Deep Blue (1997) - Alpha Go Zero (2017)Computer power is important. . . but not enough !

    knowledge engineering

    evaluation function

    libraries of moves

    computational power

    deep reinforcement learning

    zero knowledge engineering

    computational power : 48 TPUs

    Xavier Siebert DEFIS IA 6 / 61

  • Deep Reinforcement Learning (Deep RL)

    Deep RL = Reinforcement Learning + Deep Learning

    computational approach to learning from interaction

    can solve a wide range of complex sequential decision-making tasksthat were previously out of reach for a machine.

    Xavier Siebert DEFIS IA 7 / 61

  • Topics covered in this course

    1 Markov Decision Processes

    2 Reinforcement Learning

    3 Deep Reinforcement Learning

    Feel free to jump to part 3 if you already master 1 and 2.

    Xavier Siebert DEFIS IA 8 / 61

  • Introduction

    Menu

    1 Introduction

    2 Markov Decision Processes

    Xavier Siebert DEFIS IA 9 / 61

  • Introduction

    Reinforcement Learning Scheme

    Agent

    Environment

    Action atNew state st+1 Reward rt+1

    (artificial) agent can choose action, interacting with its environment.

    state st = (partial) informations about the environment at time t

    policy π = strategy = procedure followed to select actions

    objective = maximise the cumulated reward

    Xavier Siebert DEFIS IA 10 / 61

  • Introduction

    Reinforcement Learning (RL)

    Objective of Reinforcement learning

    Automatic acquisition of competences for decision making in a complexand uncertain environment

    birth around 1960’s

    intersection of two fieldsI Computational NeurosciencesI Experimental Psychology : adaptative behaviour

    Xavier Siebert DEFIS IA 11 / 61

  • Introduction

    Experimental PsychologyAnimal conditioning (Pavlov, 1904)

    Reinforcement Learning linked to the way animals learnThe term “reinforcement” was first used by Pavlov in 1927

    Xavier Siebert DEFIS IA 12 / 61

  • Introduction

    Experimental PsychologyFrom the dog’s perspective. . .

    Reinforcement Learning linked to the way humans learn

    Xavier Siebert DEFIS IA 13 / 61

  • Introduction

    Application : Games

    So

    urc

    e:

    Da

    vid

    Silve

    r

    Each action influences the agent’s future state

    Success is measured by a reward signal (e.g., score)

    Goal is to select actions to maximize future reward

    Xavier Siebert DEFIS IA 14 / 61

  • Introduction

    Application : Games

    Atari Breakout game.

    actions : left / right

    (occasional) feedback information = score

    Ima

    ge

    cred

    it:

    Dee

    pM

    ind

    .

    history : Checkers (Samuel, 1952), TD-Gammon (Tesauro, 1992), . . .

    Xavier Siebert DEFIS IA 15 / 61

  • Introduction

    Application : Robots

    observations = positions and angles of moving parts

    actions = forces, couples, . . .

    reinforcement = keep balance / move / specific task / . . .

    https://www.youtube.com/watch?v=VCdxqn0fcnE

    Xavier Siebert DEFIS IA 16 / 61

  • Introduction

    Toy Example (for practical sessions)

    GridWorld

    So

    urc

    e:ai.berkeley.edu

    we use this toy example to modelI Markov Decision Processes (MDP)I Basic Reinforcement Learning (RL)I Q-Learning

    Xavier Siebert DEFIS IA 17 / 61

    ai.berkeley.edu

  • Introduction

    Toy Example (for practical sessions)

    PacMan

    So

    urc

    e:ai.berkeley.edu

    we use this toy example to modelI Q-LearningI approximate Q-LearningI deep Q-Learning

    Xavier Siebert DEFIS IA 18 / 61

    ai.berkeley.edu

  • Markov Decision Processes

    Menu

    1 Introduction

    2 Markov Decision ProcessesDefinitionsPoliciesRewardsValue FunctionBellman EquationValue IterationPolicy Iteration

    Xavier Siebert DEFIS IA 19 / 61

  • Markov Decision Processes Definitions

    Markov Chain

    Definition : Markov chain

    A Markov chain is an oriented graph whose nodes are states (∈ S) andedges are weighted by the transition probabilitites (∈ T ) from one state toanother.

    example : weather forecast

    Sunny Rainy0.6

    0.7

    0.4 0.3

    no actions nor rewards, only transition probabilities between states

    Xavier Siebert DEFIS IA 20 / 61

  • Markov Decision Processes Definitions

    Markov Chain

    Can be used to model a discrete-time dynamical system

    (st)t∈N,

    where st ∈ S = state space.St is a random variable

    st is a realization of St

    Definition : Markov property

    The transition probabilities T ∈ T for a Markov Chain,

    T (s, s ′) = P(St+1 = s ′|St = s),

    only depend on St , not on the past history (St−1,St−2, . . .).

    Xavier Siebert DEFIS IA 21 / 61

  • Markov Decision Processes Definitions

    Markov ChainMarkov property rephrased

    Given the present state, the future and the past are independent

    Xavier Siebert DEFIS IA 22 / 61

  • Markov Decision Processes Definitions

    Markov Decision Process (MDP)

    Definition : Markov Decision Process (MDP)

    A Markov Decision Process (MDP) is a discrete time stochastic controlprocess characterized by a 5-tuple {S,A, T ,T ,R}.

    we assume that the system is fully observable

    a transition graph summarizes the dynamics of a (finite) MDP

    Xavier Siebert DEFIS IA 23 / 61

  • Markov Decision Processes Definitions

    Markov Decision Process (MDP)

    S = state space= {Cool, Warm, Overheated}A = actions space = {Slow, Fast}T = transitions space = {(Warm, Slow, Warm), . . . }

    Xavier Siebert DEFIS IA 24 / 61

  • Markov Decision Processes Definitions

    Markov Decision Process (MDP)Transitions are stochastic !

    the selected action does not always lead to the same state

    So

    urc

    e:ai.berkeley.edu

    Xavier Siebert DEFIS IA 25 / 61

    ai.berkeley.edu

  • Markov Decision Processes Definitions

    Markov Decision Process (MDP)Transitions are stochastic !

    closer look at action Slow from state Warm

    does not always lead to the same state !I P(St+1 = Warm|St = Warm,At = Slow) = 0.5I P(St+1 = Cool|St = Warm,At = Slow) = 0.5

    Xavier Siebert DEFIS IA 26 / 61

  • Markov Decision Processes Definitions

    Markov Decision Process (MDP)

    T : transition function (conditional probability)

    T : T → [0, 1] : {s, a, s ′} → T (s, a, s ′) = P(St+1 = s ′|At = a,St = s)

    R : reward function

    R : T → R : {s, a, s ′} → R(s, a, s ′)

    Xavier Siebert DEFIS IA 27 / 61

  • Markov Decision Processes Definitions

    Markov Decision Process (MDP)

    T are known, for example :I P(Warm|Slow ,Warm) = 0.5I P(Cool |Slow ,Warm) = 0.5

    R are knownI R(s ′,Slow , s) = (+1)

    I R(s ′,Fast, s) =

    {(−10) if s ′ = overheated(+2) otherwise

    Xavier Siebert DEFIS IA 28 / 61

  • Markov Decision Processes Definitions

    Markov Decision Process (MDP)

    in general, the consequences of the actions could depend on all thepast states, rewards and actions :

    P{St+1 = s;Rt+1 = r |St = st ,Rt = rt ,At = at , . . . ,S0 = s0,R0 = r0,A0 = a0}.

    Definition : Markov property for a Markov Decision Process

    The consequences of an action only depend on the last step

    P{St+1 = s;Rt+1 = r |St = st ,Rt = rt ,At = at}.

    Thus the agent has no interest in looking at the full history.

    Xavier Siebert DEFIS IA 29 / 61

  • Markov Decision Processes Definitions

    Markov Decision Process (MDP)

    In a finite MDP, S, A and T all have a finite number of elements.abstract and flexible framework

    Xavier Siebert DEFIS IA 30 / 61

  • Markov Decision Processes Policies

    MDP and Policiespolicy π ∈ Π = strategy, defines the behaviour of an agent

    I deterministic π(s): defines action a to perform in state s (with P = 1)I stochastic π(s, a): gives the probability to select a from state s

    any policy can be stationary (πt = π, ∀t) or not

    Definition : stochastic policy

    A stochastic policy is a distribution over actions given states,

    π(a, s) = π(a|s) = P[At = a|St = s]

    Xavier Siebert DEFIS IA 31 / 61

  • Markov Decision Processes Policies

    MDP - HANDS ON : Grid World

    do section 1.1 of RL1

    goal : move the agent (•) in the grid and understand stochasticitypython gridworld.py -m -g BookGrid --noise=0.2

    the stochastic behavior is controlled by the parameter --noise

    it is the probability that the agent moves in another direction thatthat expected by the action

    Xavier Siebert DEFIS IA 32 / 61

  • Markov Decision Processes Policies

    Markov Decision Process (MDP)

    how to “solve” an MDP ?

    goal : find the optimal policy π∗

    So

    urc

    e:ai.berkeley.edu

    to find the optimal policy π∗, we have to maximize some reward

    Xavier Siebert DEFIS IA 33 / 61

    ai.berkeley.edu

  • Markov Decision Processes Rewards

    Markov Decision Process (MDP)Cumulated Reward

    first (näıve) definition of return

    expected cumulated reward from current state St = sI after N steps (finite horizon)

    E [Rt+1 + Rt+2 + . . .+ Rt+N |St = s]

    I after infinite number of steps (infinite horizon)

    E

    [ ∞∑i=0

    Rt+i+1|St = s

    ]

    this can lead to convergence issues

    need another definition

    Xavier Siebert DEFIS IA 34 / 61

  • Markov Decision Processes Rewards

    Markov Decision Process (MDP)More general definition : Deprecated Reward = Return

    expected deprecated reward = return from current state St = s

    E

    ∞∑i=0

    γ iRt+i+1︸ ︷︷ ︸:=Gt

    |St = s

    where the discount factor γ ∈ [0, 1]

    I γ → 0 : immediate reward matters (“myopic” agent)I γ → 1 : all rewards matter equally (“far-sighted” agent)

    reward hypothesis : all the purposes can be represented by theexpected value of cumulated sum of a received scalar signal (return)which the agent tries to maximize

    Xavier Siebert DEFIS IA 35 / 61

  • Markov Decision Processes Value Function

    Markov Decision Process (MDP)

    Definition : Value Function for policy π

    For a given policy π, a value function

    vπ : S → R

    associates to any state s ∈ S the value of the expected (deprecated)reward obtained by following π from s.

    vπ(s) := Eπ[Gt |St = s] = Eπ[∞∑i=0

    γ iRt+i+1|St = s] ∀s ∈ S

    Intuitively, a value function encodes our knowledge about the system,and specifies what is good in the long run

    Xavier Siebert DEFIS IA 36 / 61

  • Markov Decision Processes Value Function

    Markov Decision Process (MDP)

    Definition : Value Function

    A value functionv : S → R

    associates to any state s ∈ S the value of the expected (deprecated)reward

    v(s) := E[Gt |St = s] = E[∞∑i=0

    γ iRt+i+1|St = s] ∀s ∈ S

    Xavier Siebert DEFIS IA 37 / 61

  • Markov Decision Processes Value Function

    Markov Decision Process (MDP)Value Function and recurrence relationship

    from the definition

    Gt :=∞∑i=0

    γ iRt+i+1

    we can deduce the recurrence

    Gt = Rt+1 + γGt+1

    the value function can thus be decomposed into two parts:

    v(s) = E[Rt+1 + γGt+1|St = s]= E[ Rt+1︸︷︷︸

    immediate reward

    + γv(St+1)︸ ︷︷ ︸value of successor

    |St = s]

    = E[Rt+1|St = s] + E[γv(St+1)|St = s]

    Xavier Siebert DEFIS IA 38 / 61

  • Markov Decision Processes Bellman Equation

    Markov Decision Process (MDP)Optimal Policies and Optimal Value Functions

    value functions define a partial ordering over policies

    π dominates π′ if (assuming policies are comparable)1 ∀s, vπ(s) ≥ vπ′(s)2 ∃s : vπ(s) > vπ′(s)

    a policy π∗ is optimal if it is not dominated by any other

    π∗(s) = arg maxπ

    vπ(s)

    there can be many optimal policies, but they all have the same value

    v∗(s) = vπ∗(s) = max

    πvπ(s)

    how to find v∗ ?

    Xavier Siebert DEFIS IA 39 / 61

  • Markov Decision Processes Bellman Equation

    Markov Decision Process (MDP)

    Bellman Equations (1957)

    For a given state s, the optimal value v∗(s) is the solution of

    v(s) = maxa

    ∑s′

    T (s, a, s ′)[R(s, a, s ′) + γv(s ′)]

    “Backup diagram”

    Xavier Siebert DEFIS IA 40 / 61

  • Markov Decision Processes Bellman Equation

    Markov Decision Process (MDP)Optimality Equations (Bellman)

    introducing the operator L:

    Lvπ(s) = maxa

    ∑s′

    T (s, a, s ′)[R(s, a, s ′) + γvπ(s ′)]

    the optimal value v∗(s) is the solution of

    vπ(s) = Lvπ(s)

    fixed-point theorems (Banach)

    how to compute v∗(s) in practice ?I compute explicitly, for finite MDPI linear programmingI value iterationI policy iteration (= strategy iteration)

    Xavier Siebert DEFIS IA 41 / 61

  • Markov Decision Processes Bellman Equation

    Markov Decision Process (MDP)Compute explicitly, for finite MDP

    For finite MDP with n states

    Bellman gives a system of n equations in n unknowns

    Computational complexity is O(n3)

    Direct solution only possible for a small MDP

    Otherwise : need approximate methods

    Xavier Siebert DEFIS IA 42 / 61

  • Markov Decision Processes Bellman Equation

    Markov Decision Process (MDP)Linear Programming

    Property

    If v ∈ V minimizes∑

    s ∈S v(s) under the constraints

    v ≥ Lv

    then v = v∗.

    we can thus use any linear programming solver

    in practice : slow. . .

    Xavier Siebert DEFIS IA 43 / 61

  • Markov Decision Processes Value Iteration

    Markov Decision Process (MDP)Value Iteration (VI)

    construct the sequence {v0, v1, . . . , vk , vk+1, . . .} where

    vk+1 = Lvk

    iterations to dynamically evaluate Bellman’s equation

    vk+1(s)← maxa

    ∑s′

    T (s, a, s ′)[R(s, a, s ′) + γvk(s′)]

    Xavier Siebert DEFIS IA 44 / 61

  • Markov Decision Processes Value Iteration

    Markov Decision Process (MDP)Value Iteration (VI)

    fixed-point theorem shows that

    v∗(s) = limk→∞

    vk(s)

    if the stopping criterion is

    ‖vk+1 − vk‖ < �

    then we can prove that the solution is close to the optimum, that is:

    ‖vk − v∗‖ <2γ

    1− γ�

    Xavier Siebert DEFIS IA 45 / 61

  • Markov Decision Processes Value Iteration

    Markov Decision Process (MDP)Value Iteration (VI)

    Algorithm 1 Value Iteration

    Initialize v0(s) for all states, T and R are knownk = −1repeatk = k + 1for s ∈ S dovk+1(s) = maxa

    ∑s′ T (s, a, s

    ′)[R(s, a, s ′) + γvk(s′)]

    end foruntil ‖vk+1(s)− vk(s)‖ < �return vk

    Xavier Siebert DEFIS IA 46 / 61

  • Markov Decision Processes Value Iteration

    Markov Decision Process (MDP)Value Iteration (VI) - numerical example

    starting point : all values v0(s) initialized to 0

    c w o

    v0 0 0 0

    we take γ = 0.5

    Xavier Siebert DEFIS IA 47 / 61

  • Markov Decision Processes Value Iteration

    Markov Decision Process (MDP)Value Iteration (VI) - numerical example

    first round of updates : v1 from v0

    v1(c) = max{1(1 + 0.5× 0), 0.5(2 + 0.5× 0) + 0.5(2 + 0.5× 0)}= max{1, 2} = 2

    v1(w) = max{0.5(1 + 0.5× 0) + 0.5(1 + 0.5× 0), 1(−10 + 0.5× 0)}= max{1,−10} = 1

    v1(o) = max{0} = 0

    Xavier Siebert DEFIS IA 48 / 61

  • Markov Decision Processes Value Iteration

    Markov Decision Process (MDP)Value Iteration (VI) - numerical example

    first round of updates : v1 from v0

    c w o

    v0 0 0 0v1 2 1 0

    Xavier Siebert DEFIS IA 49 / 61

  • Markov Decision Processes Value Iteration

    Markov Decision Process (MDP)Value Iteration (VI) - numerical example

    second round of updates : v2 from v1

    v2(c) = max{1(1 + 0.5× 2), 0.5(2 + 0.5× 2) + 0.5(2 + 0.5× 1)}= max{2, 2.75} = 2.75

    v2(w) = max{0.5(1 + 0.5× 2) + 0.5(1 + 0.5× 1), 1(−10 + 0.5× 0)}= max{1.75,−10} = 1.75

    v2(o) = max{0} = 0

    Xavier Siebert DEFIS IA 50 / 61

  • Markov Decision Processes Value Iteration

    Markov Decision Process (MDP)Value Iteration (VI) - numerical example

    second round of updates : v2 from v1

    c w o

    v0 0 0 0v1 2 1 0v2 2.75 1.75 0

    Xavier Siebert DEFIS IA 51 / 61

  • Markov Decision Processes Policy Iteration

    Markov Decision Process (MDP)Policy Iteration (PI) = Strategy Iteration (SI)

    two steps, after initialization of vπ0 (s)1 policy evaluation : solve (no max)

    vπkk = Lπk vπkk

    equivalent to iterate

    vπkk+1(s)←∑s′

    T (s, πk(s), s′)[R(s, πk(s), s

    ′) + γvπkk (s′)]

    2 policy improvement :

    πk+1 = arg maxπ∈Π

    ∑s′

    T (s, π(s), s ′)[R(s, π(s), s ′) + γvπkk (s′)]

    stop when policy does not change anymore (πk = πk+1)

    Xavier Siebert DEFIS IA 52 / 61

  • Markov Decision Processes Policy Iteration

    Markov Decision Process (MDP)Policy Iteration (PI) = Strategy Iteration (SI)

    Algorithm 2 Policy Iteration

    Initialize π0 with some policyk = −1repeat

    k = k + 1for s ∈ S do

    solve vπk (s) =∑

    s′ T (s, πk(s), s′)[R(s, πk(s), s

    ′) + γvπk (s ′)]end forfor s ∈ S doπk+1 = arg max

    π∈Π

    ∑s′ T (s, π(s), s

    ′)[R(s, π(s), s ′) + γvπk (s ′)]

    end foruntil πk+1(s) = πk(s)return vk , πk

    Xavier Siebert DEFIS IA 53 / 61

  • Markov Decision Processes Policy Iteration

    Markov Decision Process (MDP)Policy Iteration (PI)

    the policy π(s) is often more important than the precise value v∗(s)

    policy iteration can converge even though the values did not

    Xavier Siebert DEFIS IA 54 / 61

  • Markov Decision Processes Policy Iteration

    Markov Decision Process (MDP)Policy Iteration (PI) - numerical example

    starting point :I initial policy π0 = “always go slow”I all values vπ0 (s) initialized to 0

    assume γ = 0.5

    c w o

    π0 slow slow -

    Xavier Siebert DEFIS IA 55 / 61

  • Markov Decision Processes Policy Iteration

    Markov Decision Process (MDP)Policy Iteration (PI) - numerical example

    first round of updates : (1) policy evaluation

    vπ0(c) = 1(1 + 0.5× vπ0(c))vπ0(w) = 0.5(1 + 0.5× vπ0(c)) + 0.5(1 + 0.5× vπ0(w))

    solving these equations yields

    vπ0(c) = 2 vπ0(w) = 2

    Xavier Siebert DEFIS IA 56 / 61

  • Markov Decision Processes Policy Iteration

    Markov Decision Process (MDP)Policy Iteration (PI) - numerical example

    first round of updates : (2) policy improvement

    π1(c) = arg max {slow : 1(1 + 0.5× 2),

    fast : 0.5(2 + 0.5× 2) + 0.5(2 + 0.5× 2)}= arg max {slow : 2, fast : 3} = fast

    π1(w) = arg max {slow : 0.5(1 + 0.5× 2) + 0.5(1 + 0.5× 2),

    fast : 0.5(−10 + 0.5× 0)}= arg max {slow : 3, fast : −10} = slow

    (1)Xavier Siebert DEFIS IA 57 / 61

  • Markov Decision Processes Policy Iteration

    Markov Decision Process (MDP)Policy Iteration (PI) - numerical example

    c w o

    π0 slow slow -π1 fast slow -

    Xavier Siebert DEFIS IA 58 / 61

  • Markov Decision Processes Policy Iteration

    Markov Decision Process (MDP)Policy Iteration (PI) - numerical example

    second round of updates : policy evaluation and improvement yields

    c w o

    π0 slow slow -π1 fast slow -π2 fast slow -

    the policy converged already !

    Xavier Siebert DEFIS IA 59 / 61

  • Markov Decision Processes Policy Iteration

    HANDS ON : MDP

    do sections 1.2 → 1.4 of RL1goal : explore MDP algorithms with simple labyrinths

    Xavier Siebert DEFIS IA 60 / 61

  • Markov Decision Processes Policy Iteration

    HANDS ON : MDPSometimes the agent takes too long to find the exitOne can also add a living reward to speed up the process∀ (s, a, s ′) with no reward, add a negative reward :

    I −0.01 (agent has plenty of time to get out)I −2.0 (hurry up !)

    So

    urc

    e:ai.berkeley.edu

    Xavier Siebert DEFIS IA 61 / 61

    ai.berkeley.edu

    IntroductionMarkov Decision ProcessesDefinitionsPoliciesRewardsValue FunctionBellman EquationValue IterationPolicy Iteration