introduction to reinforcement learning -...

21
Introduction to Reinforcement Learning Ather Gattami Senior Scientist, RISE SICS Stockholm, Sweden November 3, 2017

Upload: others

Post on 19-Feb-2021

8 views

Category:

Documents


0 download

TRANSCRIPT

  • Introduction to Reinforcement Learning

    Ather GattamiSenior Scientist, RISE SICS

    Stockholm, Sweden

    November 3, 2017

  • Introduction Dynamical Systems Bellman’s Principle of Optimality Reinforcement Learning

    Outline

    1 Introduction

    2 Dynamical Systems

    3 Bellman’s Principle of Optimality

    4 Reinforcement Learning

    2

  • Introduction Dynamical Systems Bellman’s Principle of Optimality Reinforcement Learning

    Success Stories

    3

  • Introduction Dynamical Systems Bellman’s Principle of Optimality Reinforcement Learning

    Reinforcement Learning in A Nutshell

    Used in problems where actions(decisions) have to be made

    Each action (decision) affects futurestates of the system

    Success is measured by a scalarreward signal

    Goal: Take actions (decisions) tomaximize reward (or minimize cost)where no system model is available

    4

  • Introduction Dynamical Systems Bellman’s Principle of Optimality Reinforcement Learning

    Reinforcement Learning in A Nutshell

    Used in problems where actions(decisions) have to be made

    Each action (decision) affects futurestates of the system

    Success is measured by a scalarreward signal

    Goal: Take actions (decisions) tomaximize reward (or minimize cost)where no system model is available

    4

  • Introduction Dynamical Systems Bellman’s Principle of Optimality Reinforcement Learning

    Reinforcement Learning in A Nutshell

    Used in problems where actions(decisions) have to be made

    Each action (decision) affects futurestates of the system

    Success is measured by a scalarreward signal

    Goal: Take actions (decisions) tomaximize reward (or minimize cost)where no system model is available

    4

  • Introduction Dynamical Systems Bellman’s Principle of Optimality Reinforcement Learning

    Reinforcement Learning in A Nutshell

    Used in problems where actions(decisions) have to be made

    Each action (decision) affects futurestates of the system

    Success is measured by a scalarreward signal

    Goal: Take actions (decisions) tomaximize reward (or minimize cost)where no system model is available

    4

  • Introduction Dynamical Systems Bellman’s Principle of Optimality Reinforcement Learning

    Dynamical Systems

    Let sk, yk, ak be the state, observation, and action at time step k, respectively.

    Deterministic model:

    sk+1 = fk(sk, ak)

    yk = gk(sk, ak)

    Stochastic model (Markov Decision Process):

    P(sk+1 | sk, ak, sk−1, ak−1, ...) = P(sk+1 | sk, ak)P(yk | sk, ak, sk−1, ak−1, ...) = P(yk | sk, ak)

    We assume perfect state observation, that is yk = sk.

    5

  • Introduction Dynamical Systems Bellman’s Principle of Optimality Reinforcement Learning

    Dynamical Systems

    Given a dynamical system with states, observations, and actions given by sk, ykand ak, respectively, and scalar valued rewards rk(sk, ak), find the actions akthat maximize the average reward

    RT = E

    (T∑k=1

    δkrk(sk, ak)

    )

    where 0 < δ ≤ 1 is the discount factor.

    6

  • Introduction Dynamical Systems Bellman’s Principle of Optimality Reinforcement Learning

    Example

    Let rk(θk, Fk) = −θ2k where θk and Fkare time discretized values of the angle θ(the state) and force F (the action).Maximize the reward function R withrespect to F .

    7

  • Introduction Dynamical Systems Bellman’s Principle of Optimality Reinforcement Learning

    Bellman’s Principle of Optimality

    8

  • Introduction Dynamical Systems Bellman’s Principle of Optimality Reinforcement Learning

    Bellman’s Equation

    DefinitionA policy π(sk) defines a probability distribution over actions given a state sk,

    P(Ak = ak | Sk = sk)

    For deterministic policies, the action is given by ak = π(sk) with probability 1.

    9

  • Introduction Dynamical Systems Bellman’s Principle of Optimality Reinforcement Learning

    Bellman’s Equation

    Let δi ·Qπi (s, a) be the expected reward to go from time step k = i to k = Tgiven the policy π(sk), the state si = s, and the action ak = a. That is

    Qπi (s, a) = E

    (T∑k=i

    δk−irk(sk, ak)

    ∣∣∣∣∣si = s, ai = a)

    Then,

    Qπi (s, a) = E

    (ri(si, ai) + δ ·

    T∑k=i+1

    δk−(i+1)rk(sk, ak)

    ∣∣∣∣∣si = s, ai = a)

    = E(ri(s, a) + δ ·Qπi+1(sk+1, ak+1)

    ∣∣si = s, ai = a)10

  • Introduction Dynamical Systems Bellman’s Principle of Optimality Reinforcement Learning

    Bellman’s Equation

    Let π∗ be an optimal policy that maximizes Qπi (s, a) for all i and define

    Q∗i (s, a) = Qπ∗

    i (s, a)

    Since the policy is optimal for all i we have

    π∗(s) = arg supaQ∗i (s, a)

    Bellman’s Equation

    Q∗i (s, a) = E(ri(s, a) + δ ·Q∗i+1(si+1, ai+1)

    ∣∣si = s, ai = a)11

  • Introduction Dynamical Systems Bellman’s Principle of Optimality Reinforcement Learning

    Dynamic Programming

    If the system model is known, we can use dynamic programming.Let

    V ∗(s) = supaQ∗i (s, a)

    The Bellman Equation is given by

    V (sk) = supak

    E (rk(sk, ak) + V (sk+1))

    = supak

    E

    (∑s′∈S

    P(s′ | sk, ak) (rk(sk, ak) + V (s′))

    )

    12

  • Introduction Dynamical Systems Bellman’s Principle of Optimality Reinforcement Learning

    Model Free Optimization and Reinforcement Learning

    What if we don’t have the system model?

    If the system is deterministic, the model is given by

    sk+1 = fk(sk, ak)

    If the system is stochastic, the model is given by

    P(sk+1 | sk, ak)

    13

  • Introduction Dynamical Systems Bellman’s Principle of Optimality Reinforcement Learning

    Q-Learning

    Let s = sk and s′ = sk+1. Learning the Q function is done by minimizing withrespect to Q the following cost function

    l = (r(s, a) + δ supa′Q(s′, a′)−Q(s, a))2

    Gradient descent update rule with α as Lagrange multiplier:

    Q(s, a)← Q(s, a) + α(r(s, a) + δ supa′Q(s′, a′)−Q(s, a))

    The optimal policy is estimated from Q(s, a):

    π(s) = arg supaQ(s, a)

    14

  • Introduction Dynamical Systems Bellman’s Principle of Optimality Reinforcement Learning

    Q-Learning

    TheoremConsider the Q-learning algorithm given by

    Q(s, a)← Q(s, a) + α(r(s, a) + δ supa′Q(s′, a′)−Q(s, a))

    The Q-learning algorithm converges to the optimal action-value function,Q(s, a)→ Q∗(s, a).

    15

  • Introduction Dynamical Systems Bellman’s Principle of Optimality Reinforcement Learning

    Deep Reinforcement Learning

    Deep reinforcement learning (used in AlphaGo that defeated the World Championin Go) is reinforcement learning where the Q function is approximated with adeep neural network.

    Minimizing the loss function with respect to the neural network weights w

    l = (r(s, a) + δ supa′Q(s′, a′,w−)−Q(s, a,w))2

    16

  • Introduction Dynamical Systems Bellman’s Principle of Optimality Reinforcement Learning

    Cart Pole Example

    The horizontal ("x") axis represents the time axis and the vertical ("y") axisrepresents the upward position of the pole (the "z" axis in space).

    Figure:

    17

  • Introduction Dynamical Systems Bellman’s Principle of Optimality Reinforcement Learning

    End of Presentation

    THANKS FOR LISTENING!

    18

    IntroductionDynamical SystemsBellman's Principle of OptimalityReinforcement Learning