introduction to reinforcement learning -...

Introduction to Reinforcement Learning

Ather GattamiSenior Scientist, RISE SICS

Stockholm, Sweden

November 3, 2017

Introduction Dynamical Systems Bellman’s Principle of Optimality Reinforcement Learning

Outline

1 Introduction

2 Dynamical Systems

3 Bellman’s Principle of Optimality

4 Reinforcement Learning

2


Success Stories

3


Reinforcement Learning in A Nutshell

Used in problems where actions(decisions) have to be made

Each action (decision) affects futurestates of the system

Success is measured by a scalarreward signal

Goal: Take actions (decisions) tomaximize reward (or minimize cost)where no system model is available

4


Dynamical Systems

Let sk, yk, ak be the state, observation, and action at time step k, respectively.

Deterministic model:

sk+1 = fk(sk, ak)

yk = gk(sk, ak)

Stochastic model (Markov Decision Process):

P(sk+1 | sk, ak, sk−1, ak−1, ...) = P(sk+1 | sk, ak)P(yk | sk, ak, sk−1, ak−1, ...) = P(yk | sk, ak)

We assume perfect state observation, that is yk = sk.

5


Dynamical Systems

Given a dynamical system with states, observations, and actions given by sk, ykand ak, respectively, and scalar valued rewards rk(sk, ak), find the actions akthat maximize the average reward

RT = E

(T∑k=1

δkrk(sk, ak)

)

where 0 < δ ≤ 1 is the discount factor.

6


Example

Let rk(θk, Fk) = −θ2k where θk and Fkare time discretized values of the angle θ(the state) and force F (the action).Maximize the reward function R withrespect to F .

7


Bellman’s Principle of Optimality

8


Bellman’s Equation

DefinitionA policy π(sk) defines a probability distribution over actions given a state sk,

P(Ak = ak | Sk = sk)

For deterministic policies, the action is given by ak = π(sk) with probability 1.

9



Let δi ·Qπi (s, a) be the expected reward to go from time step k = i to k = Tgiven the policy π(sk), the state si = s, and the action ak = a. That is

Qπi (s, a) = E

(T∑k=i

δk−irk(sk, ak)

∣∣∣∣∣si = s, ai = a)

Then,

Qπi (s, a) = E

(ri(si, ai) + δ ·

T∑k=i+1

δk−(i+1)rk(sk, ak)

∣∣∣∣∣si = s, ai = a)

= E(ri(s, a) + δ ·Qπi+1(sk+1, ak+1)

∣∣si = s, ai = a)10



Let π∗ be an optimal policy that maximizes Qπi (s, a) for all i and define

Q∗i (s, a) = Qπ∗

i (s, a)

Since the policy is optimal for all i we have

π∗(s) = arg supaQ∗i (s, a)


Q∗i (s, a) = E(ri(s, a) + δ ·Q∗i+1(si+1, ai+1)

∣∣si = s, ai = a)11


Dynamic Programming

If the system model is known, we can use dynamic programming.Let

V ∗(s) = supaQ∗i (s, a)

The Bellman Equation is given by

V (sk) = supak

E (rk(sk, ak) + V (sk+1))

= supak

E

(∑s′∈S

P(s′ | sk, ak) (rk(sk, ak) + V (s′))

)

12


Model Free Optimization and Reinforcement Learning

What if we don’t have the system model?

If the system is deterministic, the model is given by

sk+1 = fk(sk, ak)

If the system is stochastic, the model is given by

P(sk+1 | sk, ak)

13


Q-Learning

Let s = sk and s′ = sk+1. Learning the Q function is done by minimizing withrespect to Q the following cost function

l = (r(s, a) + δ supa′Q(s′, a′)−Q(s, a))2

Gradient descent update rule with α as Lagrange multiplier:

Q(s, a)← Q(s, a) + α(r(s, a) + δ supa′Q(s′, a′)−Q(s, a))

The optimal policy is estimated from Q(s, a):

π(s) = arg supaQ(s, a)

14


Q-Learning

TheoremConsider the Q-learning algorithm given by

Q(s, a)← Q(s, a) + α(r(s, a) + δ supa′Q(s′, a′)−Q(s, a))

The Q-learning algorithm converges to the optimal action-value function,Q(s, a)→ Q∗(s, a).

15


Deep Reinforcement Learning

Deep reinforcement learning (used in AlphaGo that defeated the World Championin Go) is reinforcement learning where the Q function is approximated with adeep neural network.

Minimizing the loss function with respect to the neural network weights w

l = (r(s, a) + δ supa′Q(s′, a′,w−)−Q(s, a,w))2

16


Cart Pole Example

The horizontal ("x") axis represents the time axis and the vertical ("y") axisrepresents the upward position of the pole (the "z" axis in space).

Figure:

17


End of Presentation

THANKS FOR LISTENING!

18

IntroductionDynamical SystemsBellman's Principle of OptimalityReinforcement Learning

introduction to reinforcement learning -...

Documents