reinforcement learning - lecture 1: introduction · 2017-10-02 · reinforcement learning lecture...
TRANSCRIPT
![Page 1: Reinforcement Learning - Lecture 1: Introduction · 2017-10-02 · Reinforcement Learning Lecture 1: Introduction Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed411d38d46b66d22636626/html5/thumbnails/1.jpg)
Reinforcement Learning
Lecture 1: Introduction
Alexandre Proutiere, Sadegh Talebi, Jungseul Ok
KTH, The Royal Institute of Technology
![Page 2: Reinforcement Learning - Lecture 1: Introduction · 2017-10-02 · Reinforcement Learning Lecture 1: Introduction Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed411d38d46b66d22636626/html5/thumbnails/2.jpg)
Lecture 1: Outline
1. Generic models for sequential decision making
2. Overview and schedule of the course
2
![Page 3: Reinforcement Learning - Lecture 1: Introduction · 2017-10-02 · Reinforcement Learning Lecture 1: Introduction Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed411d38d46b66d22636626/html5/thumbnails/3.jpg)
Lecture 1: Outline
1. Generic models for sequential decision making
2. Overview and schedule of the course
3
![Page 4: Reinforcement Learning - Lecture 1: Introduction · 2017-10-02 · Reinforcement Learning Lecture 1: Introduction Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed411d38d46b66d22636626/html5/thumbnails/4.jpg)
Sequential Decision Making
Objective. Devise a sequential action selection / control policy
maximising rewards
4
![Page 5: Reinforcement Learning - Lecture 1: Introduction · 2017-10-02 · Reinforcement Learning Lecture 1: Introduction Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed411d38d46b66d22636626/html5/thumbnails/5.jpg)
Sequential Decision Making
Problem definition
1. System dynamics
2. Set of available policies – available information or feedback to the
decision maker
3. Reward structure
5
![Page 6: Reinforcement Learning - Lecture 1: Introduction · 2017-10-02 · Reinforcement Learning Lecture 1: Introduction Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed411d38d46b66d22636626/html5/thumbnails/6.jpg)
Applications
6
![Page 7: Reinforcement Learning - Lecture 1: Introduction · 2017-10-02 · Reinforcement Learning Lecture 1: Introduction Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed411d38d46b66d22636626/html5/thumbnails/7.jpg)
Sequential Decision Making
Dynamics. A few examples:
• Linear: st+1 = Ast +Bat
• Deterministic and stationary: st+1 = F (st, at)
• Markovian: P(st+1 = s′|ht, st = s, at = a) = pt(s′|s, a)
where∑s′ pt(s
′|s, a) = 1; homogenous if pt(s′|s, a) = p(s′|s, a)
7
![Page 8: Reinforcement Learning - Lecture 1: Introduction · 2017-10-02 · Reinforcement Learning Lecture 1: Introduction Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed411d38d46b66d22636626/html5/thumbnails/8.jpg)
Sequential Decision Making
Information - Set of policies. A few examples:
• Markov Decision Process (MDP)
- Fully observable state and reward
- Known reward distribution and transition probabilities
- at function of (s0, a0, r0, . . . , st−1, at−1, rt−1, st)
8
![Page 9: Reinforcement Learning - Lecture 1: Introduction · 2017-10-02 · Reinforcement Learning Lecture 1: Introduction Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed411d38d46b66d22636626/html5/thumbnails/9.jpg)
Sequential Decision Making
Information - Set of policies. A few examples:
• Partially Observable Markov Decision Process (POMDP)
- Partially observable state: we know zt with known P[st = s|zt]- Observed rewards
- Known reward distribution and transition probabilities
- at function of (z0, a0, r0, . . . , zt−1, at−1, rt−1, zt)
9
![Page 10: Reinforcement Learning - Lecture 1: Introduction · 2017-10-02 · Reinforcement Learning Lecture 1: Introduction Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed411d38d46b66d22636626/html5/thumbnails/10.jpg)
Sequential Decision Making
Information - Set of policies. A few examples:
• Reinforcement learning
- Observable state and reward
- Unknown reward distribution
- Unknown transition probabilities
- at function of (s0, a0, r0, . . . , st−1, at−1, rt−1, st)
10
![Page 11: Reinforcement Learning - Lecture 1: Introduction · 2017-10-02 · Reinforcement Learning Lecture 1: Introduction Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed411d38d46b66d22636626/html5/thumbnails/11.jpg)
Sequential Decision Making
Information - Set of policies. A few examples:
• Adversarial problems
- Observable state and reward
- Arbitrary and time-varying reward function and state transitions
- at function of (s0, a0, r0, . . . , st−1, at−1, rt−1, st)
11
![Page 12: Reinforcement Learning - Lecture 1: Introduction · 2017-10-02 · Reinforcement Learning Lecture 1: Introduction Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed411d38d46b66d22636626/html5/thumbnails/12.jpg)
Sequential Decision Making
Objectives. A few examples:
• Finite horizon: maxπ E[∑Tt=0 rt(a
πt , s
πt )]
• Infinite horizon discounted: maxπ E[∑∞t=0 λ
trt(aπt , s
πt )]
• Infinite horizon average: maxπ lim infT→∞1T E[
∑Tt=0 rt(a
πt , s
πt )]
12
![Page 13: Reinforcement Learning - Lecture 1: Introduction · 2017-10-02 · Reinforcement Learning Lecture 1: Introduction Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed411d38d46b66d22636626/html5/thumbnails/13.jpg)
Problem classification
13
![Page 14: Reinforcement Learning - Lecture 1: Introduction · 2017-10-02 · Reinforcement Learning Lecture 1: Introduction Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed411d38d46b66d22636626/html5/thumbnails/14.jpg)
Selling an item
You need to sell your house, and receive offers sequentially. Rejecting an
offer has a cost of 10 kSEK. What is the rejection/acceptance policy
maximising your profit?
MDP. Offers are i.i.d. with known distribution
Reinforcement learning. (Bandit optimisation) Offers are i.i.d. with
unknown distribution
Adversarial problem. The sequence of offers is arbitrary
14
![Page 15: Reinforcement Learning - Lecture 1: Introduction · 2017-10-02 · Reinforcement Learning Lecture 1: Introduction Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed411d38d46b66d22636626/html5/thumbnails/15.jpg)
Lecture 1: Outline
1. Generic models for sequential decision making
2. Overview and schedule of the course
15
![Page 16: Reinforcement Learning - Lecture 1: Introduction · 2017-10-02 · Reinforcement Learning Lecture 1: Introduction Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed411d38d46b66d22636626/html5/thumbnails/16.jpg)
Reinforcement learning
Learning optimal sequential behaviour / control from interacting with the
environment
Unknown state dynamics
and reward function:
sπt+1 = Ft(sπt , a
πt )
rt(·, ·)
16
![Page 17: Reinforcement Learning - Lecture 1: Introduction · 2017-10-02 · Reinforcement Learning Lecture 1: Introduction Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed411d38d46b66d22636626/html5/thumbnails/17.jpg)
Reinforcement learning
Learning optimal sequential behaviour / control from interacting with the
environment
[. . .]
By the time we learn to live
It’s already too late
Our hearts cry in unison at night
[. . .]
Louis Aragon
17
![Page 18: Reinforcement Learning - Lecture 1: Introduction · 2017-10-02 · Reinforcement Learning Lecture 1: Introduction Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed411d38d46b66d22636626/html5/thumbnails/18.jpg)
Reinforcement learning: Applications
• Making a robot walk
• Portfolio optimisation
• Playing games better than
humans
• Helicopter stunt
manoeuvres
• Optimal communication
protocols in radio networks
• Display ads
• Search engines
• ...
18
![Page 19: Reinforcement Learning - Lecture 1: Introduction · 2017-10-02 · Reinforcement Learning Lecture 1: Introduction Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed411d38d46b66d22636626/html5/thumbnails/19.jpg)
1. Bandit Optimisation
State dynamics:
sπt+1 = Ft(sπt , a
πt )
• Interact with an i.i.d. or adversarial environment
• The reward is independent of the state and is the only feedback:
- i.i.d. environment: rt(a, s) = rt(a) random variable with mean θa
- adversarial environment: rt(a, s) = rt(a) is arbitrary!
19
![Page 20: Reinforcement Learning - Lecture 1: Introduction · 2017-10-02 · Reinforcement Learning Lecture 1: Introduction Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed411d38d46b66d22636626/html5/thumbnails/20.jpg)
2. Markov Decision Process (MDP)
State dynamics:
sπt+1 = Ft(sπt , a
πt )
• History at t: hπt = (sπ1 , aπ1 , . . . , s
πt−1, a
πt−1, s
πt )
• Markovian environment: P[sπt+1 = s′|hπt , sπt = s, aπt = a] = p(s′|s, a)
• Stationary deterministic rewards (for simplicity): rt(a, s) = r(a, s)
20
![Page 21: Reinforcement Learning - Lecture 1: Introduction · 2017-10-02 · Reinforcement Learning Lecture 1: Introduction Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed411d38d46b66d22636626/html5/thumbnails/21.jpg)
What is to be learnt and optimised?
• Bandit optimisation: the average rewards of actions are unknown
Information available at time t under π:
aπ1 , r1(aπ1 ), . . . , aπt−1, rt−1(aπt−1)
• MDP: The state dynamics p(·|s, a) and the reward function r(a, s)
are unknown
Information available at time t under π:
sπ1 , aπ1 , r1(aπ1 , s
π1 ), . . . , sπt−1, a
πt−1, rt−1(aπt−1, s
πt−1), sπt
• Objective: maximise the cumulative reward
T∑t=1
E[rt(aπt , s
πt )] or
∞∑t=1
λtE[rt(aπt , s
πt )]
21
![Page 22: Reinforcement Learning - Lecture 1: Introduction · 2017-10-02 · Reinforcement Learning Lecture 1: Introduction Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed411d38d46b66d22636626/html5/thumbnails/22.jpg)
Regret
• Difference between the cumulative reward of an ”Oracle” policy and
that of agent π
• Regret quantifies the price to pay for learning!
• Exploration vs. exploitation trade-off: we need to probe all actions
to play the best later ...
22
![Page 23: Reinforcement Learning - Lecture 1: Introduction · 2017-10-02 · Reinforcement Learning Lecture 1: Introduction Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed411d38d46b66d22636626/html5/thumbnails/23.jpg)
1. Bandit Optimisation
First application: Clinical trial, Thompson 1933
- A set of possible actions at each step
- Unknown sequence of rewards for each action
- Bandit feedback: only rewards of chosen actions are observed
- Goal: maximise the cumulative reward (up to step T )
Two examples:
a. Finite number of actions, stochastic rewards
b. Continuous actions, concave adversarial rewards
23
![Page 24: Reinforcement Learning - Lecture 1: Introduction · 2017-10-02 · Reinforcement Learning Lecture 1: Introduction Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed411d38d46b66d22636626/html5/thumbnails/24.jpg)
a. Stochastic bandits – Robbins 1952
• Finite set of actions A
• (Unknown) rewards of action a ∈ A: (rt(a), t ≥ 0) i.i.d. Bernoulli
with E[rt(a)] = θa
• Optimal action a? ∈ arg maxa θa
• Online policy π: select action aπt at time t depending on
aπ1 , r1(aπ1 ), . . . , aπt−1, rt−1(aπt−1)
• Regret up to time T : Rπ(T ) = Tθa? −∑Tt=1 θaπt
24
![Page 25: Reinforcement Learning - Lecture 1: Introduction · 2017-10-02 · Reinforcement Learning Lecture 1: Introduction Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed411d38d46b66d22636626/html5/thumbnails/25.jpg)
a. Stochastic bandits
Fundamental performance limits: (Lai-Robbins1985)
For any reasonable π:
lim infT
Rπ(T )
log(T )≥∑a 6=a?
θa? − θaKL(θ, θa?)
where KL(a, b) = a log(ab ) + (1− a) log(1−a1−b ) (KL divergence)
Algorithms:
(i) ε-greedy: linear regret
(ii) εt-greedy: logarithmic regret (εt = 1/t)
(iii) Upper Confidence Bound algorithm:
ba(t) = θ̂a(t) +
√2 log(t)
na(t)
θ̂(t): empirical reward of a up to t
na(t): nb of times a played up to t
25
![Page 26: Reinforcement Learning - Lecture 1: Introduction · 2017-10-02 · Reinforcement Learning Lecture 1: Introduction Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed411d38d46b66d22636626/html5/thumbnails/26.jpg)
b. Adversarial Convex Bandits
At the beginning of each year, Volvo has to select a vector x (in a convex
set) representing the relative efforts in producing various models (S60,
V70, V90, . . .). The reward is an arbitrarily varying and unknown
concave function of x. How to maximise reward over say 50 years?
26
![Page 27: Reinforcement Learning - Lecture 1: Introduction · 2017-10-02 · Reinforcement Learning Lecture 1: Introduction Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed411d38d46b66d22636626/html5/thumbnails/27.jpg)
b. Adversarial Convex Bandits
At the beginning of each year, Volvo has to select a vector x (in a convex
set) representing the relative efforts in producing various models (S60,
V70, V90, . . .). The reward is an arbitrarily varying and unknown
concave function of x. How to maximise reward over say 50 years?
27
![Page 28: Reinforcement Learning - Lecture 1: Introduction · 2017-10-02 · Reinforcement Learning Lecture 1: Introduction Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed411d38d46b66d22636626/html5/thumbnails/28.jpg)
b. Adversarial Convex Bandits
At the beginning of each year, Volvo has to select a vector x (in a convex
set) representing the relative efforts in producing various models (S60,
V70, V90, . . .). The reward is an arbitrarily varying and unknown
concave function of x. How to maximise reward over say 50 years?
28
![Page 29: Reinforcement Learning - Lecture 1: Introduction · 2017-10-02 · Reinforcement Learning Lecture 1: Introduction Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed411d38d46b66d22636626/html5/thumbnails/29.jpg)
b. Adversarial Convex Bandits
At the beginning of each year, Volvo has to select a vector x (in a convex
set) representing the relative efforts in producing various models (S60,
V70, V90, . . .). The reward is an arbitrarily varying and unknown
concave function of x. How to maximise reward over say 50 years?
29
![Page 30: Reinforcement Learning - Lecture 1: Introduction · 2017-10-02 · Reinforcement Learning Lecture 1: Introduction Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed411d38d46b66d22636626/html5/thumbnails/30.jpg)
b. Adversarial Convex Bandits
At the beginning of each year, Volvo has to select a vector x (in a convex
set) representing the relative efforts in producing various models (S60,
V70, V90, . . .). The reward is an arbitrarily varying and unknown
concave function of x. How to maximise reward over say 50 years?
30
![Page 31: Reinforcement Learning - Lecture 1: Introduction · 2017-10-02 · Reinforcement Learning Lecture 1: Introduction Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed411d38d46b66d22636626/html5/thumbnails/31.jpg)
b. Adversarial Convex Bandits
At the beginning of each year, Volvo has to select a vector x (in a convex
set) representing the relative efforts in producing various models (S60,
V70, V90, . . .). The reward is an arbitrarily varying and unknown
concave function of x. How to maximise reward over say 50 years?
31
![Page 32: Reinforcement Learning - Lecture 1: Introduction · 2017-10-02 · Reinforcement Learning Lecture 1: Introduction Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed411d38d46b66d22636626/html5/thumbnails/32.jpg)
b. Adversarial Convex Bandits
• Continuous set of actions A = [0, 1]
• (Unknown) Arbitrary but concave rewards of action x ∈ A: rt(x)
• Online policy π: select action xπt at time t depending on
xπ1 , r1(xπ1 ), . . . , xπt−1, rt−1(xπt−1)
• Regret up to time T : (defined w.r.t. the best empirical action up to
time T )
Rπ(T ) = maxx∈[0,1]
T∑t=1
rt(x)−T∑t=1
rt(xπt )
Can we do something smart at all? Achieve a sublinear regret?
32
![Page 33: Reinforcement Learning - Lecture 1: Introduction · 2017-10-02 · Reinforcement Learning Lecture 1: Introduction Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed411d38d46b66d22636626/html5/thumbnails/33.jpg)
b. Adversarial Convex Bandits
• If rt(·) = r(·), and if r(·) was known, we could apply a gradient
ascent algorithm
• One-point gradient estimate:
f̂(x) = Ev∈B [f(x+ δv)], B = {x : ‖x‖2 ≤ 1}
Eu∈S [f(x+ δu)u] = δ∇f̂(x), S = {x : ‖x‖2 = 1}
• Simulated Gradient Ascent algorithm: at each step t, do
- ut uniformly chosen in S
- yt = xt + δut
- yt+1 = yt + αrt(xt)ut
• Regret: R(T ) = O(T 5/6)
33
![Page 34: Reinforcement Learning - Lecture 1: Introduction · 2017-10-02 · Reinforcement Learning Lecture 1: Introduction Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed411d38d46b66d22636626/html5/thumbnails/34.jpg)
2. Markov Decision Process (MDP)
State dynamics:
sπt+1 = Ft(sπt , a
πt )
• Markovian environment: P[sπt+1 = s′|hπt , sπt = s, aπt = a] = p(s′|s, a)
• Stationary deterministic rewards (for simplicity): rt(a, s) = r(a, s)
• p(·|s, a) and r(·, ·) are unknown initially
34
![Page 35: Reinforcement Learning - Lecture 1: Introduction · 2017-10-02 · Reinforcement Learning Lecture 1: Introduction Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed411d38d46b66d22636626/html5/thumbnails/35.jpg)
Example
Playing pacman (Google Deepmind experiment, 2015)
State: the current displayed image
Action: right, left, down, up
Feedback: the score and its incre-
ments + state
35
![Page 36: Reinforcement Learning - Lecture 1: Introduction · 2017-10-02 · Reinforcement Learning Lecture 1: Introduction Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed411d38d46b66d22636626/html5/thumbnails/36.jpg)
Bellman’s equation
Objective: max the average discounted reward∑∞t=1 λ
tE[r(aπt , sπt )]
Assume the transition probabilities and the reward function are known
• Value function: maps the initial state s to the corresponding
maximum reward v(s)
• Bellman’s equation:
v(s) = maxa∈A
r(a, s) + λ∑j
p(j|s, a)v(j)
• Solve Bellman’s equation. The optimal policy is given by:
a?(s) = arg maxa∈A
r(a, s) + λ∑j
p(j|s, a)v(j)
36
![Page 37: Reinforcement Learning - Lecture 1: Introduction · 2017-10-02 · Reinforcement Learning Lecture 1: Introduction Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed411d38d46b66d22636626/html5/thumbnails/37.jpg)
Q-learning
What if the transition probabilities and the reward function are unknown?
• Q-value function: the max expected reward starting from state s
and playing action a:
Q(s, a) = r(a, s) + λ∑j
p(j|s, a) maxb∈A
Q(j, b)
Note that: v(s) = maxa∈AQ(s, a)
• Algorithm: update the Q-value estimate sequentially so that it
converges to the true Q-value
37
![Page 38: Reinforcement Learning - Lecture 1: Introduction · 2017-10-02 · Reinforcement Learning Lecture 1: Introduction Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed411d38d46b66d22636626/html5/thumbnails/38.jpg)
Q-learning
1. Initialisation: select Q ∈ RS×A arbitrarily, and s0
2. Q-value iteration: at each step t, select action at (each
state-action pair must be selected infinitely often)
Observe the new state st+1 and the reward r(st, at)
Update Q(st, at):
Q(st, at) :=Q(st, at)
+ αt
[r(st, at) + λmax
a∈AQ(st+1, a)−Q(st, at)
]
It converges to Q if∑t αt =∞ and
∑t α
2t <∞
38
![Page 39: Reinforcement Learning - Lecture 1: Introduction · 2017-10-02 · Reinforcement Learning Lecture 1: Introduction Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed411d38d46b66d22636626/html5/thumbnails/39.jpg)
Q-learning: demo
The crawling robot ...
https://www.youtube.com/watch?v=2iNrJx6IDEo
39
![Page 40: Reinforcement Learning - Lecture 1: Introduction · 2017-10-02 · Reinforcement Learning Lecture 1: Introduction Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed411d38d46b66d22636626/html5/thumbnails/40.jpg)
Scaling up Q-learning
Q-learning converges very slowly, especially when the state and action
spaces are large ...
State-of-the-art algorithms (optimal exploration, ideas from bandit opt.):
regret O(√SAT )
What if the action and state are continuous variables?
Example: Mountain car demo1
1See Sutton tutorial, NIPS 2015
40
![Page 41: Reinforcement Learning - Lecture 1: Introduction · 2017-10-02 · Reinforcement Learning Lecture 1: Introduction Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed411d38d46b66d22636626/html5/thumbnails/41.jpg)
Q-learning with function approximation
Idea: restrict our attention to Q-value functions belonging to a family of
functions Q
Examples:
1. Linear functions: Q = {Qθ, θ ∈ RM},
Qθ(s, a) =
M∑i=1
φi(s, a)θi = φ>θ
where for all i, φi is linear. The φi’s are linearly independent.
2. Deep networks: Q = {Qw,w ∈ RM}, Qw(s, a) given as the output
of a neural network with weights w and inputs (s, a)
41
![Page 42: Reinforcement Learning - Lecture 1: Introduction · 2017-10-02 · Reinforcement Learning Lecture 1: Introduction Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed411d38d46b66d22636626/html5/thumbnails/42.jpg)
Q-learning with linear function approximation
1. Initialisation: select θ ∈ RM arbitrarily, and s0
2. Q-value iteration: at each step t, select action at (each
state-action pair must be selected infinitely often)
Observe the new state st+1 and the reward r(st, at)
Update θ:
θ := θ + αt∆t∇θQθ(st, at)= θ + αt∆tφ(st, at)
where ∆t = r(st, at) + λmaxa∈AQθ(st+1, a)−Qθ(st, at)
For convergence results, see ”An analysis of Reinforcement Learning with
Function Approximation”, Melo et al., ICML 2008
42
![Page 43: Reinforcement Learning - Lecture 1: Introduction · 2017-10-02 · Reinforcement Learning Lecture 1: Introduction Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed411d38d46b66d22636626/html5/thumbnails/43.jpg)
Q-learning with function approximation
Success stories:
• TD-Gammon (Backgammon), Tesauro 1995 (neural nets)
• Acrobatic helicopter autopilots, Ng et al. 2006
• Jeopardy, IBM Watson, 2011
• 49 atari games, pixel-level visual inputs, Google Deepmind 2015
43
![Page 44: Reinforcement Learning - Lecture 1: Introduction · 2017-10-02 · Reinforcement Learning Lecture 1: Introduction Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed411d38d46b66d22636626/html5/thumbnails/44.jpg)
Outline of the course
L1. Introduction
L2. Markov Decision Processes and Bellman’s equation for finite and
infinite horizon (with or without discount)
L3. RL problems. Regret, sample complexity, exploration-exploitation
trade-off
L4. First RL algorithms (e.g. Q-learning, TD-learning, SARSA).
Convergence analysis
L5. Bandit optimization: the ”optimism in face of uncertainty” principle
vs. posterior sampling
L6. RL algorithms 2.0 (e.g. UCRL, Thompson Sampling, REGAL).
Regret and sample complexity analysis
L7. Scalable RL algorithms: State aggregation, function approximation
(deep RL, experience replay)
L8. Examples and empirical comparison of various algorithms
44