sample efficiency in online rl
TRANSCRIPT
![Page 1: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/1.jpg)
Reinforcement Learning China Summer School
RLChina 2021
Sample Efficiency in OnlineRL
Zhuoran Yang
Princeton −→ Yale (2022)
August 16, 2021
![Page 2: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/2.jpg)
Motivation: sample complexitychallenge in deep RL
2
![Page 3: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/3.jpg)
Success and challenge of deep RL
DRL = Representation (DL) + Decision Making (RL)
board games computer games
robotic control policy making
image source: Deepmind, OpenAI, Salesforce Research.
3
![Page 4: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/4.jpg)
Success and challenge of deep RL
AlphaGo: 3× 107 games of self-play (data), 40 days oftraining (computation)
AlphaStar: 2× 102 years of self-play, 44 days of training
Rubik’s Cube: 104 years of simulation, 10 days of training
Insufficient Data
Lack of ConvergenceUnreliable AI
Massive Data Needed
Massive Computation
4
![Page 5: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/5.jpg)
Success and challenge of deep RL
AlphaGo: 3× 107 games of self-play (data), 40 days oftraining (computation)
AlphaStar: 2× 102 years of self-play, 44 days of training
Rubik’s Cube: 104 years of simulation, 10 days of training
Insufficient Data
Lack of ConvergenceUnreliable AI
Massive Data Needed
Massive Computation
4
![Page 6: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/6.jpg)
Our goal: provably efficient RLalgorithms
Sample efficiency: how many data points needed?
Computational efficiency: how much computation needed?
Function approximation: allow infinite number ofobservations?
5
![Page 7: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/7.jpg)
Background: Episodic Markovdecision process
6
![Page 8: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/8.jpg)
Episodic Markov decision process
Agent
Environment
action ah = πh(sh) ∈ A
reward next state
rh = rh(sh, ah) ∈ [0,1]sh+1 ∼ Ph( ⋅ | sh, ah) ∈ S
state sh ∈ S
Policy
For h ∈ [H], in the h-th step, observe state sh, takes action ah
Receive immediate reward rh, E[rh|sh = s, ah = a] = rh(s, a)
Environment evolves to a new state sh+1 ∼ Ph(·|sh, ah)
Policy πh : S → A: how agent takes action at h-th step
Goal: find the policy π? that maximizes J(π) := Eπ[∑H
h=1 rh]
7
![Page 9: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/9.jpg)
Episodic Markov decision process
Agent
Environment
action ah = πh(sh) ∈ A
reward next state
rh = rh(sh, ah) ∈ [0,1]sh+1 ∼ Ph( ⋅ | sh, ah) ∈ S
state sh ∈ S
Policy
For h ∈ [H], in the h-th step, observe state sh, takes action ah
Receive immediate reward rh, E[rh|sh = s, ah = a] = rh(s, a)
Environment evolves to a new state sh+1 ∼ Ph(·|sh, ah)
Policy πh : S → A: how agent takes action at h-th step
Goal: find the policy π? that maximizes J(π) := Eπ[∑H
h=1 rh]
7
![Page 10: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/10.jpg)
Episodic Markov decision process
a1
s1
r1
π1(s1)
a2
s2
r2
π2(s2)
a3
s3
r3
π3(s3)
aH
sH
rH
πH(sH)
s⋆
Action
State P( ⋅ | s1, a1) P( ⋅ | s2, a2) P( ⋅ | s3, a3)
r(s1, a1) r(s2, a2) r(s3, a3) r(sH, aH)
+ + +⋯+𝔼 J(π)=
Policy
Reward
For simplicity, fix s1 = s?, rh = r , Ph = P for all h.
8
![Page 11: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/11.jpg)
Contextual bandit is a special case of MDP
Contextual bandit (CB): observe context s ∈ A, take actiona ∈ A, observe (random) reward r with mean r?(s, a)
Optimal policy π?(s) = arg maxa∈A r?(s, a)
CB is a MDP with H = 1
Multi-armed bandit (MAB): H = 1, |A| finite, and |S| = 1
MDP is significantly more challenging than CB due to tem-poral dependency (i.e., state transitions)
9
![Page 12: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/12.jpg)
Contextual bandit is a special case of MDP
Contextual bandit (CB): observe context s ∈ A, take actiona ∈ A, observe (random) reward r with mean r?(s, a)
Optimal policy π?(s) = arg maxa∈A r?(s, a)
CB is a MDP with H = 1
Multi-armed bandit (MAB): H = 1, |A| finite, and |S| = 1
MDP is significantly more challenging than CB due to tem-poral dependency (i.e., state transitions)
9
![Page 13: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/13.jpg)
From MDP to RL
Informal definition of RL: Solve the MDP when the environment(r , P) is unknown
(i) generative model: able to query the reward rh and next statesh+1 for any (sh, ah) ∈ S ×A
(ii) offline setting: given a dataset D = (sth, ath, r th)h∈[H],t∈[T ] (Ttrajectories), learn π? without any interactions
(iii) online setting: without any prior knowledge, learn π? byinteracting with the MDP for T episodes
(i) learning from model query complexity(ii) learning from data sample complexity
(iii) learning by doing sample complexity + exploration
10
![Page 14: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/14.jpg)
From MDP to RL
Informal definition of RL: Solve the MDP when the environment(r , P) is unknown
(i) generative model: able to query the reward rh and next statesh+1 for any (sh, ah) ∈ S ×A
(ii) offline setting: given a dataset D = (sth, ath, r th)h∈[H],t∈[T ] (Ttrajectories), learn π? without any interactions
(iii) online setting: without any prior knowledge, learn π? byinteracting with the MDP for T episodes
(i) learning from model query complexity(ii) learning from data sample complexity
(iii) learning by doing sample complexity + exploration
10
![Page 15: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/15.jpg)
From MDP to RL
Informal definition of RL: Solve the MDP when the environment(r , P) is unknown
(i) generative model: able to query the reward rh and next statesh+1 for any (sh, ah) ∈ S ×A
(ii) offline setting: given a dataset D = (sth, ath, r th)h∈[H],t∈[T ] (Ttrajectories), learn π? without any interactions
(iii) online setting: without any prior knowledge, learn π? byinteracting with the MDP for T episodes
(i) learning from model query complexity(ii) learning from data sample complexity
(iii) learning by doing sample complexity + exploration
10
![Page 16: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/16.jpg)
From MDP to RL
Informal definition of RL: Solve the MDP when the environment(r , P) is unknown
(i) generative model: able to query the reward rh and next statesh+1 for any (sh, ah) ∈ S ×A
(ii) offline setting: given a dataset D = (sth, ath, r th)h∈[H],t∈[T ] (Ttrajectories), learn π? without any interactions
(iii) online setting: without any prior knowledge, learn π? byinteracting with the MDP for T episodes
(i) learning from model query complexity(ii) learning from data sample complexity
(iii) learning by doing sample complexity + exploration
10
![Page 17: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/17.jpg)
From MDP to RL
Informal definition of RL: Solve the MDP when the environment(r , P) is unknown
(i) generative model: able to query the reward rh and next statesh+1 for any (sh, ah) ∈ S ×A
(ii) offline setting: given a dataset D = (sth, ath, r th)h∈[H],t∈[T ] (Ttrajectories), learn π? without any interactions
(iii) online setting: without any prior knowledge, learn π? byinteracting with the MDP for T episodes
(i) learning from model query complexity(ii) learning from data sample complexity
(iii) learning by doing sample complexity + exploration
10
![Page 18: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/18.jpg)
Online RL: Pipeline
Online Agent
Environment
Data !t
Execute πt
πt+1 ← #$%&'((!t, ℱ)
!t ← !t−1 ∪ Traj(πt)Continue for episodesT
Initialize π1 = π1hh∈[H] arbitrarily
For each t ∈ [T ], execute πt , get a trajectory Traj(πt)
Store Traj(πt) into dataset Dt = Traj(πi ), i ≤ tUpdate the policy via RL algorithm (need our design)
11
![Page 19: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/19.jpg)
Sample efficiency in online RL: regret
Measure sample efficiency via regret:
Regret(T ) =T∑t=1
[J(π?)− J(πt)]︸ ︷︷ ︸suboptimality in t-th episode
Why regret is meaningful?
Define a random policy πT uniformly sampled fromπt , t ∈ [T ]. Suppose Regret(T ), thenJ(π?)− J(πT ) = Regret(T )/T
If Regret(T ) = o(T ), when T sufficiency large, πT isarbitrarily close to optimality
Goal: Regret(T ) = O(√
T · poly(H) · poly(dim))
12
![Page 20: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/20.jpg)
Sample efficiency in online RL: regret
Measure sample efficiency via regret:
Regret(T ) =T∑t=1
[J(π?)− J(πt)]︸ ︷︷ ︸suboptimality in t-th episode
Why regret is meaningful?
Define a random policy πT uniformly sampled fromπt , t ∈ [T ]. Suppose Regret(T ), thenJ(π?)− J(πT ) = Regret(T )/T
If Regret(T ) = o(T ), when T sufficiency large, πT isarbitrarily close to optimality
Goal: Regret(T ) = O(√
T · poly(H) · poly(dim))
12
![Page 21: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/21.jpg)
Sample efficiency in online RL: regret
Measure sample efficiency via regret:
Regret(T ) =T∑t=1
[J(π?)− J(πt)]︸ ︷︷ ︸suboptimality in t-th episode
Why regret is meaningful?
Define a random policy πT uniformly sampled fromπt , t ∈ [T ]. Suppose Regret(T ), thenJ(π?)− J(πT ) = Regret(T )/T
If Regret(T ) = o(T ), when T sufficiency large, πT isarbitrarily close to optimality
Goal: Regret(T ) = O(√
T · poly(H) · poly(dim))
12
![Page 22: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/22.jpg)
Challenge of Online RL: exploration & uncertainty
×
×
×
×
×
××××× ××
×
×
××
×
××
×××××
×
×××
××
×
××
××
exploration vs. exploitation function estimation
How to construct exploration incentives tailored to function approximators?
Function EstimationDeep Exploration
How to assess estimation uncertainty based on adaptively acquired data?
Online RL
13
![Page 23: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/23.jpg)
Challenge of Online RL: exploration & uncertainty
×
×
×
×
×
××××× ××
×
×
××
×
××
×××××
×
×××
××
×
××
××
exploration vs. exploitation function estimation
How to construct exploration incentives tailored to function approximators?
Function EstimationDeep Exploration
How to assess estimation uncertainty based on adaptively acquired data?
Online RL
13
![Page 24: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/24.jpg)
Warmup example:LinUCB for linear CB
14
![Page 25: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/25.jpg)
Linear contextual bandit
Setting of linear CB:
Observation structure: observe (perhaps adversarial) st ∈ S,take action at ∈ A, receive bounded reward r t ∈ [0, 1]
E[r t |st = s, at = a] = r?(s, a) for all (s, a) ∈ S ×A
Linear reward function: r?(s, a) = φ(s, a)>θ?
φ : S ×A → Rd : known feature mapping
Normalization condition: ‖θ?‖2 ≤√d , sups,a ‖φ(s, a)‖2 ≤ 1
Why? Include MAB as a special case:|S| = 1, d = |A|, φ(s, a) = ea, θ? = (µ1, µ2, · · · , µA) withµa ∈ [0, 1] for all a ∈ A
Regret(T ) =∑T
t=1
(maxa∈A〈φ(st , a)− φ(st , at), θ?〉
)
15
![Page 26: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/26.jpg)
Linear contextual bandit
Setting of linear CB:
Observation structure: observe (perhaps adversarial) st ∈ S,take action at ∈ A, receive bounded reward r t ∈ [0, 1]
E[r t |st = s, at = a] = r?(s, a) for all (s, a) ∈ S ×A
Linear reward function: r?(s, a) = φ(s, a)>θ?
φ : S ×A → Rd : known feature mapping
Normalization condition: ‖θ?‖2 ≤√d , sups,a ‖φ(s, a)‖2 ≤ 1
Why? Include MAB as a special case:|S| = 1, d = |A|, φ(s, a) = ea, θ? = (µ1, µ2, · · · , µA) withµa ∈ [0, 1] for all a ∈ A
Regret(T ) =∑T
t=1
(maxa∈A〈φ(st , a)− φ(st , at), θ?〉
)
15
![Page 27: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/27.jpg)
Linear contextual bandit
Setting of linear CB:
Observation structure: observe (perhaps adversarial) st ∈ S,take action at ∈ A, receive bounded reward r t ∈ [0, 1]
E[r t |st = s, at = a] = r?(s, a) for all (s, a) ∈ S ×A
Linear reward function: r?(s, a) = φ(s, a)>θ?
φ : S ×A → Rd : known feature mapping
Normalization condition: ‖θ?‖2 ≤√d , sups,a ‖φ(s, a)‖2 ≤ 1
Why? Include MAB as a special case:|S| = 1, d = |A|, φ(s, a) = ea, θ? = (µ1, µ2, · · · , µA) withµa ∈ [0, 1] for all a ∈ A
Regret(T ) =∑T
t=1
(maxa∈A〈φ(st , a)− φ(st , at), θ?〉
)
15
![Page 28: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/28.jpg)
Linear contextual bandit
Setting of linear CB:
Observation structure: observe (perhaps adversarial) st ∈ S,take action at ∈ A, receive bounded reward r t ∈ [0, 1]
E[r t |st = s, at = a] = r?(s, a) for all (s, a) ∈ S ×A
Linear reward function: r?(s, a) = φ(s, a)>θ?
φ : S ×A → Rd : known feature mapping
Normalization condition: ‖θ?‖2 ≤√d , sups,a ‖φ(s, a)‖2 ≤ 1
Why? Include MAB as a special case:|S| = 1, d = |A|, φ(s, a) = ea, θ? = (µ1, µ2, · · · , µA) withµa ∈ [0, 1] for all a ∈ A
Regret(T ) =∑T
t=1
(maxa∈A〈φ(st , a)− φ(st , at), θ?〉
)15
![Page 29: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/29.jpg)
Exploration via optimism in the face of uncertainty (OFU)
For each t ∈ [T ], in the t-th round, determine at in thefollowing two steps
Uncertainty quantification: based on the current dataset Dt−1,construct a high-probability confidence set Θt that contains θ?
P(∀t ∈ [T ], θ? ∈ Θt
)≥ 1− δ (1)
Optimistic planing: choose at to maximize the most optimisticmodel within Θt
at = arg maxa∈A
maxθ∈Θt〈φ(st , at), θ〉
(2)
16
![Page 30: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/30.jpg)
Exploration via optimism in the face of uncertainty (OFU)
For each t ∈ [T ], in the t-th round, determine at in thefollowing two steps
Uncertainty quantification: based on the current dataset Dt−1,construct a high-probability confidence set Θt that contains θ?
P(∀t ∈ [T ], θ? ∈ Θt
)≥ 1− δ (1)
Optimistic planing: choose at to maximize the most optimisticmodel within Θt
at = arg maxa∈A
maxθ∈Θt〈φ(st , at), θ〉
(2)
16
![Page 31: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/31.jpg)
Exploration via optimism in the face of uncertainty (OFU)
For each t ∈ [T ], in the t-th round, determine at in thefollowing two steps
Uncertainty quantification: based on the current dataset Dt−1,construct a high-probability confidence set Θt that contains θ?
P(∀t ∈ [T ], θ? ∈ Θt
)≥ 1− δ (1)
Optimistic planing: choose at to maximize the most optimisticmodel within Θt
at = arg maxa∈A
maxθ∈Θt〈φ(st , at), θ〉
(2)
16
![Page 32: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/32.jpg)
Exploration via optimism in the face of uncertainty (OFU)
For each t ∈ [T ], in the t-th round, determine at in thefollowing two steps
Uncertainty quantification: based on the current dataset Dt−1,construct a high-probability confidence set Θt that contains θ?
P(∀t ∈ [T ], θ? ∈ Θt
)≥ 1− δ (1)
Optimistic planing: choose at to maximize the most optimisticmodel within Θt
at = arg maxa∈A
maxθ∈Θt〈φ(st , at), θ〉
(2)
16
![Page 33: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/33.jpg)
Intuition of OFU principle
Define two functions
r+,t(s, a) = maxθ∈Θt〈φ(s, a), θ〉
r−,t(s, a) = minθ∈Θt〈φ(s, a), θ〉
By (1), θ? ∈ Θt
r+,t(s, a) ≥ r?(s, a) ≥ r−,t(s, a), ∀(s, a) ∈ S ×A
|r+,t(s, a)− r−,t(s, a)| reflects the uncertainty about a atcontext s
By (2), at = arg maxa r+,t(st , a) — choose the action with
either large uncertainty (explore) or large reward (exploit)
17
![Page 34: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/34.jpg)
Intuition of OFU principle
Define two functions
r+,t(s, a) = maxθ∈Θt〈φ(s, a), θ〉
r−,t(s, a) = minθ∈Θt〈φ(s, a), θ〉
By (1), θ? ∈ Θt
r+,t(s, a) ≥ r?(s, a) ≥ r−,t(s, a), ∀(s, a) ∈ S ×A
|r+,t(s, a)− r−,t(s, a)| reflects the uncertainty about a atcontext s
By (2), at = arg maxa r+,t(st , a) — choose the action with
either large uncertainty (explore) or large reward (exploit)
17
![Page 35: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/35.jpg)
Intuition of OFU principle
Define two functions
r+,t(s, a) = maxθ∈Θt〈φ(s, a), θ〉
r−,t(s, a) = minθ∈Θt〈φ(s, a), θ〉
By (1), θ? ∈ Θt
r+,t(s, a) ≥ r?(s, a) ≥ r−,t(s, a), ∀(s, a) ∈ S ×A
|r+,t(s, a)− r−,t(s, a)| reflects the uncertainty about a atcontext s
By (2), at = arg maxa r+,t(st , a) — choose the action with
either large uncertainty (explore) or large reward (exploit)
17
![Page 36: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/36.jpg)
Intuition of OFU principle
Define two functions
r+,t(s, a) = maxθ∈Θt〈φ(s, a), θ〉
r−,t(s, a) = minθ∈Θt〈φ(s, a), θ〉
By (1), θ? ∈ Θt
r+,t(s, a) ≥ r?(s, a) ≥ r−,t(s, a), ∀(s, a) ∈ S ×A
|r+,t(s, a)− r−,t(s, a)| reflects the uncertainty about a atcontext s
By (2), at = arg maxa r+,t(st , a) — choose the action with
either large uncertainty (explore) or large reward (exploit)
17
![Page 37: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/37.jpg)
How to construct Θt for linear CB?In the t-th round, our dataset is Dt−1 = (s i , ai , r i )i≤t−1
First construct an estimator of θ? via ridge regression
θt = minθ∈Rd
Lt(θ) : =
t−1∑i=1
[r i − 〈φ(s i , ai ), θ〉]2 + ‖θ‖22
Hessian of the Ridge loss:
Λt = ∇2Lt(θ) = I +t−1∑i=1
φ(s i , ai )φ(s i , ai )>
Closed-form solution: θt = (Λt)−1∑t−1
i=1 ri · φ(s i , ai )
Define Θt as the confidence ellipsoid :
Θt = θ ∈ Rd : ‖θ − θt‖Λt ≤ β,
here define β = C ·√
d · log(1 + T/δ), ‖x‖M =√x>Mx
18
![Page 38: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/38.jpg)
How to construct Θt for linear CB?In the t-th round, our dataset is Dt−1 = (s i , ai , r i )i≤t−1
First construct an estimator of θ? via ridge regression
θt = minθ∈Rd
Lt(θ) : =
t−1∑i=1
[r i − 〈φ(s i , ai ), θ〉]2 + ‖θ‖22
Hessian of the Ridge loss:
Λt = ∇2Lt(θ) = I +t−1∑i=1
φ(s i , ai )φ(s i , ai )>
Closed-form solution: θt = (Λt)−1∑t−1
i=1 ri · φ(s i , ai )
Define Θt as the confidence ellipsoid :
Θt = θ ∈ Rd : ‖θ − θt‖Λt ≤ β,
here define β = C ·√
d · log(1 + T/δ), ‖x‖M =√x>Mx
18
![Page 39: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/39.jpg)
How to construct Θt for linear CB?In the t-th round, our dataset is Dt−1 = (s i , ai , r i )i≤t−1
First construct an estimator of θ? via ridge regression
θt = minθ∈Rd
Lt(θ) : =
t−1∑i=1
[r i − 〈φ(s i , ai ), θ〉]2 + ‖θ‖22
Hessian of the Ridge loss:
Λt = ∇2Lt(θ) = I +t−1∑i=1
φ(s i , ai )φ(s i , ai )>
Closed-form solution: θt = (Λt)−1∑t−1
i=1 ri · φ(s i , ai )
Define Θt as the confidence ellipsoid :
Θt = θ ∈ Rd : ‖θ − θt‖Λt ≤ β,
here define β = C ·√
d · log(1 + T/δ), ‖x‖M =√x>Mx
18
![Page 40: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/40.jpg)
How to construct Θt for linear CB?In the t-th round, our dataset is Dt−1 = (s i , ai , r i )i≤t−1
First construct an estimator of θ? via ridge regression
θt = minθ∈Rd
Lt(θ) : =
t−1∑i=1
[r i − 〈φ(s i , ai ), θ〉]2 + ‖θ‖22
Hessian of the Ridge loss:
Λt = ∇2Lt(θ) = I +t−1∑i=1
φ(s i , ai )φ(s i , ai )>
Closed-form solution: θt = (Λt)−1∑t−1
i=1 ri · φ(s i , ai )
Define Θt as the confidence ellipsoid :
Θt = θ ∈ Rd : ‖θ − θt‖Λt ≤ β,
here define β = C ·√
d · log(1 + T/δ), ‖x‖M =√x>Mx
18
![Page 41: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/41.jpg)
How to construct Θt for linear CB?In the t-th round, our dataset is Dt−1 = (s i , ai , r i )i≤t−1
First construct an estimator of θ? via ridge regression
θt = minθ∈Rd
Lt(θ) : =
t−1∑i=1
[r i − 〈φ(s i , ai ), θ〉]2 + ‖θ‖22
Hessian of the Ridge loss:
Λt = ∇2Lt(θ) = I +t−1∑i=1
φ(s i , ai )φ(s i , ai )>
Closed-form solution: θt = (Λt)−1∑t−1
i=1 ri · φ(s i , ai )
Define Θt as the confidence ellipsoid :
Θt = θ ∈ Rd : ‖θ − θt‖Λt ≤ β,
here define β = C ·√
d · log(1 + T/δ), ‖x‖M =√x>Mx
18
![Page 42: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/42.jpg)
LinUCB algorithm
Confidence ellipsoid: Θt = θ ∈ Rd : ‖θ − θt‖Λt ≤ β
r+,t admits a closed-form:
r+,t(s, a) = maxθ∈Θt〈φ(s, a), θ〉 = 〈φ(s, a), θt〉+ β · ‖φ(s, a)‖(Λt)−1︸ ︷︷ ︸
bonus
LinUCB algorithm (Dani et all. (2008), Abbasi-Yadkori et al.(2011), ...)
For t = 1, . . . ,T , in the beginning of t-th round
Observe context st
Compute θt via ridge regression and compute Hessian Λt
Choose at = arg maxa∈A r+,t(st , a) and observe r t
19
![Page 43: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/43.jpg)
LinUCB algorithm
Confidence ellipsoid: Θt = θ ∈ Rd : ‖θ − θt‖Λt ≤ β
r+,t admits a closed-form:
r+,t(s, a) = maxθ∈Θt〈φ(s, a), θ〉 = 〈φ(s, a), θt〉+ β · ‖φ(s, a)‖(Λt)−1︸ ︷︷ ︸
bonus
LinUCB algorithm (Dani et all. (2008), Abbasi-Yadkori et al.(2011), ...)
For t = 1, . . . ,T , in the beginning of t-th round
Observe context st
Compute θt via ridge regression and compute Hessian Λt
Choose at = arg maxa∈A r+,t(st , a) and observe r t
19
![Page 44: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/44.jpg)
LinUCB algorithm
Confidence ellipsoid: Θt = θ ∈ Rd : ‖θ − θt‖Λt ≤ β
r+,t admits a closed-form:
r+,t(s, a) = maxθ∈Θt〈φ(s, a), θ〉 = 〈φ(s, a), θt〉+ β · ‖φ(s, a)‖(Λt)−1︸ ︷︷ ︸
bonus
LinUCB algorithm (Dani et all. (2008), Abbasi-Yadkori et al.(2011), ...)
For t = 1, . . . ,T , in the beginning of t-th round
Observe context st
Compute θt via ridge regression and compute Hessian Λt
Choose at = arg maxa∈A r+,t(st , a) and observe r t
19
![Page 45: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/45.jpg)
LinUCB algorithm
Confidence ellipsoid: Θt = θ ∈ Rd : ‖θ − θt‖Λt ≤ β
r+,t admits a closed-form:
r+,t(s, a) = maxθ∈Θt〈φ(s, a), θ〉 = 〈φ(s, a), θt〉+ β · ‖φ(s, a)‖(Λt)−1︸ ︷︷ ︸
bonus
LinUCB algorithm (Dani et all. (2008), Abbasi-Yadkori et al.(2011), ...)
For t = 1, . . . ,T , in the beginning of t-th round
Observe context st
Compute θt via ridge regression and compute Hessian Λt
Choose at = arg maxa∈A r+,t(st , a) and observe r t
19
![Page 46: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/46.jpg)
Regret Bound of LinUCB
Theorem. With probability 1− δ, LinUCB satisfies
Regret(T ) ≤ β ·√dT ·
√logT = O(d
√T )
Independent of the size of context and action spaces, onlydepends on the dimension (linear in d)
Depend on T through√T (sample efficient)
Computational cost poly(d ,A,T ) and memory requirement ispoly(d ,T ) (Why?)
20
![Page 47: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/47.jpg)
Regret Bound of LinUCB
Theorem. With probability 1− δ, LinUCB satisfies
Regret(T ) ≤ β ·√dT ·
√logT = O(d
√T )
Independent of the size of context and action spaces, onlydepends on the dimension (linear in d)
Depend on T through√T (sample efficient)
Computational cost poly(d ,A,T ) and memory requirement ispoly(d ,T ) (Why?)
20
![Page 48: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/48.jpg)
Regret Bound of LinUCB
Theorem. With probability 1− δ, LinUCB satisfies
Regret(T ) ≤ β ·√dT ·
√logT = O(d
√T )
Independent of the size of context and action spaces, onlydepends on the dimension (linear in d)
Depend on T through√T (sample efficient)
Computational cost poly(d ,A,T ) and memory requirement ispoly(d ,T ) (Why?)
20
![Page 49: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/49.jpg)
Regret Bound of LinUCB
Theorem. With probability 1− δ, LinUCB satisfies
Regret(T ) ≤ β ·√dT ·
√logT = O(d
√T )
Independent of the size of context and action spaces, onlydepends on the dimension (linear in d)
Depend on T through√T (sample efficient)
Computational cost poly(d ,A,T ) and memory requirement ispoly(d ,T ) (Why?)
20
![Page 50: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/50.jpg)
Regret Analysis of LinUCB (1/3)
Step 1. Conditioning on the event E = ∀t ∈ [T ], θ? ∈ Θt,r−,t(s, a) ≤ r?(s, a) ≤ r+,t(s, a), ∀(x , a) ∈ S ×A
Step 2. Regret decomposition:
maxa
r?(st , a)− r?(st , at)
=(
maxa
r?(st , a)− r+,t(st , at))
+(r+,t(st , at)− r?(st , at)
)≤ r+,t(st , at)− r−,t(st , at) ≤ 2β · ‖φ(st , at)‖(Λt)−1
Thus we have shown
maxa
r?(st , a)− r?(st , at) ≤ β ·min1, ‖φ(st , at)‖(Λt)−1
21
![Page 51: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/51.jpg)
Regret Analysis of LinUCB (1/3)
Step 1. Conditioning on the event E = ∀t ∈ [T ], θ? ∈ Θt,r−,t(s, a) ≤ r?(s, a) ≤ r+,t(s, a), ∀(x , a) ∈ S ×A
Step 2. Regret decomposition:
maxa
r?(st , a)− r?(st , at)
=(
maxa
r?(st , a)− r+,t(st , at))
+(r+,t(st , at)− r?(st , at)
)≤ r+,t(st , at)− r−,t(st , at) ≤ 2β · ‖φ(st , at)‖(Λt)−1
Thus we have shown
maxa
r?(st , a)− r?(st , at) ≤ β ·min1, ‖φ(st , at)‖(Λt)−1
21
![Page 52: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/52.jpg)
Regret Analysis of LinUCB (1/3)
Step 1. Conditioning on the event E = ∀t ∈ [T ], θ? ∈ Θt,r−,t(s, a) ≤ r?(s, a) ≤ r+,t(s, a), ∀(x , a) ∈ S ×A
Step 2. Regret decomposition:
maxa
r?(st , a)− r?(st , at)
=(
maxa
r?(st , a)− r+,t(st , at))
+(r+,t(st , at)− r?(st , at)
)≤ r+,t(st , at)− r−,t(st , at) ≤ 2β · ‖φ(st , at)‖(Λt)−1
Thus we have shown
maxa
r?(st , a)− r?(st , at) ≤ β ·min1, ‖φ(st , at)‖(Λt)−1
21
![Page 53: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/53.jpg)
Regret Analysis of LinUCB (2/3)
Step 3. Telescope the bonus: Cauchy-Schwarz + elliptical potentiallemma
Regret(T ) ≤ 2β ·T∑t=1
·min1, ‖φ(st , at)‖(Λt)−1
≤ 2β√T ·[ T∑
t=1
min1, ‖φ(st , at)‖2(Λt)−1
]1/2
≤ 2β√T ·√
2logdet(ΛT ) . β ·√T ·√d · logT
22
![Page 54: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/54.jpg)
Elliptical potential lemma
Lemma
t−1∑i=1
min1, ‖φ(s i , ai )‖2(Λt)−1 ≤ 2logdet(ΛT ) ≤ d · logT
Bayesian perspective:
Prior distribution θ? ∼ N(0, I )
Likelihood ri = r?(s i , ai ) + ε, ε ∈ N(0, 1)
Given dataset Dt−1, the posterior distribution is N(θt ,Λt)
‖φ(s, a)‖2(Λt)−1 = I (θ?, (s, a)|Dt−1) =
H(θ?|Dt−1)− H(θ?|Dt−1 ∪ (s, a)) — conditional mutualinformation gain
Elliptical potential lemma ↔ chain rule of mutual information
23
![Page 55: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/55.jpg)
Elliptical potential lemma
Lemma
t−1∑i=1
min1, ‖φ(s i , ai )‖2(Λt)−1 ≤ 2logdet(ΛT ) ≤ d · logT
Bayesian perspective:
Prior distribution θ? ∼ N(0, I )
Likelihood ri = r?(s i , ai ) + ε, ε ∈ N(0, 1)
Given dataset Dt−1, the posterior distribution is N(θt ,Λt)
‖φ(s, a)‖2(Λt)−1 = I (θ?, (s, a)|Dt−1) =
H(θ?|Dt−1)− H(θ?|Dt−1 ∪ (s, a)) — conditional mutualinformation gain
Elliptical potential lemma ↔ chain rule of mutual information
23
![Page 56: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/56.jpg)
Elliptical potential lemma
Lemma
t−1∑i=1
min1, ‖φ(s i , ai )‖2(Λt)−1 ≤ 2logdet(ΛT ) ≤ d · logT
Bayesian perspective:
Prior distribution θ? ∼ N(0, I )
Likelihood ri = r?(s i , ai ) + ε, ε ∈ N(0, 1)
Given dataset Dt−1, the posterior distribution is N(θt ,Λt)
‖φ(s, a)‖2(Λt)−1 = I (θ?, (s, a)|Dt−1) =
H(θ?|Dt−1)− H(θ?|Dt−1 ∪ (s, a)) — conditional mutualinformation gain
Elliptical potential lemma ↔ chain rule of mutual information
23
![Page 57: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/57.jpg)
Regret Analysis of LinUCB (3/3)
Step 4. Show optimism: θ? ∈ Θt = θ ∈ Rd : ‖θ − θt‖Λt ≤ β
θt − θ? = (Λt)−1
[t−1∑i=1
(r?(s i , ai ) + εi )︸ ︷︷ ︸ri
·φ(s i , ai )− (Λt) · θ?]
= (Λt)−1θ? + (Λt)−1t−1∑i=1
εi︸︷︷︸martingale difference
·φ(s i , ai )
‖θt − θ?‖Λt ≈ ‖∑t−1
i=1 εi · φ(s i , ai )‖(Λt)−1 ≤ βConcentration of self-normalized process (e.g., Abbasi-Yadkoriet al, 2011)
24
![Page 58: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/58.jpg)
Regret Analysis of LinUCB (3/3)
Step 4. Show optimism: θ? ∈ Θt = θ ∈ Rd : ‖θ − θt‖Λt ≤ β
θt − θ? = (Λt)−1
[t−1∑i=1
(r?(s i , ai ) + εi )︸ ︷︷ ︸ri
·φ(s i , ai )− (Λt) · θ?]
= (Λt)−1θ? + (Λt)−1t−1∑i=1
εi︸︷︷︸martingale difference
·φ(s i , ai )
‖θt − θ?‖Λt ≈ ‖∑t−1
i=1 εi · φ(s i , ai )‖(Λt)−1 ≤ βConcentration of self-normalized process (e.g., Abbasi-Yadkoriet al, 2011)
24
![Page 59: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/59.jpg)
Summary of LinUCB
Algorithm is based on the optimism principle
upper confidence bound = ridge estimator + bonus
O(d√T ) regret via
a. utilize optimism to bound instant regret by width of confidenceregion (bonus)
b. sum up the instant regret terms by elliptical potential lemma
c. confidence region constructed via uncertainty quantification forridge regression
Extensions: generalized linear model, RKHS,overparameterized network, abstract function class withbounded eluder dimension, ...
25
![Page 60: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/60.jpg)
Summary of LinUCB
Algorithm is based on the optimism principle
upper confidence bound = ridge estimator + bonus
O(d√T ) regret via
a. utilize optimism to bound instant regret by width of confidenceregion (bonus)
b. sum up the instant regret terms by elliptical potential lemma
c. confidence region constructed via uncertainty quantification forridge regression
Extensions: generalized linear model, RKHS,overparameterized network, abstract function class withbounded eluder dimension, ...
25
![Page 61: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/61.jpg)
Summary of LinUCB
Algorithm is based on the optimism principle
upper confidence bound = ridge estimator + bonus
O(d√T ) regret via
a. utilize optimism to bound instant regret by width of confidenceregion (bonus)
b. sum up the instant regret terms by elliptical potential lemma
c. confidence region constructed via uncertainty quantification forridge regression
Extensions: generalized linear model, RKHS,overparameterized network, abstract function class withbounded eluder dimension, ...
25
![Page 62: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/62.jpg)
From Linear CB to Linear RL
26
![Page 63: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/63.jpg)
Value functions and Bellman equation
Our goal is to find π? = arg maxπ J(π) = E[∑H
h=1 rh]
π? is characterized by Q? = Q?hh∈[H+1] : S ×A → [0,H]
π?h(s) = arg maxa Q?h (s, a), ∀s ∈ S; V ?
h (s) = maxa Q?h (s, a)
Q?h (s, a): optimal return starting from sh = s and ah = a
Q? is characterized by Bellman equation
= Q⋆h (s, a)
ah
sh
ah+1
sh+1
rh+1
π⋆h+1
ah+2
sh+2
rh+2
aH
sH
rH
s
π⋆h+2 π⋆
H
a
+ + +⋯+# r(s, a)
•
•
• given by Bellman equation
π⋆h (s) = arg max
aQ⋆
h (s, ⋅ )V⋆
h (s) = maxa
Q⋆h (s, a)
Q⋆h (s, a)
27
![Page 64: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/64.jpg)
Value functions and Bellman equation
Our goal is to find π? = arg maxπ J(π) = E[∑H
h=1 rh]
π? is characterized by Q? = Q?hh∈[H+1] : S ×A → [0,H]
π?h(s) = arg maxa Q?h (s, a), ∀s ∈ S; V ?
h (s) = maxa Q?h (s, a)
Q?h (s, a): optimal return starting from sh = s and ah = a
Q? is characterized by Bellman equation
= Q⋆h (s, a)
ah
sh
ah+1
sh+1
rh+1
π⋆h+1
ah+2
sh+2
rh+2
aH
sH
rH
s
π⋆h+2 π⋆
H
a
+ + +⋯+# r(s, a)
•
•
• given by Bellman equation
π⋆h (s) = arg max
aQ⋆
h (s, ⋅ )V⋆
h (s) = maxa
Q⋆h (s, a)
Q⋆h (s, a)
27
![Page 65: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/65.jpg)
Value functions and Bellman equation
Our goal is to find π? = arg maxπ J(π) = E[∑H
h=1 rh]
π? is characterized by Q? = Q?hh∈[H+1] : S ×A → [0,H]
π?h(s) = arg maxa Q?h (s, a), ∀s ∈ S; V ?
h (s) = maxa Q?h (s, a)
Q?h (s, a): optimal return starting from sh = s and ah = a
Q? is characterized by Bellman equation
= Q⋆h (s, a)
ah
sh
ah+1
sh+1
rh+1
π⋆h+1
ah+2
sh+2
rh+2
aH
sH
rH
s
π⋆h+2 π⋆
H
a
+ + +⋯+# r(s, a)
•
•
• given by Bellman equation
π⋆h (s) = arg max
aQ⋆
h (s, ⋅ )V⋆
h (s) = maxa
Q⋆h (s, a)
Q⋆h (s, a)
27
![Page 66: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/66.jpg)
Value functions and Bellman equation
Our goal is to find π? = arg maxπ J(π) = E[∑H
h=1 rh]
π? is characterized by Q? = Q?hh∈[H+1] : S ×A → [0,H]
π?h(s) = arg maxa Q?h (s, a), ∀s ∈ S; V ?
h (s) = maxa Q?h (s, a)
Q?h (s, a): optimal return starting from sh = s and ah = a
Q? is characterized by Bellman equation
= Q⋆h (s, a)
ah
sh
ah+1
sh+1
rh+1
π⋆h+1
ah+2
sh+2
rh+2
aH
sH
rH
s
π⋆h+2 π⋆
H
a
+ + +⋯+# r(s, a)
•
•
• given by Bellman equation
π⋆h (s) = arg max
aQ⋆
h (s, ⋅ )V⋆
h (s) = maxa
Q⋆h (s, a)
Q⋆h (s, a)
27
![Page 67: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/67.jpg)
Value functions and Bellman equation
MDP terminates after H steps — Q?H+1 = 0, Q?
H = r
Bellman equation
Q?h(s, a) =
Bellman operator: BQ?h+1︷ ︸︸ ︷
r(s, a) + Esh+1∼P[V ?h+1(sh+1)|sh = s, ah = a
]V ?h (s) = max
aQ?
h(s, a)
RL with linear function approximation: useFlin = φ(s, a)>θ : S ×A → R to approximate Q?
hh∈[H]
φ : S ×A → Rd : known feature mapping
Can allow known and step-varying features: φhh∈[H]
28
![Page 68: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/68.jpg)
Value functions and Bellman equation
MDP terminates after H steps — Q?H+1 = 0, Q?
H = r
Bellman equation
Q?h(s, a) =
Bellman operator: BQ?h+1︷ ︸︸ ︷
r(s, a) + Esh+1∼P[V ?h+1(sh+1)|sh = s, ah = a
]V ?h (s) = max
aQ?
h(s, a)
RL with linear function approximation: useFlin = φ(s, a)>θ : S ×A → R to approximate Q?
hh∈[H]
φ : S ×A → Rd : known feature mapping
Can allow known and step-varying features: φhh∈[H]
28
![Page 69: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/69.jpg)
Value functions and Bellman equation
MDP terminates after H steps — Q?H+1 = 0, Q?
H = r
Bellman equation
Q?h(s, a) =
Bellman operator: BQ?h+1︷ ︸︸ ︷
r(s, a) + Esh+1∼P[V ?h+1(sh+1)|sh = s, ah = a
]V ?h (s) = max
aQ?
h(s, a)
RL with linear function approximation: useFlin = φ(s, a)>θ : S ×A → R to approximate Q?
hh∈[H]
φ : S ×A → Rd : known feature mapping
Can allow known and step-varying features: φhh∈[H]
28
![Page 70: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/70.jpg)
RL is more challenging than bandit
Online Agent
Environment
Data !t
Execute πt
πt+1 ← #$%&'((!t, ℱ)
!t ← !t−1 ∪ Traj(πt)Continue for episodesT
Regret(T ) =∑T
t=1[J(π?)− J(πt)]
J(π?) = maxa Q?1 (s?, a), fixed initial state for simplicity
Can generalize to adversarial s1
Online RL 6= a contextual bandit with TH rounds
RL requires deep exploration for achieving o(T ) regret.
29
![Page 71: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/71.jpg)
RL is more challenging than bandit
Online Agent
Environment
Data !t
Execute πt
πt+1 ← #$%&'((!t, ℱ)
!t ← !t−1 ∪ Traj(πt)Continue for episodesT
Regret(T ) =∑T
t=1[J(π?)− J(πt)]
J(π?) = maxa Q?1 (s?, a), fixed initial state for simplicity
Can generalize to adversarial s1
Online RL 6= a contextual bandit with TH rounds
RL requires deep exploration for achieving o(T ) regret.
29
![Page 72: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/72.jpg)
RL is more challenging than bandit
Online Agent
Environment
Data !t
Execute πt
πt+1 ← #$%&'((!t, ℱ)
!t ← !t−1 ∪ Traj(πt)Continue for episodesT
Regret(T ) =∑T
t=1[J(π?)− J(πt)]
J(π?) = maxa Q?1 (s?, a), fixed initial state for simplicity
Can generalize to adversarial s1
Online RL 6= a contextual bandit with TH rounds
RL requires deep exploration for achieving o(T ) regret.
29
![Page 73: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/73.jpg)
Deep Exploration
s⋆ good good good good
bad bad bad bad
A = b1, b2, …
b1 b1 b1 b1
A\b1
∀a ∀a ∀a
A\b1 A\b1A\b1
π⋆ π⋆π⋆π⋆
Contextual bandit algorithm directly goes to the bad state atthe initial state — Ω(T ) regret
Random sampling requires |A|H samples (inefficient)
Goal: Regret(T ) = poly(H) ·√T (need deep exploration)
O(SAH) query complexity if have a generative model
30
![Page 74: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/74.jpg)
Deep Exploration
s⋆ good good good good
bad bad bad bad
A = b1, b2, …
b1 b1 b1 b1
A\b1
∀a ∀a ∀a
A\b1 A\b1A\b1
π⋆ π⋆π⋆π⋆
Contextual bandit algorithm directly goes to the bad state atthe initial state — Ω(T ) regret
Random sampling requires |A|H samples (inefficient)
Goal: Regret(T ) = poly(H) ·√T (need deep exploration)
O(SAH) query complexity if have a generative model
30
![Page 75: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/75.jpg)
Deep Exploration
s⋆ good good good good
bad bad bad bad
A = b1, b2, …
b1 b1 b1 b1
A\b1
∀a ∀a ∀a
A\b1 A\b1A\b1
π⋆ π⋆π⋆π⋆
Contextual bandit algorithm directly goes to the bad state atthe initial state — Ω(T ) regret
Random sampling requires |A|H samples (inefficient)
Goal: Regret(T ) = poly(H) ·√T (need deep exploration)
O(SAH) query complexity if have a generative model
30
![Page 76: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/76.jpg)
Deep Exploration
s⋆ good good good good
bad bad bad bad
A = b1, b2, …
b1 b1 b1 b1
A\b1
∀a ∀a ∀a
A\b1 A\b1A\b1
π⋆ π⋆π⋆π⋆
Contextual bandit algorithm directly goes to the bad state atthe initial state — Ω(T ) regret
Random sampling requires |A|H samples (inefficient)
Goal: Regret(T ) = poly(H) ·√T (need deep exploration)
O(SAH) query complexity if have a generative model30
![Page 77: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/77.jpg)
What assumption should we impose?
In CB, π? is greedy with respect to r?, we assume r? is linear
In RL, π? is greedy with respect to Q?
Linear realizability : Q?h ∈ Flin, ∀h ∈ [H]
Theorem [Wang-Wang-Kakade-21] There exists a classof linearly realizable MDPs such that any online RL al-gorithm requires minΩ(2d),Ω(2H) samples to obtaina near optimal policy.
Fundamentally different from supervised learning and bandits
To achieve efficiency in online RL, we need strongerassumptions
31
![Page 78: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/78.jpg)
What assumption should we impose?
In CB, π? is greedy with respect to r?, we assume r? is linear
In RL, π? is greedy with respect to Q?
Linear realizability : Q?h ∈ Flin, ∀h ∈ [H]
Theorem [Wang-Wang-Kakade-21] There exists a classof linearly realizable MDPs such that any online RL al-gorithm requires minΩ(2d),Ω(2H) samples to obtaina near optimal policy.
Fundamentally different from supervised learning and bandits
To achieve efficiency in online RL, we need strongerassumptions
31
![Page 79: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/79.jpg)
What assumption should we impose?
In CB, π? is greedy with respect to r?, we assume r? is linear
In RL, π? is greedy with respect to Q?
Linear realizability : Q?h ∈ Flin, ∀h ∈ [H]
Theorem [Wang-Wang-Kakade-21] There exists a classof linearly realizable MDPs such that any online RL al-gorithm requires minΩ(2d),Ω(2H) samples to obtaina near optimal policy.
Fundamentally different from supervised learning and bandits
To achieve efficiency in online RL, we need strongerassumptions
31
![Page 80: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/80.jpg)
Assumption: Image(B) ⊆ Flin
We assume the image set of the Bellman operator lies in Flin:
For any Q : S ×A → [0,H], there exists θQ ∈ Rd s.t.
(BQ)(s, a) = r(s, a) + E[maxa′
Q(s ′, a′)] = 〈φ(s, a), θQ〉
Normalization condition: ‖θQ‖2 ≤ 2H√d ,
sups,a ‖φ(s, a)‖2 ≤ 1
Is there such an MDP? Yes, Linear MDP
r(s, a) = 〈φ(s, a),ω〉 P(s ′|s, a) = 〈φ(s, a),µ(s ′)〉Normalization: ‖ω‖2 ≤
√d ,∑
s′ ‖µ(s ′)‖2 ≤√d
Linear MDP contains tabular MDP as a special case:φ(s, a) = e(s,a), d = |S| · |A|
32
![Page 81: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/81.jpg)
Assumption: Image(B) ⊆ Flin
We assume the image set of the Bellman operator lies in Flin:
For any Q : S ×A → [0,H], there exists θQ ∈ Rd s.t.
(BQ)(s, a) = r(s, a) + E[maxa′
Q(s ′, a′)] = 〈φ(s, a), θQ〉
Normalization condition: ‖θQ‖2 ≤ 2H√d ,
sups,a ‖φ(s, a)‖2 ≤ 1
Is there such an MDP? Yes, Linear MDP
r(s, a) = 〈φ(s, a),ω〉 P(s ′|s, a) = 〈φ(s, a),µ(s ′)〉Normalization: ‖ω‖2 ≤
√d ,∑
s′ ‖µ(s ′)‖2 ≤√d
Linear MDP contains tabular MDP as a special case:φ(s, a) = e(s,a), d = |S| · |A|
32
![Page 82: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/82.jpg)
Assumption: Image(B) ⊆ Flin
We assume the image set of the Bellman operator lies in Flin:
For any Q : S ×A → [0,H], there exists θQ ∈ Rd s.t.
(BQ)(s, a) = r(s, a) + E[maxa′
Q(s ′, a′)] = 〈φ(s, a), θQ〉
Normalization condition: ‖θQ‖2 ≤ 2H√d ,
sups,a ‖φ(s, a)‖2 ≤ 1
Is there such an MDP? Yes, Linear MDP
r(s, a) = 〈φ(s, a),ω〉 P(s ′|s, a) = 〈φ(s, a),µ(s ′)〉Normalization: ‖ω‖2 ≤
√d ,∑
s′ ‖µ(s ′)‖2 ≤√d
Linear MDP contains tabular MDP as a special case:φ(s, a) = e(s,a), d = |S| · |A|
32
![Page 83: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/83.jpg)
Algorithm for linear RL: LSVI-UCB
33
![Page 84: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/84.jpg)
Algorithm: LSVI + optimism
In the beginning t-th episode, datasetDt−1 = (s ih, aih, r ih), h ∈ [H]i<t
For h = H, . . . , 1, backwardly solve ridge regression:
y ih = r ih + maxa
Q+,th+1(sh+1, a)
θth = arg minθ
t−1∑i=1
[y ih − 〈φ(s ih, aih), θ〉]2 + ‖θ‖2
2
Hessian of ridge loss: Λth = I +
∑t−1i=1 φ(s ih, a
ih)φ(s ih, a
ih)>
Construct bonus Γth(s, a) = β · ‖φ(s, a)‖(Λt
h)−1
UCB Q-function
Q+,th = Trunc[0,H]〈φ(s, a), θth〉+ Γt
h(s, a)
Execute πth(·) = arg maxa Q+,th (·, a)
34
![Page 85: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/85.jpg)
Algorithm: LSVI + optimism
In the beginning t-th episode, datasetDt−1 = (s ih, aih, r ih), h ∈ [H]i<t
For h = H, . . . , 1, backwardly solve ridge regression:
y ih = r ih + maxa
Q+,th+1(sh+1, a)
θth = arg minθ
t−1∑i=1
[y ih − 〈φ(s ih, aih), θ〉]2 + ‖θ‖2
2
Hessian of ridge loss: Λth = I +
∑t−1i=1 φ(s ih, a
ih)φ(s ih, a
ih)>
Construct bonus Γth(s, a) = β · ‖φ(s, a)‖(Λt
h)−1
UCB Q-function
Q+,th = Trunc[0,H]〈φ(s, a), θth〉+ Γt
h(s, a)
Execute πth(·) = arg maxa Q+,th (·, a)
34
![Page 86: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/86.jpg)
Algorithm: LSVI + optimism
In the beginning t-th episode, datasetDt−1 = (s ih, aih, r ih), h ∈ [H]i<t
For h = H, . . . , 1, backwardly solve ridge regression:
y ih = r ih + maxa
Q+,th+1(sh+1, a)
θth = arg minθ
t−1∑i=1
[y ih − 〈φ(s ih, aih), θ〉]2 + ‖θ‖2
2
Hessian of ridge loss: Λth = I +
∑t−1i=1 φ(s ih, a
ih)φ(s ih, a
ih)>
Construct bonus Γth(s, a) = β · ‖φ(s, a)‖(Λt
h)−1
UCB Q-function
Q+,th = Trunc[0,H]〈φ(s, a), θth〉+ Γt
h(s, a)
Execute πth(·) = arg maxa Q+,th (·, a)
34
![Page 87: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/87.jpg)
Algorithm: LSVI + optimism
In the beginning t-th episode, datasetDt−1 = (s ih, aih, r ih), h ∈ [H]i<t
For h = H, . . . , 1, backwardly solve ridge regression:
y ih = r ih + maxa
Q+,th+1(sh+1, a)
θth = arg minθ
t−1∑i=1
[y ih − 〈φ(s ih, aih), θ〉]2 + ‖θ‖2
2
Hessian of ridge loss: Λth = I +
∑t−1i=1 φ(s ih, a
ih)φ(s ih, a
ih)>
Construct bonus Γth(s, a) = β · ‖φ(s, a)‖(Λt
h)−1
UCB Q-function
Q+,th = Trunc[0,H]〈φ(s, a), θth〉+ Γt
h(s, a)
Execute πth(·) = arg maxa Q+,th (·, a)
34
![Page 88: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/88.jpg)
Algorithm: LSVI + optimism
In the beginning t-th episode, datasetDt−1 = (s ih, aih, r ih), h ∈ [H]i<t
For h = H, . . . , 1, backwardly solve ridge regression:
y ih = r ih + maxa
Q+,th+1(sh+1, a)
θth = arg minθ
t−1∑i=1
[y ih − 〈φ(s ih, aih), θ〉]2 + ‖θ‖2
2
Hessian of ridge loss: Λth = I +
∑t−1i=1 φ(s ih, a
ih)φ(s ih, a
ih)>
Construct bonus Γth(s, a) = β · ‖φ(s, a)‖(Λt
h)−1
UCB Q-function
Q+,th = Trunc[0,H]〈φ(s, a), θth〉+ Γt
h(s, a)
Execute πth(·) = arg maxa Q+,th (·, a)
34
![Page 89: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/89.jpg)
A more abstract version of LSVI-UCB
For h = H, . . . , 1, backwardly solve least-squares regression:
y ih = r ih + maxa
Q+,th+1(sh+1, a)
Qth = arg min
f ∈F
t−1∑i=1
[y ih − f (s ih, aih)]2 + penalty(f )
Uncertainty quantification: for Qth: find Γt
h such that
P(∀(t, h, s, a), |Qt
h(s, a)− (BQ+,th+1)(s, a)| ≤ Γt
h(s, a))≥ 1− δ
Optimism:
Q+,th = Trunc[0,H]Qt
h + Γth(s, a)
• Computation & memory cost independent of |S| (even forRKHS, overparameterized NN)
35
![Page 90: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/90.jpg)
A more abstract version of LSVI-UCB
For h = H, . . . , 1, backwardly solve least-squares regression:
y ih = r ih + maxa
Q+,th+1(sh+1, a)
Qth = arg min
f ∈F
t−1∑i=1
[y ih − f (s ih, aih)]2 + penalty(f )
Uncertainty quantification: for Qth: find Γt
h such that
P(∀(t, h, s, a), |Qt
h(s, a)− (BQ+,th+1)(s, a)| ≤ Γt
h(s, a))≥ 1− δ
Optimism:
Q+,th = Trunc[0,H]Qt
h + Γth(s, a)
• Computation & memory cost independent of |S| (even forRKHS, overparameterized NN)
35
![Page 91: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/91.jpg)
A more abstract version of LSVI-UCB
For h = H, . . . , 1, backwardly solve least-squares regression:
y ih = r ih + maxa
Q+,th+1(sh+1, a)
Qth = arg min
f ∈F
t−1∑i=1
[y ih − f (s ih, aih)]2 + penalty(f )
Uncertainty quantification: for Qth: find Γt
h such that
P(∀(t, h, s, a), |Qt
h(s, a)− (BQ+,th+1)(s, a)| ≤ Γt
h(s, a))≥ 1− δ
Optimism:
Q+,th = Trunc[0,H]Qt
h + Γth(s, a)
• Computation & memory cost independent of |S| (even forRKHS, overparameterized NN)
35
![Page 92: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/92.jpg)
A more abstract version of LSVI-UCB
For h = H, . . . , 1, backwardly solve least-squares regression:
y ih = r ih + maxa
Q+,th+1(sh+1, a)
Qth = arg min
f ∈F
t−1∑i=1
[y ih − f (s ih, aih)]2 + penalty(f )
Uncertainty quantification: for Qth: find Γt
h such that
P(∀(t, h, s, a), |Qt
h(s, a)− (BQ+,th+1)(s, a)| ≤ Γt
h(s, a))≥ 1− δ
Optimism:
Q+,th = Trunc[0,H]Qt
h + Γth(s, a)
• Computation & memory cost independent of |S| (even forRKHS, overparameterized NN)
35
![Page 93: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/93.jpg)
A more abstract version of LSVI-UCB
Q+h
Γth
Qth 𝔹Q+
h+1 Q+
h − 2Γth
| Qth(s, a) − (𝔹Q+
h+1)(s, a) | ≤ Γth(s, a), ∀t, h, s, a
Q+h (s, a) = Qt
h(s, a)LSVI
+ Γth(s, a)
bonusOptimism gives: BQ+
h+1 ∈ [Q+h − 2Γt
h,Q+h ]
Monotonicity of Bellman operator: Q+h ≥ Q?
h (UCB)36
![Page 94: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/94.jpg)
Sample efficiency of LSVI-UCB
37
![Page 95: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/95.jpg)
LSVI-UCB achieves√T -regret
Theorem [Jin-Yang-Wang-Jordan-20] Choosing β = O(dH),with probability at least 1− 1/T ,
Regret(T ) = O(β · H ·√dT ) = O(H2 ·
√d3T )
Directly imply a |S|1.5|A|1.5H2√T regret for tabular RL
First algorithm with both sample and computational efficiencyin the context of RL with function approximation
Only assumption: Image set of B is in Flin
Optimal regret dH2√T is achieved by [Zanette et al. 2020]
with a relaxed model assumption. But the computation isintractable.
38
![Page 96: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/96.jpg)
LSVI-UCB achieves√T -regret
Theorem [Jin-Yang-Wang-Jordan-20] Choosing β = O(dH),with probability at least 1− 1/T ,
Regret(T ) = O(β · H ·√dT ) = O(H2 ·
√d3T )
Directly imply a |S|1.5|A|1.5H2√T regret for tabular RL
First algorithm with both sample and computational efficiencyin the context of RL with function approximation
Only assumption: Image set of B is in Flin
Optimal regret dH2√T is achieved by [Zanette et al. 2020]
with a relaxed model assumption. But the computation isintractable.
38
![Page 97: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/97.jpg)
LSVI-UCB achieves√T -regret
Theorem [Jin-Yang-Wang-Jordan-20] Choosing β = O(dH),with probability at least 1− 1/T ,
Regret(T ) = O(β · H ·√dT ) = O(H2 ·
√d3T )
Directly imply a |S|1.5|A|1.5H2√T regret for tabular RL
First algorithm with both sample and computational efficiencyin the context of RL with function approximation
Only assumption: Image set of B is in Flin
Optimal regret dH2√T is achieved by [Zanette et al. 2020]
with a relaxed model assumption. But the computation isintractable.
38
![Page 98: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/98.jpg)
LSVI-UCB achieves√T -regret
Theorem [Jin-Yang-Wang-Jordan-20] Choosing β = O(dH),with probability at least 1− 1/T ,
Regret(T ) = O(β · H ·√dT ) = O(H2 ·
√d3T )
Directly imply a |S|1.5|A|1.5H2√T regret for tabular RL
First algorithm with both sample and computational efficiencyin the context of RL with function approximation
Only assumption: Image set of B is in Flin
Optimal regret dH2√T is achieved by [Zanette et al. 2020]
with a relaxed model assumption. But the computation isintractable.
38
![Page 99: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/99.jpg)
LSVI-UCB achieves√T -regret
Theorem [Jin-Yang-Wang-Jordan-20] Choosing β = O(dH),with probability at least 1− 1/T ,
Regret(T ) = O(β · H ·√dT ) = O(H2 ·
√d3T )
Directly imply a |S|1.5|A|1.5H2√T regret for tabular RL
First algorithm with both sample and computational efficiencyin the context of RL with function approximation
Only assumption: Image set of B is in Flin
Optimal regret dH2√T is achieved by [Zanette et al. 2020]
with a relaxed model assumption. But the computation isintractable.
38
![Page 100: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/100.jpg)
A general version of regretLet Qucb be the function class containing the Q-functionsconstructed by LSVI-UCB
Qucb =
Trunc[0,H]φ(·, ·)>θ+ β · ‖φ(·, ·)‖Λ−1
for linear case
Theorem [Jin-Yang-Wang-Wang-Jordan-20] Choosing β =O(H ·
√logN∞(Qucb,T−2)
), with probability at least 1 −
1/T ,Regret(T ) = O(βH ·
√deff · T ) = O(H2 ·
√deff · T · logN∞)
logN∞(Qucb,T−2) d logT for linear case
Include kernel and overparameterized neural network
deff is the effective dimension of RKHS or NTK
For an abstract function class F , deff can be set as theBellman-Eluder dimension [Jin-Liu-Miryoosef-21]
39
![Page 101: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/101.jpg)
A general version of regretLet Qucb be the function class containing the Q-functionsconstructed by LSVI-UCB
Qucb =
Trunc[0,H]φ(·, ·)>θ+ β · ‖φ(·, ·)‖Λ−1
for linear case
Theorem [Jin-Yang-Wang-Wang-Jordan-20] Choosing β =O(H ·
√logN∞(Qucb,T−2)
), with probability at least 1 −
1/T ,Regret(T ) = O(βH ·
√deff · T ) = O(H2 ·
√deff · T · logN∞)
logN∞(Qucb,T−2) d logT for linear case
Include kernel and overparameterized neural network
deff is the effective dimension of RKHS or NTK
For an abstract function class F , deff can be set as theBellman-Eluder dimension [Jin-Liu-Miryoosef-21]
39
![Page 102: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/102.jpg)
A general version of regretLet Qucb be the function class containing the Q-functionsconstructed by LSVI-UCB
Qucb =
Trunc[0,H]φ(·, ·)>θ+ β · ‖φ(·, ·)‖Λ−1
for linear case
Theorem [Jin-Yang-Wang-Wang-Jordan-20] Choosing β =O(H ·
√logN∞(Qucb,T−2)
), with probability at least 1 −
1/T ,Regret(T ) = O(βH ·
√deff · T ) = O(H2 ·
√deff · T · logN∞)
logN∞(Qucb,T−2) d logT for linear case
Include kernel and overparameterized neural network
deff is the effective dimension of RKHS or NTK
For an abstract function class F , deff can be set as theBellman-Eluder dimension [Jin-Liu-Miryoosef-21]
39
![Page 103: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/103.jpg)
A general version of regretLet Qucb be the function class containing the Q-functionsconstructed by LSVI-UCB
Qucb =
Trunc[0,H]φ(·, ·)>θ+ β · ‖φ(·, ·)‖Λ−1
for linear case
Theorem [Jin-Yang-Wang-Wang-Jordan-20] Choosing β =O(H ·
√logN∞(Qucb,T−2)
), with probability at least 1 −
1/T ,Regret(T ) = O(βH ·
√deff · T ) = O(H2 ·
√deff · T · logN∞)
logN∞(Qucb,T−2) d logT for linear case
Include kernel and overparameterized neural network
deff is the effective dimension of RKHS or NTK
For an abstract function class F , deff can be set as theBellman-Eluder dimension [Jin-Liu-Miryoosef-21]
39
![Page 104: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/104.jpg)
Regret analysis
40
![Page 105: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/105.jpg)
Regret analysis: sensitivity analysis + elliptical potentialA general sensitivity analysis
J(π?)− J(πt)
=∑
h∈[H] Eπ?
[Q+,t
h
(sh, π
?h(sh)
)− Q+,t
h
(sh, π
th(sh)
)︸ ︷︷ ︸(i) policy optimization error
]+∑
h∈[H]Eπt
[(Q+,t
h − BQ+,th+1)
]︸ ︷︷ ︸(ii) Bellman error on πt
+∑
h∈[H]Eπ?
[−(Q+,t
h − BQ+,th+1)
]︸ ︷︷ ︸(iii) Bellman error on π?
Term (i) ≤ 0 as πt is greedy with respect to Q+,th
Optimism: Q+,th+1 − 2Γt
h ≤ BQ+,th+1 ≤ Q+,t
h , Term (iii) ≤ 0
41
![Page 106: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/106.jpg)
Regret analysis: sensitivity analysis + elliptical potentialTherefore, we have
Regret(T ) =∑T
t=1J(π?)− J(πt) ≤ 2∑
h∈[H]Eπt
[Γth(sh, ah)
]= 2∑
h∈[H]
[Γth(sth, a
th)]
+ martingale− diff
= 2β∑
h∈[H]
∑t∈[T ]
[‖φ(sth, a
th)‖(Λt
h)−1
]+ H ·
√T
= O(βH√dT )
Second line holds because (sth, ath), h ∈ [H] ∼ πt
It remains to conduct uncertainty quantification:
P(∀(t, h, s, a), |Qt
h(s, a)− (BQ+,th+1)(s, a)| ≤ Γt
h(s, a))≥ 1− δ
uniform concentration over Qucb (new for RL)+self-normalized concentration (same as in CB)
42
![Page 107: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/107.jpg)
Regret analysis: sensitivity analysis + elliptical potentialTherefore, we have
Regret(T ) =∑T
t=1J(π?)− J(πt) ≤ 2∑
h∈[H]Eπt
[Γth(sh, ah)
]= 2∑
h∈[H]
[Γth(sth, a
th)]
+ martingale− diff
= 2β∑
h∈[H]
∑t∈[T ]
[‖φ(sth, a
th)‖(Λt
h)−1
]+ H ·
√T
= O(βH√dT )
Second line holds because (sth, ath), h ∈ [H] ∼ πt
It remains to conduct uncertainty quantification:
P(∀(t, h, s, a), |Qt
h(s, a)− (BQ+,th+1)(s, a)| ≤ Γt
h(s, a))≥ 1− δ
uniform concentration over Qucb (new for RL)+self-normalized concentration (same as in CB)
42
![Page 108: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/108.jpg)
Summary & Extensions
Exploration in RL is more challenging in that we need tohandle deep exploration
LSVI-UCB explores by doing uncertainty quantification forLSVI estimator
LSVI propagates the uncertainty in larger h steps backward toh = 1
By construction, Q+1 contains the uncertainty about all h ≥ 1
LSVI-UCB achieves sample-efficiency and computationaltractability in online RL with function approximation
Similar principle can be extended to:proximal policy optimization (use soft-greedy instead ofgreedy)
zero-sum Markov game (two-player extension)
constrained MDP (primal dual optimization)
43
![Page 109: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/109.jpg)
Summary & Extensions
Exploration in RL is more challenging in that we need tohandle deep exploration
LSVI-UCB explores by doing uncertainty quantification forLSVI estimator
LSVI propagates the uncertainty in larger h steps backward toh = 1
By construction, Q+1 contains the uncertainty about all h ≥ 1
LSVI-UCB achieves sample-efficiency and computationaltractability in online RL with function approximation
Similar principle can be extended to:proximal policy optimization (use soft-greedy instead ofgreedy)
zero-sum Markov game (two-player extension)
constrained MDP (primal dual optimization)
43
![Page 110: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/110.jpg)
Summary & Extensions
Exploration in RL is more challenging in that we need tohandle deep exploration
LSVI-UCB explores by doing uncertainty quantification forLSVI estimator
LSVI propagates the uncertainty in larger h steps backward toh = 1
By construction, Q+1 contains the uncertainty about all h ≥ 1
LSVI-UCB achieves sample-efficiency and computationaltractability in online RL with function approximation
Similar principle can be extended to:proximal policy optimization (use soft-greedy instead ofgreedy)
zero-sum Markov game (two-player extension)
constrained MDP (primal dual optimization)
43
![Page 111: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/111.jpg)
Summary & Extensions
Exploration in RL is more challenging in that we need tohandle deep exploration
LSVI-UCB explores by doing uncertainty quantification forLSVI estimator
LSVI propagates the uncertainty in larger h steps backward toh = 1
By construction, Q+1 contains the uncertainty about all h ≥ 1
LSVI-UCB achieves sample-efficiency and computationaltractability in online RL with function approximation
Similar principle can be extended to:proximal policy optimization (use soft-greedy instead ofgreedy)
zero-sum Markov game (two-player extension)
constrained MDP (primal dual optimization)
43
![Page 112: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/112.jpg)
References (a very incomplete list)
linear bandit
• [Dani et al, 2008] Stochastic Linear Optimization under BanditFeedback
• [Abbasi-Yadkori et al, 2011] Improved algorithms for linearstochastic bandits
optimism in tabular RL
• [Auer & Ortner, 2007] Logarithmic Online Regret Bounds forUndiscounted Reinforcement Learning
• [Jaksch et al, 2010] Near-optimal Regret Bounds forReinforcement Learning
• [Azar et al, 2017] Minimax Regret Bounds for ReinforcementLearning
44
![Page 113: Sample Efficiency in Online RL](https://reader033.vdocument.in/reader033/viewer/2022050207/626dacdb82797f482d03e8a3/html5/thumbnails/113.jpg)
References (a very incomplete list)RL with function approximation (incomplete list)• [Dann et al, 2018] On Oracle-Efficient PAC RL with Rich
Observations• [Wang & Yang, 2019] Sample-Optimal Parametric Q-Learning
Using Linearly Additive Features• [Jin et al, 2020] Provably Efficient Reinforcement Learning
with Linear Function Approximation (this talk)• [Zanette et al, 2020] Learning near optimal policies with low
inherent bellman error• [Du et al, 2020] Is a Good Representation Sufficient for
Sample Efficient Reinforcement Learning?• [Xie et al, 2020] Learning Zero-Sum Simultaneous-Move
Markov Games Using Function Approximation and CorrelatedEquilibrium
• [Yang et al, 2020] On Function Approximation inReinforcement Learning: Optimism in the Face of Large StateSpaces (this talk)
• [Jin et al, 2021] Bellman Eluder Dimension: New Rich Classesof RL Problems, and Sample-Efficient Algorithms
45