sample efficiency in online rl

Reinforcement Learning China Summer School

RLChina 2021

Sample Efficiency in OnlineRL

Zhuoran Yang

Princeton −→ Yale (2022)

August 16, 2021

Motivation: sample complexitychallenge in deep RL

Success and challenge of deep RL

DRL = Representation (DL) + Decision Making (RL)

board games computer games

robotic control policy making

image source: Deepmind, OpenAI, Salesforce Research.

AlphaGo: 3× 107 games of self-play (data), 40 days oftraining (computation)

AlphaStar: 2× 102 years of self-play, 44 days of training

Rubik’s Cube: 104 years of simulation, 10 days of training

Insufficient Data

Lack of ConvergenceUnreliable AI

Massive Data Needed

Massive Computation

AlphaGo: 3× 107 games of self-play (data), 40 days oftraining (computation)

AlphaStar: 2× 102 years of self-play, 44 days of training

Rubik’s Cube: 104 years of simulation, 10 days of training

Insufficient Data

Lack of ConvergenceUnreliable AI

Massive Data Needed

Massive Computation

Our goal: provably efficient RLalgorithms

Sample efficiency: how many data points needed?

Computational efficiency: how much computation needed?

Function approximation: allow infinite number ofobservations?

Background: Episodic Markovdecision process

Episodic Markov decision process

Environment

action ah = πh(sh) ∈ A

reward next state

rh = rh(sh, ah) ∈ [0,1]sh+1 ∼ Ph( ⋅ | sh, ah) ∈ S

state sh ∈ S

Policy

For h ∈ [H], in the h-th step, observe state sh, takes action ah

Receive immediate reward rh, E[rh|sh = s, ah = a] = rh(s, a)

Environment evolves to a new state sh+1 ∼ Ph(·|sh, ah)

Policy πh : S → A: how agent takes action at h-th step

Goal: find the policy π? that maximizes J(π) := Eπ[∑H

h=1 rh]

Environment

action ah = πh(sh) ∈ A

reward next state

rh = rh(sh, ah) ∈ [0,1]sh+1 ∼ Ph( ⋅ | sh, ah) ∈ S

state sh ∈ S

Policy

For h ∈ [H], in the h-th step, observe state sh, takes action ah

Receive immediate reward rh, E[rh|sh = s, ah = a] = rh(s, a)

Environment evolves to a new state sh+1 ∼ Ph(·|sh, ah)

Policy πh : S → A: how agent takes action at h-th step

Goal: find the policy π? that maximizes J(π) := Eπ[∑H

h=1 rh]

π1(s1)

π2(s2)

π3(s3)

πH(sH)

Action

State P( ⋅ | s1, a1) P( ⋅ | s2, a2) P( ⋅ | s3, a3)

r(s1, a1) r(s2, a2) r(s3, a3) r(sH, aH)

+ + +⋯+𝔼 J(π)=

Policy

Reward

For simplicity, fix s1 = s?, rh = r , Ph = P for all h.

Contextual bandit is a special case of MDP

Contextual bandit (CB): observe context s ∈ A, take actiona ∈ A, observe (random) reward r with mean r?(s, a)

Optimal policy π?(s) = arg maxa∈A r?(s, a)

CB is a MDP with H = 1

Multi-armed bandit (MAB): H = 1, |A| finite, and |S| = 1

MDP is significantly more challenging than CB due to tem-poral dependency (i.e., state transitions)

Contextual bandit is a special case of MDP

Contextual bandit (CB): observe context s ∈ A, take actiona ∈ A, observe (random) reward r with mean r?(s, a)

Optimal policy π?(s) = arg maxa∈A r?(s, a)

CB is a MDP with H = 1

Multi-armed bandit (MAB): H = 1, |A| finite, and |S| = 1

MDP is significantly more challenging than CB due to tem-poral dependency (i.e., state transitions)

From MDP to RL

Informal definition of RL: Solve the MDP when the environment(r , P) is unknown

(i) generative model: able to query the reward rh and next statesh+1 for any (sh, ah) ∈ S ×A

(ii) offline setting: given a dataset D = (sth, ath, r th)h∈[H],t∈[T ] (Ttrajectories), learn π? without any interactions

(iii) online setting: without any prior knowledge, learn π? byinteracting with the MDP for T episodes

(i) learning from model query complexity(ii) learning from data sample complexity

(iii) learning by doing sample complexity + exploration

From MDP to RL

Online RL: Pipeline

Online Agent

Environment

Data !t

Execute πt

πt+1 ← #$%&'((!t, ℱ)

!t ← !t−1 ∪ Traj(πt)Continue for episodesT

Initialize π1 = π1hh∈[H] arbitrarily

For each t ∈ [T ], execute πt , get a trajectory Traj(πt)

Store Traj(πt) into dataset Dt = Traj(πi ), i ≤ tUpdate the policy via RL algorithm (need our design)

Sample efficiency in online RL: regret

Measure sample efficiency via regret:

Regret(T ) =T∑t=1

[J(π?)− J(πt)]︸︷︷︸suboptimality in t-th episode

Why regret is meaningful?

Define a random policy πT uniformly sampled fromπt , t ∈ [T ]. Suppose Regret(T ), thenJ(π?)− J(πT ) = Regret(T )/T

If Regret(T ) = o(T ), when T sufficiency large, πT isarbitrarily close to optimality

Goal: Regret(T ) = O(√

T · poly(H) · poly(dim))

Regret(T ) =T∑t=1

Challenge of Online RL: exploration & uncertainty

××××× ××

×××××

×××

exploration vs. exploitation function estimation

How to construct exploration incentives tailored to function approximators?

Function EstimationDeep Exploration

How to assess estimation uncertainty based on adaptively acquired data?

Online RL

Challenge of Online RL: exploration & uncertainty

××××× ××

×××××

×××

exploration vs. exploitation function estimation

How to construct exploration incentives tailored to function approximators?

Function EstimationDeep Exploration

How to assess estimation uncertainty based on adaptively acquired data?

Online RL

Warmup example:LinUCB for linear CB

Linear contextual bandit

Setting of linear CB:

Observation structure: observe (perhaps adversarial) st ∈ S,take action at ∈ A, receive bounded reward r t ∈ [0, 1]

E[r t |st = s, at = a] = r?(s, a) for all (s, a) ∈ S ×A

Linear reward function: r?(s, a) = φ(s, a)>θ?

φ : S ×A → Rd : known feature mapping

Normalization condition: ‖θ?‖2 ≤√d , sups,a ‖φ(s, a)‖2 ≤ 1

Why? Include MAB as a special case:|S| = 1, d = |A|, φ(s, a) = ea, θ? = (µ1, µ2, · · · , µA) withµa ∈ [0, 1] for all a ∈ A

Regret(T ) =∑T

(maxa∈A〈φ(st , a)− φ(st , at), θ?〉

Regret(T ) =∑T

Exploration via optimism in the face of uncertainty (OFU)

For each t ∈ [T ], in the t-th round, determine at in thefollowing two steps

Uncertainty quantification: based on the current dataset Dt−1,construct a high-probability confidence set Θt that contains θ?

P(∀t ∈ [T ], θ? ∈ Θt

)≥ 1− δ (1)

Optimistic planing: choose at to maximize the most optimisticmodel within Θt

at = arg maxa∈A

maxθ∈Θt〈φ(st , at), θ〉

P(∀t ∈ [T ], θ? ∈ Θt

)≥ 1− δ (1)

at = arg maxa∈A

P(∀t ∈ [T ], θ? ∈ Θt

)≥ 1− δ (1)

at = arg maxa∈A

P(∀t ∈ [T ], θ? ∈ Θt

)≥ 1− δ (1)

at = arg maxa∈A

Intuition of OFU principle

Define two functions

r+,t(s, a) = maxθ∈Θt〈φ(s, a), θ〉

r−,t(s, a) = minθ∈Θt〈φ(s, a), θ〉

By (1), θ? ∈ Θt

r+,t(s, a) ≥ r?(s, a) ≥ r−,t(s, a), ∀(s, a) ∈ S ×A

|r+,t(s, a)− r−,t(s, a)| reflects the uncertainty about a atcontext s

By (2), at = arg maxa r+,t(st , a) — choose the action with

either large uncertainty (explore) or large reward (exploit)

By (1), θ? ∈ Θt

How to construct Θt for linear CB?In the t-th round, our dataset is Dt−1 = (s i , ai , r i )i≤t−1

First construct an estimator of θ? via ridge regression

θt = minθ∈Rd

Lt(θ) : =

t−1∑i=1

[r i − 〈φ(s i , ai ), θ〉]2 + ‖θ‖22

Hessian of the Ridge loss:

Λt = ∇2Lt(θ) = I +t−1∑i=1

φ(s i , ai )φ(s i , ai )>

Closed-form solution: θt = (Λt)−1∑t−1

i=1 ri · φ(s i , ai )

Define Θt as the confidence ellipsoid :

Θt = θ ∈ Rd : ‖θ − θt‖Λt ≤ β,

here define β = C ·√

d · log(1 + T/δ), ‖x‖M =√x>Mx

θt = minθ∈Rd

Lt(θ) : =

t−1∑i=1

[r i − 〈φ(s i , ai ), θ〉]2 + ‖θ‖22

Λt = ∇2Lt(θ) = I +t−1∑i=1

i=1 ri · φ(s i , ai )

Θt = θ ∈ Rd : ‖θ − θt‖Λt ≤ β,

d · log(1 + T/δ), ‖x‖M =√x>Mx

θt = minθ∈Rd

Lt(θ) : =

t−1∑i=1

[r i − 〈φ(s i , ai ), θ〉]2 + ‖θ‖22

Λt = ∇2Lt(θ) = I +t−1∑i=1

i=1 ri · φ(s i , ai )

Θt = θ ∈ Rd : ‖θ − θt‖Λt ≤ β,

d · log(1 + T/δ), ‖x‖M =√x>Mx

θt = minθ∈Rd

Lt(θ) : =

t−1∑i=1

[r i − 〈φ(s i , ai ), θ〉]2 + ‖θ‖22

Λt = ∇2Lt(θ) = I +t−1∑i=1

i=1 ri · φ(s i , ai )

Θt = θ ∈ Rd : ‖θ − θt‖Λt ≤ β,

d · log(1 + T/δ), ‖x‖M =√x>Mx

θt = minθ∈Rd

Lt(θ) : =

t−1∑i=1

[r i − 〈φ(s i , ai ), θ〉]2 + ‖θ‖22

Λt = ∇2Lt(θ) = I +t−1∑i=1

i=1 ri · φ(s i , ai )

Θt = θ ∈ Rd : ‖θ − θt‖Λt ≤ β,

d · log(1 + T/δ), ‖x‖M =√x>Mx

LinUCB algorithm

Confidence ellipsoid: Θt = θ ∈ Rd : ‖θ − θt‖Λt ≤ β

r+,t admits a closed-form:

r+,t(s, a) = maxθ∈Θt〈φ(s, a), θ〉 = 〈φ(s, a), θt〉+ β · ‖φ(s, a)‖(Λt)−1︸︷︷︸

LinUCB algorithm (Dani et all. (2008), Abbasi-Yadkori et al.(2011), ...)

For t = 1, . . . ,T , in the beginning of t-th round

Observe context st

Compute θt via ridge regression and compute Hessian Λt

Choose at = arg maxa∈A r+,t(st , a) and observe r t

LinUCB algorithm

Observe context st

LinUCB algorithm

Observe context st

LinUCB algorithm

Observe context st

Regret Bound of LinUCB

Theorem. With probability 1− δ, LinUCB satisfies

Regret(T ) ≤ β ·√dT ·

√logT = O(d

√T )

Independent of the size of context and action spaces, onlydepends on the dimension (linear in d)

Depend on T through√T (sample efficient)

Computational cost poly(d ,A,T ) and memory requirement ispoly(d ,T ) (Why?)

√logT = O(d

√T )

√logT = O(d

√T )

√logT = O(d

√T )

Regret Analysis of LinUCB (1/3)

Step 1. Conditioning on the event E = ∀t ∈ [T ], θ? ∈ Θt,r−,t(s, a) ≤ r?(s, a) ≤ r+,t(s, a), ∀(x , a) ∈ S ×A

Step 2. Regret decomposition:

r?(st , a)− r?(st , at)

r?(st , a)− r+,t(st , at))

+(r+,t(st , at)− r?(st , at)

)≤ r+,t(st , at)− r−,t(st , at) ≤ 2β · ‖φ(st , at)‖(Λt)−1

Thus we have shown

r?(st , a)− r?(st , at) ≤ β ·min1, ‖φ(st , at)‖(Λt)−1

r?(st , a)− r?(st , at)

r?(st , a)− r+,t(st , at))

+(r+,t(st , at)− r?(st , at)

Thus we have shown

r?(st , a)− r?(st , at)

r?(st , a)− r+,t(st , at))

+(r+,t(st , at)− r?(st , at)

Thus we have shown

Step 3. Telescope the bonus: Cauchy-Schwarz + elliptical potentiallemma

Regret(T ) ≤ 2β ·T∑t=1

·min1, ‖φ(st , at)‖(Λt)−1

≤ 2β√T ·[ T∑

min1, ‖φ(st , at)‖2(Λt)−1

≤ 2β√T ·√

2logdet(ΛT ) . β ·√T ·√d · logT

Elliptical potential lemma

t−1∑i=1

min1, ‖φ(s i , ai )‖2(Λt)−1 ≤ 2logdet(ΛT ) ≤ d · logT

Bayesian perspective:

Prior distribution θ? ∼ N(0, I )

Likelihood ri = r?(s i , ai ) + ε, ε ∈ N(0, 1)

Given dataset Dt−1, the posterior distribution is N(θt ,Λt)

‖φ(s, a)‖2(Λt)−1 = I (θ?, (s, a)|Dt−1) =

H(θ?|Dt−1)− H(θ?|Dt−1 ∪ (s, a)) — conditional mutualinformation gain

Elliptical potential lemma ↔ chain rule of mutual information

t−1∑i=1

‖φ(s, a)‖2(Λt)−1 = I (θ?, (s, a)|Dt−1) =

t−1∑i=1

‖φ(s, a)‖2(Λt)−1 = I (θ?, (s, a)|Dt−1) =

Step 4. Show optimism: θ? ∈ Θt = θ ∈ Rd : ‖θ − θt‖Λt ≤ β

θt − θ? = (Λt)−1

[t−1∑i=1

(r?(s i , ai ) + εi )︸︷︷︸ri

·φ(s i , ai )− (Λt) · θ?]

= (Λt)−1θ? + (Λt)−1t−1∑i=1

εi︸︷︷︸martingale difference

·φ(s i , ai )

‖θt − θ?‖Λt ≈ ‖∑t−1

i=1 εi · φ(s i , ai )‖(Λt)−1 ≤ βConcentration of self-normalized process (e.g., Abbasi-Yadkoriet al, 2011)

Step 4. Show optimism: θ? ∈ Θt = θ ∈ Rd : ‖θ − θt‖Λt ≤ β

θt − θ? = (Λt)−1

[t−1∑i=1

(r?(s i , ai ) + εi )︸︷︷︸ri

·φ(s i , ai )− (Λt) · θ?]

= (Λt)−1θ? + (Λt)−1t−1∑i=1

εi︸︷︷︸martingale difference

·φ(s i , ai )

‖θt − θ?‖Λt ≈ ‖∑t−1

i=1 εi · φ(s i , ai )‖(Λt)−1 ≤ βConcentration of self-normalized process (e.g., Abbasi-Yadkoriet al, 2011)

Summary of LinUCB

Algorithm is based on the optimism principle

upper confidence bound = ridge estimator + bonus

O(d√T ) regret via

a. utilize optimism to bound instant regret by width of confidenceregion (bonus)

b. sum up the instant regret terms by elliptical potential lemma

c. confidence region constructed via uncertainty quantification forridge regression

Extensions: generalized linear model, RKHS,overparameterized network, abstract function class withbounded eluder dimension, ...

Summary of LinUCB

From Linear CB to Linear RL

Value functions and Bellman equation

Our goal is to find π? = arg maxπ J(π) = E[∑H

h=1 rh]

π? is characterized by Q? = Q?hh∈[H+1] : S ×A → [0,H]

π?h(s) = arg maxa Q?h (s, a), ∀s ∈ S; V ?

h (s) = maxa Q?h (s, a)

Q?h (s, a): optimal return starting from sh = s and ah = a

Q? is characterized by Bellman equation

= Q⋆h (s, a)

π⋆h+1

π⋆h+2 π⋆

+ + +⋯+# r(s, a)

• given by Bellman equation

π⋆h (s) = arg max

h (s, ⋅ )V⋆

h (s) = maxa

Q⋆h (s, a)

h=1 rh]

= Q⋆h (s, a)

π⋆h+1

π⋆h+2 π⋆

+ + +⋯+# r(s, a)

h (s, ⋅ )V⋆

h (s) = maxa

Q⋆h (s, a)

h=1 rh]

= Q⋆h (s, a)

π⋆h+1

π⋆h+2 π⋆

+ + +⋯+# r(s, a)

h (s, ⋅ )V⋆

h (s) = maxa

Q⋆h (s, a)

h=1 rh]

= Q⋆h (s, a)

π⋆h+1

π⋆h+2 π⋆

+ + +⋯+# r(s, a)

h (s, ⋅ )V⋆

h (s) = maxa

Q⋆h (s, a)

MDP terminates after H steps — Q?H+1 = 0, Q?

Bellman equation

Q?h(s, a) =

Bellman operator: BQ?h+1︷︸︸︷

r(s, a) + Esh+1∼P[V ?h+1(sh+1)|sh = s, ah = a

]V ?h (s) = max

h(s, a)

RL with linear function approximation: useFlin = φ(s, a)>θ : S ×A → R to approximate Q?

hh∈[H]

Can allow known and step-varying features: φhh∈[H]

Bellman equation

Q?h(s, a) =

]V ?h (s) = max

h(s, a)

hh∈[H]

Bellman equation

Q?h(s, a) =

]V ?h (s) = max

h(s, a)

hh∈[H]

RL is more challenging than bandit

Online Agent

Environment

Data !t

Execute πt

πt+1 ← #$%&'((!t, ℱ)

Regret(T ) =∑T

t=1[J(π?)− J(πt)]

J(π?) = maxa Q?1 (s?, a), fixed initial state for simplicity

Can generalize to adversarial s1

Online RL 6= a contextual bandit with TH rounds

RL requires deep exploration for achieving o(T ) regret.

Online Agent

Environment

Data !t

Execute πt

πt+1 ← #$%&'((!t, ℱ)

Regret(T ) =∑T

t=1[J(π?)− J(πt)]

Online Agent

Environment

Data !t

Execute πt

πt+1 ← #$%&'((!t, ℱ)

Regret(T ) =∑T

t=1[J(π?)− J(πt)]

Deep Exploration

s⋆ good good good good

bad bad bad bad

A = b1, b2, …

b1 b1 b1 b1

∀a ∀a ∀a

A\b1 A\b1A\b1

π⋆ π⋆π⋆π⋆

Contextual bandit algorithm directly goes to the bad state atthe initial state — Ω(T ) regret

Random sampling requires |A|H samples (inefficient)

Goal: Regret(T ) = poly(H) ·√T (need deep exploration)

O(SAH) query complexity if have a generative model

Deep Exploration

bad bad bad bad

A = b1, b2, …

b1 b1 b1 b1

∀a ∀a ∀a

A\b1 A\b1A\b1

Deep Exploration

bad bad bad bad

A = b1, b2, …

b1 b1 b1 b1

∀a ∀a ∀a

A\b1 A\b1A\b1

Deep Exploration

bad bad bad bad

A = b1, b2, …

b1 b1 b1 b1

∀a ∀a ∀a

A\b1 A\b1A\b1

O(SAH) query complexity if have a generative model30

What assumption should we impose?

In CB, π? is greedy with respect to r?, we assume r? is linear

In RL, π? is greedy with respect to Q?

Linear realizability : Q?h ∈ Flin, ∀h ∈ [H]

Theorem [Wang-Wang-Kakade-21] There exists a classof linearly realizable MDPs such that any online RL al-gorithm requires minΩ(2d),Ω(2H) samples to obtaina near optimal policy.

Fundamentally different from supervised learning and bandits

To achieve efficiency in online RL, we need strongerassumptions

Assumption: Image(B) ⊆ Flin

We assume the image set of the Bellman operator lies in Flin:

For any Q : S ×A → [0,H], there exists θQ ∈ Rd s.t.

(BQ)(s, a) = r(s, a) + E[maxa′

Q(s ′, a′)] = 〈φ(s, a), θQ〉

Normalization condition: ‖θQ‖2 ≤ 2H√d ,

sups,a ‖φ(s, a)‖2 ≤ 1

Is there such an MDP? Yes, Linear MDP

r(s, a) = 〈φ(s, a),ω〉 P(s ′|s, a) = 〈φ(s, a),µ(s ′)〉Normalization: ‖ω‖2 ≤

√d ,∑

s′ ‖µ(s ′)‖2 ≤√d

Linear MDP contains tabular MDP as a special case:φ(s, a) = e(s,a), d = |S| · |A|

Q(s ′, a′)] = 〈φ(s, a), θQ〉

sups,a ‖φ(s, a)‖2 ≤ 1

√d ,∑

s′ ‖µ(s ′)‖2 ≤√d

Q(s ′, a′)] = 〈φ(s, a), θQ〉

sups,a ‖φ(s, a)‖2 ≤ 1

√d ,∑

s′ ‖µ(s ′)‖2 ≤√d

Algorithm for linear RL: LSVI-UCB

Algorithm: LSVI + optimism

In the beginning t-th episode, datasetDt−1 = (s ih, aih, r ih), h ∈ [H]i<t

For h = H, . . . , 1, backwardly solve ridge regression:

y ih = r ih + maxa

Q+,th+1(sh+1, a)

θth = arg minθ

t−1∑i=1

[y ih − 〈φ(s ih, aih), θ〉]2 + ‖θ‖2

Hessian of ridge loss: Λth = I +

∑t−1i=1 φ(s ih, a

ih)φ(s ih, a

Construct bonus Γth(s, a) = β · ‖φ(s, a)‖(Λt

h)−1

UCB Q-function

Q+,th = Trunc[0,H]〈φ(s, a), θth〉+ Γt

h(s, a)

Execute πth(·) = arg maxa Q+,th (·, a)

y ih = r ih + maxa

Q+,th+1(sh+1, a)

θth = arg minθ

t−1∑i=1

[y ih − 〈φ(s ih, aih), θ〉]2 + ‖θ‖2

ih)φ(s ih, a

h)−1

UCB Q-function

h(s, a)

y ih = r ih + maxa

Q+,th+1(sh+1, a)

θth = arg minθ

t−1∑i=1

[y ih − 〈φ(s ih, aih), θ〉]2 + ‖θ‖2

ih)φ(s ih, a

h)−1

UCB Q-function

h(s, a)

y ih = r ih + maxa

Q+,th+1(sh+1, a)

θth = arg minθ

t−1∑i=1

[y ih − 〈φ(s ih, aih), θ〉]2 + ‖θ‖2

ih)φ(s ih, a

h)−1

UCB Q-function

h(s, a)

y ih = r ih + maxa

Q+,th+1(sh+1, a)

θth = arg minθ

t−1∑i=1

[y ih − 〈φ(s ih, aih), θ〉]2 + ‖θ‖2

ih)φ(s ih, a

h)−1

UCB Q-function

h(s, a)

A more abstract version of LSVI-UCB

For h = H, . . . , 1, backwardly solve least-squares regression:

y ih = r ih + maxa

Q+,th+1(sh+1, a)

Qth = arg min

f ∈F

t−1∑i=1

[y ih − f (s ih, aih)]2 + penalty(f )

Uncertainty quantification: for Qth: find Γt

h such that

P(∀(t, h, s, a), |Qt

h(s, a)− (BQ+,th+1)(s, a)| ≤ Γt

h(s, a))≥ 1− δ

Optimism:

Q+,th = Trunc[0,H]Qt

h + Γth(s, a)

• Computation & memory cost independent of |S| (even forRKHS, overparameterized NN)

y ih = r ih + maxa

Q+,th+1(sh+1, a)

Qth = arg min

f ∈F

t−1∑i=1

h such that

P(∀(t, h, s, a), |Qt

h(s, a)− (BQ+,th+1)(s, a)| ≤ Γt

h(s, a))≥ 1− δ

Optimism:

h + Γth(s, a)

y ih = r ih + maxa

Q+,th+1(sh+1, a)

Qth = arg min

f ∈F

t−1∑i=1

h such that

P(∀(t, h, s, a), |Qt

h(s, a)− (BQ+,th+1)(s, a)| ≤ Γt

h(s, a))≥ 1− δ

Optimism:

h + Γth(s, a)

y ih = r ih + maxa

Q+,th+1(sh+1, a)

Qth = arg min

f ∈F

t−1∑i=1

h such that

P(∀(t, h, s, a), |Qt

h(s, a)− (BQ+,th+1)(s, a)| ≤ Γt

h(s, a))≥ 1− δ

Optimism:

h + Γth(s, a)

Qth 𝔹Q+

h+1 Q+

h − 2Γth

| Qth(s, a) − (𝔹Q+

h+1)(s, a) | ≤ Γth(s, a), ∀t, h, s, a

Q+h (s, a) = Qt

h(s, a)LSVI

+ Γth(s, a)

bonusOptimism gives: BQ+

h+1 ∈ [Q+h − 2Γt

h,Q+h ]

Monotonicity of Bellman operator: Q+h ≥ Q?

h (UCB)36

Sample efficiency of LSVI-UCB

LSVI-UCB achieves√T -regret

Theorem [Jin-Yang-Wang-Jordan-20] Choosing β = O(dH),with probability at least 1− 1/T ,

Regret(T ) = O(β · H ·√dT ) = O(H2 ·

√d3T )

Directly imply a |S|1.5|A|1.5H2√T regret for tabular RL

First algorithm with both sample and computational efficiencyin the context of RL with function approximation

Only assumption: Image set of B is in Flin

Optimal regret dH2√T is achieved by [Zanette et al. 2020]

with a relaxed model assumption. But the computation isintractable.

Regret(T ) = O(β · H ·√dT ) = O(H2 ·

√d3T )

Regret(T ) = O(β · H ·√dT ) = O(H2 ·

√d3T )

Regret(T ) = O(β · H ·√dT ) = O(H2 ·

√d3T )

Regret(T ) = O(β · H ·√dT ) = O(H2 ·

√d3T )

A general version of regretLet Qucb be the function class containing the Q-functionsconstructed by LSVI-UCB

Qucb =

Trunc[0,H]φ(·, ·)>θ+ β · ‖φ(·, ·)‖Λ−1

for linear case

Theorem [Jin-Yang-Wang-Wang-Jordan-20] Choosing β =O(H ·

√logN∞(Qucb,T−2)

), with probability at least 1 −

1/T ,Regret(T ) = O(βH ·

√deff · T ) = O(H2 ·

√deff · T · logN∞)

logN∞(Qucb,T−2) d logT for linear case

Include kernel and overparameterized neural network

deff is the effective dimension of RKHS or NTK

For an abstract function class F , deff can be set as theBellman-Eluder dimension [Jin-Liu-Miryoosef-21]

Qucb =

Trunc[0,H]φ(·, ·)>θ+ β · ‖φ(·, ·)‖Λ−1

for linear case

√deff · T ) = O(H2 ·

Qucb =

Trunc[0,H]φ(·, ·)>θ+ β · ‖φ(·, ·)‖Λ−1

for linear case

√deff · T ) = O(H2 ·

Qucb =

Trunc[0,H]φ(·, ·)>θ+ β · ‖φ(·, ·)‖Λ−1

for linear case

√deff · T ) = O(H2 ·

Regret analysis

Regret analysis: sensitivity analysis + elliptical potentialA general sensitivity analysis

J(π?)− J(πt)

h∈[H] Eπ?

(sh, π

?h(sh)

)− Q+,t

(sh, π

th(sh)

)︸︷︷︸(i) policy optimization error

h∈[H]Eπt

[(Q+,t

h − BQ+,th+1)

]︸︷︷︸(ii) Bellman error on πt

h∈[H]Eπ?

[−(Q+,t

h − BQ+,th+1)

]︸︷︷︸(iii) Bellman error on π?

Term (i) ≤ 0 as πt is greedy with respect to Q+,th

Optimism: Q+,th+1 − 2Γt

h ≤ BQ+,th+1 ≤ Q+,t

h , Term (iii) ≤ 0

Regret analysis: sensitivity analysis + elliptical potentialTherefore, we have

Regret(T ) =∑T

t=1J(π?)− J(πt) ≤ 2∑

h∈[H]Eπt

[Γth(sh, ah)

]= 2∑

h∈[H]

[Γth(sth, a

+ martingale− diff

= 2β∑

h∈[H]

∑t∈[T ]

[‖φ(sth, a

th)‖(Λt

h)−1

]+ H ·

= O(βH√dT )

Second line holds because (sth, ath), h ∈ [H] ∼ πt

It remains to conduct uncertainty quantification:

P(∀(t, h, s, a), |Qt

h(s, a)− (BQ+,th+1)(s, a)| ≤ Γt

h(s, a))≥ 1− δ

uniform concentration over Qucb (new for RL)+self-normalized concentration (same as in CB)

Regret analysis: sensitivity analysis + elliptical potentialTherefore, we have

Regret(T ) =∑T

t=1J(π?)− J(πt) ≤ 2∑

h∈[H]Eπt

[Γth(sh, ah)

]= 2∑

h∈[H]

[Γth(sth, a

+ martingale− diff

= 2β∑

h∈[H]

∑t∈[T ]

[‖φ(sth, a

th)‖(Λt

h)−1

]+ H ·

= O(βH√dT )

Second line holds because (sth, ath), h ∈ [H] ∼ πt

It remains to conduct uncertainty quantification:

P(∀(t, h, s, a), |Qt

h(s, a)− (BQ+,th+1)(s, a)| ≤ Γt

h(s, a))≥ 1− δ

uniform concentration over Qucb (new for RL)+self-normalized concentration (same as in CB)

Summary & Extensions

Exploration in RL is more challenging in that we need tohandle deep exploration

LSVI-UCB explores by doing uncertainty quantification forLSVI estimator

LSVI propagates the uncertainty in larger h steps backward toh = 1

By construction, Q+1 contains the uncertainty about all h ≥ 1

LSVI-UCB achieves sample-efficiency and computationaltractability in online RL with function approximation

Similar principle can be extended to:proximal policy optimization (use soft-greedy instead ofgreedy)

zero-sum Markov game (two-player extension)

constrained MDP (primal dual optimization)

References (a very incomplete list)

linear bandit

• [Dani et al, 2008] Stochastic Linear Optimization under BanditFeedback

• [Abbasi-Yadkori et al, 2011] Improved algorithms for linearstochastic bandits

optimism in tabular RL

• [Auer & Ortner, 2007] Logarithmic Online Regret Bounds forUndiscounted Reinforcement Learning

• [Jaksch et al, 2010] Near-optimal Regret Bounds forReinforcement Learning

• [Azar et al, 2017] Minimax Regret Bounds for ReinforcementLearning

References (a very incomplete list)RL with function approximation (incomplete list)• [Dann et al, 2018] On Oracle-Efficient PAC RL with Rich

Observations• [Wang & Yang, 2019] Sample-Optimal Parametric Q-Learning

Using Linearly Additive Features• [Jin et al, 2020] Provably Efficient Reinforcement Learning

with Linear Function Approximation (this talk)• [Zanette et al, 2020] Learning near optimal policies with low

inherent bellman error• [Du et al, 2020] Is a Good Representation Sufficient for

Sample Efficient Reinforcement Learning?• [Xie et al, 2020] Learning Zero-Sum Simultaneous-Move

Markov Games Using Function Approximation and CorrelatedEquilibrium

• [Yang et al, 2020] On Function Approximation inReinforcement Learning: Optimism in the Face of Large StateSpaces (this talk)

• [Jin et al, 2021] Bellman Eluder Dimension: New Rich Classesof RL Problems, and Sample-Efficient Algorithms

sample efficiency in online rl

Documents