learning to control’ an unknown system · episode also ★ episode length can’t increase faster...

17
` Learning to Control ’ an Unknown System Simons Institute RTDM Program - Societal Networks Workshop March 26, 2018 1 Joint work with Mukul Gagrani (USC), Ashutosh Nayyar (USC) and Yi Ouyang (UC Berkeley) Rahul Jain University of Southern California

Upload: others

Post on 20-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Learning to Control’ an Unknown System · episode also ★ Episode length can’t increase faster Non-asymptotic Regret bound for PSDE!10 Theorem.* If the MDP is weakly communicating

`Learning to Control’ an Unknown System

Simons Institute RTDM Program - Societal Networks Workshop March 26, 2018

1

Joint work with Mukul Gagrani (USC), Ashutosh Nayyar (USC) and Yi Ouyang (UC Berkeley)

Rahul Jain

University of Southern California

Page 2: Learning to Control’ an Unknown System · episode also ★ Episode length can’t increase faster Non-asymptotic Regret bound for PSDE!10 Theorem.* If the MDP is weakly communicating

/17

Outline

I. MDPs, Dynamic Programming

II. Bandit Models, Online Learning

III. PSDE: An RL Algorithm for Unknown MDPs

IV. PSDE Algorithm for Unknown Linear Stochastic Systems

!2

Page 3: Learning to Control’ an Unknown System · episode also ★ Episode length can’t increase faster Non-asymptotic Regret bound for PSDE!10 Theorem.* If the MDP is weakly communicating

/17

A Markov Decision Process

!3

V() = lim infT!1

1

TE[

TX

t=1

r(xt, ut)]

xy

u𝜽(y|x,u)

𝛑(u|x;𝜽)

(x,r)ur(x,u)

MDP

Control

Finite State space X Finite Action space U

known

Page 4: Learning to Control’ an Unknown System · episode also ★ Episode length can’t increase faster Non-asymptotic Regret bound for PSDE!10 Theorem.* If the MDP is weakly communicating

/17

Dynamic Programming

Weakly communicating finite MDP

Optimal average reward

Bellman equation

‣ w*(x,𝜽) is relative value function

Solve by average-reward DP algorithms

!4

E[w*(y)|x,u]

V () = sup

V()

V () + w(x, ) = supur(x, u) +

X

y

(y|x, u)w(x, )

Page 5: Learning to Control’ an Unknown System · episode also ★ Episode length can’t increase faster Non-asymptotic Regret bound for PSDE!10 Theorem.* If the MDP is weakly communicating

/17

Unknown Model

True 𝜃o, unknown ~ prior μ

Learning policy 𝜙t(ht), history ht=(states,actions)

Objective of Learning: To find a nearly optimal policy at the fastest possible rate?

!5

xy

u𝜽(y|x,u)

(x,r)ur(x,u)

MDP

Control

Finite State space X Finite Action space U

𝛑(u|x;𝜽)Learn

Unknown

𝜽^

[Borkar-Varaiya’82]

Page 6: Learning to Control’ an Unknown System · episode also ★ Episode length can’t increase faster Non-asymptotic Regret bound for PSDE!10 Theorem.* If the MDP is weakly communicating

/17

Bandit Models and Online Learning

Reward on Heads = $1, on Tails =0

Objective: “max expected long-term total reward”

min (expected) Regret

Lai & Robbins (1985) lower bound O(log T)

UCB1 algorithm achieves O(log T) [Agrawal’95, Auer, et al’02]‣

!6

Unknown 𝜽1 Unknown 𝜽2

gi(t, ti) = Xi +p

2 log t/tiOptimism in the Face of Uncertainty (OFU)

RT () = Tmax E[TX

t=1

rt]

Page 7: Learning to Control’ an Unknown System · episode also ★ Episode length can’t increase faster Non-asymptotic Regret bound for PSDE!10 Theorem.* If the MDP is weakly communicating

/17

Maintain a belief (posterior distribution), µi over 𝜽i

Sample 𝜽i from µi

Choose i* = arg maxi 𝜽i

Achieves (exp) regret RT(𝜙) = O(log T)

‣ Advantage: superior numerical performance, computationally simpler

‣ Thompson’33, Chapelle-Li’11, Agrawal-Goyal’12!7

Unknown 𝜽1 Unknown 𝜽2

The (Thompson) Posterior Sampling Algorithm

Unknown 𝜽m

^

^

Posterior Sampling Algorithms

Page 8: Learning to Control’ an Unknown System · episode also ★ Episode length can’t increase faster Non-asymptotic Regret bound for PSDE!10 Theorem.* If the MDP is weakly communicating

/17

Learning an Unknown MDP

Learning policy 𝜙t(ht) to search over space 𝚯

Objective of Learning:

Lower Bound =Ω(√T) [Tsitsiklis, et al (2010)]

OFU v. PS!8

xy

u𝜽(y|x,u)

(x,r)ur(x,u)

MDP

Control

Finite State space X Finite Action space U

𝛑(u|x;𝜽)Learn

Unknown

𝜽^

𝚯

RT () = TV (o) E[TX

t=1

r(xt, ut)]

Page 9: Learning to Control’ an Unknown System · episode also ★ Episode length can’t increase faster Non-asymptotic Regret bound for PSDE!10 Theorem.* If the MDP is weakly communicating

/17

The PSDE Algorithm: Posterior Sampling with Dynamic Episodes

The PSDE Algorithm:

Resample 𝜃 from posterior 𝜇t at end of every episode

‣ Compute policy optimal for sampled 𝜃

At each t, update posterior using Bayes’ rule

!9

µt() = P(|ht)

Stopping Rule 1: Stopping Rule 2:0 t1

T1

t2

T2dynamicepisodes

t > tk + Tk1 Nt(x, u) > 2Ntk(x, u) for some (x, u)

Resample 𝜃compute policy 𝜋*(𝜃)

Resample 𝜃compute policy 𝜋*(𝜃)

ExploreExploit Exploit Explore Exploit

Page 10: Learning to Control’ an Unknown System · episode also ★ Episode length can’t increase faster Non-asymptotic Regret bound for PSDE!10 Theorem.* If the MDP is weakly communicating

/17

Up to logarithmic factors, exact constants known PSDE Algorithm works with approximately optimal policies in each

episode also Episode length can’t increase faster

Non-asymptotic Regret bound for PSDE

!10

Theorem.*

If the MDP is weakly communicating and its span ≤ H, then

where X is state space size, and U is action space size.

*Y. Ouyang, M. Gagrani, A. Nayyar and R. Jain, “Learning Unknown MDPs: A Thompson Sampling Approach”, NIPS, 2017.

RT (PSDE) O(HXpUT )

Page 11: Learning to Control’ an Unknown System · episode also ★ Episode length can’t increase faster Non-asymptotic Regret bound for PSDE!10 Theorem.* If the MDP is weakly communicating

/17

Numerical Performance

Riverswim Benchmark problem

!11

0 2 4 6 8 10

104

0

1000

2000

3000

4000

5000

6000

UCRL2: [Jaksch, Ortner, Auer (2010)]TSMDP: [Gopalan & Mannor (2015)]Lazy-PSRL: [Yadkori & Szepesvari (2015)]

PSDE

PSDE

Page 12: Learning to Control’ an Unknown System · episode also ★ Episode length can’t increase faster Non-asymptotic Regret bound for PSDE!10 Theorem.* If the MDP is weakly communicating

/17

Proof Outline

For any function f and RV X, algorithm must satisfy

Upper bounds number of episodes

Upper bound between true and sampled parameters

!12

E[f(k, X)] = E[f(o, X)]

KT p

2XUT log T

TE[V (o)]KTX

k=1

E[TkV(k)] E[KT ]

Page 13: Learning to Control’ an Unknown System · episode also ★ Episode length can’t increase faster Non-asymptotic Regret bound for PSDE!10 Theorem.* If the MDP is weakly communicating

/17!13

Unknown Stochastic Linear System

Parameters θ unknown

Regret

Optimal control policy is linear:

Assumption 1: There is a set Θ such that for all θ ϵ Θ, there is a unique p.d. solution to the Ricatti equation

xt+1 = Axt +But + wt,

ut = t(ht)

RT () = E[TX

t=1

ct TJ()]

u = G()x where G() = (R+B>S(B)1B>S()A.

ct = x>t Qxt + u>

t Rut

> = [A,B] known xt,ut

Page 14: Learning to Control’ an Unknown System · episode also ★ Episode length can’t increase faster Non-asymptotic Regret bound for PSDE!10 Theorem.* If the MDP is weakly communicating

/17!14

Stochastic Adaptive Control

Classical Adaptive Control…

Certainty equivalence principle

‣ Astrom-Wittenmark’94, Sastry’89, Narendra’89

Cost-biased Max Likelihood approach

‣ Campi and Kumar’98, Prandini-Campi’01,…

Optimism in the Face of Uncertainty (OFU)

‣ Yadkori-Szepesvari’11,’15, Van Roy, et al’12,’13,’16 (computation!)

‣ Abeile-Lazaric’17 ~O(T2/3)

Page 15: Learning to Control’ an Unknown System · episode also ★ Episode length can’t increase faster Non-asymptotic Regret bound for PSDE!10 Theorem.* If the MDP is weakly communicating

/17!15

The Posterior Sampling with Dynamic Episodes (PSDE) Learning Algorithm

From data zt=[xt,ut], estimate parameters θ:

Sample parameters tk from µtk(tk)Solve Ricatti equationCompute Gain G(tk)

Posterior Sampling:

…T1 Tk=Tk-1+1

0 t2 tk tk+1…

det(t) < 0.5det(tk)

Dynamic Episodes

Page 16: Learning to Control’ an Unknown System · episode also ★ Episode length can’t increase faster Non-asymptotic Regret bound for PSDE!10 Theorem.* If the MDP is weakly communicating

/17!16

√T-Regret of PSDE Algorithm

Assumption 2. State space X compact,

This implies for all θ ϵ Θ, spectral radius 𝛒(A1+B1G(θ)) < δ < 1

Theorem.* Expected regret of PSDE,

Open Loop Stable System Open Loop Unstable System

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000Horizon

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

R(T

)/T

δ = 0.99δ = 2

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000Horizon

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

R(T

)/T

δ = 0.99δ = 2

*Y. Ouyang, M. Gagrani and R. Jain, “Learning-based Control of Unknown Linear Systems with Thompson Sampling”, arXiv:1709.04047

RT (PSDE) O(pT )

Page 17: Learning to Control’ an Unknown System · episode also ★ Episode length can’t increase faster Non-asymptotic Regret bound for PSDE!10 Theorem.* If the MDP is weakly communicating

/17

Conclusions Simple Posterior Sampling (PS)-based Learning-to-

Control Algorithms‣ For MDPs and Linear Stochastic Systems

Trades-off `Exploration v. Exploitation’ nearly optimally to get O(√T) regret

‣ Unlike OFU-type algorithms, computationally simple

‣ A natural design

‣ A deterministic schedule possible?

Extensions‣ Continuous state space MDPs via function approximation

‣ Time-varying systems

!17*Y. Ouyang, M. Gagrani and R. Jain, “Learning-based Control of Unknown Linear Systems with Thompson Sampling”, arXiv:1709.04047, 2017

*Y. Ouyang, M. Gagrani, A. Nayyar and R. Jain, “Learning Unknown MDPs: A Thompson Sampling Approach”, NIPS, 2017.