approximate reinforcement learning - busoniubusoniu.net/files/repository/cdc11_workshop.pdfthis...

55
Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions Approximate Reinforcement Learning (Optimal Adaptive Control workshop: Part II) Lucian Bus ¸oniu SequeL, INRIA Lille, France CDC-ECC 2011, 11 December, Orlando

Upload: others

Post on 06-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Approximate Reinforcement Learning - Busoniubusoniu.net/files/repository/cdc11_workshop.pdfThis talk: methods developed in AI, focus on control problems. Intro & Recap Offline AVI

Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions

Approximate Reinforcement Learning(Optimal Adaptive Control workshop: Part II)

Lucian BusoniuSequeL, INRIA Lille, France

CDC-ECC 2011, 11 December, Orlando

Page 2: Approximate Reinforcement Learning - Busoniubusoniu.net/files/repository/cdc11_workshop.pdfThis talk: methods developed in AI, focus on control problems. Intro & Recap Offline AVI

Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions

Reinforcement learning perspectives

Learn to optimally controlunknown system

... or stated in AI terms ...

Learn to optimally behavein unkown environment

This talk: methods developed in AI,focus on control problems

Page 3: Approximate Reinforcement Learning - Busoniubusoniu.net/files/repository/cdc11_workshop.pdfThis talk: methods developed in AI, focus on control problems. Intro & Recap Offline AVI

Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions

Reminder: Markov decision process

Observe states x , apply actions u, receive rewards r

System: stochastic dynamics xk+1 ∼ f (xk , uk , ·)Performance: reward function rk+1 = r(xk , uk , xk+1)

Controller: uk = h(xk )

Page 4: Approximate Reinforcement Learning - Busoniubusoniu.net/files/repository/cdc11_workshop.pdfThis talk: methods developed in AI, focus on control problems. Intro & Recap Offline AVI

Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions

Goal

Find h to maximize from any x0 expected discounted return:

Rh(x0) = E{∑∞

k=0γk rk+1 |h

}

Equivalent to minimizing the expected cost:

E {J(x0) |h} = E{∑∞

k=0γk · (–rk+1) |h

}

Page 5: Approximate Reinforcement Learning - Busoniubusoniu.net/files/repository/cdc11_workshop.pdfThis talk: methods developed in AI, focus on control problems. Intro & Recap Offline AVI

Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions

Solution using Q-functions

Q-function of h: Qh(x , u) = E{

r(x , u, x ′) + γRh(x ′)}

Bellman equation:

Qh(x , u) = E{

r(x , u, x ′) + γQh(x ′, h(x ′))}

Optimal Q-function: Q∗ = maxh Qh

Bellman optimality equation:

Q∗(x , u) = E{

r(x , u, x ′) + γ maxu′

Q∗(x ′, u′)}

Optimal policy: h∗(x) = arg maxu Q∗(x , u)

Page 6: Approximate Reinforcement Learning - Busoniubusoniu.net/files/repository/cdc11_workshop.pdfThis talk: methods developed in AI, focus on control problems. Intro & Recap Offline AVI

Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions

Value & policy iteration with Q-functions

Q-value iterationTurn Bellman optimality equation into iterative update

repeatQ(x , u)← E {r(x , u, x ′) + γ maxu′ Q(x ′, u′)} ∀x , u

until convergence to Q∗

Policy iterationiteratively evaluate & improve policies

repeatpolicy evaluation:

solve Qh(x , u) = E{

r(x , u, x ′) + γQh(x ′, h(x ′))}

(e.g. by using VI-like update)policy improvement: h(x)← arg maxu Qh(x , u) ∀x

until convergence to h∗

Page 7: Approximate Reinforcement Learning - Busoniubusoniu.net/files/repository/cdc11_workshop.pdfThis talk: methods developed in AI, focus on control problems. Intro & Recap Offline AVI

Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions

Example: Inverted pendulum

x = [angle α, velocity α]>,u = voltagedeterministic dynamics

Goal: stabilize pointing up:

r(x , u) = −x>[5 00 0.1

]x − u>1u

discount factor γ = 0.98

Underactuated⇒ must swing before stabilizing

Page 8: Approximate Reinforcement Learning - Busoniubusoniu.net/files/repository/cdc11_workshop.pdfThis talk: methods developed in AI, focus on control problems. Intro & Recap Offline AVI

Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions

Example: Optimal solution

Left: slice Q∗(x , u) for u = 0 Right: optimal policy

Replay

Page 9: Approximate Reinforcement Learning - Busoniubusoniu.net/files/repository/cdc11_workshop.pdfThis talk: methods developed in AI, focus on control problems. Intro & Recap Offline AVI

Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions

Need for approximation

Classical RL – discrete x , uQ(x , u) and h(x) exactly represented

In control problems, x , u typically continuous

Approximation over x , u necessary

Page 10: Approximate Reinforcement Learning - Busoniubusoniu.net/files/repository/cdc11_workshop.pdfThis talk: methods developed in AI, focus on control problems. Intro & Recap Offline AVI

Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions

Approximators

Parametric: fixed form, # of parameters

Linear:QW (x , u) = φ>(x , u)WE.g. RBFs

Nonlinear:E.g. neural net

Nonparametric: form, # of parameters derived from dataE.g. locally linear regression

Page 11: Approximate Reinforcement Learning - Busoniubusoniu.net/files/repository/cdc11_workshop.pdfThis talk: methods developed in AI, focus on control problems. Intro & Recap Offline AVI

Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions

Taxonomy of methods

By path to solution:1 Approximate value iteration: Find Q∗, use it to compute h∗

2 Approximate policy iteration: Evaluate h (find Qh),improve h, repeat

3 Approximate policy search: Look directly for h∗

By level of interaction:1 Offline (batch): data collected in advance2 Online: learn by interaction

Page 12: Approximate Reinforcement Learning - Busoniubusoniu.net/files/repository/cdc11_workshop.pdfThis talk: methods developed in AI, focus on control problems. Intro & Recap Offline AVI

Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions

1 Introduction & Recap

2 Offline approximate value iteration & policy iterationAVI: Fuzzy Q-iterationAVI: Fitted Q-iterationAPI: Least-squares policy iteration

3 Online temporal-difference RLClassical Q-learning and SARSAApproximate Q-learning and SARSA

4 Optimistic planning

5 Conclusions

Page 13: Approximate Reinforcement Learning - Busoniubusoniu.net/files/repository/cdc11_workshop.pdfThis talk: methods developed in AI, focus on control problems. Intro & Recap Offline AVI

Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions

Fuzzy approximation

Linear approximator:basis functions (BFs) over stateφ1, . . . , φN : X → [0, 1]

discretized actions u1, . . . , uM ∈ Uparameters W ∈ RN×M

Approximate Q-function:

QW (x , u) = φ>(x)W∗,j =N∑

i=1

φi(x)Wi,j

j = arg minj ′

∥∥u − uj ′∥∥ (nearest neighbor)

Page 14: Approximate Reinforcement Learning - Busoniubusoniu.net/files/repository/cdc11_workshop.pdfThis talk: methods developed in AI, focus on control problems. Intro & Recap Offline AVI

Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions

Assumptions

Deterministic systemBFs normalized:

∑Ni=1 φi(x) = 1

At center xi of each BF, φi(xi) = 1, φi ′(xi) = 0 ∀i ′

Simplest BFs satisfying this – triangular:

⇒ multilinear interpolation

Page 15: Approximate Reinforcement Learning - Busoniubusoniu.net/files/repository/cdc11_workshop.pdfThis talk: methods developed in AI, focus on control problems. Intro & Recap Offline AVI

Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions

Fuzzy Q-iteration

Revisit exact Q-iteration:repeat at each iteration `

Q`+1(x , u)← r(x , u, x ′) + γ maxu′ Q`(x ′, u′) for all x , u=: [T (Q`)](x , u) with T : Q → Q Bellman mapping

until convergence to Q∗

output Q∗, h∗(x) = arg maxu Q∗(x , u)

Fuzzy Q-iteration (Busoniu et al., 2007)

repeat at each iteration `

W`+1,i,j ← [T (QW`)](xi , uj) for all i , j

= r(xi , uj , x ′) + γ maxu′ QW`(x ′, u′)until convergence to W ∗

output QW∗, hW∗

(x) = arg maxujQW∗

(x , uj)

Page 16: Approximate Reinforcement Learning - Busoniubusoniu.net/files/repository/cdc11_workshop.pdfThis talk: methods developed in AI, focus on control problems. Intro & Recap Offline AVI

Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions

Convergence

T contraction with factor γ:‖T (Q)− T (Q′)‖∞ ≤ γ ‖Q −Q′‖∞QW non-expansion (seen as mapping RN×M → Q):∥∥∥QW − QW ′

∥∥∥∞≤ maxi,j

∣∣∣Wi,j −W ′i,j

∣∣∣Therefore:

Theorem

Fuzzy Q-iteration monotonically converges to W ∗

Page 17: Approximate Reinforcement Learning - Busoniubusoniu.net/files/repository/cdc11_workshop.pdfThis talk: methods developed in AI, focus on control problems. Intro & Recap Offline AVI

Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions

Solution quality

Characterize approximation power by minimum distance to Q∗:

ε = minW∈RN×M

∥∥∥Q∗ − QW∥∥∥∞

Theorem (continued)Returned Q-function is near-optimal:∥∥∥Q∗ − QW∗

∥∥∥∞≤ 2ε

1−γ

... and corresponding policy also near-optimal:∥∥∥Q∗ −QhW∗∥∥∥∞≤ 4γε

(1−γ)2 —

Page 18: Approximate Reinforcement Learning - Busoniubusoniu.net/files/repository/cdc11_workshop.pdfThis talk: methods developed in AI, focus on control problems. Intro & Recap Offline AVI

Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions

Example: Fuzzy QI for the inverted pendulum

BFs over state space: triangular, on 41× 21 equidistant gridAction discretization: grid of 5 actions, centered on 0

Page 19: Approximate Reinforcement Learning - Busoniubusoniu.net/files/repository/cdc11_workshop.pdfThis talk: methods developed in AI, focus on control problems. Intro & Recap Offline AVI

Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions

1 Introduction & Recap

2 Offline approximate value iteration & policy iterationAVI: Fuzzy Q-iterationAVI: Fitted Q-iterationAPI: Least-squares policy iteration

3 Online temporal-difference RLClassical Q-learning and SARSAApproximate Q-learning and SARSA

4 Optimistic planning

5 Conclusions

Page 20: Approximate Reinforcement Learning - Busoniubusoniu.net/files/repository/cdc11_workshop.pdfThis talk: methods developed in AI, focus on control problems. Intro & Recap Offline AVI

Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions

Generalizing fuzzy Q-iteration

Arbitrary approximator, parametric or nonparametric(fuzzy QI: restricted BFs + discrete actions)

Arbitrary transition samples (xis , uis , ris , x ′is), is = 1, . . . , ns(fuzzy QI: center-discrete action pairs (xi , uj))

Possibly stochastic system

Page 21: Approximate Reinforcement Learning - Busoniubusoniu.net/files/repository/cdc11_workshop.pdfThis talk: methods developed in AI, focus on control problems. Intro & Recap Offline AVI

Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions

Fitted Q-iteration

Algorithm for parametric approximator(can also use nonparametric regression)

Fitted Q-iteration (Ernst et al., 2005)

repeat at each iteration `for each sample is = 1, . . . , ns do

compute target Q-value:qis ← ris ,+γ maxu′ QW`(x ′is , u′)

end forperform regression on data (xis , uis)→ qis :

W`+1 ← arg minW∈RN×M∑ns

is=1

∣∣∣qis − QW (xis , uis)∣∣∣2

until finished

Page 22: Approximate Reinforcement Learning - Busoniubusoniu.net/files/repository/cdc11_workshop.pdfThis talk: methods developed in AI, focus on control problems. Intro & Recap Offline AVI

Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions

Fitted Q-iteration (cont’d)

If system is stochastic, we actually aim for:QW (x , u) ≈ E

{r(x , u, x ′) + γ maxu′ QW`(x ′, u′)

}– regression on (xis , uis)→ qis naturally achieves this

Convergence to near-optimal region(given MDP and approximator characteristics)

(Munos & Szepesvari, 2008)

Left: fitted QI Right: fuzzy QI

Page 23: Approximate Reinforcement Learning - Busoniubusoniu.net/files/repository/cdc11_workshop.pdfThis talk: methods developed in AI, focus on control problems. Intro & Recap Offline AVI

Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions

Example: Fitted QI for the inverted pendulum

Approximator over state space: local linear regressionAction discretization: 3 actionsSamples: on a grid 31× 15× 3

Page 24: Approximate Reinforcement Learning - Busoniubusoniu.net/files/repository/cdc11_workshop.pdfThis talk: methods developed in AI, focus on control problems. Intro & Recap Offline AVI

Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions

1 Introduction & Recap

2 Offline approximate value iteration & policy iterationAVI: Fuzzy Q-iterationAVI: Fitted Q-iterationAPI: Least-squares policy iteration

3 Online temporal-difference RLClassical Q-learning and SARSAApproximate Q-learning and SARSA

4 Optimistic planning

5 Conclusions

Page 25: Approximate Reinforcement Learning - Busoniubusoniu.net/files/repository/cdc11_workshop.pdfThis talk: methods developed in AI, focus on control problems. Intro & Recap Offline AVI

Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions

Approximate policy iteration

Revisit exact policy iteration:repeat

policy evaluation: solveQh`(x , u) = E

{r(x , u, x ′) + γQh`(x ′, h`(x ′))

}=: [T h`(Qh`)](x , u), T h policy eval. mapping

policy improvement: h`+1(x)← arg maxu Qh`(x , u) ∀xuntil convergence to h∗

Approximate policy iterationrepeat

approximate policy evaluation: solve Qh` ≈ T h`(Qh`)

policy improvement: h`+1(x)← arg maxu Qh`(x , u)until finished

Page 26: Approximate Reinforcement Learning - Busoniubusoniu.net/files/repository/cdc11_workshop.pdfThis talk: methods developed in AI, focus on control problems. Intro & Recap Offline AVI

Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions

Guarantees

Theorem (Bertsekas & Tsitsiklis 1996, Lagoudakis & Parr 2003)

If every policy evaluation is ε-accurate:∥∥∥Qh` −Qh`

∥∥∥∞≤ ε,

a near-optimal policy is eventually obtained:

lim sup`→∞

∥∥∥Qh` −Q∗∥∥∥∞≤ 2γε

(1− γ)2

Page 27: Approximate Reinforcement Learning - Busoniubusoniu.net/files/repository/cdc11_workshop.pdfThis talk: methods developed in AI, focus on control problems. Intro & Recap Offline AVI

Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions

Projected policy evaluation (LSTD)

Linear approximator QW (x , u) = φ>(x , u)W

Qh ≈ T h(Qh) instantiated to QW = P[T h(QW )]with P weighted least-squares projectionHas meaningful solution under conditions on PCalled least-squares temporal difference

Page 28: Approximate Reinforcement Learning - Busoniubusoniu.net/files/repository/cdc11_workshop.pdfThis talk: methods developed in AI, focus on control problems. Intro & Recap Offline AVI

Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions

Projected policy evaluation (cont’d)

QW = P[T h(QW )]

QW linear in W ; T h(Q) linear in Q; P(Q) linear in Q⇒ rewrite as linear equation in W :

AW = b

A, b can be estimated from samples (xis , uis , ris , x ′is):

A←A + φ(xis , uis)φ>(xis , uis)

− γφ(xis , uis)φ>(x ′is , h(x ′is))

b ←b + φ(xis , uis)ris

Page 29: Approximate Reinforcement Learning - Busoniubusoniu.net/files/repository/cdc11_workshop.pdfThis talk: methods developed in AI, focus on control problems. Intro & Recap Offline AVI

Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions

Least-squares policy iteration

LSPI (Lagoudakis & Parr, 2003)

repeatuse samples to build estimates A`, b` for h`

projected policy evaluation: solve A`W` = b`

policy improvement: h`+1(x)← arg maxu QW`(x , u)until finished

Page 30: Approximate Reinforcement Learning - Busoniubusoniu.net/files/repository/cdc11_workshop.pdfThis talk: methods developed in AI, focus on control problems. Intro & Recap Offline AVI

Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions

Example: LSPI for the inverted pendulum

BFs over state space: RBFs on a 15× 9 gridAction discretization: 3 actionsSamples: 7500 transitions from random (x , discrete u) pairs

Page 31: Approximate Reinforcement Learning - Busoniubusoniu.net/files/repository/cdc11_workshop.pdfThis talk: methods developed in AI, focus on control problems. Intro & Recap Offline AVI

Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions

Comparison between API and AVI

Similar theoretical behavior in general:convergence to near-optimal region

In practice, API typically converges in fewer iterations...

...but also has larger complexity per iteration

Page 32: Approximate Reinforcement Learning - Busoniubusoniu.net/files/repository/cdc11_workshop.pdfThis talk: methods developed in AI, focus on control problems. Intro & Recap Offline AVI

Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions

1 Introduction & Recap

2 Offline approximate value iteration & policy iterationAVI: Fuzzy Q-iterationAVI: Fitted Q-iterationAPI: Least-squares policy iteration

3 Online temporal-difference RLClassical Q-learning and SARSAApproximate Q-learning and SARSA

4 Optimistic planning

5 Conclusions

Page 33: Approximate Reinforcement Learning - Busoniubusoniu.net/files/repository/cdc11_workshop.pdfThis talk: methods developed in AI, focus on control problems. Intro & Recap Offline AVI

Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions

Q-learning

1 Take (model-based) Q-iteration update:Q`+1(x , u)← E {r(x , u, x ′) + γ maxu′ Q`(x ′, u′)}

2 Use transition sample (xk , uk , rk+1, xk+1) at each step k :Q(xk , uk )← rk+1 + γ maxu′ Q(xk+1, u′)

– note xk+1 = x ′, rk+1 = r(x , u, x ′) in deterministic case;in stochastic case they just provide a sample of r.h.s.

3 Make update incremental with learning rate αk ∈ (0, 1]:Q(xk , uk )←Q(xk , uk ) + αk ·

[rk+1 + γ maxu′

Q(xk+1, u′)−Q(xk , uk )]

Page 34: Approximate Reinforcement Learning - Busoniubusoniu.net/files/repository/cdc11_workshop.pdfThis talk: methods developed in AI, focus on control problems. Intro & Recap Offline AVI

Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions

Temporal difference

[rk+1 + γ maxu′

Q(xk+1, u′)−Q(xk , uk )]

is the Q-learning temporal difference:“error” in Bellman optimality equation for current sample

Page 35: Approximate Reinforcement Learning - Busoniubusoniu.net/files/repository/cdc11_workshop.pdfThis talk: methods developed in AI, focus on control problems. Intro & Recap Offline AVI

Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions

Q-learning algorithm

Q-learningfor every trial do

initialize x0repeat for each step k

take action ukmeasure xk+1, receive rk+1Q(xk , uk )←Q(xk , uk ) + αk ·

[rk+1 + γ maxu′

Q(xk+1, u′)−Q(xk , uk )]

until trial finishedend for

Page 36: Approximate Reinforcement Learning - Busoniubusoniu.net/files/repository/cdc11_workshop.pdfThis talk: methods developed in AI, focus on control problems. Intro & Recap Offline AVI

Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions

Exploration-exploitation tradeoff

Essential condition for convergence to Q∗:all (x , u) pairs must be visited infinitely often

⇒ Exploration necessary:sometimes, choose actions randomlyExploitation of current knowledge is also necessary:sometimes, choose actions greedily

Simple solution: ε-greedy

uk =

{arg maxu Q(xk , u) with probability (1− εk )

a random action with probability εk

with exploration probability εk ∈ (0, 1) decreasing in time

Page 37: Approximate Reinforcement Learning - Busoniubusoniu.net/files/repository/cdc11_workshop.pdfThis talk: methods developed in AI, focus on control problems. Intro & Recap Offline AVI

Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions

SARSA

Just like Q-learning but starting from policy evaluation:Qh(x , u) = E

{r(x , u, x ′) + γQh(x ′, h(x ′))

}We get the temporal-difference update:

Q(xk , uk )←Q(xk , uk ) + αk ·[rk+1 + γQ(xk+1, uk+1)−Q(xk , uk )]

Two algorithms can be obtained:1 Policy held fixed⇒ online variant of policy evaluation2 Policy based on greedy⇒ SARSA

(xk , uk , rk+1, xk+1, uk+1 = state, action, reward, state, action)

Page 38: Approximate Reinforcement Learning - Busoniubusoniu.net/files/repository/cdc11_workshop.pdfThis talk: methods developed in AI, focus on control problems. Intro & Recap Offline AVI

Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions

SARSA Algorithm

SARSA with ε-greedy explorationfor every trial do

initialize x0

u0 =

{arg maxu Q(x0, u) w.p. (1− ε0)

random w.p. ε0repeat each step k

apply uk , measure xk+1, receive rk+1

uk+1 =

{arg maxu Q(xk+1, u) w.p. (1− εk+1)

random w.p. εk+1

Q(xk , uk )←Q(xk , uk ) + αk ·[rk+1 + γQ(xk+1, uk+1)−Q(xk , uk )]

until trial finishedend for

Page 39: Approximate Reinforcement Learning - Busoniubusoniu.net/files/repository/cdc11_workshop.pdfThis talk: methods developed in AI, focus on control problems. Intro & Recap Offline AVI

Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions

On-policy vs. off-policy

SARSA on-policy:[rk+1 + γQ(xk+1, uk+1)−Q(xk , uk )]

– updates towards the value of current policy used⇒ policy must converge to greedy

Q-learning off-policy:[rk+1 + γ maxu′ Q(xk+1, u′)−Q(xk , uk )]

– updates towards optimal value, regardless of policy⇒ any policy can be used to learn

Page 40: Approximate Reinforcement Learning - Busoniubusoniu.net/files/repository/cdc11_workshop.pdfThis talk: methods developed in AI, focus on control problems. Intro & Recap Offline AVI

Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions

1 Introduction & Recap

2 Offline approximate value iteration & policy iterationAVI: Fuzzy Q-iterationAVI: Fitted Q-iterationAPI: Least-squares policy iteration

3 Online temporal-difference RLClassical Q-learning and SARSAApproximate Q-learning and SARSA

4 Optimistic planning

5 Conclusions

Page 41: Approximate Reinforcement Learning - Busoniubusoniu.net/files/repository/cdc11_workshop.pdfThis talk: methods developed in AI, focus on control problems. Intro & Recap Offline AVI

Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions

Approximate Q-learning

Parametric approximation QW (x , u)

Gradient descent on the error [Q∗(xk , uk )− QW (xk , uk )]2:

W ←W − 12αk

∂W

[Q∗(xk , uk )− QW (xk , uk )

]2

= W + αk∂QW (xk , uk )

∂W·[Q∗(xk , uk )− QW (xk , uk )

]Estimate Q∗(xk , uk ) using Bellman:

W ←W + αk∂QW (xk , uk )

∂W·[

rk+1 + γ maxu′

QW (xk+1, u′)− QW (xk , uk )

](approximate temporal difference)

Page 42: Approximate Reinforcement Learning - Busoniubusoniu.net/files/repository/cdc11_workshop.pdfThis talk: methods developed in AI, focus on control problems. Intro & Recap Offline AVI

Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions

Approximate Q-learning algorithm

Approximate Q-learning, ε-greedy (Sutton & Barto, 1998)

for every trial doinitialize x0repeat at each step k

uk =

{arg maxu QW (xk , u) w.p. (1− εk )

random w.p. εkapply uk , measure xk+1, receive rk+1

W ←W + αk∂ bQW (xk ,uk )

∂W ·[rk+1 + γ maxu′ QW (xk+1, u′)− QW (xk , uk )

]until trial finished

end for

Page 43: Approximate Reinforcement Learning - Busoniubusoniu.net/files/repository/cdc11_workshop.pdfThis talk: methods developed in AI, focus on control problems. Intro & Recap Offline AVI

Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions

Approximate SARSA

Approximate SARSA, ε-greedy (Sutton & Barto, 1998)

for every trial doinitialize x0

u0 =

{arg maxu QW (x0, u) w.p. (1− ε0)

random w.p. ε0repeat at each step k

apply uk , measure xk+1, receive rk+1

uk+1 =

{arg maxu QW (xk+1, u) w.p. (1− εk+1)

random w.p. εk+1

W ←W + αk∂ bQW (xk ,uk )

∂W ·[rk+1 + γQW (xk+1, uk+1)− QW (xk , uk )

]until trial finished

end for

Page 44: Approximate Reinforcement Learning - Busoniubusoniu.net/files/repository/cdc11_workshop.pdfThis talk: methods developed in AI, focus on control problems. Intro & Recap Offline AVI

Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions

Convergence

Assumptions:linear parametrizationtechnical conditions on the policy

Results:1 Approximate SARSA converges2 Approximate Q-learning converges when policy fixed

(Melo et al., 2008)

Page 45: Approximate Reinforcement Learning - Busoniubusoniu.net/files/repository/cdc11_workshop.pdfThis talk: methods developed in AI, focus on control problems. Intro & Recap Offline AVI

Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions

Application: Robotic goalkeeper

Vision-based control: learn how to catch ballusing video camera image6 states, 2 actions (motor torques)

(Adam et al., 2011)

Page 46: Approximate Reinforcement Learning - Busoniubusoniu.net/files/repository/cdc11_workshop.pdfThis talk: methods developed in AI, focus on control problems. Intro & Recap Offline AVI

Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions

Robot goalkeeper details

Challenge: fast learning from only a few data(∼25 samples/ball shot for 25FPS camera)

Modifications:Reuse transition samples (“experience replay”)

Reduce dimensionality: x = [ϕball, ϕ1]>, u = τ1

(& control second arm so camera points forward)

Guide initial learning phasewith heuristic controller (“easy missions”)

Reward r = −x>[1 00 0.01

]x , discount γ = 0.8

Page 47: Approximate Reinforcement Learning - Busoniubusoniu.net/files/repository/cdc11_workshop.pdfThis talk: methods developed in AI, focus on control problems. Intro & Recap Offline AVI

Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions

Demo: online RL for robot goalkeeper

Page 48: Approximate Reinforcement Learning - Busoniubusoniu.net/files/repository/cdc11_workshop.pdfThis talk: methods developed in AI, focus on control problems. Intro & Recap Offline AVI

Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions

1 Introduction & Recap

2 Offline approximate value iteration & policy iterationAVI: Fuzzy Q-iterationAVI: Fitted Q-iterationAPI: Least-squares policy iteration

3 Online temporal-difference RLClassical Q-learning and SARSAApproximate Q-learning and SARSA

4 Optimistic planning

5 Conclusions

Page 49: Approximate Reinforcement Learning - Busoniubusoniu.net/files/repository/cdc11_workshop.pdfThis talk: methods developed in AI, focus on control problems. Intro & Recap Offline AVI

Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions

Idea: Online, receding-horizon planning

At each step k :1 Explore possible policies (e.g. action sequences) from xk

2 Choose uk from resulting information

Also a type of model-predictive control

Page 50: Approximate Reinforcement Learning - Busoniubusoniu.net/files/repository/cdc11_workshop.pdfThis talk: methods developed in AI, focus on control problems. Intro & Recap Offline AVI

Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions

Optimistic planning

Optimistic planning for MDPs (Busoniu & Munos, 2011)

initialize tree with root xkrepeat n times

find optimistic partial policyexpand node with largest uncertainty

output near-optimal action

Page 51: Approximate Reinforcement Learning - Busoniubusoniu.net/files/repository/cdc11_workshop.pdfThis talk: methods developed in AI, focus on control problems. Intro & Recap Offline AVI

Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions

Online treatment of HIV (simulation)

Optimistic planning vs. full treatmentInfection eventually controlled without drugs

Page 52: Approximate Reinforcement Learning - Busoniubusoniu.net/files/repository/cdc11_workshop.pdfThis talk: methods developed in AI, focus on control problems. Intro & Recap Offline AVI

Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions

Outlook

Additional topics:Better exploration-exploitationPolicy search & gradient methodsAlgorithm variants & extensions

Open issues in “AI” RL:Large-scale systemsSafe learningGood applicationsTighter integration with “control” ADP & RL!

Page 53: Approximate Reinforcement Learning - Busoniubusoniu.net/files/repository/cdc11_workshop.pdfThis talk: methods developed in AI, focus on control problems. Intro & Recap Offline AVI

Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions

Conclusion

Approximate reinforcement learning =Learn how to near-optimally control

unknown, nonlinear, stochastic systems

Page 54: Approximate Reinforcement Learning - Busoniubusoniu.net/files/repository/cdc11_workshop.pdfThis talk: methods developed in AI, focus on control problems. Intro & Recap Offline AVI

Appendix

Books on ADP & RL (chronologically)

Bertsekas & Tsitsiklis, Neuro-Dynamic Programming, 1996.Sutton & Barto, Reinforcement Learning: An Introduction, 1998.Si, Barto, & Powell (eds.), Handbook of Learning andApproximate Dynamic Programming, 2004.Bertsekas, Dynamic Programming and Optimal Control, 3rd ed.,2007.Sigaud & Buffet (eds.), Markov Decision Processes in ArtificialIntelligence, 2010.Busoniu, Babuska, De Schutter, & Ernst, ReinforcementLearning and Dynamic Programming Using FunctionApproximators, 2010.Szepesvari, Algorithms for Reinforcement Learning, 2010.Powell, Approximate Dynamic Programming: Solving the Cursesof Dimensionality, 2nd ed., 2011.

Upcoming: Lewis & Liu (eds.) RL and ADP for FeedbackControl, Wiering & Otterlo (eds.) RL: State of the Art

Page 55: Approximate Reinforcement Learning - Busoniubusoniu.net/files/repository/cdc11_workshop.pdfThis talk: methods developed in AI, focus on control problems. Intro & Recap Offline AVI

Appendix

References for this talk (in citation order)

Busoniu, Ernst, Babuska, De Schutter, Fuzzy Approximation forConvergent Model-Based Reinforcement Learning, FUZZ-IEEE2007; extended in Automatica, 2010.Ernst, Geurts, Wehenkel, Tree-Based Batch ModeReinforcement Learning, JMLR, 2005.Munos & Szepesvari, Finite Time Bounds for Fitted ValueIteration, JMLR, 2008.Bertsekas & Tsitsiklis, Neuro-Dynamic Programming, 1996.Lagoudakis & Parr, Least-Squares Policy Iteration, JMLR, 2003.Sutton & Barto, Reinforcement Learning: An Introduction, 1998.Melo, Meyn, Ribeiro, An Analysis of Reinforcement Learningwith Function Approximation, ICML 2008.Adam, Busoniu, Babuska, Experience Replay for Real-TimeReinforcement Learning Control, Tran. SMC-C 2011.Busoniu, Munos, De Schutter, and Babuska, Optimistic Planningfor Sparsely Stochastic Systems, ADPRL 2011.