![Page 1: Optimism in the Face of Uncertainty: a Unifying approach](https://reader036.vdocument.in/reader036/viewer/2022070410/568145ef550346895db2f537/html5/thumbnails/1.jpg)
Optimism in the Face of Uncertainty:
a Unifying approach
István Szita & András Lőrincz
Eötvös Loránd University
Hungary
![Page 2: Optimism in the Face of Uncertainty: a Unifying approach](https://reader036.vdocument.in/reader036/viewer/2022070410/568145ef550346895db2f537/html5/thumbnails/2.jpg)
Outline
background quick overview of exploration
methods construction of the new algorithm analysis & experimental results outlook
![Page 3: Optimism in the Face of Uncertainty: a Unifying approach](https://reader036.vdocument.in/reader036/viewer/2022070410/568145ef550346895db2f537/html5/thumbnails/3.jpg)
Background
Markov decision processes finite, discounted (…but wait until the end of the talk)
value function-based methods Q(x,a) values
the efficient exploration problem
![Page 4: Optimism in the Face of Uncertainty: a Unifying approach](https://reader036.vdocument.in/reader036/viewer/2022070410/568145ef550346895db2f537/html5/thumbnails/4.jpg)
Basic exploration: -greedy
extremely simple sufficient for convergence in the limit
for many classical methods likeQ-learning, Dyna, Sarsa
…under suitable conditions
extremely inefficient
![Page 5: Optimism in the Face of Uncertainty: a Unifying approach](https://reader036.vdocument.in/reader036/viewer/2022070410/568145ef550346895db2f537/html5/thumbnails/5.jpg)
Advanced exploration
in case of uncertainty, be optimistic! …details vary
we will use concepts from R-max optimistic initial values exploration bonus methods model-based interval estimation
there are many others, Bayesian methods UCT delayed Q-learning …
![Page 6: Optimism in the Face of Uncertainty: a Unifying approach](https://reader036.vdocument.in/reader036/viewer/2022070410/568145ef550346895db2f537/html5/thumbnails/6.jpg)
R-max
builds model from observations
uses an optimistic model unknown transitions go to
“garden of Eden” (hypothetical state with max. reward)
transitions declared known after O(nVisits3) steps
+poly-time convergence
−slow in practice
(Brafman &Tennenholz, 2001)
![Page 7: Optimism in the Face of Uncertainty: a Unifying approach](https://reader036.vdocument.in/reader036/viewer/2022070410/568145ef550346895db2f537/html5/thumbnails/7.jpg)
Optimistic initial values
set initial values high:
no extra work usually combined with
other techniques with very high initial
values, no need for additional exploration
+no extra work
−wears out slowly
only model-free
![Page 8: Optimism in the Face of Uncertainty: a Unifying approach](https://reader036.vdocument.in/reader036/viewer/2022070410/568145ef550346895db2f537/html5/thumbnails/8.jpg)
Exploration bonus methods
bonus reward for “interesting” states rarely visited, large TD-error,
etc. exact size/form varies can oscillate fervently
regular/bonus rewards accumulated in separate value functions
+can be efficientin practice
−ad-hoc method
bonuses do not converge
(e.g. Mealeau & Bourgine, 1999; many others)
![Page 9: Optimism in the Face of Uncertainty: a Unifying approach](https://reader036.vdocument.in/reader036/viewer/2022070410/568145ef550346895db2f537/html5/thumbnails/9.jpg)
Model-based interval estimation
builds model from observations
estimates confidence intervals of state values
exploration bonus: half-widths of intervals
+poly-time convergence
−???
(Wiering, 1998; Strehl & Littman, 2006)
![Page 10: Optimism in the Face of Uncertainty: a Unifying approach](https://reader036.vdocument.in/reader036/viewer/2022070410/568145ef550346895db2f537/html5/thumbnails/10.jpg)
Assembling the new algorithm
model estimation:sum of rewards for all (x,a,y) up to t
number of visits to (x,a,y) up to t
number of visits to (x,a) up to t
![Page 11: Optimism in the Face of Uncertainty: a Unifying approach](https://reader036.vdocument.in/reader036/viewer/2022070410/568145ef550346895db2f537/html5/thumbnails/11.jpg)
Assembling the new algorithm II Optimistic initial model: a single visit to xE from each (x,a)
really optimistic!
![Page 12: Optimism in the Face of Uncertainty: a Unifying approach](https://reader036.vdocument.in/reader036/viewer/2022070410/568145ef550346895db2f537/html5/thumbnails/12.jpg)
Assembling the new algorithm II Optimistic initial model: a single visit to xE from each (x,a)
really optimistic!
cf. optimistic initial values:no extra work after initialization
cf. R-max:hypothetical “Eden” statewith max. reward
![Page 13: Optimism in the Face of Uncertainty: a Unifying approach](https://reader036.vdocument.in/reader036/viewer/2022070410/568145ef550346895db2f537/html5/thumbnails/13.jpg)
Assembling the new algorithm III in each step t, at := greedy with respect to Qt(xt,¢)
perform at, observe next state and reward update counters, model parameters solve model MDP
... can be done incrementally & fast, e.g.: a few steps of value iteration asynchronously, by prioritized sweeping
get new value function Qt+1
![Page 14: Optimism in the Face of Uncertainty: a Unifying approach](https://reader036.vdocument.in/reader036/viewer/2022070410/568145ef550346895db2f537/html5/thumbnails/14.jpg)
Assembling the new algorithm IV Potential problem: Rmax is too large! separate real/bonus rewards!
initialize to 0add “real” rewards
initialize to 0 or Rmax
add nothing
we can use it at any time!
![Page 15: Optimism in the Face of Uncertainty: a Unifying approach](https://reader036.vdocument.in/reader036/viewer/2022070410/568145ef550346895db2f537/html5/thumbnails/15.jpg)
Assembling the new algorithm IV Potential problem: Rmax is too large! separate real/bonus rewards!
initialize to 0add “real” rewards
initialize to 0 or Rmax
add nothing
we can use it at any time!
cf. exploration bonus methods
exploration bonus!
![Page 16: Optimism in the Face of Uncertainty: a Unifying approach](https://reader036.vdocument.in/reader036/viewer/2022070410/568145ef550346895db2f537/html5/thumbnails/16.jpg)
Convergence results
One parameter: Rmax
for large Rmax, converges to near-optimum (with high probability)
proof is based on MBIE’s proof (and R-max, E3) by the time the bonus becomes small !
numVisits is large !model estimate is accurate
bonus is (instead of MBIE’s )
looser bound (but polynomial!)
![Page 17: Optimism in the Face of Uncertainty: a Unifying approach](https://reader036.vdocument.in/reader036/viewer/2022070410/568145ef550346895db2f537/html5/thumbnails/17.jpg)
Experimental results I
“RiverSwim”
“SixArms”
(Strehl & Littman, 2006)
![Page 18: Optimism in the Face of Uncertainty: a Unifying approach](https://reader036.vdocument.in/reader036/viewer/2022070410/568145ef550346895db2f537/html5/thumbnails/18.jpg)
Experimental results II
“Chain”
“Loop”
(Meuleau & Bourgine, 1999; Strens, 2000; Dearden, 2000)
![Page 19: Optimism in the Face of Uncertainty: a Unifying approach](https://reader036.vdocument.in/reader036/viewer/2022070410/568145ef550346895db2f537/html5/thumbnails/19.jpg)
Experimental results III
“FlagMaze”
(Meuleau & Bourgine, 1999; Strens, 2000; Dearden, 2000)
![Page 20: Optimism in the Face of Uncertainty: a Unifying approach](https://reader036.vdocument.in/reader036/viewer/2022070410/568145ef550346895db2f537/html5/thumbnails/20.jpg)
Experimental results IV
“Maze with subgoals”
(Wiering & Schmidhuber, 1998)
+500
+500 +1000
![Page 21: Optimism in the Face of Uncertainty: a Unifying approach](https://reader036.vdocument.in/reader036/viewer/2022070410/568145ef550346895db2f537/html5/thumbnails/21.jpg)
Outlook
extension to factored MDPs: almost ready (we need benchmarks)
extension to general function approximation: in progress
![Page 22: Optimism in the Face of Uncertainty: a Unifying approach](https://reader036.vdocument.in/reader036/viewer/2022070410/568145ef550346895db2f537/html5/thumbnails/22.jpg)
Advantages of OIM
polynomial-time convergence (to near-optimum, with high probability)
convincing performance in practice extremely simple to implement
all work done at initialization decision making is always greedy
Matlab source code to be released soon
![Page 23: Optimism in the Face of Uncertainty: a Unifying approach](https://reader036.vdocument.in/reader036/viewer/2022070410/568145ef550346895db2f537/html5/thumbnails/23.jpg)
Thank you for your attention!
check our web pages athttp://szityu.web.eotvos.elte.huhttp://inf.elte.hu/lorincz
or my reinforcement learning blog “Gimme Reward” athttp://gimmereward.wordpress.com
![Page 24: Optimism in the Face of Uncertainty: a Unifying approach](https://reader036.vdocument.in/reader036/viewer/2022070410/568145ef550346895db2f537/html5/thumbnails/24.jpg)
Full pseudocode of the OIM algorithm
![Page 25: Optimism in the Face of Uncertainty: a Unifying approach](https://reader036.vdocument.in/reader036/viewer/2022070410/568145ef550346895db2f537/html5/thumbnails/25.jpg)
Exact statement of the convergence theorem