click to edit master title style batch rl approximate...
Post on 16-Feb-2019
215 Views
Preview:
TRANSCRIPT
2/18/15 1
Click to edit Master title style
Click to edit Master subtitle style
2/18/15 1
Approximate Models for Batch RL
Emma Brunskill
2/18/15 2
Image from David Silver
FVI / FQI
Policy Iteration maintains both
an explicit representation of a policy and
the value of that policy
PI
Approximate model planners
2/18/15 4
Exact/Exhaustive Forward Search
Slide modified from David Silver
a2 a1
s1 s2 s1 s2max
a1
expexp
2/18/15 5
How many nodes in a H-depth tree (as a function of state space |S| and action space |A|)?
Slide modified from David Silver
a2 a1
s1 s2 s1 s2max
a1
expexp
2/18/15 6
How many nodes in a H-depth tree (as a function of state space |S| and action space |A|)? (|S||A|)H
Slide modified from David Silver
a2 a1
s1 s2 s1 s2max
a1
expexp
2/18/15 7
Sparse Sampling: Don’t Enumerate All Next States, Instead Sample Next States s’ ~ P(s’|s,a)
Sample n next states, si ~ P(s’|s,a)
Compute (1/n) Sumi V(s_i)
Converges to expected future reward: Sums’ p(s’|s,a)V(s’)
Slide modified from David Silver
a2 a1
s1 s2 s1 s2max
a1
expexp s1 s37
2/18/15 8
Sparse Sampling: # nodes if sample n states at each action node? Independent of |S|! O(n|A|)H
Sample n next states, si ~ P(s’|s,a)
Compute (1/n) Sumi V(s_i)
Converges to expected future reward: Sums’ p(s’|s,a)V(s’)
Slide modified from David Silver
a2 a1
s1 s2 s1 s2max
a1
expexp s1 s37
2/18/15 9
Sparse Sampling: # nodes if sample n states at each action node? Independent of |S|! O(n|A|)H
Upside: Can choose n to achieve bounds on the accuracy of the value function at the root state, independent of state space size
Downside: Still exponential in horizon, n still large for good bounds
Slide modified from David Silver
a2 a1
s1 s2 s1 s2max
a1
expexp s1 s37
2/18/15 11
Monte Carlo Tree Search
Combine ideas of sparse sampling with an adaptive method for focusing on more promising parts of the ree
Here “more promising” means the actions that are seem likely to yield higher long term reward
Uses the idea of simulation search
2/18/15 14
Simple Monte Carlo Search
Slide modified from David Silver
rollout policy dafa
greedy improvement with respect to fixed rollout policy
2/18/15 15
Upper Confidence Tree (UCT)[Kocsis & Szepesvari, 2006]
Slide modified from Alan Fern
• Combine forward search and simulation search• Instance of Monte-Carlo Tree Search
• Repeated Monte Carlo simulation of rollout policy• Rollouts add one or more nodes to search tree
• UCT• Uses optimism under uncertainty idea• Some nice theoretical properties
• Much better realtime performance than sparse sampling
2/18/15 23
Slide modified from Alan Fern
● Requires us to have a simulator/ generative model
● Each pass down the tree, follow tree policy until reach a state leaf where not all actions have been tried.
● Then need to simulate starting from that state leaf the result of taking another action
2/18/15 25
Slide modified from slides from Alan Fern & David Silver
Computer GoPrevious game tree approaches faired poorly
2/18/15 28
Monte Carlo Evaluation in Go:Planning problem, just a very very hard one
Slide modified from David Silver
2/18/15 30
Going Back to Batch RL...
• Use supervised learning method to compute model
• Use learned model with MCTS planning• Note: error in model will impact error in
estimated values!
• Computes an action for current state, take action, then redo planning for next state
top related