reinforcement learning monte carlo planningreinforcement learning & monte carlo planning (slides...
TRANSCRIPT
![Page 1: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/1.jpg)
Reinforcement Learning &
Monte Carlo Planning
(Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao,
Lisa Torrey, Dan Weld)
![Page 2: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/2.jpg)
![Page 3: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/3.jpg)
Learning/Planning/Acting
![Page 4: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/4.jpg)
![Page 5: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/5.jpg)
Main Dimensions
Model-based vs. Model-free
• Model-based vs. Model-free – Model-based Have/learn
action models (i.e. transition probabilities)
• Eg. Approximate DP
– Model-free Skip them and directly learn what action to do when (without necessarily finding out the exact model of the action)
• E.g. Q-learning
Passive vs. Active
• Passive vs. Active – Passive: Assume the agent is
already following a policy (so there is no action choice to be made; you just need to learn the state values and may be action model)
– Active: Need to learn both the optimal policy and the state values (and may be action model)
![Page 6: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/6.jpg)
Main Dimensions (contd)
Extent of Backup
• Full DP – Adjust value based on values
of all the neighbors (as predicted by the transition model)
– Can only be done when transition model is present
• Temporal difference – Adjust value based only on
the actual transitions observed
Strong or Weak Simulator
• Strong – I can jump to any part of the
state space and start simulation there.
• Weak – Simulator is the real world
and I can teleport to a new state.
![Page 7: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/7.jpg)
![Page 8: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/8.jpg)
Does self learning through simulator. [Infants don’t get to “simulate” the world since they neither have T(.) nor R(.) of their world]
![Page 9: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/9.jpg)
![Page 10: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/10.jpg)
We are basically doing EMPIRICAL Policy Evaluation!
But we know this will be wasteful (since it misses the correlation between values of neibhoring states!)
Do DP-based policy evaluation!
![Page 11: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/11.jpg)
![Page 12: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/12.jpg)
![Page 13: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/13.jpg)
![Page 14: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/14.jpg)
![Page 15: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/15.jpg)
![Page 16: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/16.jpg)
updated estimate learning rate
![Page 17: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/17.jpg)
17
Aside: Online Mean Estimation
• Suppose that we want to incrementally compute the mean of a sequence of numbers (x1, x2, x3, ….)
– E.g. to estimate the expected value of a random variable from a sequence of samples.
nnn
n
i
in
n
i
i
n
i
in
Xxn
X
xn
xn
xn
xn
X
ˆ1
1ˆ
1
1
11
1
1ˆ
1
1
1
1
1
1
1
average of n+1 samples
![Page 18: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/18.jpg)
18
Aside: Online Mean Estimation
• Suppose that we want to incrementally compute the mean of a sequence of numbers (x1, x2, x3, ….)
– E.g. to estimate the expected value of a random variable from a sequence of samples.
nnn
n
i
in
n
i
i
n
i
in
Xxn
X
xn
xn
xn
xn
X
ˆ1
1ˆ
1
1
11
1
1ˆ
1
1
1
1
1
1
1
average of n+1 samples
![Page 19: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/19.jpg)
19
Aside: Online Mean Estimation
• Suppose that we want to incrementally compute the mean of a sequence of numbers (x1, x2, x3, ….)
– E.g. to estimate the expected value of a random variable from a sequence of samples.
• Given a new sample xn+1, the new mean is the old estimate (for n samples) plus the weighted difference between the new sample and old estimate
nnn
n
i
in
n
i
i
n
i
in
Xxn
X
xn
xn
xn
xn
X
ˆ1
1ˆ
1
1
11
1
1ˆ
1
1
1
1
1
1
1
average of n+1 samples sample n+1 learning rate
![Page 20: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/20.jpg)
20
Temporal Difference Learning
• TD update for transition from s to s’:
• So the update is maintaining a “mean” of the (noisy) value samples
• If the learning rate decreases appropriately with the number of samples (e.g. 1/n) then the value estimates will converge to true values! (non-trivial)
))()'()(()()( sVsVsRsVsV
)'()',,()()('
sVsasTsRsVs
learning rate (noisy) sample of value at s based on next state s’
updated estimate
![Page 21: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/21.jpg)
Early Results: Pavlov and his Dog
• Classical (Pavlovian) conditioning experiments
• Training: Bell Food
• After: Bell Salivate
• Conditioned stimulus (bell) predicts future reward (food)
(http://employees.csbsju.edu/tcreed/pb/pdoganim.html)
![Page 22: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/22.jpg)
Predicting Delayed Rewards
• Reward is typically delivered at the end (when you know whether you succeeded or not)
• Time: 0 t T with stimulus u(t) and reward r(t) at each time step t (Note: r(t) can be zero at some time points)
• Key Idea: Make the output v(t) predict total expected future reward starting from time t
tT
trtv0
)()(
![Page 23: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/23.jpg)
Predicting Delayed Reward: TD Learning
Stimulus at t = 100 and reward at t = 200
Prediction error for each time step
(over many trials)
![Page 24: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/24.jpg)
Prediction Error in the Primate Brain?
Dopaminergic cells in Ventral Tegmental Area (VTA)
Before Training
After Training
Reward Prediction error?
No error
)]()1()([ tvtvtr
)1()()( tvtrtv)]()1(0[ tvtv
![Page 25: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/25.jpg)
More Evidence for Prediction Error Signals
Dopaminergic cells in VTA
Negative error
)()]()1( )([
0)1( ,0)(
tvtvtvtr
tvtr
![Page 26: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/26.jpg)
![Page 27: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/27.jpg)
![Page 28: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/28.jpg)
![Page 29: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/29.jpg)
• Under certain conditions: – The environment model doesn’t change
– States and actions are finite
– Rewards are bounded
– Learning rate decays with visits to state-action pairs
• but not too fast decay. (∑i(s,a,i) = ∞, ∑i2(s,a,i) < ∞)
– Exploration method would guarantee infinite visits to every state-action pair over an infinite training period
![Page 30: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/30.jpg)
![Page 31: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/31.jpg)
36
Explore/Exploit Policies
• GLIE Policy 2: Boltzmann Exploration – Select action a with probability,
– T is the temperature. Large T means that each action has about the same probability. Small T leads to more greedy behavior.
– Typically start with large T and decrease with time
Aa
TasQ
TasQsa
'
/)',(exp
/),(exp)|Pr(
![Page 32: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/32.jpg)
Model based vs. Model Free RL
• Model based
– estimate O(|S|2|A|) parameters
– requires relatively larger data for learning
– can make use of background knowledge easily
• Model free
– estimate O(|S||A|) parameters
– requires relatively less data for learning
![Page 33: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/33.jpg)
• Games – Backgammon, Solitaire, Real-time strategy games
• Elevator Scheduling • Stock investment decisions • Chemotherapy treatment decisions • Robotics
– Navigation, Robocup – http://www.youtube.com/watch?v=CIF2SBVY-J0 – http://www.youtube.com/watch?v=5FGVgMsiv1s – http://www.youtube.com/watch?v=W_gxLKSsSIE
• Helicopter maneuvering
Applications of RL
![Page 34: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/34.jpg)
Learning/Planning/Acting
Planning Monte-Carlo Planning Reinforcement Learning
![Page 35: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/35.jpg)
41
Monte-Carlo Planning
• Often a simulator of a planning domain is available
or can be learned from data
– Even when domain can’t be expressed via MDP language
41
Klondike Solitaire
Fire & Emergency Response
![Page 36: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/36.jpg)
42
• Traffic simulators
• Robotics simulators
• Military campaign simulators
• Computer network simulators
• Emergency planning simulators
– large-scale disaster and municipal
• Sports domains (Madden Football)
• Board games / Video games
– Go / RTS
In many cases Monte-Carlo techniques yield state-of-the-art performance. Even in domains where model-based planner is applicable.
Example Domains with Simulators
![Page 37: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/37.jpg)
Slot Machines as MDP?
43
…
????
![Page 38: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/38.jpg)
Outline
• Uniform Sampling
– PAC Bound for Single State MDPs
– Policy Rollouts for full MDPs
• Adaptive Sampling
– UCB for Single State MDPs
– UCT for full MDPs
![Page 39: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/39.jpg)
45
Single State Monte-Carlo Planning
• Suppose MDP has a single state and k actions – Figure out which action has best expected reward – Can sample rewards of actions using calls to simulator – Sampling a is like pulling slot machine arm with random
payoff function R(s,a)
s
a1 a2 ak
R(s,a1) R(s,a2) R(s,ak)
Multi-Armed Bandit Problem
…
…
![Page 40: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/40.jpg)
46
PAC Bandit Objective
Probably Approximately Correct (PAC) • Select an arm that probably (w/ high probability, 1-) has
approximately (i.e., within ) the best expected reward
• Use as few simulator calls (or pulls) as possible
s
a1 a2 ak
R(s,a1) R(s,a2) R(s,ak)
Multi-Armed Bandit Problem
…
…
![Page 41: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/41.jpg)
47
UniformBandit Algorithm NaiveBandit from [Even-Dar et. al., 2002]
1. Pull each arm w times (uniform pulling). 2. Return arm with best average reward. How large must w be to provide a PAC guarantee?
s
a1 a2 ak
…
… r11 r12 … r1w r21 r22 … r2w rk1 rk2 … rkw
![Page 42: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/42.jpg)
48
Aside: Additive Chernoff Bound
• Let R be a random variable with maximum absolute value Z. An let ri (for i=1,…,w) be i.i.d. samples of R • The Chernoff bound gives a bound on the probability that the average of the ri are far from E[R]
11
1
1 ln][w
w
i
iwZrRE
With probability at least we have that, 1
wZ
rREw
i
iw
2
1
1 exp][Pr
Chernoff Bound
Equivalently:
![Page 43: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/43.jpg)
49
UniformBandit PAC Bound
If for all arms simultaneously
with probability at least 1
kR
w ln
2
max
With a bit of algebra and Chernoff bound we get:
That is, estimates of all actions are ε–accurate with probability at least 1-
Thus selecting estimate with highest value is approximately optimal with high probability, or PAC
w
j
ijwi rasRE1
1)],([
![Page 44: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/44.jpg)
50
# Simulator Calls for UniformBandit s
a1 a2 ak
R(s,a1) R(s,a2) R(s,ak)
…
…
Total simulator calls for PAC:
Can get rid of ln(k) term with more complex algorithm [Even-Dar et. al., 2002].
kk
Owk ln2
![Page 45: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/45.jpg)
Outline
• Uniform Sampling
– PAC Bound for Single State MDPs
– Policy Rollouts for full MDPs
• Adaptive Sampling
– UCB for Single State MDPs
– UCT for full MDPs
![Page 46: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/46.jpg)
Policy Improvement via Monte-Carlo
• Now consider a multi-state MDP.
• Suppose we have a simulator and a non-optimal policy – E.g. policy could be a standard heuristic or based on intuition
• Can we somehow compute an improved policy?
52
World
Simulator
+
Base Policy Real World
action
State + reward
![Page 47: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/47.jpg)
53
Policy Rollout Algorithm
1. For each ai, run SimQ(s,ai,π,h) w times 2. Return action with best average of SimQ results
s
a1 a2 ak
…
q11 q12 … q1w q21 q22 … q2w qk1 qk2 … qkw
…
…
…
…
…
…
…
…
…
SimQ(s,ai,π,h) trajectories Each simulates taking action ai then following π for h-1 steps.
Samples of SimQ(s,ai,π,h)
![Page 48: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/48.jpg)
54
Policy Rollout: # of Simulator Calls
• For each action, w calls to SimQ, each using h sim calls • Total of khw calls to the simulator
a1 a2 ak
…
…
…
…
…
…
…
…
…
…
SimQ(s,ai,π,h) trajectories Each simulates taking action ai then following π for h-1 steps.
s
![Page 49: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/49.jpg)
55
Multi-Stage Rollout
a1 a2 ak
…
…
…
…
…
…
…
…
…
…
Trajectories of SimQ(s,ai,Rollout(π),h)
Each step requires khw simulator calls
• Two stage: compute rollout policy of rollout policy of π • Requires (khw)2 calls to the simulator for 2 stages • In general exponential in the number of stages
s
![Page 50: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/50.jpg)
56
Rollout Summary
We often are able to write simple, mediocre policies Network routing policy
Compiler instruction scheduling
Policy for card game of Hearts
Policy for game of Backgammon
Solitaire playing policy
Game of GO
Combinatorial optimization
Policy rollout is a general and easy way to improve upon such policies
Often observe substantial improvement!
![Page 51: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/51.jpg)
57
Example: Rollout for Thoughtful Solitaire [Yan et al. NIPS’04]
Player Success Rate Time/Game
Human Expert 36.6% 20 min
(naïve) Base
Policy
13.05% 0.021 sec
1 rollout 31.20% 0.67 sec
2 rollout 47.6% 7.13 sec
3 rollout 56.83% 1.5 min
4 rollout 60.51% 18 min
5 rollout 70.20% 1 hour 45 min
![Page 52: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/52.jpg)
58
Example: Rollout for Thoughtful Solitaire [Yan et al. NIPS’04]
Player Success Rate Time/Game
Human Expert 36.6% 20 min
(naïve) Base
Policy
13.05% 0.021 sec
1 rollout 31.20% 0.67 sec
2 rollout 47.6% 7.13 sec
3 rollout 56.83% 1.5 min
4 rollout 60.51% 18 min
5 rollout 70.20% 1 hour 45 min
![Page 53: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/53.jpg)
59
Example: Rollout for Thoughtful Solitaire [Yan et al. NIPS’04]
Player Success Rate Time/Game
Human Expert 36.6% 20 min
(naïve) Base
Policy
13.05% 0.021 sec
1 rollout 31.20% 0.67 sec
2 rollout 47.6% 7.13 sec
3 rollout 56.83% 1.5 min
4 rollout 60.51% 18 min
5 rollout 70.20% 1 hour 45 min
![Page 54: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/54.jpg)
60
Example: Rollout for Thoughtful Solitaire [Yan et al. NIPS’04]
Deeper rollout can pay off, but is expensive
Player Success Rate Time/Game
Human Expert 36.6% 20 min
(naïve) Base
Policy
13.05% 0.021 sec
1 rollout 31.20% 0.67 sec
2 rollout 47.6% 7.13 sec
3 rollout 56.83% 1.5 min
4 rollout 60.51% 18 min
5 rollout 70.20% 1 hour 45 min
![Page 55: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/55.jpg)
Outline
• Uniform Sampling
– PAC Bound for Single State MDPs
– Policy Rollouts for full MDPs
• Adaptive Sampling
– UCB for Single State MDPs
– UCT for full MDPs
![Page 56: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/56.jpg)
Non-Adaptive Monte-Carlo
What is an issue with Uniform sampling?
time wasted equally on all actions!
no early learning about suboptimal actions
Policy rollouts Devotes equal resources to each state encountered in the tree Would like to focus on most promising parts of tree
But how to control exploration of new parts of tree??
![Page 57: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/57.jpg)
63
Regret Minimization Bandit Objective
s
a1 a2 ak
…
Problem: find arm-pulling strategy such that the expected total reward at time n is close to the best possible (i.e. pulling the best arm always)
UniformBandit is poor choice --- waste time on bad arms
Must balance exploring machines to find good payoffs and exploiting current knowledge
![Page 58: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/58.jpg)
64
UCB Adaptive Bandit Algorithm [Auer, Cesa-Bianchi, & Fischer, 2002]
• Q(a) : average payoff for action a based on current experience
• n(a) : number of pulls of arm a
• Action choice by UCB after n pulls:
)(
ln2)(maxarg*
an
naQa a
Assumes payoffs in [0,1]
Value Term: favors actions that looked good historically
Exploration Term: actions get an exploration bonus that grows with ln(n)
Doesn’t waste much time on sub-optimal arms unlike uniform!
![Page 59: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/59.jpg)
65
UCB Algorithm [Auer, Cesa-Bianchi, & Fischer, 2002]
)(
ln2)(maxarg*
an
naQa a
Theorem: expected number of pulls of sub-optimal arm a is bounded by:
where is regret of arm a
na
ln82
a
Hence, the expected regret after n arm pulls compared to optimal behavior is bounded by O(log n)
No algorithm can achieve a better loss rate
![Page 60: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/60.jpg)
Outline
• Uniform Sampling
– PAC Bound for Single State MDPs
– Policy Rollouts for full MDPs
• Adaptive Sampling
– UCB for Single State MDPs
– UCT for full MDPs
![Page 61: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/61.jpg)
UCB Based Policy Rollout
• Allocate samples non-uniformly
– based on UCB action selection
– More sample efficient than uniform policy rollout
– Still suboptimal.
![Page 62: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/62.jpg)
• Instance of Monte-Carlo Tree Search – Applies principle of UCB
– Some nice theoretical properties
– Better than policy rollouts – asymptotically optimal
– Major advance in computer Go
• Monte-Carlo Tree Search – Repeated Monte Carlo simulation of a rollout policy
– Each rollout adds one or more nodes to search tree
• Rollout policy depends on nodes already in tree
UCT Algorithm [Kocsis & Szepesvari, 2006]
![Page 63: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/63.jpg)
Current World State
At a leaf node perform a random rollout
Initially tree is single leaf
![Page 64: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/64.jpg)
Current World State
Rollout Policy
Terminal (reward = 1)
1
At a leaf node perform a random rollout
Initially tree is single leaf
a1
![Page 65: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/65.jpg)
Current World State
Rollout Policy
Terminal (reward = 1)
1
1
1
1
1
At a leaf node perform a random rollout
Initially tree is single leaf
a1
![Page 66: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/66.jpg)
Current World State
1
1
1
1
1
Must select each action at a node at least once
0
Rollout Policy
Terminal (reward = 0)
a2
![Page 67: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/67.jpg)
Current World State
1
1
1
1
1/2
Must select each action at a node at least once
0
0
0
0
![Page 68: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/68.jpg)
Current World State
1
1
1
1
1/2
0
0
0
0
When all node actions tried once, select action according to tree policy
Tree Policy
![Page 69: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/69.jpg)
Current World State
1
1
1
1
1/2
When all node actions tried once, select action according to tree policy
0
0
0
0
Tree Policy
0
Rollout Policy
![Page 70: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/70.jpg)
Current World State
1
1
1
1/2
1/3
When all node actions tried once, select action according to tree policy
0
0
0
0 Tree Policy
0
0
0
0
What is an appropriate tree policy? Rollout policy?
![Page 71: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/71.jpg)
77
• Basic UCT uses random rollout policy
• Tree policy is based on UCB: – Q(s,a) : average reward received in current
trajectories after taking action a in state s
– n(s,a) : number of times action a taken in s
– n(s) : number of times state s encountered
),(
)(ln),(maxarg)(
asn
sncasQs aUCT
Theoretical constant that must be selected empirically in practice
UCT Algorithm [Kocsis & Szepesvari, 2006]
![Page 72: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/72.jpg)
Current World State
1
1
1
1/2
1/3
When all node actions tried once, select action according to tree policy
0
0
0
0 Tree Policy
0
0
0
0
a1 a2 ),(
)(ln),(maxarg)(
asn
sncasQs aUCT
![Page 73: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/73.jpg)
Current World State
1
1
1
1/2
1/3
When all node actions tried once, select action according to tree policy
0
0
0
0 Tree Policy
0
0
0
0
),(
)(ln),(maxarg)(
asn
sncasQs aUCT
![Page 74: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/74.jpg)
80
UCT Recap
• To select an action at a state s – Build a tree using N iterations of monte-carlo tree
search • Default policy is uniform random
• Tree policy is based on UCB rule
– Select action that maximizes Q(s,a) (note that this final action selection does not take the exploration term into account, just the Q-value estimate)
• The more simulations the more accurate
![Page 75: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/75.jpg)
Computer Go
“Task Par Excellence for AI” (Hans Berliner)
“New Drosophila of AI” (John McCarthy)
“Grand Challenge Task” (David Mechner)
9x9 (smallest board) 19x19 (largest board)
![Page 76: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/76.jpg)
Game of Go
human champions refuse to compete against computers, because software is too bad.
Chess Go Size of board 8 x 8 19 x 19
Average no. of
moves per game 100 300
Avg branching
factor per turn 35 235
Additional
complexity
Players can
pass
82
![Page 77: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/77.jpg)
A Brief History of Computer Go
2005: Computer Go is impossible!
2006: UCT invented and applied to 9x9 Go (Kocsis, Szepesvari; Gelly et al.)
2007: Human master level achieved at 9x9 Go (Gelly, Silver; Coulom)
2008: Human grandmaster level achieved at 9x9 Go (Teytaud et al.)
ELO rating 1800 2600
![Page 78: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/78.jpg)
Other Successes
Klondike Solitaire (wins 40% of games)
General Game Playing Competition
Real-Time Strategy Games
Combinatorial Optimization
Probabilistic Planning (MDPs)
Usually extend UCT is some ways
![Page 79: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/79.jpg)
Improvements/Issues
• Use domain knowledge to improve the base policies
– E.g.: don’t choose obvious stupid actions
– better policy does not imply better UCT performance
• Learn a heuristic function to evaluate positions
– Use heuristic to initialize leaves
• Interesting question: UCT versus minimax
![Page 80: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/80.jpg)
Generalization in Learning
![Page 81: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/81.jpg)
Task Hierarchy: MAXQ Decomposition [Dietterich’00]
Root
Take Give Navigate(loc)
Deliver Fetch
Extend-arm Extend-arm Grab Release
Movee Movew Moves Moven
Children of a task are unordered
![Page 82: Reinforcement Learning Monte Carlo PlanningReinforcement Learning & Monte Carlo Planning (Slides by Alan Fern, Dan Klein, Subbarao Kambhampati, Raj Rao, Lisa Torrey, Dan Weld) Learning/Planning/Acting](https://reader034.vdocument.in/reader034/viewer/2022042308/5ed5181b34f27f38dd67886f/html5/thumbnails/82.jpg)
Summary
• Multi-armed Bandits – Principles of both RL and Monte-Carlo
• Reinforcement Learning – Exploration/Exploitation tradeoff
– Passive/Active RL
– Model free/Model based
• Monte-Carlo Planning – Exploration/Exploitation tradeoff
– Uniform/Adaptive Sampling