1 monte-carlo tree search alan fern. 2 introduction rollout does not guarantee optimality or near...
TRANSCRIPT
![Page 1: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction Rollout does not guarantee optimality or near optimality It only guarantees policy improvement](https://reader036.vdocument.in/reader036/viewer/2022062308/56649e355503460f94b24811/html5/thumbnails/1.jpg)
1
Monte-Carlo Tree Search
Alan Fern
![Page 2: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction Rollout does not guarantee optimality or near optimality It only guarantees policy improvement](https://reader036.vdocument.in/reader036/viewer/2022062308/56649e355503460f94b24811/html5/thumbnails/2.jpg)
2
Introduction
h Rollout does not guarantee optimality or near optimality5 It only guarantees policy improvement (under certain conditions)
h Theoretical Question:Can we develop Monte-Carlo methods that give us near optimal policies?5 With computation that does NOT depend on number of states!5 This was an open theoretical question until late 90’s.
h Practical Question:Can we develop Monte-Carlo methods that improve smoothly and quickly with more computation time?
![Page 3: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction Rollout does not guarantee optimality or near optimality It only guarantees policy improvement](https://reader036.vdocument.in/reader036/viewer/2022062308/56649e355503460f94b24811/html5/thumbnails/3.jpg)
3
Look-Ahead Trees
h Rollout can be viewed as performing one level of search on top of the base policy
h In deterministic games and search problems it is common to build a look-ahead tree at a state to select best action5 Can we generalize this to general stochastic MDPs?
h Sparse Sampling is one such algorithm5 Strong theoretical guarantees of near optimality
a1 a2 ak
…
… … … … … … … …Maybe we should searchfor multiple levels.
![Page 4: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction Rollout does not guarantee optimality or near optimality It only guarantees policy improvement](https://reader036.vdocument.in/reader036/viewer/2022062308/56649e355503460f94b24811/html5/thumbnails/4.jpg)
4
Online Planning with Look-Ahead Trees
h At each state we encounter in the environment we build a look-ahead tree of depth h and use it to estimate optimal Q-values of each action5 Select action with highest Q-value estimate
h s = current state
h Repeat 5 T = BuildLookAheadTree(s) ;; sparse sampling or UCT ;; tree provides Q-value estimates for root action5 a = BestRootAction(T) ;; action with best Q-value5 Execute action a in environment5 s is the resulting state
![Page 5: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction Rollout does not guarantee optimality or near optimality It only guarantees policy improvement](https://reader036.vdocument.in/reader036/viewer/2022062308/56649e355503460f94b24811/html5/thumbnails/5.jpg)
5
Planning with Look-Ahead Trees
… …sa2 a1
Build look-ahead tree Build look-ahead tree
Real worldstate/action sequence
sa1 a2
…
a1 a2
…
s11
a1 a2
s1w…
a1 a2
…
s21
a1 a2
s2w…
R(s11,a1) R(s11,a2)R(s1w,a1)R(s1w,a2) R(s21,a1)R(s21,a2)R(s2w,a1)R(s2w,a2)
s
a1 a2
…
a1 a2
…
s11
a1 a2
s1w…
a1 a2
…
s21
a1 a2
s2w…
R(s11,a1) R(s11,a2)R(s1w,a1)R(s1w,a2) R(s21,a1)R(s21,a2)R(s2w,a1)R(s2w,a2)
………… …………
![Page 6: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction Rollout does not guarantee optimality or near optimality It only guarantees policy improvement](https://reader036.vdocument.in/reader036/viewer/2022062308/56649e355503460f94b24811/html5/thumbnails/6.jpg)
Expectimax Tree (depth 1)Alternate max &expectation
… …
s max node (max over actions)
expectation node (weighted averageover states)
a b
States -- weighted by probability of occuringafter taking a in s
h After taking each action there is a distribution over next states (nature’s moves)
h The value of an action depends on the immediate reward and weighted value of those states
s1 s2 sn-1 sn s1 s2 sn-1 sn
![Page 7: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction Rollout does not guarantee optimality or near optimality It only guarantees policy improvement](https://reader036.vdocument.in/reader036/viewer/2022062308/56649e355503460f94b24811/html5/thumbnails/7.jpg)
Expectimax Tree (depth 2)
Alternate max &expectation
… …
… … … ……......
s max node (max over actions)
expectation node (max over actions)
a b
Max Nodes: value equal to max of values of child expectation nodes
Expectation Nodes: value is weighted average of values of its children
Compute values from bottom up (leaf values = 0). Select root action with best value.
a b a b
![Page 8: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction Rollout does not guarantee optimality or near optimality It only guarantees policy improvement](https://reader036.vdocument.in/reader036/viewer/2022062308/56649e355503460f94b24811/html5/thumbnails/8.jpg)
Exact Expectimax Tree
Max
Exp Exp
Max Max HorizonH
k
# states
......
......
............
............
...... ...... ...... ......
...... ...... ............ ...... ...... ......
# actions
(kn )H leaves
n
Alternate max &expectation
In general can grow tree to any horizon H
Size depends on size of the state-space. Bad!
V*(s,H)
Q*(s,a,H)
![Page 9: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction Rollout does not guarantee optimality or near optimality It only guarantees policy improvement](https://reader036.vdocument.in/reader036/viewer/2022062308/56649e355503460f94b24811/html5/thumbnails/9.jpg)
Sparse Sampling Tree
Max
Exp Exp
Max MaxHorizon H
k
# states
......
......
............
............
...... ...... ...... ......
Samplingdepth H s
Samplingwidth C
...... ...... ............ ...... ...... ......
(kC )Hs leaves
# actions
(kn )H leaves
n
V*(s,H)
Q*(s,a,H)
Replace expectation with average over w samples
w will typically be much smaller than n.
(kw)H leaves
Sampling width w
![Page 10: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction Rollout does not guarantee optimality or near optimality It only guarantees policy improvement](https://reader036.vdocument.in/reader036/viewer/2022062308/56649e355503460f94b24811/html5/thumbnails/10.jpg)
Sparse Sampling [Kearns et. al. 2002]
SparseSampleTree(s,h,w)
If h == 0 Then Return [0, null]
For each action a in s
Q*(s,a,h) = 0
For i = 1 to w
Simulate taking a in s resulting in si and reward ri
[V*(si,h),a*] = SparseSample(si,h-1,w)
Q*(s,a,h) = Q*(s,a,h) + ri + β V*(si,h)
Q*(s,a,h) = Q*(s,a,h) / w ;; estimate of Q*(s,a,h)
V*(s,h) = maxa Q*(s,a,h) ;; estimate of V*(s,h)
a* = argmaxa Q*(s,a,h)
Return [V*(s,h), a*]
The Sparse Sampling algorithm computes root value via depth first expansion
Return value estimate V*(s,h) of state s and estimated optimal action a*
![Page 11: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction Rollout does not guarantee optimality or near optimality It only guarantees policy improvement](https://reader036.vdocument.in/reader036/viewer/2022062308/56649e355503460f94b24811/html5/thumbnails/11.jpg)
SparseSample(h=2,w=2)
Alternate max &averaging
s
a
Max Nodes: value equal to max of values of child expectation nodes
Average Nodes: value is average of values of w sampled children
Compute values from bottom up (leaf values = 0). Select root action with best value.
![Page 12: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction Rollout does not guarantee optimality or near optimality It only guarantees policy improvement](https://reader036.vdocument.in/reader036/viewer/2022062308/56649e355503460f94b24811/html5/thumbnails/12.jpg)
SparseSample(h=2,w=2)
Alternate max &averaging
s
a
Max Nodes: value equal to max of values of child expectation nodes
Average Nodes: value is average of values of w sampled children
Compute values from bottom up (leaf values = 0). Select root action with best value.
SparseSample(h=1,w=2)
10
![Page 13: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction Rollout does not guarantee optimality or near optimality It only guarantees policy improvement](https://reader036.vdocument.in/reader036/viewer/2022062308/56649e355503460f94b24811/html5/thumbnails/13.jpg)
SparseSample(h=2,w=2)
Alternate max &averaging
s
a
Max Nodes: value equal to max of values of child expectation nodes
Average Nodes: value is average of values of w sampled children
Compute values from bottom up (leaf values = 0). Select root action with best value.
SparseSample(h=1,w=2)
10
SparseSample(h=1,w=2)
0
![Page 14: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction Rollout does not guarantee optimality or near optimality It only guarantees policy improvement](https://reader036.vdocument.in/reader036/viewer/2022062308/56649e355503460f94b24811/html5/thumbnails/14.jpg)
SparseSample(h=2,w=2)
Alternate max &averaging
s
a
Max Nodes: value equal to max of values of child expectation nodes
Average Nodes: value is average of values of w sampled children
Compute values from bottom up (leaf values = 0). Select root action with best value.
SparseSample(h=1,w=2)
10
SparseSample(h=1,w=2)
05
![Page 15: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction Rollout does not guarantee optimality or near optimality It only guarantees policy improvement](https://reader036.vdocument.in/reader036/viewer/2022062308/56649e355503460f94b24811/html5/thumbnails/15.jpg)
SparseSample(h=2,w=2)
Alternate max &averaging
s
a
Max Nodes: value equal to max of values of child expectation nodes
Average Nodes: value is average of values of w sampled children
Compute values from bottom up (leaf values = 0). Select root action with best value.
5
SparseSample(h=1,w=2) SparseSample(h=1,w=2)
4 2
b
3
Select action ‘a’ since its value is larger than b’s.
![Page 16: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction Rollout does not guarantee optimality or near optimality It only guarantees policy improvement](https://reader036.vdocument.in/reader036/viewer/2022062308/56649e355503460f94b24811/html5/thumbnails/16.jpg)
Sparse Sampling (Cont’d)h For a given desired accuracy, how large
should sampling width and depth be?5 Answered: Kearns, Mansour, and Ng (1999)
h Good news: gives values for w and H to achieve policy arbitrarily close to optimal5 Values are independent of state-space size!5 First near-optimal general MDP planning algorithm
whose runtime didn’t depend on size of state-space
h Bad news: the theoretical values are typically still intractably large---also exponential in H
5 Exponential in H is the best we can do in general5 In practice: use small H and use heuristic at leaves
![Page 17: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction Rollout does not guarantee optimality or near optimality It only guarantees policy improvement](https://reader036.vdocument.in/reader036/viewer/2022062308/56649e355503460f94b24811/html5/thumbnails/17.jpg)
Sparse Sampling (Cont’d)h In practice: use small H & evaluate leaves with heuristic
5 For example, if we have a policy, leaves could be evaluated by estimating the policy value (i.e. average reward across runs of policy)
… …
… … … ……......
a1 a2
policysims
![Page 18: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction Rollout does not guarantee optimality or near optimality It only guarantees policy improvement](https://reader036.vdocument.in/reader036/viewer/2022062308/56649e355503460f94b24811/html5/thumbnails/18.jpg)
18
Uniform vs. Adaptive Tree Search
h Sparse sampling wastes time on bad parts of tree5 Devotes equal resources to each
state encountered in the tree5 Would like to focus on most
promising parts of tree
h But how to control exploration of new parts of tree vs. exploiting promising parts?
h Need adaptive bandit algorithmthat explores more effectively
![Page 19: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction Rollout does not guarantee optimality or near optimality It only guarantees policy improvement](https://reader036.vdocument.in/reader036/viewer/2022062308/56649e355503460f94b24811/html5/thumbnails/19.jpg)
19
What now?
h Adaptive Monte-Carlo Tree Search5 UCB Bandit Algorithm5 UCT Monte-Carlo Tree Search
![Page 20: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction Rollout does not guarantee optimality or near optimality It only guarantees policy improvement](https://reader036.vdocument.in/reader036/viewer/2022062308/56649e355503460f94b24811/html5/thumbnails/20.jpg)
20
Bandits: Cumulative Regret Objective
s
a1 a2 ak
…
h Problem: find arm-pulling strategy such that the expected total reward at time n is close to the best possible (one pull per time step)5 Optimal (in expectation) is to pull optimal arm n times5 UniformBandit is poor choice --- waste time on bad arms5 Must balance exploring machines to find good payoffs
and exploiting current knowledge
![Page 21: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction Rollout does not guarantee optimality or near optimality It only guarantees policy improvement](https://reader036.vdocument.in/reader036/viewer/2022062308/56649e355503460f94b24811/html5/thumbnails/21.jpg)
21
Cumulative Regret Objectiveh Theoretical results are often about “expected
cumulative regret” of an arm pulling strategy.
h Protocol: At time step n the algorithm picks an arm based on what it has seen so far and receives reward ( are random variables).
h Expected Cumulative Regret (: difference between optimal expected cumulative reward and expected cumulative reward of our strategy at time n
![Page 22: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction Rollout does not guarantee optimality or near optimality It only guarantees policy improvement](https://reader036.vdocument.in/reader036/viewer/2022062308/56649e355503460f94b24811/html5/thumbnails/22.jpg)
22
UCB Algorithm for Minimizing Cumulative Regret
h Q(a) : average reward for trying action a (in our single state s) so far
h n(a) : number of pulls of arm a so far
h Action choice by UCB after n pulls:
h Assumes rewards in [0,1]. We can always normalize if we know max value.
)(
ln2)(maxarg
an
naQa an
Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2), 235-256.
![Page 23: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction Rollout does not guarantee optimality or near optimality It only guarantees policy improvement](https://reader036.vdocument.in/reader036/viewer/2022062308/56649e355503460f94b24811/html5/thumbnails/23.jpg)
23
UCB: Bounded Sub-Optimality
)(
ln2)(maxarg
an
naQa an
Value Term: favors actions that looked good historically
Exploration Term:actions get an exploration bonus that grows with ln(n)
Expected number of pulls of sub-optimal arm a is bounded by:
where is the sub-optimality of arm a
na
ln82
a
Doesn’t waste much time on sub-optimal arms, unlike uniform!
![Page 24: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction Rollout does not guarantee optimality or near optimality It only guarantees policy improvement](https://reader036.vdocument.in/reader036/viewer/2022062308/56649e355503460f94b24811/html5/thumbnails/24.jpg)
24
UCB Performance Guarantee[Auer, Cesa-Bianchi, & Fischer, 2002]
Theorem: The expected cumulative regret of UCB after n arm pulls is bounded by O(log n)
h Is this good?
Yes. The average per-step regret is O
Theorem: No algorithm can achieve a better expected regret (up to constant factors)
![Page 25: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction Rollout does not guarantee optimality or near optimality It only guarantees policy improvement](https://reader036.vdocument.in/reader036/viewer/2022062308/56649e355503460f94b24811/html5/thumbnails/25.jpg)
25
What now?
h Adaptive Monte-Carlo Tree Search5 UCB Bandit Algorithm5 UCT Monte-Carlo Tree Search
![Page 26: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction Rollout does not guarantee optimality or near optimality It only guarantees policy improvement](https://reader036.vdocument.in/reader036/viewer/2022062308/56649e355503460f94b24811/html5/thumbnails/26.jpg)
h UCT is an instance of Monte-Carlo Tree Search5 Applies principle of UCB5 Similar theoretical properties to sparse sampling5 Much better anytime behavior than sparse sampling
h Famous for yielding a major advance in computer Go
h A growing number of success stories
UCT Algorithm [Kocsis & Szepesvari, 2006]
![Page 27: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction Rollout does not guarantee optimality or near optimality It only guarantees policy improvement](https://reader036.vdocument.in/reader036/viewer/2022062308/56649e355503460f94b24811/html5/thumbnails/27.jpg)
h Builds a sparse look-ahead tree rooted at current state by repeated Monte-Carlo simulation of a “rollout policy”
h During construction each tree node s stores: 5 state-visitation count n(s) 5 action counts n(s,a) 5 action values Q(s,a)
h Repeat until time is up1. Execute rollout policy starting from root until horizon
(generates a state-action-reward trajectory)
2. Add first node not in current tree to the tree
3. Update statistics of each tree node s on trajectoryg Increment n(s) and n(s,a) for selected action ag Update Q(s,a) by total reward observed after the node
Monte-Carlo Tree Search
What is the rollout policy?
![Page 28: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction Rollout does not guarantee optimality or near optimality It only guarantees policy improvement](https://reader036.vdocument.in/reader036/viewer/2022062308/56649e355503460f94b24811/html5/thumbnails/28.jpg)
h Monte-Carlo Tree Search algorithms mainly differ on their choice of rollout policy
h Rollout policies have two distinct phases5 Tree policy: selects actions at nodes already in tree
(each action must be selected at least once)5 Default policy: selects actions after leaving tree
h Key Idea: the tree policy can use statistics collected from previous trajectories to intelligently expand tree in most promising direction 5 Rather than uniformly explore actions at each node
Rollout Policies
![Page 29: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction Rollout does not guarantee optimality or near optimality It only guarantees policy improvement](https://reader036.vdocument.in/reader036/viewer/2022062308/56649e355503460f94b24811/html5/thumbnails/29.jpg)
h For illustration purposes we will assume MDP is:5 Deterministic 5 Only non-zero rewards are at terminal/leaf nodes
h Algorithm is well-defined without these assumpitons
Example MCTS
![Page 30: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction Rollout does not guarantee optimality or near optimality It only guarantees policy improvement](https://reader036.vdocument.in/reader036/viewer/2022062308/56649e355503460f94b24811/html5/thumbnails/30.jpg)
Current World State
DefaultPolicy
Terminal(reward = 1)
1
1
1
At a leaf node tree policy selects a random action then executes default
Initially tree is single leaf
Assume all non-zero reward occurs at terminal nodes.
new tree node
Iteration 1
![Page 31: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction Rollout does not guarantee optimality or near optimality It only guarantees policy improvement](https://reader036.vdocument.in/reader036/viewer/2022062308/56649e355503460f94b24811/html5/thumbnails/31.jpg)
Current World State
1
1
Must select each action at a node at least once
0
DefaultPolicy
Terminal(reward = 0)
new tree node
Iteration 2
![Page 32: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction Rollout does not guarantee optimality or near optimality It only guarantees policy improvement](https://reader036.vdocument.in/reader036/viewer/2022062308/56649e355503460f94b24811/html5/thumbnails/32.jpg)
Current World State
1
1/2
Must select each action at a node at least once
0
Iteration 3
![Page 33: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction Rollout does not guarantee optimality or near optimality It only guarantees policy improvement](https://reader036.vdocument.in/reader036/viewer/2022062308/56649e355503460f94b24811/html5/thumbnails/33.jpg)
Current World State
1
1/2
0
When all node actions tried once, select action according to tree policy
Tree Policy
Iteration 3
![Page 34: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction Rollout does not guarantee optimality or near optimality It only guarantees policy improvement](https://reader036.vdocument.in/reader036/viewer/2022062308/56649e355503460f94b24811/html5/thumbnails/34.jpg)
Current World State
1
1/2
When all node actions tried once, select action according to tree policy
0
Tree Policy
0
DefaultPolicy
new tree node
Iteration 3
![Page 35: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction Rollout does not guarantee optimality or near optimality It only guarantees policy improvement](https://reader036.vdocument.in/reader036/viewer/2022062308/56649e355503460f94b24811/html5/thumbnails/35.jpg)
Current World State
1/2
1/3
When all node actions tried once, select action according to tree policy
0Tree
Policy
0
Iteration 4
![Page 36: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction Rollout does not guarantee optimality or near optimality It only guarantees policy improvement](https://reader036.vdocument.in/reader036/viewer/2022062308/56649e355503460f94b24811/html5/thumbnails/36.jpg)
Current World State
1
1/2
1/3
When all node actions tried once, select action according to tree policy
0
0
Iteration 4
![Page 37: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction Rollout does not guarantee optimality or near optimality It only guarantees policy improvement](https://reader036.vdocument.in/reader036/viewer/2022062308/56649e355503460f94b24811/html5/thumbnails/37.jpg)
Current World State
2/3
2/3
When all node actions tried once, select action according to tree policy
0Tree
Policy
0
What is an appropriate tree policy?Default policy?
1
![Page 38: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction Rollout does not guarantee optimality or near optimality It only guarantees policy improvement](https://reader036.vdocument.in/reader036/viewer/2022062308/56649e355503460f94b24811/html5/thumbnails/38.jpg)
38
h Basic UCT uses random default policy5 In practice often use hand-coded or learned policy
h Tree policy is based on UCB:5 Q(s,a) : average reward received in trajectories so
far after taking action a in state s5 n(s,a) : number of times action a taken in s5 n(s) : number of times state s encountered
),(
)(ln),(maxarg)(
asn
snKasQs aUCT
Theoretical constant that must be selected empirically in practice
UCT Algorithm [Kocsis & Szepesvari, 2006]
![Page 39: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction Rollout does not guarantee optimality or near optimality It only guarantees policy improvement](https://reader036.vdocument.in/reader036/viewer/2022062308/56649e355503460f94b24811/html5/thumbnails/39.jpg)
Current World State
1
1/2
1/3
When all node actions tried once, select action according to tree policy
0Tree
Policy
0
a1 a2),(
)(ln),(maxarg)(
asn
snKasQs aUCT
![Page 40: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction Rollout does not guarantee optimality or near optimality It only guarantees policy improvement](https://reader036.vdocument.in/reader036/viewer/2022062308/56649e355503460f94b24811/html5/thumbnails/40.jpg)
Current World State
1
1/2
1/3
When all node actions tried once, select action according to tree policy
0Tree
Policy
0
),(
)(ln),(maxarg)(
asn
snKasQs aUCT
![Page 41: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction Rollout does not guarantee optimality or near optimality It only guarantees policy improvement](https://reader036.vdocument.in/reader036/viewer/2022062308/56649e355503460f94b24811/html5/thumbnails/41.jpg)
Current World State
1
1/2
1/3
When all node actions tried once, select action according to tree policy
0Tree
Policy
0
0
),(
)(ln),(maxarg)(
asn
snKasQs aUCT
![Page 42: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction Rollout does not guarantee optimality or near optimality It only guarantees policy improvement](https://reader036.vdocument.in/reader036/viewer/2022062308/56649e355503460f94b24811/html5/thumbnails/42.jpg)
Current World State
1
1/3
1/4
When all node actions tried once, select action according to tree policy
0Tree
Policy
0/1
),(
)(ln),(maxarg)(
asn
snKasQs aUCT
0
![Page 43: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction Rollout does not guarantee optimality or near optimality It only guarantees policy improvement](https://reader036.vdocument.in/reader036/viewer/2022062308/56649e355503460f94b24811/html5/thumbnails/43.jpg)
Current World State
1
1/3
1/4
When all node actions tried once, select action according to tree policy
0Tree
Policy
0/1
),(
)(ln),(maxarg)(
asn
snKasQs aUCT
0
1
![Page 44: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction Rollout does not guarantee optimality or near optimality It only guarantees policy improvement](https://reader036.vdocument.in/reader036/viewer/2022062308/56649e355503460f94b24811/html5/thumbnails/44.jpg)
Current World State
1
1/3
2/5
When all node actions tried once, select action according to tree policy
1/2Tree
Policy
0/1
),(
)(ln),(maxarg)(
asn
snKasQs aUCT
0
1
![Page 45: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction Rollout does not guarantee optimality or near optimality It only guarantees policy improvement](https://reader036.vdocument.in/reader036/viewer/2022062308/56649e355503460f94b24811/html5/thumbnails/45.jpg)
45
UCT Recap
h To select an action at a state s5 Build a tree using N iterations of monte-carlo tree
searchg Default policy is uniform randomg Tree policy is based on UCB rule
5 Select action that maximizes Q(s,a)(note that this final action selection does not take the exploration term into account, just the Q-value estimate)
h The more simulations the more accurate
![Page 46: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction Rollout does not guarantee optimality or near optimality It only guarantees policy improvement](https://reader036.vdocument.in/reader036/viewer/2022062308/56649e355503460f94b24811/html5/thumbnails/46.jpg)
Some Successes
Computer Go
Klondike Solitaire (wins 40% of games)
General Game Playing Competition
Real-Time Strategy Games
Combinatorial Optimization
Crowd Sourcing
List is growing
Usually extend UCT is some ways
![Page 47: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction Rollout does not guarantee optimality or near optimality It only guarantees policy improvement](https://reader036.vdocument.in/reader036/viewer/2022062308/56649e355503460f94b24811/html5/thumbnails/47.jpg)
Some ImprovementsUse domain knowledge to handcraft a more intelligent default policy than random
E.g. don’t choose obviously stupid actions
In Go a hand-coded default policy is used
Learn a heuristic function to evaluate positions
Use the heuristic function to initialize leaf nodes(otherwise initialized to zero)
![Page 48: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction Rollout does not guarantee optimality or near optimality It only guarantees policy improvement](https://reader036.vdocument.in/reader036/viewer/2022062308/56649e355503460f94b24811/html5/thumbnails/48.jpg)
Practical IssuesSelecting K
There is no fixed rule
Experiment with different values in your domain
Rule of thumb – try values of K that are of the same order of magnitude as the reward signal you expect
The best value of K may depend on the number of iterations N
Usually a single value of K will work well for values of N that are large enough
![Page 49: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction Rollout does not guarantee optimality or near optimality It only guarantees policy improvement](https://reader036.vdocument.in/reader036/viewer/2022062308/56649e355503460f94b24811/html5/thumbnails/49.jpg)
Practical Issues
a b
s
UCT can have trouble building deep trees when actions can result in a large # of possible next states
Each time we try action ‘a’ we get a new state (new leaf) and very rarely resample an existing leaf
![Page 50: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction Rollout does not guarantee optimality or near optimality It only guarantees policy improvement](https://reader036.vdocument.in/reader036/viewer/2022062308/56649e355503460f94b24811/html5/thumbnails/50.jpg)
Practical IssuesUCT can have trouble building deep trees when actions can result in a large # of possible next states
Each time we try action ‘a’ we get a new state (new leaf) and very rarely resample an existing leaf
a b
s
![Page 51: 1 Monte-Carlo Tree Search Alan Fern. 2 Introduction Rollout does not guarantee optimality or near optimality It only guarantees policy improvement](https://reader036.vdocument.in/reader036/viewer/2022062308/56649e355503460f94b24811/html5/thumbnails/51.jpg)
Practical IssuesDegenerates into a depth 1 tree search
Solution: We have a “sparseness parameter” that controls how many new children of an action will be generated
Set sparseness to something like 5 or 10
a b
s
….. …..