bandit-based search for constraint programmingacl/aeres2013/presentations/presentation-ao-fo… ·...
TRANSCRIPT
Bandit-based Search for Constraint Programming
Manuel Loth1,2,4, Michele Sebag2,4,1,Youssef Hamadi3,1, Marc Schoenauer4,2,1, Christian Schulte5
1Microsoft-INRIA joint centre2LRI, Univ. Paris-Sud and CNRS3Microsoft Research Cambridge
4INRIA Saclay5KTH, Stockholm
Review AERES, Nov. 2013
LABORATOIRE DE RECHERCHE ENI N F O R M AT I Q U E
1 / 23
Search/Optimization and Machine Learning
Different Learning contexts
I Supervised (from examples) vs Reinforcement (from reward)
I Off-line (static) vs On-line (while searching)
Here: Use on-line Reinforcement Learning (MCTS)To improve CP search
2 / 23
Main idea
Constraint Programming
I Explore a search tree
I Heuristics: (learn to) ordervariables & values
Monte-Carlo Tree Search
I A tree-search method
I Breathrough for games andplanning
Hybridizing MCTS and CPBandit-based Search for Constraint Programming
3 / 23
Overview
MCTS
BaSCoP
Experimental validation
Conclusions and Perspectives
4 / 23
The Multi-Armed Bandit problem
Lai, Robbins 85
In a casino, one wants to maximizeone’s gains while playing.
Lifelong learning
Exploration vs Exploitation Dilemma
I Play the best arm so far ? Exploitation
I But there might exist better arms... Exploration
5 / 23
The Multi-Armed Bandit problem (2)
I K arms, i th arm gives reward 1 with proba. µi , 0 otherwise
I At each time t, one selects an arm i∗t and gets a reward rt
ni ,t = number of times i has been selected in [0,t]µi ,t = average reward of arm i in [0,t]
Upper Confidence Bound Auer et al. 2002
Be optimistic when facing the unknown
Select argmax{µi ,t + C
√log(
∑nj,t)
ni,t
}ε-greedywith probability 1− ε, select argmax {µi ,t} exploitation
else select an arm uniformly exploration
6 / 23
Monte-Carlo Tree Search Kocsis Szepesvari, 06
UCT == UCB for Trees: gradually grow the search tree
I Iterate Tree-WalkI Building Blocks
I Select next actionBandit phase
I Add a nodeGrow a leaf of the search tree
I Select next action bisRandom phase, roll-out
I Compute instant rewardEvaluate
I Update information in visited nodesof the search tree Propagate
I Returned solution:I Path visited most often
Explored Tree
Search Tree
7 / 23
Monte-Carlo Tree Search Kocsis Szepesvari, 06
UCT == UCB for Trees: gradually grow the search tree
I Iterate Tree-WalkI Building Blocks
I Select next actionBandit phase
I Add a nodeGrow a leaf of the search tree
I Select next action bisRandom phase, roll-out
I Compute instant rewardEvaluate
I Update information in visited nodesof the search tree Propagate
I Returned solution:I Path visited most often
Explored Tree
Search TreePhase
Bandit−Based
7 / 23
Monte-Carlo Tree Search Kocsis Szepesvari, 06
UCT == UCB for Trees: gradually grow the search tree
I Iterate Tree-WalkI Building Blocks
I Select next actionBandit phase
I Add a nodeGrow a leaf of the search tree
I Select next action bisRandom phase, roll-out
I Compute instant rewardEvaluate
I Update information in visited nodesof the search tree Propagate
I Returned solution:I Path visited most often
Explored Tree
Search TreePhase
Bandit−Based
7 / 23
Monte-Carlo Tree Search Kocsis Szepesvari, 06
UCT == UCB for Trees: gradually grow the search tree
I Iterate Tree-WalkI Building Blocks
I Select next actionBandit phase
I Add a nodeGrow a leaf of the search tree
I Select next action bisRandom phase, roll-out
I Compute instant rewardEvaluate
I Update information in visited nodesof the search tree Propagate
I Returned solution:I Path visited most often
Explored Tree
Search TreePhase
Bandit−Based
7 / 23
Monte-Carlo Tree Search Kocsis Szepesvari, 06
UCT == UCB for Trees: gradually grow the search tree
I Iterate Tree-WalkI Building Blocks
I Select next actionBandit phase
I Add a nodeGrow a leaf of the search tree
I Select next action bisRandom phase, roll-out
I Compute instant rewardEvaluate
I Update information in visited nodesof the search tree Propagate
I Returned solution:I Path visited most often
Explored Tree
Search TreePhase
Bandit−Based
7 / 23
Monte-Carlo Tree Search Kocsis Szepesvari, 06
UCT == UCB for Trees: gradually grow the search tree
I Iterate Tree-WalkI Building Blocks
I Select next actionBandit phase
I Add a nodeGrow a leaf of the search tree
I Select next action bisRandom phase, roll-out
I Compute instant rewardEvaluate
I Update information in visited nodesof the search tree Propagate
I Returned solution:I Path visited most often
Explored Tree
Search TreePhase
Bandit−Based
7 / 23
Monte-Carlo Tree Search Kocsis Szepesvari, 06
UCT == UCB for Trees: gradually grow the search tree
I Iterate Tree-WalkI Building Blocks
I Select next actionBandit phase
I Add a nodeGrow a leaf of the search tree
I Select next action bisRandom phase, roll-out
I Compute instant rewardEvaluate
I Update information in visited nodesof the search tree Propagate
I Returned solution:I Path visited most often
Explored Tree
Search TreePhase
Bandit−Based
7 / 23
Monte-Carlo Tree Search Kocsis Szepesvari, 06
UCT == UCB for Trees: gradually grow the search tree
I Iterate Tree-WalkI Building Blocks
I Select next actionBandit phase
I Add a nodeGrow a leaf of the search tree
I Select next action bisRandom phase, roll-out
I Compute instant rewardEvaluate
I Update information in visited nodesof the search tree Propagate
I Returned solution:I Path visited most often
Explored Tree
Search TreePhase
Bandit−Based
7 / 23
Monte-Carlo Tree Search Kocsis Szepesvari, 06
UCT == UCB for Trees: gradually grow the search tree
I Iterate Tree-WalkI Building Blocks
I Select next actionBandit phase
I Add a nodeGrow a leaf of the search tree
I Select next action bisRandom phase, roll-out
I Compute instant rewardEvaluate
I Update information in visited nodesof the search tree Propagate
I Returned solution:I Path visited most often
Explored Tree
Search TreePhase
Bandit−Based
7 / 23
Monte-Carlo Tree Search Kocsis Szepesvari, 06
UCT == UCB for Trees: gradually grow the search tree
I Iterate Tree-WalkI Building Blocks
I Select next actionBandit phase
I Add a nodeGrow a leaf of the search tree
I Select next action bisRandom phase, roll-out
I Compute instant rewardEvaluate
I Update information in visited nodesof the search tree Propagate
I Returned solution:I Path visited most often
Explored Tree
Search TreePhase
Bandit−Based
New Node
7 / 23
Monte-Carlo Tree Search Kocsis Szepesvari, 06
UCT == UCB for Trees: gradually grow the search tree
I Iterate Tree-WalkI Building Blocks
I Select next actionBandit phase
I Add a nodeGrow a leaf of the search tree
I Select next action bisRandom phase, roll-out
I Compute instant rewardEvaluate
I Update information in visited nodesof the search tree Propagate
I Returned solution:I Path visited most often
Explored Tree
Search TreePhase
Bandit−Based
New Node
PhaseRandom
7 / 23
Monte-Carlo Tree Search Kocsis Szepesvari, 06
UCT == UCB for Trees: gradually grow the search tree
I Iterate Tree-WalkI Building Blocks
I Select next actionBandit phase
I Add a nodeGrow a leaf of the search tree
I Select next action bisRandom phase, roll-out
I Compute instant rewardEvaluate
I Update information in visited nodesof the search tree Propagate
I Returned solution:I Path visited most often
Explored Tree
Search TreePhase
Bandit−Based
New Node
PhaseRandom
7 / 23
Monte-Carlo Tree Search Kocsis Szepesvari, 06
UCT == UCB for Trees: gradually grow the search tree
I Iterate Tree-WalkI Building Blocks
I Select next actionBandit phase
I Add a nodeGrow a leaf of the search tree
I Select next action bisRandom phase, roll-out
I Compute instant rewardEvaluate
I Update information in visited nodesof the search tree Propagate
I Returned solution:I Path visited most often
Explored Tree
Search TreePhase
Bandit−Based
New Node
PhaseRandom
7 / 23
Monte-Carlo Tree Search Kocsis Szepesvari, 06
UCT == UCB for Trees: gradually grow the search tree
I Iterate Tree-WalkI Building Blocks
I Select next actionBandit phase
I Add a nodeGrow a leaf of the search tree
I Select next action bisRandom phase, roll-out
I Compute instant rewardEvaluate
I Update information in visited nodesof the search tree Propagate
I Returned solution:I Path visited most often
Explored Tree
Search TreePhase
Bandit−Based
New Node
PhaseRandom
7 / 23
Monte-Carlo Tree Search Kocsis Szepesvari, 06
UCT == UCB for Trees: gradually grow the search tree
I Iterate Tree-WalkI Building Blocks
I Select next actionBandit phase
I Add a nodeGrow a leaf of the search tree
I Select next action bisRandom phase, roll-out
I Compute instant rewardEvaluate
I Update information in visited nodesof the search tree Propagate
I Returned solution:I Path visited most often
Explored Tree
Search TreePhase
Bandit−Based
New Node
PhaseRandom
7 / 23
Overview
MCTS
BaSCoP
Experimental validation
Conclusions and Perspectives
8 / 23
Adaptation
Main issues
I Which default policy ? (random phase)
I Which reward ?
I Which selection rule ?
Desired
I As problem-independent as possible
I Compatible with multiple restarts
I (some) Guarantees of completeness
9 / 23
Default policy: Depth-first search (DFS)
I Enforces completeness
I Accounts for priors about values (some are better than others;neighborhood of last best solution).
I Limited memory resources required:under each MCTS leaf node, store the current DFS path(assignments on the left of the DFS path are closed)
10 / 23
Reward
I If multiple restarts, rewards cannot be attached to tree nodes
I → rewards attached to elementary assignmentsi.e. (variable = value)
Guiding principles
I Variables: Fail first existing heuristics perform well
I Values: Fail deep −→
reward(var = val) =
{1 if failure deeper than (local) average0 otherwise
Discussion
I Compatible with multiple restarts
I Noise: var might occur at different depths
I But noise equally affects all val .
11 / 23
Selection rules
L-value: left-value (0) R-value: right-value (1)
Baselines (non-adaptive)
I Uniform
I ε-left: with proba 1− ε select L-value, otherwise, R-value
Adaptive selection rules
I UCB: select val maximizing
reward (var = val) + C
√log(
∑value n(value))
n(var = val)
I UCB-left: same, but Cleft = ρCright , ρ > 1
12 / 23
Overview
MCTS
BaSCoP
Experimental validation
Conclusions and Perspectives
13 / 23
Goal of experiments
Compare BaSCoP with baselines
I DFS alone
I Adaptive and non-adaptive selection rules
Genericity
I Robustness wrt multiple restarts
I Sensitivity analysis wrt parameters
14 / 23
Experimental setting
Algorithmic framework: Gecode http://gecode.org
Top policies Bottom policies
non-Adaptive Adaptive
Uniform UCBε-Left UCB-Left
Depth-First Search
Parameters
I ε .05, .1, .15, .2 ε-Left
I C .05, .1, .2, .5 UCB
I ρ 1, 2, 4, 8 UCB-Left
15 / 23
Benchmark problems
Job-shop scheduling
I 40 Taillard instances http://mistic.heig-vd.ch/taillard/
I Multiple restarts (Luby sequence), neighborhood search
I Performance: mean relative error (to best known results)
Car-sequencing
I 70 instances, circa 200 n-ary variables
I Performance: -violation
I No restart
All results averaged over 11 runs
16 / 23
Structures of visited trees
Uniform UCB
ε-Left UCB-LeftTypical tree shapes for some JSP Taillard 15×20 instance
17 / 23
Experimental Results
State-of-the-art results on several instances (200 000 tree-walks)
0.01
0.1
0 20000 40000 60000 80000 100000
mean
rela
tive
err
or
to b
est
-know
n s
olu
tion
tree-walks
DFSBalanced
e-left(e=0.15)UCB(C=0.1)
UCB-left(Cl=0.2,Cr=0.1)
Sample result: Mean Relative Error on Taillard 11-20
18 / 23
Car Sequencing
2/3 2/5ABS 1/2
ABSABS ABS
I Car assembly line, different options on ordered cars.
I Stalls can handle a given number of cars
I Arrange car sequence so as not to exceed any capacity→ minimize number of empty stalls
n-ary, no restart, no positional bias of values
19 / 23
Car Sequencing
0
5
10
15
20
25
30
35num
ber
of
em
pty
sta
lls
instances
DFSUCB, C in {0.05,0.1,0.2,0.5}
BaSCoP modestly but significantly better than DFS. . . but both significantly worse than ad hoc heuristics
20 / 23
Overview
MCTS
BaSCoP
Experimental validation
Conclusions and Perspectives
21 / 23
Conclusion
BaSCoP integrated in the Gecode framework
I Generic heuristics for value ordering
I Compatible with multiple restarts
I DFS as rollout policy provides completeness guarantees
I Improves on DFS on 2/3 benchmark families
I State-of-art CP results without any ad-hoc heuristics on JSP
22 / 23
Perspectives
Extensions
I Rank-based reward for valuesfor n-ary contexts
I When no-restart, full MCTS(rewards attached to partial assignments)
I Rewards for variable ordering
I Control of the parallelization scheme (adaptive work stealing)
23 / 23