bandit-based search for constraint programmingacl/aeres2013/presentations/presentation-ao-fo… ·...

Bandit-based Search for Constraint Programming

Manuel Loth1,2,4, Michele Sebag2,4,1,Youssef Hamadi3,1, Marc Schoenauer4,2,1, Christian Schulte5

1Microsoft-INRIA joint centre2LRI, Univ. Paris-Sud and CNRS3Microsoft Research Cambridge

4INRIA Saclay5KTH, Stockholm

Review AERES, Nov. 2013

LABORATOIRE DE RECHERCHE ENI N F O R M AT I Q U E

1 / 23

Search/Optimization and Machine Learning

Different Learning contexts

I Supervised (from examples) vs Reinforcement (from reward)

I Off-line (static) vs On-line (while searching)

Here: Use on-line Reinforcement Learning (MCTS)To improve CP search

2 / 23

Main idea

Constraint Programming

I Explore a search tree

I Heuristics: (learn to) ordervariables & values

Monte-Carlo Tree Search

I A tree-search method

I Breathrough for games andplanning

Hybridizing MCTS and CPBandit-based Search for Constraint Programming

3 / 23

Overview

BaSCoP

Experimental validation

Conclusions and Perspectives

4 / 23

The Multi-Armed Bandit problem

Lai, Robbins 85

In a casino, one wants to maximizeone’s gains while playing.

Lifelong learning

Exploration vs Exploitation Dilemma

I Play the best arm so far ? Exploitation

I But there might exist better arms... Exploration

5 / 23

The Multi-Armed Bandit problem (2)

I K arms, i th arm gives reward 1 with proba. µi , 0 otherwise

I At each time t, one selects an arm i∗t and gets a reward rt

ni ,t = number of times i has been selected in [0,t]µi ,t = average reward of arm i in [0,t]

Upper Confidence Bound Auer et al. 2002

Be optimistic when facing the unknown

Select argmax{µi ,t + C

√log(

∑nj,t)

}ε-greedywith probability 1− ε, select argmax {µi ,t} exploitation

else select an arm uniformly exploration

6 / 23

Monte-Carlo Tree Search Kocsis Szepesvari, 06

UCT == UCB for Trees: gradually grow the search tree

I Iterate Tree-WalkI Building Blocks

I Select next actionBandit phase

I Add a nodeGrow a leaf of the search tree

I Select next action bisRandom phase, roll-out

I Compute instant rewardEvaluate

I Update information in visited nodesof the search tree Propagate

I Returned solution:I Path visited most often

Explored Tree

Search Tree

7 / 23

Explored Tree

Search TreePhase

Bandit−Based

7 / 23

Explored Tree

Search TreePhase

Bandit−Based

7 / 23

Explored Tree

Search TreePhase

Bandit−Based

7 / 23

Explored Tree

Search TreePhase

Bandit−Based

7 / 23

Explored Tree

Search TreePhase

Bandit−Based

7 / 23

Explored Tree

Search TreePhase

Bandit−Based

7 / 23

Explored Tree

Search TreePhase

Bandit−Based

7 / 23

Explored Tree

Search TreePhase

Bandit−Based

7 / 23

Explored Tree

Search TreePhase

Bandit−Based

New Node

7 / 23

Explored Tree

Search TreePhase

Bandit−Based

New Node

PhaseRandom

7 / 23

Explored Tree

Search TreePhase

Bandit−Based

New Node

PhaseRandom

7 / 23

Explored Tree

Search TreePhase

Bandit−Based

New Node

PhaseRandom

7 / 23

Explored Tree

Search TreePhase

Bandit−Based

New Node

PhaseRandom

7 / 23

Explored Tree

Search TreePhase

Bandit−Based

New Node

PhaseRandom

7 / 23

Overview

BaSCoP

8 / 23

Adaptation

Main issues

I Which default policy ? (random phase)

I Which reward ?

I Which selection rule ?

Desired

I As problem-independent as possible

I Compatible with multiple restarts

I (some) Guarantees of completeness

9 / 23

Default policy: Depth-first search (DFS)

I Enforces completeness

I Accounts for priors about values (some are better than others;neighborhood of last best solution).

I Limited memory resources required:under each MCTS leaf node, store the current DFS path(assignments on the left of the DFS path are closed)

10 / 23

Reward

I If multiple restarts, rewards cannot be attached to tree nodes

I → rewards attached to elementary assignmentsi.e. (variable = value)

Guiding principles

I Variables: Fail first existing heuristics perform well

I Values: Fail deep −→

reward(var = val) =

{1 if failure deeper than (local) average0 otherwise

Discussion

I Noise: var might occur at different depths

I But noise equally affects all val .

11 / 23

Selection rules

L-value: left-value (0) R-value: right-value (1)

Baselines (non-adaptive)

I Uniform

I ε-left: with proba 1− ε select L-value, otherwise, R-value

Adaptive selection rules

I UCB: select val maximizing

reward (var = val) + C

√log(

∑value n(value))

n(var = val)

I UCB-left: same, but Cleft = ρCright , ρ > 1

12 / 23

Overview

BaSCoP

13 / 23

Goal of experiments

Compare BaSCoP with baselines

I DFS alone

I Adaptive and non-adaptive selection rules

Genericity

I Robustness wrt multiple restarts

I Sensitivity analysis wrt parameters

14 / 23

Experimental setting

Algorithmic framework: Gecode http://gecode.org

Top policies Bottom policies

non-Adaptive Adaptive

Uniform UCBε-Left UCB-Left

Depth-First Search

Parameters

I ε .05, .1, .15, .2 ε-Left

I C .05, .1, .2, .5 UCB

I ρ 1, 2, 4, 8 UCB-Left

15 / 23

Benchmark problems

Job-shop scheduling

I 40 Taillard instances http://mistic.heig-vd.ch/taillard/

I Multiple restarts (Luby sequence), neighborhood search

I Performance: mean relative error (to best known results)

Car-sequencing

I 70 instances, circa 200 n-ary variables

I Performance: -violation

I No restart

All results averaged over 11 runs

16 / 23

Structures of visited trees

Uniform UCB

ε-Left UCB-LeftTypical tree shapes for some JSP Taillard 15×20 instance

17 / 23

Experimental Results

State-of-the-art results on several instances (200 000 tree-walks)

0 20000 40000 60000 80000 100000

tree-walks

DFSBalanced

e-left(e=0.15)UCB(C=0.1)

UCB-left(Cl=0.2,Cr=0.1)

Sample result: Mean Relative Error on Taillard 11-20

18 / 23

Car Sequencing

2/3 2/5ABS 1/2

ABSABS ABS

I Car assembly line, different options on ordered cars.

I Stalls can handle a given number of cars

I Arrange car sequence so as not to exceed any capacity→ minimize number of empty stalls

n-ary, no restart, no positional bias of values

19 / 23

Car Sequencing

instances

DFSUCB, C in {0.05,0.1,0.2,0.5}

BaSCoP modestly but significantly better than DFS. . . but both significantly worse than ad hoc heuristics

20 / 23

Overview

BaSCoP

21 / 23

Conclusion

BaSCoP integrated in the Gecode framework

I Generic heuristics for value ordering

I DFS as rollout policy provides completeness guarantees

I Improves on DFS on 2/3 benchmark families

I State-of-art CP results without any ad-hoc heuristics on JSP

22 / 23

Perspectives

Extensions

I Rank-based reward for valuesfor n-ary contexts

I When no-restart, full MCTS(rewards attached to partial assignments)

I Rewards for variable ordering

I Control of the parallelization scheme (adaptive work stealing)

23 / 23

bandit-based search for constraint programmingacl/aeres2013/presentations/presentation-ao-fo… ·...

Documents