transfer learning, soft distance-based bias, and the hierarchical boa
DESCRIPTION
An automated technique has recently been proposed to transfer learning in the hierarchical Bayesian optimization algorithm (hBOA) based on distance-based statistics. The technique enables practitioners to improve hBOA efficiency by collecting statistics from probabilistic models obtained in previous hBOA runs and using the obtained statistics to bias future hBOA runs on similar problems. The purpose of this paper is threefold: (1) test the technique on several classes of NP-complete problems, including MAXSAT, spin glasses and minimum vertex cover; (2) demonstrate that the technique is effective even when previous runs were done on problems of different size; (3) provide empirical evidence that combining transfer learning with other efficiency enhancement techniques can often yield nearly multiplicative speedups.TRANSCRIPT
Transfer learning, so. distance-‐based bias, and the hierarchical BOA Transfer Learning, So. Distance-‐Based Bias, and the Hierarchical BOA
Martin Pelikan Missouri Estimation of Distribution Algorithms
Laboratory (MEDAL) University of Missouri, St. Louis, MO
E-mail: [email protected] WWW: http://martinpelikan.net/
Mark W. Hauschild Missouri Estimation of Distribution Algorithms
Laboratory (MEDAL) University of Missouri, St. Louis, MO
E-mail: [email protected]
Pier Luca Lanzi Dipartimento di Elettronica e Informazione
Politecnico di Milano Milano, Italy
E-mail: [email protected] WWW: http://www.pierlucalanzi.net/
Background • Model-‐directed op-mizers (MDOs), such as es#ma#on of
distribu#on algorithms, learn and use models to solve difficult op-miza-on problems scalably and reliably.
• MDOs o?en provide more than the solu-on; they provide a set of models that reveal informa-on about the problem. Why not use that informa#on in future runs?
Purpose • Combine prior models with a problem-‐specific distance
metric to solve new problem instances with increased speed, accuracy, reliability.
• Focus on hBOA algorithm and addi-vely decomposable func-ons, although the approach can be generalized to other MDOs and other problem classes.
• Extend previous work to mainly demonstrate that • Previous MDO runs on smaller problems can be used
to bias runs on larger problems. • Previous MDO runs for one problem class can be used
to bias runs for another problem class.
Hierarchical Bayesian opAmizaAon algorithm, hBOA
• Models allow hBOA to learn and use problem structure. • To build models, hBOA uses Bayesian metrics that
combine the current popula-on and prior knowledge (via prior distribu-on of structures and parameters).
Current popula-on
Selected popula-on
New popula-on
Bayesian network
Basic idea of the approach • Outline of the approach
1. Define distances between problem variables. 2. Mine probabilis-c models from previous runs for
model regulari-es with respect to distances. 3. Use obtained informa-on to bias model building
when solving new problem instances. • Apply approach to hBOA and addi-vely decomposable
func-ons (ADFs) • Distance metric developed for ADFs, but other
metrics can be used. • Bias incorporated by modifying prior probabili-es of
network structures when building the models. • Strength of bias can be controlled using a
parameter κ>0.
Distance metric for ADFs • ADF
• Create a graph for ADF with one node per variable by connec-ng variables that are in the same subproblem.
• Number of edges along shortest path between two nodes defines their distance; for disconnected variables the distance is equal to the number of variables.
• Can use other distance metrics (e.g. QAP and scheduling).
{Si} are subsets of variables. {fi} are arbitrary func-ons.
Selected results • Problem classes:
• Nearest-‐neighbor NK landscapes. • Spin glasses (2D and 3D). • MAXSAT for transformed graph coloring. • Minimum vertex cover for random graphs.
• Use bias from smaller problems on bigger problems (compare to the bias from problems of the same size)
NK landscapes MVC Spin glass (2D)
• Use bias from another problem class NK landscapes MVC (easier) MVC (harder)
• Summary of results (many results omi_ed here)
• Significant speedups obtained in all test cases. • Bias based on runs on smaller problems and other
problem classes yields significant speedups as well. • This proves flexibility, prac-cality and poten-al of the
approach. • Approach can be adapted to other MDOs (LTGA,
ECGA, DSMGA, …).
0.2 0.4
0.6 0.8 1
1.2 1.4
1.6 1.8 2
2.2 2.4
2.6 2.8 3
1 2 3 4 5 6 7 8 9 10
Mul
tiplic
ativ
e sp
eedu
p w
.r.t
CPU
tim
e
Kappa (strength of bias)
base case (no speedup)
NK, n=50, k=5NK, n=60, k=5NK, n=70, k=5
0 10 20 30 40 50
60 70 80 90
100
1 2 3 4 5 6 7 8 9 10
Perc
enta
ge o
f ins
tanc
esw
ith im
prov
ed e
xecu
tion
time
Kappa (strength of bias)
NK, n=50, k=5NK, n=60, k=5NK, n=70, k=5
0.2 0.4
0.6 0.8 1
1.2 1.4
1.6 1.8 2
2.2 2.4
2.6 2.8 3
50 55 60 65 70
Aver
age
CPU
spe
edup
(mul
tiplic
ativ
e)
Problem size (number of bits, n)
base case (no speedup)
Kappa=10Kappa=8
Kappa=6
Kappa=4Kappa=2
(a) NK landscapes with nearest neighbors.
0 0.2 0.4
0.6 0.8 1
1.2 1.4
1.6 1.8 2
2.2 2.4
2.6
1 2 3 4 5 6 7 8 9 10
Mul
tiplic
ativ
e sp
eedu
p w
.r.t
CPU
tim
e
Kappa (strength of bias)
base case (no speedup)
SG 2D, n=144 (12x12)SG 2D, n=100 (10x10)
SG 2D, n=64 (8x8)
0 10 20 30 40 50
60 70 80 90
100
1 2 3 4 5 6 7 8 9 10
Perc
enta
ge o
f ins
tanc
esw
ith im
prov
ed e
xecu
tion
time
Kappa (strength of bias)
SG 2D, n=144 (12x12)SG 2D, n=100 (10x10)
SG 2D, n=64 (8x8)
0.2 0.4
0.6 0.8 1
1.2 1.4
1.6 1.8 2
2.2 2.4
2.6
64 100 144
Aver
age
CPU
spe
edup
(mul
tiplic
ativ
e)
Problem size (number of bits, n)
base case(no speedup)
Kappa=10Kappa=8
Kappa=6
Kappa=4Kappa=2
(b) 2D ±J Ising spin glass
Figure 9: Speedups obtained on NK landscapes and 2D spin glasses without using local search.
0.5
1
1.5
2
2.5
3
3.5
4
1 2 3 4 5 6 7 8 9 10
Mul
tiplic
ativ
e sp
eedu
p w
.r.t
CPU
tim
e
Kappa (strength of bias)
base case (no speedup)
n=200, bias from n=150n=200, bias from n=200
(a) NK landscapes with nearest neighbors,n = 200, k = 5.
0.5
1
1.5
2
2.5
3
3.5
1 2 3 4 5 6 7 8 9 10
Mul
tiplic
ativ
e sp
eedu
p w
.r.t
CPU
tim
e
Kappa (strength of bias)
base case (no speedup)
n=200, bias from n=150n=200, bias from n=200
(b) Minimum vertex cover, n = 200, c = 2.
0 0.2 0.4
0.6 0.8 1
1.2 1.4
1.6 1.8 2
2.2 2.4
2.6 2.8
1 2 3 4 5 6 7 8 9 10
Mul
tiplic
ativ
e sp
eedu
p w
.r.t
CPU
tim
e
Kappa (strength of bias)
base case(no speedup)
n=200, bias from n=150n=200, bias from n=200
(c) Minimum vertex cover, n = 200, c = 4.
0.6 0.8
1 1.2 1.4
1.6 1.8
2
1 2 3 4 5 6 7 8 9 10
Mul
tiplic
ativ
e sp
eedu
p w
.r.t
CPU
tim
e
Kappa (strength of bias)
base case (no speedup)
n=400, bias from n=324n=400, bias from n=400
(d) 2D ±J Ising spin glass, n = 20 ! 20 =
400
0.2 0.4
0.6 0.8 1
1.2 1.4
1.6 1.8
1 2 3 4 5 6 7 8 9 10
Mul
tiplic
ativ
e sp
eedu
p w
.r.t
CPU
tim
e
Kappa (strength of bias)
base case (no speedup)
n=343, bias from n=216n=343, bias from n=343
(e) 3D ±J Ising spin glass, n = 7! 7! 7 =
343
Figure 10: Speedups obtained on all test problems except for MAXSAT with bias based on models obtainedfrom problems of smaller size, compared to the base case with bias based on the same problem size.
17
0.2 0.4
0.6 0.8 1
1.2 1.4
1.6 1.8 2
2.2 2.4
2.6 2.8 3
1 2 3 4 5 6 7 8 9 10
Mul
tiplic
ativ
e sp
eedu
p w
.r.t
CPU
tim
e
Kappa (strength of bias)
base case (no speedup)
NK, n=50, k=5NK, n=60, k=5NK, n=70, k=5
0 10 20 30 40 50
60 70 80 90
100
1 2 3 4 5 6 7 8 9 10
Perc
enta
ge o
f ins
tanc
esw
ith im
prov
ed e
xecu
tion
time
Kappa (strength of bias)
NK, n=50, k=5NK, n=60, k=5NK, n=70, k=5
0.2 0.4
0.6 0.8 1
1.2 1.4
1.6 1.8 2
2.2 2.4
2.6 2.8 3
50 55 60 65 70
Aver
age
CPU
spe
edup
(mul
tiplic
ativ
e)
Problem size (number of bits, n)
base case (no speedup)
Kappa=10Kappa=8
Kappa=6
Kappa=4Kappa=2
(a) NK landscapes with nearest neighbors.
0 0.2 0.4
0.6 0.8 1
1.2 1.4
1.6 1.8 2
2.2 2.4
2.6
1 2 3 4 5 6 7 8 9 10
Mul
tiplic
ativ
e sp
eedu
p w
.r.t
CPU
tim
e
Kappa (strength of bias)
base case (no speedup)
SG 2D, n=144 (12x12)SG 2D, n=100 (10x10)
SG 2D, n=64 (8x8)
0 10 20 30 40 50
60 70 80 90
100
1 2 3 4 5 6 7 8 9 10
Perc
enta
ge o
f ins
tanc
esw
ith im
prov
ed e
xecu
tion
time
Kappa (strength of bias)
SG 2D, n=144 (12x12)SG 2D, n=100 (10x10)
SG 2D, n=64 (8x8)
0.2 0.4
0.6 0.8 1
1.2 1.4
1.6 1.8 2
2.2 2.4
2.6
64 100 144
Aver
age
CPU
spe
edup
(mul
tiplic
ativ
e)
Problem size (number of bits, n)
base case(no speedup)
Kappa=10Kappa=8
Kappa=6
Kappa=4Kappa=2
(b) 2D ±J Ising spin glass
Figure 9: Speedups obtained on NK landscapes and 2D spin glasses without using local search.
0.5
1
1.5
2
2.5
3
3.5
4
1 2 3 4 5 6 7 8 9 10
Mul
tiplic
ativ
e sp
eedu
p w
.r.t
CPU
tim
e
Kappa (strength of bias)
base case (no speedup)
n=200, bias from n=150n=200, bias from n=200
(a) NK landscapes with nearest neighbors,n = 200, k = 5.
0.5
1
1.5
2
2.5
3
3.5
1 2 3 4 5 6 7 8 9 10
Mul
tiplic
ativ
e sp
eedu
p w
.r.t
CPU
tim
e
Kappa (strength of bias)
base case (no speedup)
n=200, bias from n=150n=200, bias from n=200
(b) Minimum vertex cover, n = 200, c = 2.
0 0.2 0.4
0.6 0.8 1
1.2 1.4
1.6 1.8 2
2.2 2.4
2.6 2.8
1 2 3 4 5 6 7 8 9 10
Mul
tiplic
ativ
e sp
eedu
p w
.r.t
CPU
tim
e
Kappa (strength of bias)
base case(no speedup)
n=200, bias from n=150n=200, bias from n=200
(c) Minimum vertex cover, n = 200, c = 4.
0.6 0.8 1
1.2 1.4
1.6 1.8 2
1 2 3 4 5 6 7 8 9 10
Mul
tiplic
ativ
e sp
eedu
p w
.r.t
CPU
tim
e
Kappa (strength of bias)
base case (no speedup)
n=400, bias from n=324n=400, bias from n=400
(d) 2D ±J Ising spin glass, n = 20 ! 20 =
400
0.2 0.4
0.6 0.8 1
1.2 1.4
1.6 1.8
1 2 3 4 5 6 7 8 9 10
Mul
tiplic
ativ
e sp
eedu
p w
.r.t
CPU
tim
e
Kappa (strength of bias)
base case (no speedup)
n=343, bias from n=216n=343, bias from n=343
(e) 3D ±J Ising spin glass, n = 7! 7! 7 =
343
Figure 10: Speedups obtained on all test problems except for MAXSAT with bias based on models obtainedfrom problems of smaller size, compared to the base case with bias based on the same problem size.
17
0.4 0.6 0.8 1
1.2 1.4
1.6 1.8 2
2.2 2.4
2.6 2.8 3
1 2 3 4 5 6 7 8 9 10
Mul
tiplic
ativ
e sp
eedu
p w
.r.t
CPU
tim
e
Kappa (strength of bias)
base case (no speedup)
Models from NKModels from MVC, c=2Models from MVC, c=4
(a) NK landscapes with nearest neighbors,n = 200, k = 5.
0.5 1
1.5 2
2.5 3
3.5 4
1 2 3 4 5 6 7 8 9 10
Mul
tiplic
ativ
e sp
eedu
p w
.r.t
CPU
tim
e
Kappa (strength of bias)
base case (no speedup)
Models from NKModels from MVC, c=2.0Models from MVC, c=4.0
(b) Minimum vertex cover, n = 200, c = 2.
0
0.5
1
1.5
2
2.5
3
1 2 3 4 5 6 7 8 9 10
Mul
tiplic
ativ
e sp
eedu
p w
.r.t
CPU
tim
e
Kappa (strength of bias)
base case(no speedup)
Models from NKModels from MVC, c=2.0Models from MVC, c=4.0
(c) Minimum vertex cover, n = 200, c = 4.
Figure 11: Speedups obtained on NK landscapes and minimum vertex cover with the bias from modelson another problem class compared to the base case (the base case corresponds to 10-fold crossvalidationusing the models obtained on the problems from the same class).
of experiments; the results of these experiments can be found in fig. 11. Specifically, the figure presents theresults of using bias from the NK landscapes while solving the minimum vertex cover instances for the twosets of graphs (with c = 2 and c = 4), the results of using bias from the minimum vertex cover instancesfor both the classes of graphs when solving the NK landscapes, and also the results of using bias from theminimum vertex cover on another class of graphs (with different c). It can be seen that significant speedupswere obtained even when using models from another problem class, although in some cases the speedupsobtained when using the same problem class were significantly better. Some combinations of problemclasses discussed in this paper were omitted because of the high computational demands of testing everypair of problems against each other. Despite that, the results presented here indicate that even the biasbased on problems from a different class can be beneficial.
5.3.5 Combining Distance-Based Bias and Sporadic Model Building
While most efficiency enhancement techniques provide significant speedups on their own right (Hauschildand Pelikan, 2008; Hauschild et al., 2012; Lozano et al., 2001; Mendiburu et al., 2007; Ocenasek, 2002; Pelikanet al., 2008; Sastry et al., 2006), their combinations are usually significantly more powerful because thespeedups often multiply. For example, with the speedup of 4 due to sporadic model building (Pelikanet al., 2008) and the speedup of 20 due to parallelization using 20 computational nodes (Lozano et al.,2001; Ocenasek and Schwarz, 2000), combining the two techniques can be expected to yield speedups ofabout 80; obtaining comparable speedups using parallelization on its own would require at least 4 times asmany computational nodes. How will the distance-based bias from prior hBOA runs combine with otherefficiency enhancements?
To illustrate the benefits of combining the distance-based bias discussed in this paper with other ef-ficiency enhancement techniques, we combined the approach with sporadic model building and testedthe two efficiency enhancement techniques separately and in combination. Fig. 12 shows the results ofthese experiments. The results show that even in cases where the speedup due to distance-based bias isonly moderate, the overall speedup increases substantially due to the combination of the two techniques.Nearly multiplicative speedups can be observed in most scenarios.
6 Thoughts on Extending the Approach to Other Model-Directed Optimizers
6.1 Extended Compact Genetic Algorithm
The extended compact genetic algorithm (ECGA) (Harik, 1999) uses marginal product models (MPM) toguide optimization. An MPM splits variables to disjoint linkage groups. The distribution of variables ineach group is encoded by a probability table that specifies a probability for each combination of valuesof these variables. The overall probability distribution is given by the product of the distributions of theindividual linkage groups. In comparison with Bayesian network models, instead of connecting variablesthat influence each other strongly using directed edges, ECGA puts such variables into the same linkage
18
More informaAon • Download MEDAL Reports No. 2012004 and 2012007:
h_p://medal-‐lab.org/files/2012004.pdf h_p://medal-‐lab.org/files/2012007.pdf
Acknowledgments • NSF grants ECS-‐0547013 and IIS-‐1115352. • ITS at the University of Missouri in St. Louis. • University of Missouri Bioinforma-cs Consor-um. • Any opinions, findings, and conclusions or recommenda-ons expressed in
this material are those of the authors and do not necessarily reflect the views of the Na-onal Science Founda-on.
http://medal-lab.org
in hBOA for additively decomposable problems based on a problem-specific distance metric. However,note that the framework can be applied to many other model-directed optimization techniques and thefunction ! can be defined in many other ways. To illustrate this, we outline how this approach can beextended to several other model-directed optimization techniques in section 6.
4 Distance-Based Bias
4.1 Additively Decomposable Functions
For many optimization problems, the objective function (fitness function) can be expressed in the form ofan additively decomposable function (ADF) of m subproblems:
f(X1, . . . , Xn) =m!
i=1
fi(Si), (5)
where (X1, . . . , Xn) are problem’s decision variables, fi is the ith subfunction, and Si ! {X1, X2, . . . , Xn}is the subset of variables contributing to fi (subsets {Si} can overlap). While they may often exist multipleways of decomposing the problem using additive decomposition, one would typically prefer decomposi-tions that minimize the sizes of subsets {Si}. As an example, consider the following objective function fora problem with 6 variables:
fexample(X1, X2, X3, X4, X5, X6) = f1(X1, X2, X5) + f2(X3, X4) + f3(X2, X5, X6).
In the above objective function, there are three subsets of variables, S1 = {X1, X2, X5}, S2 = {X3, X4} andS3 = {X2, X5, X6}, and three subfunctions {f1, f2, f3}, each of which can be defined arbitrarily.
It is of note that the difficulty of ADFs is not fully determined by the order (size) of subproblems, butalso by the definition of the subproblems and their interaction. In fact, there exist a number of NP-completeproblems that can be formulated as ADFs with subproblems of order 2 or 3, such as MAXSAT for 3-CNFformulas. On the other hand, one may easily define ADFs with subproblems of order n that can be solvedby a simple bit-flip hill climbing in low-order polynomial time.
4.2 Measuring Variable Distances for ADFs
The definition of a distance between two variables of an ADF used in this paper is based on the workof Hauschild and Pelikan (2008) and Hauschild et al. (2012). Given an additively decomposable problemwith n variables, we define the distance between two variables using a graph G of n nodes, one node pervariable. For any two variables Xi and Xj in the same subset Sk, that is, Xi, Xj " Sk, we create an edge in Gbetween the nodes Xi and Xj . See fig. 2 for an example of an ADF and the corresponding graph. Denotingby li,j the number of edges along the shortest path between Xi and Xj in G (in terms of the number ofedges), we define the distance between two variables as
D(Xi, Xj) =
"
li,j if a path between Xi and Xj exists, andn otherwise.
(6)
Fig. 2 illustrates the distance metric on a simple example. The above distance measure makes variables inthe same subproblem close to each other, whereas for the remaining variables, the distances correspond tothe length of the chain of subproblems that relate the two variables. The distance is maximal for variablesthat are completely independent (the value of a variable does not influence the contribution of the othervariable in any way).
Since interactions between problem variables are encoded mainly in the subproblems of the additiveproblem decomposition, the above distance metric should typically correspond closely to the likelihoodof dependencies between problem variables in probabilistic models discovered by EDAs. Specifically, thevariables located closer with respect to the metric should more likely interact with each other. Fig. 3 illus-trates this on two ADFs discussed later in this paper—the NK landscape with nearest neighbor interactionsand the two-dimensional Ising spin glass (for a description of these problems, see section 5.1). The figureanalyzes probabilistic models discovered by hBOA in 10 independent runs on each of the 1,000 random in-stances for each problem and problem size. For a range of distances d between problem variables, the figure
6