transfer learning, soft distance-based bias, and the hierarchical boa

Transfer learning, so. distance-‐based bias, and the hierarchical BOA Transfer Learning, So. Distance-‐Based Bias, and the Hierarchical BOA

Martin Pelikan Missouri Estimation of Distribution Algorithms

Laboratory (MEDAL) University of Missouri, St. Louis, MO

E-mail: [email protected] WWW: http://martinpelikan.net/

Mark W. Hauschild Missouri Estimation of Distribution Algorithms

Laboratory (MEDAL) University of Missouri, St. Louis, MO

E-mail: [email protected]

Pier Luca Lanzi Dipartimento di Elettronica e Informazione

Politecnico di Milano Milano, Italy

E-mail: [email protected] WWW: http://www.pierlucalanzi.net/

Background •  Model-‐directed op-mizers (MDOs), such as es#ma#on of

distribu#on algorithms, learn and use models to solve difficult op-miza-on problems scalably and reliably.

•  MDOs o?en provide more than the solu-on; they provide a set of models that reveal informa-on about the problem. Why not use that informa#on in future runs?

Purpose •  Combine prior models with a problem-‐specific distance

metric to solve new problem instances with increased speed, accuracy, reliability.

•  Focus on hBOA algorithm and addi-vely decomposable func-ons, although the approach can be generalized to other MDOs and other problem classes.

•  Extend previous work to mainly demonstrate that •  Previous MDO runs on smaller problems can be used

to bias runs on larger problems. •  Previous MDO runs for one problem class can be used

to bias runs for another problem class.

Hierarchical Bayesian opAmizaAon algorithm, hBOA

•  Models allow hBOA to learn and use problem structure. •  To build models, hBOA uses Bayesian metrics that

combine the current popula-on and prior knowledge (via prior distribu-on of structures and parameters).

Current popula-on

Selected popula-on

New popula-on

Bayesian network

Basic idea of the approach •  Outline of the approach

1.  Define distances between problem variables. 2.  Mine probabilis-c models from previous runs for

model regulari-es with respect to distances. 3.  Use obtained informa-on to bias model building

when solving new problem instances. •  Apply approach to hBOA and addi-vely decomposable

func-ons (ADFs) •  Distance metric developed for ADFs, but other

metrics can be used. •  Bias incorporated by modifying prior probabili-es of

network structures when building the models. •  Strength of bias can be controlled using a

parameter κ>0.

Distance metric for ADFs •  ADF

•  Create a graph for ADF with one node per variable by connec-ng variables that are in the same subproblem.

•  Number of edges along shortest path between two nodes defines their distance; for disconnected variables the distance is equal to the number of variables.

•  Can use other distance metrics (e.g. QAP and scheduling).

{Si} are subsets of variables. {fi} are arbitrary func-ons.

Selected results •  Problem classes:

•  Nearest-‐neighbor NK landscapes. •  Spin glasses (2D and 3D). •  MAXSAT for transformed graph coloring. •  Minimum vertex cover for random graphs.

•  Use bias from smaller problems on bigger problems (compare to the bias from problems of the same size)

NK landscapes MVC Spin glass (2D)

•  Use bias from another problem class NK landscapes MVC (easier) MVC (harder)

•  Summary of results (many results omi_ed here)

•  Significant speedups obtained in all test cases. •  Bias based on runs on smaller problems and other

problem classes yields significant speedups as well. •  This proves flexibility, prac-cality and poten-al of the

approach. •  Approach can be adapted to other MDOs (LTGA,

ECGA, DSMGA, …).

0.2 0.4

0.6 0.8 1

1.2 1.4

1.6 1.8 2

2.2 2.4

2.6 2.8 3

1 2 3 4 5 6 7 8 9 10

Mul

tiplic

ativ

e sp

eedu

p w

.r.t

CPU

tim

e

Kappa (strength of bias)

base case (no speedup)

NK, n=50, k=5NK, n=60, k=5NK, n=70, k=5

0 10 20 30 40 50

60 70 80 90

100

1 2 3 4 5 6 7 8 9 10

Perc

enta

ge o

f ins

tanc

esw

ith im

prov

ed e

xecu

tion

time


NK, n=50, k=5NK, n=60, k=5NK, n=70, k=5

0.2 0.4

0.6 0.8 1

1.2 1.4

1.6 1.8 2

2.2 2.4

2.6 2.8 3

50 55 60 65 70

Aver

age

CPU

spe

edup

(mul

tiplic

ativ

e)

Problem size (number of bits, n)


Kappa=10Kappa=8

Kappa=6

Kappa=4Kappa=2

(a) NK landscapes with nearest neighbors.

0 0.2 0.4

0.6 0.8 1

1.2 1.4

1.6 1.8 2

2.2 2.4

2.6

1 2 3 4 5 6 7 8 9 10

Mul

tiplic

ativ

e sp

eedu

p w

.r.t

CPU

tim

e



SG 2D, n=144 (12x12)SG 2D, n=100 (10x10)

SG 2D, n=64 (8x8)

0 10 20 30 40 50

60 70 80 90

100

1 2 3 4 5 6 7 8 9 10

Perc

enta

ge o

f ins

tanc

esw

ith im

prov

ed e

xecu

tion

time


SG 2D, n=144 (12x12)SG 2D, n=100 (10x10)

SG 2D, n=64 (8x8)

0.2 0.4

0.6 0.8 1

1.2 1.4

1.6 1.8 2

2.2 2.4

2.6

64 100 144

Aver

age

CPU

spe

edup

(mul

tiplic

ativ

e)


base case(no speedup)

Kappa=10Kappa=8

Kappa=6

Kappa=4Kappa=2

(b) 2D ±J Ising spin glass

Figure 9: Speedups obtained on NK landscapes and 2D spin glasses without using local search.

0.5

1

1.5

2

2.5

3

3.5

4

1 2 3 4 5 6 7 8 9 10

Mul

tiplic

ativ

e sp

eedu

p w

.r.t

CPU

tim

e



n=200, bias from n=150n=200, bias from n=200

(a) NK landscapes with nearest neighbors,n = 200, k = 5.

0.5

1

1.5

2

2.5

3

3.5

1 2 3 4 5 6 7 8 9 10

Mul

tiplic

ativ

e sp

eedu

p w

.r.t

CPU

tim

e




(b) Minimum vertex cover, n = 200, c = 2.

0 0.2 0.4

0.6 0.8 1

1.2 1.4

1.6 1.8 2

2.2 2.4

2.6 2.8

1 2 3 4 5 6 7 8 9 10

Mul

tiplic

ativ

e sp

eedu

p w

.r.t

CPU

tim

e




(c) Minimum vertex cover, n = 200, c = 4.

0.6 0.8

1 1.2 1.4

1.6 1.8

2

1 2 3 4 5 6 7 8 9 10

Mul

tiplic

ativ

e sp

eedu

p w

.r.t

CPU

tim

e




(d) 2D ±J Ising spin glass, n = 20 ! 20 =

400

0.2 0.4

0.6 0.8 1

1.2 1.4

1.6 1.8

1 2 3 4 5 6 7 8 9 10

Mul

tiplic

ativ

e sp

eedu

p w

.r.t

CPU

tim

e




(e) 3D ±J Ising spin glass, n = 7! 7! 7 =

343

Figure 10: Speedups obtained on all test problems except for MAXSAT with bias based on models obtainedfrom problems of smaller size, compared to the base case with bias based on the same problem size.

17

0.2 0.4

0.6 0.8 1

1.2 1.4

1.6 1.8 2

2.2 2.4

2.6 2.8 3

1 2 3 4 5 6 7 8 9 10

Mul

tiplic

ativ

e sp

eedu

p w

.r.t

CPU

tim

e



NK, n=50, k=5NK, n=60, k=5NK, n=70, k=5

0 10 20 30 40 50

60 70 80 90

100

1 2 3 4 5 6 7 8 9 10

Perc

enta

ge o

f ins

tanc

esw

ith im

prov

ed e

xecu

tion

time


NK, n=50, k=5NK, n=60, k=5NK, n=70, k=5

0.2 0.4

0.6 0.8 1

1.2 1.4

1.6 1.8 2

2.2 2.4

2.6 2.8 3

50 55 60 65 70

Aver

age

CPU

spe

edup

(mul

tiplic

ativ

e)



Kappa=10Kappa=8

Kappa=6

Kappa=4Kappa=2

(a) NK landscapes with nearest neighbors.

0 0.2 0.4

0.6 0.8 1

1.2 1.4

1.6 1.8 2

2.2 2.4

2.6

1 2 3 4 5 6 7 8 9 10

Mul

tiplic

ativ

e sp

eedu

p w

.r.t

CPU

tim

e



SG 2D, n=144 (12x12)SG 2D, n=100 (10x10)

SG 2D, n=64 (8x8)

0 10 20 30 40 50

60 70 80 90

100

1 2 3 4 5 6 7 8 9 10

Perc

enta

ge o

f ins

tanc

esw

ith im

prov

ed e

xecu

tion

time


SG 2D, n=144 (12x12)SG 2D, n=100 (10x10)

SG 2D, n=64 (8x8)

0.2 0.4

0.6 0.8 1

1.2 1.4

1.6 1.8 2

2.2 2.4

2.6

64 100 144

Aver

age

CPU

spe

edup

(mul

tiplic

ativ

e)



Kappa=10Kappa=8

Kappa=6

Kappa=4Kappa=2

(b) 2D ±J Ising spin glass

Figure 9: Speedups obtained on NK landscapes and 2D spin glasses without using local search.

0.5

1

1.5

2

2.5

3

3.5

4

1 2 3 4 5 6 7 8 9 10

Mul

tiplic

ativ

e sp

eedu

p w

.r.t

CPU

tim

e





0.5

1

1.5

2

2.5

3

3.5

1 2 3 4 5 6 7 8 9 10

Mul

tiplic

ativ

e sp

eedu

p w

.r.t

CPU

tim

e





0 0.2 0.4

0.6 0.8 1

1.2 1.4

1.6 1.8 2

2.2 2.4

2.6 2.8

1 2 3 4 5 6 7 8 9 10

Mul

tiplic

ativ

e sp

eedu

p w

.r.t

CPU

tim

e





0.6 0.8 1

1.2 1.4

1.6 1.8 2

1 2 3 4 5 6 7 8 9 10

Mul

tiplic

ativ

e sp

eedu

p w

.r.t

CPU

tim

e




(d) 2D ±J Ising spin glass, n = 20 ! 20 =

400

0.2 0.4

0.6 0.8 1

1.2 1.4

1.6 1.8

1 2 3 4 5 6 7 8 9 10

Mul

tiplic

ativ

e sp

eedu

p w

.r.t

CPU

tim

e




(e) 3D ±J Ising spin glass, n = 7! 7! 7 =

343

Figure 10: Speedups obtained on all test problems except for MAXSAT with bias based on models obtainedfrom problems of smaller size, compared to the base case with bias based on the same problem size.

17

0.4 0.6 0.8 1

1.2 1.4

1.6 1.8 2

2.2 2.4

2.6 2.8 3

1 2 3 4 5 6 7 8 9 10

Mul

tiplic

ativ

e sp

eedu

p w

.r.t

CPU

tim

e



Models from NKModels from MVC, c=2Models from MVC, c=4


0.5 1

1.5 2

2.5 3

3.5 4

1 2 3 4 5 6 7 8 9 10

Mul

tiplic

ativ

e sp

eedu

p w

.r.t

CPU

tim

e



Models from NKModels from MVC, c=2.0Models from MVC, c=4.0


0

0.5

1

1.5

2

2.5

3

1 2 3 4 5 6 7 8 9 10

Mul

tiplic

ativ

e sp

eedu

p w

.r.t

CPU

tim

e



Models from NKModels from MVC, c=2.0Models from MVC, c=4.0


Figure 11: Speedups obtained on NK landscapes and minimum vertex cover with the bias from modelson another problem class compared to the base case (the base case corresponds to 10-fold crossvalidationusing the models obtained on the problems from the same class).

of experiments; the results of these experiments can be found in fig. 11. Specifically, the figure presents theresults of using bias from the NK landscapes while solving the minimum vertex cover instances for the twosets of graphs (with c = 2 and c = 4), the results of using bias from the minimum vertex cover instancesfor both the classes of graphs when solving the NK landscapes, and also the results of using bias from theminimum vertex cover on another class of graphs (with different c). It can be seen that significant speedupswere obtained even when using models from another problem class, although in some cases the speedupsobtained when using the same problem class were significantly better. Some combinations of problemclasses discussed in this paper were omitted because of the high computational demands of testing everypair of problems against each other. Despite that, the results presented here indicate that even the biasbased on problems from a different class can be beneficial.

5.3.5 Combining Distance-Based Bias and Sporadic Model Building

While most efficiency enhancement techniques provide significant speedups on their own right (Hauschildand Pelikan, 2008; Hauschild et al., 2012; Lozano et al., 2001; Mendiburu et al., 2007; Ocenasek, 2002; Pelikanet al., 2008; Sastry et al., 2006), their combinations are usually significantly more powerful because thespeedups often multiply. For example, with the speedup of 4 due to sporadic model building (Pelikanet al., 2008) and the speedup of 20 due to parallelization using 20 computational nodes (Lozano et al.,2001; Ocenasek and Schwarz, 2000), combining the two techniques can be expected to yield speedups ofabout 80; obtaining comparable speedups using parallelization on its own would require at least 4 times asmany computational nodes. How will the distance-based bias from prior hBOA runs combine with otherefficiency enhancements?

To illustrate the benefits of combining the distance-based bias discussed in this paper with other ef-ficiency enhancement techniques, we combined the approach with sporadic model building and testedthe two efficiency enhancement techniques separately and in combination. Fig. 12 shows the results ofthese experiments. The results show that even in cases where the speedup due to distance-based bias isonly moderate, the overall speedup increases substantially due to the combination of the two techniques.Nearly multiplicative speedups can be observed in most scenarios.

6 Thoughts on Extending the Approach to Other Model-Directed Optimizers

6.1 Extended Compact Genetic Algorithm

The extended compact genetic algorithm (ECGA) (Harik, 1999) uses marginal product models (MPM) toguide optimization. An MPM splits variables to disjoint linkage groups. The distribution of variables ineach group is encoded by a probability table that specifies a probability for each combination of valuesof these variables. The overall probability distribution is given by the product of the distributions of theindividual linkage groups. In comparison with Bayesian network models, instead of connecting variablesthat influence each other strongly using directed edges, ECGA puts such variables into the same linkage

18

More informaAon •  Download MEDAL Reports No. 2012004 and 2012007:

h_p://medal-‐lab.org/files/2012004.pdf h_p://medal-‐lab.org/files/2012007.pdf

Acknowledgments •  NSF grants ECS-‐0547013 and IIS-‐1115352. •  ITS at the University of Missouri in St. Louis. •  University of Missouri Bioinforma-cs Consor-um. •  Any opinions, findings, and conclusions or recommenda-ons expressed in

this material are those of the authors and do not necessarily reflect the views of the Na-onal Science Founda-on.

http://medal-lab.org

in hBOA for additively decomposable problems based on a problem-specific distance metric. However,note that the framework can be applied to many other model-directed optimization techniques and thefunction ! can be defined in many other ways. To illustrate this, we outline how this approach can beextended to several other model-directed optimization techniques in section 6.

4 Distance-Based Bias

4.1 Additively Decomposable Functions

For many optimization problems, the objective function (fitness function) can be expressed in the form ofan additively decomposable function (ADF) of m subproblems:

f(X1, . . . , Xn) =m!

i=1

fi(Si), (5)

where (X1, . . . , Xn) are problem’s decision variables, fi is the ith subfunction, and Si ! {X1, X2, . . . , Xn}is the subset of variables contributing to fi (subsets {Si} can overlap). While they may often exist multipleways of decomposing the problem using additive decomposition, one would typically prefer decomposi-tions that minimize the sizes of subsets {Si}. As an example, consider the following objective function fora problem with 6 variables:

fexample(X1, X2, X3, X4, X5, X6) = f1(X1, X2, X5) + f2(X3, X4) + f3(X2, X5, X6).

In the above objective function, there are three subsets of variables, S1 = {X1, X2, X5}, S2 = {X3, X4} andS3 = {X2, X5, X6}, and three subfunctions {f1, f2, f3}, each of which can be defined arbitrarily.

It is of note that the difficulty of ADFs is not fully determined by the order (size) of subproblems, butalso by the definition of the subproblems and their interaction. In fact, there exist a number of NP-completeproblems that can be formulated as ADFs with subproblems of order 2 or 3, such as MAXSAT for 3-CNFformulas. On the other hand, one may easily define ADFs with subproblems of order n that can be solvedby a simple bit-flip hill climbing in low-order polynomial time.

4.2 Measuring Variable Distances for ADFs

The definition of a distance between two variables of an ADF used in this paper is based on the workof Hauschild and Pelikan (2008) and Hauschild et al. (2012). Given an additively decomposable problemwith n variables, we define the distance between two variables using a graph G of n nodes, one node pervariable. For any two variables Xi and Xj in the same subset Sk, that is, Xi, Xj " Sk, we create an edge in Gbetween the nodes Xi and Xj . See fig. 2 for an example of an ADF and the corresponding graph. Denotingby li,j the number of edges along the shortest path between Xi and Xj in G (in terms of the number ofedges), we define the distance between two variables as

D(Xi, Xj) =

"

li,j if a path between Xi and Xj exists, andn otherwise.

(6)

Fig. 2 illustrates the distance metric on a simple example. The above distance measure makes variables inthe same subproblem close to each other, whereas for the remaining variables, the distances correspond tothe length of the chain of subproblems that relate the two variables. The distance is maximal for variablesthat are completely independent (the value of a variable does not influence the contribution of the othervariable in any way).

Since interactions between problem variables are encoded mainly in the subproblems of the additiveproblem decomposition, the above distance metric should typically correspond closely to the likelihoodof dependencies between problem variables in probabilistic models discovered by EDAs. Specifically, thevariables located closer with respect to the metric should more likely interact with each other. Fig. 3 illus-trates this on two ADFs discussed later in this paper—the NK landscape with nearest neighbor interactionsand the two-dimensional Ising spin glass (for a description of these problems, see section 5.1). The figureanalyzes probabilistic models discovered by hBOA in 10 independent runs on each of the 1,000 random in-stances for each problem and problem size. For a range of distances d between problem variables, the figure

6

transfer learning, soft distance-based bias, and the hierarchical boa

Technology

d time

nk landscapes

lsubproblems of order

speedup sg

average cpu speedup

10x10 kappa

adfs distance

number of variables