adam padee ( [email protected] ) wojciech padee ( [email protected] )

Cracow Grid Workshop – 15-18 October 2006 - 1

Large-Scale Evolutionary Optimization on the Grid:Multiple-Deme Genetic Algorithm in the Globus-

Based Environment

Adam Padee ([email protected])Wojciech Padee ([email protected])

Krzysztof Zaremba ([email protected])


Goals of the project

• Create a tool for numerical optimization of complex problems that are:• Computationally very expensive• Impossible to solve using classical, gradient-based methods (too many local

optima)

• Utilize evolutionary algorithms as they don’t rely directly on the gradient vector

• Objective function calls an external program and parses it’s output • Easy adaptation to new tasks

• Example application: Track reconstruction optimization in HEP experiments – will be shown at the end of this presentation


Common architectures of the parallel evolutionary algorithms (1)

• Master-slave• One population is stored on a server (master node), calculation of the

fitness function values distributed among the worker nodes (slaves)• Synchronous• Asynchronous (split population or SSGA – Steady State Genetic

Algorithm)

• Multiple population algorithms (also called coarse-grained)• They consist of multiple independent populations, exchanging only

selected individuals. Frequency of the exchanges, migration channels and the operators applied to the individuals depend on the model, e.g.:

• Fully connected topology (suitable especially for parallel supercomputers)• Island Model (arbitrary topology, simple migrations but less frequent)• Pollen transmission• Social• ...


Common architectures of the parallel evolutionary algorithms (2)

• Cellular (also called fine-grained)• One population divided spatially among neighboring processors. Each

of them can process one or more individuals. Selection and crossing-over takes place only among neighbors. Most popular implementations:

• Hardware (dedicated integrated circuits)

• Software: usually on SIMD processors, although there are also very efficient implementations on ccNUMA architecture (Cache Coherent Non-Uniform Memory Access)

• Hierarchical • Coarse-grained algorithms consisting of multiple cellular or master-

slave algorithms. This is the most advanced, and also most flexible architecture.


Asynchronous master-slave SSGA on a single cluster

1. Create empty population. 2. Create empty execution queue (this is an internal object with mapping

to one of the physical queues in the batch system). 3. If there are not at least two free places in the execution queue, go to

step 54. Check if there are free places in the population

a) If there are, create two random individuals, place them in the execution queue and return to the step 3

b) If not, select two individuals using reproduction operator. Apply crossing-over and mutation. Place them in the execution queue. Return to 3.

5. Wait until one of the client finishes it’s work. Collect the results.6. If there are free places in the population, place the newcomer in one

of them. If not, select the individual to replace using reverse reproduction operator (tournament, proportional or random).

7. Check if the stop criteria has been reached. If yes, terminate the program, otherwise return to the step 3.


Implementation details(LSF and OpenPBS)

• Master process runs on the batch system server (or on the designated UI machine - LSF):

• Creates new individuals and applies the genetic operators• Registers input data in MSS• Runs and monitors the slave processes• Collects the results using batch system mechanisms and assigns the fitness

values to the individuals in the execution queue (execution queue is program’s internal object, mapping to the batch system queue done via appropriate API)

• Batch system introduces couple of seconds delay:• Registration in the queue• Selection of the free CPU• Transfer of the parameters• Monitoring• Gathering results

• With job flow around 50-100 jobs/sec, the failure rate doesn’t exceed 10% (in real life application – RECON 2000)


Flat master-slave on the Grid

• At the first glance implementation is relatively easy• Convenient API/CLI functions for job submission• Single sign-on allows the master process to operate autonomously• Global file systems (e.g. LFC) facilitate the data access

But ...

• Approximately 100 more processors ( 100 more slaves, network bandwidth requirements are very high)

• Complicated task monitoring and error analysis• Job submission overhead can reach order of minutes for a

single job• RB + L&B is not prepared for a massive submission of short

jobs ( frequent failures, disturbance for other users)


Island Model GA: basic concept

GA

1

GA

2

GA

3

GA

4

GA

5

GA

6

Growth phase: population on each of the islands is being developed independently


Island Model GA: basic concept

GA

1

GA

2

GA

3

GA

4

GA

5

GA

6

Migration phase: Each of the population selects one or more individuals (usually the best ones) and sends him to the neighboring island, where immigrant is introduced in the local population Migration

channels


Island model: parameters

• Size of the member populations

• Migration topology (directed graph)

• Frequency of the migrations

• Selection of the migrants and adoption of the immigrants

These parameters have big influence on the convergence speed, but the optimal choice of their values highly depend on the optimized function and used infrastructure (type of the computer, cost of the CPU cycles vs. communications). There are models based on Markov chains allowing their calculation for a given probability of reaching the global optimum, but applicability of these models is limited to very simple cases


Flat Island Model GA on the Grid

• One deme per every CPU requires high migration rates (bandwidth problems)

• Flat model of communication hard to implement across sites.

• Grid-wide MPI not available in big production grids like EGEE• Introduction of dedicated service is not flexible• Possible exchange of information via replicas in LFC - very slow

and inefficient solution


Hybrid algorithm: Islands with master-slave SSGA populations

• One island is formed on each cluster. • Thanks to fast internal communication, the master-slave algorithm

for clusters can be used with only slight modifications

• Master is running on the gatekeeper (CE), which is usually a batch system server or at least has proper rights to run batch jobs directly (via qsub or bsub)

• This machine has outbound and inbound IP connectivity with other sites (at least GLOBUS_TCP_PORT_RANGE ports are open)

• Communication with other islands is possible in any topology

• Relatively big population size at each island allows lower migration rates

• Migrants can be exchanged also via files in LFC


Hybrid algorithm (LFC variant):start of the master processes

UIJDL

Worker NodesWorker Nodes

Logical Logical File File CatalogCatalog

Resource Broker

JDL with task

CondorG

Worker NodesWorker Nodes Worker NodesWorker Nodes

CE 1

PBS Passwordless SSH

GACE

1CE

1


Hybrid algorithm (LFC variant): calculations

UIJDL




CE 1GA

CE 1

CE 1

Running slave processes via PBS/LSF

GA GA


Hybrid algorithm (LFC variant): migration

UIJDL




CE 1GA

CE 1

CE 1GA GA

Registration of the migrants


Hybrid algorithm (LFC variant): migration

UIJDL




CE 1GA

CE 1

CE 1

Readout of the immigrants

GA GA


Hybrid algorithm (LFC variant): calculations

UIJDL




CE 1GA

CE 1

CE 1


GA GA


Hybrid algorithm (LFC variant): registration of the results

UIJDL




CE 1GA

CE 1

CE 1

Saving the best individuals

GA GA


Hybrid algorithm (MPI variant): start of the master processes

UIJDL



MPI-enabled Resource Broker

JDL with task

CondorG


CE 1

PBS

GA

CE 1

CE 1


Hybrid algorithm (MPI variant): calculations

UIJDL




CE 1

GA

CE 1

CE 1


GA GA


Hybrid algorithm (MPI variant): migration

UIJDL




CE 1

GA

CE 1

CE 1

Communication through MPI

GA GA


Hybrid algorithm (MPI variant): calculations

UIJDL




CE 1

GA

CE 1

CE 1


GA GA


Hybrid algorithm (MPI variant): registration of the results

UIJDL




CE 1

GA

CE 1

CE 1


GA GA


Hybrid algorithm (TCP variant): start of the master processes

UIJDL



Resource Broker

JDL with task

CondorG


CE 1

PBS Passwordless SSH

GACE

1CE

1

Registration of IP / port


Hybrid algorithm (TCP variant): start of the master processes

UIJDL




CE 1GA

CE 1

CE 1

Readout of other machines’ addresses

GA GA


Hybrid algorithm (TCP variant): calculations

UIJDL




CE 1GA

CE 1

CE 1


GA GA


Hybrid algorithm (TCP variant): migration

UIJDL




CE 1GA

CE 1

CE 1

Communication through TCP/IP

GA GA


Hybrid algorithm (TCP variant): calculations

UIJDL




CE 1GA

CE 1

CE 1


GA GA


Hybrid algorithm (TCP variant): registration of the results

UIJDL




CE 1GA

CE 1

CE 1


GA GA


Hybrid algorithm on the Grid – conclusions and problems

• Size of each deme should reflect the available number of CPUs at a site ( demes have different sizes)

• To avoid differences in the convergence speed, it is necessary to differentiate intensities and ranges of the genetic operators. For example, smaller demes have lower Gaussian mutation range, thus performing their search more locally.

• Length of the epoch at each deme should be variable and adapted with local population development

• Too early migrations may lead all the islands to a suboptimal solution

• Migrations have to be done asynchronously • Problem especially with MPI or plain TCP versions. To overcome that

difficulty, two independent processes are needed (one for migrations control, one for population development and batch system management)


Hybrid algorithm on the Grid – conclusions and problems

• MPI and TCP variants are not yet implemented• For the most demanding application – particle track reconstruction

optimization in HEP – LFC seems to be enough• Inter-cluster MPI not available (at least not in EGEE)• Manual TCP/IP communication is troublesome

• It is hard to guess, how many job slots are really available for a given VO (different batch system configurations, not always reflected in the information index)

• This does not affect the SSGA directly, but eventually may lead to improper adaptation of the operator ranges and intensities. Therefore, some islands may lag behind the others due to slower convergence


Griewank function

Test results – simulated behavior on well-known deceptive functions

Rosenbrock function

21

2212 )1()(100 xxxxf

2

2cos2cos

40

)()( 21

22

21 cx

cxc

cxcxxf


Griewank function

1x100 individuals

2x100 individuals

10x100 individuals

Fully connected topologies

Test results – simulated behavior on well-known deceptive functions

Rosenbrock function

1x50 individuals

2x50 individuals

4x50 individuals

Fully connected topologies

0 10 20 30 40 50997

997.5

998

998.5

999

999.5

1000

0 10 20 30 40 501.5

2

2.5

3

3.5

4x 10

5


„Real life” application – optimization of particle track reconstruction

Bending MagnetTarget

Detector planes

• Input data: set of hits from the detector planes

• Output data: momenta of the charged particles

• To get the momentum we need to reconstruct the whole track first.


Problems

• Particles don’t leave traces in all the detector planes.

• Many hits originate from the background noise.

• Mathematical models used in reconstruction are simplified.

• In one trigger there are tracks from many particles.


Optimized parameters

• Geometrical tolerances on the straight parts of the tracks (areas 1 and 3)

• Number of missing planes allowed in each track• Precision of the crossing point in the area of the magnet• Precision of the primary interaction vertex in the target• ...And• Everything in 3 dimensions + angles where applicable• Each step consists of several iterations controlled by

different parameters• Totally about 70 parameters should be optimized

simultaneously• Evaluation of one set takes about 10 minutes


0 5 10 15 20648

650

652

654

656

658

660

662

0 5 10 15 20 25 30 353200

3220

3240

3260

3280

3300

3320

0 5 10 15 20 25 30 35131

132

133

134

135

136

137

0 5 10 15 2016

18

20

22

24

26

28

Mean number of properly and improperly reconstructed tracks (synchronous, total population size 40, 100 physical events used for fitness calculation).

Mean number of properly and improperly reconstructed tracks (asynchronous, total population size 100, 500 physical events used for fitness calculation).

Results


Literature

1. Alba E., Tomassini M.: Parallelism and Evolutionary Algorithms. IEEE Transactions on Evolutionary Computation, Vol. 6 no. 5, pp.443-462, (2002)

2. Cantu-Paz E.: Efficient and Accurate Parallel Genetic Algorithms: Kluwer Academic Publishers (2000)

3. Goldberg D.E.: Genetic algorithms in search, optimization, and machine learning: Addison-Wesley (1989)

4. Meunier H. et al.: A Multiobjective Genetic Algorithm for Radio Network Optimization. Proceedings of the 2000 Congress on Evolutionary Computation CEC00 (2000)

5. Michalewicz Z.: Genetic Algorithms + Data Structures = Evolution Programs: Springer-Verlag Berlin Heidelberg (1996)

6. Miettinen K. et al.: Evolutionary Algorithms in Engineering and Computer Science: John Wiley and Sons Ltd (1999)

7. Padee A., Kurek K., Zaremba K. “Parallel evolutionary algorithm for track reconstruction optimization on PC cluster”. “Artificial Intelligence and Soft Computing”, Polish Neural Network Society, Warsaw 2006, pp. 211-216

8. The COMPASS Collaboration: Common Muon and Proton Aparatus for Structure and Spectroscopy. CERN/SPSLC 96-14 (SPSC/P297) (1996)

adam padee ( [email protected] ) wojciech padee ( [email protected] )

Documents

masterslave algorithms

new individuals

random individuals

execution queue

masterslaveone population

selected individuals

free places

finegrainedone population