evolution of numeric constants in genetic programming

65
Evolution of Numeric Constants in Genetic Programming Thomas Fernandez Florida Atlantic University

Upload: gene

Post on 17-Jan-2016

17 views

Category:

Documents


0 download

DESCRIPTION

Evolution of Numeric Constants in Genetic Programming. Florida Atlantic University. Thomas Fernandez. Genetic Programming (GP). How can computers solve problems without being explicitly programmed? - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Evolution of Numeric Constants in Genetic Programming

Evolution of Numeric Constants in Genetic Programming

Thomas Fernandez

Florida Atlantic University

Page 2: Evolution of Numeric Constants in Genetic Programming

Genetic Programming (GP)

How can computers solve problems without being explicitly programmed?

GP is a domain independent method for inducing programs by searching the space of S-expressions.

Page 3: Evolution of Numeric Constants in Genetic Programming

Motivation

One of the weaknesses of GP is the difficulty it suffers in discovering useful numeric constants for the terminal nodes of the S-expression trees.

The goal of this work is to improve the effectiveness of GP by creating hybrid GP systems that use hill climbing and other local search techniques to improve these numeric constants.

Page 4: Evolution of Numeric Constants in Genetic Programming

Contributions

A cross-platform, object oriented, multi-processor enabled, GP/GA system implemented in C++.

A simple experimental design useful for analyzing different GP parameters, strategies and enhancements.

Page 5: Evolution of Numeric Constants in Genetic Programming

Contributions (continued)

Three hybrid GP systems using local search algorithms Multi-Dimensional Hill Climbing

(MDHC) Vector Hill Climbing (VHC) Numeric Mutation (NM)

Page 6: Evolution of Numeric Constants in Genetic Programming

Contributions (continued)

We analyze the performance of each of the three hybrid systems when applied to three problems with different difficulty.

We discuss the future directions for hybrid GP systems.

Page 7: Evolution of Numeric Constants in Genetic Programming

Charles Darwin’s principle of Natural

Selection Individuals having any

advantage, however slight, over others, have the best chance of surviving and of procreating their kind.

Over time, the population as a whole will gradually evolve to have more combinations of these good attributes.

GP simulates this process of natural selection in a computer.

“…from so simple a beginning endless forms most beautiful and most wonderful have been, and are being evolved.”

Page 8: Evolution of Numeric Constants in Genetic Programming

Genetic Algorithms (GA) Forerunner of GP Evolve binary strings that

represent the solutions to problems

Start with a population of random strings

During each generation a new population is formed.

Hopefully the solutions will improve.

Page 9: Evolution of Numeric Constants in Genetic Programming

Three steps to setting up a GA 1) Devise a binary encoding

representing the potential solutions to a problem.

2) Define a fitness function. 3) Set control parameters.

population size maximum generations probability of mutation and crossover others

Page 10: Evolution of Numeric Constants in Genetic Programming

Running a GA

Generating an initial population of random binary strings

Create next generation Randomly select elements for direct inclusion Randomly select pairs of elements for mating

Offspring will have a combination of randomly selected parts of the binary strings of both parents

Select some elements for mutation. Typically one or two random bits will be flipped

Page 11: Evolution of Numeric Constants in Genetic Programming

Running a GA (continued) Repeatedly create new

generations. Terminate when an acceptable

solution is been found or when the specified maximum number of generations is reached.

Page 12: Evolution of Numeric Constants in Genetic Programming

Running a GA/GP

Do Straight Selection

Do Selection for Mating

Do Mutation

Success

Failure

Reached Maximum Number ofGenerations?

Found Acceptable Solution?

Score PopulationGeneration_Count=Generation_Count +1

Generate Random Population

Page 13: Evolution of Numeric Constants in Genetic Programming

Genetic Programming Elements

The individual elements in the GA’s population are binary strings. The elements in the population of a GP are trees.

This is an example of an S-expression tree (sqrt ( (a + b) /2.0 ) )

SQRT

/

a b

+ 2.0

terminal set = {a, b, c, 0, 1, 2} function set ={+, -, *, /, SQRT}

Page 14: Evolution of Numeric Constants in Genetic Programming

Representation of a Program Structure as an S-expression tree.

float treeFunc(float a){

if ( a>10.0) {

return 20.0;}else{

return a/2.0;}

}

if

> 20.0 /

a 10.0 a 2.0

Looping Constructs and Subroutine Calls are also possible.

Page 15: Evolution of Numeric Constants in Genetic Programming

Three steps to setting up a GP Define an appropriate set of functions and

terminals. Functions must have closure. Functions and Terminals must be sufficient.

Define a fitness function. Set control parameters.

like GA population size, maximum size or depth of the individual

trees size and the shape of the original trees others

Page 16: Evolution of Numeric Constants in Genetic Programming

Running a GP

Generating a initial population of random S-expression trees.

Create next generation Randomly select elements for direct

inclusion Randomly select pairs of elements for

mating Different from GA.

Select some elements for mutation. Also different from GA.

Page 17: Evolution of Numeric Constants in Genetic Programming

Running a GP (continued) Repeatedly create new generations. Terminate when an acceptable

solution is been found or when the specified maximum number of generations is reached.

The termination criteria is based on the number of hits, where a hit is defined as the successful completion of some subgoal.

Page 18: Evolution of Numeric Constants in Genetic Programming

+ 3

a b

+

7 *

* b

2 a

/Crossover

Point

CrossoverPoint

3

/

+

a b

+

7

*

* b

2 a

Swap

3

/

+

a b

+

7*

* b

2 a

S-expression Mating Process

Step 1) Randomly select crossover points in both parents.

Step 2) Detach the Subtrees and swap them.

Resulting in these offspring.

(a+b)/3 7+((2*a)*b)

7+(a+b)((2*a)*b)/3

Page 19: Evolution of Numeric Constants in Genetic Programming

Mutation with GP

Elements that are selected for mutation will have some randomly selected node (and any subtree under it) replaced with a randomly generated subtree.

Often mutation is not used in the GP process.

New research has indicated that it may be beneficial.

Page 20: Evolution of Numeric Constants in Genetic Programming

Mutation with GP*

+ /

a 10.0

a 20

Randomly SelectMutation Point

*

- /

* 4 a 20

*

b b

a

Replace with RandomlyGenerated Tree

Page 21: Evolution of Numeric Constants in Genetic Programming

Early work with numeric constants Arithmetic genesis

A terminal a variable be divided by itself in an S-expression resulting in the number 1.0.

2.0 can evolve via an S-expression that adds 1.0 to itself. 0.5 can evolve via an S-expression that divides 1.0 by 2.0. An arbitrary number of constants can be created this way.

Arithmetic combination. A number of numeric constants can be included in the

terminal set. They can be combined in S-expressions. Any subtree with only numeric constants for terminal nodes

can itself be thought of as a numeric constant node.

Page 22: Evolution of Numeric Constants in Genetic Programming

Arithmetic Genesis of Numeric Constants

a a

/ +

a a

/

a a

/

a a

/ +

a a

/

a a

/

/

1.0 2.0 0.5

Page 23: Evolution of Numeric Constants in Genetic Programming

Arithmetic Combination Terminal Set ={ x, y, 0, 1, 2, 3 }

/

1 3

-

3 2

/

* 2

-1.5

Page 24: Evolution of Numeric Constants in Genetic Programming

The Ephemeral Random Constant Each time the ephemeral random constant

is selected as a terminal in the creation of the initial population, it is replaced with a randomly generated number within some specified range.

Even with the use of the ephemeral random constant and/or the presence of predefined constants in the terminal set GP still has difficulty generating sufficient numeric constants.

Page 25: Evolution of Numeric Constants in Genetic Programming

The Ephemeral Random Constant

R a

/ +

R 1

/

R 2

/

/

0.578 a

/ +

1

/

2

/

/

0.578

-0.381 1.971

Terminal Set = {a, 1, 2, R }

R is replaced with random numbers

Page 26: Evolution of Numeric Constants in Genetic Programming

Numeric constants are a problem for GP

A problem consisting of discovering just a single numeric constant required 14 generations to create an S-expression comprising almost half a page.

John Koza, the discoverer of GP, said: “The finding of numeric constants is a skeleton in the GP closet... [and an] area of research that requires more investigation.”

Page 27: Evolution of Numeric Constants in Genetic Programming

A Simple Symbolic Regression Problem

We will try to evolve a function which passes through eleven given target points. The target points all lie on the curve defined by the function y=x2 + 3.141592654.

0

20

40

60

80

100

120

0 1 2 3 4 5 6 7 8 9 10

x

y

x

y

Page 28: Evolution of Numeric Constants in Genetic Programming

A More Difficult Symbolic Regression Problem

We will again try to evolve a function which passes through eleven given target points. This time the target points all lie on the curve defined by the function

y = x3 - 0.3x2 - 0.4x - 0.6

-200

0

200

400

600

800

1000

0 1 2 3 4 5 6 7 8 9 10

x

yy

x

Page 29: Evolution of Numeric Constants in Genetic Programming

Financial Symbolic Regression Problem Our third problem is taken from the domain of

financial analysis. The goal is to do symbolic regression, where the target points are a financial time series.

In this case we are using a target time series that is derived from the daily closing prices of the S&P 500 from the years 1994 and 1995.

Instead of using just one independent variable like the last two problems, we use 33 independent variables taken from time series that that are derived from the S&P 500 itself and from the closing

daily prices of 32 Fidelity Select Mutual Funds.

Page 30: Evolution of Numeric Constants in Genetic Programming

Preprocessing the Financial Data The first step to preprocessing the data is to take the 21

day moving average. This is to reduce the effect of day to day fluctuations.

The second step is to take the percent change in the financial time series between the current day and 21 days prior. This is to normalize the data within each series and

between different series. The target data points are taken from the S&P 500 and

preprocessed in the same way but are also shifted 10 days into the past. The result of this is that the system is trying to predict

the general trend of this time series, two weeks into the

future.

Page 31: Evolution of Numeric Constants in Genetic Programming

The Target data for the Financial Problem

The top line in the graph is the daily closing price of the S&P 500. The solid line below it is the graph of the target time series after the preprocessing.

The dotted line is a function evolved using GP. It is included here only as an example to illustrate that criterion for success

does not require a great deal of accuracy.

0

100

200

300

400

500

600

700

940103 940701 941229 950628 951226

-0.02

-0.01

0

0.01

0.02

0.03

0.04

0.05

Page 32: Evolution of Numeric Constants in Genetic Programming

The Example Evolved Function y = (((0.38)-((-0.20923)-(FSPTX-(((-0.79706)

/(0.38))*((FSUTX-FSCSX)*(FSCGX-(-0.34247))))))) *(SPX*((0.82794)/(0.54431)))) The independent variables that were used by this

evolved function are derived from the following time series.

FSPTX Fidelity select Technology Portfolio. FSUTX Fidelity Select Utility Portfolio FSCSX Fidelity Select Software Portfolio FSCGX Fidelity Select Capital Goods Portfolio SPX S&P 500 Index

Page 33: Evolution of Numeric Constants in Genetic Programming

Related Work

Adaptive Restrictive Tournament Selection and a Local Hill Climbing hybrid for the Identification of Multiple Good Design Solutions R. Roy and I.C. Parmee

Application of a Hybrid Genetic Algorithm to Airline Crew Scheduling David Levine

Hybridized Crossover-Based Search Techniques for Program Discovery Una-May O’Reilly and Franz Oppacher

Exploring Alternative Operators and Search Strategies in Genetic Programming Kim Harries and Peter Smith

Page 34: Evolution of Numeric Constants in Genetic Programming

Overview of Local Search

GP is more adept at finding the general form of an S-expression, but was less efficient at determining the appropriate numeric constants.

The three techniques for assisting the GP to find numeric constants are: Multi-Dimensional Hill Climbing (MDHC) Vector Hill Climbing (VHC) Numeric Mutation (NM)

Page 35: Evolution of Numeric Constants in Genetic Programming

Features That They Have in Common Selection of Individuals for Local Search

SelectionCount elements are randomly selected for the operation from the top SelectionGroup elements

Control of the Temperature Factor temperature factor =

(bestElementFitnessScore * TemperatureScoreParm )

temperature factor = ( ( ( maxHits - hitsFromBestElement ) / maxHits ) *

TemperatureHitsParm)

Page 36: Evolution of Numeric Constants in Genetic Programming

Multi-Dimensional Hill Climbing For each selected numeric constant a delta value is

randomly selected between zero and the current temperature factor.

For each selected constant three values will be considered, the original value, the original value plus the delta, and the original value minus the delta.

All combinations of one of these three values for each of the selected constants is then evaluated using the fitness function.

The best of these combinations is then selected to replace the original constants.

Page 37: Evolution of Numeric Constants in Genetic Programming

Multi-Dimensional Hill Climbing Example Select only two constants so the hyper cuboid has two

dimensions, and is a rectangle. The original value of the constants is 10.0 and 20.0. The temperature factor is currently 4.0. Deltas for the two constants will be randomly selected from

0.0 to 4.0. Let us say that the delta for the first constant is 3.0 and the delta for the second constant is 1.0.

The drawing below shows all combinations of the original values plus and minus the deltas and the unchanged original values.

The best of all these combinations would replace the original values.

(7,21)

(7,20)

(7,19)

(10,21)

(10,19)

(10,20)

(13,21)

(13,20)

(13,19)

Page 38: Evolution of Numeric Constants in Genetic Programming

Multi-Dimensional Hill Climbing( 3D )

C1

C2

C3

Delta 2

Delta 1

Delta 3Search Points =all combinations oforiginal constantsplus and minus the

deltas.

RandomlySelected Delta

Values betweenZero and

Temperature

Each point represents an S-expression with the same form but different constants.

Page 39: Evolution of Numeric Constants in Genetic Programming

The Problem with MDHC

Unless HillClimbConstantCount is very small, the multi-dimensional hill climbing will require many calls to the fitness function .

The number of calls to the fitness function is calculated by raising 3 to the power of the number of numeric constants selected (HillClimbConstantCount ) and then multiplying this by the number of elements that are selected.

For example if 5 constants are selected and applied to 10 elements. 2430 ( 35 * 10 ) calls to the fitness function are required.

For this reason in our experiments we only applied MDHC to the highest scoring element. We also limited the number of constants selected to a maximum of 4.

Page 40: Evolution of Numeric Constants in Genetic Programming

Vector Hill Climbing

For each numeric constants, the fitness function is applied after adding a very small epsilon. The change in the fitness score is called the Delta for that constant.

All of the Deltas together are called the Delta Vector. The Delta Vector is normalized by dividing each value in it by

the largest absolute value in the vector. A new position in the search space is determined by multiplying

the Delta Vector by a temperature factor and adding it to the current constants.

If the fitness score of the new position is better than the original position all constants are replaced by the constants from the new position and this is then considered the current position in the search space.

The Delta Vector is then repeatedly added to the current position until the fitness function results in no further improvement or a maximum of 100 such additions is reached.

Page 41: Evolution of Numeric Constants in Genetic Programming

Vector Hill Climbing

C1

C2

fitness

Delta 2

Delta 1

Starting Point

New Point = Starting Point +(Delta Vector * Temperature)

Page 42: Evolution of Numeric Constants in Genetic Programming

Vector Hill Climbing Example Let us consider a simple two dimensional example, of an S-expression with

two constants 50.0 and 12.5, and that the current fitness score is 100 and the temperature factor is 2.0.

We add a small epsilon say 0.01 to 50.0 and obtain a fitness score of 103.0, so s the delta for 50.0 is equal to 3. We restore the value of to 50.0 and add the epsilon to 12.5 and this time we get a fitness score of 98.5 so the delta for 12.5 is -1.5.

The delta vector (3, -1.5) is then normalized by dividing it by the largest absolute value in it, resulting in (1,-0.5). We can now determine the new position by multiplying the delta vector by the temperature factor (2.0) and adding it to the current position.

The delta vector times the temperature will equal (2,-1). When we add this to the current position our new position will be (52.0, 11.5).

The fitness function is called after replacing the constants with 52.0 and 11.5. If the fitness score was better than 100.0 we would keep the new constants.

Then the delta vector would be added to the constants again. This is repeated 100 times or till the fitness score shows no further improvement.

Page 43: Evolution of Numeric Constants in Genetic Programming

Numeric Mutation

NM replaces all of the numeric constants with new numeric constants, chosen at random from a uniform distribution within a specific selection range.

The selection range for each numeric constant is specified as the old value of that constant plus or minus the temperature factor.

NM is the simplest of the three techniques described here.

NM is not concerned with improving the fitness score and so it requires only one call to the fitness function.

It is feasible to apply the method to many elements during each generation of the GP.

Page 44: Evolution of Numeric Constants in Genetic Programming

Numeric Mutation

C1

C2

C3

temperature

New Point

Page 45: Evolution of Numeric Constants in Genetic Programming

Measuring GP Systems

Can these three local search techniques improve the GP process?

We need a means for measuring the effectiveness of the hybrid and non-hybrid GP systems.

A GP run is not always successful. We can estimate the success ratio for a GP system by making repeated runs and dividing the number of successful runs by the total number of runs.

Page 46: Evolution of Numeric Constants in Genetic Programming

The Experimental Design Using the three problems we described, we will

compare the success ratios of the three hybrid GP systems with the success ratios of a non-hybrid GP system.

We generally make multiple sets of runs with the hybrid systems using different values for some of the parameters.

To determine if the differences are statistically significant at a 95% confidence level we use the Two Tailed Large-Sample Statistical Test for Comparing

Two Binomial Proportions.

Page 47: Evolution of Numeric Constants in Genetic Programming

MDHC and Simple Symbolic Regression

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

No HillClimbing

27 timeswith 1

constant

9 timeswith 2

constants

3 timeswith 3

constants

1 timewith 4

constants

Success Ratios

Page 48: Evolution of Numeric Constants in Genetic Programming

VHC and Simple Symbolic Regression

Success Ratios

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

No HillClimbing

0.01 0.1 1 2 10

TSP

Succ

ess

Ratio

Page 49: Evolution of Numeric Constants in Genetic Programming

VHC and Simple Symbolic Regression

Run Time in Minutes

0

24

6

810

12

14

1618

20

No HillClimbing

0.01 0.1 1 2 10

TSP

min

utes

Page 50: Evolution of Numeric Constants in Genetic Programming

NM and Simple Symbolic Regression

Success Ratios

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

NoNumericMutation

0.0002 0.002 0.02 0.2 2 20 200 2000

TSP

succ

ess

ratio

Page 51: Evolution of Numeric Constants in Genetic Programming

NM and Simple Symbolic Regression

Success Ratios

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

NoNumeri

cMutatio

100 200 300 400 500 600 700 800 900 1000

TSP

succ

ess

ratio

Page 52: Evolution of Numeric Constants in Genetic Programming

MDHC and Harder Symbolic Regression

Success Ratios

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

No HillClimbing

27 timeswith 1

constant

9 timeswith 2

constants

3 timeswith 3

constants

1 timewith 4

constants

Suc

ess

Rat

ios

Page 53: Evolution of Numeric Constants in Genetic Programming

VHC and Harder Symbolic Regression

Success Ratios

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

No HillClimbing

0.00001 0.0001 0.001 0.01 0.1 1 10

TSP

succ

ess

rati

o

Page 54: Evolution of Numeric Constants in Genetic Programming

NM and Harder Symbolic Regression Success Ratios Selection Group Vs. Selected Elements

50 100 200 300 400

10

20

30

40

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Selection Group

Selected Elements

Success Ratio

Page 55: Evolution of Numeric Constants in Genetic Programming

NM and Harder Symbolic Regression

Success Ratios

0

0.1

0.2

0.3

0.4

0.5

0.6

Non-hybrid

100 10 1 0.1 0.02 0.01 0.008 0.006 0.004 0.002 0.001

TSP

succ

ess

rati

o

Page 56: Evolution of Numeric Constants in Genetic Programming

MDHC and Financial Symbolic Regression

Success Ratios

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

No HillClimbing

27 timeswith 1

constant

9 timeswith 2

constants

3 timeswith 3

constants

1 timewith 4

constants

Page 57: Evolution of Numeric Constants in Genetic Programming

VHC and Financial Symbolic Regression

Success Ratios

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

No Hill Climbing VHC w/ THP = 0.1

succ

ess

rati

o

Page 58: Evolution of Numeric Constants in Genetic Programming

NM and Financial Symbolic Regression

Success Ratios

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Non-Hybrid

0.000001 0.00001 0.0001 0.001 0.01 0.1

THP

succ

ess

rati

o

Page 59: Evolution of Numeric Constants in Genetic Programming

Conclusions

All of the hybrid systems, MDHC, VHC, and NM, created statistically significant improvements in the success ratios when

applied to the simple regression problem With the other two problems NM was the

only hybrid system that demonstrated statistically significant improvements to the GP process.

Page 60: Evolution of Numeric Constants in Genetic Programming

Overview of the Software Object Oriented C++. Windows 95 (MS Visual C++ 5.0) Ported to UNIX. (GNU C++) Extended to run cooperatively on

multiple machines using MPI.

Page 61: Evolution of Numeric Constants in Genetic Programming

OOA of Genetic Programming

PopulationProblem

SymbolicIntegration

FinancialTradingSystems

Strategy

GA_Stratagy GP_Stratagy

Is made of

Will solve

scoreStratagy

Page 62: Evolution of Numeric Constants in Genetic Programming

OOA of the GP_Strategy Class

GP Strategy

S Expression Node

Terminal Function

Variable Constant

Page 63: Evolution of Numeric Constants in Genetic Programming

Dynamic Model of the Population Class

Empty Populationdo: generate strategies

New Generationdo: score and sort stratagies

Generation Scoreddo: straight selection

Next Gen. Partialy Selecteddo: mating

Gen Complete /Gen . Count =0

Score and sort complete

Termination criteria met

Term. crit. not met & Gen Count > max

Success

FailureStraight Selection Complete

Mating Complete

Mutation complete / Gen..Count ++

Next Gen. Selected and Generateddo:mutation

Page 64: Evolution of Numeric Constants in Genetic Programming

Overview of Future Work Extensions to the MDHC Extensions to the VHC Extensions to the NM Other Future Directions

Applying These Local Searches in Combination Multiple Processors (MPI) Optimal Times to Implement the Local

Searches Discovery of Econometric Models

Page 65: Evolution of Numeric Constants in Genetic Programming

Thank You! Are there any Questions?