Download - 4 Ecology GA Regression
-
8/6/2019 4 Ecology GA Regression
1/39
All Subset Regression Using a
Genetic Algorithm
Olcay Akman
Department of Mathematics
Illinois State [email protected]
-
8/6/2019 4 Ecology GA Regression
2/39
OVERVIEW OF GENETIC ALGORITHMS
Genetic algorithms (GA) belong to a group of
optimization techniques collectively called
evolutionary computation.
They constitute robust but simple search
techniques inspired by observations of the
mechanics of natural selection process and
genetics.
-
8/6/2019 4 Ecology GA Regression
3/39
Ask not what mathematics can do for biology, but
ask what biology can do for mathematics
The idea is that biological evolution has produced
organisms capable of living in almost every
possible landscape available, why dont we take a
tip from nature and exploit the utility of
evolution to do optimization.
-
8/6/2019 4 Ecology GA Regression
4/39
OVERVIEW OF GENETIC ALGORITHMS
Stringdata structures are used to represent sets
of possible problem solutions, where each location
in the string contains a character (gene)
identifying the state of a particular process
variable.
This is analogous to a chromosomal structure
which is occupied genes at fixed locations.
-
8/6/2019 4 Ecology GA Regression
5/39
OVERVIEW OF GENETIC ALGORITHMS
This string structure experience alterations
(mutations) throughout an iterative search
process (analogous to biological structures during
an evolutionary process).
In many generations, stronger strings
(individuals) appear and merge with the
population.
These eventually dominate the population in the
string selection, breeding and replacement
process (natural selection).
-
8/6/2019 4 Ecology GA Regression
6/39
OVERVIEW OF GENETIC ALGORITHMS
As such, over several generations, the population
of strings will experience successive incremental
improvements, eventually stabilizing itself as the
best strings emerge.
-
8/6/2019 4 Ecology GA Regression
7/39
COMPONENTS OF GA
Population
Mating Pool
New Offspring
Selection (Darwinian selection operation) Mating and Mutation
Evaluation
-
8/6/2019 4 Ecology GA Regression
8/39
PSEUDO-CODE
Randomly generate chromosomes
While t
-
8/6/2019 4 Ecology GA Regression
9/39
THE BASICS: SELECTION
There are several methods to choose which
chromosomes will contribute to the next
generation. Among the most popular are:
Proportional (Roulette Wheel) Selection
Tournament
Ranking Selection
-
8/6/2019 4 Ecology GA Regression
10/39
THE BASICS: MATING
Arguably the most important part of the
algorithm by creating new combinations implicit
parallelism.
Mating of chromosomes is analogous to biological
crossover which diploid organisms use to create
new combinations of genes.
Crossover is done until every chromosome in the
mating pool is mated with another chromosome.
-
8/6/2019 4 Ecology GA Regression
11/39
THE BASICS: CROSSOVER
First, the number of crossover points are chosen.
Suppose the number was of points was 1 and the
two chromosomes where 01001110 and 11101011
The point of crossover is chosen at random from[1,l-1] where l denotes the length of the string.
Parent 1 01001110
Parent 2 11101011
010|01110
111|01011
11101110
01001011
-
8/6/2019 4 Ecology GA Regression
12/39
THE BASICS: MUTATION
Again analogous to biological mutation in which
there is a base change in nucleotide of DNA,
mutation changes a 0 to a 1 and a 1 to a 0.
With a certain probability (pm
), each 0 or 1 has a
chance of being flipped. Typically this
probability is low.
A general rule is 1/ l so that on average there is
one mutation per chromosome.
01001011 01001011
The red 1 is chosen
for mutation
01000011
-
8/6/2019 4 Ecology GA Regression
13/39
Mating-Offspring
For every two parents, create
two children via crossover and
mutation.
Crossover: Split the two
parent chromosomes into
two at the same location.Exchange tails.
Parent 1: 1 1 1 1 | 1 1 Child 1: 1 1 1 1 0 0
Parent 2: 0 0 0 0 | 0 0 Child 2: 0 0 0 0 1 1
Parent 1: 1 0 1 | 1 0 1 Child 1: 1 0 1 1 0 0
Parent 2: 0 0 1 | 1 0 0 Child 2: 0 0 1 1 0 1
-
8/6/2019 4 Ecology GA Regression
14/39
Crossover
Crossover can produce children that are radically
different from their parents.
Parent 1: 1 1 1 1 | 1 1 Child 1: 1 1 1 1 0 0
Parent 2: 0 0 0 0 | 0 0 Child 2: 0 0 0 0 1 1
-
8/6/2019 4 Ecology GA Regression
15/39
Crossover
Crossover will not introduce differences for a bit
position where both parents have the same value. Parent 1: 1 0 1 | 1 0 1 Child 1: 1 0 1 1 0 0
Parent 2:0 0 1 | 1 0 0 Child 2: 0 0 1 1 0 1
An extreme instance occurs when both parents are identical. In
such cases cross over can introduce no diversity in the children.
Thus making the mutation a vital component of GA.
-
8/6/2019 4 Ecology GA Regression
16/39
Offspring-Mutation
For every two parents, create
two children via crossover and
mutation.
Mutation: Randomly
mutate every bit in every
offspring.
Old
Chromosome
Random
Numbers
New
Bit
New
Chromosome
1 0 1 0 .801 .102 .266 .373 - 1 0 1 0
1 1 0 0 .120 .096 .005 .840 0* 1 1 0 0
0 0 1 0 .760 .473 .894 .001 1 0 0 1 1
*: Randomly generated bit is the same as the original bit.
-
8/6/2019 4 Ecology GA Regression
17/39
Generation
Delete all:
Repeat the reproduction process until you obtain 100
offspring.
Kill the original population.
Start anew with the second generation.
Stop at (say) 50 generations.
-
8/6/2019 4 Ecology GA Regression
18/39
PSEUDO-CODE
Randomly generate chromosomes
While t
-
8/6/2019 4 Ecology GA Regression
19/39
The Process
The algorithm begins by creating a random
initial population, as shown in the figure.
-
8/6/2019 4 Ecology GA Regression
20/39
The Process
-
8/6/2019 4 Ecology GA Regression
21/39
The Process
-
8/6/2019 4 Ecology GA Regression
22/39
USING GA FOR SUBSET
REGRESSION MODEL
SELECTION
-
8/6/2019 4 Ecology GA Regression
23/39
MULTIPLE LINEAR REGRESSION
Real world mathematical models can have many
variables to explain a single response variable.
Determining the correct set of variables can be
difficult.
There are some methods such as stepwise,
backward and forward, however, these methods
may not produce the optimal set.
-
8/6/2019 4 Ecology GA Regression
24/39
MULTIPLE LINEAR REGRESSION
General Model:
For the general model there 2k+1-1 different
possible sets for n variables.
kkXXXy ...22110
0
20000
40000
60000
80000
100000
120000
140000
1 3 5 7 9 11 13 15Number
ofModels
Number of Ind. Varibales
Number of Possible Modelsfor Multiple Regression
Number of Possible
Models
-
8/6/2019 4 Ecology GA Regression
25/39
MULTIPLE REGRESSION AND
GENETIC ALGORITHMS
The first step to using a GA is to get the correct
encoding to take advantage of the genetic
operators.
Binary encoding is used in which the following rule
is applied:
Ifith position is 0 the ith explanatory variable is
not included in the model
Ifith position is 1 the ith explanatory variable is
included in the model
-
8/6/2019 4 Ecology GA Regression
26/39
ENCODING EXAMPLE
Suppose the full model is
The binary structure 1011101would represent
kkXXXy ...22110
kkkk XXXXy 2233220 ...
-
8/6/2019 4 Ecology GA Regression
27/39
MULTIPLE REGRESSION AND
GENETIC ALGORITHMS
Genetic algorithms operate to find the most fit
chromosome.
Thus, to use genetic algorithms with multiple
regression modeling, a fitness function must be
determined.
R2, adjusted R2, Mallows Cp,, MSE, AIC and so on.
In our analysis we employ ICOMP. (Bozdogan,
2004)
-
8/6/2019 4 Ecology GA Regression
28/39
ICOMP
ICOMP is defined as:
here
n is the number of variables in the model
is the mean standard error or RSS/ n
q is the number of observations the model is based on
X is the matrix used is the (n+1 X q) matrix where the kth column is thevalues for the (k-1)th variable and the first column is filled with 1s,
unless the model has no intercept in which the size is (n X q) and the kth
column corresponds to the kth variable
nXX
n
nXXtrace
qnnICOMP4
12
412
2 2ln))'ln(det()
1
2)'(
)(ln1()ln()2ln(
2
-
8/6/2019 4 Ecology GA Regression
29/39
ICOMP
The best models have low complexity, thus, the
GA wishes to minimize the ICOMP value.
-
8/6/2019 4 Ecology GA Regression
30/39
An Illustration
Consider a regression problem with k=6 candidate
variables.
Suppose that three randomly chosen regression
subsets are the members of the initial population,
with ICOMP as their fitness:
4433220 ... XXXy
55110 XXy
101110 ICOMP=143.7
110001 ICOMP=138.32
100101 55330 XXy ICOMP=134.18
-
8/6/2019 4 Ecology GA Regression
31/39
An Illustration
110 001
100 101110101100001
)25.132(5533110 ICOMPXXXy
)16.140(550 ICOMPXy
We now rank all of the strings in the current population and replace the
lowest ranking string with the new string to generate a new population.
The new population might then consists of the following strings:
110101 (ICOMP=132.25), 100001 (ICOMP=140.16), 100101 (ICOMP=134.18)
-
8/6/2019 4 Ecology GA Regression
32/39
An Illustration
We now rank all of the strings in the current population and replace the
lowest ranking string with the new string to generate a new population.
The new population might then consists of the following strings:
110101 (ICOMP=132.25), 100001 (ICOMP=140.16), 100101 (ICOMP=134.18)
At this time strings 110101 and 100101 will most likely reproduce.
Suppose they did and the following occurred:
110 101
100 101
110101
100101
No new genetic combination produced.
Mutation will alter some bits onboth strings; a new population willemerge.
-
8/6/2019 4 Ecology GA Regression
33/39
An Illustration
Question:
Can the mating and/or mutation result in weaker individuals
which may eventually enter the general population?
Answer:
Yes, possibly. However because of their lower finesses, these
individuals will be less likely to be chosen for mating, and will soon
disseappear from population as better fit newborn individuals appear.
-
8/6/2019 4 Ecology GA Regression
34/39
WHAT DID WE CONTRIBUTE?
Evolution works fastest when the initial genetic
variance is the largest.
Using binary coding, the variance is maximized when
in each position of chromosome 50% of the
chromosomes have 1 and the others have 0. In the context of the regression model choosing, this
means that in the initial population each variables
has a chance to be included in the model.
We have developed a method called Initial PopulationDiversification in which half the population is
randomly created and the other half is created by
taking one chromosome and changing 0 to 1 and 1 to
0.
-
8/6/2019 4 Ecology GA Regression
35/39
INITIAL POPULATION
DIVERSIFICATION EXAMPLE
01110101011111
00001001010001
00000100001101
00110001000100
10111001001110
10100010111010
10010111001001
11101010111101
01110101010010
01111101110101
10001010100000
11110110101110
11111011110010
11001110111011
01000110110001
01011101000101
01101000110110
00010101000010
10001010101101
10000010001010
First 10chromosomes
Last 10chromosomes
-
8/6/2019 4 Ecology GA Regression
36/39
WHAT DID WE MODIFY?
Populations start larger and reduce.
Diversification
We use Binary Tournament instead of
Proportional Selection (which allows less
computation since ICOMP and model parameters
only need to be computed for those in the
tournament)
-
8/6/2019 4 Ecology GA Regression
37/39
RESULTS
14681470
1472
1474
1476
1478
1480
1482
1484
1486
116
31
46
61
76
91
106
121
136
151
166
181
196
ICOMP
Trial
Trials and ICOMP values
Adapt deltaF=.1
Adapt deltaF=.2
Typical deltaF=0
Adapt deltaF=.3
Exponential
Linear
Bozdogan
Frequency of Correct
Solution
Typical deltaF=0 0.915
Adapt deltaF=.1 0.935
Adapt deltaF=.2 0.93
Adapt deltaF=.3 0.905
Exponential 0.42
Linear 0.50
Bozdogan (No bit flipping/no population reduction) 0.09
Trials and ICOMP values ordered byICOMP value. 1473.9 is the smallestvalue and the correct solution basedon Bozdogan(2004). In all trials 600computations were allowed andmutation was set at .05.
Frequency of the GAs that found
the correct solution. The first 6were based on the GAs wedeveloped, the last wasBozdogans. The first entry did
not have any populationreduction (deltaF=0), but didhave bit flipping alluding to the
fact that bit flipping is beneficial.
-
8/6/2019 4 Ecology GA Regression
38/39
FUTURE WORK
Understanding how the ruggedness of the
search space effects progress of the algorithm
How to calculate this ruggedness for any
arbitrary function with little computation.
Mapping optimum parameters with certain
values of ruggedness
-
8/6/2019 4 Ecology GA Regression
39/39
REFERENCES
Dumitrescu, D. et al, ed. Evolutionary
Computation. CRC P, 2000.
Holland, J. (1975)Adaptation in Natural and
Artificial Systems. University of Michigan Press.
C.F. Lima, F.G. Lobo. A Review of Adaptive
Population Sizing Schemes in Genetic
Algorithms. In Proceedings of the 2005
workshops on Genetic and evolutionary
computation, pages 228 - 234. ACM, 2005. Wright, Sewall. Evolution in Mendelian
Populations. Genetics 16 (1931): 97-159.