gpu-accelerated genetic algorithms

Genetic Algorithms A class of evolutionary algorithms Efficiently solves optimization tasksPotential Applications in many fields

ChallengesLarge execution time

A representation for chromosomeCreate Initial PopulationSelect ParentsCreate New PopulationGA ParametersEvaluate FitnessCrossover OperatorMutation OperatorTermination CriteriaUser Specifies A method for fitness evaluationExitYes

High degree of parallelism Fitness evaluationCrossoverMutation

Most obvious : chromosome level parallelismSame Operations on each chromosomeUse a thread per chromosome

Thread-per-chromosome model Good enough for small to moderate sized multi-coreDoesnt map well to a massively multithreaded GPUs

Solution : identify and exploit gene-level parallelism

A column of threads read a chromosome gene-by-gene and cooperate to perform operationsResults in coalesced read and faster processingPopulation Matrix in MemoryThread Blocks in a grid

Construct Initial Population

On CPUGPU Global MemoryRandom NumbersOld PopulationNew PopulationFitness ScoresStatisticsEvaluation KernelStatistics Update KernelSelection KernelCrossover KernelMutation KernelParse GA ParametersGenerate Random NumbersOn GPU


On CPUGPU Global MemoryRandom NumbersOld PopulationNew PopulationFitness ScoresStatisticsStatistics Update KernelSelection KernelCrossover KernelMutation KernelParse GA ParametersGenerate Random NumbersOn GPUPopulationScoresEvaluation KernelEvaluation Kernel

Partially parallel methodPartially-parallel Method

User Specifies a serial code fragment for fitness evaluation.

Threads are arranged in a 1D grid.

Each thread executes users code on one chromosome.

Providing chromosome level parallelism.

Benefit : Abstraction

Fully parallel methodCUDA familiar user can effectively use 2D thread layout

Use gene level Parallelism for fitness evaluation

Benefit : Efficiency

Task : Given weights , costs & knapsack capacityAim : maximize the cost.

Representation1D binary string0/1: Absence/Presence of an item,W and C are total weight and Cost of given representation

Best Solution : One with max C given W < Wmax

Fully Parallel Method

Use a group of threads to compute total cost and weight in logarithmic time


On CPUGPU Global MemoryRandom NumbersOld PopulationNew PopulationFitness ScoresStatisticsStatistics Update KernelSelection KernelCrossover KernelMutation KernelParse GA ParametersGenerate Random NumbersOn GPUScoresStatisticsEvaluation KernelStatistics Update Kernel

Selection and Termination most often use Population Statistics

We use standard parallel reduce algorithm to calculate Max, Min, Average Scores

We use highly optimized public library CUDPP To sort and rank chromosomes


On CPUGPU Global MemoryRandom NumbersOld PopulationNew PopulationFitness ScoresStatisticsStatistics Update KernelSelection KernelCrossover KernelMutation KernelParse GA ParametersGenerate Random NumbersOn GPUStatisticsParentsEvaluation KernelSelection Kernel

Selection KernelUses N/2 threadsEach thread selects two parents for producing offspring

Uniform Selection : Selects parents in a uniform random manner

Roulette Wheel Selection: Fitness based approach, more the fitness, better the chance of selection

Roulette Wheel

Sort fitness scores

Compute a roulette wheel array by doing a prefix-sum scan of scores and normalizing it.

Generate a random number in 0-1.

Perform binary search in roulette wheel array for the nearest smaller number to the randomly selected number.

Return the index of the result in array

Image Courtesy : xyz


On CPUGPU Global MemoryRandom NumbersOld PopulationNew PopulationFitness ScoresStatisticsStatistics Update KernelSelection KernelCrossover KernelMutation KernelParse GA ParametersGenerate Random NumbersOn GPUOld PopulationNew PopulationEvaluation KernelCrossover Kernel

GPU Global Memory

PopulationThread idy 12020403Thread idy 2081302Thread idy 3120702Thread idy 4051902Thread idx 1-LThread idx 1-LThread idx 1-LThread idx 1-L

GPU Global Memory

PopulationThread idy020403Thread idy081302Thread idy 120702Thread idy 051902Thread idx 1-LThread idx 1-LThread idx 1-LThread idx 1-L12345678


On CPUGPU Global MemoryRandom NumbersOld PopulationNew PopulationFitness ScoresStatisticsStatistics Update KernelSelection KernelCrossover KernelMutation KernelParse GA ParametersGenerate Random NumbersOn GPUNew PopulationNew PopulationEvaluation KernelMutation Kernel

Thread 1,4Coin State GeneXFlip CoinCoin State GeneTFlip MutatorEach thread handles one gene and mutates it with probability of mutationThread Id yPopulation

Thread Id yPopulationThread 1,4Coin State GeneXFlip CoinCoin State GeneTFlip MutatorEach thread handles one gene and mutates it with probability of mutation


On CPUGPU Global MemoryRandom NumbersOld PopulationNew PopulationFitness ScoresStatisticsStatistics Update KernelSelection KernelCrossover KernelMutation KernelParse GA ParametersGenerate Random NumbersOn GPURandom No.sEvaluation KernelGenerate Random Numbers

Extensive use of random numbers

No primitive for on the fly single random number generation

Solution: Generate a pool of random numbers and copy it on GPU

We use CUDPP routine to generate a large pool of random numbers on GPU (faster)

If better quality random numbers are needed, this can be replaced by a CPU based routine

Test Device : A quarter of Nvidia Tesla S1030 GPU

Test Problem : Solve a 0/1 knapsack problem

Test Parameters:Representation : A 1D Binary StringCrossover : One-point crossoverMutation : Flip Mutation Selection : Uniform and Roulette Wheel

Ave. Run-time for 100 iterations (Uniform Selection)Ave. Run-time for 100 iterations (Roulette Wheel Selection)Growth in run-time for increase in NxLN: Population Size , L: Chromosome Length

Our approach is modeled after GAlib and maintains structures for GA, Genome and Statistics

It is built with enough abstraction from user program so that user does not need to know CUDA architecture or programming.

This can be extended to build a GPU-Accelerated GA library

gpu-accelerated genetic algorithms

Documents