gpu-accelerated genetic algorithms
DESCRIPTION
GPU-Accelerated Genetic Algorithms. Rajvi Shah + , P J Narayanan + , Kishore Kothapalli ˆ IIIT Hyderabad Hyderabad, India. + : Center for Visual Information Technology ˆ : Center for Security, Theory and Algorithmic Research. GAs – an introduction. Genetic Algorithms - PowerPoint PPT PresentationTRANSCRIPT
-
Genetic Algorithms A class of evolutionary algorithms Efficiently solves optimization tasksPotential Applications in many fields
ChallengesLarge execution time
-
A representation for chromosomeCreate Initial PopulationSelect ParentsCreate New PopulationGA ParametersEvaluate FitnessCrossover OperatorMutation OperatorTermination CriteriaUser Specifies A method for fitness evaluationExitYes
-
High degree of parallelism Fitness evaluationCrossoverMutation
Most obvious : chromosome level parallelismSame Operations on each chromosomeUse a thread per chromosome
-
Thread-per-chromosome model Good enough for small to moderate sized multi-coreDoesnt map well to a massively multithreaded GPUs
Solution : identify and exploit gene-level parallelism
-
A column of threads read a chromosome gene-by-gene and cooperate to perform operationsResults in coalesced read and faster processingPopulation Matrix in MemoryThread Blocks in a grid
-
Construct Initial Population
On CPUGPU Global MemoryRandom NumbersOld PopulationNew PopulationFitness ScoresStatisticsEvaluation KernelStatistics Update KernelSelection KernelCrossover KernelMutation KernelParse GA ParametersGenerate Random NumbersOn GPU
-
Construct Initial Population
On CPUGPU Global MemoryRandom NumbersOld PopulationNew PopulationFitness ScoresStatisticsStatistics Update KernelSelection KernelCrossover KernelMutation KernelParse GA ParametersGenerate Random NumbersOn GPUPopulationScoresEvaluation KernelEvaluation Kernel
-
Partially parallel methodPartially-parallel Method
User Specifies a serial code fragment for fitness evaluation.
Threads are arranged in a 1D grid.
Each thread executes users code on one chromosome.
Providing chromosome level parallelism.
Benefit : Abstraction
Fully parallel methodCUDA familiar user can effectively use 2D thread layout
Use gene level Parallelism for fitness evaluation
Benefit : Efficiency
-
Task : Given weights , costs & knapsack capacityAim : maximize the cost.
Representation1D binary string0/1: Absence/Presence of an item,W and C are total weight and Cost of given representation
Best Solution : One with max C given W < Wmax
Fully Parallel Method
Use a group of threads to compute total cost and weight in logarithmic time
-
Construct Initial Population
On CPUGPU Global MemoryRandom NumbersOld PopulationNew PopulationFitness ScoresStatisticsStatistics Update KernelSelection KernelCrossover KernelMutation KernelParse GA ParametersGenerate Random NumbersOn GPUScoresStatisticsEvaluation KernelStatistics Update Kernel
-
Selection and Termination most often use Population Statistics
We use standard parallel reduce algorithm to calculate Max, Min, Average Scores
We use highly optimized public library CUDPP To sort and rank chromosomes
-
Construct Initial Population
On CPUGPU Global MemoryRandom NumbersOld PopulationNew PopulationFitness ScoresStatisticsStatistics Update KernelSelection KernelCrossover KernelMutation KernelParse GA ParametersGenerate Random NumbersOn GPUStatisticsParentsEvaluation KernelSelection Kernel
-
Selection KernelUses N/2 threadsEach thread selects two parents for producing offspring
Uniform Selection : Selects parents in a uniform random manner
Roulette Wheel Selection: Fitness based approach, more the fitness, better the chance of selection
-
Roulette Wheel
Sort fitness scores
Compute a roulette wheel array by doing a prefix-sum scan of scores and normalizing it.
Generate a random number in 0-1.
Perform binary search in roulette wheel array for the nearest smaller number to the randomly selected number.
Return the index of the result in array
Image Courtesy : xyz
-
Construct Initial Population
On CPUGPU Global MemoryRandom NumbersOld PopulationNew PopulationFitness ScoresStatisticsStatistics Update KernelSelection KernelCrossover KernelMutation KernelParse GA ParametersGenerate Random NumbersOn GPUOld PopulationNew PopulationEvaluation KernelCrossover Kernel
-
GPU Global Memory
PopulationThread idy 12020403Thread idy 2081302Thread idy 3120702Thread idy 4051902Thread idx 1-LThread idx 1-LThread idx 1-LThread idx 1-L
-
GPU Global Memory
PopulationThread idy020403Thread idy081302Thread idy 120702Thread idy 051902Thread idx 1-LThread idx 1-LThread idx 1-LThread idx 1-L12345678
-
Construct Initial Population
On CPUGPU Global MemoryRandom NumbersOld PopulationNew PopulationFitness ScoresStatisticsStatistics Update KernelSelection KernelCrossover KernelMutation KernelParse GA ParametersGenerate Random NumbersOn GPUNew PopulationNew PopulationEvaluation KernelMutation Kernel
-
Thread 1,4Coin State GeneXFlip CoinCoin State GeneTFlip MutatorEach thread handles one gene and mutates it with probability of mutationThread Id yPopulation
-
Thread Id yPopulationThread 1,4Coin State GeneXFlip CoinCoin State GeneTFlip MutatorEach thread handles one gene and mutates it with probability of mutation
-
Construct Initial Population
On CPUGPU Global MemoryRandom NumbersOld PopulationNew PopulationFitness ScoresStatisticsStatistics Update KernelSelection KernelCrossover KernelMutation KernelParse GA ParametersGenerate Random NumbersOn GPURandom No.sEvaluation KernelGenerate Random Numbers
-
Extensive use of random numbers
No primitive for on the fly single random number generation
Solution: Generate a pool of random numbers and copy it on GPU
We use CUDPP routine to generate a large pool of random numbers on GPU (faster)
If better quality random numbers are needed, this can be replaced by a CPU based routine
-
Test Device : A quarter of Nvidia Tesla S1030 GPU
Test Problem : Solve a 0/1 knapsack problem
Test Parameters:Representation : A 1D Binary StringCrossover : One-point crossoverMutation : Flip Mutation Selection : Uniform and Roulette Wheel
-
Ave. Run-time for 100 iterations (Uniform Selection)Ave. Run-time for 100 iterations (Roulette Wheel Selection)Growth in run-time for increase in NxLN: Population Size , L: Chromosome Length
-
Our approach is modeled after GAlib and maintains structures for GA, Genome and Statistics
It is built with enough abstraction from user program so that user does not need to know CUDA architecture or programming.
This can be extended to build a GPU-Accelerated GA library