hardware implementation of a memetic algorithm for vlsi circuit layout stephen coe msc engineering...
TRANSCRIPT
Hardware Implementation of a Hardware Implementation of a Memetic Algorithm for VLSI Circuit Memetic Algorithm for VLSI Circuit
LayoutLayout
Stephen Stephen CoeCoe
MSc Engineering CandidateMSc Engineering Candidate
Advisors:Advisors: Dr. Shawki AreibiDr. Shawki AreibiDr. Medhat MoussaDr. Medhat Moussa
Topic OverviewTopic Overview
IntroductionIntroduction BackgroundBackground
– Circuit Partitioning (CP)Circuit Partitioning (CP)– Handel-C vs. VHDLHandel-C vs. VHDL– Memetic AlgorithmMemetic Algorithm
Research ChallengesResearch Challenges Hardware ApproachHardware Approach Current Status and Future WorkCurrent Status and Future Work
IntroductionIntroduction
Today's technology allows for Today's technology allows for billions of transistors to be billions of transistors to be implemented into a single circuitimplemented into a single circuit
As these transistors become As these transistors become smaller, the interconnect delay smaller, the interconnect delay is the limiting factor in computer is the limiting factor in computer execution speedsexecution speeds
These factors place an These factors place an increasing importance on CAD increasing importance on CAD tools to minimizing this tools to minimizing this interconnect lengthinterconnect length
As FPGAs become larger and As FPGAs become larger and faster, new methods for faster, new methods for improving algorithm improving algorithm performance become availableperformance become available
2.0 µ 1.5 µ 1.0 µ 0.8 µ 0.5 µ 0.35 µ
0.1
1.0
10
Dela
y (
ns)
Minimum Feature Size
TypicalGate Delay
InterconnectDelay
Circuit PartitioningCircuit Partitioning Method of splitting complex designs into smaller Method of splitting complex designs into smaller
subsystemssubsystems Attempts to minimize the connection between subsystems Attempts to minimize the connection between subsystems The objective is to maximize the number of uncut netsThe objective is to maximize the number of uncut nets
– The longer the interconnects between modules, the longer the The longer the interconnects between modules, the longer the delay within the circuitdelay within the circuit
M0M0 M2M2 M4M4 M3M3M1M1 M5M5
Net 5Net 1
Net 2Net 3
Net 4
Development ToolsDevelopment ToolsCeloxica DK Design SuiteCeloxica DK Design Suite
High-level language based on ISO/ANSI-C for the implementation of High-level language based on ISO/ANSI-C for the implementation of algorithms in hardwarealgorithms in hardware
Allows software engineers to design hardware without retrainingAllows software engineers to design hardware without retraining Can generate VHDL code or a EDIF fileCan generate VHDL code or a EDIF file Support for many Actel, Altera and Xilinx devices Support for many Actel, Altera and Xilinx devices Uses second-party Placement and Routing programs to generate bit filesUses second-party Placement and Routing programs to generate bit files
Handel C Source Files
Compile
GenerateEDIF (netlist)
GenerateVHDL/Verilog
Simulate & netlist
Place & RouteTools
GenerationBitStream
Design FlowDesign Flow
Similarities of Handel-C & ISO CSimilarities of Handel-C & ISO C SimilaritiesSimilarities
– #define, #ifdef, etc.#define, #ifdef, etc.– Casting different Variable typesCasting different Variable types– Function Declarations are the sameFunction Declarations are the same– Registers stored as variables (eg. int, unsigned, etc)Registers stored as variables (eg. int, unsigned, etc)– for, while and do loopsfor, while and do loops
DifferencesDifferences– No float, double in Handel-CNo float, double in Handel-C– Variables in Handel-C are of undefined widthsVariables in Handel-C are of undefined widths– No Recursive Function CallsNo Recursive Function Calls– Incline functions generate totally new hardwareIncline functions generate totally new hardware– No malloc, free (Hardware cannot make dynamic No malloc, free (Hardware cannot make dynamic
memorymemory– Data can be read in for simulation onlyData can be read in for simulation only– Parallelism existsParallelism exists
Memory is access as a array
Type of memory is easily distinguishable
Memory of Handel-CMemory of Handel-CMemory Access AdvantageMemory Access Advantage
Memory Data is access within 1 Clock•No specific timing requiredNo specific timing required
•Block RamBlock Ram•External RamExternal Ram•Logic RamLogic Ram
Memory Access DisadvantageMemory Access Disadvantage
•MemoryData[1024] = WriteData;MemoryData[1024] = WriteData;•Allows Multi-Dimensional Memory AccessAllows Multi-Dimensional Memory Access
Divides operating clock frequency by 4
External Clock
Handel-C Clock
Write Enable
Data
Parallel Execution In Handel-CParallel Execution In Handel-C
Parallel Executionpar{ } Command
Clock 1
Clock 2 Clock 2
Clock 3
Clock 4
Wait
Waiting for right Waiting for right execution to finishexecution to finish
Channel Communication
Allows parallel component to talk to each other
ChannelChannel
Memetic AlgorithmMemetic AlgorithmA genetic/evolutionary algorithm which includes a
non-genetic local search to improve solution
Genetic AlgorithmGenetic Algorithm– Population based heuristic Population based heuristic
technique based on the technique based on the biological reproductive systembiological reproductive system
– Operates on the theory of Operates on the theory of “survival of the fittest”“survival of the fittest”
– Good at exploring the solution Good at exploring the solution spacespace
Local SearchLocal Search– Iterative improvement Iterative improvement
algorithmsalgorithms– Often get trapped in sub-Often get trapped in sub-
optimum solutionsoptimum solutions– Good at exploiting the Good at exploiting the
solution spacesolution space– Success is dependent on Success is dependent on
good starting solutionsgood starting solutions
Not Global Minimum
Genetic AlgorithmGenetic AlgorithmLocal SearchLocal Search
Research ChallengesResearch Challenges
Memetic AlgorithmsMemetic Algorithms– Increase computational performance of Increase computational performance of
Algorithm (CPU Time)Algorithm (CPU Time)– Exploit the inherent parallel nature of Genetic Exploit the inherent parallel nature of Genetic
AlgorithmsAlgorithms
Hardware Development LanguagesHardware Development Languages– Determine the impact of High level Languages Determine the impact of High level Languages
vs Low level Languagesvs Low level Languages
ApproachApproach
Explore the most efficient design to implement Explore the most efficient design to implement memetic algorithms on single FPGA chipmemetic algorithms on single FPGA chip
Achieve increased performance through pipelining Achieve increased performance through pipelining and parallelizationand parallelization– Divide the tasks into separate but concurrent components Divide the tasks into separate but concurrent components
FPGA Chip
Different Tasks of algorithm
Genetic Algorithm in HardwareGenetic Algorithm in Hardware
CrossoverModule
Selection Module
MutationModule
MutationModule
RepairModule
RepairModule
FitnessModule
Replacement
FitnessModule
Offspring 1Offspring 1
Offspring 2Offspring 2Crossover
ModuleSelection
Module
MutationModule
MutationModule
RepairModule
RepairModule
FitnessModule
Replacement
FitnessModule
Offspring 1Offspring 1
Offspring 2Offspring 2
CrossoverModule
Selection Module
ReplacementMutationModule
RepairModule
FitnessModule
(Pipelined Approch)(Pipelined Approch)
CrossoverModule
Selection Module
ReplacementMutationModule
RepairModule
FitnessModule
CrossoverModule
Selection Module
MutationModule
RepairModule
FitnessModule
CrossoverModule
Selection Module
MutationModule
RepairModule
Offspring 1Offspring 1 Offspring 1Offspring 1 Offspring 1Offspring 1 Offspring 1Offspring 1 Offspring 1Offspring 1 Offspring 1Offspring 1Offspring 2Offspring 2 Offspring 2Offspring 2 Offspring 2Offspring 2 Offspring 2Offspring 2 Offspring 2Offspring 2Offspring 3Offspring 3 Offspring 3Offspring 3 Offspring 3Offspring 3 Offspring 3Offspring 3
Local Search AlgorithmLocal Search Algorithm
M0M0 M2M2M1M1 M5M5 M4M4 M3M3
Net 4
Net 5Net 1
Net 2Net 3
0 1 2 3 4 5
0110 10
Block 1Block 0
0
Objective Value =
(Uncut Nets)
23
Module Data
0 10
010 Block 1
Block 01 2 3 4 5
0 0
0 0
11
0
(forcing specific nets within one block)
Sequential issuesSequential issuesSelect Next
Move
Copy Solution
Loop1
Loop2
Loop3
Loop1
Loop2
Loop3
Loop1
Loop2
Loop3
Loop1
Loop2
Loop3
Block RamBlock Ram Block RamBlock Ram
UpdateNet Info
Preliminary Results of GAPreliminary Results of GASoftware Results (Sun Blade 1000)
107.6
Benchmark Modules Nets Best Worst Mean Std Dev Time
prim1.dat
prim2.dat
struct.dat
ind1.dat
pcb1.dat
chip1.dat
chip4.dat
fract.dat
833
3014
1952
2271
24
300
224
149
902
3029
1920
2192
32
294
221
147
795.4
2580.6
1713.2
1947.6
25
253.2
186.6
767.2
2504.4
1671.2
1887.8
19.2
241.2
175.4
96.2
786.4
2546.6
1694.6
1919.6
24.7
251.1
184.6
107.4
5.642
14.539
8.252
12.134
1.073
2.703
2.361
2.480
30.6
122.1
73.1
87.9
0.8
8.4
6.6
4.3
Quality
Hardware Results (@ 59MHz / 4)
116.6
Benchmark Modules Nets Best Worst Mean Std Dev Time
prim1.dat
prim2.dat
struct.dat
ind1.dat
pcb1.dat
chip1.dat
chip4.dat
fract.dat
833
3014
1952
2271
24
300
224
149
902
3029
1920
2192
32
294
221
147
661.4
1732.0
1275.4
1415.0
25.2
230.8
188.8
645.2
1703.0
1246.8
1390.0
22.0
221.2
182.0
112.0
657.2
1723.8
1266.2
1407.8
25.2
229.8
188.2
116.3
3.775
7.041
6.705
6.138
0.333
1.883
1.316
0.661
10.3
33.0
21.4
23.8
0.3
3.4
2.5
1.7
Speedup
290%
370%
342%
369%
266%
247%
264%
253%
-16.8%
-32.8%
-25.5%
-27.3%
0.8%
-8.8%
1.1%
8.4%
Handel-C vs VHDLHandel-C vs VHDL For Local Search Designs For Local Search Designs
42,19242,19242,89842,898Total equivalent gateTotal equivalent gate
Handel-CHandel-CVHDL PrototypeVHDL Prototype
Handel-CHandel-CVHDL PrototypeVHDL Prototype
1/4 (25%)1/4 (25%)
3,349/24,576 (13%)3,349/24,576 (13%)
2,193/24,576 (8%)2,193/24,576 (8%)
2,204/12,288 (17%)2,204/12,288 (17%)
11.612 ns11.612 ns
15.768 ns15.768 ns
2.921 ns2.921 ns
Number of GCLKsNumber of GCLKs
Number of 4 input LUTsNumber of 4 input LUTs
Number of Slice Number of Slice RegistersRegisters
Number of SlicesNumber of Slices
Usage SummaryUsage Summary
Average Delay on the 10 Worst NetsAverage Delay on the 10 Worst Nets
Maximum DelayMaximum Delay
Average Connection Delay for this designAverage Connection Delay for this design
SpeedSpeed
2/4 (50%)2/4 (50%)
3,333/24,576 (13%)3,333/24,576 (13%)
1,709/24,576 (6%)1,709/24,576 (6%)
2,573/12,288 (20%)2,573/12,288 (20%)
11.309 ns11.309 ns
11.979 ns11.979 ns
2.775 ns2.775 ns
(xcv1000-4bg560)
Current Status and Future WorkCurrent Status and Future Work Current StatusCurrent Status
– Completed VHDL Local Search PrototypeCompleted VHDL Local Search Prototype Verified through simulationVerified through simulation
– Completed Handel-C Local Search DesignCompleted Handel-C Local Search Design Verified and implemented on RC1000Verified and implemented on RC1000
– Completed Handel-C Genetic Algorithm DesignCompleted Handel-C Genetic Algorithm Design Currently in testing stagesCurrently in testing stages
Future WorkFuture Work– Complete VHDL Local Search Design and ImplementationComplete VHDL Local Search Design and Implementation– Analyze the performance difference between Hardware Analyze the performance difference between Hardware
based Memetic algorithm and Software algorithmbased Memetic algorithm and Software algorithm
Hardware Implementation of a Hardware Implementation of a Memetic Algorithm for VLSI Circuit Memetic Algorithm for VLSI Circuit
LayoutLayout
Stephen Stephen CoeCoe
MSc Engineering CandidateMSc Engineering Candidate
Advisors:Advisors: Dr. Shawki AreibiDr. Shawki AreibiDr. Medhat MoussaDr. Medhat Moussa