design of two different 128-bit adders project report

Design of Two Different 128-bit Adders

Project Report

By Vladislav Muravin Concordia ID: 5505763

COEN6501: Digital Design & Synthesis Offered by Professor Asim Al-Khalili

Concordia University

December 2004

Table of Contents 1 INTRODUCTION............................................................................................................................... 4

1.1 REPORT ORGANIZATION............................................................................................................... 4 1.2 COMMON ADDER STRUCTURES.................................................................................................... 4

1.2.1 1-bit Full Adder ...................................................................................................................... 4 1.2.2 N-bit Ripple Carry Adder ....................................................................................................... 4 1.2.3 Carry Skip Adder .................................................................................................................... 5 1.2.4 Carry Select Adder ................................................................................................................. 6 1.2.5 Carry Look Ahead Adder........................................................................................................ 6 1.2.6 Prefix Adders .......................................................................................................................... 7 1.2.7 Sklansky Prefix Adder............................................................................................................. 8 1.2.8 Kogge-Stone Prefix Adder ...................................................................................................... 9

2 DESIGN FLOW & IMPLEMENTATION..................................................................................... 10 2.1 MICRO ARCHITECTURE .............................................................................................................. 11

2.1.1 Top Entity ............................................................................................................................. 11 2.1.2 Sub-Block Partitioning ......................................................................................................... 12

2.1.2.1 "Carry Propagate" and "Carry Generate" Block (pg_gen) ......................................................... 13 2.1.2.2 Carry Generation Block ............................................................................................................. 14

2.1.2.2.1 Carry Generation Block – Sklansky Prefix Adder (cg_gen_sklansky) ................................. 14 2.1.2.2.2 Carry Generation Block – Kogge-Stone Prefix Adder (cg_gen_kogge_stone)..................... 14

2.1.2.3 Sum Bits Generation Block (sb_gen)......................................................................................... 15 2.2 RTL CODING ............................................................................................................................. 15 2.3 VERIFICATION PLAN .................................................................................................................. 15 2.4 SYNTHESIS, PLACE AND ROUTE ................................................................................................. 16

3 RESULTS .......................................................................................................................................... 16 3.1 SIMULATION RESULTS ............................................................................................................... 17

3.1.1 Initial Test Cases .................................................................................................................. 17 3.1.2 General Test Case ................................................................................................................ 17

3.2 SYNTHESIS RESULTS .................................................................................................................. 20 3.2.1 Multiplexing I/O ................................................................................................................... 20

3.2.1.1 Multiplexed Inputs ..................................................................................................................... 20 3.2.1.2 Multiplexed Outputs .................................................................................................................. 21 3.2.1.3 Multiplexed Inputs and Outputs................................................................................................. 21

3.2.2 Changing Target Device....................................................................................................... 21 4 DESIGN ENHANCEMENT – PIPELINING................................................................................. 22 5 SUMMARY AND CONCLUSIONS ............................................................................................... 24 6 REFERENCES.................................................................................................................................. 25

Table of Figures FIGURE 1: 1-BIT FULL ADDER ......................................................................................................................... 4 FIGURE 2: N-BIT CARRY PROPAGATE ADDER ................................................................................................. 5 FIGURE 3: CARRY SKIP CONCEPT.................................................................................................................... 5 FIGURE 4: CARRY SELECT CONCEPT............................................................................................................... 6 FIGURE 5: SKLANSKY PREFIX TREE ................................................................................................................ 8 FIGURE 6: KOGGE-STONE PREFIX TREE .......................................................................................................... 9 FIGURE 7: DESIGN FLOW............................................................................................................................... 10 FIGURE 8: TOP LEVEL VIEW.......................................................................................................................... 11 FIGURE 9: FULL_ADDER SUB-BLOCK PARTITIONING...................................................................................... 12 FIGURE 10: "CARRY GENERATE" AND "CARRY PROPAGATE" BLOCK IMPLEMENTATION ............................. 13 FIGURE 11: SUM BITS GENERATION BLOCK IMPLEMENTATION .................................................................... 15 FIGURE 12: TEST BENCH & VERIFICATION PLAN.......................................................................................... 16 FIGURE 13: INITIAL TEST CASE SIMULATION RESULTS................................................................................. 17 FIGURE 14: GENERAL TEST CASE - FULL ZOOM ........................................................................................... 18 FIGURE 15: GENERAL TEST CASE - EXAMPLE 1 ............................................................................................ 19 FIGURE 16: GENERAL TEST CASE - EXAMPLE 2 ............................................................................................ 19 FIGURE 17: FORWARD REGISTERS BALANCING (PIPELINING) ....................................................................... 22 FIGURE 18: BACKWARD REGISTERS BALANCING (PIPELINING) .................................................................... 22 TABLE 1: SIGNAL DESCRIPTION .................................................................................................................... 11 TABLE 2: SYNTHESIS RESULTS (NO PLACEMENT AND ROUTING): XC2V500 -FG456-4 DEVICE................... 20 TABLE 3: SYNTHESIS RESULTS: XC2V1000 – FF896-4 DEVICE................................................................... 21 TABLE 4: PLACEMENT AND ROUTING RESULTS – FF896-4 DEVICE .............................................................. 21 TABLE 5: PLACEMENT AND ROUTING RESULTS OF PIPELINED SKLANSKY ADDER........................................ 23 TABLE 6: PLACEMENT AND ROUTING RESULTS OF PIPELINED KOGGE-STONE ADDER ................................. 23

1 Introduction The objective of this project is to design two different 128-bit adders by going through the full design cycle from initial concept to structural RTL coding, simulation and synthesis for Xilinx Virtex-2 FPGA family, device XC2V500.

1.1 Report Organization The report is organized into few sections. Section 1 introduces common principles of adder designs and structures, briefly describing the Carry Select, Carry Skip and the Carry Look-Ahead principles with further elaboration on parallel-prefix adders, two of which, Sklansky prefix adder and Kogge-Stone prefix adder, are implemented in this project. Section 2 describes the design flow and the micro architecture of the design. Section 3 focuses on the verification and test plan of the designs, followed by section 4 describing the results. Finally, sections 5 and 6 finalize the report with the conclusions and references, respectively.

1.2 Common Adder Structures

1.2.1 1-bit Full Adder A 1-bit Full Adder is shown on Figure 1. The equations describing the outputs are:

inCBAS ⊕⊕=

inout CBABAC ∗⊕+∗= )(

Full Adder

A

B

S

Cout

Cin

A

B

Cin

S

Cout Figure 1: 1-bit Full Adder

1.2.2 N-bit Ripple Carry Adder An iterative approach of considering an N-bit full adder leads to cascading of 1-bit full adders. This concept is illustrated in Figure 2. Obviously, as N increases, the most critical path, which is the carry path, increases as well ( outC path), linearly.

Full Adder

Full Adder

Full Adder

1−nB 1−nA iB iA 0A0B

0SiS1−nS

0CiC

outC Figure 2: N-bit Carry Propagate Adder

1.2.3 Carry Skip Adder Let iii bap ⊕= and iii bag ∗= . p denotes "propagate" and g denotes "generate". The basic carry-skip or carry-bypass design is an adder, which divides an N-bit adder into

MN blocks, where each block contains M bits. This is shown at Figure 3. Within each

block, a simple M-bit full adder structure is realized (linear time Carry Skip Adder), where "propagate" and "generate" signals for the respective input bits are used to form the output sum bits and the output carries. The multiplexer at the end of a block, allows the input carry to bypass the block when all of the "propagate" signals in that block are asserted. After the carry generate delay of the first block, the bypassing of carries in subsequent blocks results in the carry-propagate delay. If any of the "propagate" signals in some block is unasserted, then the carry propagation is not dependent on any of the input carries from the previous blocks and each multiplexer. The critical path delay is

( ) SUMFAMUXFAsetupPD ttKtMNtMtt +∗−+∗

−+∗+= 11

The subsequent section 1.2.4 explains how the better performance can be achieved by modifying the block size.

Carry Propagation

Cin

SUM(M-1)

M M

M

Carry Propagation

M

Carry Propagation

MSUM(M-2) SUM(0)

01 AAM ÷− 01 BBM ÷−MNN AA −− ÷1 MNN BB −− ÷1 MKMNKMN AA −−−− ÷1 MKMNKMN BB −−−− ÷1

Cout

Carry Select Logic M M

Carry Select Logic M M

Carry Select Logic

Figure 3: Carry Skip Concept

1.2.4 Carry Select Adder This type of adder, despite its bigger amount of hardware needed, it has a very interesting

design concept. The linear Carry Select Adder is divided into MN blocks, where each

block contains M bits, just as Carry Skip Adder. At each block, the hardware is replicated in order to calculate sum and carry-out bits for both possible carry-ins. Figure 4 illustrates this concept. The multiplexer at the end chooses between the carry-outs based on the carry-in from the previous stage. In this implementation, the critical path delay comprises the carry-generate of the first block, followed by the mux delays for successive blocks. This results in a linear time Carry Select Adder. Variable-sized blocks can yield higher performance [5]. For a carry-select adder, one can have increasing sizes of the blocks so that the delay can be minimized by allowing all the inputs to arrive at the same time at each multiplexer. For example, if the multiplexer delay is similar to the delay of a full adder, then the minimal carry delay can be achieved by adding 1 bit in the first block, 2 in the second, and so on. Having linearly increasing block sizes results in a square-root number of block stages for the carry propagate delay, and hence a square-root time CSA. A similar approach can yield a square-root time CSkA.

M-bit AdderCin

SUM(0)

M M

M

M-bit Adder

SUM(1)

01 AAM ÷− 01 BBM ÷−

M M

M-bit Adder

M M

MM AA ÷−12 MM BB ÷−12 MNN AA −− ÷1 MNN BB −− ÷1

M-bit Adder

SUM(M-1)

M M

M-bit Adder

M M

MM AA ÷−12 MM BB ÷−12 MNN AA −− ÷1 MNN BB −− ÷1 Cout Figure 4: Carry Select Concept

1.2.5 Carry Look Ahead Adder Ripple Carry Adder implementation imposes the sequential generation of the carries, making the output carry of each stage dependant on the input carry to the stage. Carry Look Ahead implementation implies that the carry-out is not depending on the previous carries. Let iii bap ⊕= and iii bag ∗= . P denotes "propagate" and G denotes "generate". Then iii cps ⊕= and iiii cpgc ∗+=+1 Expanding the above given equations for N-bit adder gives:

0001 cpgc +=

0011112 cppcpgc ++= …………………………

0012100121211 ............ CPPPPGPPppgpgc nnnnnnnn −−−−−−− +++++= It can be easily seen that since the carry is not depending on the previous carries, this would result in less delay, as the adder circuit can be implemented as sum of products. Consequently, an increase in the speed can be achieved. Unfortunately, due to the fact that CMOS delay increases non-linearly as the fan-in grows, Carry Look Ahead implementation is used in a modular way, cascading several 4-bit CLAs.

1.2.6 Prefix Adders In very simple words, a parallel prefix algorithm takes n inputs 021 ,...,, xxx nn −− and produces in parallel n outputs 002021 ,...,...,... xxxxxx nnn −−− . The analogy between carry computation and the prefix algorithm is that the carry computation at a certain stage i depends on all inputs of the stages 1−i to 0 . Let 021 ,...,, aaa nn −− and 021 ,...,, bbb nn −− be n-bit binary numbers to be added. Let oc designate the input carry and nc designate the output carry. For each bit, "propagate" ( ip ) and "generate" ( ig ) signals are defined, as described in the previous section. Furthermore, for parallelizing the computation of a carry two additional terms are defined: Group Carry Generate ( jiG : ) and Group Carry Propagate ( jiP: ). For each group of bits the Group Carry Generate signal jiG : means that the carry is generated somewhere between stages i and j , and it is propagated from that location to stage i . This implies 11 =+ic and, in particular, if 0=j , then ii cG =0: . For each group of bits the Group Carry Propagate signal jiP: means that the carry is propagated from stage j to stage i , i.e. ji cc =+1 . So the formal definition of jiG : and jiP: is expressed using the following relationship: [ ] [ ]iijiji pgPG ,, :: = if ji = [ ] [ ] [ ]jkjkkikijiji PGPGPG :::::: ,,, = if ji ≠ Where jki ≥≥ and " " operator is introduced by Brent and Kung [1]. Finally, once the final carries 0:iG for all ni < have been computed, the sub bits are calculated as:

=>>⊕

=0,

1,0:

ipinGp

si

iii

The traditional CRA can be regarded as serial prefix adder using the above definitions.

1.2.7 Sklansky Prefix Adder Sklansky Prefix tree is shown on Figure 5 for 16-bit adder. Its structure is the simplest among the prefix adders. It used for a conditional-sum addition [2]. The fan-out of such

adder grows exponentially from input to output along the critical path and it is 2n . This

leads to a large delay as the adder operand’s width increases. Recursive division of the blocks can construct full adder using such a tree for the implementation. The number of

"o" cells required to implement is nn2log

2 and the delay is n2log , where n is the

adder’s width. The detailed implementation of "o" cell is described in 2.1.2.2

0123456789101112131415

Figure 5: Sklansky Prefix Tree

1.2.8 Kogge-Stone Prefix Adder The Kogge-Stone structure has a more optimal implementation than Sklansky structure, as its fan-out is greatly reduced to 2 at the expense of larger "o" (circle) cells. It is obtained by copying the of the most significant bit position [3]. Figure 6 shows this prefix tree for 16-bit operands. Just as in 1.2.7, recursive division of the blocks can construct full adder using such a tree for the implementation. The number of "o" cells required for the implementation is

1log2 +− nnn and the delay is n2log , where n is the adder’s width. It is expected that Kogge-Stone adder should consume more resources than Sklansky adder. The delay is 7 levels.

0123456789101112131415

Figure 6: Kogge-Stone Prefix Tree

2 Design Flow & Implementation The following Figure 7 illustrates design flow for the implementation of prefix adders.

Design SpecificationMacro Architecture

VHDL RTL CodingStructural Level

(Emacs VHDL mode)

SimulationModelSim 6.0 SE

SynthesisPlace and Route

Xilinx ISE 6.3 SP3

Compare Results

Test Bench PRBS

Generator

Analyze Results

Verification PlanTest Case

Specification

Results

Results

ResultsReports

Figure 7: Design Flow

2.1 Micro Architecture

2.1.1 Top Entity The following Figure 8 illustrates top-level view. The top entity is named full_adder_sklansky and full_adder_kogge_stone, respectively, with the following ports (Table 1).

full_adder_sklansky

or

full_adder_kogge_stone

operand1

operand2

sys_clk

128

128

result

128

carry_out

reset_n

Figure 8: Top Level View

Signal Name Width, [bits] Direction Comments

operand1 128 input Number #1 to be added

operand2 128 input Number #2 to be added

sys_clk 1 input System clock

reset_n 1 input System reset (active low)

result 128 output Result of an addition

carry_out 1 output Output carry resulting from an addition Table 1: Signal Description

2.1.2 Sub-Block Partitioning The top-level block is further partitioned into three sub-blocks, as it is shown on Figure 9. No doubt, the choices of block partitioning are numerous. It is chosen to partition the design into three sub-blocks due to the fact that in such block partitioning the two different adders’ designs differ only by one sub-block, which is Carry Generation Block (cg_gen). Consequently, two different sub-blocks are designed: cg_gen_sklansky and cg_gen_kogge_stone.

pg_gen ("Carry Propagate"&"Carry Generate" Block)

operand1[0]operand2[0]operand1[127]operand2[127]

cg_gen_sklanskycg_gen_kogge_stone

(2-D Carry Generation Block)

g[0](0)g[127](0)

sb_gen (Sum Bits Generation Block)

g[127](M-1)

s[127] s[126] s[0]s[1]carry_out

p[0]

p[127]

g[0](M-1)

Figure 9: full_adder sub-block partitioning The subsequent sections elaborate on each one of the sub-blocks.

2.1.2.1 "Carry Propagate" and "Carry Generate" Block (pg_gen) This sub-block calculates "carry propagate" )0]([ip and "carry generate" )0]([ig , which are calculated from operand1 and operand2 bitwise, as defined in 1.2.5, namely:

][2][1)0]([ ioperandioperandip ⊕= ][2][1)0]([ ioperandioperandig ∗=

The implementation is shown on Figure 10. This block consumes 128 2-input AND gates and 128 2-input XOR gates.

operand1[1]operand2[1]operand1[127]operand2[127] operand1[0]operand2[0]

g[127] p[127] g[i] p[i] g[1] p[1] g[0] p[0]

Figure 10: "Carry Generate" and "Carry Propagate" Block Implementation

2.1.2.2 Carry Generation Block The signals )0]([ip and )0]([ig generated in Precondition Block are used within Carry Generation Block for calculation the )1]([ −Mig signals, which could be represented as two-dimensional carry generate structure. Further subsequent sections describe the implementation of Carry Generation Block for each one of the chosen designs.

2.1.2.2.1 Carry Generation Block – Sklansky Prefix Adder (cg_gen_sklansky) Following the Sklansky prefix tree (presented in 1.2.7), the following observation is determined (assuming a two-dimensional structure j rows by i columns):

• In the column i , cells occupy the nodes whose row coordinates j correspond to "1" in the binary representation of i , i.e. straight forward from binary encoding of the index i . The coordinate corresponding to "0" in the binary representation of i simply propagates the )]([ jip and )]([ jig

• All "o" (circle) cells are of GP type except of those situated in the bottom border of ij 2log< .

The output of GP cell is defined as following:

)1](12mod[)1]([)1]([)]([ 1 −−−∗−+−= − jiigjipjigjig j The output of G cell is defined as following:

)1](12mod][)1]([)]([ 1 −−−∗−= − jiipjipjip j )1](12mod[)1]([)1]([)]([ 1 −−−∗−+−= − jiigjipjigjig j

Following the prefix algorithm description, with 128=n the implementation consumes 448 "o" cells, namely 448 2-input OR gates and the same amount of 2-input AND gates. The delay is 7 levels and the fan-out is 64.

2.1.2.2.2 Carry Generation Block – Kogge-Stone Prefix Adder (cg_gen_kogge_stone)

Following the Kogge-Stone prefix tree (presented in 1.2.8) and assuming a two-dimensional structure j rows by i columns, the nodes in the upper-left are populated with "o" (circle) cells, while the rest of the two-dimensional array is empty, i.e. the "o" (circle) cells are placed in the nodes whose coordinates satisfy the following relationship:

11 −≤≤ Mj and 112 1 −≤≤−+ Nij The outputs of the placed cells are:

)1](2[)1]([)]([ 1 −−∗−= − jipjipjip j )1](2[)1]([)1]([)]([ 1 −−∗−+−= − jigjipjigjig j

Following the prefix algorithm description, with 128=n the implementation consumes 769 "o" cells, hence occupying 769 2-input OR gates and the same amount of 2-input AND gates.

2.1.2.3 Sum Bits Generation Block (sb_gen) The sum bits are produced in Sum Bits Generation Block by XORing the "carry propagate" signals, )0]([ip , generated in Precondition Block, and the "carry generate" bits )1]([ −Mig . Figure 11 illustrates the implementation, which is consuming 128 2-input XOR gates.

p[1]g[0](M-1)p[127]g[126](M-1)

s[127] s[i] s[1]

carry_inp[0]

s[0]

Figure 11: Sum Bits Generation Block Implementation

2.2 RTL Coding RTL coding is done in VHDL at the structural level. The basic cells are 2-input AND gate, 2-input OR gate, 2-input XOR gate and D-type positive edge triggered flip flop. The text editor used is emacs version 20.7 with vhdl mode, since it has many templates for arranging VHDL code in an alignment, which is easy to read. Each one of the files has a header at the top explaining the entity name and its logical function.

2.3 Verification Plan In general, describing the same design functionality (especially of a large and complex design) by a high-level language, such as C/C++ or using verification tools, such as Verisity Specman, etc, is the way to verify the design in many scenarios with many possible input combinations. For the verification of the two full adders, the following is proposed (Figure 12). A test bench, which is written in behavioral Verilog, instantiates both designs. Two 128-bit numbers are generated using a dedicated LFSR (Linear Feedback Shift Register) [4], which generates pseudo-random bit stream. Each clock cycle, the values of two 128-bit numbers change in pseudo-random way. These values are summed using a '+' operation within the test bench and they are also applied as inputs to both adders. The resulting output sum and carry of each adder is compared with the result generated by '+' addition within the test bench.

A successful test case (test passed) is defined as the match between the result of a test bench and the result of each adder.

128-bit PRBS Generator 1

128-bit PRBS Generator 2

operand1+operand2

full_adder_sklansky128

128

result

128

carry_out

test_bench

operand1[127:0]

operand2[127:0]

result[127:0]

carry_out

test_benchresults file

128

128

full_adder_kogge_stone

operand1

operand2

operand1

operand2

result

128

carry_out

match_sklansky

match_kogge_stone

128

128

Figure 12: Test Bench & Verification Plan

2.4 Synthesis, Place and Route Synthesis, placement and routing of the design are done using Xilinx ISE 6.3i software with the latest service pack SP3. The constraints are set for the best timing, by selecting the optimization criteria "speed" with the maximum effort. More details on the results, as well as the faced problems, are given in the section 3.2

3 Results

3.1 Simulation Results

3.1.1 Initial Test Cases The initial test cases are defined as the sum of the following 128-bit numbers. The very first case verifies the sum of the following numbers:

• 64 zeros followed by 64 ones. • 64 ones followed by 64 zeros.

The next case is: • 32 repetitions of 0xA. • 32 repetitions of 0x5.

In such fashion, the possible bit swapping or incorrect index generation is tested. Figure 13 illustrates the simulation results for the initial test case. operand1 and operand2 are, effectively, the two 128-bit numbers to be added. result and carry_out are outputs of each one of the adders marked by the appropriate divider (Sklansky Adder and Kogge-Stone Adder, respectively).

Figure 13: Initial Test Case Simulation Results

3.1.2 General Test Case In general test case, the data is generated in a pseudo-random way, as described in the section 2.3. Three snapshots of the simulation results are given in the following figures. Figure 14 illustrates the entire simulation. The lowest divider separates the test bench signals. operand1_prbs and operand2_prbs are the 128-bit PRBS data, which is applied to the adders. operand1 and operand2 are the input numbers; result and carry_out are the outputs of the adder circuits, marked by the corresponding divider (Sklansky Adder and Kogge-Stone Adder, respectively). Two more very important test bench signals are result_match_sklansky and result_match_kogge_stone, which are updated each clock cycle, depending whether there is a match between the test bench result and the respective result of Sklansky adder and Kogge-Stone Adder. Figure 15 and Figure 16 are giving two "zoom-in" examples of the same simulation.

Figure 14: General Test Case - Full Zoom

Figure 15: General Test Case - Example 1

Figure 16: General Test Case - Example 2

3.2 Synthesis Results Both designs were successfully synthesized for Virtex-2 device XC2V500. The synthesis results are summarized in the following Table 2. It is noted that Kogge-Stone adder consumes more resources than Sklansky adder, just as it was expected. Results Explanation (Table 2): The input and outputs of the design were sampled in order to achieve more true delay estimation, assuming that the inputs and the outputs of the design are registered. Furthermore, in the placement and routing stage, a specific option, which forces the flip-flops to be packed within the I/O buffer, is selected, so that the logic delay represents true estimation of each adder’s processing delay in this FPGA implementation. However, due to the fact the maximum available user I/O pins for this device is 264 (package FG456), further placement and routing of the design, and, hence, the true estimation of its logic delay is not possible. Consequently, there are two alternatives. One alternative is multiplexing the I/Os in order to fit the design into XC2V500 device. Another alternative is to select a larger device, which is XC2V1000. Both the alternatives are described in the following subsections. Table 2: Synthesis Results (No placement and routing): XC2V500 -FG456-4 device

Design LUTs usage 1-bit Registers Usage

Total Slices Usage

Maximum Frequency

Sklansky Adder

829 (13%) 385 (6%) 453 (14%) 85.6 MHz

Kogge-Stone Adder

1449 (23%) 385 (6%) 751 (24%) 100.5 MHz

3.2.1 Multiplexing I/O This alternative requires complete redesigning of the interface and changing the overall architecture of the design. Either loading the numbers or outputting the result in multiplexed way could have advantages and disadvantages, which are summarized further. In addition, handshaking signals, which designate the start of loading and the completion of the addition, are required.

3.2.1.1 Multiplexed Inputs In this case, it is obvious that the design latency (overall processing time) will increase, since the whole input numbers cannot be acquired at once. However, there are two major advantages that could be achieved. First, the logic required for the addition could be reduced, since the logic performing the addition cannot process more bits than are present on the interface at the same cycle. Consequently, the addition could be performed in multiplexed fashion, especially if the loading of the input numbers is done in the way that the least significant part of the numbers is loaded first. Second, that the overall speed of the design will definitely increase as the complexity and combinational levels of logic decrease as well.

3.2.1.2 Multiplexed Outputs In this case, it is also obvious that the design latency (overall processing time) will increase, since the output is not generated at once. However, there are two major advantages that could be achieved here as well. First, the logic required for the addition could be reduced, since the logic performing the addition cannot generate more bits than the output (result) width is. Consequently, the addition could be performed in multiplexed fashion, processing least significant part of the input numbers first, i.e. the least significant part of the output is generated earlier than the most significant one. Second, that the overall speed of the design will definitely increase as the complexity and combinational levels of logic decrease as well.

3.2.1.3 Multiplexed Inputs and Outputs In general, this case combines the alternatives discussed in 4.2.1.1 and 4.2.1.2. No doubt as the design latency (overall processing time) will increase. Assuming that the inputs are loaded with least significant part first, the least significant part of the output can be generated at once. So, there are the same two major advantages can be achieved in this case as well. First, the logic required for the addition could be reduced. Second, that the overall speed of the design will definitely increase as the complexity and combinational levels of logic decrease as well. The most optimal case is when the input and the output widths are the same. If the input and the output widths are different, this will definitely result in another level of complexity in this design, which I leave outside the scope of this project.

3.2.2 Changing Target Device This alternative is the quickest solution because it introduces no modifications within the RTL design. The new target device is XC2V1000 with package FF896, allowing up to 432 user I/O pins. The main disadvantage of this alternative is that the larger device represents a more costly solution. Table 3 and Table 4 present the synthesis and the placement and routing results with the maximum efforts on timing, respectively. The results are different because the synthesis tool gives the delays estimation without knowing the true placement and routing. Table 3: Synthesis Results: XC2V1000 – FF896-4 device

Design LUTs usage 1-bit Registers Usage

Total Slices Usage

Maximum Frequency

Sklansky Adder

829 (8%) 385 (3%) 453 (6%) 85.6 MHz

Kogge-Stone Adder

1449 (14%) 385 (3%) 751 (14%) 100.5 MHz

Table 4: Placement and Routing Results – FF896-4 device

Design Total Slices Usage Maximum Delay / Frequency

Sklansky Adder 585 (11%) 15.428 ns / 64.8 MHz

Kogge-Stone Adder 1042 (20%) 14.149 ns / 70.1 MHz

4 Design Enhancement – Pipelining The pipelining of the design is introduced in order to improve the design speed. There are two ways of applying pipelining. One, manual, is to locate the exact point at the critical path, which has an arrival time of exactly half the total delay of the critical path (or one third, if two pipeline stages are inferred, and so on) and insert a pipeline there. Another alternative, automatic pipelining, is described below. The location of the pipelining registers location is chosen automatically by Xilinx synthesis tool. In the design, N pipeline stages are added to the inputs, the outputs or both inputs and outputs of a design and the software optimizes the location of the pipeline registers according to specified timing requirements and synthesis effort by moving them forward and backward. This is also referred as "forward/backward register balancing" in the tools (Xilinx ISE [6]) and "retiming" (Synplicity Synplify Pro 7.xx [7]) and it is illustrated at Figure 17 and Figure 18. The software automatically determines Td1 and Td2 corresponding to the given timing constraints and synthesis effort.

Pipelinestage

Pipelinestage

sys_clk

Pipelinestage

Td

Pipelinestage

sys_clk

Td1 Pipelinestage

Td2 Pipelinestage

Td = Td1 + Td2 Figure 17: Forward Registers Balancing (Pipelining)

Pipelinestage

sys_clk

Pipelinestage

Td

Pipelinestage

sys_clk

Td1 Pipelinestage

Td2 Pipelinestage

Td = Td1 + Td2

Pipelinestage

Figure 18: Backward Registers Balancing (Pipelining)

Table 5 gives the result of automatic pipelining of Sklansky Adder. Table 6 gives the result of automatic pipelining of Kogge-Stone Adder. From the results, it is observed that:

• Adding one output pipeline stage improves the timing, while adding two pipeline stages does not. The main reason is the fact that the delay distribution, consists of approximately 25%-30% logic delay and approximately 70% routing delay.

• Despite that adding 2 pipeline stages improves flip-flop to flip-flop delay, due to the routing delay, the total delay is worse than with only 1 pipeline stage.

• One other important factor that might prevent from achieving the good performance could be the high usage of I/O pins, which imposes another level of complexity for the place and route tool.

• The faster a certain path is, the more percentage of it is contributed by the actual logic delay.

• Multiple iterations of synthesis, place and route produce slightly different results.

Number of Pipeline Stages

Total Slices Usage Maximum Delay / Frequency

Delay Distribution Logic % / Routing %

1 input stage 551 (11%) 10.895 ns / 91.7 MHz 33 / 67 2 input stages 746 (14%) 9.9 ns / 101 MHz 36 / 64 1 output stage 603 (11%) 12.174 ns / 82.1 MHz 32 / 68 2 output stages 630 (12%) 12.644 ns / 79.1 MHz 27 / 73 1 stage at input

and output 571 (11%) 8.905 ns / 112.2 MHz 43 / 57

2 stages at input and output

777 (15%) 8.698 ns / 114.9 MHz 45 / 55

Table 5: Placement and Routing Results of Pipelined Sklansky Adder

Number of Pipeline Stages

Total Slices Usage Maximum Delay / Frequency

Delay Distribution Logic % / Routing %

1 input stage 838 (16%) 11.112 ns / 89.9 MHz 32 / 68 2 input stages 948 (18%) 10.597 ns / 94.36 MHz 28 / 72 1 output stage 852 (16%) 8.802 ns / 113.6 MHz 30 / 70 2 output stages 933 (18%) 9.286 ns / 107.68 MHz 41 / 69 1 stage at input

and output 888 (17%) 7.724 ns / 129.4 MHz 43 / 57

2 stages at input and output

1075 (%) 7.612 ns / 131.3 MHz 47 / 53

Table 6: Placement and Routing Results of Pipelined Kogge-Stone Adder

5 Summary and Conclusions Two different parallel prefix 128-bit adders were designed, analyzed and tested. In the beginning of the design process, it was noted that the required device (XC2V500) couldn’t accommodate the requirements because of the limited number of the available user I/O pins. Two alternatives were discussed and considered for further step of the design: using the multiplexed I/O and, hence, reducing the overall number of the used I/Os or changing the target device to XC2V1000. The second alternative was chosen because it did not require redesigning and involving other levels of complexity. It was observed that due to the nature of Kogge-Stone prefix, the expected resource usage of Kogge-Stone adder will be greater comparing with Sklansky adder and it was justified by the results. It was also observed that multiple iterations of the same design’s synthesis sometimes produce slightly different placement results in terms of logic resources usage and timing. The reason for this is the fact that the placement and routing algorithm used by Xilinx tools is based on randomized initial settings [6], [8], in opposite to Altera [7]. Pipelining by inserting a number of pipeline stages enhanced the designs and the results were analyzed. It turns out that the pipelining is not necessary improving the design speed. The main reason for this is that the delay distribution in most cases consists of approximately 20% to 40% of the actual logic and the rest, which is 80% down to 60%, respectively, of routing delay. So, it is concluded that adding more pipeline stages does not necessary improves the total delay.

6 References [1] R. T. Brent and H. T. Kung – "A regular layout of parallel adders", IEEE Trans. Comput. Vol. C-31, No 3, pp. 260-264, March 1982 [2] J. Sklansky – "Conditional-sum Addition Logic", in IRE transactions of electronic Computers, Vol. EC-9, No 2, pp. 226-231, June 1960 [3] P. M. Kogge and H. S. Stone – "A parallel algorithm for the efficient solution of a general class of recurrence qeuations”, IEEE Transactions on computers. C-22(8):260 – 264. Aug 1973 [4] Paul H. Bardell, William H. McAnney, and Jacob Savir, "Built-In Test for VLSI: Pseudorandom Techniques", John Wiley & Sons, New York, 1987 [5] V. G. Oklobdzija, E. R. Barnes, "Some Optimal Schemes for ALU Implementation in VLSI Technology", Proceedings of the 7th Symposium on Computer Arithmetic ARITH-7, pp. 2-8. Reprinted in Computer Arithmetic, E. E. Swartzlander, (editor), Vol. II, pp. 137-142, 1985. [6] Xilinx Programmable Logic Devices – PLD & FPGA, www.xilinx.com [7] Synplicity Synplify Pro 7.02 user’s guide www.synplicity.com [8] Xilinx ISE 6.2 / 6.3 user’s manual – www.xilinx.com

design of two different 128-bit adders project report

Documents