project report

of 50

V.L.S.I. I

CARRY LOOK AHEAD ADDER

Submitted by

Daniel Koontz

Kaustubh Kowjalgi

of 50

Project requirements

• Maximum pin count 40

• Data output rate > 450 Mhz

• 32 bit inputs and output

• Minimize total chip power dissipation

• Minimize total chip area

• Maintain good noise margin

Carry lookahead adder

This adder is a practical design with reduced delay at the price of more complex

hardware. The carry lookahead design can be obtained by a transformation

of the ripple carry design into a design in which the carry logic over fixed groups

of bits of the adder is reduced to two-level logic.

Figure 1 transformation for a 4 bit adder group

of 50

Figure 2 4 bit look ahead circuit

The ripple carry adder, although simple in concept, has a long circuit delay due to

the many gates in the carry path from the least significant bit to the most

significant bit. For a typical design, the longest delay path through an n-bit ripple

carry adder.

adder is approximately 2n + 2 gate delays. Thus, for a 32-bit ripple carry adder,

the delay is 66 gate delays.

We call the first part of each full adder a partial full adder (PFA). This separation

is shown 1, which presents a diagram of a PFA and a diagram of four PFAs

connected to the carry path. We have removed the OR gate and one of the AND

gates from each of the full adders to form the ripple carry path.

There are two outputs, Pi and Gi, from each PFA to the ripple carry path and

one input Ci, the carry input, from the carry path to each PFA. The function Pi =

Ai � Bi is called the propagate function. Whenever Pi is equal to 1, an incoming

of 50

carry is propagated through the bit position from Ci to Ci11. For Pi equal to 0,

carry propagation through the bit position is blocked. The function Gi = Ai � Bi

and is called the generate function. Whenever Gi is equal to 1, the carry output

from the

position is 1, regardless of the value of Pi, so a carry has been generated in the

position. When Gi is 0, a carry is not generated, so that Ci11 is 0 if the carry

propagated through the position from Ci is also 0. The generate and propagate

functions correspond exactly to the half adder and are essential in controlling the

values in the ripple carry path. Also, as in the full adder, the PFA generates the

sum function by the exclusive-OR of the incoming carry Ci and the propagate

function Pi. gates in cascade, so the circuit has a delay of eight gate delays.

Since only AND and OR gates are involved in the carry path, ideally, the delay

for each of the four carry signals produced, C1 through C4, would be just two

gate delays. The basic carry lookahead circuit is simply a circuit in which

functions C1 through C3 have a delay of only two gate delays. The

implementation of C4 is more complicated in order to allow the 4-bit carry

lookahead adder to be extended to multiples of 4 bits, such as 16 bits. The 4-bit

carry lookahead circuit is shown in Figure 1(b). It is designed to directly replace

the ripple carry path in Figure 1(a). Since the logic generating C1 is already two-

level, it remains unchanged. The logic for C2, however, has four levels. So to find

the carry lookahead logic for C2, we must reduce the logic to two levels. The

equation for C2 is found from Figure 1(a), and the distributive

law is applied to obtain C2 = G1 +P1G0 +P1P0C0

We obtain the two-level logic for C3 by finding its equation from the carry path in

Figure 1(a) and applying the distributive law

C3 = G2 +P2G1 +P2P1G0 +P2P1P0C0

The two-level logic with output C3 in Figure 2 implements this function.

G1 +P1 (G0 +P0C0)

G1 +P1G0 +P1P0C0

G2 +P2 (G1 +P1 (G0 +P0C0))

G2 +P2 (G1 +P1G0 +P0C0)

of 50

G2 +P2G1 +P2P1G0 +P2P1P0C0

instead of generating C4, we produce generate and propagate functions that

apply to 4-bit groups instead of a single bit to act as the inputs for the group carry

lookahead circuit. To propagate a carry from C0 to C4, we need to have all four

of the propagate functions equal to 1, giving the group propagate function

P0 –3 = P3P2P1P0

Assuming that an exclusive OR contributes 2 gate delays, the longest delay in

the 4-bit carry lookahead adder is 6 gate delays, compared with 10 gate delays in

the ripple carry adder. The improvement is very modest and perhaps not worth

all the extra logic.

The carry equations for carry lookahead adder are given by

Figure 2 (b) 16 bit carry look ahead adder

of 50

A 16-bit CLA is shown. This adder requires two CLC levels and five carry-

lookahead circuits (CLCs). Note how the second-level CLC uses the group P and

G outputs from the four first level CLCs as inputs and provides the carry outputs

C4, C8, and C12. Also, the P and G group outputs from the second level CLC

cover carry generation and propagation for all 16 bits and, by using an OC circuit,

can combine these two outputs with C0 to produce carry C16.r. In Figure 2(b), a

blue (or gray) path represents one of many longest delay paths through the

circuit. Assuming that each passage through an FPA or CLC block requires two

gate delays, the delay of this circuit can be estimated as 5 x 2 =10 gate delays

compared to 2 x 16 + 2 = 34 gate delays for the 16-bit ripple carry adder. The

performance improves by a factor greater than three. Based on the results for

the 4-bit and 16-bit CLAs, we now attempt to deduce formula for the maximum

delay of an a CLA using L CLC levels. For the 4-bit CLA, there are 6 gate delays

for L = 1 and, for the 16-bit CLA there are 10 gate delays for L = 2. In the 4-bit

CLA, there is one pass through a CLC and, and in the 16-bit CLA, there are three

passes through CLCs. Based on this information, we can conclude that the

number of passes through CLCs = 2L – 1 for L levels. Based on the numerical

values of 6 and 10 gate delays just presented, the Estimated Maximum

Path Delay is

2(2L – 1) + 4 = 4L – 2 + 4 = 4L + 2 Gate Delays

of 50

Determining the best P/N transistor ratio Before beginning the transistor level design of the chip, the best βp/βn was determined to give the best average delay and power consumption. The methodology usde to find the best ratio was using an inverter gate driving an identical inverter. Figure 1 shows an inverter circuit loaded with an identical inverter circuit.

Figure 3 Inverter driving an identical inverter

Hspice using an optimization algorithm was used to find the best Beta ratio. Figure 4 show the essential parts of an Hspice simulation script.

Figure 4 Partial Hspice optimization script

of 50

The first section of the Hspice script describes the voltage sources, inverter subcircuits, and the inverter pair circuit as show in Figure 1. Note the N and P variables used in the inverter subcircuit. These are the variables that will be used by the optimization routing. Also note the gate parasitics which are shown as a function of the N and P variables. The next section of the script is the optimization command themselves. The first section is the parameters. As one can see in the comments, the parameters are defined in order of best guess, minimum, and maximum values respectively. The .model command in the optimization control command. It sets the optimization algorithm’s parameters. Following the optimization section is the transient command. The command performs a transient sweep for a given time range. The command also refers back to the optimization model and sets the goal variables for the optimization. The last section of the script is the measurement command that will produce the variable for the optimization. Note the goal value at the end of end measurement command. This is the value that the optimizer will try to meet. The optimization script was run using the following test environment.

• TSMC 0.35u technology • Temperature 25C • Voltage 3.3V • Input waveform of 5ns period, 100ps rise/fall time, 2.8ns pulse width

Hspice was run to optimize the average delay (tav) of the driving inverter. The average delay of the inverter is given by Equation 1.

tav = (tpdf + tpdr) / 2

Where tpdf is the high-to-low delay time tpdr is the low-to-high delay time

Equation 1 Average delay calculation

The lowest average delay occurred when the values for the P and N transistors were 13.67 and 9.7 respectively. The length (L) of the transistors remained constant at 2. Multiplying by lambda we get the following transistor sizes Wp/Lp = 2.7u/0.4u and Wn/Ln = 1.9u/0.4u. Figure 3 shows the resultant timing waveform for an optimization of the average delay. Tpdf is 55.3ps and tpdf is 76.2ps. Applying Equation 1 gives 65.7ps. The simulation time was 4.5ns. The average power over the simulation was 114uw

of 50

Figure 5 Timing waveform of an optimized inverter There is one issue with the optimized sizes. The N and P values are not whole numbers. This is a problem in the TSMC 0.35u technology because they will give values in micron that are greater that one decimal place. To avoid this problem and to improve power consumption by using smaller transistor it was decided to round the N and P values down. Now the P value is 13 and the N value is 9. Figure 6 shows the resulting waveform.

Figure 6 Timing waveform of an inverter with P=13 and N=9 lambda

of 50

Using Equation 1 again give an average delay of 66.2ps. The average power of the simulation time of 4.5n was 89.4uw. This small savings in power is worth the tradeoff of 0.5ps. The final sizes of the transistors were Wp/Lp = 2.6u/0.4u and Wn/Ln = 1.8u/0.4u. These sizes would be the basis for the sizing of the gates for the rest of the design. The Logic Library As seen before in the design architecture, the 32 bit carry look ahead (CLA) adder is built around a 4 bit CLA adder. This building block is where the CLA logic is contained. The schematics of the 4 bit carry look adder are shown in Figure 7.

Figure 7 4 bit carry look ahead adder (CLA)

of 50

The adder is made up of mostly standard logic cells with the exception of the carry generation logic of C4 and C3. These cells are comprised of domino logic. The list of cells is

AND02 -- 2 input and logic gate OR02 -- 2 input or logic gate AND02_ii -- 2 input and logic gate with one inverted input AND03 – 3 input and logic gate OR3 -- 3 input or logic gate XO2 – 2 input exclusive or logic gate carry_unit_1 – carry generator of C4 carry_unit_2 – carry generator of C3 Before the system simulation and layout can begin, it is necessary to simulate and start layout of the individual logic cells. The simulations were performed under the following conditions

• TSMC 0.35u technology • Temperature 25C • Voltage 3.3V • Input waveform of 2ns period, 50ps rise/fall time, 900ps pulse width

AND02 As stated earlier, the initial transistor sizes came from the optimization of the transistors of the loaded inverter. Figure 6 shows the schematic of the AND02.

Figure 8 AND02 schematic

of 50

The simulation was done with the AND gate loaded with a similar and gate. Using the schematic netlist resulted in the waveform in Figure 9.

Figure 9 Timing waveform of schematic simulation From the waveforms, the AND gate has an input rising propagation delay of 109ps and an input falling propagation delay of 67.7ps. The large inequality in delays comes from the sizing of the N transistors. Since they are in series in the AND gate, they need to be sized larger than those in the INV gate. The schematics give way to the layout of the circuit shown in Figure 10.

Figure 10 Layout of AND02 gate

of 50

Once the layout was finished, the layout was extracted with Calibre. A successful extraction cannot be completed with a successful LVS. Figure 11 shows an abridged LVS report showing the cell name and the successful “smiley face”

Figure 11 Successful LVS of AND02 gate Simulating with the extracted parasitics yielded the waveforms in Figure 12.

Figure 12 Timing waveform of extracted parasitics

of 50

The waveforms show a degradation of the delays in the AND gate. This is expected since the layout extraction takes into consideration interconnection and accurate gate and drain/source measurements. The gate delay and the output waveform is still fairly sharp; so, the gates remain unchanged to keep power consumption to a minimum. OR02 The next logic gate in the design was the OR02 gate. Similar design methods were used from the AND02 in the design of OR02. Figure 13 shows the schematic of the OR02 gate.

Figure 1 OR02 schematic The simulations were done with an identical OR02 gate loading the first gate The node N1 is the output of the loaded OR02 gate. Delay measurements were made from the input of the OR02 gate to the N1 node. Figure 14 shows the waveforms of the schematic netlist.

of 50

Figure 2 Timing waveform of OR02 gate schematic From the waveforms, the OR gate has an input rising propagation delay of 98.3 ps and an input falling propagation delay of 160ps. The large inequality in delays comes from the sizing of the P transistors.

Figure 15 OR02 gate layout

of 50

This makes sense in that the gate is partially built with two series P transistors. This means they need to be quite large compared to the N transistors to achieve a more equal delay. Using the schematics produces the layout in Figure 15. Calibre produced the LVS report in Figure 16 during parasitic extraction.

Figure 16 OR02 gate LVS report Simulating with the extracted parasitics yielded the timing results in Figure 17.

Figure 17 Timing waveform of OR02 extracted parasitics

of 50

As in the AND02 gate the parasitics increase the delay of the gate. However, the output waveform had good rise and fall times. The delay of the gate was acceptable. AND02_ii The AND02_ii is the same logic gate as the AND02 except the B input of the gate is inverted. . Figure 18 shows the schematic of the AND02_ii

Figure 18 AND02_ii schematic The same simulation procedures were applied to the AND02_ii. Figure 19 shows the waveforms and measurements of the propagation delay of the gate. The node N1 is the connecting node between the driving and load gates. As one would expect the propagation delays through the AND02_ii gate from the B input to N1 are greater than that of the AND02. This from the delays associated with the inverter on the B input.

of 50

Figure 19 Timing waveform of AND02_ii gate schematic The physical implementation of the AND02_ii gate is shown in Figure 20.

Figure 20 AND02_ii layout

of 50

Following the layout was the extracted parasitics. Again, a clean LVS of the layout is necessary for the extracted parasitics to be valid. Figure 21 shows the shortened LVS report for the Calibre extraction.

Figure 21 AND02_ii gate LVS report Figure 22 show the timing waveforms from simulating the gate again under the same conditions the schematic run.

Figure 22 Timing waveform of AND02_ii with extracted parasitics

of 50

The delays grow after layout as expected. The edges of the output waveform are becoming less sharp due to the increase in the rise and fall times. This means the transistor will need to be resized if the load increases slightly more than the load of the gate here. AND03 The next gate in the library is the AND03. As its name implies the gate is a 3 input AND gate. Figure 23 shows the schematic of the AND03 gate.

Figure 33 AND03 schematic

The schematic shows the sizing of the P transistor to be 13 lambdas. This matches the sizing given by the optimization simulations of the inverter. The N transistor sizing is greater than 9 lambdas given by the same optimization. It’s also greater than the P transistor sizes. This 25% increase in sizing is due to the fact that there are three of the transistors in series. Simulating the AND03 schematic gives the waveform results shown in Figure 24.

of 50

Figure 24 schematic timing results .

Figure 25 AND03 layout

of 50

In the schematic simulation the loaded node between the AND03 gates is Y. The waveform diagram shows a rising delay of 130ps and a falling delay of 65ps. This difference in delays comes from the series N transistors. It is made worse by the fact the N transistor connect of the C input is the transistor nearest the ground. It needs to work very hard to charge the capacitances of the nodes between the rest of the N transistors as well as the output Y. If these delays affect further timing in the systems, the N transistor will need to be increased in size to bring the delay down. The layout of the AND03 is shown in Figure 25. One can see that the drain/source area of the series N transistors by the elimination of unnecessary contracts. This will have a positive effect of the extracted parasitics. The next step in the AND03 design was extracting layout parasitics and simulating with the parasitics netlist. The successful LVS report is show in Figure 26 followed by the timing waveforms the parasitic simulation is Figure 27.

Figure 26 AND03 LVS report The parasitics results are similar to the schematics results in that there is the same disparity in the rising and falling delay times. One would believe that the rising delay was reduce somewhat in the layout by source/drain sharing mentioned earlier.

of 50

Figure 27 Waveform of parasitic simulation OR03 The next standard logic gate in the design is the OR03. This gate is a three input OR gate. Figure 28 shows the schematic of the gate.

Figure 28 OR03 schematic

of 50

The major challenge with this gate is the series P transistors. Not only is there the added series resistance, but also the lower mobility of the P transistor itself. With that in mind you can see from the schematic, the large size of the P transistors that make of the NOR part of the gate. The waveforms of the OR03 gate are shown in Figure 29.

Figure 29 Timing results of OR03 schematic simulation The timing results for the OR03 are 166ps for the rising transition delay nand 203ps for the falling transition delay. The large delay on the falling transition is due to the series P transistors.

Figure 30 OR03 layout

of 50

The slower mobility of the P transistors and the fact that they are connected in series are the main reasons for the delay. The output waveform has good transitions times, so with that in mind the power consumption of the gate trumps its delay. The gate is located in one area and is responsible for the generation of the carry 2 bit. The larger delay did not affect the calculation of the 2nd sum bit. The layout of the circuit is shown in Figure 30. One can easily see the 3 large P transistors that make up the NOR03 part of the gate. By sharing source/drains on these transistors, the parasitics will be reduced. Figure 31 shows the successful LVS report for the parasitics extraction.

Figure 31 LVS report of parasitic extraction for OR03 gate As with the other cells in the library, the extracted netlist was used to generate another set of timing waveforms. Figure 32 shows the timing waveform of the OR03 circuit with the layout parasitics considered. The delay increased as expected, but the delay did not impact the overall operation of the 4 bit adder.

of 50

Figure 4 Timing results of OR03 schematic simulation XOR The final combinational block is the XOR02 gate. Figure 33 shows the schematic of this 2 input XOR gate

Figure 33 Timing results of OR03 schematic simulation

of 50

XOR The final combinational block is the XOR02 gate. Figure 33 shows the schematic of this 2 input XOR gate.

Figure 34 XOR schematic

Using the schematic netlist, the Hspice simulation gave the timing results shown in Figure 35. The input waveforms exercise the truth table for an XOR gate.

Figure 55 Timing waveforms from XOR schematic simulation

of 50

Node y is the output node of the driving XOR gate. The rise and fall delays are 174ps and 205ps respectively. As with the other gates the delay is acceptable with respect to conserving power. Figure 36 shows the completed layout of the XOR gate.

Figure 36 XOR gate layout The layout shows a great amount of drain/source sharing. Even transistor of different sizes can be shared. This design passed all DRC rules. Once the layout was finished the parasitic extraction was run. The clean LVS report is shown in Figure 37.

Figure 37 Successful XOR02 LVS report

of 50

Taking the extracted netlist and simulating with the same parameters as in the schematic simulations gave the timing waveforms in Figure 38.

Figure 38 Waveforms from XOR parasitic simulation The gate delays increased significantly after layout. However, the adder timing still meets requirements. Again, to maintain low power consumption, the transistors remain at the sizes in the schematic. The Case for Domino Logic As mentioned in the architecture section, the carry logic in the 4bit CLA involves a large amount of terms. Equation 2 shows the logic expression for the carry (C4) of the 4bit CLA.

C4 = G4 + P4·G3 + P4·P3·G2 + P4·P3·P2·G1 + P4·P3·P2·P1·Cin

Equation 2

The above logic equation will require some large fan-in gates. Figure 39 shows the schematic of the carry generation logic using the high fan-in gates. From the earlier gate simulations of the 3 input AND gate and 3 input OR gate, the delays of these large fan-in gates can become quite large. What’s worse is the fact that the Cin signal will have one of the largest delays since it is part of the 5 term AND function and 5 term OR function. While each 4bit adder uses CLA logic, the 32 bit adder uses a ripple carry between the 4 bit adders. Equation 3 shows the budget for the carry generator for each 4 bit adder.

of 50

Carry generator time budget = chip cycle time / (32 bit adder / 4 bit adder)

Equation 3

Figure 39 Carry lookahead logic generator

The cycle time of the chip is 2ns since the clock frequency is 500 Mhz. This means that the time budget for each carry generator is 250ps. As we can see this stringent requirement will be impossible to meet without using very large gates. Simulations of the network within the 4bit adder showed that carry generator was too slow. To solve the speed and power issues, domino logic was used for the C4 carry term as well as the C3 carry. Equation 4 shows the C3 carry expression.

C3 = G3 + P3·G2 + P3·P2·G1 + P3·P2·P1·Cin

Equation 4 While the C3 has a larger timing budget than C4 because it stays within the 4bit adder logic, it was decided to implement in domino logic as well since the precharge clock was introduced for the C4 domino logic. One disadvantage of the domino logic was the introduction of a clock signal that is used to charge to output node. Another disadvantage was determining the timing of the precharge clock to ensure proper function in the larger chip. Figure 40 and Figure 41 show the schematics of the C4 carry generator and C3 carry generator [1].

of 50

Figure 40 C4 carry generator schematic

Figure 41 C3 carry generator schematic

of 50

For both carry generators, the individual transistor sizes are 20 lambdas for the N transistor pull-down network. The size of the precharge P transistor is 14 lambdas. This is much smaller than the P transistors that would be required for standard logic. The output inverter for the C3 generator is P=14 and N=10, which correlate well with the optimized sizes. The output inverter of C4 is double the C3’s size so that it can drive the carry signal to other adder units. The timing of the domino logic is different from that of static logic. First, there is a precharge clock. This clock turns on the P transistor to charge the dynamic node to VDD. When the clock turns off the P transistor, the dynamic node will be change according to the states of the evaluation N transistor network. The evaluation must be completed before the next precharge cycle. Since the chip clock has a 2ns second period, the dynamic logic clock has the same period. However the phase of the dynamic clock is shifted 500ps so that the evaluation phase of the clock is center on the positive edge of the system clock. This will give the most room for error is evaluation since the logic of the adder will be capture on the positive edge of the system clock. Figure 42 shows the best case timing of the C4 carry logic. The simulation was run with a capacitive load of 50fF on the output of the carry generator. The top signal sysclk as its name implies is the system clock with a 2ns period. The next signal is clk which is the clock for the dynamic logic. The figure shows how this clock is shifted 500ps in phase from the system clock. Also notice how the C4 output fluctuates with this clock. This is where care must be taken with the timing of the domino logic. This fluctuation can cause errors in the logic downstream and is why this clock is centered on the rising edge of the system clock. The use of domino logic was confined to the carry generators to control power consumption and keep the timing as simple as possible. The timing shown is for the best case situation in which the single G4 term causes the generation of a carry. The C4 terms goes high with 60.7ps of the assertion of G4. C4 then goes low when the clk signal goes low in to precharge. Figure 43 shows the timing waveforms for the worst case delay.

Figure 42 Best case timing of the C4 carry generator schematic

of 50

In the worst case scenario, the generation of the carry occurs when Cin, P1,P2,P3 and P4 go high. Looking back at the schematic show series path

Figure 43 Worst case timing of the C4 carry generator schematic

that is comprised of N transistors that cause the dynamic node to be discharge and the C4 output go high. This transition time is 231ps which is very close to the budgeted time of 250ps. This delay may indicate a need to increase the size of the N network transistors. Before sizing was considered, the layout of the carry generated was completed. This layout is shown is Figure 44. The layout is built which as much drain/source sharing as possible. The layout also allows resizing of transistor without much adjustment. Once the layout was finished a parasitic extraction was performed. Figure 45 shows the successful LVS report from the extraction.

Figure 44 C4 Carry generator logic layout

of 50

Figure 45 C4 carry generator successful LVS report The next step was a set of simulations based on the same methodology of the schematic simulations. Figure 46 shows the waveforms for the best case delay of the carry generator. While the delay has increased over the schematic, it is still within the timing budget of 250ps. The worst case timing is shown in Figure 47. The delay has greatly increased over the schematic netlist.

Figure 46 Best case timing of the C4 carry generator layout

of 50

Figure 47 Worst case timing of the C4 carry generator layout The delay through the generator exceeds the budget of 250ps. This means that the transistor will need to be resized in order for the carry to propagate through the adder within the 2ns clock period. The layout and simulations of the C3 carry generator follow the same methods as the C4 carry generator. Figures 48a and 48b show the best and worst case timing waveforms for the schematic simulation of the C3 generator.

Figure 48a Best case timing of the C3 carry generator schematic

of 50

Figure 48b Worst case timing of the C3 carry generator schematic While the delay in both cases is acceptable, both waveforms show poor rise and falls times. The output transistor may need resizing to improve these transition times when the C3 generator is in the adder circuit. The layout of the C3 generator is similar to that of the C4. Figure 49 shows the C3 generator layout. Figure 50 shows the LVS report from the successful parasitic extraction of the layout.

Figure 47 C3 Carry generator layout

of 50

Figure 48 C3 carry generator successful LVS report

The same simulations were done for the extracted parasitic layout as was done for the schematics. Figures 51a and 41b are the results for the best and worst case carry generation.

Figure 49a Best case timing of the C3 carry generator parasitic netlist

of 50

Figure 49b Worst case timing of the C3 carry generator schematic As expected the delay times are worse. The transition time on the C3 output on the extracted layout during the worst case logic state is extremely poor. The delays have caused a reduction in the pulse with which may cause a logic error. More accurate timing results will come from the entire adder simulations. Design of the 4Bit Carry Lookahead Adder The schematic of the 4bit adder is shown again in Figure 50 for easy references.

Figure 50 4 bit Carry lookahead adder

of 50

The four bit adder is the basic building block of the design. On the left the A and B operands enter the AND and OR logic which generate the G and P signals which comprise the carry logic in addition to the carry in signal Cin. After this block is finished, hspice simulations were run using an additional functionality within Hspice itself. This added function is known as a digital vector file. Figure 51 shows a partial example of a digital vector file.

Figure 51 Partial digital vector file input to Hspice

Figure 52 Timing results of the 4bit adder schematic simulation

of 50

Using this file to generate the A[4 1] and B[4 1] operands and CIN signal greatly simplified the Hspice simulations. Having Hspice do the comparison of the expect sum S[4 1] and COUT made troubleshooting the circuit much quicker and accurate. With the vector file in place a simulation of the schematic was done exercising simple addition in the adder along with an addition generating a carry. A 50fF load was applied to the outputs. Figure 52 shows the timing waveforms for this simulation. Knowing the critical timing needed for the C4 carry signal, a measurement was made to see how long it take the carry to go high after the precharge clock goes into its evaluation phase. This time is 196ps which is under the budget of 250ps. As seen in other parts of the waveform the addition function is working as expected. As with all of the cells that make up the 4 bit adder, the layout was assembled. Figure 53 shows the complete layout for the entire 4 bit adder.

Figure 53 4 bit adder layout.

The layout follows a bit slice approach is that the A and B data flow downward vertically until it is reaches the XOR. The carry generators are shown in the lower right part of the figure. The layout was extracted for parasitics and completed successfully. Figure 54 show the successful LVS report. The parasitic netlist was simulated using the same settings as the schematic netlist. The resulting timing waveforms are shown in Figure 55.

of 50

Figure 54 Successful LVS report for 4bit adder during parasitic extraction

Figure 55 Timing results of the 4bit adder parasitic simulation

The parasitics, as expected, have increased the delay from rising edge of the domino logic clock to the assertion of carry C4 signal. Some transistor sizing will need to be done to bring the C4 signal back under the 250ps budget.

of 50

Design of the 32 Bit Carry Lookahead (CLA) Adder Before any layout began, the schematic of the full CLA was assembled. Figure 56a shows the entire 32 bit CLA schematic while 56b show a zoomed view of a single 4 bit unit.

Figure 56a 32 Bit Carry Lookahead schematic

Figure 56b Zoom of one 4bit CLA unit within the 32 bit CLA schematic

The additional logic blocks around the 4bit adder are d flip flops. Due to the limit of 40 pin on the 32bit CLA chip, it was necessary to bring the 32 operands into the adder sequentially using these flips flops.

of 50

The theory of operations is the A[32 1] operands are clock into the first set of flops in the first clock cycle. On the next clock cycle the B[32 1] operands are clocked in. The cin follows the same clocking to keep sync with the A and B operands. After two clock cycles the adder has both of its operands and the addition can begin. Here is where the 2ns cycle time becomes important. The third clock will be the capture of the sum S[32 1] and Cout, so all data must be valid at the third clock. One can see that the sum outputs also have to clocks. This is for the timing required to receive the valid output. Also note that as the operands are being loaded in the first clock cycle, the outputs of the adder will be invalid until both A and B reach the adder inputs. To make this happen requires an enable signal. This enable signal is used on the bidirectional output pads to set the I/O pad as an input or an output. The D Flip Flop The heart of the sequential design is the D flip flop. Figure 57 shows the design schematic of a D flip flop that uses a transmission gate approach. This design was selected over a nand or nor gate set of logic to reduce delay and power consumption.

Figure 57 D Flip Flop schematic

The transistors were sized according to the inverter optimizations. One must note the addition of the clock which will control the data flow through the flip flop. The inverted phase needed by the transmission gate is generated locally in each flop. Figure 58 shows the simulation waveform for the flip flop schematic.

of 50

Figure 58 Timing waveform of D flip flop schematic simulation The delays through the flop are acceptable as long as the setup times of flops downstream are not violated. Figure 59 shows the layout of the complete flop design.

Figure 59 D flip flop layout With the layout completed, a parasitic extraction was performed for the flip flop. Figure 60 shows LVS report from the parasitic extraction. There are transistor size errors that occurred because the flip flop schematic was not update to reflect new sizes in the layout.

of 50

Figure 60 LVS report for D flip flop during parasitic extraction

Figure 61 Timing waveform of D flip flop parasitic simulation

The delays of the flop are worse in from the layout as expected. This may be a sign that the transmission gates will need to be resized.

of 50

32 Bit Carry Lookahead (CLA) Adder Simulations Using the schematic in Figure 56a, an Hspice simulation was done for the entire 32 adder. With the amount of inputs involved the digital vector as mentioned before became an invaluable tool for these simulations. Figure 59 show the vector file of the Hspice simulation.

Figure 62 Digital vector file for 32 bit adder

The X values in the vector file under the S sum output are don’t care values. They are set this way because of the invalid data coming from the adder as mentioned before. Once the adder is placed in the chip the don’t care values will go away because the signals are by direction at that point and the invalid data will occur when A and B are input mode. Thus the erroneous output will not be captured. Simulations of the 32 bit adder are not available at this time due to disk space and time limits. Currently errors in the sequential timing are being investigated. However Figure 60 shows the current layout of the 32 bit adder.

of 50

Figure 63 32 bit CLA adder layout Full Chip Before examining the layout in it’s entirety, the full chip schematics including pads was captured in Mentor Design Architect. Figure 61a and 61b shows the zoomed in chip schematic and the entire full chip schematic with the large 32 bit adder block instantiation.

Figure 64a Full Chip schematic zoomed view

of 50

Figure 64b Full Chip schematic

of 50

Results Total average chip power consumption from 0 to 50 ns = 91 mw based on schematics Pin count 39 pins

• 32 bits for operators , • 1 bit for C in , • 1 bit for Cout , • 1 bit for system clock , • 1 for domino logic clock , • 1 bit for I/O enable , • 1 bit for Vdd , • 1 bit for Vss

Total chip area = 1500 um X 1500 um Data output rate = 1.39ns (worst path delay from Cin to Cout without sequential elements) Core area = 443 um X 347 um

of 50

Bibliography

1. Principles of CMOS VLSI design – Neil Weste and Kamran Kshraghian

2 . MANO, M. M. AND C. R. KIME. Logic and Computer Design Fundamentals, 4th ed. Upper Saddle River, NJ Pearson Prentice Hall, 2008.

3 . CMOS VLSI design - Neil Weste, David Harris and Ayan Banerjee. 4 . http //tams-www.informatik.uni-hamburg.de

project report

Documents

successful

ripple carry

worst case

c4 carry generator

extracted

digital vector

carry lookahead

dont care