multiplication is basically a shift add operation

Upload: seshu-babu

Post on 07-Apr-2018

232 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/3/2019 Multiplication is Basically a Shift Add Operation

    1/13

  • 8/3/2019 Multiplication is Basically a Shift Add Operation

    2/13

    Multiplication is basically a shift add operation. There are, however, many variations on how to do it.Some are more suitable for FPGA use than others. This page is a brief tutorial on multiplicationhardware. The hyperlinked items in this list are currently in the text. The remaining items will be addedin a future release of this page.

    Scaling Accumulator Multipliers

    Serial by Parallel Booth Multipliers

    Ripple Carry Array Multipliers

    Row Adder Tree Multipliers

    Carry Save Array Multipliers

    Look-Up Table Multipliers

    Partial Product LUT Multipliers

    Computed Partial Product Multipliers

    Constant Multipliers from Adders

    KCM multipliers

    Limited Set LUT Multipliers

    Wallace Trees

    Booth Recoding

    Negative inputs

    Scaling Accumulator Multipliers

    Parallel by serial algorithm

    Iterative shift add routineN clock cycles to completeVery compact designSerial input can be MSB or LSB first depending on direction of shift in accumulatorParallel output

    A scaling accumulator multiplier performs multiplication using an iterative shift-add routine. One inputis presented in bit parallel form while the other is in bit serial form. Each bit in the serial input multipliesthe parallel input by either 0 or 1. The parallel input is held constant while each bit of the serial input ispresented. Note that the one bit multiplication either passes the parallel input unchanged or substituteszero. The result from each bit is added to an accumulated sum. That sum is shifted one bit before theresult of the next bit multiplication is added to it.

    http://www.andraka.com/multipli.htm#Scaling%20Accumulator%20Multipliershttp://www.andraka.com/multipli.htm#Scaling%20Accumulator%20Multipliershttp://www.andraka.com/multipli.htm#Serial%20by%20Parallel%20Booth%20Multipliershttp://www.andraka.com/multipli.htm#Serial%20by%20Parallel%20Booth%20Multipliershttp://www.andraka.com/multipli.htm#Ripple%20Carry%20Array%20Multipliershttp://www.andraka.com/multipli.htm#Ripple%20Carry%20Array%20Multipliershttp://www.andraka.com/multipli.htm#Row%20Adder%20Tree%20Multipliershttp://www.andraka.com/multipli.htm#Row%20Adder%20Tree%20Multipliershttp://www.andraka.com/multipli.htm#Carry%20Save%20Array%20Multipliershttp://www.andraka.com/multipli.htm#Carry%20Save%20Array%20Multipliershttp://www.andraka.com/multipli.htm#Look-Up%20Table%20(LUT)%20Multipliershttp://www.andraka.com/multipli.htm#Look-Up%20Table%20(LUT)%20Multipliershttp://www.andraka.com/multipli.htm#Partial%20Product%20LUT%20Multipliershttp://www.andraka.com/multipli.htm#Partial%20Product%20LUT%20Multipliershttp://www.andraka.com/multipli.htm#Computed%20Partial%20Product%20Multipliershttp://www.andraka.com/multipli.htm#Computed%20Partial%20Product%20Multipliershttp://www.andraka.com/multipli.htm#Constant%20Multipliers%20from%20Addershttp://www.andraka.com/multipli.htm#Constant%20Multipliers%20from%20Addershttp://www.andraka.com/multipli.htm#KCM%20Multipliershttp://www.andraka.com/multipli.htm#Wallace%20Treeshttp://www.andraka.com/multipli.htm#Limited%20Set%20LUT%20Multipliershttp://www.andraka.com/multipli.htm#Limited%20Set%20LUT%20Multipliershttp://www.andraka.com/multipli.htm#Wallace%20Treeshttp://www.andraka.com/multipli.htm#Booth%20Recodinghttp://www.andraka.com/multipli.htm#Booth%20Recodinghttp://www.andraka.com/multipli.htm#Booth%20Recodinghttp://www.andraka.com/multipli.htm#Wallace%20Treeshttp://www.andraka.com/multipli.htm#Limited%20Set%20LUT%20Multipliershttp://www.andraka.com/multipli.htm#KCM%20Multipliershttp://www.andraka.com/multipli.htm#Constant%20Multipliers%20from%20Addershttp://www.andraka.com/multipli.htm#Computed%20Partial%20Product%20Multipliershttp://www.andraka.com/multipli.htm#Partial%20Product%20LUT%20Multipliershttp://www.andraka.com/multipli.htm#Look-Up%20Table%20(LUT)%20Multipliershttp://www.andraka.com/multipli.htm#Carry%20Save%20Array%20Multipliershttp://www.andraka.com/multipli.htm#Row%20Adder%20Tree%20Multipliershttp://www.andraka.com/multipli.htm#Ripple%20Carry%20Array%20Multipliershttp://www.andraka.com/multipli.htm#Serial%20by%20Parallel%20Booth%20Multipliershttp://www.andraka.com/multipli.htm#Scaling%20Accumulator%20Multipliers
  • 8/3/2019 Multiplication is Basically a Shift Add Operation

    3/13

    1 1011001

    0 0000000

    1 1011001

    1 +1011001

    10010000101

    Serial by Parallel Booth Multipliers

    Bit serial adds eliminate need for carry chainWell suited for FPGAs without fast carry logicSerial input LSB firstSerial outputRouting is all nearest neighbor except serial input which is broadcastLatency is one bit time

    The simple serial by parallel booth multiplier is particularly well suited for bit serial processorsimplemented in FPGAs without carry chains because all of its routing is to nearest neighbors with theexception of the input. The serial input must be sign extended to a length equal to the sum of thelengths of the serial input and parallel input to avoid overflow, which means this multiplier takes more

    clocks to complete than the scaling accumulator version. This is the structure used in the venerableTTL serial by parallel multiplier.

    Ripple Carry Array Multipliers

  • 8/3/2019 Multiplication is Basically a Shift Add Operation

    4/13

    Row ripple formUnrolled shift-add algorithmDelay is proportional to N

    A ripple carry array multiplier (also called row ripple form) is an unrolled embodiment of the classicshift-add multiplication algorithm. The illustration shows the adder structure used to combine all the bitproducts in a 4x4 multiplier. The bit products are the logical and of the bits from each input. They areshown in the form x,y in the drawing. The maximum delay is the path from either LSB input to the MSB

    of the product, and is the same (ignoring routing delays) regardless of the path taken. The delay isapproximately 2*n.

    This basic structure is simple to implement in FPGAs, but does not make efficient use of the logic inmany FPGAs, and is therefore larger and slower than other implementations.

    Row Adder Tree Multipliers

    Optimized Row Ripple FormFundamentally same gate count as row ripple formRow Adders arranged in tree to reduce delayRouting more difficult, but workable in most FPGAsDelay proportional to log2(N)

    Row Adder tree multipliers rearrange the adders of the row ripple multiplier to equalize the number ofadders the results from each partial product must pass through. The result uses the same number ofadders, but the worst case path is through log2(n) adders instead of through n adders. In strictlycombinatorial multipliers, this reduces the delay. For pipelined multipliers, the clock latency is reduced.

    The tree structure of the routing means some of the individual wires are longer than the row ripple

    form. As a result a pipelined row ripple multiplier can have a higher throughput in an FPGA (shorterclock cycle) even though the latency is increased.

  • 8/3/2019 Multiplication is Basically a Shift Add Operation

    5/13

    Carry Save Array MultipliersColumn ripple formFundamentally same delay and gate count as row ripple formGate level speed ups available for ASICsRipple adder can be replaced with faster carry tree adderRegular routing pattern

  • 8/3/2019 Multiplication is Basically a Shift Add Operation

    6/13

    Look-Up Table (LUT) Multipliers

    Complete times table of all possible input combinationsOne address bit for each bit in each inputTable size grows exponentiallyVery limited useFast - result is just a memory access away

    Look-Up Table multipliers are simply a block of memory containing a complete multiplication table of all

    possible input combinations. The large table sizes needed for even modest input widths make theseimpractical for FPGAs.

    The following table is the contents for a 6 input LUT for a 3 bit by 3 bit multiplication table.

    000 001 010 011 100 101 110 111000 000000 000000 000000 000000 000000 000000 000000 000000001 000000 000001 000010 000011 000100 000101 000110 000111010 000000 000010 000100 000110 001000 001010 001100 001110011 000000 000011 000110 001001 001100 001111 010010 010101100 000000 000100 001000 001100 010000 010100 011000 011100101 000000 000101 001010 001111 010100 011001 011110 100011

  • 8/3/2019 Multiplication is Basically a Shift Add Operation

    7/13

    110 000000 000110 001100 010010 011000 011110 100100 101010111 000000 000111 001110 010101 011100 100011 101010 110001

    Partial Product LUT Multipliers

    Works like long hand multiplicationLUT used to obtain products of digitsPartial products combined with adder tree

    Partial Products LUT multipliers use partial product techniques similar to those used in longhandmultiplication (like you learned in 3rd grade) to extend the usefulness of LUT multiplication. Considerthe long hand multiplication:

    67x 54

    28

    240350

    +30003618

    67x 54

    28

    240350

    +30003618

    67x 54

    28

    240350

    +30003618

    67x 54

    28

    240350

    +30003618

    By performing the multiplication one digit at a time and then shifting and summing the individual partialproducts, the size of the memorized times table is greatly reduced. While this example is decimal, thetechnique works for any radix. The order in which the partial products are obtained or summed is notimportant. The proper weighting by shifting must be maintained however.

    The example below shows how this technique is applied in hardware to obtain a 6x6 multiplier usingthe 3x3 LUT multiplier shown above. The LUT (which performs multiplication of a pair of octal digits) isduplicated so that all of the partial products are obtained simultaneously. The partial products are thenshifted as needed and summed together. An adder tree is used to obtain the sum with minimumdelay.

    The LUT could be replaced by any other multiplier implementation, since LUT is being used as amultiplier. This gives the insight into how to combine multipliers of an arbitrary size to obtain a largermultiplier.

    The LUT multipliers shown have matched radices (both inputs are octal). The partial products canalso have mixed radices on the inputs provided care is taken to make sure the partial products are

  • 8/3/2019 Multiplication is Basically a Shift Add Operation

    8/13

    shifted properly before summing. Where the partial products are obtained with small LUTs, the mostefficient implementation occurs when LUT is square (ie the input radices are the same). For 8 bitLUTs, such as might be found in an Altera 10K FPGA, this means the LUT radix is hexadecimal. For 4bit LUTs, found in many FPGA logic cells, the ideal radix is 2 bits (This is really the only option for a 4LUT: a 1 bit input reduces the LUT to an AND gate, and since each LUT cell has 1 output, it can onlyuse one bit on the other input).

    A more compact but slower version is possible by computing the partial products sequentially using oneLUT and accumulating the results in a scaling accumulator. Note that in this case, the shifter would

    need a special control to obtain the proper shift on all the partials

    Computed Partial Product Multipliers

    Partial product optimization for FPGAs having small LUTsFewer partial products decrease depth of adder tree2 x n bit partial products generated by logic rather than LUTSmaller and faster than 4 LUT partial product multipliers

    A partial product multiplier constructed from the 4 LUTs found in many FPGAs is not very efficientbecause of the large number of partial products that need to be summed (and the large number ofLUTs required). A more efficient multiplier can be made by recognizing that a 2 bit input to a multiplierproduces a product 0,1,2 or 3 times the other input. All four of these products are easily generated inone step using just an adder and shifter. A multiplexer controlled by the 2 bit multiplicand selects theappropriate product as shown below. Unlike the LUT solution, there is no restriction on the width of theA input to the partial product. This structure greatly reduces the number of partial products and thedepth of the adder tree. Since the 0,1,2 and 3x inputs to the multiplexers for all the partial products arethe same, one adder can be shared by all the partial product generators. This structure works well in

    coarser grained FPGAs like the Xilinx 4K series.

    2 x n bit partial product generated with adder and multiplexer

    The Xilinx Virtex device includes an extra gate in the carry chain logic that allows a 4 input LUT plus thecarry chain to perform a 2xN partial product, thereby achieving twice the density otherwise attainable.In this case, the adder (consisting of the XOR gates and muxes in the carry chain) adds a pair of 1xNpartial products obtained with AND gates. The extra AND gate in the carry logic allows you to put ANDgates on both inputs of the adder while maintaining a 4 input function.

  • 8/3/2019 Multiplication is Basically a Shift Add Operation

    9/13

    2 x n bit computed partial product implemented in Xilinx Virtex using special MULTAND gate in carry chain logic

    Constant Coefficient Multipliers

    Multiplies input by a constantLUT contains custom times tableWidth of constants do not affect depth of adder tree

    All LUT inputs available for multiplicandMore efficient than full multiplierSize is constant regardless of value of constant (assuming equal constant bit widths)

    A full multiplier accepts the full range of inputs for each multiplicand. If one of the multiplicands is aconstant, then it is far more efficient to construct a times table that only has the column correspondingto the constant value. These are known as constant (K) Coefficient Multipliers or KCM's. The examplebelow multiplies a 5 bit input (values 0 to 31) by a constant 67. Note that with a constant multiplier, allof the LUT inputs are available for the variable multiplicand. This makes the KCM more efficient than afull multiplier (fewer partial products for a given width).

    5 bit input * 67input 00 01 10 11000 0 536 1072 1608001 67 603 1139 1675010 134 670 1206 1742011 201 737 1273 1809100 268 804 1340 1876101 335 871 1407 1943

  • 8/3/2019 Multiplication is Basically a Shift Add Operation

    10/13

    110 402 938 1474 2010111 469 10051541 2077

    When the LUT does not offer enough inputs to accommodate the desired variable width, severalidentical LUTs may be combined using the partial products techniques discussed above. In this case,the constant multiplicand is full width, so the partial products will be m x n where m is the number ofLUT inputs and n is the width of the constant.

    Limited Set LUT Multipliers

    Multiplies input by one of a small set of constantsSimilar to KCM multiplierLUT input bit(s) select which constant to useUseful in modulators, other signal processing applications

    In signal processing, there are often instances where one multiplicand is taken from of a small set ofconstant values. In these cases, the KCM multiplier can be extended so that the LUT contains thetimes tables for each constant. One or more of the LUT inputs select which constant is used, while theremaining inputs are for the variable multiplicand. The example below is a 6 LUT containing timestables for the constants 67 and 85. One bit of the input selects which times table is used. Theremaining inputs are the 5 bit variable multiplicand (values from 0 to 31). Again, the input width can beextended using the partial product techniques discussed previously.

    5 bit input * 67 5 bit input * 85000 001 010 011 100 101 110 111

    000

    0

    536

    10721608

    0

    680

    1360

    2040

    001 67 603 1139 1675 85 765 1445 2125010 134 670 1206 1742 170 850 1530 2210011 201 737 1273 1809 255 935 1615 2295100 268 804 1340 1876 340 1020 1700 2380101 335 871 1407 1943 425 1105 1785 2465110 402 938 1474 2010 510 1190 1870 2550111 469 10051541 2077 595 1275 1955 2635

    Constant Multipliers from Adders

    Adder for each '1' bit in constantSubtractor replaces strings of '1' bits using Booth recodingEfficiency, size depend on value of constantKCM multipliers are usually more efficient for arbitrary constant values

    The shift-add multiply algorithm essentially produces m 1xn partial products and sums them togetherwith appropriate shifting. The partial products corresponding to '0' bits in the 1 bit input are zero, and

  • 8/3/2019 Multiplication is Basically a Shift Add Operation

    11/13

    therefore do not have to be included in the sum. If the number of '1' bits in a constant coefficientmultiplier is small, then a constant multiplier may be realized with wired shifts and a few adders asshown in the 'times 10' example below.

    0 0000000

    1 1011001

    0 0000000

    1 +1011001

    1101111010In cases where there are strings of '1' bits in the constant, adders can be eliminated by using Boothrecoding methods with subtractors. The 'times 14 example below illustrates this technique. Note that14 = 8+4+2 can be expressed as 14=16-2, which reduces the number of partial products.

    0 0000000

    1 1011001

    1 1011001

    1 +1011001

    10011011110

    0 0000000

    -1 1110100111

    0 0000000

    0 0000000

    1 +1011001

    10011011110Combinations of partial products can sometimes also be shifted and added in order to reduce thenumber of partials, although this may not necessarily reduce the depth of a tree. For example, the'times 1/3' approximation (85/256=0.332) below uses less adders than would be necessary if all thepartial products were summed directly. Note that the shifts are in the opposite direction to obtain thefractional partial products.

    Clearly, the complexity of a constant multiplier constructed from adders is dependent upon theconstant. For an arbitrary constant, the KCM multiplier discussed above is a better choice. For certain'quick and dirty' scaling applications, this multiplier works nicely.

    Wallace Trees

    Optimized column adder treeCombines all partial products into 2 vectors (carry and sum)Carry and sum outputs combined using a conventional adderDelay is log(n)Wallace tree multiplier uses Wallace tree to combine 1 x n partial productsIrregular routingNot optimum in many FPGAs

    A Wallace tree is an implementation of an adder tree designed for minimum propagation delay. Ratherthan completely adding the partial products in pairs like the ripple adder tree does, the Wallace treesums up all the bits of the same weights in a merged tree. Usually full adders are used, so that 3

    equally weighted bits are combined to produce two bits: one (the carry) with weight of n+1 and theother (the sum) with weight n. Each layer of the tree therefore reduces the number of vectors by a

  • 8/3/2019 Multiplication is Basically a Shift Add Operation

    12/13

    factor of 3:2 (Another popular scheme obtains a 4:2 reduction using a different adder style that addslittle delay in an ASIC implementation). The tree has as many layers as is necessary to reduce thenumber of vectors to two (a carry and a sum). A conventional adder is used to combine these to obtainthe final product. The structure of the tree is shown below. The red numbers after each full adder in theillustration indicate the bit weights of each signal. For a multiplier, this tree is pruned because the inputpartial products are shifted by varying amounts.

    A section of an 8 input wallace tree. The wallace tree combines the 8 partialproduct inputs to two output vectors corresponding to a sum and a carry. Aconventional adder is used to combine these outputs to obtain the complete

    product..

    If you trace the bits in the tree (the tree in the illustration is

    color coded to help in this regard), you will find that theWallace tree is a tree of carry-save adders arranged as shownto the left. A carry save adder consists of full adders like themore familiar ripple adders, but the carry output from each bitis brought out to form second result vector rather being thanwired to the next most significant bit. The carry vector is'saved' to be combined with the sum later, hence the carry-save moniker.

    A Wallace tree multiplier is one that uses a Wallace tree tocombine the partial products from a field of 1x n multipliers

    (made of AND gates). It turns out that the number of CarrySave Adders in a Wallace tree multiplier is exactly the sameasWallace tree rearranges the wiring however, so that the

    partial product bits with the longest delays are wired closer to the root of the tree. This changes thedelay characteristic from o(n+n) to o(n+log(n)) at no gate cost. Unfortunately the nice regular routing ofthe array multiplier is also replaced with a rats-nest.

  • 8/3/2019 Multiplication is Basically a Shift Add Operation

    13/13