new lut

Upload: btechece65

Post on 08-Apr-2018

231 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/7/2019 new LUT

    1/4

    New Look-up-Table Optimizations for

    Memory-Based Multiplication

    Pramod Kumar MeherCommunication Systems Department, Institute for Infocomm Research

    1 Fusionopolis Way, Singapore, email: [email protected].

    AbstractTwo new techniques are presented in this paper,for the reduction of look-up-table (LUT) size of memory-basedmultipliers to be used in digital signal processing applications. It isshown that by simple sign-bit exclusion, the LUT size is reducedby half at the cost of a marginal area overhead. Moreover, anovel anti-symmetric product coding (APC) scheme is proposedto reduce the LUT size by further half, where the LUT outputis added with or subtracted from a fixed value. It is shown thatthe optimized LUTs for small input width could be used forefficient implementation of high-precision LUT-multipliers, wherethe total contribution of all such fixed offsets could be added to thefinal result or could be initialized for successive accumulations.

    The proposed LUT-multiplier and the existing ones are codedin VHDL and synthesized by Synopsys Design Compiler usingTSMC 90 nanometer library. The proposed optimized LUT-multiplier is found to involve less area and less multiplicationtime than the existing LUT-multipliers. The conventional LUT-multiplier and the odd-multiple storage LUT of [1] involve nearly56.8% and 26.2% more area-delay product, in average, for 16-bitinput than the proposed LUT design.

    I. INTRODUCTION

    Along with the device scaling over the years, semiconductor

    memory has become cheaper, faster and more power-efficient.

    Moreover, according to the projections of the international

    technology roadmap for semiconductors (ITRS) [2], embedded

    memories will have dominating presence in the system-on-

    chips (SoC), which may exceed 90% of total SoC content.

    It has also been found that the transistor packing density

    of memory components is not only high but also increasing

    much faster than the transistor density of logic components.

    Apart from that, the memory-based computing structures are

    more regular than the multiply-accumulate structures; and

    offer many other advantages, e.g., greater potential for high-

    throughput and low-latency implementation; and less dynamic

    power consumption. Memory-based computing is well-suited

    for many digital signal processing (DSP) algorithms, which

    involve multiplication with fixed set of coefficients.

    The basic functional model of memory-based multiplieris shown in Fig.1. Let A be a fixed coefficient and X bean input word to be multiplied with A. Assuming X to bea positive binary number with word-length L, there can be2L possible values of X, and accordingly, there can be 2L

    possible values of product C= A X. Therefore, for memory-based multiplication, an LUT of 2L words consisting of pre-computed product values corresponding to all possible values

    ofX, is conventionally used. The product-word AXi is storedat the location whose address is the same as Xi for 0 2L1,

    such that ifL-bit binary value ofXi is used as address for theLUT, then the corresponding product value A Xi is availableas its output.

    Several architectures have been reported in the literature for

    memory-based implementation of DSP algorithms involving

    orthogonal transforms and digital filters [3][6]. But, we do not

    find any significant work on LUT optimization for memory-

    based multiplication. In a recent paper [1], we have presented a

    new approach to LUT design for memory-based multiplication,

    which could be used to reduce the memory-size by half for

    small input-widths. It is shown in [1] that although 2L

    possiblevalues ofX correspond to 2L possible values ofC= A X,only (2L/2) words corresponding to the odd multiples of Amay only be stored in the LUT while the even multiples of Acould be derived by left-shift operations of one of those odd

    multiples. We have referred to this scheme as odd-multiple-

    storage LUT in the rest of this paper. Using this approach,

    one can reduce the LUT size to half, but it has significant

    combinational overhead since it requires a barrel-shifter along

    with a control-circuit to generate the control-bits for producing

    a maximum of (L 1) left-shifts, and an encoder to map theL-bit input word to (L 1)-bit LUT address.

    In this paper, we propose two new schemes for optimization

    of LUT with lower area- and time-overhead. One of theproposed optimization is based on exclusion of sign-bit from

    the LUT address, and the other optimization is based on a

    recoding of stored product word.

    I I . PROPOSED OPTIMIZATIONS OF LOO K-UP-TABLE FOR

    MEMORY-BASED MULTIPLICATION

    The proposed LUT optimizations for multiplication of a

    number X with a given constant A is possible for both sign-magnitude and 2s complement representations of X and A.Besides, both X and A could be fractions or integers in fixed-

    point format. But, for simplicity of presentation, the multipli-cand X is assumed here to be an integer in sign-magnituderepresentation, while the constant A is assumed to be either insign-magnitude or in 2s complement representation.

    MEMORYCORE

    (2LWORDS)

    L (W+L)X AX

    ADDRESS

    PORT

    OUTPUT

    PORT

    Fig. 1. Basic Functional Model LUT-Based Multiplier.

    ISIC 2009663

  • 8/7/2019 new LUT

    2/4

    A. Sign-Bit Exclusion for LUT Optimization

    Let us write the input multiplicand, X = sX |X|, wherethe operation is a bit-concatenation at the MSB of |X|,sX is the sign-bit (MSB) ofX, and the magnitude-part |X| isgiven by

    |X| =L2i=0

    xi 2i (1)

    The product C= A X, similarly, can be written as

    C= s |A| |X|, and (2a)

    s = sA XOR sX (2b)

    where sA and |A|, respectively, are the sign-bit and magnitude-part ofA.

    Since |X| is an (L 1)-bit binary number, all possibleproduct values of|A| |X| can be stored as 2(L1) LUT words,while the sign-bit can be derived by an XOR operation accord-

    ing to (2b). The product words and words required to be stored

    for different values of X for LUT-based multiplication (forL = 5) is shown in Table I. The product word corresponding to

    X= (1 x3x2x1x0) is negative of that for X= (0 x3x2x1x0)for any given value of |X| = (x3x2x1x0). The product wordson the fourth column of Table I can, therefore, be derived

    by negating the product word stored at the second column on

    the same row. Therefore, instead of 32 product words only 16

    values of |A| |X| for all possible values of |X| = (x3x2x1x0)are required to be stored, as shown in the sixth column

    of Table I. The corresponding LUT-multiplier is shown in

    Fig.2(a). It requires only one additional XOR gate to determine

    the sign of product word C. The sign-exclusion technique canalso be applied for 2s complement representation of coefficient

    (W+4)-bit product, AXin sign-magnitude representation

    x0

    x1

    x2

    15|A|14|A|13|A|12|A|11|A|10|A|9|A|8|A|7|A|6|A|5|A|4|A|3|A|2|A||A|0

    x3

    x4= sX

    sA

    (W+3)

    1

    16-WORDLUTPRODUCTSINSIGN-

    MAGNITUDESREPRESENTATION

    4TO16LINEDECODER

    AXin 2s complement representation

    x0

    x1

    x2

    16-WORDLUTOFPRODUCTWORDSIN2S

    COMPLEMENTREPRESENTATION

    15A14A13A12A11A10A9A8A7A6A5A4A3A2AA0

    x3

    MUX

    x4=sX

    (W+4)

    (W+4)

    1 0

    TWOS COMPLEMENT

    4TO16LINEDECODER

    (a) (b)

    Fig. 2. Proposed LUTs for multiplication ofW-bit fixed coefficient A with 5-bit operand X = x4 x3 x2 x1 x0, using sign-bit exclusion. (a) LUT-multiplierwith coefficient A in sign-magnitude representation. (b) LUT-multiplier withcoefficient A in 2s complement representation.

    TABLE IINPUT WORDS, P RODUCTS WORDS AND STORED WORDS FOR L = 5

    address, Xproduct

    address, Xproduct stored words

    values values 2s comp sign-mag

    0 0 0 0 0 0 1 0 0 0 0 0 0 0

    0 0 0 0 1 A 1 0 0 0 1 A A |A|

    0 0 0 1 0 2A 1 0 0 1 0 2A 2A 2|A|

    0 0 0 1 1 3A 1 0 0 1 1 3A 3A 3|A|

    0 0 1 0 0 4A 1 0 1 0 0 4A 4A 4|A|

    0 0 1 0 1 5A 1 0 1 0 1 5A 5A 5|A|0 0 1 1 0 6A 1 0 1 1 0 6A 6A 6|A|

    0 0 1 1 1 7A 1 0 1 1 1 7A 7A 7|A|

    0 1 0 0 0 8A 1 1 0 0 0 8A 8A 8|A|

    0 1 0 0 1 9A 1 1 0 0 1 9A 9A 9|A|

    0 1 0 1 0 10A 1 1 0 1 0 10A 10A 10|A|

    0 1 0 1 1 11A 1 1 0 1 1 11A 11A 11|A|

    0 1 1 0 0 12A 1 1 1 0 0 12A 12A 12|A|

    0 1 1 0 1 13A 1 1 1 0 1 13A 13A 13|A|

    0 1 1 1 0 14A 1 1 1 1 0 14A 14A 14|A|

    0 1 1 1 1 15A 1 1 1 1 1 15A 15A 15|A|

    A. The multiplier for 2s complement representation of A

    is shown in Fig.2(b), which stores the product words in 2scomplement representation, and requires a 2s complement unit

    along with a 2:1 MUX to change the sign of LUT output for

    negative values ofX.

    B. Anti-Symmetric Product Coding for LUT Optimization

    The product words for different values of modulus |X| forL = 5 are shown in the second column of Table II. It may beobserved in Table II that, the address word |X| on the i-th rowis 1s complement of that on (16+1i)-th row for 1 i 8.Besides, the sum of product values on these two rows is 15A.Let the product values on the i-th and (16+1 i)-th rows beu and v, respectively. Since one can write u =

    u+v2

    vu2

    and v =u+v2 + v

    u2

    , for (u + v) = 15A, we can have

    u =15A

    2v u

    2

    and v =

    15A

    2+v u

    2

    (3)

    The product values on the second column of Table II,

    therefore, could be re-written (in a form as given in the third

    column of the Table) according to (3). Since the product words

    at i-th and (16 + 1 i)-th rows, for 1 i 8, on the thirdcolumn of this Table have a negative mirror symmetry, we

    can name it as anti-symmetric product coding (APC). This

    behavior APC can be used to reduce the LUT size as shown

    in the fourth and fifth columns of the Table, where instead of

    storing u and v at i-th and (16 + 1 i)-th rows, respectively,(v u) is stored at i-th row, for 1 i 8. The desiredproduct could be obtained by adding or subtracting the stored

    value (v u) to or from the fixed value 15A when x3 is 1 or0, respectively, followed by a right-shift operation (for scaling

    down by a factor 2).

    APC can also be used when both the multiplicands are in

    sign-magnitude representation. As shown in Table I, all the

    stored values in this case are positive, since (v u) is alwayspositive. Besides, the result of addition/subtraction with/from

    664

  • 8/7/2019 new LUT

    3/4

    TABLE IITHE APC-BASED LUT ALLOCATION F OR DIFFERENT VALUES OF |X|

    modulus, |X| product APC stored wordsx3x2x1x0 values representation 2s comp sign-mag

    0 0 0 0 0 (15A 15A)/2 15A 15|A|

    0 0 0 1 A (15A 13A)/2 13A 13|A|

    0 0 1 0 2A (15A 11A)/2 11A 11|A|

    0 0 1 1 3A (15A 9A)/2 9A 9|A|

    0 1 0 0 4A (15A 7A)/2 7A 7|A|

    0 1 0 1 5A (15A 5A)/2 5A 5|A|

    0 1 1 0 6A (15A 3A)/2 3A 3|A|

    0 1 1 1 7A (15A A)/2 A |A|

    1 0 0 0 8A (15A + A)/2

    1 0 0 1 9A (15A + 3A)/2

    1 0 1 0 10 (15A + 5A)/2

    1 0 1 1 11A (15A + 7A)/2

    1 1 0 0 12A (15A + 9A)/2

    1 1 0 1 13A (15A + 11A)/2

    1 1 1 0 14A (15A + 13A)/2

    1 1 1 1 15A (15A + 15A)/2

    the fixed value 15|A| is also positive. The sign of product word

    s, therefore, can finally be determined by an XOR operationaccording to (2b).

    The structure and function of LUT-based multiplier for L =5 using sign-bit exclusion and APC techniques (assuming A aswell as the product values C in 2s complement representation)is shown in Fig.3(a). It consists of an LUT of 8 words to store

    the values of product words in 2s complement representation

    as given on the first eight rows of the fourth column of Table II.

    A 3-to-8 line decoder is fed with the address bits through

    an address mapping circuit, which generates the word-select

    signals for the LUT. When x3 = 0, the first three bits ofX, i.e., (x2 x1 x0) are used as the LUT address while forx3 = 1, ones complement of (x2 x1 x0) is used as address.The address-mapping circuit, therefore, can be designed by

    three XOR gates, where x3 is fed to all the XOR gates while(x2 x1 x0) are fed to three XOR gates as shown in the figure.The output of LUT is added with or subtracted from 15A, forx3 = 1 or 0, respectively, followed by a scaling down by afactor of two, using an ADD/SUB-SCALE cell according to

    (3). The output of ADD/SUB-SCALE cell is negated by a 2s

    complement unit and a 2:1 MUX when x4 = 1 .

    The structure and function of LUT-based multiplier for

    L = 5 using sign-bit elimination and APC techniques assum-ing both the multiplicands in sign-magnitude representation is

    shown in Fig.3(b). It stores the magnitudes of product values

    on the first eight rows of the fifth column of Table II. Theaddress mapping circuit in this case is the same as that of

    Fig.3(a). The output of LUT is added with or subtracted from

    15|A|, for x3 = 1 or 0, respectively, by an ADD/SUB-SCALEcell according to (3). The sign-bit s of the product word isgenerated by a 2-input XOR gate as in case of Fig. 2(a).

    In the next section, we have shown that APC technique

    could be used for efficient implementation of high-precision

    LUT-based multiplications by suitable input operand decom-

    position and summation of resulting partial sums.

    PRODUCTWORDSIN

    2SCOMPLEMENT

    REPRESENTATION

    A

    3A

    5A

    7A

    9A

    11A

    13A

    15A

    (W+4)

    3TO8LINEDECODERx0

    15A

    product value AXin 2s complement representation

    x4 = sX

    (W+4)

    (W+4)

    1 0

    2S COMPLEMENT

    MUX

    ADD/ SUB-SCALE

    x1

    x2

    x3

    address mapping 8-word LUT

    sign-determination

    (a)

    |A|

    3|A|

    5|A|

    7|A|

    9|A|

    11|A|

    13|A|

    15|A|

    (W+3)

    3TO8LINEDECODERx0

    15A

    ADD/ SUB-SCALE

    x1

    x2

    x3

    address mapping

    sign-determination

    (W+3)

    (W+4)-bit product, AXin sign-magnitude representation

    x4= sX

    sA1

    MAGNITUDEOFTHE

    PRODUCTWORDS

    8-word LUT

    (b)

    Fig. 3. LUT design for multiplication of W-bit fixed coefficient A with5-bit input, X = x4 x3 x2 x1 x0 using APC. (a) LUT-based multiplierwith constant multiplicand in 2s complement representation. (b) LUT-basedmultiplier with constant multiplicand in sign-magnitude representation.

    III. HIG H-P RECISION MEMORY-BASED MULTIPLICATION

    BY INPUT-OPERAND DECOMPOSITION

    Although the area of memory-core of the LUT-basedmultiplier proposed in Section II is nearly half that of the

    conventional LUT-multiplier, still the memory-size increases

    exponentially with the input word-length. It is, therefore, not

    an area-efficient choice to extend the proposed LUT-multiplier

    of Section-II for large input words. When the width of input

    multiplicand X is large, it could however be decomposedinto certain number of segments or sub-words, and the partial

    products pertaining to different sub-words be shift-added to

    obtain the desired product word.

    665

  • 8/7/2019 new LUT

    4/4

    TABLE IIIAREA AND TIM E COMPLEXITIES OF LUT-MULTIPLIERS F OR L = 16 .

    multiplier designsW = 8 W = 16

    area delay ADP excess ADP area delay ADP excess ADP

    conventional LUT multiplier 3688.9 3.24 11952.0 61.02 % 5866.4 3.99 23406.9 52.58 %odd-multiple LUT-multiplier [1] 2870.4 3.32 9529.7 28.39 % 4608.3 4.13 19032.3 24.07 %proposed LUT-multiplier 2348.9 3.16 7422.5 3893.5 3.94 15340.4

    Area is in m2 . Delay refers to the average multiplication-time in ns. ADP stands for area-delay product. ADP saving corresponds to the saving over conventional multiplier.

    Let the magnitude part of input multiplicand X be de-composed into P number of S-bit segments (or sub-words),{X1 X2 XP}, for Xi = { x(i1)S x(i1)S+1 xiS1},is an S-bit sub-word for 1 i P 1, and XP ={x(P1)S x(P1)S+1 . . . x(P1)S+S1} is the last sub-wordof S-bit, where S S, and xi is the (i + 1)-th bit of X,and P is an integer for which the word-length of |X| is notmore than PS, (i.e., L 1 PS). Without loss of generality,we can however assume the width of input magnitude to be

    L 1 = PS. The product word C= A.X thus can be writtenas sum of partial products generated by multiplying A withthe sub-words ofX as

    C= s Pi=1

    2S(i1) Ci (4)

    The partial product Ci = A Xi for 1 i P, canhave maximum width of (W+ S 1) bits or (W + S) bitswhen A is in sign-magnitude or 2s complement representation,respectively. denotes 2s complement operation in caseof 2s complement representation of A if s = sX = 1, orconcatenation of s = sA sX to the MSB of the estimatedsum in case of sign-magnitude representation ofA.

    For L = 16, let us take S = 4 and P = 4. The inputmultiplicand X, therefore, can be decomposed into 4 sub-words X1, X2, X3 and X4, where X2, X3 and X4 containthe 12 more significant bits ofX, and each of them is a 4-bitword. Let X1 contain the less significant 3-bits of |X|. Threepartial products C2 = A.X2, C3 = A.X3 and C4 = A.X4, arecomputed by three LUT-based multipliers using APC. Since

    X1 has three bits only, C1 = A.X1 can be computed bya conventional 8-word LUT, moreover the fixed offset value

    V = [120A ( 1 + 24 + 28)] due to other three LUTs can alsobe added with C1 to be stored in this LUT, such that LUT-1stores the values (A X1 + V). The proposed multiplier forL = 16 for any given value of coefficient size W is shownin Fig. 4. The least significant 3-bit sub-word X1 ofX is fedas input to LUT-1, and the next more significant sub-words

    X2, X3 and X4 are fed as input to LUT-2 and LUT-3 andLUT-4 respectively. The structure of each of LUT-2, LUT-3and LUT-4 is similar to the one shown in Fig. 3 [except that

    they do not have the ADD/SUB-SCALE cell]. The sign of the

    LUT outputs are modified according to the value of the most-

    significant bit of corresponding sub-word Xi for i = 2, 3 and4; and the LUT outputs are appropriately shifted and added

    together as shown in Fig. 4. The sign of product is modified

    finally depending on the sign of X by a sign-determinationcircuit (as shown in Fig. 3).

    X44

    AX4

    X34

    W+4

    LUT4

    AX3W+4

    LUT3

    X24

    AX2W+4

    LUT2

    W+15

    A|X|

    ADDER

    sX=x15

    AX

    ADDER

    ADDERW+8 W+15

    W+15

    1

    X13

    LUT1

    V+AX1