cascadable nmos vlsi circuit for implementing a fast convolver using the fermat number transform

Cascadable NMOS VLSI circuit for implementing afast convolver using the Fermat number transform

P.J. Towers, BScA. Pajayakrit, BScProf. A.G.J. Holt, DSc

Indexing terms: Mathematical techniques, Very large scale integration, Circuit theory and design

Abstract: The paper describes a novel NMOSVLSI circuit, a set of which can be cascaded toform a 32-point Fermat number transformer/inverse transformer operating over F4(224 + 1 =216 + 1 = 65537). With the addition of a moduloF 4 multiplier a fast convolver/correlator can beconstructed, allowing the favourable properties ofthe transform to be exploited avoiding the diffi-culties of implementing 17-bit arithmetic with astandard micro- or signal processor. The designcomprises one complete section of a pipelinedtransformer, and is novel in that it may be pro-grammed to function at any point in a forward orinverse pipeline, so allowing the construction of apipelined convolver or correlator using identicalchips. This overcomes the difficulty of fitting acomplete pipeline on to one chip without resortingto the use of several different designs.

1 Introduction

The calculation of convolutions and correlations is aproblem which occurs frequently in digital signal pro-cessing. Difficulties that arise in these calculations are (i)the need for high speed to give large bandwidth and (ii)the errors due to rounding or truncation that occurbecause of the finite register lengths employed.

A high speed of operation can be obtained by the useof dedicated VLSI chips which, for this purpose, offerconsiderable speed advantages over single generalpurpose microprocessors. The errors arising due torounding and truncation (but not from quantisation) canbe eliminated and the multiplication process greatly sim-plified by the use of the Fermat number transform.

This paper describes a design for a dedicated VLSIchip that may be used to implement the Fermat numbertransform for convolutions and correlations. The trans-forms are briefly outlined with a short discussion of theiruse for convolutions using the overlap and save tech-nique. After the implementation scheme has been brieflyoutlined the design of the pipelined transformer is dis-cussed with its simulation. Finally, a design for a fastconvolver is given using a number of identical chips eachprogrammed to perform different functions in the convol-ution process.

Paper 5204G (E10, C2), first received 4th August 1986 and in revisedform 8th January 1987The authors are with the Department of Electrical & Electronic Engin-eering, The Merz Laboratories, University of Newcastle upon Tyne,Newcastle upon Tyne NE1 7RU, United Kingdom

VLSI is an ideal medium for the implementation oftransforms based on Fermat numbers (Fermat numbertransforms, FNTs) because the moduli used are of theform 22' + 1 and so the word lengths required in thesystem are 2f -f- 1. The peculiar number of bits requiredpresents no difficulty in a custom VLSI design, but makesthe use of standard 8-, 16- or 32-bit microprocessors andbit slice systems quite inefficient.

Truong et al. [1] proposed a VLSI implementation ofa convolver using the FNT modulo F 4 , and designed aforward transformer making use of a number codingscheme specifically designed by Leibowitz [2] for thepurpose of performing arithmetic modulo on a Fermatnumber. The design uses the pipelined scheme, first pro-posed for the computation of fast Fourier transforms(FFTs). Although this scheme is very efficient, the area ofsilicon required is large (12 x 106 A2) and would noteasily be fabricated in a k = 3/x process. It would appearalso that either a separate design would have to be fabri-cated, or data externally reordered, to perform theinverse transform necessary for convolution.

In an attempt to overcome these problems ourapproach was to section the pipeline into smaller units.Each unit is identical and can be programmed to take upthe characteristics of any section of a pipeline capable ofperforming a 32-point modulo F 4 forward or inverseFNT, or a reversed-order forward transform necessaryfor correlation. This allows the convolution/correlationto be performed using a set of identical chips — a veryfavourable attribute when considering a VLSI solution toa problem.

The pipeline section has been designed for implemen-tation in 3fi NMOS. To achieve a high data throughput,pipeline registers [3] have been placed between the con-stituent arithmetic and logic blocks. These registersdouble up as scan path [4] registers to aid testability — avery important consideration in a design of this size.

The resulting design is highly practical and has anarea of approximately 9 x 106 k2, a predicted data rate of1 MHz and power consumption of 1 W. The sections areeasily interconnected and programmed to perform con-volution or correlation, requiring only a residue numbermultiplier to complete the system. For impulse responselengths of up to 16 points, the transforms of which arealready known, a pipelined convolver/correlator wouldconsist of five chips for the forward transform, a moduloF 4 multiplier and five chips for the inverse transform.The system would then have a maximum sampling fre-quency of 0.5 MHz, power consumption of 10 W andchip count of 11. Impulse response lengths can beincreased in multiples of 16 by providing an extra multi-plier and inverse transformer per 16 points.

IEE PROCEEDINGS, Vol. 134, Pt. G, No. 2, APRIL 1987 57

2 Convolution using transforms

Convolution can easily be achieved by a direct timedomain implementation of the linear convolution sum:

L - l

y(n) = Y, x(m)h(n — m) = x(n) * h(n) (1)

where x(n), h(n) and y(n) are the input, impulse andoutput quantised sequences. h(n) is of finite length L,whereas x(n) (and hence y(n)) may or may not be of finitelength.

2.1 Discrete Fourier transforms (DFTs)With the advent of Cooley and Tukey's [5] fast Fouriertransform (FFT) for computing the discrete Fouriertransform (DFT) it becomes more efficient, for impulseresponses of greater than a certain length, to compute theconvolution by making use of the cyclic convolution pro-perty (CCP) of the DFT. That is

DFT[x{n) * (2)

To obtain the result of the convolution y(n) the trans-forms of x(n) and h(n) are multiplied point by point andthe inverse transform of the product taken. The cyclicconvolution result is the same as the linear convolution ifthe transform is long enough to contain the outputsequence. If x(n) is of indefinite length the 'overlap andsave' technique may be used to join the finite lengthcyclic convolution results.

The JV-point DFT is defined as

i V - l

DFTlx(nJ] = X(k) = £ x{n)e -j2nkn/N

and the inverse transform as

i J V - l

IDFT[X(k)] = x(n) = - Y X{k)ej2nkr"N

N kto

(3)

(4)

However, calculating the DFT and IDFT has the dis-advantage that all arithmetic is complex, the functionse±je = cos 9 +j sin 6 have to be calculated or stored andare irrational, leading to errors due to their truncationand the subsequent truncation of their products.

2.2 Number theoretic transforms (NTTs)Agarwal and Burrus [6] demonstrated that a transformpair with the DFT structure

X(k) = £ x(n)«k

n = 0

will have the CCP if

aN = l

i.e. a is an Nth root of unity, and

N~l exists

(5)

(6)

(7)

(8)

In the complex field a can only be e±j2nlN giving theDFT. However, if all the arithmetic is carried out on afinite field or ring of integers — that is, using a residuenumber system — there are many possible values of adependent on N and the modulus chosen, M.

M

M

Y(k) = ((X(k)H(k)))M

and

The transform pair is thereforeJ V - 1 \ \

(9)

(10)

(11)

((«"))* = 1 (12)

Where = denotes a congruence and (( ))M the residuereduction modulo M.

This type of transform is a number theoretic transform(NTT). All computation is in integers, so no truncationerrors occur, and as all results are residue reduced nooverflows occur either. The computation of the convolu-tion sum adds no further noise to the result as the inputsignal has been quantised and no scaling is requiredduring the computation. This is in contrast to an FFTcomputation.

From the definition of the transform pair, it can beseen that the result y(n) is only congruent to the actualresult, i.e.

tuai = kM + y(n)NTT

(13)

where k is a positive or negative unknown integer. Ify(n)NTT is to be of any use, x(n) and h(n) must be con-strained so that k is known to be zero. Assuming thatpositive and negative numbers are represented in theavailable range of 0 to M — 1, the maximum permissiblevalueof|y(n)actUflJ|isM/2.

Referring to eqn. 1, it can be seen that

m = 0

and so

*(")!, Y\h(m)\ UM/2m = O

(14)

(15)

So, for a given length of impulse response L (whichwould determine the transform length N) the magnitudeof x(n) and h(n) have to be adjusted to satisfy eqn. 15.

2.3 Fermat number transforms (FNTs)Of particular interest are NTTs where the modulus usedis a Fermat number, i.e.

M = Ft = 22' + 1 (16)

Agarwal and Burrus give more detail regarding thechoice of a but the main values of interest are a = 2 anda = y]2 (i.e. ((a2))M = 2, an integer). These give transformlengths of N = 2 x 2' and N = 4 x 2'. As these arepowers of 2, the lengths allow a very efficient radix-2FFT-type algorithm to be used for the computation ofthe transform. Although a = 2 gives a shorter transformlength than a = yj2, its use is attractive in that multipli-cation by powers of 2 are just bit shifts, so a relativelysimple snifter rather than a multiplier is required whencomputing x(n)a.k". Also, the problem of calculating orstoring the complex akns as required by the DFT is solved— only the exponent ((kn))N has to be generated for theshifter.

The FNT therefore overcomes the previously statedproblems involved in computing convolutions using theDFT.

58 IEE PROCEEDINGS, Vol. 134, Pt. G, No. 2, APRIL 1987

3 Linear convolution with the FNT using thegeneralised overlap and save technique

To perform a linear convolution with an indefinite lengthinput sequence using the CCP of a transform, the overlapand save technique may be used. This is described indetail elsewhere [7], but a brief description will follow.

The indefinite length input sequence is sectioned intoblocks of length N (transform length) overlapping by Lpoints. The maximum length of the impulse response isthen L points and this is then padded out with zeros tolength N. The cyclic convolution of each of the inputblocks and the impulse response is then computed. Thefirst L points of each block result are incorrect and dis-carded, the desired result is obtained by taking the lastN — L points of each of the output blocks and placingthem consecutively.

When using the FNT, the choice of transform length,and therefore impulse response length L is somewhatrestricted. For the realistic moduli of F4 or F5 (17- and33-bit data, 2' -I- 1) and using the more attractive value ofa = 2, N is 32 and 64 points, respectively, restricting L towithin these lengths, a = yjl could be used, giving trans-form lengths of 64 and 128 points, but, as implied in eqn.15, the increase in impulse response lengths possiblewould restrict the magnitudes of x(n) and h(n) if theoutput is to be within the desired range of — M/2 to M/2.To allow arbitrary lengths of impulse response to beused, Truong et al. [1, 8] devised the 'generalised overlapand save' technique.

Here, if the desired impulse response length is 2L, forexample, it is sectioned into two parts of length L and theconvolution of each of these with the input signal is com-puted using the normal overlap and save technique. Thetwo sets of results are added together, after passingthrough the appropriate delay (Fig. 1), to produce the

H,(k) 16 points

x(n)

32-pointforwardFNT

X(k)

X32-pointinverseFNT

H,(k) 16 points

32-pointinverseFNT

.-16

Fig. 1 32-point impulse response filter using the generalised overlapand save technique (Truong [5])

result. This addition is not carried out in the transformdomain and so need not be carried out modulo M{Ft).So, although the magnitude constraints still apply foreach individual section of the impulse response, they donot apply to the overall result. This allows any length ofimpulse response to be used, no matter what transformlength is possible, merely by using the appropriatenumber of inverse transforms, adders and delays.

This technique makes the use of FNTs feasible, espe-cially with F 4 . This is an attractive modulus to usebecause it requires a reasonable wordlength (17 bits). Forsome applications it may give an unacceptably shorttransform length (32 points for a = 2) and hence impulseresponse length, but this problem can be overcome by theuse of the generalised overlap and save technique.

4 Implementation of FNTs (and FFTs)

The signal graphs of a 16-point forward and inversetransform are given in Fig. 2. There are many different

schemes for implementing the transforms [7] exploitingdifferent degrees of parallelism. These range from the

rankO ranki rank 2 rank 3

rankO ranki rank 2 rank 3

X(0)

X (15)

x(0)

Fig. 2 Forward decimation in frequency (DIF) transform and corres-ponding inverse transform

a Forward transformb Inverse transform

single processor, bus and memory type schemes with noparallelism up to completely parallel schemes where thereis a butterfly processor for each butterfly in the trans-form. Obviously, fewer clock cycles are required tocompute the transform if more parallelism is exploited,but more hardware is required instead. The more parallelthe scheme used the less complex is the problem ofcontrol, which can also be an important criterion. Forexample, the single processor approach would require acomplex microprogrammed controller which wouldoccupy a substantial area of silicon, and this is essentiallywasted space because it contributes nothing to the actualcomputation. The area-time product suffers as a result. Incontrast to this, a fully parallel implementation requiresno control at all and so no area is wasted on controllers.

In selecting a particular scheme for implementation inVLSI other factors also have to be considered. Theamount of interconnection required is significant in thatit is expensive in terms of area and time. Also, if thescheme is not implemented on one chip, the way in whichit is partitioned into smaller sections can affect the per-formance drastically, as well as increasing the cost by


having to fabricate a number of different designs insteadof just one.

Taking all of these factors into consideration the pipe-lined scheme adopted by Truong et al. [1, 8] appears tobe the most attractive for a VLSI implementation. Adetailed description of a pipelined transformer can befound elsewhere [7] but a block diagram of a 32-pointtransformer is given in Fig. 3.

5 Design of a pipelined Format numbertransformer in NMOS VLSI

The pipelined scheme has been selected as the mostappropriate, and the preliminary calculations of the arearequired for a complete 32-point F 4 pipeline in NMOSagreed closely with Truong's [1] figure of 12 x 106 X2.This was too large for the 3[i NMOS process available,and so the pipeline was partitioned, one section per chip.

section 0

Fig. 3 32-point pipelined transformer

This scheme exploits parallelism in that it has fivebutterfly (computational) units (Fig. 4) operating at the

Thn

Fig. 4 Decimation in frequency (DIF) butterfly

same time, each specific to one particular rank of thetransform. The pipeline can easily be partitioned into sec-tions, one section for each rank, which is a useful feature.Also, if the two inputs of the first rank are connectedtogether rather than switched, the pipeline automaticallytransforms 32-point blocks of input data overlapping by16 points. This makes the system ideal for implementingthe overlap and save technique, especially as all thebutterflies are then working 100% of the time, whichimplies a very efficient implementation.

It should be noted that the pipelined scheme workssatisfactorily only with the structure of transform shownin Fig. 2a, where the input is in natural order and theoutput is in bit-reversed order or vice versa. For convolu-tion the input is in natural order and so there is nochoice but to have the transformed data in bit-reversedorder. The inverse transform must then be of the typeshown, taking its input in bit-reversed order and produc-ing a natural ordered output, otherwise reordering of thetransform and time domain results would be necessarywhich is wasteful of time and hardware.

section

From Fig. 3 it can be seen that all the sections differ.Not only are the lengths of the delays different for eachrank, but so are the 'twiddle' factors and the commutatorswitchings. Rather than producing a different design foreach section, it was decided that one programmable chipshould be designed that could be programmed to take onthe characteristics of any section of the pipeline under thecontrol of an externally supplied control word.

As the aim of this project was to produce a fast con-volver, the design must also be capable of performing theinverse transform. As mentioned previously, there is littlechoice of algorithm for the inverse transform, in that ithas to accept a bit-reversed input sequence. It thereforehas a different structure to the forward transform (Fig. 2)so that the delays and commutator switchings are all dif-ferent again from the forward transform. Not only do thetwiddle factors now have negative exponents, but theyare in bit-reversed order too.

It was also desired that the chip could be used toperform correlation as easily as convolution. Thisrequires the computation of H( — k), and this has beenprovided for without the need to reverse h(n) or H(k).

5.1 Functional description of the programmable F4

FNT pipeline sectionA block diagram of the pipeline section is given in Fig. 5.All data is coded according to Leibowitz's [2] schemewhich requires 17 bits to represent the range 0 to 216. Itcontains the basic elements of a decimation in frequency(DIF) butterfly — negater, two adders and a multiplier— along with the delay elements required for the pipelinescheme. There is also a counter that, along with a shifter,produces the necessary exponents for the multiplier.Pipeline/scan path registers subdivide the chip.

To reduce the pin count, the inputs and outputs aremultiplexed under the control of the enable signal gener-ated by the two state externally resettable finite statemachine (FSM) toggle. As the internal data rate is thushalf the external 2-phase clock and data rates, all clockedelements within the chip — the delays, pipeline registersand the counter — are clocked under the control of the


enable signal. The toggle FSM also has an external scan2 input that forces the enable line low, freezing the stateof the entire chip. This is intended primarily for useduring testing with the scan path registers.

During normal operation, the registers act as pipelineregisters clocking data in and out in parallel. They break

the necessary commutation function, is performed by thecounter FSM and associated shifter. The multiplierrequires a different sequence of four bit exponents forevery rank in the forward and inverse transform, and themultiplexer control signal has a similarly variedsequence. At first sight this would seem an ideal applica-

cc c

11

input

c0, Ooca>

1

Dir

1delayDEL A

c _

ti

Fig. 5 Block diagram of the FA FNT pipeline section

the circuit up into sections of reduced propagation delayand so allow a substantial increase in clock speed [3].The latency in terms of clock cycles (i.e. the number ofclock cycles taken for input data to make their way tothe output) is greatly increased, though this is not usuallya problem when balanced against the increase inthroughput obtained.

The registers are formed into four separate scan paths[4], as can be seen in Fig. 5. During testing, the scan 1line can be held high causing the four sets of registers toshift data serially rather than in parallel. This allowsinternal data to be clocked out of the chip and externallysupplied data to be clocked in. Individual logic blockscan therefore be tested by clocking data into the preced-ing scan path and then using the following scan path toclock out the resultant output for examination. Theability to isolate small sections of the chip in this waygreatly eases the testing problem and is essential in adesign of this size.

The two delays units, Dela and Delb, are controlled bythe internal enable signal and externally supplied rankand Dir 1 (direction) signals. The latter two alter thelength of the delays to suit the rank in the forward orinverse pipeline to which the chip has been assigned.The operation of these registers is detailed in Appendix 11.

The supply of exponents to the multiplier and thecontrol of the output multiplexing required to perform

o. output

tion for a programmable logic array (PLA) [3] basedFSM, taking as its input the rank and Dir 1 signals thatwould determine its behaviour. However, such anapproach leads to an extremely large, and hence slow,PLA.

Examination of the sequence of exponents in the dif-ferent ranks of both the forward and inverse transforms,shows that if the two base sequences of a normal 0 to 15count and a bit-reversed count are shifted left accordingto the rank of the transform, the correct sequences aregenerated. This allows the use of a much simplified PLAbased FSM that produces either of the four bit counts —depending on the Dir 1 input — and a simple pass tran-sistor shifter controlled by the 3-bit rank vector.

The output multiplexer would normally switch everyclock cycle under the control of the enable signal, butusing the most significant bit of the exponent to reversethe switching performs the commutation required for allranks of the forward and inverse transform.

The counter FSM is externally resettable independent-ly of the toggle FSM, to take account of the differentlengths of delay possible between the clocking in of datato the chip and their arrival at the multiplier.

Finally, the multiplier interprets the exponent itreceives as being positive or negative depending on thedirection signal Dir 2. This is brought out of the chipseparately from the other, Dir 1, signal so that the


forward algorithm can be performed using negative expo-nents if required. This allows correlation to be performed,rather than convolution, without the need for reversingthe order of the impulse response.

The more significant logic blocks, including the adderand negater, are described in detail in Appendix 11 alongwith brief coverage of Leibowitz's diminished-1 codingscheme.

5.2 Simulation and layoutThe design was simulated and laid out using ESP* andPLAP [9], both written within the Department of Elec-trical and Electronic Engineering, the University of New-castle upon Tyne, UK.

PLAP is a set of procedures embedded in Pascalwhich allows the layout of logic cells, in terms of primi-tive rectangle and layer statements, to be written as pro-cedures. These procedures may then be called byprocedures specifying higher level logic blocks, and so on,thus encouraging a hierarchical structure to the design.PLAP also provides procedures for routing, input/outputpad placement and layout of PLA based FSMs automati-cally generated by STATIC [9] and PLAMIN from ahigh level state table description.

The use of Pascal allows all its constructs to be usedwhen specifying the layout, and allows easy param-eterisation of wordlength, for example. On running theprogram in which the layout has been specified, the lowlevel GAELIC [9] description of the layout is generated.The resulting layout of the chip is given in Fig. 6.

Fig. 6 Layout of the F 4 FNT pipeline section

The simulator ESP is similarly a set of proceduresembedded in Pascal, allowing the functional descriptionof logic blocks in terms of Boolean statements to bewritten in procedure form. This enables a one-to-one cor-respondence to be maintained between the simulationand layout of the design, which has been achieved as faras is practicable.

Apart from exhaustively verifying the individual logicblocks, ESP was used to simulate an entire convolver to

* SAYERS, I.L., and CHESTER, E.G.: 'ESP'. Internal Memo, Depart-ment of Electrical and Electronic Engineering, The University of New-castle upon Tyne, UK, 1985.

verify the correctness of the control sequences and theprogrammability of the design.

The SPICE [10] circuit simulator was used to checkthe functioning of the more novel logic elements and theresults used to estimate the speed and power consump-tion of the overall design.

6 Fast convolver using the F4 FNT pipeline section

Formation of a fast convolver with an impulse responselength of 16 using the design is straightforward, as illus-trated in Fig. 7, requiring five chips for the forward trans-form, a modulo F 4 multiplier and five chips for theinverse transform. It is assumed that the multiplicationby N~l (32 - 1 mod F4) required for the computation ofthe inverse transform (eqn. 6) is performed by the multi-plier. Also, encoding and decoding of the data to andfrom Leibowitz's diminished-1 code is accomplishedexternally to the system.

The direction and rank signals are set up permanentlyfor each chip. To start the system running, the resetsequence is started and the data fed to the input theappropriate number of clock cycles later. The two resetsignals required by each chip occur at different clockcycles relative to each other for each section, and on laterclock cycles at each section down the pipeline. This isbecause of the latency within the pipeline. This is easilyaccommodated either by the generation of the individualsignals by a controlling micro or an EPROM basedFSM, or derived from one signal with the appropriatedelays placed between the sections.

The sequencing of H{k) must be timed to coincide withthe arrival of transformed data at the multiplier. H(k)may be stored in an EPROM for a fixed filter or be com-puted by another forward pipeline. For correlation,H( — k) can be computed using a forward pipeline withthe Dir 2 input set so as to use negative rather than posi-tive exponents.

To perform convolution using the overlap and savetechnique, the two inputs to the system must be con-nected together and the first half of each inverse trans-form result discarded. As the input and output of eachchip is multiplexed, this can be achieved by holding eachinput value for two clock cycles and, at the output,taking only alternate results.

For impulse responses longer than 16 points, moredevices can be used to form a generalised overlap andsave scheme, as described previously.

7 Performance

The design has an area of 8.8 x 106 X2 (X = 3^) and anestimated power consumption and clock rate of 1 W and1 MHz, respectively. A 16-point impulse response overlapand save convolver constructed using the devices (Fig. 7)would therefore have a chip count of 11 (including amodulo F 4 multiplier), an estimated power consumptionof 10 W and sample rate of 0.5 MHz.

Comparison with Truong's design [1] is difficultbecause estimates of its performance are not known.However, an identical convolver using this design wouldhave a chip count of three and, one might estimate, alower power consumption and higher sample rate. Thelatter two benefits would be a result of the reducednumber of times that data are driven 'off chip' in thissystem, and the reduced complexity of the individual sec-tions of the pipeline as compared to the programmablesections in our design.


A comparison of the FNT convolver system can alsobe made with a commercial digital signal processor chipsuch as the TMS32O1O. Figures published by TexasInstruments [11] indicate that the device can perform a

design. This is clearly an advantage over a system inwhich three separate designs or external reordering ofdata are needed to perform these transforms.

The most important advantage of the partitioning

x(n)-

• - CM

Q O

section0

Q Q

section1

o

I/I U)0) 0<

<b &(A in o>V) (A

co

Q Q

section3

Q Q

section •y(n)

Fig. 7 Fast convolver using the FA FNT pipeline section

32-point FFT in 0.254 ms. To implement the overlap andsave technique a 32-point transform must be completedevery 16 samples. This gives a maximum sample rate of16/0.254 kHz = 63 kHz. To maintain this rate a separatemultiplier would be required as well as anotherTMS32010 to perform the inverse transform, giving achip count of three, disregarding support devices.

However, if the TMS32010 is programmed to performconvolution in the time domain, the time betweensamples is the product of the length of the impulseresponse and the multiply-accumulate time. In this casethis gives 16 x 0.4 ^s = 6.4 ^s, giving a sample rate of150 kHz. Note that only one device is needed to achievethis.

If cost/performance is expressed in terms of a chipcount/bandwidth quotient, the time domain system andTruong's FNT system are approximately equivalent andthe most economical, followed by the FNT system in thispaper and lastly the FFT system implemented on theTMS32010.

It should be noted that the transform methods do notout-perform the time domain implementation becausethe lengths of the transforms (and hence impulseresponses) are too short to take advantage of the compu-tational efficiency of the transforms.

8 Conclusions

In this paper a technique has been demonstrated thatovercomes the problem of the area required to implementa pipelined 32-point F 4 Fermat number transformer inNMOS VLSI. This allows construction of a fastconvolver/correlator using the transform avoiding boththe inefficiency of using a 16- or 32-bit programmablesystem to perform 17-bit arithmetic, or the heavy, hard-ware cost of a transistor-transistor logic realisation.

As may be expected from the lower level of integrationused, the cost/performance of a convolver system usingthis design is not as good as that achievable using themore highly integrated design by Truong [1]. The tech-nique of partitioning the pipeline into sections may nottherefore appear to be advantageous; however, it doesallow the system to be implemented in the X = 3n NMOSprocess available, which would not have otherwise beenpossible. It also allows the inclusion of the scan paths,which are necessary for testing, and the increased flex-ibility of the programmable design makes it possible forany of the three types of transform necessary forconvolution/correlation to be performed using the single

technique demonstrated here relates to the efficiency ofthe FNT in performing fast convolution in comparison tothe time domain implementation. The FNT has clearadvantages over the FFT in this respect, but, like trans-form methods in general, becomes efficient only whenlong impulse responses are used. Clearly the 16-pointimpulse response possible with the 32-point F 4 transformhere is too short for any advantage to manifest itself.

For this reason, FNTs will only find applications if 64-or 128-point transforms are implemented. This will neces-sitate the use of F5 as the modulus on which the trans-form is based, and the use of ((N/2))Fs as a. This wouldhave several other effects. The use of the 33-bit datarequired to represent F5(232 + 1) would overcome prob-lems associated with the dynamic range of the residueresults of the transform, but the resulting increase in sizeof the arithmetic units combined with the increase in thenumber of sections in the pipeline from 5 to 6 or 7 wouldmake the fabrication of the complete pipeline on one chipimpossible in the current state of VLSI technology, evenwith substantial decreases in the linewidths of the pro-cesses.

Therefore, if the advantages of the FNT for fast con-volution are to be exploited via the implementation of64- or 128-point F5 pipelined transformers, it will benecessary to make use of the partitioning techniquedemonstrated on the 32-point F 4 system in this paper.

9 Acknowledgments

The authors would like to thank Prof. D.J. Kinniment,Dr. I.L. Sayers, Dr. E.G. Chester and Dr. S.S. Dlay (all ofthe University of Newcastle upon Tyne) for helpful dis-cussions regarding VLSI design in NMOS, and AliShakaff for the useful discussions on the subject of FNTs.

We would also like to thank Miss P. Leeson for typingthis paper and the UK Science and Engineering ResearchCouncil who are funding the project along with theRoyal Thai Navy for their part in supporting the project.

10 References

1 TRUONG, T.K., YEH, C.-S., REED, I.S., and CHANG, J.J.: 'VLSIdesign of number theoretic transforms for a fast convolution'. Pro-ceedings of IEEE International Conference on Computer Design:VLSI in Computing, ICCD '83, New York, NY, USA 31st Oct.-3rdNov. 1983, pp. 200-203

2 LEIBOWITZ, L.M.: 'A simplified binary arithmetic for the Fermatnumber transform', IEEE Trans., 1976, ASSP-24, (5), pp. 356-359

3 MEAD, C, and CONWAY, L.: 'Introduction to VLSI systems'(Addison Wesley, 1980)


4 WILLIAMS, T.W., and PARKER, K.P.: 'Design for testability — asurvey', Proc. IEEE, 1983, 71, (1), pp. 98-112

5 COOLEY, J.W, and TUKEY, J.W.: 'An algorithm for the machinecalculation of complex Fourier series', Math. Comput., 1965, 19, pp.297-301

6 AGARWAL, R.C., and BURRUS, C. S.: 'Number theoretic trans-forms to implement fast convolution', Proc. IEEE, 1975, 63, (4), pp.550-560

7 RABINER, L.R., and GOLD, B.: 'Theory and application of digitalsignal processing' (Prentice-Hall, 1975)

8 TRUONG, T.K, REED, I.S., YEH, C.-S., and SHAO, H.M.: 'Aparallel VLSI architecture for a digital filter of arbitrary lengthusing Fermat number transforms'. Proceedings of IEEE Internation-al Conference on Circuits and Computers, ICCC '82, New York,NY, USA, 28th Sept.-lst Oct. 1982, pp. 574-578

9 RUSSELL, G., KINNIMENT, D.J., CHESTER, E.G., andMcLAUCHLAN, M.R.: 'CAD for VLSI' (Van Nostrand Reinhold(UK), 1985)

10 VLADIMIRESCU, A., NEWTON, A.R., and PEDERSON, D.O.:'SPICE user's guide'. Department of Electrical Engineering andComputer Studies, University of California, Berkeley, CA, USA,1980

11 Texas Instruments, Digital Signal Processor TMS32010 ProductDescription, 1983, p. 10

12 McCLELLAN, J.H.: 'Hardware realisation of a Fermat numbertransform', IEEE Trans., 1976, ASSP-24, (3), pp. 216-225

11 Appendixes

11.1 The arithmetic unitsAll arithmetic is carried out modulo F4 using Leibowitz'sdiminished-1 [2] number coding scheme. This requires 17bits for representation of data over F 4 . In this scheme,the binary value of the word used is one less (mod F4)than the actual data represented. In this way the mostsignificant bit (MSB) can be used as a zero detection bit,because zero is represented as 1 0000 ... 0000. This sim-plifies the arithmetic units considerably, because arith-metic operations are carried out on the 16 leastsignificant bits (LSBs) only and the MSB is used tomodify or inhibit the operations, depending on whetherthe data are zero or nonzero.

11.1.1 Negater: Negation is straightforward. If the inputa is nonzero, the output b is found by taking the one'scomplement of the 16 LSBs, i.e.

15 15

The MSB remains the same. If the input a is zero, so isthe output b.

The negater is easily implemented using the logicshown in Fig. 8.

11.1.2 Adder: Addition is complicated by the fact that aresidue reduction may be necessary. If the inputs are aand b and the sum is c, then, briefly:

(i) If a or b is zero, c = b or a, respectively(ii) If neither a nor b is zero then: add the 16 LSBs of a

and b to give c'. Complement the carry out of c' and addto the 16 LSBs of c'. This sum forms the result c, thecarry out being the MSB of c (i.e. c16).

A number of schemes for implementing the addition aregiven by McClellan [12]. The adder used in this design isbased on that shown in Fig. 9 and is given in Fig. 10.

In designing the adder, advantage was taken of thefreedom offered by custom designing in VLSI and so sim-plified PLA based carry look ahead (CLA) units wereused in the first stage, where only the carry out from the4-bit CLA units is required. Also, only one set of generate(G) and propagate (P) signals are generated for both thefirst stage of CLA units and the full CLA units that formthe second part of the adder.

This zero level CLA scheme seemed to be the best interms of area and speed. A first level CLA approach, inwhich only four modified zero level and one first levelCLA units can perform both the first and second stages

Fig. 8 F4 negater

A15...A0 B15...B0

carry in=0

',, carry out (D)

1 5 - •

Fig. 9 Zero level carry look ahead (CLA) adder MOD F4 (McClellan

of the addition, was used by McClellan. This wasattempted in NMOS because it should be smaller andfaster than the zero level scheme, but in fact the reversewas true. This is because of the large amount of 'global'routing to and from the first level CLA unit, whichoccupies a lot of space and is time consuming to drivebecause of its high capacitance.

11.1.3 Multiplier: If b = alc mod F4 then b can be foundusing the following procedure:

(a) If a = 0 then b = 0


B16

A . ,A • B - C u

16 16 16

B,6

modified

4-bit

CLA

modified

4-bit

CLA

CARRY IN= 0

Fig. 10 Zero level CLA adder using modified CLA units

(b) If a # 0 then if c is positive: ignoring the MSB,circularly shift the 16 LSBs c bits left, negating those thatrotate pass the 16th bit.

This is easily and elegantly achieved using a modifiedbarrel shifter, as described by Truong [8], in which boththe input bits and their inverses are available within thebarrel shifter array. The MSB is fed straight from inputto output and also used to set the 16 LSBs to zero if theinput is zero.

To multiply by negative powers of two, as required bythe inverse transform, it is possible to negate all theinputs, effectively multiplying by 216. Multiplying by 2~k

is equivalent to multiplication by 2 3 2 " k and hence couldbe achieved by negating the inputs and left shifting by16 — k. However, this greatly complicates the generation

IEE PROCEEDINGS, Vol. 134, Pt. G, No. 2, APRIL 1987

full

U -bit

CLA

full

A-bit

CLA

A , 6 ;

ex OR

12...15

ex OR

0. . .3

CARRY IN

Co

of the sequence of exponents supplied to the multiplier,and so a second multiplier is used in series with that usedfor the positive exponent (Fig. 11).

Multiplication by a negative power of two can beachieved by the following procedure:

(b) If a # 0 then if c is negative: ignoring the MSB,circularly shift the 16 LSBs c bits right, negating thosethat pass the LSB.

11.2 The registers11.2.1 Delay units: The delay units have to be pro-grammable to take on different lengths according towhether the chip is performing the forward or inverse

65

transform and to which rank the chip is assigned. Thelengths are given in Table 1.

This would easily be implemented using a set of 16tapped registers, but the space required for the large

A-B

positiveexponentmodifiedbarrelshifter J

negativeexponentmodifiedbarrelshifter

A-B)2• k n

exponent(kn) < " o f 16

decoder

direction (DIR 2)

Fig. 11 F\2±kn multiplier

Table 1: Lengths of delays required in 32-point forward andinverse pipelined transformer

Rank

forOfor 1for 2for 3for 4inv 0inv 1inv 2inv 3inv 4

First delay Del A

16842101248

Second delay Del B

842101248

16

amount of routing needed would dwarf the registersthemselves. The speed of clocking would also be greatlyaffected by the time taken to drive the long intercon-nections. Buffering could alleviate the problem, butwould increase the area further still. A more novelapproach was therefore adopted.

The basic two phase clocked register cell used isshown in Fig. 12. With the normal two phase clock

phi 1 clock phi 2 clock

output

Fig. 12 Basic 2-phase clocked dynamic register cell

applied, this cell delays data by one clock cycle. If bothclocks are held high, data pass straight from input tooutput, delayed only by the propagation through thecircuit.

The delays are built up from groups of 8, 4, 2, 1 and 1cells in series and the 2-phase clock is gated to themunder control of a PLA based decoder. This decoderaccepts the rank and Dir 1 (forward/inverse) signals andholds the appropriate register groups' clock lines high toproduce the required delay.

Note that the PHIl clock is further gated by theenable signal, because the delays accept new data onalternate clock cycles only, because of the input/outputmultiplexing.

Though this method may seem to involve long propa-gation delays (worst when delay is zero), simulation

results indicate a worst case delay of 100 ns, which isadequate.

11.2.2 Pipeline/scan path registers: These use the samedynamic register cells as the delays (Fig. 12) but with anextra serial input and output. Selection of serial/paralleloperation is performed by the gating of the PHIl clockeither to the pass transistors of the serial inputs or paral-lel inputs (Fig. 13).

phi 1-enable-scan .. . serialphi 2 input

parallelinputs

i parallelI outputs

serialoutput

Fig. 13 Scan path register using dynamic register cells with a gatedclock

A. Pajayakrit received his BSc degreefrom Royal Thai Naval Academy, Thai-land in 1981. After receiving his Diplomain Applied Electronics from the Uni-versity of Newcastle upon Tyne in 1983he has been working for his PhD degreeat the same University. His currentresearch interests include VLSI architec-ture, implementation and applications oftransform techniques in digital signal pro-cessing.

P.J. Towers gained a BSc in Electricaland Electronic Engineering at the Uni-versity of Leeds in 1983. He joined theDepartment of Electrical and ElectronicEngineering at the University of New-castle upon Tyne in 1983 as a ResearchAssociate, where he has been working onVLSI design for digital signal processingapplications.

A.G.J. Holt has been with the Universityof Newcastle upon Tyne since 1957 and isat present a Professor of Electrical Engin-eering. He has been awarded the PhDand DSc degrees from Southampton Uni-versity. His current interests are in digitalsignal processing.


cascadable nmos vlsi circuit for implementing a fast convolver using the fermat number transform

Documents