design and implementation of a fir lter with …...design and implementation of a fir lter with...

Design and implementation of a FIR filter with MMAlpha

An initiatory journey through Distributed Arithmetic and Vhdl synthesis

–

Jean-Baka Domelevo (ENS Cachan)welcomed by Patrice Quinton & Steven Derrien at IRISA, F-35042 Rennes

June/July 2004

Abstract

During this training period, we had to try and design a Distributed Arithmetic FIR filter(FIR means “Finite Impulse Response”) using Alpha, which is a functional language forregular program description, transformation, evaluation, and hardware implementation ofthe derived architecture. It implied our getting used to a series of tools we had never handledbefore: the Alpha and VHDL languages, the Mathematica environment and some tools ofRTL synthesis (RTL means Register-To-Logic, a word for the lowest level of descriptionmaking sense for a Computer Science literate) and simulation (ModelSim and Xilinx’s FPGAEditor).

Resume

Pendant ce stage, nous avons tente de concevoir un filtre FIR (pour “Finite Impulse Re-sponse”, ou encore “Reponse Impulsionnelle Finie”) en arithmetique distribuee et a l’aidedu langage Alpha, langage fonctionnel pour la description d’algorithmes reguliers ainsi quepour leur transformation, evaluation (i.e. ordonnancement) et implementation d’une archi-tecture derivee.Ce travail a necessite en amont la prise en main d’outils nouveaux pour nous : les lan-gages Alpha et VHDL, l’environnement Mathematica et quelques outils de synthese RTL(un acronyme pour “Register-To-Logic”, designant le plus bas niveau de description faisantencore sens du point de vue de l’informaticien) et de simulation (nommement ModelSim etFPGA Editor, de chez Xilinx).

1

Introduction

High performance embedded systems

Nowadays, part of what we call the Computer Sciences field is dedicated to the research anddesign of efficient architectures for high-speed computation purposes. Indeed, in our daily lives,our environment is full of devices which need both high integration and performance: frommobile phones to computers or other domestic devices, the world is full of embedded systems.Often called SOCs (for “Systems On a Chip”), their development has been stimulated by thegrowing industrial needs. Living on the borderline between electronics and computer science,researchers and engineers have to be fully aware of both software and hardware possibilities thatwill enable them, if cunningly combinated, to bring out the best design at the lowest price. And,as all the borderlines between two scientific fields, the research about such architectures provesto be sheltering one of the most exciting areas of the computer sciences to be investigated.There is no doubt it is bound to take an even growing importance in the next decades. . .

FPGAs and other ICs

In the large world of integrated circuits (ICs), we can highlight a few classes:

• general-purpose microprocessors are undoubtedly the most common ICs. To give a fewexamples, we can quote the x86 processors, of course, but also some embedded processorslike the ARM or the IBM PowerPC. Many reasons can account for general-pupose micro-processors being so commonly used in various devices, from microcomputers to personnalassistants, through entertainment systems (think about the Xbox) or set-top boxes (likethose by Akimbo or Fujitsu-Siemens). Microprocessors are cheap circuits, able to carryout a lot of different tasks without the need of reconfiguring it all. Indeed, the user is freeto run a large spectrum of applications on the very same architecture, just launching theadequate software. But the price to pay for such a flexibility is a relative sluggishness; bethey SIMD or VLIW (e.g. in some Digital Sound Processors), superscalar, CISC, RISC,with Hyper-Threading technology or whatever the industrials can argue, it is obviousthat many applications would run much faster if they were given a dedicated architecture(saying “applications”, we mean any computation process scaling from a product of twointegers to the compilation of a source code or the routing of packets over a network). Con-sidering this fact, some companies have issued processors which remain general-purposebut with a set of instructions particularly efficient in realizing a specific kind of opera-tions, like a multiply-accumulate in a DSP or a matching on a field pattern for a networkprocessor. Digital Signal Processors are for instance the Texas Instruments TMS-320C62xseries, the chips by Analog Devices or Lucent Technologies. As an example of networkprocessor, we can quote the Intel XP2000.

• ASICs (Application-Specific Integrated Circuits) are fully dedicated circuits : they aredesigned with a specific application in mind. In terms of performance, ASICs are themost efficient circuits : they are the core of most high performance routers, which areable route up to 40 Gbits per second and per slot (as for today’s best routers of the Cisco12000 Series)! On the other hand, they are difficult to design and suffer huge NRE (NonRecurrent Engineering) costs ($600,000 for a 0.13µm wafer). Last but not least, they havea pretty long time-to-market(approximately 6 months). These drawbacks make their useonly viable for very large production volumes (million-pieces productions at least).

• FPGA is an acronym for “Field Programmable Gate Array”: an FPGA is a highly inte-grated circuit, made out of thousands of transistors (recent ones can include up to 109 of

2

them). It can be representd as a matrix of logical cells that make it easy to carry out anylogical function of a number of entries. FPGAs make a good tradeoff between, on the onehand, the high efficiency and price of the ASICs, and on the other hand the low price butrelative slowliness of the general-purpose microprocessors.

Hardware synthesis from high-level specifications

Obviously, to design circuits, one will have to be able to specify a behavior and an architecturein a structured and standardized way. Such a language exists: VHDL (an acronym for VHSIC

Hardware Description Language, where VHSIC itself stands for Very High Speed Integrated

Circuit) was given its first normal form in 1987. Today, after many evolutions, it is widelyspread in the community of circuits degigners all around the world, and its sole contestant isthe Verilog language. The reader is given the opportunity to have a glance at the way VHDLworks in Appendix A. The negative point with such a low-level language is that designinghardware with VHDL is difficult and error-prone. So we need automated tools to speed updevelopment time but also to help verification.

Digital signal processing systems

Since the early development of electronics fostered by the computer science and informationtechnologies, digital signal processing has always been in constant development. This restlessgrowth can easily be accounted for: new industrial needs, in the fields of digital audio, modems,mobile phones, image processing, etc. made DSP the core basis for many areas of technology.Today, due to the extraordinary development of the telecommunication infrastructures, theyare mainly used by the industry of mobile phone networking.In the whole landscape of digital signal processing, filters stand in the very foreground. Theseare designed – be they software or harware-implemented – to perform a transformation to theirinput values in order to produce a real-time or an offline output, which is bound to undergofurther processing throughout the rest of the processing chain.

filterinput output

x(t) y(t)

Figure 1: A generic filter.

The most common family of filters is made of the FIR ones, which are used in many applica-tions from medical imagery to radar and wireless telecommunications. Because this filtering is avery compute intensive operation, and since many applications require this filtering to be donein real-time, the chase for an efficient implementation of FIR filters (either through software orin hardware) has received a lot of interest.

1 FIR filters

1.1 The FIR transformation

There are two types of digital filters : IIR (Infinite Impulse Response) and FIR (Finite ImpulseResponse) filters. In this work, we will focus on FIR filters which are the most commonly used.Let x(t) be the input at time t, then the filter output is given by :

3

y(t) =N−1∑

k=0

aN−1−k · x(t − k) (1)

where N stands for the “order” of the filter, and the ak are fixed coefficients.

1.2 MAC (Multiply-Accumulate) based implementation

In a DSP (Digital Sound Processor), many “multiply-accumulate” operations are performed.Basically, it is a matter of building a sum of products: in the slightly modified version ofequation 1,

y =N∑

k=0

akxk = a0x0 + a1x1 + · · · + aNxN (2)

one uses to compute this expression gradually, by means of the partial sums Sn, where

Sn =n∑

k=0

akxk = a0x0 + a1x1 + · · · + anxn , and then Sn+1 = Sn + an+1xn+1

, so that the final result is: y = SN .This is exactly what is done in a MAC (multiply-accumulate), whose schematics is shown belowin figure 2.

Figure 2: The multiply-accumulate structure.

The main problem is that on any architecture, the multipliers are costly (they need a lot oflogical gates to be built), and in many cases they do not exactly fit your needs, because theirinput data size is either too large or too small. It would be a real improvement if we were ableto limit the number of multipliers used or even to do our calculus without the use of any ofthem. That’s just the point of what we call “distributed arithmetic”, which is a means of takingprofit from some arithmetic properties appearing at a bit level to design efficient circuits. Thismeans, depending on the needs, to find a design requiring little physical place — i.e. using fewgates — or computing an arithmetic expression in as fewer clock signals as possible. These twoparadigms are constantly balancing each other, and the right balance is seldom easy to bring out.

4

1.3 Distributed arithmetic based implementation

To explain the ideas behind DA, a short example is worth exposing.Let us take the recurrent problem of computing a basic sum of products, like in (2). To illustrateit, we will discuss the problem with N = 4 (but in the formulae we will keep displaying theformal parameter). The basic flow corresponding to this approach is given in figure

Figure 3: The generic flow.

The first step toward a Distributed Arithmetic implementation of such a flow is to see whathappens at bit-level. For a start, let the inputs (xk), k ∈ [0,N-1] be M-bit unsigned integers.Thus, as soon as we have noticed that a multiplication of two bits is nothing but an and

operation, we can avoid using multipliers. To multiply x0 by a0, we simply use:

• a serializer, which lets its sequential output be the series of M bits of its input, startingwith the least significant ones.

• an and gate whose output is either a0 if its second input is 1, or 0 if its second output is0.

• a scaling accumulator, performing the whole multiplication in M clock events.

The figure n◦4 describes this first approach.

Please notice that for the first time we are going to break the apparent symmetry betweendata xi and coefficients ai. This can be accounted for, provided that most of the time thecoefficients are constant while the data are not. Now we can refine our implementation for thedata flow given in figure 3, what leads to figure 5.

The next step is to consider that in this design, we have used N scaling accumulators andN − 1 supplemental adders (for the adder tree summing the partial products). This results in atotal amount of 2N − 1 adders, which are the costly circuits in this design. Can we shrink thisoutcome? Yes we can! In fact, we can use only one scaling accumulator by the early summingof the N and gates outputs, before scaling and accumulating. This results in the layout shownin figure 6.

Now, as it was previously said, in a lot of applications the coefficients (ak), k ∈ [0, N − 1]have constant values. Therefore, we can notice that the subsystem compound of the series of

5

Figure 4: Multiplication at low gate cost (computes a0x0).

Figure 5: Data flow with no multiplier.

Figure 6: Early summation unveils an architecture with fewer gates.

and gates plus the following adder tree only implements a logical (combinatorial) function ofits N inputs: in the schematics of figure 6, data(t) = f(x0,t, . . . , xN−1,t). Yet it boils down towhat we call a Look-Up Table (LUT), an addressable memory made out of 2N blocks of theappropriate size (wide enough to contain the sum

∑N−1k=0 ak).

6

Figure 7: A Look-UP Table efficiently stands for N and gates plus an adder tree.

Before going further and exposing how this Distributed Arithmetic solution works in math-ematical terms, we have to make a choice of implementation and decide what is the form of thedata we are to handle. Indeed, the xi could be M-bit signed integers, unsigned ones, fixed-pointdecimals or floating-point ones — yet, the latter usually proves a bit hard to implement cor-rectly. Here we assume that our data will be fixed-point signed decimals, with one sign bit —it will be the most significant bit — and the M − 1 following bits forming the decimal part. Ona syntactic viewpoint, we will henceforward assume that xk,j denotes the j-th bit of the M-bitdata xk. Thus the interpretation of the binary form of data xi is quite simple:

xk = −xk,0 +M−1∑

j=1

xk,j · 2−j

This implementation choice is nothing random: in digital signal processing applications, it iseasy and also extremely handy to normalize the data. Thus this fixed-point arithmetic is usedin many real-life cases. Thus we can rewrite (2):

y =N∑

k=0

akxk =N∑

k=0

ak(−xk,0 +M−1∑

j=1

xk,j · 2−j) =

N∑

k=0

(−ak · xk,0 +M−1∑

j=1

akxk,j · 2−j)

And, at last:

y = −

N∑

k=0

a0xk,0 +M−1∑

j=1

2−j

N∑

k=0

akxk,j

If we want a formula which would be closer to the way we will really compute y, we can writeit in a Horner-like form:

y = ((· · · ((

N∑

k=0

akxk,M−1) · 2−1 +

N∑

k=0

akxk,M−2) · 2−1 + · · ·) −

N∑

k=0

a0xk,0)

If we consider an expression one of the above sums, like the “first” one — it will be thefirst to be computed when the system will be scheduled —

∑Nk=0 akxk,M−1 and if we assume

the coefficients (ai)i∈[0,N−1] have constant values, then we understand that this sum can becomputed as a mere function of the N bits x0,M−1 to xN−1,M−1. That’s why we can implement

7

the calculus of these sums through a look-up table that will be addressed, at time t, by the vectormade of the N bits x0,M−1−t to xN−1,M−1−t. This way, the computation of y =

∑N−1k=0 akxk will

be performed in M clock cycles, and the result will appear at the output of the adder, valid attime M − 1.

2 MMAlpha and the polyhedral model

2.1 What is MMAlpha?

Alpha is a functional data-parallel language for the description of algorithms. It was firstproposed by C. Mauras in [Mau89]. It is intended to clearly underline the real dependencies ina given computation. For instance, in the expression T[i,j] = a[] * T[i,j+1], we say thatT[i,j] depends on T[i,j-1], or that the dependency vector for T is (0,1). This is particularlywell adapted to operations on matrices or vectors, but not only. The reader willing to have aquick stare at the Alpha syntax is invited to refer to Appendix B. Originally Alpha had beendesigned in order to help people design parallel synchronous architectures, but it can be usedto analyze various programs, and try to build up a specialized architecture, from any system ofaffine recurrent equation (SARE), as we want to show it here.MMAlpha is a program developed by Patrice Quinton, Tanguy Risset and other people from theApi, Cosi and R2D2 research teams. It is designed as a module for Mathematica, so that theuser is allowed to pass commands to MMAlpha via a Mathematica session. This makes it prettyinteractive, what is often useful for debugging purposes. The goal of MMAlpha is to process itsinput (a program written in the Alpha language), to analyze it and figure out the dependenciesit contains in order to schedule it, and eventually to produce a C code for simulation purposesand/or a VHDL description of the architecture it has derived from the schedule found.A flowchart is useful to understand how it works (figure 8).

2.2 The Polyhedral Model

When one tries to analyze a computation, it may be useful to think of the calculus — whateverit can be, provided it can be modelled as a SARE — as an iteration space of dimension N. Sinceany iteration step is unambiguously designated by the set of the corresponding indicesvalues,this iteration space is a domain in Z

n. Such a domain is called a polyhedron. In the following,we might also refer to it as “the computation graph”. In this graph, each node stands for anelementary computation step: for instance an addition, a product or a test. An edge going fromnode i to node j represents the fact that the operation at work in node j needs a result from theone carried out in node i. Intuitively, that means that node i has to be processed before node j.Once the program analyzer (may him/it be a human or a software, like Alpha) has figured outhow to unfold the if tests or the for loops to represent them as a nest of nodes, the next step isto find out a suitable schedule, i.e. a schedule complying with the above dependences. This taskis probably the hardest, provided that we have to find the best schedule, meaning that the totalexecution time has to be as short as it can be. Usually, for a complex program, it is extremely

uneasy to find the best schedule, so any automation of this research is welcomed. At this pointof our study, we are worth saying a few words about schedules. First of all, finding a schedulemeans figuring out how to map our N-dimensional computational graph to a K-dimensionalother graph, where one of the dimensions represents the time — indeed, it is one-dimensional,usually — and the other dimensions stand for the K − 1 dimensions of the processor spaceused. Of course, it is up to us to set the K variable before trying to find the schedule. Here theword processor, should be understood in a fully generic way, standing for a type of “cell” in ourcircuit-to-be, each cell designed in order to be able to perform each of the different operations

8

Figure 8: The typical MMAlpha processing scheme.

that we decide to map on it.An important point to be mentioned here is that the schedules found by MMAlpha are “affine-by-variable”, which means that the execution date of a given variable at a given point of itsdomain is an affine function of its indices (i.e. the coordinates of that point) and of its parameters(often size indices, as are M and N in our FIR).

3 My own work: through the design of an efficient FIR filterwith MMAlpha

The objective was to be able to carry out the entire design of a structured FIR — using adistributed arithmetic-based approach — to produce in a first time a correct Alpha source code,then to go through the whole MMAlpha toolchain and eventually to implement the solutiononto an FPGA, using Xilinx’s Place & Route software. In the next paragraphs, I will show

9

MMAlpha at work through the Mathematica notebook I wrote and kept debugging to achievemy goals. Simultaneously I will try to explain the series of transformations MMAlpha executes.

3.1 The source code: a structured implementation

The first idea wich came to our mind is that we were bound to design, at first, a structuredcode to be as closer as possible to the simple Distributed Arithmetic LUT-based solution weexposed in paragraph (—give the number of the paragraph —). And so did we. In Alpha, itis rather easy to specify separate components called “systems”, giving the type of their inputsand outputs. This is done through the system statements you can see in the source Alpha code.Then in the body of the system, we have two parts: the first part contains the declarations ofthe local variables used by this system, and the second one is the genuine core of the system,containing a set of equations defining the behavior of each variable or output, which is a functionof the system inputs — and of the other variables, of course.Once the different components we need have been clearly identified and written down as Al-pha systems, we are enabled to use them as parts of a bigger system, and this instantiationis done through the use statement, specifying a mapping between formal parameters — theinputs/outputs of the used system — and the effective variables — which are variables, inputsor outputs of the shell system. Please notice this is pretty similar to the port map-style instan-ciation in the VHDL language.In our design (see figure 9), we have built four subsystems:

• the role of the ShiftRightBuffer is to produce an output of M bits — the Alpha matchingtype is boolean— corresponding to those of its integer input, starting from the LSB (LeastSignificant Bit) to the MSB, i.e. from right to left. Such a subsystem is to be instanciatedN times, where N stands for the order of the filter.

• the LUT is our look-up table, filled with arbitrary coefficients (mainly for testing purposes).

• the FinalAdder is a simple adder of width equal to 2M-1 bits.

• and last, the RightShifter right shifts its input of one position, in order to lower itsimportance in the final sum, according to the position of the corresponding processed bitsin the inputs (i.e. according to the value of j in the Xi,j values).

3.2 Scheduling

The next step is scheduling. We will also have to normalize the system, i.e. to make sure all thelocal variables do have the same dimension, in order to have appSched[] work properly oncethe schedule has been found. That operation is called “placing” by Alpha experts. Currentlytwo techniques are implemented in MMAlpha for the computation of a schedule function. Thedefault one is called the Farkas method; the reader willing to know more about it is free to referto Paul Feautrier’s papers (mainly [Fea92a] & [Fea92b]).Concerning the schedule, the user is given the ability to impose some conditions on it, saying forinstance that the coefficient before t in the arrival date of data D[i, j] has to be 1, or potentiallyany other affine constraint — even unsolvable ones, which will bring out an error message. Thismakes it very flexible, and with a bit of handwork, Alpha finds an appropriate schedule whichsticks to the reality1.

1Alpha draws a clear separation line between what it considers as indices and what as data. Parameters

can be compared to indices but not to data. Hence, you will notice that sometimes we had no solution but to

10

Figure 9: Our implementation of the FIR.

“compute” the parameters value, i.e. to instanciate a mirror of them in the world of data. Such a computation is

an artefact, and as so it is not conceivable to allow it any non-zero time in the schedule. The addconstraints[]

command helps us fix that.

11

3.3 Toward an architecture. . .

After the appSched[] command, MMAlpha has separated the space and time dimensions: thereis one dimension for the time, and several — if not only one — dimensions for the space. Themapping is its own choice, and that’s precisely one of the tricky tasks to do. Now the indiceshave changed : D[t,p] is one of the D data (previously it was one of the D[i,j], but with thatnotation Alpha indicates that it will be computed at time t on “processor” (think in terms ofcells or nodes) p. Here the space appears to be one-dimensional, but that is not always the case.Then, one of the optional operations is the pipelining. It consists in transmitting a signal (i.e.a data) from cell to cell rather that broadcasting it from a single point on the circuit. Althoughnot fundamental, it helps the designer: if you keep in mind that the final goal is to produce anhardware architecture, you will early see that it is really uncomfortable to deal with signals tobe broadcasted from one and ot the circuit to another. The broadcasted solution, if any can befound, looks much better.

3.4 Generating the VHDL

After some cleanup has been done — removing redundant equations, simplifying others andputting all the equations in a normal form —, we are ready to proceed with the last two steps:

1. the alpha0ToAlphard[] command translates the alpha0 format into another form, muchcloser to a hardware description. You can just figure it as a transient form before gettingto some proper VHDL.

2. the a2v[] command translates the AlpHard code into a structured set of VHDL files,implementing the initial algorithm through a basic Controller/Modules structure — theclassic separation between the control units and the execution units —, as described infigure 10.

Figure 10: The decomposition of the generated .vhd files scheme.

3.5 Mapping on FPGAs: placing and routing with Xilinx ISE

At this time, we should have generated a fully working VHDL code. Alas, MMAlpha is still anexperimental software, and thus some extra work has to be done before we can use the VHDLfor an RTL synthesis. We choose to work with Synplify Pro r©, which is a powerful tool bySynopsis r©2. Such powerful tools enable the used to get the synthesis of its VHDL, that is tomean that he or she will be ready to implement its circuits onto the FPGAs we talked about inparagraph 0.2. To understand how it works, one should know that the structure of a FPGA isrelatively regular: it is made of blocks named CLBs (Configurable Logical Blocks), along with

2In the world of IC design, software and trademarks do not use to be ruled by the GPL. Business is business.

12

switch matrixes to handle and route the signals between the different CLBs,I/O logical blocksto handle the input and output signals connecting the FPGA to the rest of the platform, andadditionnaly some banks of RAM are present. The basic schematics of a standard FPGA isgiven in figure 11.

Plot d’entrée/sortie

Bloc logique élémentaire

Matrice d’interconnection

Figure 11: The architecture of a FPGA.

regreg

LUTLUT

Bloc logique

Matrice d’interconnexion

Figure 12: The inner organization of signals in a FPGA.

Virtex™-E 1.8 V Field Programmable Gate ArraysR

Module 2 of 4 www.xilinx.com DS022-2 (v2.2) July 23, 20014 1-800-255-7778 Preliminary Product Specification

Storage Elements

The storage elements in the Virtex-E slice can be config-ured either as edge-triggered D-type flip-flops or aslevel-sensitive latches. The D inputs can be driven either by

the function generators within the slice or directly from sliceinputs, bypassing the function generators.

In addition to Clock and Clock Enable signals, each Slicehas synchronous set and reset signals (SR and BY). SR

Figure 4: 2-Slice Virtex-E CLB

F1

F2

F3

F4

G1

G2

G3

G4

Carry &Control

Carry &Control

Carry &Control

Carry &Control

LUT

CINCIN

COUT COUT

YQ

XQXQ

YQ

X

XB

YYBYB

Y

BX

BY

BX

BY

G1

G2

G3

G4

F1

F2

F3

F4

Slice 1 Slice 0

XB

X

LUTLUT

LUT DCE

Q

RC

SP

DCE

Q

RC

SP

DCE

Q

RC

SP

DCE

Q

RC

SP

ds022_04_121799

Figure 5: Detailed View of Virtex-E Slice

BY

F5IN

SRCLKCE

BX

YB

Y

YQ

XB

X

XQ

G4G3G2G1

F4F3F2F1

CIN

0

1

1

0

F5 F5

ds022_05_092000

COUT

CY

DCE

Q

DCE

Q

F6

CK WSO

WSHWEA4

BY DG

BX DI

DI

O

WEI3I2I1I0

LUT

CY

I3I2I1I0

O

DIWE

LUT

INIT

INIT

REV

REV

Figure 13: One of the CLBs found in the Virtex series FPGA.

The latest FPGAs from Virtex (the Virtex2Pro series) even include one or two PowerPCprocessors on their very core. Now each CLB is usually made of two “slices” plus some carryand register additional logic. Basically, a slice can directly implement a 4-input LUT, whichproves interesting when a software can spot out parts of the VHDL descriptions which would

13

be implemented through a look-up table. This is precisely what the synthesis tools are up to.Then we run a “place & route” suite, which is a proprietary soft enabling the user to defineprecisely the pinout of the implementation on a given FPGA. We used Xilinx’s ISE suite, andfor our FIR we were able to put the whole architecture on a small FPGA, the Virtex-II XC2V80by Xilinx r©.The final results of this implementation are given in Appendix C. We only used 21 percent ofthe available slices (i.e. of the available system gates, roughly), and we have reached a clockfrequency of about 180 MHz, which is good enough — it means that we are able to perform theFIR computation with a 180 Mbps input data pace. The (beautiful) resultant layout is givenin figure 14.

Conclusion and future prospects

Personnaly, I found that 7-week training period an interesting introduction to the low-leveldesign of processing architectures. It turned somewhat tricky due to the fact that MMAlpha isstill an experimental add-on to Mathematica. Bugs and lighter misbehavings were to be foundround every corner. On this account, the presence of top MMAlpha developers like MM. PatriceQuinton and Tanguy Risset has been beneficial to our project. On the basis of the problems andbugs we discovered through our Alpha venture, we contributed to the refresh and developmentof the MMAlpha documentation. Anyway, it is still partial.

If only I had had more time, I would have been glad to pursue my research in a few directions:

• one of the most important points for the further development of MMAlpha is the cleaningand redesigning of the AlpHard to VHDL translator. Particularly, the part of the trans-lator dealing with the basic VHDL data type (the so-called std_logic, i.e. the bit) couldbe somewhat improved.

• Concerning the core of our implementation, it would be possible to try differnt other typesof Distributed Arithemtic solutions, like the one processing two bits at a time for each ofthe xi inputs or that other one dividing the LUT in multiple smaller LUTs whose outputsare to be added afterwards (see Article by the Indians, to be added in my biblio.). It wouldhave been highly interesting to compare our results with theirs, for instance.

• a time-continuous solution, with reentrant shifting buffers implementing translating intothe hardware that xi(t) = xi+1(t+1). Actually we did write such a solution, but MMAlphatroubles kept us from reaching the corresponding VHDL description.

• last but not least, there are genuine problems when one wants to go through the MMAlphaprocess keeping a generic version of one’s system. By “generic”, we mean that the formalparameters are not assignated until the last step, i.e. the generation of a set of VHDL files— for which, obviously, such parameters have to be fixed. Problems araise due to the factAlpha generates, in its Alpha0ToAlpHard, a system of cells where each cell is bound tobe processing a certain computation only on a certain domain. Sometimes it occurs thatafter fixing the formal parameters, some regions can be empty, so the resulting system ismeaningless. Alpha experiences difficulties in dropping such cells, and forces the user tobe less generic.

I greatly appreciated the period of time I had in the R2D2, in a cheerful yet stimulatingresearch environment. I sincerely thank Steven Derrien and Patrice Quinton for proposing me

14

[col 1 of 1, row 1 of 2] FIR1NM.ncd

Figure 14: The resultant layout on the FPGA.

15

this training period and remaining by my side, as earful to my comments as allowed by theirschedule. Like Madeleine Nyamsi, Gilles Georges or Ludovic Lhours, the other members of theteam were also helpful and friendly colleagues.

16

References

[Alpha] Api, then Cosi, then R2D2 & Compsys. Getting started with Alpha, June 2004.

[Ash99] Peter J. Ashenden. The Designer’s Guide to VHDL, Morgan Kaufmann Publishers,1999.

[Fea92a] Paul Feautrier. Some efficient solutions to the affine scheduling problem, Part I: One-

dimensional Time,International Journal of Parallel Programming, 21(5):313-348, October1992.

[Fea92b] Paul Feautrier. Some efficient solutions to the affine scheduling problem, Part II: Mul-

tidimensional Time,International Journal of Parallel Programming, 21(6):313-348, Decem-ber 1992.

[Math] Stephen Wolfram. The Mathematica Book, Wolfram Media/Cambridge University Press,1999.

[Mau89] C. Mauras. Alpha : un langage equationnel pour la conception et la programmation

d’architectures paralleles synchrones, PhD thesis, Universite de Rennes 1 / IFSIC, Decem-ber 1989.

[Par99] Keshab K. Parhi. VLSI Digital Signal Processing Systems - Design and Implementation,Wiley & Sons, 1999. (chapter 13: Bit-Level Arithmetic Architectures)

[Pel97] David Pellerin & Douglas Taylor. VHDL Made Easy!, Prentice Hall, 1997.

[Qui89] Patrice Quinton & Yves Robert. Algorithmes et architectures systoliques, Masson, 1989.

17

A Vhdl at a glance

Here is a small piece of Vhdl code, describing the behavior of a 4-input multiplexer.

Entity mux4to1 is

Port (

A : in std_logic_vector(3 downto 0);

Sel : in std_logic_vector(1 downto 0);

R : out std_logic

);

End mux4to1;

architecture STRUCT of mux4to1 is

begin

process(Sel,A)

begin

case Sel is

when "00" => R<= A(0);

when "01" => R<= A(1);

when "10" => R<= A(2);

when "11" => R<= A(3);

when others => null;

end case;

end process;

end STRUCT;

18

B The Alpha language: a short example

Imagine we have to solve a linear algebra equation : AX = B. We have to find a solution X,and we assume that A is a N ∗ N lower left triangular matrix with a non-null diagonal. Thenthe solution is — provided that the indices range from 1 to N − 1:

X(1) =B1

A1,1and ∀ i > 1, X(i) =

Bi −∑i−1

j=1 Ai,jXj

Ai,i

It is now quite simple to build up an Alpha structure whose inputs will be A and B, and willreturn the solution X:

system solver : {N | N >= 1}

-- the integer N is a parameter for this system

(A : {i,j | 1 <= i <= N ; 1 <= j <= i} of real;

-- A is lower left triangular

B : {i | 1 <= i <= N} of real)

returns

(X : {i | 1 <= i <= N} of real);

let

X[i] =

case

{| i = 1} : B / A[i,i];

{| i > 1} : (B - reduce(+, (i,j -> i), {| j+1 <= i} : A * X[j])) / A[i,i];

-- the ‘‘reduce’’ structure is a kind of lambda-expression.

esac;

tel;

After scheduling, we will get the following output from Alpha (figure 15).

19

In[5]:= ashow@D;

system solver :8N ¨ 1<=N<HA : 8i,j ¨ j<=i<=N; 1<=j< of real;B : 8i ¨ 1<=i<=N< of realL

returns HX : 8i ¨ 1<=i<=N< of realL;varser1 : 8i,j ¨ H2,j+1L<=i<=N; 0<=j< of real;

letser1@i,jD =

case8 ¨ 2<=i<=N; j=0< : 0@D;8 ¨ j+1<=i<=N; 1<=j< : ser1@i,j-1D + H8 ¨ j+1<=i< : A * X@jDL;

esac;X@iD =

case8 ¨ i=1< : B � A@i,iD;8 ¨ 2<=i< : HB - H8 ¨ 2<=i<=N< : ser1@i,i-1DLL � A@i,iD;

esac;tel;

In[6]:= schedule@D;

Do not forget to change the initialization of $scheduleLibrary

Checking options...

Dependence analysis...

Building LP...

LP: 78 variables, 88 Constraints

Writing file for PIP....

Solving the LP...

Shift coef: 0

Total execution Time: 2 N

T_A8i, j, N< = 0

T_B8i, N< = 0

T_X8i, N< = 2 i

T_ser18i, j, N< = i + j

example.nb 2

Figure 15: MMAlpha returns from his quest for a schedule.

20

C Final results from the placement of our FIR on an XC2V-80/udd/domelevo/Alpha/test_fpga/FIR1NM.par

Release 5.2i − Par F.28Copyright (c) 1995−2002 Xilinx, Inc. All rights reserved.

cidre:: Thu Jul 22 15:07:01 2004

/soft/xilinx_hd/ISE5.2i/bin/sol/par −w −ol 2 −t 1 FIR1NM_map.ncd FIR1NM.ncdFIR1NM.pcf

Constraints file: FIR1NM.pcf

Loading device database for application par from file "FIR1NM_map.ncd". "FIR1NM" is an NCD, version 2.37, device xc2v80, package fg256, speed −6Loading device for application par from file ’2v80.nph’ in environment/soft/xilinx_hd/ISE5.2i.The STEPPING level for this design is 1.Device speed data version: PRODUCTION 1.114 2002−12−13.

Device utilization summary:

Number of External IOBs 115 out of 120 95% Number of LOCed External IOBs 0 out of 115 0%

Number of SLICEs 112 out of 512 21%

Number of BUFGMUXs 1 out of 16 6%

Overall effort level (−ol): 2 (set by user)Placer effort level (−pl): 2 (set by user)Placer cost table entry (−t): 1Router effort level (−rl): 2 (set by user)

Starting initial Timing Analysis. REAL time: 5 secs Finished initial Timing Analysis. REAL time: 6 secs

Phase 1.1Phase 1.1 (Checksum:989a85) REAL time: 6 secs

Phase 3.23.....Phase 3.23 (Checksum:1c9c37d) REAL time: 19 secs

Phase 4.3Phase 4.3 (Checksum:26259fc) REAL time: 19 secs

Phase 6.5Phase 6.5 (Checksum:39386fa) REAL time: 19 secs

Phase 7.8....Phase 7.8 (Checksum:9bd1c8) REAL time: 20 secs

Phase 8.5Phase 8.5 (Checksum:4c4b3f8) REAL time: 20 secs

Phase 9.18Phase 9.18 (Checksum:55d4a77) REAL time: 22 secs

Phase 10.24Phase 10.24 (Checksum:5f5e0f6) REAL time: 22 secs

Page: 1

21

/udd/domelevo/Alpha/test_fpga/FIR1NM.par

Writing design to file FIR1NM.ncd.

Total REAL time to placer completion: 22 secs Total CPU time to placer completion: 21 secs

Starting Router REAL time: 23 secs

Phase 1: 1074 unrouted; REAL time: 25 secs

Phase 2: 996 unrouted; REAL time: 25 secs

Phase 3: 227 unrouted; (0) REAL time: 26 secs




Finished Router REAL time: 27 secs

Total REAL time to router completion: 28 secs Total CPU time to router completion: 27 secs

Generating "par" statistics.

**************************Generating Clock Report**************************

+−−−−−−−−−−−−−−−−−−−−−−−−−−−−+−−−−−−−−−−+−−−−−−−−+−−−−−−−−−−−−+−−−−−−−−−−−−−+| Clock Net | Resource | Fanout |Max Skew(ns)|Max Delay(ns)|+−−−−−−−−−−−−−−−−−−−−−−−−−−−−+−−−−−−−−−−+−−−−−−−−+−−−−−−−−−−−−+−−−−−−−−−−−−−+| clk_c | Global | 77 | 0.014 | 0.597 |+−−−−−−−−−−−−−−−−−−−−−−−−−−−−+−−−−−−−−−−+−−−−−−−−+−−−−−−−−−−−−+−−−−−−−−−−−−−+

The Delay Summary Report

The Score for this design is: 117

The Number of signals not completely routed for this design is: 0

The Average Connection Delay for this design is: 0.781 ns The Maximum Pin Delay is: 4.158 ns The Average Connection Delay on the 10 Worst Nets is: 1.992 ns

Listing Pin Delays by value: (ns)

d < 1.00 < d < 2.00 < d < 3.00 < d < 4.00 < d < 5.00 d >= 5.00 −−−−−−−−− −−−−−−−−− −−−−−−−−− −−−−−−−−− −−−−−−−−− −−−−−−−−− 752 280 32 9 1 0

Timing Score: 0

Asterisk (*) preceding a constraint indicates it was not met.

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− Constraint | Requested | Actual | Logic | | | Levels−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− TS_clk = PERIOD TIMEGRP "clk" 7 nS HIG | 7.000ns | 5.929ns | 3 H 50.000000 % | | | −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− OFFSET = OUT 1000 nS AFTER COMP "clk" | 1000.000ns | 11.767ns | 7 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

Page: 2

22

/udd/domelevo/Alpha/test_fpga/FIR1NM.par

OFFSET = IN 1000 nS BEFORE COMP "clk" | 1000.000ns | 3.894ns | 17 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

All constraints were met.

All signals are completely routed.

Total REAL time to par completion: 29 secs Total CPU time to par completion: 28 secs

Placement: Completed − No errors found.Routing: Completed − No errors found.Timing: Completed − No errors found.

Writing design to file FIR1NM.ncd.

PAR done.

Page: 3

23

design and implementation of a fir lter with …...design and implementation of a fir lter with...

Documents