vhdl implemenation of cordic algorithm

39

Upload: 1234sharada

Post on 18-Apr-2015

181 views

Category:

Documents


8 download

TRANSCRIPT

Page 1: Vhdl Implemenation of Cordic Algorithm

Faculty of Computing and Information TechnologyDepartment of Robotics and Digital TechnologyTechnical Report 94-9A VHDL Implementation of a CORDICArithmetic Processor ChipGrant Hampson, Student Member, IEEEAndrew Papli�nski, Member, IEEEOctober 10, 1994Enquiries:-Technical Report CoordinatorRobotics and Digital TechnologyMonash UniversityClayton VIC [email protected] +61 3 905 3402

Page 2: Vhdl Implemenation of Cordic Algorithm

ContentsAbstract and Keywords 4Preface 51 The CORDIC Algorithm 62 CORDIC Hardware Implementations 102.1 CORDIC Processor Architecture : : : : : : : : : : : : : : : : : : : : : : : 102.1.1 A Word-Serial CORDIC Architecture : : : : : : : : : : : : : : : : : 102.1.2 A Word-Parallel CORDIC Architecture : : : : : : : : : : : : : : : : 113 Improving CORDIC Accuracy 143.1 Estimation of CORDIC Accuracy : : : : : : : : : : : : : : : : : : : : : : : 143.2 The Lower Bound of CORDIC Accuracy : : : : : : : : : : : : : : : : : : : 153.3 Reducing the z update error : : : : : : : : : : : : : : : : : : : : : : : : : : 163.4 Unexpected Truncation Errors : : : : : : : : : : : : : : : : : : : : : : : : : 204 VHDL Implementation 214.1 The Basic CORDIC Unit : : : : : : : : : : : : : : : : : : : : : : : : : : : : 214.2 VHDL Describes Structure and Behaviour : : : : : : : : : : : : : : : : : : 224.2.1 Hierarchical vs Flat Designs : : : : : : : : : : : : : : : : : : : : : : 234.2.2 The Viewlogic Synthesiser : : : : : : : : : : : : : : : : : : : : : : : 254.3 VHDL Design of the CORDIC Unit : : : : : : : : : : : : : : : : : : : : : : 264.3.1 The Rounding Unit : : : : : : : : : : : : : : : : : : : : : : : : : : : 294.4 Combining the CORDIC Units : : : : : : : : : : : : : : : : : : : : : : : : 304.4.1 A Solution : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 314.5 Improvements : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 33Conclusion 34A CORDIC Functions 35B Upper Bound of CORDIC Error 37References 381

Page 3: Vhdl Implemenation of Cordic Algorithm

List of Tables1.1 Elementary angles of �i : : : : : : : : : : : : : : : : : : : : : : : : : : : : 81.2 Various values of Kn : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 84.1 Some CORDIC hardware statistics. : : : : : : : : : : : : : : : : : : : : : : 33A.1 The six CORDIC modes. : : : : : : : : : : : : : : : : : : : : : : : : : : : : 36

2

Page 4: Vhdl Implemenation of Cordic Algorithm

List of Figures1.1 Rotation of a point in 2-D space. : : : : : : : : : : : : : : : : : : : : : : : 62.1 Generic Processor Architecture. : : : : : : : : : : : : : : : : : : : : : : : : 112.2 A Optimised Word-Serial CORDIC Architecture. : : : : : : : : : : : : : : 122.3 Word-Parallel CORDIC architecture with possible data pipelining. : : : : : 133.1 Numerical accuracy of the CORDIC processor. : : : : : : : : : : : : : : : : 153.2 Predicted and Actual accuracy of a CORDIC processor with a 12 bit in-ternal datapath. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 153.3 A plot showing bits of error for a typical test vector rotated through allpossible angles. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 163.4 A 12 bit, 8 stage CORDIC processor produces 9 bit accurate results. : : : 173.5 An 8 bit, 8 stage CORDIC processor produces 7 bit accurate results. : : : 173.6 Simulation results from a CORDIC processor illustrating the e�ects of thenormalisation scheme. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 193.7 An 8 bit, 8 stage CORDIC processor (a) without rounding, (b) with rounding. 204.1 The basic CORDIC unit. : : : : : : : : : : : : : : : : : : : : : : : : : : : : 214.2 A Hierarchical Design of the Adder/Subtracter for n = 4. : : : : : : : : : : 244.3 A Flat Design of the Adder/Subtracter for n = 4. : : : : : : : : : : : : : : 244.4 A Behavioural Design of the Adder/Subtracter for n = 4. : : : : : : : : : : 254.5 The structure of CORDIC unit showing the various entities. : : : : : : : : 284.6 The top level schematic of an 4 stage CORDIC processor with IncreasedConvergence Range and Rounding components. : : : : : : : : : : : : : : : 323

Page 5: Vhdl Implemenation of Cordic Algorithm

AbstractThis report describes the fundamentals of CORDIC (Co-ordinate Rotations Digital Com-puter) algorithm and a possible implementation using the VHDL hardware descriptionlanguage. An analysis of errors associated with a �xed point implementation of CORDICis also discussed and methods for reducing these errors. A normalisation scheme whichreduces error and requires no extra hardware is such a method. Various CORDIC struc-tures and possible VHDL implementations are described in detail, including design andlanguage issues. Finally a parallel hardware implementation is described and simulated.CORDIC has many applications, of which, some can be used for array imaging tech-niques.KeywordsCORDIC, VHDL

4

Page 6: Vhdl Implemenation of Cordic Algorithm

PrefaceCORDIC is an acronym for Coordinate Rotations Digital Computer and was derived byVolder [1] in the late 1950's for the purpose of calculating trigonometric functions. Itspopularity came about nearly twenty years later when VLSI solutions became a reality.The original algorithm describes the rotation of a 2-D vector which can be appliedin applications such as Digital Signal Processing [2] (Fourier Transforms, Digital Filters),Computer Graphics [3] and Robotics [4].CORDIC processing o�ers high computational rates making it attractive to applica-tions such as computer graphics where a combination of scaling and rotations are requiredin real time. CORDIC is also attractive to Robotics since the fundamental operation iscoordinate transformations, however it could be used for more computationally intensiveprocesses such as motion planning and collision detection.Array Imaging typically involves complex signal processing which may require manycomputationally intensive matrix operations. Increasing the complexity of the imagingmodel places greater demands on accuracy. Solutions to such complex systems requiresbetter, and hence, more complex algorithms. Most of these algorithms are based on matrixfactorization (decomposition) techniques, of which Singular Value Decomposition (SVD)is the most robust method. The SVD factorisation requires a two-sided transformationwhich involves several trigometric operations and rotations ideally suited to dedicatedVLSI hardware (CORDIC processing) for real time calculations. CORDIC has also beenapplied to phase correction when dynamic range focusing when Digital Baseband Demod-ulation [5] techniques are employed in Interpolation Beamforming [6] . A complex signalis represented by the in-phase, I, and quadrature, Q, components, and are phase correctedby rotating the complex signal.Haviland and Tuszynski designed and built a CORDIC processor [7] in 1980 whichused a iterative process to calculate circular, linear and hyperbolic functions. A morerecent implementation (1993) by Duprat and Muller [8] discusses the possibility of usinga redundant number system for the representation of a signed digit.This report is broken into four logical sections, namely, CORDIC Theory, HardwareImplementations, Improving CORDIC Accuracy and �nally a VHDL Implementation.5

Page 7: Vhdl Implemenation of Cordic Algorithm

Chapter 1The CORDIC AlgorithmConsider a 2-D vector (x; y) represented by a point v = x+ |y in the complex plane. Ifthe vector is rotated by an angle �, the new co-ordinate vector is given by:~v = v ej� (1:1)and shown in Figure (1.1). yx� ~v = ~x+ |~yv = x+ |yFigure 1.1: Rotation of a point in 2-D space.The angle � can be expanded into a set of elementary angles �i with pseudo-digitsqi 2 f�1;+1g, and angle expansion error zn, such that� = n�1Xi=�1 qi � �i + zn (1:2)and the sub-rotation angles �i take on the following values:�i = ( �=2 for i = �1arctan(2�i) for i = 0; 1; � � � ; n� 1 (1:3)Note that �i is approximately equal to but less than 2�i and the resulting angular expan-sion error is therefore jznj <� 2�(n�1). 6

Page 8: Vhdl Implemenation of Cordic Algorithm

Substitution of Equation(1.2) into Equation (1.1) gives:~v = v � n�1Yi=�1 e| qi �i � e| zn= v � (|qi) � n�1Yi=0 e| qi �i � e| zn (1.4)and expanding ejqi�i, ejqi�i = cos qi�i + j sin qi�i= cos qi�i (1 + j tan qi�i)= cos�i �1 + j qi 2�i�Finally ~v = v � n�1Yi=0 cos�i! � (|q�1) � n�1Yi=0 �1 + | qi 2�i�! � e�j zn (1:5)The range of rotation angles which can be represented by Equation (1.2) is ��max, where�max = n�1Xi=�1�i � 190� (1:6)and some values of �i are given in Table (1.1).If the expected range of rotation angles is �90� then the initial rotation by 90�, thatis, e| q�q �2 = j q�1, does not have to be performed and the initial rotation is by �45�.The second term is a constant scaling factor and for given value of n it can be pre-evaluated using Equation (1.7), and the �rst 15 evaluated in Table (1.2).Kn = n�1Yi=0 cos�i = n�1Yi=0 �1 + 2�2i�� 12 = n�1Yi=0 1q1 + 14i (1:7)The basic CORDIC algorithm which describes rotation of a unity length vector v =x+|y by an angle � can be derived from Equation (1.5) using the initial conditions, wherezi is the accumulated angular residue:v�1 = v �Knz�1 = �And, proceeding with i = �1; 0; � � � ; n� 1qi = ( �1+1 if zi < 0� 0 (1.8)vi+1 = ( vi � |qi if i = �1vi (1 + |qi � 2�i) if i � 0 (1.9)zi+1 = zi � qi�i (1.10)7

Page 9: Vhdl Implemenation of Cordic Algorithm

i Angle Angle (degrees) 16-bit binaries0 arctan(20) 45:0000� B400 = 110001:00000000001 arctan(2�1) 26:5651� 6A43 = 011010:10010000112 arctan(2�2) 14:0362� 3825 = 001110:00001001013 arctan(2�3) 7:1250� 1C80 = 000111:00100000004 arctan(2�4) 3:5763� 0E40 = 000011:10010000005 arctan(2�5) 1:7899� 0729 = 000001:11001010016 arctan(2�6) 0:8952� 0395 = 000000:11100101017 arctan(2�7) 0:4476� 01CA = 000000:01110010108 arctan(2�8) 0:2238� 00E5 = 000000:00111001019 arctan(2�9) 0:1119� 0073 = 000000:000111001110 arctan(2�10) 0:0560� 0039 = 000000:000011100111 arctan(2�11) 0:0280� 001D = 000000:000001110112 arctan(2�12) 0:0140� 000E = 000000:000000111013 arctan(2�13) 0:0070� 0007 = 000000:000000011114 arctan(2�14) 0:0035� 0004 = 000000:000000010015 arctan(2�15) 0:0017� 0002 = 000000:000000001016 arctan(2�16) 0:0008� 0001 = 000000:0000000001Table 1.1: Elementary angles of �in Kn0 0.707106781186551 0.632455532033682 0.613571991077903 0.608833912517754 0.607648256256175 0.607351770141306 0.607277644093537 0.607259112298898 0.607254479332569 0.6072533210898810 0.6072530315291311 0.6072529591389412 0.6072529410414013 0.6072529365170114 0.6072529353859115 0.60725293510314Table 1.2: Various values of Kn8

Page 10: Vhdl Implemenation of Cordic Algorithm

The �nal rotated vector is vn, with angle expansion error znvn = ~v = v � e|� � e�|zn (1.11)zn = �� n�1Xi=�1 qi�i (1.12)One complex operation on vi is equivalent to two operations on real numbers. For i = �1x0 + |y0 = |q�1(x�1 + |y�1)Hence =) x0 = �q�1y�1 (1.13)y0 = q�1x�1 (1.14)For i = 0; 1; � � � ; n� 1 xi+1 + |yi+1 = (xi + |yi)(1 + |qi � 2�i)Hence =) xi+1 = xi � qi � yi � 2�i (1.15)yi+1 = yi + qi � xi � 2�i (1.16)The CORDIC algorithm reduces to an iterative set of operations consisting of a binaryshift and an accumulator for each of x; y and z.Refer to Appendix A for a list of transcendental functions.

9

Page 11: Vhdl Implemenation of Cordic Algorithm

Chapter 2CORDIC HardwareImplementationsA Hardware implementation of CORDIC processor is dependent on the number of func-tions required and the computational speed. If all functions are to be computed, thenthere will be a necessary overhead for selecting each function. However, a small fast de-sign will result if a small number of functions are required. This chapter presents possiblesolutions to a mixture of design problems.2.1 CORDIC Processor ArchitectureA CORDIC algorithm can take on two primary architectures, namely, word serial or wordparallel. A word-serial processor minimises hardware requirements by utilising a singleCORDIC unit repeatedly. However, iterative algorithms which are controlled by a smallnumber of variables can be expanded on a two-dimensional area. ie., instead of executinga certain set of instructions n times using a single element (eg., a CORDIC unit), n timesduplicated elementary cells are used in successive steps of an iteration [9]. This attenedstructure can now perform many operations in parallel and is so called a word-parallelCORDIC processor.A word-parallel architecture has the advantage of being up to n times faster, but dueto the expansion requires, at worst, n times more hardware. However, the word-serialarchitecture requires complex controlling hardware and a variable shifter, decreasing thehardware saving ratio.2.1.1 A Word-Serial CORDIC ArchitectureThe CORDIC algorithm has the advantage of not requiring any special hardware otherthan an accumulator and a variable shifter which are generally available in most micro-controllers.A multi-function word-serial CORDIC processor architecture could be realised usinga basic micro structure consisting of a two-port register �le, a variable shifter combinedwith an ALU interconnected by several data paths as shown in Figure (2.1).A generic controller could consist of a microcode instructions for the ALU and register10

Page 12: Vhdl Implemenation of Cordic Algorithm

VariableShifterROMKn 's RegisterFile Controllingmicro-codeCC register2�i � yi or 2�i � xiiin ALU Result bus: xi+1, yi+1, zi+1Input data buses: xi, yi, ziROM�i 'sFigure 2.1: Generic Processor Architecture.�le, and would execute an iterative algorithm. This structure is simular to that of amicroprocessor or DSP and allows many variations of the CORDIC algorithm as theorder of operations and the expanded instruction set increases exibility. This type ofstructure illustrates that it would be possible to implement the CORDIC algorithm onany micro or DSP.Optimising the generic processor-structure for a word-serial CORDIC processor isachieved by reducing the functionality to operations only required by the CORDIC algo-rithm. A possible word-serial architecture is shown in Figure (2.2) where the ALU nowcontains three adders and dedicated registers. The microcode controller has been replacedby faster Combination Control Logic dedicated to the CORDIC operation sequence.2.1.2 A Word-Parallel CORDIC ArchitectureThe word-parallel method expands the problem of a single dimensional algorithm intoa two-dimensional problem and results in shorter computational times. Greater speedsof computation can be obtained by pipe-lining between stages so that many partial re-sults can be calculated in parallel. A pipelined-word-parallel architecture is shown inFigure (2.3) where each iteration is represented by a separate CORDIC block and a latchis placed after each iteration, or, several iterations.The following chapters will develop, implement, and simulate such parallel CORDICstructure using the VHDL hardware description language.

11

Page 13: Vhdl Implemenation of Cordic Algorithm

PPCombinationalControlLogic counterm-bit register

FinishedFlagxixi+1 yi+1Pn-bit register n-bit register n-bit register

yii qix0 y0z }| {Initial Inputsqiyi2�i�qixi2�i

ClockSelectNext State q-bit registerIncrement Zero

ResetPrecisionLoadClock z0 zi LookupTableof�i'szi+1Figure 2.2: A Optimised Word-Serial CORDIC Architecture.12

Page 14: Vhdl Implemenation of Cordic Algorithm

Latch for Pipelining of dataClockLatch for Pipelining of dataClock Cell #n

y0y1 x1x0 �1z0 �0P P Pyi+1 xi+1qi � yi � 2�i�qi � xi � 2�iyi xi zi

zi+1yn xn zn

qi = sign[zi] �iCell #0

Cell #i�n�1Figure 2.3: Word-Parallel CORDIC architecture with possible data pipelining.13

Page 15: Vhdl Implemenation of Cordic Algorithm

Chapter 3Improving CORDIC AccuracyAs expected, iterative algorithms calculate results by approximation and the solution willcontain errors. CORDIC is not an exception and errors are introduced by a combina-tion of quantisation and approximation errors. The accuracy of a CORDIC processor isdependent on the word length used for the three input variables x; y, and z, as well asthe number of iterations or steps performed. The following chapter describes the errorsassociated with a �xed point implementation and a means of reducing these errors.3.1 Estimation of CORDIC AccuracyThe fundamental operations performed by a CORDIC processor is the shift-and-add pro-cess of which �xed point arithmetic will introduce errors. For example, consider the binaryscaling of the vector vi = (xi; yi) at the ith stage:if i � m then vi+1 is updated with the truncated value vi � 2�iif i > m then vi+1 = vi ; and the update will be 0wherem is the internal bus width of v and limits the maximumnumber of useful iterations.Peak accuracy could be achieved after m iterations since all accuracy has been exhaustedin v. However, truncation errors may exceed the accuracy achieved by more iterations,and it is desirable to �nd the optimal number of iterations.The accuracy of the rotation will be determined by how closely the input rotationangle was approximated by the summation of sub-rotation angles �i. The error in v aftern iterations will be proportional to the error in z. An increase in the z datapath widthwill increase the accuracy of the z update and hence the v update.The numerical accuracy of the CORDIC algorithm can be calculated by the examina-tion of truncation and approximation errors. Truncation errors are due to the �nite wordlength and approximation errors are due to the �nite number of iterations. Walther [10]analyzed the x and y iterations independently of the z iterations and concluded that log nextra bits in the data paths can provide n bits of accuracy. This work was re-calculatedby Kota and Cavallaro[11] in a non-independent manner and concluded that log n + 2extra bits are required to achieve n bits of accuracy after n iterations.14

Page 16: Vhdl Implemenation of Cordic Algorithm

This solution represents an upper bound of error in the CORDIC processor. A graphof this function appears in Figure (3.1) from which it can be seen that to achieve 8 or 16bit accuracy, the internal datapaths need to be 13 and 22 bits respectively.0 4 8 12 16 20 24 28 32 36 40

0

4

8

12

16

20

24

28

32

Internal Datapath Width (n+log(n)+2)

Out

put r

esol

utio

n is

(n)

bits

with

(n)

iter

atio

ns

Datapath resolution vs Output Resolution

Figure 3.1: Numerical accuracy of the CORDIC processor.3.2 The Lower Bound of CORDIC AccuracyA CORDIC processor can be presented with all possible input combinations to �nd thelower bound of error. Simulation results are shown in Figure (3.2) where a 12 bit CORDICprocessor with a variable number of stages is presented with all possible rotation anglesbetween �� � z�1 � � and the resulting accuracy in bits is calculated. Kota and Caval-laro's upper bound of error (as de�ned by their maximumerror equation in Appendix (B))is also shown in Figure (3.2). The upper bound of error has a well de�ned peak of accu-racy, however the simulation results indicate that accuracy will improve if more iterationsare performed.0 2 4 6 8 10 12

0

2

4

6

8

10

12

Number of stages n

Out

put A

ccur

acy

Solid: Predicted Accuracy, Dashed: Actual Accuracy

Figure 3.2: Predicted and Actual accuracy of a CORDIC processor with a 12 bit internaldatapath. 15

Page 17: Vhdl Implemenation of Cordic Algorithm

Figure (3.3) illustrates the accuracy of a 12 bit, 12 stage processor, by simulation, andthe resulting bits of error produced. About 0:3% of results are greater than 2 bits of errorwhich indicates that the error bound of a CORDIC processor is positioned between theupper and lower bounds of error.Bits error

1

2

3

30

210

60

240

90

270

120

300

150

330

180 0

Figure 3.3: A plot showing bits of error for a typical test vector rotated through allpossible angles.The simulation results indicate that n + log n + 2 is an over estimation of data pathwidth required and a reduction in datapath width is possible if the number of iterationsis increased. Simulation results of two 8 stage CORDIC processors with 12 bit and 8 bitdatapaths, are shown for comparison in Figure (3.5) and Figure (3.4) respectively. Thesimulation results were obtained by varying the magnitude of v and � in uniform steps.The di�erence in resolution obtained is two bits, indicating that the lower bound of erroris closer to the error bound of CORDIC.3.3 Reducing the z update errorIn the rotational mode of CORDIC, � converges towards zero by adding/subtracting sub-rotation angles and the �nal iterations of the zi update will result in numbers approachingzero. More precisely, the angular error zi is approximately equal to 2�i, thus for a buswidth m, only (m� i) bits are used to represent error.To reduce the zi error a oating point system could be used, but it has complexhardware implementations not suited to word-parallel structures. A simpler method to16

Page 18: Vhdl Implemenation of Cordic Algorithm

0.17

0.33

0.50

0.66

0.83

1.0

30

-150

60

-120

90

-90

120

-60

150

-30

180 0

Figure 3.4: A 12 bit, 8 stage CORDIC processor produces 9 bit accurate results. 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

30

-150

60

-120

90

-90

120

-60

150

-30

180 0

Figure 3.5: An 8 bit, 8 stage CORDIC processor produces 7 bit accurate results.17

Page 19: Vhdl Implemenation of Cordic Algorithm

improve accuracy, ie., to utilise all m bits, a quasi- oating point scheme or normalisationscheme could be implemented by scaling the existing sequence by 2i, ie.,zi = 2i � ziTherefore, the new sequence becomeszi+1 = 2i+1 � zi+1= 2 � 2i � (zi � qi�i)= 2 � (2i � zi � qi � 2i � �i)= 2(zi � qi�i) (3.1)which requires a shift left at each iteration, and requires no extra hardware for a word-parallel structure. A new sequence of sub-rotation angles can be de�ned as:�i = 2i�i = 2i tan(2�i) (3:2)where �i approaches a �nite value of 1 for increasing values of i, and will utilise most ofthe bus width. Since the scaling system results in full use of the databus width, over owmay occur if the bus width is too small. Using Equation (3.1), the maximum value zi+1can have is when zi approaches zero, givingmax[zi+1] � 2 �max[�i] (3:3)To calculate the increase in accuracy is beyond the scope of this report, however,simulation indicates that there is a direct improvement in accuracy. The simulationresults indicated that using the traditional scheme the accuracy of the rotation isaccuracy / log(zi datapath width) + log(number of stages) (3:4)whereas the normalisation scheme has the advantage ofaccuracy / log(number of stages) (3:5)since the z datapath is always in a semi-normalised state.Using the traditional scheme, �i ! 0, limiting the number of useful stages. Howeverwhen normalised, there is no limit on the number of stages and a signi�cant reduction inhardware is possible by reducing buswidth of z.Figure (3.6) illustrates the error dependencies on the number of stages and bits forthe scaled and unscaled CORDIC processors. Figure (3.6(a)) and Figure (3.6(b)) showthe angular expansion error. Figure (3.6(c)) and Figure (3.6(d)) show the dependance ofv error on the angular expansion error. 18

Page 20: Vhdl Implemenation of Cordic Algorithm

010

200

1020

0

2

4

x 10-3

bitsstages

an

gle

exp

an

s.

err

or

Alpha scaling

010

200

1020

0

2

4

6

x 10-3

bitsstages

an

gle

exp

an

s.

err

or

No alpha scaling

010

200

1020

0

2

4

bits in zstages/bits in v

rela

tive

v e

rro

r

No alpha scaling

010

200

1020

0

2

4

bits in zstages/bits in v

rela

tive

v e

rro

r

Alpha scaling

Figure 3.6: Simulation results from a CORDIC processor illustrating the e�ects of thenormalisation scheme. 19

Page 21: Vhdl Implemenation of Cordic Algorithm

3.4 Unexpected Truncation ErrorsUsing �xed point arithmetic in a CORDIC processor will introduce an unexpected trunca-tion error. The error occurs when the vector (x; y) has a negative component. Considerthe �nal iterations where the update of vector v approaches 0 since a larger number ofright shifts is performed at each iteration. However this is not the case if x or y is negative.For example, let xi!N equal some number hex X"2D", or positive 45. The right shiftedvalue of xi!N approaches zero. However, the negative of X"2D" in twos-complement formis X"D3" and the right shifted value will produce a number approaching X"FF", or �1,not the expected zero.This is a signi�cant problem in the CORDIC processor, since the addition of extraiterations will only increase the error. A simple method of removing this error would be toround the shifted value, instead of the forced truncation. A simple method for roundingvalues is to add the bit that was last shifted out to the shifted value.The rounder could be implemented using a half-adder and typically requires threelogic gates per bit to implement. Minimal extra hardware is required in the word-serialarchitecture, however a word-parallel structure requires two half-adders per stage. Thiswill have a direct e�ect on the performance of the processor with the additional delay.Figure (3.7) are the simulation results of two CORDIC processors, with and without,rounding units. The test vector was rotated in steps of 5�, through 360� and the roundedresults are signi�cantly more accurate. The rounding maintains monoticity in the actualangle of rotation as well as uniform magnitude. 32.95

30

-150

60

-120

90

-90

120

-60

150

-30

180 0

32.95

30

-150

60

-120

90

-90

120

-60

150

-30

180 0

Figure 3.7: An 8 bit, 8 stage CORDIC processor (a) without rounding, (b) with rounding.20

Page 22: Vhdl Implemenation of Cordic Algorithm

Chapter 4VHDL ImplementationVarious tools can be used to implement the CORDIC processor, however, a standardisedapproach to this problem would unify the solution for further development in variousapplications. A VHDL (VHSIC Hardware Description Language) has been used hereto describe the structural and behavioural characteristics of a Word-Parallel CORDICprocessor. VHDL has become the standard of hardware description languages and has itsown IEEE standard [12].4.1 The Basic CORDIC UnitAny CORDIC structure will involve a basic unit containing three adders/subtracters, asshown in Figure (4.1). The binary scaler would be variable in the case of a Word-Serialdevice, however, much simpler in the Word-Parallel device as a shift translates directlyto a misalignment of the data bus. Cell i �ixi zizi+1xi+1yiyi+1Figure 4.1: The basic CORDIC unit.This unit and a suitable FSM and registers could form a word-serial structure. Aword-parallel implementation can be obtained by linking n CORDIC units.The rest of this chapter deals with development of a Word-Parallel unit and the in-terconnection of these devices using the VHDL language. It should be a relatively trivialtask, but unfortunately there are many bugs in the Viewlogic VHDL Synthesiser, as wellas only containing a subset of the full VHDL standard.The main aim of the project was to describe a CORDIC processor using the VHDLlanguage and to allow the application designer to change the size of structure easily. This21

Page 23: Vhdl Implemenation of Cordic Algorithm

exibility could include fundamental changes such as variable datapath widths and vari-able number of stages. Other options such as rounding intermediate nodes and pipeliningcould also be easily integrated.Currently, Viewlogic's VHDL is a partial implementation of the 1987 IEEE StandardVHDL, and many constructs are missing from their implementation. However, most ofthe useful constructs are there, but contain nasty ambiguous messages following to saysorry this only works partially. This made it very di�cult to work with.4.2 VHDL Describes Structure and BehaviourVHDL has the ability to describe a design in two ways� in terms of its component structure,� in terms of behavioural functionality of the designand also the possibility of integrating the two streams. A requirement for structuraldescriptions is that the lowest level description will be a behavioural description to ensureportability between di�erent synthesis libraries. An example of a lowest level operator isthe logical operator AND (behavioural), and used to describe the ANDing of two operands.This may be synthesised as an AND standard cell from the library. In this way, there isno way of directly accessing a component from a cell library and limiting portability.Consider a slightly more complex design of an n-bit adder/subtracter, which could bedescribed by the following behavioural description:addsub : PROCESS(a,b,sel)VARIABLE res : VLBIT_VECTOR(n DOWNTO 0);BEGINres := zero(n DOWNTO 0); -- needs to be initialisedIF sel = '1' THENres := add2c(a,b);ELSEres := sub2c(a,b);END IF;s <= res(n-1 downto 0); -- discard coutEND PROCESS;The process activates when one of the variables in the sensitivity list changes, andthen produces a result in the internal variable res. The signal s is assigned the lowerportion of the sum. Now consider a structural description of the same adder/subtracterwhere several components are used:c(0) <= sel; -- carry inconnect: FOR i IN 0 TO n-1 GENERATE 22

Page 24: Vhdl Implemenation of Cordic Algorithm

invert: invf101 PORT MAP( b(i), b_bar(i) );mux_b_b_bar: muxf201 PORT MAP( b_bar(i), b(i), sel, b_hat(i) );addsub: faf001 PORT MAP( a(i), b_hat(i), c(i), s(i), c(i+1) );END GENERATE;Note that the muxf201 component is used to select between the non-inverted andinverted signals of the b bus. The components are user de�ned entities describing theappropriate logic gates. For example a fragment of the faf001 component contains thefollowing lowest level behavioural description:SUM <= A1 xor B1 xor CIN2;CO <= (A1 and B1) or (A1 and CIN2) or (B1 and CIN2);It is not immediately obvious which way a designer should describe a particular design,however the next section reveals the results of the synthesiser on which a decision may bebased. In general however, the easier it is for a designer to write a design in VHDL, themore optimisation the synthesiser needs to perform.4.2.1 Hierarchical vs Flat DesignsOne of very useful features of Viewlogic's VHDL Synthesiser[13] is the ability to eithercreate a hierarchical (top-down) or a at (bottom-up) design. A hierarchical design allowsthe engineer to see lower level interconnections between design units, unlike the at designwhere no (or little) hierarchy can be seen. This allows easier debugging of designs, howeverits has the disadvantage of being less e�cient than a at design which combines all thedesign elements together into one circuit, and then performs optimisation.Figure (4.2) illustrates the previous structural design of the Adder/Subtracter whereit can be observed that the schematic consists of higher level components than standardlibrary cells. This feature of Viewlogic VHDL enables easy debugging of high level com-ponents when compared to a at design. It is relatively simple to navigate between levelsin a design.However, most libraries contain standard cells for full adders, muxes, and inverters, butremembering that VHDL doesn't allow direct access to Library cells, these componentshad to be described by a behavioural description. A mux simply maps to an IF statement,however no behavioural description will map to the full adder cell, and resort to thedescription stated previously.Compiling the same design using the at (bottom-up) design approach the synthesiserproduces the following statistics, if for example, using the X2000 library. The schematicgenerated by the synthesiser is shown in Figure (4.3).*********************************************Gate Usage Summary*********************************************Cell Count Area/Cell Cell Count Area/Cell----------------------------------------------------------------------------X2000:NAND2 15 0.25 X2000:OR2 3 0.2523

Page 25: Vhdl Implemenation of Cordic Algorithm

SEL S1

A1

B1

FAF001

A1B1CIN2

SUMCO

INVF101

A1 O

MUXF201

A1B2SEL3

O

A2 S2

FAF001

A1B1CIN2

SUMCO

A3

FAF001

A1B1CIN2

SUMCO

S3

B0

S0A0

INVF101

A1 OFAF001

A1B1CIN2

SUMCO

MUXF201

A1B2SEL3

O

INVF101

A1 O MUXF201

A1B2SEL3

OINVF101

A1 O

B2

B3

MUXF201

A1B2SEL3

O

Figure 4.2: A Hierarchical Design of the Adder/Subtracter for n = 4.X2000:XOR2 15 0.25----------------------------------------------------------------------------Total Cells : 33 Total Area : 8.25*********************************************Netlist Statistics*********************************************Maximum level of gates = 14 Total number of nets = 42A3

S0

S3

A0

B3

XOR2XOR2

XOR2

XOR2

XOR2

S1

A2

XOR2

B2

S2

XOR2

XOR2

B1

XOR2

NAND2

XOR2

NAND2

NAND2

B0XOR2

SEL

NAND2

NAND2

A1

NAND2

NAND2OR2

NAND2

OR2

NAND2XOR2NAND2

NAND2XOR2

NAND2

NAND2

OR2

NAND2

XOR2NAND2

XOR2

Figure 4.3: A Flat Design of the Adder/Subtracter for n = 4.Reconsidering the behavioural description of the Adder/Subtracter and synthesizingthe design, the following statistics are generated, and the corresponding schematic shownin Figure (4.4).*********************************************Gate Usage Summary*********************************************24

Page 26: Vhdl Implemenation of Cordic Algorithm

Cell Count Area/Cell Cell Count Area/Cell----------------------------------------------------------------------------X2000:AND2 21 0.25 X2000:AND3 1 0.50X2000:INV 11 0.00 X2000:NAND2 8 0.25X2000:OR2 17 0.25 X2000:XOR2 3 0.25----------------------------------------------------------------------------Total Cells : 61 Total Area : 12.75*********************************************Netlist Statistics*********************************************Maximum level of gates = 11 Total number of nets = 70SEL

AND2

INV

OR2

AND2

AND2

AND2

AND2

OR2

NAND2 AND2

NAND2INV

NAND2

AND2

AND2OR2

AND2

AND2

OR2

INV

S1

XOR2

B1

INV

AND2 OR2

S3

OR2

A3

B3

XOR2

NAND2

OR2

S2

AND2

A0

OR2

INV

B0

OR2

AND2

OR2

INV

AND3OR2

S0

A1

INVINV

AND2

AND2

OR2

A2

B2

XOR2

INVAND2

INV

AND2

AND2 OR2

AND2

AND2OR2

INV

AND2

OR2NAND2

NAND2

OR2NAND2

NAND2

OR2

Figure 4.4: A Behavioural Design of the Adder/Subtracter for n = 4.From the statistics of each design, it is important to note that the total area and themaximum level of gates di�ers. The structural description produces a small but slowdesign when compared to the behavioural description which produces a fast but largedesign.A characteristics of the synthesiser is that a behavioural description maps to a struc-ture by representing each output in terms of its inputs, much like a lookup table, andremoves any structure. The synthesizer performs logic level optimisation on a the struc-tural description and thus producing a design with less logic.4.2.2 The Viewlogic SynthesiserThe Viewlogic Synthesiser has the ability to alter the emphasis on speed or area whenoptimizing a design. The statistics generated in the previous section were area optimized,25

Page 27: Vhdl Implemenation of Cordic Algorithm

and neglected the e�ect of gate delays. For example, optimizing the behavioural design forspeed, the synthesiser generates 14 more gates than before, however there is a signi�cantdecrease in the maximum level of gates:*********************************************Gate Usage Summary*********************************************Cell Count Area/Cell Cell Count Area/Cell----------------------------------------------------------------------------X2000:AND2 10 0.25 X2000:AND3 2 0.50X2000:AND4 1 0.75 X2000:INV 15 0.00X2000:NAND2 17 0.25 X2000:NAND3 1 0.50X2000:NAND4 1 0.75 X2000:NOR3 2 0.50X2000:NOR4 2 0.75 X2000:OR2 22 0.25X2000:OR4 1 0.75 X2000:XOR2 1 0.25----------------------------------------------------------------------------Total Cells : 75 Total Area : 18.75*********************************************Netlist Statistics*********************************************Maximum level of gates = 9 Total number of nets = 84The synthesiser can optimise small designs, but when the design grows large, thememory and processing power required to optimize such a design is considerable. Thedesign of the CORDIC unit contains three adders/subtracters and takes several minutesto compile and optimize the design. However, integrating this unit into a larger design ofseveral units, the compiler has many problems and will eventually lead to a crash afterhalf an hour of compilation.A solution to get around this optimization problem is to use a hierarchical ow anddescribe the components using behavioural or structural descriptions. Using this methodthe compiler knows nothing about large components and cannot perform any global op-timization. This is not a fully optimized solution, but it is currently the best solution.However, it is possible to atten the design below the top level making the design slightlymore e�cient.4.3 VHDL Design of the CORDIC UnitThe �rst stage of the design of a CORDIC processor is to create the CORDIC unit, wheretwo approaches can be taken: a behavioral description or a structural description. Firstly,consider the following behavioural description where the shifted values of (xi; yi) are doneexternal to the CORDIC unit in the top level design. This approach is optimal, since itonly requires a misalignment of the data buses in the top level interconnections.However, if contained inside the CORDIC unit, each unit would require a variableshifter and could not be optimized using the current version of Viewlogic VHDL for reasonsdiscussed previously. Another reason why shifting is done external to the CORDIC unit26

Page 28: Vhdl Implemenation of Cordic Algorithm

is that the LOOP variable inside the generate statement cannot be passed to any userde�ned function, procedure or entity. This is not stated in the manual and took manydays to determine the problem.The behavioural description is as follows:ARCHITECTURE behaviour OF adder ISbegincell_i : process (xi,xs,yi,ys,zi,ai)VARIABLE x_res: vlbit_vector(n downto 0); -- temporary resultsVARIABLE y_res: vlbit_vector(n downto 0);VARIABLE z_res: vlbit_vector(k downto 0);beginx_res := zero(n downto 0); -- initialise, unless comp complainsy_res := zero(n downto 0);z_res := zero(k downto 0);if zi(k-1) = '0' then -- z_i is positivex_res := add2c (xi, ys);y_res := sub2c (yi, xs);z_res := sub2c (zi, ai);else -- z_i is negativex_res := sub2c (xi, ys);y_res := add2c (yi, xs);z_res := add2c (zi, ai);end if;xip1 <= x_res (n-1 downto 0);yip1 <= y_res (n-1 downto 0);zip1 <= z_res (e-1 downto 0);end process;END behavior;The synthesiser generates the following statistics for a 8 bit version of the code. Themaximum level of gates is 20, since each bit requires 2 levels, plus additional gates for themultiplexer and inversion.*********************************************Gate Usage Summary*********************************************Cell Count Area/Cell Cell Count Area/Cell----------------------------------------------------------------------------X2000:AND2 159 0.25 X2000:AND3 3 0.50X2000:INV 69 0.00 X2000:NAND2 76 0.2527

Page 29: Vhdl Implemenation of Cordic Algorithm

X2000:OR2 125 0.25 X2000:XOR2 7 0.25----------------------------------------------------------------------------Total Cells : 439 Total Area : 93.25*********************************************Netlist Statistics*********************************************Maximum level of gates = 20 Total number of nets = 487For the Structural description of the CORDIC unit is slightly more complex and isbest represented pictorially, as shown in Figure (4.5). Each box in the �gure representsa di�erent VHDL entity (component), and some components are used more than once.The design is very bulky and easier to make mistakes.FullAdderfaf001.vhd2to1muxmuxf201.vhdinv101.vhdINV addsub n.vhdFullAdderfaf001.vhd2to1muxmuxf201.vhdinv101.vhdINV FullAdderfaf001.vhd2to1muxmuxf201.vhdinv101.vhdINVyixs addsub n.vhdaddsub e.vhd zip1xip1yip1

ziaiysxiadders.vhdFigure 4.5: The structure of CORDIC unit showing the various entities.It achieves the same functionality as the behavioural description but requires a lot moree�ort to make sure all the connections are correct. As stated previously, the structuraldesign will minimise area, but will result in a slower design, as re ected by the followingsynthesiser statistics. 28

Page 30: Vhdl Implemenation of Cordic Algorithm

*********************************************Gate Usage Summary*********************************************Cell Count Area/Cell Cell Count Area/Cell----------------------------------------------------------------------------X2000:INV 3 0.00 X2000:NAND2 139 0.25X2000:OR2 41 0.25 X2000:XOR2 75 0.25----------------------------------------------------------------------------Total Cells : 258 Total Area : 63.75*********************************************Netlist Statistics*********************************************Maximum level of gates = 31 Total number of nets = 306Using the structural design will save about 30% on area but will execute 50% slower.In a FPGA implementation speed might be more desirable than area optimization sincethe devices operate relatively slower when compared to a custom VLSI device. A 30%increase in the number of gates will be a relatively small concern.4.3.1 The Rounding UnitThe rounding unit is formed by the interconnection of n half adders, or in behaviouralterms, the addition of the bit shifted out during the shifting process. Describing it struc-turally involves using the inc001 component which contains an AND and a XOR gate toform a half adder. The interconnection of the inc001 components is:c(0) <= cin; -- first carryconnect: for i in 0 to n-1 generateaddsub: inc001 port map( a(i), c(i), s(i), c(i+1) );end generate;Or, a much simpler behavioural description is created using the unsigned addition routineaddum. This avoids the sign extension used in the add2c routine.rounder : process (a,cin)VARIABLE res: vlbit_vector(n downto 0); -- temporary resultsbeginres := zero(n downto 0); -- initialise, unless comp complainsres := addum(a,cin); -- use addum instead of add2c as it sign-- extends the cin input making it -1 not +1s <= res (n-1 downto 0);end process; 29

Page 31: Vhdl Implemenation of Cordic Algorithm

4.4 Combining the CORDIC UnitsThe process of combining the CORDIC and Rounding units involves writing the top leveldesign in the hierarchical solution. As before with structural descriptions, the generatestatement is used and allows iterative or conditional generation of a portion of description.The �rst de�nition to be made in top level �le is the alphai constants, and thisversion implements the Alpha Normalisation Scheme. Next the x; y; z intermediate signalsbetween CORDIC units are shifted by the appropriate amount. The function shift all isde�ned in another �le and contains user de�ned functions. This operation is required heresince execution inside the generate statement will not work since concurrent procedurecalls only execute when a variable in the sensitivity list changes state. A change in theshift value is not recognizable inside the generate statement.-- Scaled a_i * 2^i values are decimal 45 53 56 57 57 57 57 57ai <= X"39_39_39_39_39_38_35_2D";sh_x: xis <= shift_all(xi); -- shift intermediate signalssh_y: yis <= shift_all(yi);sh_z: zis <= shift_z(zi);It should be noted that the variables xis, yis, zis, xi, yi, and zi are large vectorscontaining several smaller vectors. This system had to be used since Viewlogic's VHDLcannot handle two-dimensional arrays of vlbit. The shifting of intermediate signals isdone by the following function:FUNCTION shift_all (x : vlbit_vector (n*(k-1)-1 downto 0))RETURN vlbit_vector ISVARIABLE x_s : vlbit_vector(n*(k-1)-1 downto 0) := zero(n*(k-1)-1 downto 0);BEGINx_s(1*n-1 downto 0) := shiftr2c(x( 1*n-1 downto 0 ),1); -- 2 stagex_s(2*n-1 downto 1*n) := shiftr2c(x( 2*n-1 downto 1*n ),2); -- 3 stagex_s(3*n-1 downto 2*n) := shiftr2c(x( 3*n-1 downto 2*n ),3); -- 4 stagex_s(4*n-1 downto 3*n) := shiftr2c(x( 4*n-1 downto 3*n ),4); -- 5 stagex_s(5*n-1 downto 4*n) := shiftr2c(x( 5*n-1 downto 4*n ),5); -- 6 stagex_s(6*n-1 downto 5*n) := shiftr2c(x( 6*n-1 downto 5*n ),6); -- 7 stagex_s(7*n-1 downto 6*n) := shiftr2c(x( 7*n-1 downto 6*n ),7); -- 8 stagex_s(8*n-1 downto 7*n) := shiftr2c(x( 8*n-1 downto 7*n ),8); -- 9 stagex_s(9*n-1 downto 8*n) := shiftr2c(x( 9*n-1 downto 8*n ),9); -- 10 stagereturn x_s;END shift_all;Next comes the connection of the init component which is used to expand the convergencerange of the CORDIC processor to �190� < z < 190�. The input signals are x in, y in,z in are connected to a unit simular to the CORDIC unit, except there is an extra bitappended to the alpha bus to account for the expanded convergence range.30

Page 32: Vhdl Implemenation of Cordic Algorithm

initial: init port map(xi <= X"00",xs <= x_in,yi <= X"00",ys <= y_in,zi <= z_in,ai <= B"0_0101_1010", -- add/sub 90 degreesxip1 <= xinit, -- xinit = 0 +- yinyip1 <= yinit, -- yinit = 0 -+ xinzip1 <= zinit );The following code has been compressed to reduce detail, however it can be seen that therea three separate stages: initial connection, intermediate connections, and �nal connection.This can be visibly seen in Figure (4.6). (Also not shown is the conditional generation ofcomponents, eg., selection of behavioral or structural components, rounding units, etc.)connect: for i in 0 to k-1 generate -- k stagesls_unit: if i=0 generatefirst_unit: adder port map( ... );end generate ls_unit;i_unit: if i>0 and i<k-1 generatex_round: round port map ( ... );y_round: round port map ( ... );middle_units: adder port map( ... );end generate ls_unit;ms_unit: if i=k-1 generatex_round_last: round port map ( ... );y_round_last: round port map ( ... );last_unit: adder port map( ... );end generate ms_unit;end generate connect;The contents of ... are simular to the port map of the init component.4.4.1 A SolutionThis represents a solution to the CORDIC problem, and is close to a optimized solu-tion, but due to compiler and language di�culties a completely optimized solution is notpossible. Under these situations the design has been optimised as far as possible though.There many choices to be made about the design of the CORDIC unit, by decidingon whether the it is going to be area or speed e�cient.31

Page 33: Vhdl Implemenation of Cordic Algorithm

WIR:cordic

SCH:cordic

cordic

SHEET 1 OF 124 Jul 94 16:30

1

2

1

2

FEDCBA

FEDCBA

A_IN0

A_IN1

A_IN2

A_IN3

A_IN4

A_IN5

A_IN6

A_IN7

A_IN8

A_OUT0

A_OUT1

A_OUT2

A_OUT3

A_OUT4

A_OUT5

A_OUT6

A_OUT7

X_IN0

X_IN1

X_IN2

X_IN3

X_IN4

X_IN5

X_IN6

X_IN7

X_OUT0

X_OUT1

X_OUT2

X_OUT3

X_OUT4

X_OUT5

X_OUT6

X_OUT7

Y_IN0

Y_IN1

Y_IN2

Y_IN3

Y_IN4

Y_IN5

Y_IN6

Y_IN7

Y_OUT0

Y_OUT1

Y_OUT2

Y_OUT3

Y_OUT4

Y_OUT5

Y_OUT6

Y_OUT7

INIT

XI7XI6XI5XI4XI3XI2XI1XI0XS7XS6XS5XS4XS3XS2XS1XS0YI7YI6YI5YI4YI3YI2YI1YI0YS7YS6YS5YS4YS3YS2YS1YS0AI8AI7AI6AI5AI4AI3AI2AI1AI0TI8TI7TI6TI5TI4TI3TI2TI1TI0

XIP17XIP16XIP15XIP14XIP13XIP12XIP11XIP10YIP17YIP16YIP15YIP14YIP13YIP12YIP11YIP10AIP17AIP16AIP15AIP14AIP13AIP12AIP11AIP10

ADDER

XI7XI6XI5XI4XI3XI2XI1XI0XS7XS6XS5XS4XS3XS2XS1XS0YI7YI6YI5YI4YI3YI2YI1YI0YS7YS6YS5YS4YS3YS2YS1YS0AI7AI6AI5AI4AI3AI2AI1AI0TI7TI6TI5TI4TI3TI2TI1TI0

XIP17XIP16XIP15XIP14XIP13XIP12XIP11XIP10YIP17YIP16YIP15YIP14YIP13YIP12YIP11YIP10AIP17AIP16AIP15AIP14AIP13AIP12AIP11AIP10

ROUND

A7A6A5A4A3A2A1A0CIN

S7S6S5S4S3S2S1S0

ROUND

A7A6A5A4A3A2A1A0CIN

S7S6S5S4S3S2S1S0

ADDER

XI7XI6XI5XI4XI3XI2XI1XI0XS7XS6XS5XS4XS3XS2XS1XS0YI7YI6YI5YI4YI3YI2YI1YI0YS7YS6YS5YS4YS3YS2YS1YS0AI7AI6AI5AI4AI3AI2AI1AI0TI7TI6TI5TI4TI3TI2TI1TI0

XIP17XIP16XIP15XIP14XIP13XIP12XIP11XIP10YIP17YIP16YIP15YIP14YIP13YIP12YIP11YIP10AIP17AIP16AIP15AIP14AIP13AIP12AIP11AIP10

ROUND

A7A6A5A4A3A2A1A0CIN

S7S6S5S4S3S2S1S0

ADDER

XI7XI6XI5XI4XI3XI2XI1XI0XS7XS6XS5XS4XS3XS2XS1XS0YI7YI6YI5YI4YI3YI2YI1YI0YS7YS6YS5YS4YS3YS2YS1YS0AI7AI6AI5AI4AI3AI2AI1AI0TI7TI6TI5TI4TI3TI2TI1TI0

XIP17XIP16XIP15XIP14XIP13XIP12XIP11XIP10YIP17YIP16YIP15YIP14YIP13YIP12YIP11YIP10AIP17AIP16AIP15AIP14AIP13AIP12AIP11AIP10

ROUND

A7A6A5A4A3A2A1A0CIN

S7S6S5S4S3S2S1S0

ADDER

XI7XI6XI5XI4XI3XI2XI1XI0XS7XS6XS5XS4XS3XS2XS1XS0YI7YI6YI5YI4YI3YI2YI1YI0YS7YS6YS5YS4YS3YS2YS1YS0AI7AI6AI5AI4AI3AI2AI1AI0TI7TI6TI5TI4TI3TI2TI1TI0

XIP17XIP16XIP15XIP14XIP13XIP12XIP11XIP10YIP17YIP16YIP15YIP14YIP13YIP12YIP11YIP10AIP17AIP16AIP15AIP14AIP13AIP12AIP11AIP10

ROUND

A7A6A5A4A3A2A1A0CIN

S7S6S5S4S3S2S1S0

ROUND

A7A6A5A4A3A2A1A0CIN

S7S6S5S4S3S2S1S0

VDD

GND

Figure4.6:Thetoplevelschematicofan4stageCORDICprocessorwithIncreasedConvergenceRangeandRoundingcomponents.32

Page 34: Vhdl Implemenation of Cordic Algorithm

The user can exibly change the characteristics of the CORDIC processor by chang-ing the value of a few constants to achieve more or less accuracy as well as hardwarecon�gurations.Some of the CORDIC hardware statistics generated are:Type Internal Bus Width Stages Rounding Number of GatesBehavioural 12 bit 8 no 5841Behavioural 12 bit 10 no 7139Behavioural 12 bit 12 no 8437Behavioural 8 bit 8 no 3753Behavioural 8 bit 8 yes 5060Structural 8 bit 8 no 2313Structural 8 bit 8 yes 2775Table 4.1: Some CORDIC hardware statistics.(Remember that there is also one additional stage for increasing the convergence range.)4.5 ImprovementsThere is one main improvement which could be made to the current design, which isto include pipelining registers between stage(s). If the Xilinx FPGA was used no extrahardware for latches is necessary as each cell contains a latch.Another possibility is to design a Word-Serial Cordic architecture around the alreadydesign CORDIC unit. This would only require a FSM driver along with some additionalhardware for the variable shifter.33

Page 35: Vhdl Implemenation of Cordic Algorithm

ConclusionThe theory behind the CORDIC algorithm has been covered in detail and its possibleapplications in array imaging discussed.It was shown how Kota predicted the upper bound on CORDIC errors, however sim-ulations reveal CORDIC to be signi�cantly more accurate. A normalisation scheme onthe z datapath was introduced to maximize bus usage, and hence increase accuracy. Thisscheme can reduce the bus width required for z, and still achieve greater than or thesame accuracy. Also observed was the unexpected truncation errors introduced by thetwos-complement binary format, and minimised using a half adder to perform a roundingoperation. A method for increasing the convergence range to �180� < � < 180� in oneextra iteration was also introduced.Various CORDIC architectures were discussed and a word-parallel architecture de-scribed using the VHDL hardware description language. A few design issues and imple-mentation problems were discussed, and concluded that a hierarchical design ow had tofollowed to avoid compiler memory problems. The solution is not completely optimal, butthe resulting design could easily be implemented on a FPGA gate array.The VHDL design of the CORDIC processor could now be easily integrated into anyapplication as alterations in the con�guration of the CORDIC processor can easily beachieved using the VHDL language.34

Page 36: Vhdl Implemenation of Cordic Algorithm

Appendix ACORDIC FunctionsThe functional results from a CORDIC processor are derived from the initialisation ofthe three input variables: x, y, and z and the subsequent mode of operation selected.Equations (1.15,1.16) can be rewritten into a general solution, from which six modesof operation are possible, with the introduction of a mode variable m:xi+1 = xi +m � qiyi � 2�i (A.1)yi+1 = yi � qixi � 2�i (A.2)zi+1 = zi � qi�i (A.3)where m can take on the following values:�i = 8><>: atanh (2�i) if m = �12�i if m = 0arctan(2�i) if m = +1 (A:4)The three modes of m determine the class of function being evaluated: linear (m = 0),circular (m = �1) or hyperbolic (m = +1). The values of �i were previously given for thecircular functions mode and simular table could be given for the hyperbolic functions.The six modes of operation exist because there are two sub-classes available dependingupon whether the iterations seek to drive the variable y or z towards zero. Table (A.1)summarises all of the functions available with this con�guration.35

Page 37: Vhdl Implemenation of Cordic Algorithm

Hyperbolic Linear Circularm = �1 m = 0 m = +1Mode i = f0;1; 2; : : : ; N � 1g i = f0;1;2; : : : ;N � 1g i = f0;1;2; : : : ; N � 1g(Repeat for i = f4;13;40; : : :g)qi = n +1 if zi < 0�1 if zi � 0 qi = n +1 if zi < 0�1 if zi � 0 qi = n +1 if zi < 0�1 if zi � 0z ! 0 xN � KN (xin cosh(zin) + yin sinh(zin) xN = xin xN � KNxin cos(zin)� yin sin(zin)yN � KN (xin sinh(zin) + yin cosh(zin)) yN = yin + xinzin yN � KN(xin sin(zin)� yin cos(zin))jzinj � 1:1182 jzinj � 1 jzinj � 1:7433 (99:9�)qi = n +1 if xiyi � 0�1 if xiyi < 0 qi = n +1 if xiyi � 0�1 if xiyi < 0 qi = n +1 if yi � 0�1 if yi < 0y ! 0 xN � KNpx2in � y2in xN = xin xN � KNpx2in + y2inzN � zin + atanh ( yinxin ) zN = zin + yinxin zN � zin + atan2 (yin; xin)jatanh(yin=xin)j � 1:1182 jyin=xinj � 1 jatan2 (yin; xin)j � 1:7433 (99:9�)Table A.1: The six CORDIC modes.36

Page 38: Vhdl Implemenation of Cordic Algorithm

Appendix BUpper Bound of CORDIC ErrorKota and Cavlallaro calculated the numerical accuracy of the CORDIC algorithm byexamination of truncation and approximation errors.They concluded that for a CORDIC processor with all data paths being m bits wide,and the number of iterations being n, then the upper bound of error was shown to be:Eu = 2�n + 3:5 � 2�m � n (B:1)This cannot be solved analytically, but a numerical solution (that is for any given m)can be approximated graphically. The solution approximates toInternal Bus Width = m � n + log2 n+ 2 (B:2)Hence to obtain a precision of n bits, a CORDIC processor with (n+ log2 n+2) bits andn iterations would be su�cient. This solution represents the upper bound of error.

37

Page 39: Vhdl Implemenation of Cordic Algorithm

Bibliography[1] J. E. Volder, \The CORDIC trigonometric computing technique," IRE Transactionson Electronic Computing, vol. EC-8, no. 3, pp. 330{334, 1959.[2] Y. H. Hu, \CORDIC-based VLSI architectures for digital signal processing," IEEESignal Processing Magazine, pp. 16{35, July 1992.[3] F. Koscsis and J. Bohme, \Fast algorithms and parallel structures for form factorevaluation," The Visual Computer, no. 8, pp. 205{216, 1992.[4] M. Kameyama, T. Amada, and T. Higuchi, \Highly parallel collision detection pro-cessor for intelligent robots," IEEE Journal of Solid-State Circuits, vol. 27, no. 4,pp. 500{506, 1992.[5] M. O'Donnell et al., \Real-time phases array imaging using digital beam forming andautonomous channel control," Ultrasonics Symposium, pp. 1499{1502, 1990.[6] G. Hampson and A. Papli�nski, \Beamforming by interpolation," Tech. Rep. 93-12,Monash University, 1993.[7] G. L. Haviland and A. A. Tuszynski, \A CORDIC arithmetic processor chip," IEEETransactions on Computers, vol. C-29, no. 2, pp. 68{79, 1980.[8] J. Duprat and J.-M. Muller, \The CORDIC algorithm: New results for fast VLSIimplementation," IEEE Transactions on Computers, vol. 42, pp. 168{178, February1993.[9] A. Papli�nski, \Array processor units for evaluating the expotential and logarithmicfunctions," Tech. Rep. TR-CS-82-07, The Australian National University, 1982.[10] J. S. Walther, \A uni�ed algorithm for elementary functions," Proceedings AFIPSSpring Joint Computer Conference, pp. 379{385, 1971.[11] K. Kota and J. R. Cavallaro, \Numerical accuracy and hardware tradeo�s forCORDIC arithmetic for special-purpose processors," IEEE Transactions on Com-puters, vol. 42, pp. 769{779, July 1993.[12] Experts, \IEEE Std 1076-1987, IEEE Standard VHDL Language ReferenceManual,"IEEE Computer Society, February 1992.[13] ViewLogic, VHDL Reference Manual for Synthesis, Powerview 5.1.3 release ed.38