real/complex logarithmic number system alu

8/13/2019 Real/Complex Logarithmic Number System ALU

http://slidepdf.com/reader/full/realcomplex-logarithmic-number-system-alu 1/12

A Real/Complex Logarithmic NumberSystem ALU

Mark G. Arnold, Member , IEEE , and Sylvain Collange

Abstract—The real Logarithmic Number System (LNS) offers fast multiplication but uses more expensive addition. Cotransformation

and higher order table methods allow real LNS ALUs with reasonable precision on Field-Programmable Gate Arrays (FPGAs). The

Complex LNS (CLNS) is a generalization of LNS, which represents complex values in log-polar form. CLNS is a more compact

representation than traditional rectangular methods, reducing bus and memory cost in the FFT; however, prior CLNS implementations

were either slow CORDIC-based or expensive 2D-table-based approaches. Instead, we reuse real LNS hardware for CLNS, with

specialized hardware (including a novel logsin that overcomes singularity problems) that is smaller than the real-valued LNS ALU to

which it is attached. All units were derived from the Floating-Point-Cores (FloPoCo) library. FPGA synthesis shows our CLNS ALU is

smaller than prior fast CLNS units. We also compare the accuracy of prior and proposed CLNS implementations. The most accurate of

the proposed methods increases the error in radix-two FFTs by less than half a bit, and a more economical FloPoCo-based

implementation increases the error by only one bit.

Index Terms—Complex arithmetic, logarithmic number system, hardware function evaluation, FPGA, fast Fourier transform, VHDL.

Ç

1 INTRODUCTION

THE usual approach to complex arithmetic works withpairs of real numbers that represent points in a

rectangular coordinate system. To multiply two complexnumbers, denoted in this paper by upper-case variables, X and Y , using rectangular coordinates involves four realmultiplications:

<½ X Y ¼ < ½ X < ½ Y = ½ X = ½ Y ð1Þ

=½ X Y ¼ < ½ X = ½ Y þ = ½ X < ½ Y ; ð2Þ

where <½ X is the real part and =½ X is the imaginary part of acomplex value, X . The bar indicates that this is an exactlinearly represented unbounded precision number. Thereare many implementation alternatives for the real arithmeticin (1) and (2): fixed-point, Floating-Point (FP), or moreunusual systems like the scaled Residue Number System(RNS) [7] or the real Logarithmic Number System (LNS) [17].Swartzlander et al. [17] analyzed fixed-point, FP, and LNSusage in a Fast Fourier Transform (FFT) implemented withrectangular coordinates, and found several advantages inusing logarithmic arithmetic to represent <½ X and =½ X .Several hundred papers [24] have considered conventional,real-valued LNS; most have found some advantages for low-precision implementation of multiply-rich algorithms, like(1) and (2).

This paper considers an even more specialized numbersystem in which complex values are represented in log-polar coordinates. The advantage is that the cost of complex

multiplication and division is reduced even more than inrectangular LNS at the cost of a very complicated additionalgorithm. This paper proposes a novel approach to reducethe cost of this log-polar addition algorithm by using aconventional real-valued LNS ALU, together with addi-tional hardware that is less complex than the real-valuedLNS ALU, to which it is attached.

Arnold et al. [4] introduced a generalization of logarith-mic arithmetic, known as the Complex LogarithmicNumber System (CLNS), which represents each complexpoint in log-polar coordinates. CLNS was inspired by anineteenth century paper by Mehmke [14] on manual usageof such log-polar representation, and shares only the polaraspect with the highly-unusual complex-level-index repre-sentation [18]. The initial CLNS implementations, like themanual approach, were based on straight table lookup,which grows quite expensive as precision increases, sincethe lookup involves a 2D table.

Cotransformation [4]cuts thecost of CLNS significantly byconverting difficult cases for addition and subtraction intoeasier cases. In CLNS, it is easy to break a complex value into

two parts such that X ¼

X 1

X 2. Incrementation of thecomplex value can be described in terms of these two parts:

1 þ X ¼ 1 þ X 1 X 2 þ X 1 X 1 ¼ ð1 þ X 1Þ þ ð X 1 ð X 2 1ÞÞ

¼ ð1 þ X 1Þ 1 þX 1 ð X 2 1Þ

1 þ X 1

:

The CLNS representations of ð1 þ X 1Þ and ð X 2 1Þ can beobtained from smaller tables and the incrementation of X 1ð X 21Þ

1þ X 1is easier. Although this saves memory, the overall

table sizes are still large.Lewis [13] overcame such large area requirements in the

design of a 32-bit CLNS ALU by using a CORDIC algorithm,

which is significantly less expensive; however, the imple-mentation involves many steps, making it rather slow.Despite the implementation cost, CLNS may be preferredin certain applications. For example, Arnold et al. [2], [3]

202 IEEE TRANSACTIONS ON COMPUTERS, VOL. 60, NO. 2, FEBRUARY 2011

. M.G. Arnold is with the Computer Science and Engineering Department,Lehigh University, Bethlehem, PA 18015. E-mail: [email protected].

. S. Collange is with ELIAUS, Universite de Perpignan, 52 av. Paul Alduy,66860 Perpignan Cedex, France.

Manuscript received 12 Aug. 2009; accepted 22 Feb. 2010; published online

18 June 2010.Recommended for acceptance by J. Bruguera, M. Cornea, and D. Das Sarma.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TCSI-2009-08-0381.Digital Object Identifier no. 10.1109/TC.2010.154.

0018-9340/11/$26.00 2011 IEEE Published by the IEEE Computer Society



showed that CLNS is significantly more compact than acomparable rectangular fixed-point representation for aradix-two FFT, making the memory and busses in the systemmuch less expensive. These savings counterbalance the extracost of the CLNS ALU, if the precision requirements are lowenough. Vouzis et al. [20] analyzed a CLNS approach forOrthogonal Frequency Division Multiplexing (OFDM) de-modulation of Ultrawideband (UWB) receivers. To reduce

the implementation cost of such CLNS applications, Vouzisand Arnold [19] proposed using a Range-AddressableLookup Table (RALUT), similar to [15].

2 REAL-VALUED LNS

The format of the real-valued LNS [16] has a base-b (usually,b ¼ 2) logarithm (consisting of kL-signed integer bits andf L fractional bits) to represent the absolute value of a realnumber and often an additional sign bit to allow for thatreal to be negative. We can describe the ideal logarithmictransformation and the quantization with distinct notations.For an arbitrary nonzero real, x, the ideal (infinite precision)

LNS value is xL ¼ logb jxj. The exact linear x has beentransformed into an exact but nonlinear xL, which is thenquantized as xL ¼ bxL2f L þ 0:5c2f L . In analogy to theFP number system, the kL integer bits behave like anFP exponent (negative exponents mean smaller than unity;positive exponents mean larger than unity); the f L fractional bits are similar to the FP mantissa. Multiplication ordivision simply requires adding or subtracting the loga-rithms and exclusive-ORing the sign bits.

To compute real LNS sums, a conventional LNS ALUcomputes a special function, known as sbðz Þ ¼ logbð1 þ bz Þ,which, in effect, increments the LNS representation by 1.0.(The s reminds us this function is only used for sums when

the sign bits are the same.) The LNS addition algorithmstarts with xL ¼ logb jxj and yL ¼ logb jyj already in LNSformat. The result, tL ¼ logb jx þ yj is computed as tL ¼yL þ sbðxL yLÞ, which is justified by the fact thatt ¼ x þ y ¼ yðx=y þ 1Þ, even though such a step neveroccurs in the hardware. Since x þ y ¼ y þ x, we can alwayschoose z L ¼ jxL yLj [16] by simultaneously choosing themaximum of xL and yL. In other words, tL ¼ maxðxL; yLÞ þsbðjxL yLjÞ, thereby restricting z < 0.

Analogously to sb, there is a similar function to computethe logarithm of the differences with tL ¼ maxðxL; yLÞ þd bðjxL yLjÞ, where d bðz Þ ¼ logb j1 bz j. The decisionwhether to compute sb or d b is based on the signs of the

real values.Early LNS implementations [16] used ROM to lookup sb

and d b achieving moderate (f L ¼ 12 bits) accuracy. Morerecent methods [8], [9], [12], [21], [22] can obtain accuracynear single-precision floating point at reasonable cost. Thegoal here is to leverage the recent advances made in real-valued LNS units for the more specialized context of CLNS.

3 COMPLEX LOGARITHMIC NUMBER SYSTEM

CLNS uses a log-polar approach for X 6¼ 0 in which wedefine, for hardware simplicity,

logbð X Þ ¼ X ¼ X L þ X i; ð3Þ

where X L ¼ logbð ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

<½ X 2 þ =½ X 2q

Þ is thelogarithm (base b) of the length of a vector in the complex plane and X <

is the angle of that vector.1 This can be converted back exactlyto rectangular form:

<½ X ¼ bX L cosðX Þ;

=½ X ¼ bX L sinðX Þ:ð4Þ

In a practical implementation, both the logarithm and anglewill be quantized:

X L ¼ bX L2f L þ 0:5c2f L ;

X ¼ bX 2f þ2= þ 0:5c2f :ð5Þ

We scale the angle by 4= so that the quantization near theunit circle will be roughly the same in the angular andradial axes when f L ¼ f while allowing the complete circleof 2 radians to be represented by a power of two. Therectangular value, ~X , represented by the quantized CLNShas the following real and imaginary parts:

<½ ~X ¼ b X L cosð=4 X Þ;

=½ ~X ¼ b ^X L sinð=4 X Þ:

ð6Þ

To summarize our notation, an arbitrary precision complexvalue X 6¼ 0 is transformed losslessly to its ideal CLNSrepresentation, X . Quantizing X to X produces anabsolute error perceived by the user in the rectangularsystem as j X ~X j.

Complex multiplication and division are trivial in CLNS.Given the exact CLNS representations, X and Y , the resultof multiplication, Z ¼ X þ Y , can be computed by twoparallel adders:

Z L ¼ X L þ Y L;

Z ¼ ðX þ Y Þ mod 2: ð7Þ

The modular operation for the imaginary part is not strictlynecessary, but reduces the range of angles that thehardware needs to accept. In the quantized system, thisreduction comes at no cost in the underlying binary adder because of the 4= scaling.

The conjugate of the value is represented by theconjugate of the representation, which can be formed bynegating Z . The negative can be formed by adding to Z .Division is like multiplication, except the representationsare subtracted.

The difficult issue for all variations of LNS is additionand subtraction. Like conventional LNS, CLNS utilizesX þ Y ¼ Y ð Z þ 1Þ, where Z ¼ X= Y . This is valid for all

nonzero complex values, except when X ¼ Y , a case thatcan be handled specially, as in real LNS.

The complex addition logarithm, S bðZ Þ is a function [14]that accepts a complex argument, Z , the log-polar repre-sentation of Z , and returns the corresponding representa-tion of Z þ 1. Thus, to find the representation, T , of the sum,T ¼ X þ Y simply involves

T ¼ Y þ S bðX Y Þ; ð8Þ

which is implemented by two subtractors, a complexapproximation unit, and two adders:

ARNOLD AND COLLANGE: A REAL/COMPLEX LOGARITHMIC NUMBER SYSTEM ALU 203

1. Unless b ¼ e, logb in this paper is defined slightly differently than thetypical definition lnð X Þ= lnðbÞ used in complex analysis.



T L ¼ Y L þ <½S bðX Y Þ;

T ¼ ðY þ =½S bðX Y ÞÞ mod 2: ð9Þ

The major cost in this circuit is the complex S b unit.Subtraction simply requires adding to Y beforeperforming (8).

4 NOVEL CLNS ADDITION ALGORITHM

The complex addition logarithm, S bðZ Þ, has the samealgebraic definition as its real-valued (sbðz Þ) counterpart,and simple algebra on the complex Z ¼ Z L þ iZ reveals theunderlying polar conversion required to increment a CLNSvalue by one:

S bðZ Þ ¼ logbð1 þ bZ Þ

¼ logbð1 þ cosðZ ÞbZ L þ i sinðZ ÞbZ L Þ: ð10Þ

This complex function can have its real and imaginary partscomputed separately. In all of the prior CLNS literature [2],[3], [13], the real part is derived from (10) as:

<ðS bðZ ÞÞ ¼logbð1 þ 2cosðZ ÞbZ L þ b2Z L Þ

2 ; ð11Þ

and the imaginary part is derived from (10) as:

=ðS bðZ ÞÞ ¼ arctanð1 þ cosðZ ÞbZ L ; sinðZ ÞbZ L Þ; ð12Þ

where arctanðx; yÞ is the four-quadrant function, like

atan2(y,x) in many programming languages.In the prior literature, (11) and (12) are computed directly

from the complex Z , either using fast, but costly, simplelookup [20], or less expensive methods like cotransforma-tion [4], CORDIC [13], or content-addressable memory [19].The proposed approach uses a different derivation in whichthe real and imaginary parts are manipulated separately.Unlike the prior literature, in the proposed technique, thereal values in the intermediate computations are repre-sented as conventional LNS real values; the angles remainas scaled fixed-point values at every step. The advantages of the proposed technique are lower cost and the ability to

reuse the same hardware for both real- and complex-LNSoperations. In order to use this novel approach, a differentderivation from (10) is required, while still considering thisa single complex function of a complex variable:

S bðZ Þ ¼ logb

1 þ sgncosðZ ÞblogcosðZ ÞbZ L

þ i sgnsinðZ ÞblogsinðZ ÞbZ L;ð13Þ

where sgncosðZ Þ and sgnsinðZ Þ are the signs of therespective real-valued trigonometric functions, and wherelogcosðZ Þ ¼ logb j cosðZ Þj and logsinðZ Þ ¼ logb j sinðZ Þj arereal-valued functions compatible with a real-valued LNSALU. For simplicity, (13) ignores the trivial cases, when Z isa multiple of =2, which causes sinðZ Þ or cosðZ Þ to be zero.These cases are resolved in the Appendix and reported inTable 1. In order to calculate the complex-valued function(13) using a conventional real LNS, there are two cases thatchoose whether the “1þ” operation is implemented with thereal sb function or the real d b function.

Case 1. (=2 < Z < =2) is when “1þ” means a sum

(sb) is computed:

S bðZ Þ ¼ logb

1 þ blogcosðZ ÞþZ L

þ i sgnsinðZ ÞblogsinðZ ÞþZ L

¼ logb

bsbðlogcosðZ ÞþZ LÞ

þ isgnsinðZ ÞblogsinðZ ÞþZ L

:

ð14Þ

Case 2. (Z < =2 or Z > =2) is when “1þ” means adifference (d b) is computed:

S bðZ Þ ¼ logb 1 blogcosðZ ÞþZ L

þ isgnsinðZ Þb

logsinðZ ÞþZ L¼ logb

bd bðlogcosðZ ÞþZ LÞ

þ isgnsinðZ ÞblogsinðZ ÞþZ L

:

ð15Þ

Uptothispoint,insteadofgivingseparaterealandimaginaryparts, we have described a complex function equivalent to(10) in a novel way that eventually will be amenable to beingcomputed with a conventional real LNS ALU. Of course, touse such an ALU, it will be necessary to compute the real andimaginary parts separately. This derivation, which is quiteinvolved, is given in the Appendix.

5 IMPLEMENTATION

Fig. 1 shows a straightforward implementation of (13)—more precisely its real (26) and imaginary (34) parts


TABLE 1Table of Required CLNS ALU Sign Logic



derived in the Appendix. The operation of the sign logicunit is summarized in Table 1, which takes into account thespecial cases described in the Appendix. The sign logicunit controls the Four-Quadrant (4Q) Correction Unit, thesb=d b unit, and the output Multiplexor (Mux). This logicoperates in two modes: complex mode, which is the focusof this paper, and real mode (which uses the sb/d b unit toperform traditional LNS arithmetic). In real mode, Z ¼ 0 isused to represent a positive sign, and Z ¼ is used torepresent a negative sign. Given the angular quantization,the LNS sign bit corresponds to the most significant bit of Z , with the other bits masked to zero.

In order to minimize the cost of CLNS hardwareimplementation, we exploit several techniques to transformthe arguments of the function approximation units intolimited ranges where approximation is affordable. Assum-ing b ¼ 2, the reduction for the addition logarithm is typicalof real LNS implementations:

s2ðxÞ ¼

0; x < 2wez ;log2ð1 þ 2xÞ; 2wez < x 0;x þ s2ðxÞ; 0 < x < 2wez ;x; x 2wez ;

8>><>>: ð16Þ

where wez kL describes the wordsize needed to reach the

“essential zero,” [16] which is the least negative value thatcauses log2ð1 þ 2xÞ to be quantized to zero. We use similarrange reduction for the subtraction logarithm when z is farfrom zero, together with cotransformation [21] for z near

zero. The scaled arctangent is nearly equal to s2 for negativearguments (only because we choose b ¼ 2 and 4= scaling),although the treatment of positive arguments reflects theasymptotic nature of the arctangent:

a2ðxÞ ¼

0; x < 2wez ;4= arctanð2xÞ; 2wez < x 0;2 a2ðxÞ; 0 < x < 2wez ;

2; x 2wez :

8>><>>: ð17Þ

The range reduction for the trigonometric functionsinvolves one case for each octant:

c2ðxÞ ¼

c2ðxÞ; x < 0;logcosð=4xÞ; 0 x < 1;d 2ð2 c2ð2 xÞÞ=2; 1 x < 2;2wez þ1; x ¼ 2;d 2ð2 c2ðx 2ÞÞ=2; 2 < x < 3;c2ð4 xÞ; 3 x < 4;c2ðx 8Þ; x 4:

8>>>>>>>><>>>>>>>>:

ð18Þ

We can reuse (18) to yield logsin

ð=4

xÞ ¼ c2ðx þ2

Þ as wellas logcosð=4xÞ ¼ c2ðxÞ. Instead of 1, the output at thesingularity, 2wez þ1 is large enough to trigger essential zero inthe later units, but is small enough to avoid overflow.

6 DIRECT FUNCTION LOOKUP

There are several possible hardware realizations for (26)and (34), which appear in the Appendix. When f L and f are small, it is possible to use a direct table lookup for sb,d b, ab, and cb. This implementation allows round-to-nearestresults for each functional unit, reducing roundoff errors in(26) and (34). Fig. 2a shows the errors in computing the

real part (26) as a function of Z L and Z with such round-to-nearest tables when f L ¼ f ¼ 7. These errors have aRoot-Mean-Squared (RMS) value around 0.003. Fig. 2bshows errors for the imaginary part (34), with an RMSvalue around 0.002. The errors in these figures spikeslightly near Z L ¼ 0, Z ¼ because of the inherentsingularity in S bðZ Þ. In contrast to later figures, here theonly noticeable error spike occurs near the singularity.

The memory requirements for each of the sb, d b, andab direct lookup tables (with a Z L input) is 2wez þf L words,assuming reduction rules like (16) and (17), which allowthe interval spanned by these tables to be as small as

ð2wez

; 0Þ. The trigonometric tables, with Z inputs, require22þf words because Z is reduced to 0 Z < 4 (equivalentto 0 Z < ) and two integer bits are sufficient to coverthis dynamic range. Instead of the more involved (18) ruleused in later sections, cbðxÞ ¼ cbðxÞ suffices for directlookup.

Since there are two instances of trigonometric tables andfour instances of other tables in Fig. 1, the total number of words required for naıve direct lookup is 2f þwez þ2 þ 2f þ3

assuming f ¼ f L ¼ f . Table 2 shows the number of wordsrequired by prior 2D direct lookup methods [3], [19]compared to our proposed 1D direct lookup implementa-tion of (26) and (34). Savings occur for f 5, and grow

exponentially more beneficial as f increases.For f 8, the cost of direct implementation grows

exponentially, although at a slower rate than the priormethods. One approach to reduce table size is to use


Fig. 1. CLNS ALU with complex and real modes.



7.4 A CLNS Coprocessor

We extended the FloPoCo library with real LNS arithmetic.To perform the LNS addition, the input domain of sb is splitinto three subdomains: ½2wez ; 8½, ½8; 4½, and ½4; 0½,where wez is such that sb evaluates to zero in 1; 2wez ½.This is done to account for the exponential nature of sbðz Þfor smaller values of z . The same partitioning was used inFPLibrary. We then use an HOTBM operator to evaluate sb

inside each interval. This partitioning scheme allows

significant area improvements compared to a regularpartitioning as performed by HOTBM.

Likewise, the domain of d b is split into intervals. Thelower range ½2wez ; t½ is evaluated using several HOTBMoperators just like sb. As we get closer to the singularity of d bnear 0, the function becomes harder to evaluate usingpolynomial approximations. The upper range ½t; 0½ isevaluated using cotransformation. This approach is similarto the one used by Vouzis et al. in FPLibrary [21].

We used this extended FloPoCo library to generate theVHDL code for the functional units in a CLNS ALU atvarious precisions. One component was generated foreach function involved in the CLNS addition. Real LNSfunctions sb and d b are generated by the extendedFloPoCo. The function logcosð=4xÞ is implemented asan HOTBM component, and 4= arctanð2xÞ is evaluatedusing several HOTBM components with the same inputdomain decomposition as sb.

Given the complexity of HOTBM and the influence of synthesizer optimizations, it is difficult to estimate preciselythe best value of the d b threshold t, cotransformationparameter j, and HOTBM order using simple formulas.We synthesized the individual units for a Virtex-4 LX25with Xilinx ISE 10.1 using various parameters. For eachprecision tested, the combination of parameters yielding thesmallest area on this configuration was determined. Table 4sums up the parameters obtained. We plan to automate thissearch in the future by using heuristics relying on thelatency/area estimation framework integrated in FloPoCo.

7.5 Time/Area Trade-Off

The synthesis results are listed in detail in Tables 5 and 6,where s, s=d , c, and a are the areas of the respectiveunits, and s, s=d , c, and a are the corresponding delays.One implementation option would be to follow thecombinational architecture in Fig. 1, in which case the areais roughly s þ 2s=d þ c þ a because one d b together withab implements both logcos and logsin . (We neglect theinsignificant area for four fixed-point adders, a mux, andthe control logic.) The delay is 2 s=d þ c þ maxð a; sÞ.

An alternative, which saves area (s=d þ c þ a), usesthree cycles to complete the computation (so that the samehybrid unit may be reused for all three sb=d b usages). Thetotal area is less than double the area s=d of a stand-alonereal LNS unit. The delay is roughly 3 s=d , since the hybridunit has the longest delay. Despite using only about half the area of the combinational alternative, the three-cycleapproach produces the result in about the same time as thecombinational option.

It is hard to make direct comparisons, but the CORDICimplementation of CLNS reported in [13] uses 10 stages.Some of those stages have relatively large multipliers, similarto what is generated for our circuit by FloPoCo. If suchCORDIC stages have a delay longer than 0.3 of our proposedclock cycle, our approach would be as fast or faster than [13].

7.6 Accuracy

Fig. 3a shows the errors in computing the real part (26) as afunction of Z L and Z using the f ¼ 7 function unitsgenerated by FloPoCo. These errors have an RMS valuearound 0.007. This is larger than the errors when computingthe direct lookup with the round-to-nearest method shownin Fig. 2a, but overall the errors appear uniform except, as before, near the singularity. As such, this may be acceptable

for some applications. In contrast, Fig. 3b shows errors for theimaginary part (34), which are significantly more than theerrors for direct lookup round-to-nearest shown in Fig. 2b.These errors have an RMS value around 0.008.

To determine the cause of the much larger error,simulations were run, in which each of the function unitsgenerated by FloPoCo was replaced with round-to-nearestunits. When both logsin and logcos are computed withround-to-nearest, the imaginary result has smaller errorswhich are much closer to those of Fig. 2b. Using sinðxÞ ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

1 cos2ðxÞp

contributes to the extra errors, which are thenmagnified by the slightly larger roundoff errors from all theother FloPoCo units.

7.7 Novel Sine Approximation

The source of this additional error visible in Fig. 3b is thequantization of logcos as cb in (18). Extra guard bits could


TABLE 4Parameters Used for Synthesis

TABLE 3Table of Functions and Derivatives



help, but this, in turn, would require extra guard bits in allthe approximation units, including the very costly one ford b. We consider the cost of such a solution to be prohibitive.Although inexpensive methods [1] for dealing withlogsinð2xÞ have been proposed, they are not applicable here.Instead, we need an alternative way to compute logsinðxÞfor x 0. The novel approach, which we propose, is based

on the simple observation that for small x,sinðxÞ x: ð19Þ

The values of x, which are considered small enough to beapproximated this way, will depend on the precision, f . TheCLNS hardware uses a quantized angle, which means (19)can be restated as:

logsinð=4 xÞ logbðxÞ þ logbð=4Þ: ð20Þ

We wish to implement this without adding any additionaltables, delay, or complexity to the hardware (aside from atiny bit of extra logic), which means we rule out evaluating

logbðxÞ directly. Instead, we use a novel logarithm approx-imation [4] that only needs d b hardware plus a few trivialadders, and which is moderately accurate in cases like this,where x is known to be near zero:

logbðxÞ ¼ d bðxÞ logbðlnðbÞÞ þ x=2: ð21Þ

The error is on the order of x2=24. Substituting (21) into (20),we have

logsinð=4 xÞ d bðxÞ þ x=2 þ ðlogbð=4Þ logbðlnðbÞÞÞ:

ð22Þ

The binary value of

log2ð=4Þ log2ðlnð2ÞÞ ¼ 0:00101110001002

suggests 1=8, 3=16, or 23=128 are low-cost approximationsfor this constant. Fig. 4 shows the error of approximatinglogsin with the Pythagorean approach, ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

1 cos2ð=4 xÞp

(implemented as d 2ð2 c2ðxÞÞ=2), versus the novel approach(22) for f ¼ 7. For values of x near zero, the novel approach isconsistently better than the Pythagorean approach. Becauseof quantization, both plots are noisy which makes them cross

each other several times in the middle. For f ¼ 7, x ¼ 0:375 isa reasonable choice for the point at which the approximationswitches from the novel approach to the Pythagorean one.Replacing these cases in (18), we have


TABLE 5Area of Function Approximation Units (Slices) on Xilinx Virtex-4

Fig. 3. Error in S 2ðZ Þ using FloPoCo with Pythagorean (18) and f L ¼ f ¼ 7. (a) Real part. (b) Imaginary part.

TABLE 6Latency of Function Approximation Units (ns) on Xilinx Virtex-4



c2ðxÞ ¼

c2ðxÞ; x < 0;logcosð=4xÞ; 0 x < 1;d 2ð2 c2ð2 xÞÞ=2; 1 x < 1:625;d 2ðx 2Þ þ ð2 xÞ=2

þ ðlog2ð=4Þ log2ðlnð2ÞÞÞ; 1:625 x < 2;2wez þ1; x ¼ 2;d 2ð2 xÞ þ ðx 2Þ=2

þ ðlog2ð=4Þ log2ðlnð2ÞÞÞ; 2 < x < 2:375;d 2ð2 c2ðx 2ÞÞ=2; 2:375 x < 3;c2ð4 xÞ; 3 x < 4;c2ðx 8Þ; x 4:

8>>>>>>>>>>>>>>>><>>>>>>>>>>>>>>>>:

ð23Þ

The same function approximation units and pipelineschedule described earlier can implement (23) as easily as(18), with only two additional adders (and possibly aregister to make x available at the proper time).

Fig. 5a shows the error for the real part of the approximateS 2ðZ Þ using (23) for range reduction and otherwise usingf ¼ 7 FloPoCo-generated approximation units. Fig. 5b showsthe errors for the corresponding imaginary part. In bothfigures, especially for the imaginary part, the errors are muchcloser to the amount seen in round-to-nearest. The RMS

errors (0.006 for the real part and 0.005 for the imaginarypart) are in between those for round-to-nearest and those forPythagorean FloPoCo.

8 OBJECT-ORIENTED CLNS SIMULATION

In order to test the proposed techniques in a complex numberapplication, like theFFT, we would like to substitute differentarithmetic implementationswithout significantchange to theapplication code. The polymorphism of an object-orientedlanguage like Java makes such experimentation easy.Although operator overloading in languages like C++ would

make this slightly nicer, Java’s polymorphic method-callsyntax makes it fairly easy to describe the complex FFT butterfly succinctly:

t ¼ w½ j:mulðx½m þ iÞ;

x½m þ i ¼ x½m:subðtÞ;

x½m ¼ x½m:addðtÞ;

where the variables in the application are of an abstractclass, which defines four abstract methods (PLUSI(),inc(), recip(), and mul(CmplxAbs y)) that returnreferences to such a CmplxAbs object and two abstractmethods (real() and imag()) that return doubles. Inaddition, CmplxAbs defines several concrete methods thatcan be used in the application:

public CmplxAbs MINUS1(){return(PLUSI().mul(PLUSI()));}

public CmplxAbs div(CmplxAbs y){

return(this.mul(y.recip()));}public CmplxAbs add(CmplxAbs y){

return(this.mul((y.div(this)).inc()));}public CmplxAbs neg(){

return(this.mul(MINUS1()));}public CmplxAbs sub(CmplxAbs y){

return(this.add(y.neg()));}

Surprisingly, this is sufficient to define both polar andrectangular complex arithmetic implementations. Only threearithmetic methods ( mul, recip, and inc) need to beprovided for each derived class. Also, the derived classneeds to provide three utility methods (the accessors for the

constant i and the rectangular parts of this). Although bestsuited for CLNS implementations, this abstract class alsoworks with a rectangular class using two floating-pointinstance variables manipulated by (1) and (2) together withthe rectangular definition of reciprocal and incrementation.This rather unusual factoring of complex arithmetic has theadvantage that application code can be tested with therectangular implementation to isolate whether there are anyCLNS-related bugs in the application or in the classdefinition itself. Two CLNS classes derived from theabstract class store Z L and Z as integers. The first CLNSclass implements ideal, round-to-nearest 2D lookup of


Fig. 4. Errors for Pythagorean and novel logsinð=4 xÞ for 0 < x 1.

Fig. 5. Error in S 2ðZ Þ using FloPoCo with novel (23) and f L ¼ f ¼ 7. (a) Real part. (b) Imaginary part.



S bðZ Þ as in the prior literature as given by (11) and (12). Thesecond class, which is derived from the first CLNS class,implements the proposed approach (26) and (34), usingtables that can simulate any of the design alternativesdiscussed earlier. The first CLNS class defines all sixmethods; the second CLNS class inherits everything exceptits inc method. As CLNS formulas are involved and proneto implementation error, this object-oriented approach

reduces duplication of untested formulas. In other words,we test add, sub, etc., using simple rectangular definitions;we test CLNS methods with the more straightforward (11)and (12). Only when we are certain these are correct do wetest with the novel inc method in the grandchild class.

Having the two CLNS classes allows us to reuse the sameapplication code to see what effect the proposed ALU willhave compared to ideal CLNS arithmetic. In this case, theapplication is a 64-point radix-two FFT whose input is areal-valued 25 percent duty cycle square wave pluscomplex white noise. This is rerun 100 times with differentpseudorandom noise. On each run, several FFTs are

computed using the same data: nearly exact rectangulardouble-precision arithmetic, ideal (2D table lookup) f ¼ 7

CLNS, and variations (direct, FloPoCo-Pythagorean, FloPo-Co-novel-sine) of the proposed f ¼ 7 CLNS. The CLNSresults from each run are compared against the double-precision rectangular FFT on the same data. The ideal CLNSFFT has RMS error 0.00025 and maximum error of 0.0033.The proposed direct lookup approach has about 50 percenthigher RMS error (about 0.00035) and similar maximumerror (0.0029). The FloPoCo-Pythagorean approach didmuch worse (RMS error of 0.00073 and maximum error of 0.01). The FloPoCo-novel-sine approach is better (RMS error

of 0.00056 and maximum error of 0.0064). In other words,the best FloPoCo implementation loses about one bit of accuracy compared to the best possible (but unaffordable)f ¼ 7 CLNS implementation and the FloPoCo-Pythagoreanloses more. To put these errors in perspective, thequantization step for Z L is 27 0:0078, and on that scaleperhaps even the errors observed in the FloPoCo-Pythagor-ean approach may be acceptable for some applications.

9 CONCLUSIONS

A new addition algorithm for complex log-polar addition

was proposed, based around an existing real-valued LNSALU. The proposed design allows this ALU to continue to beused for real arithmetic, in addition to the special complexfunctionality described in this paper. The novel CLNSalgorithm requires extra function units for log-trigonometricfunctions, which may have application beyond complexpolar representation. Two implementation options for thenovel special units were considered: medium accuracy directlookup and higher accuracy interpolation (as generated by atool called FloPoCo). The errors resulting from these twoalternatives were studied. As expected, round-to-nearestdirect lookup tables give the lowest roundoff errors. We

show that the errors in the FloPoCo implementation can bereduced by using a more accurate logsin unit than one basedon Pythagorean calculations from logcos . Instead, weproposed a novel algorithm for logsin of arguments near

zero which uses the same hardware as the Pythagorean-onlyapproach used in our earlier research [5].

We compared these implementations of our novel CLNSalgorithm to all known prior approaches. Our novel methodhas speed comparable to CORDIC, and uses orders-of-magnitude less area than prior 2D table lookup approaches.As such, our approach provides a good compromise between speed and area. Considering that CLNS offerssmaller bus widths than conventional rectangular repre-sentation of complex numbers, our proposed algorithmmakes the use of this rather unusual number system inpractical algorithms, like the FFTs in OFDM, more feasible.Using an object-oriented simulation, we have observed theadditional errors introduced by our proposed methods inan FFT simulation. With our most accurate direct lookupapproach, this is on the order of half of a bit. With theFloPoCo-novel-sine method, the additional error is aboutone bit. Given that CLNS offers many bit savings comparedto rectangular arithmetic, this level of additional errorseems a reasonable trade-off in exchange for the huge

memory savings our proposed CLNS ALU offers.In the course of implementing the CLNS ALU, we

uncovered and corrected several bugs in the HOTBMimplementation of FloPoCo. We also made it easier to use by allowing the user to select the input range and scale of the target function instead of having to map its rangesmanually to ½0; 1½ ! ½ 1; 1½. Hence, our work contributes tothe maturity of the FloPoCo tool beyond the field of LNS.

APPENDIX

In Case 1 (=2 < Z < =2), the real part of (14) can bedescribed as (24). There is a similar derivation

<ðS bðZ ÞÞ

¼ <

logb

bsbðlogcosðZ ÞþZ LÞ þ i sgnsinðZ ÞblogsinðZ ÞþZ L

¼ logb

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffib2sbðlogcosðZ ÞþZ LÞ þ b2ðlogsinðZ ÞþZ LÞ

p ¼

logbðb2sbðlogcosðZ ÞþZ LÞ þ b2ðlogsinðZ ÞþZ LÞÞ

2

¼logbðb2ðlogsinðZ ÞþZ LÞþsbð2ðsbðlogcosðZ ÞþZ LÞðlogsinðZ ÞþZ LÞÞÞÞ

2

¼ logsinðZ Þ þ Z L

þsbð2ðsbðlogcosðZ Þ þ Z LÞ ðlogsinðZ Þ þ Z LÞÞÞ

2 : ð24Þ

<ðS bðZ ÞÞ

¼ <

logb

bd bðlogcosðZ ÞþZ LÞ þ i sgnsinðZ ÞblogsinðZ ÞþZ L

¼ logb

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffib2d bðlogcosðZ ÞþZ LÞ þ b2ðlogsinðZ ÞþZ LÞ

p ¼

logbðb2d b ðlogcosðZ ÞþZ LÞ þ b2ðlogsinðZ ÞþZ LÞÞ

2

¼logbðb2ðlogsinðZ ÞþZ LÞþsbð2ðd bðlogcosðZ ÞþZ LÞðlogsinðZ ÞþZ LÞÞÞÞ

2

¼ logsinðZ Þ þ Z L

þsbð2ðd bðlogcosðZ Þ þ Z LÞ ðlogsinðZ Þ þ Z LÞÞÞ

2 : ð25Þ




log

j sinðZ ÞbZ L j

j1 þ cosðZ ÞbZ L j

!

¼

sbðlogcosðZ Þ þ Z LÞ ðlogsinðZ Þ þ Z LÞ

;

=2 < Z < =2;

d bðlogcosðZ Þ þ Z LÞ ðlogsinðZ Þ þ Z LÞ

;

Z < =2 or Z > =2:

8>>><>>>:

ð31Þ

=ðS bðZ ÞÞ

¼

þ arctan j sinðZ ÞbZ L j


; Z < =2 and

logcosðZ Þ þ Z L > 0;

arctan j sinðZ ÞbZ L j


; Z =2 and

ðZ > =2 or logcosðZ Þ þ Z L < 0Þ;


j1 þ cosðZ ÞbZ L j ; =2 < Z < 0 and

ðZ > =2 or logcosðZ Þ þ Z L < 0Þ;



; Z > =2 and




; Z =2 and ðZ < =2

or logcosðZ Þ þ Z L < 0Þ;



; 0 < Z ;

=2; Z < =2 and logcosðZ Þ þ Z L ¼ 0;

=2; Z > =2 and logcosðZ Þ þ Z L ¼ 0:

8>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>><>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>:

ð32Þ

=ðS bðZ ÞÞ

¼

þ arctan j sinðZ ÞbZ L j


; Z < =2 and



j1 þ cosðZ ÞbZ L j ; Z =2 and

logcosðZ Þ þ Z L < 0;



; =2 < Z < 0;



; Z > =2 and




; Z =2 and



j1 þ cosðZ ÞbZ L j ; 0 < Z =2;

=2; Z < =2 andlogcosðZ Þ þ Z L ¼ 0;


8>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>><>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>:

ð33Þ

=ðS bðZ ÞÞ

¼

0; Z ¼ and logcosðZ Þ þ Z L < 0;

; Z ¼ and logcosðZ Þ þ Z L > 0;

þ arctanðbðd b ðlogcosðZ ÞþZ LÞðlogsinðZ ÞþZ LÞÞÞ;

< Z =2 and logcosðZ Þ þ Z L > 0;

arctanðbðd bðlogcosðZ ÞþZ LÞðlogsinðZ ÞþZ LÞÞÞ; Z =2

and logcosðZ Þ þ Z

L < 0;

arctanðbðsb ðlogcosðZ ÞþZ LÞðlogsinðZ ÞþZ LÞÞÞ;

=2 < Z < 0; 0; Z ¼ 0;

arctanðbðd bðlogcosðZ ÞþZ LÞðlogsinðZ ÞþZ LÞÞÞ; Z > =2

and logcosðZ Þ þ Z L > 0;

arctanðbðd bðlogcosðZ ÞþZ LÞðlogsinðZ ÞþZ LÞÞÞ; Z > =2 and


arctanðbðsbðlogcosðZ ÞþZ LÞðlogsinðZ ÞþZ LÞÞÞ; 0 < Z =2;

=2; Z < =2 and logcosðZ Þ þ Z L ¼ 0;


8>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>><>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>: ð34Þ

REFERENCES

[1] M.G. Arnold, “Approximating Trigonometric Functions with theLaws of Sines and Cosines Using the Logarithmic NumberSystem,” Proc. Eighth EuroMicro Conf. Digital System Design,pp. 48-55, 2005.

[2] M.G. Arnold, T.A. Bailey, J.R. Cowles, and C. Walter, “Analysis of Complex LNS FFTs,” Signal Processing Systems SIPS 2001: Designand Implementation, F. Catthoor and M. Moonen, eds., pp. 58-69,IEEE Press, 2001.

[3] M.G. Arnold, T.A. Bailey, J.R. Cowles, and C. Walter, “FastFourier Transforms Using the Complex Logarithm NumberSystem,” J. VLSI Signal Processing, vol. 33, no. 3, pp. 325-335, 2003.

[4] M.G. Arnold, T.A. Bailey, J.R. Cowles, and M.D. Winkel,“Arithmetic Co-Transformations in the Real and ComplexLogarithmic Number Systems,” IEEE Trans. Computers, vol. 47,no. 7, pp. 777-786, July 1998.

[5] M.G. Arnold and S. Collange, “A Dual-Purpose Real/ComplexLogarithmic Number System ALU,” Proc. 19th IEEE Symp.Computer Arithmetic, pp. 15-24, June 2009.

[6] N. Brisebarre, F. de Dinechin, and J.-M. Muller, “Integer andFloating-Point Constant Multipliers for FPGAs,” Proc. Int’l Conf.

Application-Specific Systems, Architectures and Processors, pp. 239-244, 2008.

[7] N. Burgess, “Scaled and Unscaled Residue Number System toBinary Conversion Techniques Using the Residue NumberSystem,” Proc. 13th Symp. Computer Arithmetic (ARITH ’97),pp. 250-257, Aug. 1997.

[8] J. Detrey and F. de Dinechin, “A Tool for Unbiased Comparison between Logarithmic and Floating-Point Arithmetic,” J. VLSI Signal Processing, vol. 49, no. 1, pp. 161-175, 2007.

[9] J. Detrey and F. de Dinechin, “Table-Based Polynomials for FastHardware Function Evaluation,” Proc. IEEE Int’l Conf. Application-Specific Systems, Architecture and Processors, pp. 328-333, July 2005.

[10] F. de Dinechin, C. Klein, and B. Pasca, “Generating High-Performance Custom Floating-Point Pipelines,” Proc. Int’l Conf.Field-Programmable Logic, Aug. 2009.

[11] F. de Dinechin, B. Pasca, O. Cret , and R. Tudoran, “An FPGA-Specific Approach to Floating-Point Accumulation and Sum-of-Products,” Field-Programmable Technology, pp. 33-40, IEEE Press,2008.

[12] D.M. Lewis, “An Architecture for Addition and Subtraction of Long Word Length Numbers in the Logarithmic Number

System,” IEEE Trans. Computers, vol. 39, no. 11, pp. 1325-1336,Nov. 1990.[13] D.M. Lewis, “Complex Logarithmic Number System Arithmetic

Using High Radix Redundant CORDIC Algorithms,” Proc. 14thIEEE Symp. Computer Arithmetic, pp. 194-203, Apr. 1999.




[14] R. Mehmke, “Additionslogarithmen fur Complexe Grossen,”Zeitschrift fu r Math. Physik, vol. 40, pp. 15-30, 1895.

[15] R. Muscedere, V.S. Dimitrov, G.A. Jullien, and W.C. Miller,“Efficient Conversion from Binary to Multi-Digit Multidimen-sional Logarithmic Number Systems Using Arrays of RangeAddressable Look-Up Tables,” Proc. 13th IEEE Int’l Conf. Applica-tion-Specific Systems, Architectures and Processors (ASAP ’02),pp. 130-138, July 2002.

[16] E.E. Swartzlander and A.G. Alexopoulos, “The Sign/LogarithmNumber System,” IEEE Trans. Computers, vol. 24, no. 12, pp. 1238-

1242, Dec. 1975.[17] E.E. Swartzlander et al., “Sign/Logarithm Arithmetic for FFT

Implementation,” IEEE Trans. Computers, vol. 32, no. 6, pp. 526-534, June 1983.

[18] P.R. Turner, “Complex SLI Arithmetic: Representation, Algo-rithms and Analysis,” Proc. 11th IEEE Symp. Computer Arithmetic,pp. 18-25, July 1993.

[19] P. Vouzis and M.G. Arnold, “A Parallel Search Algorithm forCLNS Addition Optimization,” Proc. IEEE Int’l Symp. Circuits andSystems (ISCAS ’06), pp. 20-24, May 2006.

[20] P. Vouzis, M.G. Arnold, and V. Paliouras, “Using CLNS for FFTsin OFDM Demodulation of UWB Receivers,” Proc. IEEE Int’lSymp. Circuits and Systems (ISCAS ’05), pp. 3954-3957, May 2005.

[21] P. Vouzis, S. Collange, and M.G. Arnold, “CotransformationProvides Area and Accuracy Improvement in an HDL Library forLNS Subtraction,” Proc. 10th EuroMicro Conf. Digital System Design

Architectures, Methods and Tools, pp. 85-93, Aug. 2007.[22] P. Vouzis, S. Collange, and M.G. Arnold, “LNS Subtraction Using

Novel Cotransformation and/or Interpolation,” Proc. 18th Int’lConf. Application-Specific Systems, Architectures and Processors,pp. 107-114, July 2007.

[23] http://www.wikipedia.org/wiki/arctangent, Oct. 2008.[24] http://www.xlnsresearch.com, 2010.

Mark G. Arnold received the BS and MSdegrees from the University of Wyoming, andthe PhD d eg re e fro m the Unive rsity o fManchester Institute of Science and Technol-ogy (UMIST), United Kingdom. From 1982 to2000, he was on the faculty of the Universityof Wyoming, Laramie. From 2000 to 2002, hewas a lecturer at UMIST, United Kingdom. In2002, he joined the faculty of Lehigh Uni-versity, Bethlehem, Pennsylvania. In 1976, he

codeveloped SCELBAL, the first open-source floating-point high-levellanguage for personal computers. In 1997, he received the BestPaper Award from Open Verilog International for describing theVerilog Implicit To One-hot (VITO) tool that he codeveloped. In 2007,he received the Best Paper Award from the Application-SpecificSystems, Architectures and Processors (ASAP) Conference. He isthe author of Verilog Digital Computer Design . His current researchinterests include computer arithmetic, hardware description lan-guages, microrobotics and embedded control, and multimedia andapplication-specific systems. He is a member of the IEEE.

Sylvain Collange received the master’s degreein computer science from the Ecole NormaleSuprieure de Lyon, France, in 2007. He iscurrently in the final year of a PhD program atthe University of Perpignan, France. In 2006, heworked as a research intern at Lehigh Uni-versity in Pennsylvania. He joined NVIDIA inSanta Clara, California, for an internship in2010. His current research focuses on parallelcomputer architectures. His other interests

include computer arithmetic and general-purpose computing ongraphics processing units.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.


real/complex logarithmic number system alu

Documents