11 andraka poster

1copyright 2006 Andraka Consulting Group, Inc. All Rights reservedAndraka Consulting Group, Inc.

the

Hybrid Floating point Technique yields 1.2 Giga-sample per second 32 to 2048 point

floating point FFT in a single FPGA

HPEC 2006Poster Session B.420 September 2006

Ray Andraka, P.E.President, Andraka Consulting Group, Inc

[email protected]


the

Floating point addition & subtraction is resource intensive

Exchange Network

Barrel Shift Denormalize

Mantissa Add/Sub

Exponent Difference

Barrel Shift Renormalize

Leading Zeros Detect

Rounding

½

Exponent Adder

Mantissa A

Mantissa B

ExponentExponent B

Mantissa

Exponent A


the

Apply floating point to larger functions

• Apply floating point to larger functions

– Floating point typically applied at add and multiply level operations

– Instead construct higher order operations from fixed point operators

• Phase rotator

• FFT

– Apply floating point to those more complicated operators

• Denormalize to convert mantissa to fixed point plus common scale

– Pass exponent around series of fixed point operations

– Renormalize after several operations rather than after each one


the

Apply floating point to larger functions

Barrel Shift Denormalize

Exponent Difference

Barrel Shift Renormalize

Leading Zeros Detect

Rounding

½

Exponent Adder

Mantissas

ExponentMax Exponent

Mantissa

Exponents

Fixed point

function


the

Floating point sum has only as much precision as larger addend

• Add requires both addends to have the same scale

– Radix points must align– Addition is inherently fixed

point

• Smaller addend’s mantissa is right shifted until exponent is same as larger

– Exponent increments each shift– Right shift truncates LSBs– Truncated LSBs are lost

• Sum is left shifted to left justify– LSBs zero filled– No improvement to precision

Examples:– Different exponents

A= 1.101 * 25

B= 1.101 * 23 = 0.01101 * 25

A+B= (1.101 + 0.011) * 25

= (11.000) * 25

LSBs of B are lost

– Renormalizing

A= 1.101 * 25

B= 1.011 * 25

A-B= (1.101 - 1.011) * 25

= (0.010) * 25

= (1.000) * 23

Sum LSBs are filled with 0’s


the

Phase rotation does not change amplitude

• Re (y) = re(x) * cos() - im(x) * sin()

• Im(y) = re(x) * sin() + im(x) * cos()

• Magnitudes of individual I and Q components change, but complex magnitude is not altered.

• No loss of precision by treating I and Q with common exponent

– Complex operation is limited to precision of larger component

• Using common exponent for I and Q reduces hardware– Single copy of exponent logic– No rescaling of I with respect to Q

• Simplifies rotator – Fixed point complex multiply (smaller of I or Q is denormalized)– Fixed point sines and cosines– Output renormalize is +/-1 bit shift


the

FFT butterflies are only as precise as largest input

• Cooley-Tukey FFT butterfly– Sum and difference of pair of

complex inputs– one input is rotated by

“twiddle factor” phasor

• Rotation does not affect scale

• Smaller input right shifted – Shift to match scale– LSBs are lost

• Both outputs have same LSB weight before renormalizing

• Renormalizing does not add precision (zero fills LSBs)

• Output is 1 bit wider than input– Sum of similar sized addends

“Twiddle factor”k=cos(w)+jsin(w)

FFT Butterfly

Complex inputs

Complex outputs


the

FFT output is only as precise as largest input

• Cascade of butterfly elements

• Each output is essentially an adder tree with phase rotators

– Rotators don’t change scale– Inputs right shifted to match

scale of largest input– intermediate renormalizing not

effective– Term from every FFT input

• 1 bit growth per stage– Renormalize maintains width– Alternative: grow word width

• Similar effect in other FFTs – Winograd, Sande-Tukey,

Singleton etc.)

k

k k

k

Butterfly


the

Fixed Point FFT Replaces Floating Point FFT

• Denormalize inputs

– Shift each input right to match scale of largest

• Perform fixed point FFT

– Pass common exponent around it

– Input width = mantissa bits

– Maximum 1 bit growth per equivalent radix 2 stage

• Renormalize outputs

– Add common exponent to delta exponent from renormalize

>>nFixed Point FFT

<<

Max Exponent

Exp.

Mant. Mant.

Exp.

+-

Denormalize Renormalize


the

Advantages and Limitations

• Advantages

– Large reduction in required hardware

– Less complexity means higher clock rates, smaller parts

• Limitations

– Word width grows for each radix 2 stage

• Becomes excessive for large FFTs

– Max Exponent needed at beginning of set

• Problem for large sequential FFTs

– Use periodic renormalization to manage word widths

• A few bits growth don’t significantly affect timing

• Word not limited to specific widths in FPGA

– Fixed width assets like DSP48s limit practical word sizes.

– Find balance between precision, growth and renormalizing stages


the

Small FFTs as building blocks

• Larger FFT constructed from small FFTs with “mixed radix” algorithm

– Similar to Cooley-Tukey decomposition

• Arbitrarily large FFTs using small off-the shelf kernels

• Combination uses FFT plus phase rotator and reorder memory

• “In-place” operation (results written to same memory locations)

Fill along rows

Mult by e-j2kn/N

FFT along rows

Read down cols

FFT down cols


the

Winograd FFT

• Different factorization• Minimizes multiplies

– Advantageous for hardware implementation

– 74 adds and 18 real multiplies for 16pt Winograd

– 176 adds and 72 real multiplies for 16pt Cooley-Tukey

• Irregular data sequence– Difficult for shared memory– Easy when reorder memory is

distributed

Reorder Reorder Reorder

Reorder Reorder Reorder

Weights


the

32 to 2048 point mixed radix FFT

• 2K FFT is 8 x 256 mixed radix

• 256 point is 16 x 16 mixed radix

• Combined algorithms 2K = 8 x 16 x16

• Data arranged in cube, FFT along each dimension

• Reorder at input and output (not shown)

• Kernel is proprietary 1/4/8/16 Winograd kernel– Each kernel has floating point wrapper

1/8Point FFT

Phase Rotator

Data Reorder

4k sample

BRAM

4/8/16 Point FFT

Phase Rotator

Data Reorder

512 sample

BRAM

8/16 Point FFT

32/64/128/256 point FFT


the

32-2K point FFT statistics

• Speed: 400 MS/sec per FFT engine (3 in FPGA)

– 400MHz clock in XC4VSX55-10 (slowest speed grade)

– 1 complex sample per clock in and out continuous

– Latency: ~430 + 3*FFT length + (32,64,128 or 256) clocks

• Utilization – less than 30% of XC4VSX55

– DSP48’s: 151

– Slice flip-flops: 9707

– RAMB16’s: 69

– LUT’s: 7736 (4975 are SRL16)

• Precision

– 30-35 bit mantissa internal, 8 bit exponents

– IEEE single precision input and output

– Matches Matlab FFT to +/- 1 LSB of output mantissa


the

1.2 GSample/sec IEEE floating point FFT

Input Buffer

32 to 2K pt floating pt

FFT


FFT


FFT

Output buffer


the

Who is Andraka Consulting Group?

• Exclusively FPGAs since 1994

• Leading industry expert on DSP in FPGAs

– Charter Xilinx ‘Xperts’ partner

– First published FIR filter in FPGAs (1992)

– Fastest single threaded FFT kernel for FPGA

• Other current projects

– Beamforming digital receiver: 10 25MHz channels, 260 antennas, 500MS/sec input sample rate

– Cylindrical Sonar Array processor

– Other Digital receiver and radar projects


the

Floating Point Format

• Floating point dedicates part of word to indicate scale (exponent)– Tracks radix point position as part of data – Compare to fixed point where radix point position is at an implied

fixed location• Trades precision for dynamic range• Useful when data range is unknown or spans a large range

• The IEEE single precision floating point standard is a 32 bit word,– Leftmost bit is the sign bit, S. ‘1’ is negative, ‘0’ is positive– Next 8 bits are exponent, excess 127 format– Right 23 bits are the fraction. There is an implicit ‘1’ bit to the left

of the fraction except in special cases. The fraction’s radix point is between the implied ‘1’ and the leftmost bit of the fraction.

• S EEEEEEEE FFFFFFFFFFFFFFFFFFFFFFF

• Number = -1S * 2 (E-127) * (1.F)

11 andraka poster

Documents