11 andraka poster
TRANSCRIPT
1copyright 2006 Andraka Consulting Group, Inc. All Rights reservedAndraka Consulting Group, Inc.
the
Hybrid Floating point Technique yields 1.2 Giga-sample per second 32 to 2048 point
floating point FFT in a single FPGA
HPEC 2006Poster Session B.420 September 2006
Ray Andraka, P.E.President, Andraka Consulting Group, Inc
2copyright 2006 Andraka Consulting Group, Inc. All Rights reservedAndraka Consulting Group, Inc.
the
Floating point addition & subtraction is resource intensive
Exchange Network
Barrel Shift Denormalize
Mantissa Add/Sub
Exponent Difference
Barrel Shift Renormalize
Leading Zeros Detect
Rounding
½
Exponent Adder
Mantissa A
Mantissa B
ExponentExponent B
Mantissa
Exponent A
3copyright 2006 Andraka Consulting Group, Inc. All Rights reservedAndraka Consulting Group, Inc.
the
Apply floating point to larger functions
• Apply floating point to larger functions
– Floating point typically applied at add and multiply level operations
– Instead construct higher order operations from fixed point operators
• Phase rotator
• FFT
– Apply floating point to those more complicated operators
• Denormalize to convert mantissa to fixed point plus common scale
– Pass exponent around series of fixed point operations
– Renormalize after several operations rather than after each one
4copyright 2006 Andraka Consulting Group, Inc. All Rights reservedAndraka Consulting Group, Inc.
the
Apply floating point to larger functions
Barrel Shift Denormalize
Exponent Difference
Barrel Shift Renormalize
Leading Zeros Detect
Rounding
½
Exponent Adder
Mantissas
ExponentMax Exponent
Mantissa
Exponents
Fixed point
function
5copyright 2006 Andraka Consulting Group, Inc. All Rights reservedAndraka Consulting Group, Inc.
the
Floating point sum has only as much precision as larger addend
• Add requires both addends to have the same scale
– Radix points must align– Addition is inherently fixed
point
• Smaller addend’s mantissa is right shifted until exponent is same as larger
– Exponent increments each shift– Right shift truncates LSBs– Truncated LSBs are lost
• Sum is left shifted to left justify– LSBs zero filled– No improvement to precision
Examples:– Different exponents
A= 1.101 * 25
B= 1.101 * 23 = 0.01101 * 25
A+B= (1.101 + 0.011) * 25
= (11.000) * 25
LSBs of B are lost
– Renormalizing
A= 1.101 * 25
B= 1.011 * 25
A-B= (1.101 - 1.011) * 25
= (0.010) * 25
= (1.000) * 23
Sum LSBs are filled with 0’s
6copyright 2006 Andraka Consulting Group, Inc. All Rights reservedAndraka Consulting Group, Inc.
the
Phase rotation does not change amplitude
• Re (y) = re(x) * cos() - im(x) * sin()
• Im(y) = re(x) * sin() + im(x) * cos()
• Magnitudes of individual I and Q components change, but complex magnitude is not altered.
• No loss of precision by treating I and Q with common exponent
– Complex operation is limited to precision of larger component
• Using common exponent for I and Q reduces hardware– Single copy of exponent logic– No rescaling of I with respect to Q
• Simplifies rotator – Fixed point complex multiply (smaller of I or Q is denormalized)– Fixed point sines and cosines– Output renormalize is +/-1 bit shift
7copyright 2006 Andraka Consulting Group, Inc. All Rights reservedAndraka Consulting Group, Inc.
the
FFT butterflies are only as precise as largest input
• Cooley-Tukey FFT butterfly– Sum and difference of pair of
complex inputs– one input is rotated by
“twiddle factor” phasor
• Rotation does not affect scale
• Smaller input right shifted – Shift to match scale– LSBs are lost
• Both outputs have same LSB weight before renormalizing
• Renormalizing does not add precision (zero fills LSBs)
• Output is 1 bit wider than input– Sum of similar sized addends
“Twiddle factor”k=cos(w)+jsin(w)
FFT Butterfly
Complex inputs
Complex outputs
8copyright 2006 Andraka Consulting Group, Inc. All Rights reservedAndraka Consulting Group, Inc.
the
FFT output is only as precise as largest input
• Cascade of butterfly elements
• Each output is essentially an adder tree with phase rotators
– Rotators don’t change scale– Inputs right shifted to match
scale of largest input– intermediate renormalizing not
effective– Term from every FFT input
• 1 bit growth per stage– Renormalize maintains width– Alternative: grow word width
• Similar effect in other FFTs – Winograd, Sande-Tukey,
Singleton etc.)
k
k k
k
Butterfly
9copyright 2006 Andraka Consulting Group, Inc. All Rights reservedAndraka Consulting Group, Inc.
the
Fixed Point FFT Replaces Floating Point FFT
• Denormalize inputs
– Shift each input right to match scale of largest
• Perform fixed point FFT
– Pass common exponent around it
– Input width = mantissa bits
– Maximum 1 bit growth per equivalent radix 2 stage
• Renormalize outputs
– Add common exponent to delta exponent from renormalize
>>nFixed Point FFT
<<
Max Exponent
Exp.
Mant. Mant.
Exp.
+-
Denormalize Renormalize
10copyright 2006 Andraka Consulting Group, Inc. All Rights reservedAndraka Consulting Group, Inc.
the
Advantages and Limitations
• Advantages
– Large reduction in required hardware
– Less complexity means higher clock rates, smaller parts
• Limitations
– Word width grows for each radix 2 stage
• Becomes excessive for large FFTs
– Max Exponent needed at beginning of set
• Problem for large sequential FFTs
– Use periodic renormalization to manage word widths
• A few bits growth don’t significantly affect timing
• Word not limited to specific widths in FPGA
– Fixed width assets like DSP48s limit practical word sizes.
– Find balance between precision, growth and renormalizing stages
11copyright 2006 Andraka Consulting Group, Inc. All Rights reservedAndraka Consulting Group, Inc.
the
Small FFTs as building blocks
• Larger FFT constructed from small FFTs with “mixed radix” algorithm
– Similar to Cooley-Tukey decomposition
• Arbitrarily large FFTs using small off-the shelf kernels
• Combination uses FFT plus phase rotator and reorder memory
• “In-place” operation (results written to same memory locations)
Fill along rows
Mult by e-j2kn/N
FFT along rows
Read down cols
FFT down cols
12copyright 2006 Andraka Consulting Group, Inc. All Rights reservedAndraka Consulting Group, Inc.
the
Winograd FFT
• Different factorization• Minimizes multiplies
– Advantageous for hardware implementation
– 74 adds and 18 real multiplies for 16pt Winograd
– 176 adds and 72 real multiplies for 16pt Cooley-Tukey
• Irregular data sequence– Difficult for shared memory– Easy when reorder memory is
distributed
Reorder Reorder Reorder
Reorder Reorder Reorder
Weights
13copyright 2006 Andraka Consulting Group, Inc. All Rights reservedAndraka Consulting Group, Inc.
the
32 to 2048 point mixed radix FFT
• 2K FFT is 8 x 256 mixed radix
• 256 point is 16 x 16 mixed radix
• Combined algorithms 2K = 8 x 16 x16
• Data arranged in cube, FFT along each dimension
• Reorder at input and output (not shown)
• Kernel is proprietary 1/4/8/16 Winograd kernel– Each kernel has floating point wrapper
1/8Point FFT
Phase Rotator
Data Reorder
4k sample
BRAM
4/8/16 Point FFT
Phase Rotator
Data Reorder
512 sample
BRAM
8/16 Point FFT
32/64/128/256 point FFT
14copyright 2006 Andraka Consulting Group, Inc. All Rights reservedAndraka Consulting Group, Inc.
the
32-2K point FFT statistics
• Speed: 400 MS/sec per FFT engine (3 in FPGA)
– 400MHz clock in XC4VSX55-10 (slowest speed grade)
– 1 complex sample per clock in and out continuous
– Latency: ~430 + 3*FFT length + (32,64,128 or 256) clocks
• Utilization – less than 30% of XC4VSX55
– DSP48’s: 151
– Slice flip-flops: 9707
– RAMB16’s: 69
– LUT’s: 7736 (4975 are SRL16)
• Precision
– 30-35 bit mantissa internal, 8 bit exponents
– IEEE single precision input and output
– Matches Matlab FFT to +/- 1 LSB of output mantissa
15copyright 2006 Andraka Consulting Group, Inc. All Rights reservedAndraka Consulting Group, Inc.
the
1.2 GSample/sec IEEE floating point FFT
Input Buffer
32 to 2K pt floating pt
FFT
32 to 2K pt floating pt
FFT
32 to 2K pt floating pt
FFT
Output buffer
16copyright 2006 Andraka Consulting Group, Inc. All Rights reservedAndraka Consulting Group, Inc.
the
Who is Andraka Consulting Group?
• Exclusively FPGAs since 1994
• Leading industry expert on DSP in FPGAs
– Charter Xilinx ‘Xperts’ partner
– First published FIR filter in FPGAs (1992)
– Fastest single threaded FFT kernel for FPGA
• Other current projects
– Beamforming digital receiver: 10 25MHz channels, 260 antennas, 500MS/sec input sample rate
– Cylindrical Sonar Array processor
– Other Digital receiver and radar projects
17copyright 2006 Andraka Consulting Group, Inc. All Rights reservedAndraka Consulting Group, Inc.
the
Floating Point Format
• Floating point dedicates part of word to indicate scale (exponent)– Tracks radix point position as part of data – Compare to fixed point where radix point position is at an implied
fixed location• Trades precision for dynamic range• Useful when data range is unknown or spans a large range
• The IEEE single precision floating point standard is a 32 bit word,– Leftmost bit is the sign bit, S. ‘1’ is negative, ‘0’ is positive– Next 8 bits are exponent, excess 127 format– Right 23 bits are the fraction. There is an implicit ‘1’ bit to the left
of the fraction except in special cases. The fraction’s radix point is between the implied ‘1’ and the leftmost bit of the fraction.
• S EEEEEEEE FFFFFFFFFFFFFFFFFFFFFFF
• Number = -1S * 2 (E-127) * (1.F)