eads: accelerator project
DESCRIPTION
EADS: Accelerator Project. Rohit Prakash (2003CS10186) Anand Silodia (2003CS50210). Speed up scientific application. Application. Candidate Partition. Performance Prediction. Choose next partition. 28 th January : Figure out the best algorithm of FFT - PowerPoint PPT PresentationTRANSCRIPT
EADS: Accelerator Project
Rohit Prakash (2003CS10186)Anand Silodia (2003CS50210)
Speed up scientific application
Application
Candidate Partition
Performance Prediction
Choose next partition
Time lines (tentative)
28th January : Figure out the best algorithm of FFTCompare the algos on the following parameters – - Execution Time - No. of multiplications - No. of additions
19th February : Study hardware implementation of FFT.....
Terminologies
radix : The "radix" is the size of an FFT decomposition twiddle factors: "Twiddle factors" are the coefficients used to combine results from a previous stage to form inputs to the next stage
First Implementation
Implemented Recursive radix-4 FFT analysed this using gprof Looked into other FFT implementations
iterative parallel split radix
Analysis of the implementation
Considered FFT of 1024 random points (double)
Results from gprof -> No. of Complex multiplications : 21760 No. of Complex additions : 7680
(Each complex multiplication consists of 4 real multiplications and 2 real additions)
(Each complex addition/subtraction consists of 2 real additions/subtractions)
Problems with this implementation
Inefficient use of memory (recursive procedure)
Wasted computations (some factors computed multiple times)
Maximum time utilized in computing Twiddle factors (complex number multiplications)
2nd Implementation
Radix-4 iterative in-place implementation -iterativeFFT(a)
BitReversal(a,A)
n length(a)
for(s 1 to log4(n)) // logarithm is of base 4
{
do m 4s
ω e2Лi/m
for(k0 to n-1 by m)
{
do τ 1
for(j0 to m/4)
{
tA[k+j]
u τ A[k+j+m/4]
v τ2A[k+j+2*m/4]
x τ3A[k+j+3*m/4]
A[k+j]t+u+v+x
A[k+j+m/4]t+(i)u-
v-(i)x
A[k+j+2*m/4]t-u+v-
x
A[k+j+3*m/4]t-
(i)u-v+(i)x
τ τ* ω
}
}
}
Analysis of this implementation
Considered FFT of 1024 random points (double)
Results from gprof -> No. of Complex multiplications : 14080 No. of Complex additions/subtractions : 7680
(Each complex multiplication consists of 4 real multiplications and 2 real additions)
(Each complex addition/subtraction consists of 2 real additions/subtractions)
Improvements
Precompute twiddle factors Trade additions for multiplications
(it’s possible to multiply with 3 real multiplies and 5 real adds rather than usual 4 real multiplies and 2 real adds)
use compiler flags (10%-15% execution time on some systems) -O3 -march=pentiumpro -ffast-math -fomit-frame-pointer
Some results
Precomputing twiddle factors: No. of multiplications : 8960 5120 less multiplications (complex)
Trading multiplications for additions Did not show any appreciable decline in
execution time Using compiler flags
Drastic improvement in execution time
Comparative Analysis
User time for 1024 points
0
5
10
15
20
25
30
recursive fft inplace fft inplace twiddleprecompute
inplace twiddlecompiler
final fftw
tim
e (m
illi
seco
nd
s)
User time for 4096 points
0
10
20
30
40
50
60
70
recursive fft inplace fft inplace twiddleprecompute
inplace twiddlecompiler
final fftw
tim
e (m
illi
seco
nd
s)
User time for 262144 points
0
500
1000
1500
2000
2500
recursive fft inplace fft inplacetwiddle
precompute
inplacetwiddle
compiler
final fftw
tim
e (
mil
lis
ec
on
ds
)
Further enhancements possible
Use higher radix – 8,16,32, etc. Use split-radix or Winograd algorithms If data is real, we can have great
improvements Use Fast Bit-Reversal method (IEEE D.M.W.
Evans)
Resources
Rivest, Cormen Numerical Recipes in C IEEE papers
Conversion of Digit-Reversed to Bit-Reversed order in FFT algorithms (Panos E. and C.S. Burrus)
The Design and Implementation of FFTW3 (Matteo Frigo and Steven G. Johnson)
cnx.org Other fft implementations on the net
Best: fftw
Thank You