fastest fourier transform in the west · fastest fourier transform in the west devendra ghate and...

Fastest Fourier Transform in theWest

Devendra Ghate and Nachi Gupta

devg@comlab.ox.ac.uk, nachi@comlab.ox.ac.uk

Oxford University Computing Laboratory

MT 2006 Internal Seminar – p. 1/33

Cooley-Tukey Algorithm for FFT

Xk =N−1∑

xne−2πi

Nnk, k = 0, . . . , N − 1

If we solve for Xk directly then O(N2) operations requiredfor the solution.

If N = p × q, then this transformation can be split into twoseparate transformations...

...recursively, we get the FFT algorithm.

If N is split into N

rintervals of r each then the complexity is

r × N × logr N .

This can be generalised to mixed-radix algorithms.

Overview of FFT algorithms

Cooley-Tukey FFT algorithm (popularized in 1965), known by Gauss in 1805.

Prime-factor FFT algorithm – a.k.a. Good-Thomas (1958-1963), N = N1N2 becomes2D N1 by N2 DFT for only relatively prime N1, N2.

Bruun’s FFT algorithm (1978, generalized to arbitrary even composite sizes by H.Murakami in 1996), recursive polynomial factorization approach.

Rader’s FFT algorithm (1968), for prime size by expressing DFT as a convolution.

Bluestein’s FFT algorithm (1968) – a.k.a. chirp z-transform algorithm (1969), for primesizes by expressing DFT as a convolution.

Rader-Brenner (1976) – Cooley-Tukey like, but purely imag twiddle factors

What is FFTW?

FFTW is a package for computing a one ormultidimensional complex discrete Fourier transform(DFT) of arbitrary size.

At the heart of FFTW lies a philosophy geared towardsspeed, portability, and elegance – achieved via anadaptive software architecture.

Code, documentation, and some benchmarking resultsat www.fftw.org.

Features

Speed.

Both one-dimensional and multi-dimensionaltransforms.

Arbitrary size (small prime factors are best, but FFTWuses O(N log N) algorithms even for prime sizes).

Fast transforms of purely real input or output data.

Discrete Cosine Transform and Discrete SineTransform.

Parallel Transforms (an MPI version of distributedmemory transforms).

Portable to any platform with a C compiler.

Both C and Fortran interfaces.

GPLMT 2006 Internal Seminar – p. 5/33

Wilkinson Prize

FFTW received the 1999 J. H. Wilkinson Prize forNumerical Software, which is awarded every four years tothe software that “best addresses all phases of thepreparation of high quality numerical software.”

Wilkinson, a seminal figure in modern numerical analysis,was a key proponent of the notion of reusable commonlibraries for scientific computing.

Why FFTW is fast!

The transform is computed by an executor, composed ofhighly optimized composable blocks of C code calledcodelets.

At runtime, a planner finds an efficient way to mix thesecodelets: it measures the speeds of different plans andchooses the best using a dynamic programmingalgorithm.

The executor then interprets this plan with negligibleoverhead.

Codelets are generated automatically and fast.

FFTW is easy to use

FFTW’s internal complexity is not visible to the user.int n = 1024;

// allocate memory

in = fftw_malloc( sizeof( fftw_complex ) * n );

out = fftw_malloc( sizeof( fftw_complex ) * n );

// Create plan

fftw_plan plan = fftw_plan_dft_1d( n, in, out,

FFTW_FORWARD, FFTW_ESTIMATE );

// ... fill in ...

// Execute

fftw_execute( plan );

// Destroy plan

fftw_destroy_plan( plan );

The Executor

The executor implements the recursive divide and conquerCooley-Tukey FFT algorithm.

The executor is composed of many optimized codesequences called codelets.

Codelets come in two flavors:Nontwiddle codelets compute transforms of smallsizes.Twiddle codelets combine small transforms tocompute bigger transforms.

The executor invokes the codelets as dictated by theplan.

The Planner

A planner tries out various combinations of codelets todecide on the best algorithm.

The planner can produce many plans, measure thespeed of them, and then pick the best.

This is accomplished via dynamic programmingtechniques.

The planner can produce a “reasonable” plan quickly ifdesired.

The planner collects information about the machines,which can be stored on disk and reused at a later time.

What a plan looks like

FFT( 128 ) =DIVIDE-AND-CONQUER( 128, 4 )DIVIDE-AND-CONQUER( 32, 8 )SOLVE( 4 )

The plan is a sequence of instructions, either(SOLVE)(n): compute the FFT of size n using a nontwiddlecodelet.(DIVIDE-AND-CONQUER)(n, p): solve p problems of sizen/p using the rest of the plan, and then combine the resultsusing a twiddle codelet.

What a real plan looks like

Plan for an array of size 247

(dft-ct-dit/13

(dftw-generic-dit-13-19

(dft-direct-13-x19 "n1\_13"))

(dft-vrank>=1-x13/1

(dft-generic-19)))

FFTW Plans

FFTW_ESTIMATE - no run-time tuning, probablysuboptimal

FFTW_MEASURE - experiment with differentalgorithms and choose fastest

FFTW_PATIENT - experiment with more algorithms, ...

FFTW_EXHAUSTIVE - even more, ...

Creating a plan for an array of size 823467 inFFTW_EXHAUSTIVE modes takes 25 minutes.

Wisdom

A central or local database of plans for various array sizesfor a given hardware configuration.

Helps greatly in the estimate mode

Reduces computation time for the exhaustive mode

Wisdom from the other array sizes is also utilised

Can be exported from one architecture to another

Matlab uses a wisdom database by default.

FFTW Compiler

Various basic algorithms written in CamL (a functionallanguage)

Directed acyclic graph (DAG) created for all thesealgorithms

Optimisation rules defined especially suited for FFTalgorithms (these also include standard optimisationscarried out by a compiler)

Small snippets of code, “codelets”, generated for FFTalgorithms of size N = 2 : 16

Multi-dimensional Arrays

FFTW does not accept multidimensional arrays.

A single dimensional array in the "row-major" format (Cstandard)

Use of fftw_malloc for contiguous array allocation

MKL routines accept multi-dimensinal arrays. Variouscompact formats are also defined.

NAG FFT routines

Complex-Complex, Real-Complex routines

FFT algorithm (Brigham 1974)

Scaling factor of 1√

Array dimension n s.t. the largest prime factor of n doesnot exceed 19 and total number of factors of n includingrepetitions, does not exceed 20.

MKL FFT routines

Complex-Complex & real-complex transforms

Multi-dimensional arrays upto order 7

Interface for both FORTRAN and C

Mixed-redix transforms !

MKL also provides an API for FFTW 3.x !!!

MATLAB uses FFTW3 internally for all FFT calculations.Interface for setting plan options available.

Henrici

icc used with following flags:

-O3 : Level-3 opitization

-c99 : Support for complex numbers using C9Xstandards

-ipo : Multi file inlining

-funroll-loops : Unrolling the loops

-xN : Compile and optimise code specifically forPentium-4

-align: Analyse and reorder the memory layout

Benchmark

Time calculated by average over 1000 runs for eacharray size N

Minimum time from 8 trials

FLOPS = 5 × N × log2 N

Compiler: icc

void fftw_flops(plan, add, mul, fma)

0 1 2 3 4 5 6 7

5Nlog2(N)

2Nlog2(N)

1D Complex Bench - Powers of 2

N (Array Size)

Benchmark for complex case

FFTW3−EstimateFFTW3−MeasureFFTW3−PatientFFTW3−ExhaustiveMKLNAGMATLAB−EstimateMATLAB−MeasureMATLAB−PatientMATLAB−Exhaustive

1D Complex Bench - General

N (Array Size)

1D Complex Bench - Combined

N (Array Size)

BenchFFT Results on Henrici1

2 4 8 16 32 64 128

131072

262144

fftw3 out-of-placefftw3 in-placeooura-sg

spiral-egner-fftfftegreendfftpackrmayer-lookupbloodworthgsl-mixed-radixharmarprecsciportfxt-fhtnumutilskissfftmonniermpfun77mixfftcrossfxt-matrixfftjmfftcesrfftvalkenburg

double-precision complex, 1d transformspowers of two

6 9 12 15 18 24 36 80 108

165375

fftw3 out-of-placefftw3 in-placedfftpackfftegsl-mixed-radixkissfftmonniermixfftjmfftcnumutilsvalkenburg

double-precision complex, 1d transformsnon-powers of two

2 4 8 16 32 64 128

131072

262144

fftw3 out-of-placefftw3-r2rfftw3 in-placeooura-sg

fftrealsciportgreenrmayer-lookupdfftpackbloodworthfxt-fht-realgsl-mixed-radixjmfftc

double-precision real-data, 1d transformspowers of two

6 9 12 15 18 24 36 80 108

165375

fftw3-r2rfftw3 out-of-placefftw3 in-place

dfftpackgsl-mixed-radixjmfftc

double-precision real-data, 1d transformsnon-powers of two

16x512

128x64

128x128

256x128

512x64

64x1024

256x256

512x512

1024x1024

fftw3 in-placefftw3 out-of-placeooura-4f2dgreenffteharmfxt-twodimjmfftc

5x56x67x79x910x1011x1112x1213x1314x1415x1525x2448x4849x4960x6072x5675x7580x8084x8496x96100x100105x105112x112120x120144x144180x180240x240360x3601000x1000

fftw3 in-placefftw3 out-of-placefftejmfftc

double-precision complex, 2d transformsnon-powers of two

4x8x16

16x16x16

32x32x32

64x64x64

256x64x32

16x1024x64

128x128x128

fftw3 in-placefftw3 out-of-placeooura-sg3dgreenharmfftefxt-ndimjmfftc

10x10x10

11x11x11

12x12x12

13x13x13

14x14x14

15x15x15

24x25x28

48x48x48

49x49x49

60x60x60

72x60x56

75x75x75

80x80x80

84x84x84

96x96x96

100x100x100

105x105x105

112x112x112

120x120x120

speed (mflops)

Conclusions

Based on the experiments we conducted on Henrici1

MKL DFT routines are faster than FFTW3.

FFTW has competitive performance.

NAG performed poorly.

We prefer FFTW for C/C++/Fortran because of its ease ofuse, portability, additional routines, documentation, and themany other features. This is the package that we wouldrecommend.

In general, we find the Matlab routine to be very easy to useand recommend this if we are able to accept theperformance overhead.

fastest fourier transform in the west · fastest fourier transform in the west devendra ghate and...

Documents

continuous-time fourier transform©orie du...

md shiplu hawlader roll: sh-224. fourier series theorem ...

fourier series & the fourier transform - rundle...

hartley transform fourier transform

fourier series & fourier transform

4. the discrete fourier transform and fast fourier transform

fourier transform and fast fourier transform algorithms

fourier transform

discrete fourier series & discrete fourier...

discrete -time fourier transform discrete fourier...

blake et al.: the fastest fourier transform in the...

ee 4780 2d fourier transform. bahadir k. gunturk2 fourier...

the secrets of fftw: the fastest fourier transform …...

the fastest fourier...

the secrets of fftw: the fastest fourier transform in the...

fourier and fourier transform

the fast fourier transform; short-time fourier transform

continuous-time fourier transform. content introduction...

fourier series fourier transform the basic theorems and...

the “fastest fourier transform in the...