a generalized prime factor fft algorithm for any $n = 2^p...

SIAM J. ScI. STAT. COMPUT.Vol. 13, No. 3, pp. 676-686, May 1992

(C) 1992 Society for Industrial and Applied Mathematics003

A GENERALIZED PRIME FACTOR FFT ALGORITHM FORANY N = 2’3’5" *

CLIVE TEMPERTON?

Abstract. Prime factor fast Fourier transform (FFT) algorithms have two important advantages: theycan be simultaneously self-sorting and in-place, and they have a lower operation count than conventionalFFT algorithms. The major disadvantage of the prime factor FFT has been that it was only applicable to alimited set of values of the transform length N. This paper presents a generalized prime factor FFT, whichis applicable for any N 2P3q5 r, while maintaining both the self-sorting in-place capability and the loweroperation count. Timing experiments on the Cray Y-MP demonstrate the advantages of the new algorithm.

Key words, fast Fourier transform (FFT), prime factor algorithm (PFA), self-sorting FFT, in-place FFT

AMS(MOS) subject classification. 65T05

1. Introduction. Fast Fourier transform (FFT) algorithms can be defined wheneverthe transform length N can be factorized as N N1N2’’’ N,, where the factors Niare integers. Though there are many variants of these algorithms, they fall into twobasic categories: those based on the prime factor algorithm (PFA) of Good [5], whichare only applicable if the factors Ni are mutually prime, and those descended fromthe algorithm of Cooley and Tukey [3], for which there is no such restriction (indeedthe most familiar case is N 2 for all i).

The prime factor algorithms have two important advantages. For a given value ofN, the operation count is lower than that for the corresponding Cooley-Tukeyalgorithm. Moreover, the PFA can be made both self-sorting (input and output bothnaturally ordered) and in-place (requiring no work space) [2], 10], 14]. Their principaldisadvantage is that in order to achieve this, a program to implement the PFA requiresan explicit section of code (a "discrete Fourier transform (DFT) module") whichperforms a self-sorting in-place transform of length Ni for each of the factors. (TheDFT module is the radix-Ni generalization of the familiar radix-2 "butterfly.") In mostimplementations of the PFA, the factors Ni are thus assumed to be members of theset {2, 3, 4, 5, 7, 8, 9, 16}. Coupled with the requirement that the factors be mutuallyprime, this severely limits the set of practicable transform lengths N.

Johnson and Burrus [7] demonstrated that the radix-2 Cooley-Tukey algorithmcould also be made self-sorting and in-place, and in [18] the principle was extendedto radix-3, radix-4, and radix-5 transforms, and finally to the mixed-radix case. However,as hinted in [18] the operation count for the resulting algorithm can be improved ifN contains a mixture of factors.

In this paper we combine ideas from the PFA and from the self-sorting in-placeform of the Cooley-Tukey algorithm, to derive a new generalized prime factor FFTalgorithm with the following nice properties:

(1) It works for any transform length of the form N 2P3q5r;(2) It is always self-sorting and in-place (the only work space required is for an

optional list of precomputed twiddle factors);

Received by the editors April 30, 1990; accepted for publication (in revised form) March 19, 1991.Much of this work was done at Recherche en Pr6vision Num6rique, Atmospheric Environment Service,Canada.

" European Centre for Medium Range Weather Forecasts, Shinfield Park, Reading, Berkshire RG2 9AX,United Kingdom.

676

Dow

nloa

ded

11/1

3/14

to 1

43.1

07.4

5.11

. Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php

A GENERALIZED PRIME FACTOR FFT 677

(3) For values of N suitable for the PFA, the new algorithm reduces to the PFAand has the same operation count;

(4) For N 2p, 3 q, or 5 r, the new algorithm reduces to that described in [18], andhas the same operation count;

(5) If N contains a mixture of factors but is unsuitable for the PFA, the newalgorithm has a lower operation count than for the Cooley-Tukey algorithm.

Section 2 of this paper reviews the essentials of the prime factor algorithm. In 3we show how the PFA can be generalized to yield the new algorithm, removing therestriction on the transform lengths. Section 4 describes the implementation on a CrayY-MP, and includes timing results to illustrate the properties of the new algorithm. In

5, we describe a refinement of the algorithm which gives an even lower operationcount for suitable values of N. Finally, 6 includes a brief summary and discussion.

2. Prime factor algorithms. A thorough derivation of the family of prime factorFFT algorithms has been given by Burrus [1], [2]. Briefly, the DFT of length N isdefined by

N--1

(2 1) x(n) z(k)wN, O<=n<=N-1,

where x(n) and z(k) are complex, and we use the notation

(2.2) toN exp (+2i’rr/ N).

Either sign may be taken in the definition (2.2).We illustrate the derivation of the PFA by means of an example. Suppose that

N N1N2, where N1 and N2 are mutually prime. In this case [8, p. 250] we can findintegers p, q, r, s (0 < p < N, 0 < q < N2, 0 < r < N2, 0 < s < N) such that

(2.3) pN2 rN1 + 1, qN1 sN+ 1.

We use this "Chinese Remainder Theorem" (CRT) to define a mapping betweenthe integers n, k (0 <- n _-< N- 1, 0-< k_-< N- 1) and the corresponding integer pairs(hi, n2) and (kl, k) where 0-<nl<N, 0=<n2<N2, 0=<k<N1, 0_-<k2<N2. Asdescribed in [14], we have a choice of two such mappings, the CRT map itself andthe "Ruritanian" map [6]. In [14] the CRT map was chosen; this time we choose theRuritanian map, for reasons which will become clear later.

Thus, the mapping is defined by

(2.4) nl (pn) mod N1, //2 (qn) mod N2;

(2.5) k, (pk) mod N,, k2 (qk) mod N2,

where p, q are defined in (2.3).The inverse map is given by

(2.6) n (N2n + Nn2) mod N;

(2.7) k= N2k + Nk2) mod N.

An example for N 40 (N1 8, Nz 5) is shown in Table 1. The integer solutionsof (2.3) are p 5, q 2, r 3, s 3. In fact, it is easy to construct the mapping in theform of a table without having to find these solutions. The entries in the first columnincrease from 0 in steps of NNI( N2), while those in the first row increase from 0in steps of N/N2(=N). The remaining columns (or rows) can then be filled in byusing the same increment as in the first column (or row), and taking the results moduloN. We will use this technique later for indexing in the transform algorithm.

Dow

nloa

ded

11/1

3/14

to 1

43.1

07.4

5.11

. Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php

678 CLIVE TEMPERTON

TABLEThe Ruritanian map for N 8, N2 5.

yl

0 2 3 4

0 8 16 24 325 13 21 29 37

10 18 26 34 215 23 31 39 720 28 36 4 1225 33 9 1730 38 6 14 2235 3 11 19 27

As shown in [14], if we substitute (2.6) and (2.7) into (2.1), we obtain

N2--1[ N,--1N2kl nl 0.) Nl k2n2(2.8) x(n, n) Y Z z(k,, k2)a N, N2

k2-0 k =0

If it were not for the appearance of N and N multiplying the exponents, (2.8) wouldbe exactly in the form of a two-dimensional DFT of dimension Nix N, and thetransform could be computed simply by performing N DFTs of length N in onedimension, followed by N2 DFTs of length N in the other (or vice versa, withoutchanging the results). There are no "twiddle factors" between the two stages (hencethe lower operation count than for the Cooley-Tukey algorithm). The output of eachof the short one-dimensional transforms can overwrite the corresponding input, andthe whole computation can thus be done in place. (Here we assume that N and N2are "small" so that explicit DFT modules can be coded for these transform lengths.)

As further shown in [14], the appearance of N2 and N in the exponents simplyrotates the transforms (applying a rotation r to a transform of length N means thatinstead of appearing in the original order 0, 1, 2,. ., Ni- 1, the same results appearin the order 0, r, 2r,..., (N- 1)r, where these indices are to be interpreted moduloN). Moreover, these rotations can be incorporated by modifying certain constantswhich appear in the definitions of the short DFT algorithms of length N and N.Detailed algorithms for rotated "small-n" DFTs are given in [14], [16].

The generalization to the case N NIN N, where all the factors are mutuallyprime, is straightforward; the one-dimensional transform of length N is equivalent toa k-dimensional transform in which the DFTs in each dimension are rotated asappropriate.

For future reference, it will be helpful to express these results in matrix form. Wejk (rowsdefine WN to be the DFT matrix of order N; thus element (j, k) of WN is CO

and columns of W are indexed from 0 to N-l), and (2.1) can be rewritten as

.x WN.z. Further, we define W% to be the matrix WN with all its elements raised tothe power r. Then (2.8) corresponds to the factorization

(2.9) WN R-l( **-[N,]vvN2 X WtNN,2])Rwhere denotes the Kronecker (tensor) product and R is the permutation matrixwhich maps the integers 0 n =< N-1 to the corresponding integer pairs (nl, n2) viathe Ruritanian map (2.4). In general, if N= N1N2’’’ Nk, where all the Ni’s aremutually prime, then the factorization is

(2.10) WN R-’( ’’z[N/N]v,, N x x UZ[N/N]x,, N: w[NN,/N’])R

Dow

nloa

ded

11/1

3/14

to 1

43.1

07.4

5.11

. Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php


where R now maps the array x(n) to the corresponding k-dimensional arrayx(nl, n2,’", nk) via the appropriate Ruritanian mapping. Notice that, to computethe transform, it is not necessary to reorder the input (or output) physically; therequired mapping can be implemented implicitly via the indexing logic.

3. The generalized PFA. Using the results of the previous section, we can deriveself-sorting in-place PFAs for certain values of N. For example, if N 60 3.4.5, (2.10)becomes

(3.1) W6o--R-l(W[512] X W[415] x Wae)R.

Since the rotation [r] in uztrSi can be taken modulo Ni, (3.1) simplifies to

(3.2) W6o R-l( W[52] X W[43] X W=])g.

Algorithms are available for each of the short transforms W, W4, W3, including therotations [14], [16], and these short transforms can be self-sorting and in-place.

Suppose now that N 129600 34 4 x 5. If we arrange the factors in the formof a palindrome, for example

N=3x3x4x5x4x5x4x3x3,

then we can use the results of [18, 5] to obtain a self-sorting in-place algorithm fora transform of this length. However, the operation count would be the same as thatfor the corresponding mixed-radix Cooley-Tukey algorithm.

Alternatively, by (2.10) we have

II/’[5184] II]’[2025] X II/’[1600]]D(3.3) W129600 R-l( 25 x 64 81 ]a,,

which, on reducing the rotations modulo N, becomes

(3.4) W129600 R-l( ’’,[9] I’I/’[41]I/t’ 25 X 64 X W[8]I])R.

In previous work on the PFA, it appeared that a self-sorting in-place implementa-tion of (3.4) was not possible, since there was no way to perform self-sorting in-placetransforms of lengths 25, 64, and 81. This is no longer true, thanks to the Johnson-Burrusself-sorting in-place radix-2 algorithm [7] and its generalization to other radices [18].We need one final ingredient, so that the necessary rotations can be incorporated.

The required lemma was given in [14], in connection with deriving a rotated DFTmodule for N 9. It can be stated as follows: if we have a radix-p algorithm for atransform of length N pro, then we can apply a rotation r by the following:

(1) Applying the rotation r (modulo p) to each radix-p module (e.g., by changingthe multiplier constants); and

(2) raising all the twiddle factors to the power r.For example, 1’1/’[9]’’z5 in (3.4) can be implemented by rotating each radix-5 module

by r’--4 (=9 modulo 5), and by raising all the twiddle factors to the power 9 (notethat if the twiddle factors are stored in a precomputed list, this simply reorders them).

Thus, the generalized self-sorting in-place prime factor FFT algorithm (GPFA)for any N--2P3q5 is constructed as follows:

(1) Use the Ruritanian mapping to convert to a three-dimensional transform ofsize 2p 3 q 5r;

(2) Do the component one-dimensional transforms of length 2p, 3 q, and 5 usingthe generalized Johnson-Burrus scheme [18] with rotations incorporated as describedabove.

Dow

nloa

ded

11/1

3/14

to 1

43.1

07.4

5.11

. Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php

680 CLIVE "tEMPERa’ON

If the factors 2p, 3 q, and 5 are all "small," then the above algorithm is equivalentto the original PFA. If N 2p, 3 q, or 5 (i.e., only one of p, q, r is nonzero), then thetransform remains one-dimensional and the above algorithm is equivalent to that in[18], with the same operation count as the Cooley-Tukey algorithm. In other cases,we have a new self-sorting in-place algorithm which takes advantage of the splittingof N into its mutually prime factors to reduce the operation count.

3.1. Operation counts. In [13], it was shown how to compute the operation countsfor the mixed-radix Cooley-Tukey FFT algorithm. Here we adapt the formulae givenin [13] to the case N 2P3q5 r, where the factors of 2 are treated in pairs (i.e., using aradix-4 algorithm). As in [13], the formulae assume that redundant multiplications areavoided when the twiddle factor is 1, but that there is no special treatment of other"simple" twiddle factors. The number of real additions is then given by

(3.5) sd( N) 2N(1.375p + 2.67q + 4r-1) + 2,

while the number of real multiplications is

(3.6) J//(N) 2N(0.75p + 2q + 2.8r 2) + 4.

In the case of the GPFA, we sum the operation counts from the transforms in eachof the three dimensions: if N N1N2N3, where N1--2p, N:---3 q, N3--5 r, then thenumbers of real additions and multiplications are given, respectively, by

(3.7) s(N)= 2 (N/Ni)sg(Ni),i=1

(3.8) ()= 2 (/)a(),i=1

where s/(N) and (Ni) are obtained from (3.5) and (3.6).Examples for N-= 3600 32. 42. 52 are given in Table 2. In comparison with the

Cooley-Tukey algorithm, the GPFA saves 10 percent of the additions and 33 percentof the multiplications. Thus, besides being both self-sorting and in-place, the GPFAhas a significantly lower operation count than the conventional FFT when N containsa mixture of factors. Further examples of operation counts will be given in 4. (TheGPFA+ algorithm is described in 5.)

As an additional bonus, there is a useful saving in the storage required for tablesof precalculated twiddle factors. Since a separate table is now used for the transformsin each of the three dimensions, the storage required becomes 2p nt- 3 q + 5 rather than2p 3 q X 5 r. For example, in the case N 216000 33 43 5 the twiddle factor storagerequirement falls from 216000 to 216.

3.2. Indexing. To illustrate the indexing logic, we present a section of Fortrancode which implements the first half of the radix-2 part of the algorithm for general

TABLE 2Real operation counts for N 3600.

Cooley-Tukey 128402 76324GPFA 115538 50596GPFA+ 110250 40020

Dow

nloa

ded

11/1

3/14

to 1

43.1

07.4

5.11

. Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php


N 2P3q5 r. Set NI 2p, IP p. We first compute the required rotation and set up thetable of twiddle factors:

COMPLEX TRIGS(NI)DEL 4.0*ASIN(1.0)/FLOAT(NI)IROT= MOD((N/NI),NI)KK 0DO 10K=I, NIANGLE FLOAT(KK)*DELTRIGS(K) CMPLX(COS(ANGLE),SIN(ANGLE))KK KK+ IROTIF (KK.GT.NI) KK KK- NI

10 CONTINUE

The first (p+ 1)/2 radix-2 passes are then performed by the following code:

COMPLEX X(N), W, ZNH N/2INC N/NIDO 50 L= 1, (IP+ 1)/2LA 2**(L- 1)JA =0JB NH/LAKK= 1DO 40 K=0, JB-1, INCW= TRIGS(KK)DO 30 J--K+ 1, N, N/LAIA=JA+JIB=JB+JDO 20I=1, INCZ W*(X(IA) X(IB))X(IA) X(IA) + X(IB)X(IB) =ZIA IA+ NIIF (IA.GT.N) IA IA- NIB IB + NIIF (IB.GT.N) IB IB- N

20 CONTINUE30 CONTINUE

KK KK+ LA40 CONTINUE50 CONTINUE

The details of the indexing may be understood by comparing this code with Table 1(N 40, NI 8). The three outer loops are very similar to the three loops of the codepresented in [18], which performed the first half of a self-sorting in-place radix-2algorithm. In the present case these loops set up base addresses in the first column ofTable 1. The advantage of the Ruritanian map is that the entries in the first column(or row) increase monotonically in steps of N/NI; there is no need to compute theseaddresses modulo N. The innermost loop (DO 20) then steps across the table, perform-ing one "butterfly" from each of the N/NI transforms of length NI. Only in thisinnermost loop is it necessary to update the addresses modulo N.

Dow

nloa

ded

11/1

3/14

to 1

43.1

07.4

5.11

. Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php

682 CLIVE TEMPERTON

The code for the second half of the radix-2 part of the algorithm for generalN 2P3q5 may be obtained by similarly adapting the corresponding code given in[18], again inserting a new innermost loop which "traverses" the table. Notice thatthe indexing for the radix-2 part of the algorithm depends only on N! and NNI; itis immaterial whether the Ruritanian map is one-, two- or three-dimensional. The sameis true of the corresponding radix-3 and radix-5 parts.

4. Implementation on Cray Y-MP. The three Cray Assembly Language (CAL)routines described in [18], which implemented multiple self-sorting in-place complexFFTs for N-= 2p, 3 q and 5 r, have been generalized to implement the GPFA. The inputparameter list for the modified version of each of the three routines now includes bothN 2P3q5 and the appropriate NI 2p, 3 q, or 5 r. The sequences of floating-pointvector instructions are essentially unchanged, but the addressing and loop control nowhave to "navigate" the Ruritanian map as described in 3.2, and are more complicated.However, this extra complexity has no impact on the vectorization, since the innermostloops (within the individual vector instructions) still step across the M simultaneoustransforms being performed, with constant stride. For general N 2P3q5, each of thethree routines is now called in turn to implement the complete transform algorithm.

For the timing experiments presented below, 64 transforms were performed simul-taneously. The experiments were run using a single processor of a Cray Y-MP withclock cycle 6.4 nsec. The new algorithm was compared with three other FFT routines,all CAL-coded and vectorized in the same way. In Tables 3-5, CFFT refers to the "old

TABLE 3Timing comparisons for self-sorting transforms of length N.

Real operations (+/*)per transform

N CFFT GPFA

Time per transform (ls) Efficiencyof GPFA

CFFT PFA GPFA (%)

120=23.3.5 2382/1276 2028/508 19.1 14.1 14.2 97.1144 24. 32 2834/1444 2594/964 22.5 18.0 18.0 98.0180=22.3z.5 3992/2272 3472/1232 31.1 24.0 24.0 98.2240 24. 3.5 5362/2788 4686/1436 41.5 32.3 32.3 98.5360 23. 32. 5 9062/5260 7844/2644 71.5 54.1 53.9 98.9720=24.32.5 19922/11236 17578/6584 153.6 120.6 120.2 99.4


N

Time per transform (txs) EfficiencyReal operations (+/*) of GPFA

per transform CFFT SSIP GPFA (%)

125 53 2752/1604 22.2 18.9 18.9 98.8243 35 5996/3892 47.3 41.4 41.1 99.2256 28 5122/2052 39.9 35.1 35.3 98.7625 54 18752/11504 150.1 127.9 127.9 99.7729 36 21872/14584 171.3 150.4 149.2 99.71024 2t 26114/11268 201.7 178.4 178.9 99.32187 37 77276/52492 602.3 530.6 526.6 99.83125 55 118752/75004 952.9 809.1 809.1 99.84096 212 126978/57348 978.5 866.4 868.4 99.4

Dow

nloa

ded

11/1

3/14

to 1

43.1

07.4

5.11

. Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php



Real operations (+/*)per transform

N CFFT GPFA

Time per transform (s) Efficiencyof GPFA

CFFT GPFA (%)

216=23.33300 22. 3.5400 24. 5600 23. 3" 5900 22. 32. 521200 24 3" 51500 22 3’ 531800 23 32. 522400 25 3’ 523000 23 3’ 533200 27 523600 24 32. 5

4862/2812 4444/1868 39.2 30.6 98.77452/4264 6624/2608 58.3 45.5 99.010002/5284 9282/3844 78.0 63.6 99.316702/9724 14748/5516 132.3 100.9 99.427152/16384 24272/10624 211.1 165.7 99.636402/20644 32646/13132 282.6 222.9 99.649252/29704 45024/21248 386.8 307.1 99.759702/36364 53044/22148 470.5 362.0 99.780002/46084 71742/29564 629.1 491.6 99.2

107502/65404 97548/43996 825.5 665.6 99.7107202/58244 100306/42852 842.2 684.0 99.7128402/76324 115538/50596 994.8 787.7 99.7

routine" used as a standard of comparison in [18]; this was originally coded for theCray-1 and implements the self-sorting form ofthe Cooley-Tukey algorithm by alternat-ing between the original data array and a work array of the same size. PFA refers tothe implementation of the prime factor algorithm described in 15]. SSIP refers to theself-sorting in-place routines presented in [18], and GPFA is the new algorithm.Operation counts given in the tables were computed via (3.5)-(3.8), with appropriatemodifications when p is odd (as in [18], the Ni 2p part of the computation is doneusing a radix-4 algorithm with an extra radix-2 or radix-8 section called once ifp is odd).

Table 3 presents timing comparisons for values of N suitable for the PFA.Operation counts for the PFA routine are in principle the same as those for GPFA.In practice they are slightly smaller in some cases, since the special DFT modules usedin PFA for Ni 9 and Ni-- 16 include some economies which are not available to thegeneral-purpose radix-3 and radix-4 DFT modules in GPFA. As shown in Table 3, thetimes for GPFA are almost exactly the same as those for PFA, and represent savingsof 20-25 percent over CFFT. The last column of Table 3 gives a measure ofthe efficiencyof GPFA, defined as in [18] as

Minimum possible time

Measured time100 percent.

The minimum possible time is computed on the basis of the number of real additionsin the algorithm, and the efficiency is equivalent to the percentage of CPU time duringwhich the floating-point addition unit is active. As demonstrated in Table 3, optimumuse of the hardware is very nearly achieved. In particular, the rather complicatedindexing is successfully hidden behind the floating-point arithmetic.

Table 4 presents timings for values of N suitable for the self-sorting in-placeroutines described in [18], i.e., of the form 2p, 3 q, or 5 r. The operation counts for thethree routines are the same. The times for GPFA are almost exactly the same as thosefor SSIP (despite the more complicated indexing, which is redundant in this case asthe Ruritanian map is now one-dimensional), representing savings of 10-15 percentover CFFT.

Dow

nloa

ded

11/1

3/14

to 1

43.1

07.4

5.11

. Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php

684 CLIVE TEMPERTON

Table 5 presents timings for values of N which contain a mixture of factors butare unsuitable for the PFA because at least one of the mutually prime factors is toolarge. The times for GPFA represent savings of 18-24 percent over CFFT, demonstratingthat the two major advantages of the prime factor algorithm (self-sorting in-placecapability and reduced operation count) have been extended to general values ofN--2P3q5 ".

The results presented in this section show that in GPFA we have a self-sortingin-place FFT algorithm which works for any N 2P3q5 r. If the transform length N issuitable for the PFA, then GPFA is essentially equivalent to it and runs just as fast.If N 2p, 3 q, or 5 r, then GPFA is equivalent to the self-sorting in-place algorithms in[18], and runs just as fast. If .N contains a mixture of factors but the PFA is notapplicable, then GPFA has a lower operation count than the conventional FFT, andruns faster.

5. A further refinement. For suitable values of the transform length N, a refinementis possible which leads to a further reduction in the operation count. We illustrate theprocedure by way of an example, for N 3600 32. 42. 52. From (2.10), we have

(5 1) W3600 R-l( 25 x W16 x W[94])R.The heart of the algorithm is a three-dimensional transform; the Ruritanian mappingand the rotations are not germane to the present discussion, and the same procedurecould be used with any multidimensional transform.

In 3, (5.1) was implemented by performing transforms of length 9 along the firstdimension, then transforms of length 16 along the second dimension, and finallytransforms of length 25 along the third dimension. The transform of length 9 is atwo-stage procedure using a radix-3 algorithm; it can be written

(5.2) W[94]-’- U9(D9T9).

The first stage (D9 Z9) consists of three radix-3 "butterflies" (T9) followed by multiplica-tion by a diagonal matrix (D9) of twiddle factors. The second stage (U9) consists ofanother three radix-3 butterflies (since we are using a self-sorting in-place algorithmhere, the butterflies in U9 are coupled and some results are interchanged [18], butagain this does not affect the argument). Similarly, we can write the transforms oflength 16 and 25 as two-stage procedures using radix-4 and radix-5 algorithms, respec-tively:

(5.3) W16-- U16(D16T16),

(5.4) II/’[9]’’25 U25(D25T25),

where O16 and D25 are diagonal matrices of twiddle factors.Substituting (5.2)-(5.4) into (5.1),

(5.5) W3600 R-l((U25025T25) x (U16D16T16) x U919T9))R.

Using the algebra of Kronecker (tensor) products, (5.5) becomes

(5.6) W3600-- R-l(( U25 x U16 x U9)(D25 x O16 x 09) T25 x T16 x T9))R.

The interpretation of (5.6) is as follows. We can perform the first stage (T9) ofthe radix-3 algorithm along the first dimension, without the twiddle factors, and followit immediately by the first stage (T16) of the radix-4 algorithm along the seconddimension, and the first stage (T25) of the radix-5 algorithm along the third dimension.Next, we apply the "delayed" twiddle factors, contained in the matrix (D_ D16 D9).

Dow

nloa

ded

11/1

3/14

to 1

43.1

07.4

5.11

. Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php


Finally, we perform the second stages U9, U16, U25) ofthe radix-3, radix-4, and radix-5algorithms along the three dimensions in turn. The important point about this reorderingof the calculation is that (D25 D16 D9) is a single diagonal matrix (of order N) oftwiddle factors; thus, the twiddle factors in the original algorithm (5.1) have beennested together. Each of these combined twiddle factors can be found somewhere inthe precomputed list oflength N as used for the Cooley-Tukey algorithm (unfortunatelythe reduction to a table of length 2p q-3 q + 5 is now lost). This nesting of the twiddlefactors is analogous to that proposed for two-dimensional transforms in [13].

The number of twiddle factors in (D25 D16 D9) that have the value 1 is givenby the product of the numbers of l’s in the three matrices D25, D16 D9, respectively.Assuming we can still pick these out and avoid redundant twiddle factor multiplications,the operation count for this refined algorithm (GPFA+), for N 3600, is given inTable 2. Compared with GPFA, the new algorithm saves a further 5 percent of theadditions and 20 percent of the multiplications.

For N 2P3q5 r, nesting the twiddle factors in this fashion can be done wheneverat least two of p, q, r are greater than one (so the Ruritanian map is at least two-dimensional, and the transforms in at least two of the dimensions consist of at leasttwo stages with intervening twiddle factors). The practical aspects of indexing therefined algorithm in the general case have not yet been addressed.

6. Summary and discussion. In this paper we have developed a new FFT algorithmwhich works for any transform length of the form N 2P3q5 r, and is always self-sortingand in-place. It includes the prime factor algorithm [14] and the self-sorting in-placeversion of the fixed-radix Cooley-Tukey algorithm 18] as special cases, and is alwaysat least as fast as previously available algorithms.

A refinement, applicable for certain values of N, has already been proposed in5; it is natural to ask whether further refinements are possible. Promising directions

include replacing the radix-2 part ofthe procedure by the split-radix algorithm [4], 11 ],and the radix-3 part by the algorithm of Suzuki, Sone, and Kido [12]; but in bothcases it remains to be shown whether these algorithms can be made self-sorting andin-place, and generalized to include rotations. These refinements would reduce themultiplication count to a greater extent than the addition count--which is unfortunatefrom the viewpoint of implementation on Cray-like machines, where the multiplicationsare effectively already free of charge.

Another topic worth pursuing is the analogous generalization to any N- 2P3q5

ofthe self-sorting in-place real/half-complex prime factor FFT 17] and the correspond-ing fast sine and cosine transform algorithms [9].

Acknowledgments. The author wishes to thank Dr. Deborah Salmond of CrayResearch Inc. for her help in running the timing experiments on the Cray Y-MP.

REFERENCES

1] C. S. BURRUS, Index mappings for multidimensional formulation of the DFT and convolution, IEEETrans. Acoust. Speech Signal Process., 25 (1977), pp. 239-242.

[2] C. S. BURRUS AND P. W. ESCHENBACHER, An in-place, in-order prime factor FFT algorithm, IEEETrans. Acoust. Speech Signal Process., 29 (1981), pp. 806-817.

[3] J. W. COOLEY AND J. W. TUKEY, An algorithm for the machine calculation of complex Fourier series,Math. Comp., 19 (1965), pp. 297-301.

[4] P. DUHAMEL, Implementation of "split-radix" FFT algorithms for complex, real and real-symmetricdata, IEEE Trans. Acoust. Speech Signal Process., 34 (1986), pp. 285-295.

[5] I. J. GOOD, The interaction algorithm and practical Fourier analysis, J. Roy. Statist. Soc. Set. B., 20(1958), pp. 361-372.

Dow

nloa

ded

11/1

3/14

to 1

43.1

07.4

5.11

. Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php

686 CLIVE TEMPERTON

[6] I. J. GOOD, The relationship between two Fast Fourier Transforms, IEEE Trans. Comput., 20 (1971),pp. 310-317.

[7] H. W. JOHNSON AND C. S. BURRUS, An in-place in-order radix-2 FFT, IEEE ICASSP-84 (1984),pp. 28A.2.1-28A.2.4.

[8] D. E. KNUTH, The Art of Computer Programming: Vol. 2, Seminumerical Algorithms, Addison-Wesley,Reading, MA, 1969.

[9] J. S. OTTO, Symmetric prime factor Fast Fourier Transform algorithms, SIAM J. Sci. Statist. Comput.,10 (1989), pp. 419-431.

10] J. H. ROTHWEILER, Implementation of the in-order primefactor transform for various sizes, IEEE Trans.Acoust. Speech Signal Process., 30 (1982), pp. 105-107.

[11] H. V. SORENSEN, M. Z. HEIDEMAN, AND C. S. BURRUS, On computing the split-radix FFT, IEEETrans. Acoust. Speech Signal Process., 34 (1986), pp. 152-156.

[12] Y. SUZUKI, T. SONE, AND K. KIDO, A new FFT algorithm of radix 3, 6 and 12, IEEE Trans. Acoust.Speech Signal Process., 34 (1986), pp. 380-383.

13] C. TEMPERTON, Self-sorting mixed-radix Fast Fourier Transforms, J. Comput. Phys., 52 (1983), pp. 1-23.14] ,Implementation ofa self-sorting in-place primefactor FFT algorithm, J. Comput. Phys., 58 (1985),

pp. 283-299.[15] , Implementation of a prime factor FFT algorithm on Cray-1, Parallel Computing, 6 (1988),

pp. 99-108.[16] , A new set of minimum-add small-n rotated DFT modules, J. Comput. Phys., 75 (1988),

pp. 190-198.17] .,A self-sorting in-place primefactor real half-complex FFT algorithm, J. Comput. Phys., 75 (1988),

pp. 199-216.18] ., Self-sorting in-place Fast Fourier Transforms, SIAM J. Sci. Statist. Comput., 12 (1991), pp. 808-

823.

Dow

nloa

ded

11/1

3/14

to 1

43.1

07.4

5.11

. Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php

a generalized prime factor fft algorithm for any $n = 2^p...

Documents