the hardware-software co-design process for the fast fourier transform (fft)

synergy.cs.vt.edu

The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)Carlo C. del MundoAdvisor: Prof. Wu-chun Feng

synergy.cs.vt.eduThe Co-Design Process for the FFT

The Multi- and Many-core Menace• “...when we start talking about parallelism and

ease of use of truly parallel computers, we’re talking about a problem that’s as hard as any that computer science has faced. ...I would be panicked if I were in industry.”

John Hennessy, Stanford UniversityAuthor of Computer Architecture: A Quantitative Approach

Berkeley’s View

Dwarfs of Symbolic Computation1,2

• Dwarf (noun): An algorithmic method that captures a pattern of computation and communication

2 Asanovic et al. A View of the Parallel Computing Landscape. CACM, 2009.

1 Colella, Phillip. Defining software requirements for scientific computing. 2004. www.lanl.gov/orgs/hpc/salishan/salishan2005/davidpatterson.pdf.

Structured Grid

UnstructuredGrid

SpectralMethods

Dense Linear

Algebra

Sparse Linear

Algebra

N-BodyMethods

MapReduce

Graphical Models

Combinational Logic

Graph Traversal Dynamic Programming

Branch-and-Bound

Finite State Machines

Structured Grid

UnstructuredGrid

SpectralMethods

Dense Linear

Algebra

Sparse Linear

Algebra

N-BodyMethods

Combinational Logic

Graph Traversal Dynamic Programming

Branch-and-Bound

Finite State Machines

MapReduce

Graphical Models

SpectralMethods

Fast Fourier Transfor

Software Hardware

HardwareSoftware /

Boundary

SpectralMethods

Fast Fourier Transfor

Software Hardware

HardwareSoftware /

Boundary

OpenFFT: A heterogeneous FFT library

C. del Mundo and W. Feng. “Towards a Performance-Portable FFT Library for Heterogeneous Computing,” in IEEE IPDPS ‘13. Phoenix, AZ, USA, May 2014. (Under review.)C. del Mundo and W. Feng. “Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture”, SC|13, Denver, CO, Nov. 2013. (Poster publication)C. del Mundo et al. “Accelerating FFT for Wideband Channelization.” in IEEE ICC ‘13. Budapest, Hungary, June 2013.

Berkeley’s View

Computational Pattern: The Butterfly

• Dwarf: Spectral Methods– Butterfly Pattern

Computational Pattern

a + b * wk

a – b * wkFigure 1: Simple Butterfly Figure 2: Butterfly with

Twiddle, wk

a + b a - b

16-pt FFT: Stages of ComputationInput Array

S1: “Columns”

S2: “Twiddles”S3: “Transpose”

S1: “Columns”

16-pt FFT: Computation-to-Core: 1 thread

16-pt FFT: Computation-to-Core: 4 threads

16-pt FFT: Computation-to-Core: One Warp

16-pt FFT: Computation-to-Core: One Block

16-pt FFT: Computation-to-Core: One SM

37.5% Occupancy on NVIDIA Kepler K20c

S1: “Columns”

GPU Memory Spaces

Background (GPUs)• GPU Memory

Hierarchy

Hierarchy– Global Memory

Memory Unit

Read Bandwidth (TB/s)

Global 0.17

Table: Memory Read Bandwidth for Radeon HD 6970

Hierarchy– Global Memory– Image Memory

Memory Unit

L1/L2 Cache 1.35 / 0.45Global 0.17

Hierarchy– Global Memory– Image Memory– Constant Memory

Memory Unit

Constant 5.4

L1/L2 Cache 1.35 / 0.45Global 0.17

Hierarchy– Global Memory– Image Memory– Constant Memory– Local Memory

Memory Unit

Constant 5.4Local 2.7

L1/L2 Cache 1.35 / 0.45Global 0.17

Hierarchy– Global Memory– Image Memory– Constant Memory– Local Memory– Registers

Memory Unit

Registers 16.2Constant 5.4

Local 2.7L1/L2 Cache 1.35 / 0.45

Global 0.17

Global Data Bandwidth(Bus Traffic)

Global Data Banwidth• Bus Traffic: Bytes transferred from off-chip

to on-chip memory and back.• Suppose we take the FFT of a 128 MB data set

– Minimum bus traffic is 2 x 128 = 256 MB• Load 128 MB (global -> on-chip)• Performs FFT• Store 128 MB (on-chip -> global)

• Bus Traffic and Performance

16-pt FFT: Computation-to-Core: One Warp• What’s wrong with this access pattern?

• Scattered Memory Accesses (power-of-2 strides)

• Uncoalesced Memory Access

S1: “Columns”

System-level optimizations(applicable to any application)

1. Register Preloading2. Vector Access/{Vector,Scalar}

Arithmetic3. Constant Memory Usage 4. Dynamic Instruction Reduction5. Memory Coalescing

S1: Register Preloading• Load to registers first

Without Register Preloading

79 __kernel void FFT16_vanilla(__global float2 *buffer) 80 { 81 int index = …; 82 buffer += index; 83 84 FFT4_in_order_output(&buffer[0], &buffer[4], &buffer[8], &buffer[12]);With Register

Preloading 79 __kernel void FFT16_strawberry1(__global float2 *buffer) 80 { 81 int index = …; 82 buffer += index; 83 84 float2 registers[4]; // Explicit Loads 85 for (int i = 0; i < 4; ++i) 87 registers[i] = buffer[4*i]; 88 FFT4_in_order_output(&registers[0], &registers[1], &registers[2], &registers[3]);

S2: Vector Types• Vector Access (float{2, 4,

8, 16})

– Scalar Math (VASM)• float + float

8, 16})

– Scalar Math (VASM)• float + float

– Vector Math (VAVM)• floatN + floatN

S3: Constant Memory• Fast cached lookup

for frequently used data

S3: Constant Memory• Fast cached lookup

for frequently used data

16 __constant float2 twiddles[16] = { (float2)(1.0f,0.0f), (float2) (1.0f,0.0f), (float2)(1.0f,0.0f), (float2)(1.0f,0.0f), ... more sin/cos values};

Without Constant Memory 61 for (int j = 1; j < 4; ++j) 62 { 63 double theta = -2.0 * M_PI * tid * j / 16; 64 float2 twid = make_float2(cos(theta), sin(theta)); 65 result[j] = buffer[j*4] * twid; 66 }

With Constant Memory61 for (int j = 1; j < 4; ++j)62 result[j] = buffer[j*4] *

twiddles[4*j+tid];

S1: “Columns”

Algorithm-level optimizations

(applicable only to FFT)

1. Transpose via LM2. Compute/Transpose via LM3. Compute/No Transpose via LM4. Register-to-register transpose

(shuffle)

A1: Transpose via local1 memory• Via Shared Memory

1 CUDA shared memory == OpenCL local memory

• Register to Register (shfl)

Algorithm-level optimizationsOriginal Transposed

1. Naïve Transpose (LM-CM)

Local Memory

t0 t1 t2 t3

Original Transposed

Register File

Local Memory

t0 t1 t2 t3

Original Transposed

Register File

Local Memory

t0 t1 t2 t3

Original Transposed

Register File

Local Memory

t0 t1 t2 t3

Original Transposed

Register File

Shuffle Mechanics

Shuffle Mechanics• FFT Transpose can be implemented using

shuffle

Shuffle Mechanics

Register File

t0 t1 t2 t3

• FFT Transpose can be implemented using shuffle

Shuffle Mechanics

Register File

t0 t1 t2 t3

Stage 1: Horizontal

Shuffle Mechanics

Register File

t0 t1 t2 t3

Stage 2: Vertical

Shuffle Mechanics

Register File

t0 t1 t2 t3

Stage 3: Horizontal

Shuffle Mechanics

Register File

t0 t1 t2 t3

• FFT Transpose can be implemented using shuffle– Bottleneck: Intra-thread data movement

Stage 2: Vertical

Shuffle Mechanics

Register File

t0 t1 t2 t3

for (int k = 0; k < 4; ++k) dst_registers[k] = src_registers[(4 - tid + k) % 4];

Code 1: (NAIVE)

• FFT Transpose can be implemented using shuffle– Bottleneck: Intra-thread data movement

Stage 2: Vertical

Shuffle MechanicsCode 1 (NAIVE)

63 for (int k = 0; k < 4; ++k)64 dst_registers[k] = src_registers[(4 -

tid + k) % 4];

General strategies• Shuffle instructions are cheap.• CUDA local memory is slow.

– Compiler is forced to place registers into CUDA local memory if array indices CANNOT be determined at compile time.

tid + k) % 4];

Code 2 (DIV)int tmp = src_registers[0];if (tid == 1){

src_registers[0] = src_registers[3];src_registers[3] = src_registers[2];src_registers[2] = src_registers[1];src_registers[1] = tmp;

}else if (tid == 2){

src_registers[0] = src_registers[2];src_registers[2] = tmp;tmp = src_registers[1];src_registers[1] = src_registers[3];src_registers[3] = tmp;

tid + k) % 4];

Divergence

tid + k) % 4];

Code 3 (SELP OOP) 65 dst_registers[0] = (tid == 0) ? src_registers[0] : dst_registers[0]; 66 dst_registers[1] = (tid == 0) ? src_registers[1] : dst_registers[1]; 67 dst_registers[2] = (tid == 0) ? src_registers[2] : dst_registers[2]; 68 dst_registers[3] = (tid == 0) ? src_registers[3] : dst_registers[3]; 69 70 dst_registers[0] = (tid == 1) ? src_registers[3] : dst_registers[0]; 71 dst_registers[3] = (tid == 1) ? src_registers[2] : dst_registers[3]; 72 dst_registers[2] = (tid == 1) ? src_registers[1] : dst_registers[2]; 73 dst_registers[1] = (tid == 1) ? src_registers[0] : dst_registers[1]; 74 75 dst_registers[0] = (tid == 2) ? src_registers[2] : dst_registers[0]; 76 dst_registers[2] = (tid == 2) ? src_registers[0] : dst_registers[2]; 77 dst_registers[1] = (tid == 2) ? src_registers[3] : dst_registers[1]; 78 dst_registers[3] = (tid == 2) ? src_registers[1] : dst_registers[3]; 79 80 dst_registers[0] = (tid == 3) ? src_registers[1] : dst_registers[0]; 81 dst_registers[1] = (tid == 3) ? src_registers[2] : dst_registers[1]; 82 dst_registers[2] = (tid == 3) ? src_registers[3] : dst_registers[2]; 83 dst_registers[3] = (tid == 3) ? src_registers[0] : dst_registers[3];

Divergence

Results (Experimental Testbed)EvaluationAlgorithm 1D FFT (batched), N = 16-, 64-, and 256-

ptsFFTW Version v3.3.2 (4 threads, OpenMP with AVX

extensions)FFTW Hardware Intel i5-2400 (4 cores @ 3.1 GHz)

GPU Testbed

Device CoresPeak

Performance

(GFLOPS)

PeakBandwidth

(GB/s)Max TDP(Watts)

AMD Radeon HD 6970

1536 2703 176 250

AMD Radeon HD 7970

2048 3788 264 250

NVIDIA Tesla C2075

448 1288 144 225

NVIDIA Tesla K20c 2496 4106 208 225

Results (optimizations in isolation)

• Minimize bus traffic via on-chip optimizations (RP, LM-CM, LM-CC, LM-CT)– Critical in AMD GPUs, not so much for NVIDIA GPUs

• Use VASM2/VASM4 (do not consider VAVM types)RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and

Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

Results (optimizations in concert)

• Device data transfer (black) subsumes execution time§

• One set of opts. for all GPUs§ in RP+LM-CM + VASM2 + CM/GAP

RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

Shuffle Results

• Fractionenhanced (0 < f < 1) = 16.5%

Shuffle Results

• Fractionenhanced (0 < f < 1) = 16.5%• Max. speedupenhanced (s > 1) = 1.19x

Shuffle Results

– Speedupenhanced (s > 1) = 1.08x

Shuffle Results

Shared

50 4K 11 0

SELP(OOP

74 4K 382 0

SELP(IP)

49 4K 418 0

DIV 72 4K 462 0

52 4K 100 128• Fractionenhanced (0 < f < 1) = 16.5%• Max. speedupenhanced (s > 1) = 1.19x

– Speedupenhanced (s > 1) = 1.08x

Shuffle Results

Shared

50 4K 11 0

SELP(OOP

74 4K 382 0

SELP(IP)

49 4K 418 0

DIV 72 4K 462 0

– Speedupenhanced (s > 1) = 1.08x ... but, wait! There’s more!

Shuffle Results

Shared

50 4K 11 0

SELP(OOP

74 4K 382 0

SELP(IP)

49 4K 418 0

DIV 72 4K 462 0

– Speedupenhanced (s > 1) = 1.08x ... but, wait! There’s more!

Shuffle Results: At 50% occupancy

• Fractionenhanced (0 < f < 1) = 16.5%• Max. speedupenhanced

1 (s > 1) = 1.19x – Speedupenhanced

2 (s > 1) = 1.08x -> 1.17x1 Calculation at 37.5% occupancy2 Calculation at 50% occupancy

1 Calculation at 37.5% occupancy2 Calculation at 50% occupancy

Shared

50 4K 11 37.5%

SELP(OOP

74 4K 382 37.5%

SELP(IP)

49 4K 418 37.5%

SELP (IP)

49 0 418 50%

Shared

50 4K 11 37.5%

SELP(OOP

74 4K 382 37.5%

SELP(IP)

49 4K 418 37.5%

SELP (IP)

49 0 418 50%

1 Calculation at 37.5% occupancy2 Calculation at 50% occupancy

Higher performance athigher occupancy

Results (1D FFT 256-pts)

• Speedups – as high as 9.1 and 5.8 over FFTW– as high as 18.2 and 2.9 over unoptimized GPU

Summary• Approach

– Focus on identifying optimizations for hardware

• Takeaways– FFTs are memory-bound (focus should be on memory

opts.)– Homogeneous set of optimizations for all GPUs:

Thank You!• Contributions:

– Optimization principles for FFT on GPUs– An analysis of GPU optimizations applied in isolation

and in concert on AMD and NVIDIA GPU architectures• Contact:

– Carlo del Mundo <cdel@vt.edu>

the hardware-software co-design process for the fast fourier transform (fft)

scientic computing

parallel computing landscape

dening software requirements

fft spectralmethods2

dwarf noun

pattern of computation

algorithmic method

parallel computers

Documents

david hansen and james michelussi. introduction discrete...

13-fast fourier transform (fft)

chapter 19 fast fourier transform (fft) (theory and...

option pricing with fourier methods - opengamma … · or...

8.1 introduction 8.2 goertzel algorithm 8.3 fast fourier...

parallel fast fourier transform literature review ·...

fft-based deep learning deployment in embedded systems ·...

the fast fourier transform (fft)

notes on recursive fft (fast fourier transform) algorithm...

fft fast fourier transformcabm/pds/pds_aula07_fft.pdf ·...

and fast fourier transform (fft)

fft core manual sundance · fc100 - floating point fast...

analysis of hybrid image with fft (fast fourier transform)

the fast fourier transformthe fast fourier transform (fft)...

the fast fourier transform (fft) - vibrationdatathe fast...

the fast fourier transform (fft) 1.direct...

ce 40763 digital signal processing fall 1992 fast fourier...

fast fourier transform (fft) untuk mendeteksi …

uses of the fast fourier transform (fft) in exact...

discrete fourier transform (dft) inverse dft...