the hardware-software co-design process for the fast fourier transform (fft)

Post on 25-Feb-2016

43 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

The Hardware-Software Co-Design Process for the fast Fourier transform (FFT). Carlo C. del Mundo Advisor: Prof. Wu- chun Feng. The Multi- and Many-core Menace. - PowerPoint PPT Presentation

TRANSCRIPT

synergy.cs.vt.edu

The Hardware-Software Co-Design Process for the fast Fourier transform (FFT)Carlo C. del MundoAdvisor: Prof. Wu-chun Feng

synergy.cs.vt.eduThe Co-Design Process for the FFT

The Multi- and Many-core Menace• “...when we start talking about parallelism and

ease of use of truly parallel computers, we’re talking about a problem that’s as hard as any that computer science has faced. ...I would be panicked if I were in industry.”

John Hennessy, Stanford UniversityAuthor of Computer Architecture: A Quantitative Approach

synergy.cs.vt.eduThe Co-Design Process for the FFT

Berkeley’s View

synergy.cs.vt.eduThe Co-Design Process for the FFT

Dwarfs of Symbolic Computation1,2

• Dwarf (noun): An algorithmic method that captures a pattern of computation and communication

2 Asanovic et al. A View of the Parallel Computing Landscape. CACM, 2009.

1 Colella, Phillip. Defining software requirements for scientific computing. 2004. www.lanl.gov/orgs/hpc/salishan/salishan2005/davidpatterson.pdf.

synergy.cs.vt.eduThe Co-Design Process for the FFT

Dwarfs of Symbolic Computation1,2

• Dwarf (noun): An algorithmic method that captures a pattern of computation and communication

Structured Grid

UnstructuredGrid

SpectralMethods

2 Asanovic et al. A View of the Parallel Computing Landscape. CACM, 2009.

Dense Linear

Algebra

Sparse Linear

Algebra

N-BodyMethods

MapReduce

Graphical Models

Combinational Logic

1 Colella, Phillip. Defining software requirements for scientific computing. 2004. www.lanl.gov/orgs/hpc/salishan/salishan2005/davidpatterson.pdf.

Graph Traversal Dynamic Programming

Branch-and-Bound

Finite State Machines

synergy.cs.vt.eduThe Co-Design Process for the FFT

Dwarfs of Symbolic Computation1,2

• Dwarf (noun): An algorithmic method that captures a pattern of computation and communication

Structured Grid

UnstructuredGrid

SpectralMethods

2 Asanovic et al. A View of the Parallel Computing Landscape. CACM, 2009.

Dense Linear

Algebra

Sparse Linear

Algebra

N-BodyMethods

Combinational Logic

1 Colella, Phillip. Defining software requirements for scientific computing. 2004. www.lanl.gov/orgs/hpc/salishan/salishan2005/davidpatterson.pdf.

Graph Traversal Dynamic Programming

Branch-and-Bound

Finite State Machines

MapReduce

Graphical Models

synergy.cs.vt.eduThe Co-Design Process for the FFT

Dwarfs of Symbolic Computation1,2

• Dwarf (noun): An algorithmic method that captures a pattern of computation and communication

SpectralMethods

2 Asanovic et al. A View of the Parallel Computing Landscape. CACM, 2009.

1 Colella, Phillip. Defining software requirements for scientific computing. 2004. www.lanl.gov/orgs/hpc/salishan/salishan2005/davidpatterson.pdf.

Fast Fourier Transfor

m

Software Hardware

HardwareSoftware /

Boundary

synergy.cs.vt.eduThe Co-Design Process for the FFT

Dwarfs of Symbolic Computation1,2

• Dwarf (noun): An algorithmic method that captures a pattern of computation and communication

SpectralMethods

2 Asanovic et al. A View of the Parallel Computing Landscape. CACM, 2009.

1 Colella, Phillip. Defining software requirements for scientific computing. 2004. www.lanl.gov/orgs/hpc/salishan/salishan2005/davidpatterson.pdf.

Fast Fourier Transfor

m

Software Hardware

HardwareSoftware /

Boundary

synergy.cs.vt.eduThe Co-Design Process for the FFT

OpenFFT: A heterogeneous FFT library

C. del Mundo and W. Feng. “Towards a Performance-Portable FFT Library for Heterogeneous Computing,” in IEEE IPDPS ‘13. Phoenix, AZ, USA, May 2014. (Under review.)C. del Mundo and W. Feng. “Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture”, SC|13, Denver, CO, Nov. 2013. (Poster publication)C. del Mundo et al. “Accelerating FFT for Wideband Channelization.” in IEEE ICC ‘13. Budapest, Hungary, June 2013.

synergy.cs.vt.eduThe Co-Design Process for the FFT

Berkeley’s View

synergy.cs.vt.eduThe Co-Design Process for the FFT

Computational Pattern: The Butterfly

synergy.cs.vt.eduThe Co-Design Process for the FFT

• Dwarf: Spectral Methods– Butterfly Pattern

Computational Pattern

a b

+ -

a + b * wk

a – b * wkFigure 1: Simple Butterfly Figure 2: Butterfly with

Twiddle, wk

a b

+ -

a + b a - b

wk

synergy.cs.vt.eduThe Co-Design Process for the FFT

16-pt FFT: Stages of ComputationInput Array

S1: “Columns”

S2: “Twiddles”S3: “Transpose”

S1: “Columns”

synergy.cs.vt.eduThe Co-Design Process for the FFT

16-pt FFT: Computation-to-Core: 1 thread

synergy.cs.vt.eduThe Co-Design Process for the FFT

16-pt FFT: Computation-to-Core: 4 threads

synergy.cs.vt.eduThe Co-Design Process for the FFT

16-pt FFT: Computation-to-Core: One Warp

synergy.cs.vt.eduThe Co-Design Process for the FFT

16-pt FFT: Computation-to-Core: One Block

synergy.cs.vt.eduThe Co-Design Process for the FFT

16-pt FFT: Computation-to-Core: One SM

37.5% Occupancy on NVIDIA Kepler K20c

synergy.cs.vt.eduThe Co-Design Process for the FFT

16-pt FFT: Stages of ComputationInput Array

S1: “Columns”

S2: “Twiddles”S3: “Transpose”

S1: “Columns”

synergy.cs.vt.eduThe Co-Design Process for the FFT

GPU Memory Spaces

synergy.cs.vt.eduThe Co-Design Process for the FFT

Background (GPUs)• GPU Memory

Hierarchy

synergy.cs.vt.eduThe Co-Design Process for the FFT

Background (GPUs)• GPU Memory

Hierarchy– Global Memory

synergy.cs.vt.eduThe Co-Design Process for the FFT

Background (GPUs)• GPU Memory

Hierarchy– Global Memory

Memory Unit

Read Bandwidth (TB/s)

Global 0.17

Table: Memory Read Bandwidth for Radeon HD 6970

synergy.cs.vt.eduThe Co-Design Process for the FFT

Background (GPUs)• GPU Memory

Hierarchy– Global Memory– Image Memory

Memory Unit

Read Bandwidth (TB/s)

L1/L2 Cache 1.35 / 0.45Global 0.17

Table: Memory Read Bandwidth for Radeon HD 6970

synergy.cs.vt.eduThe Co-Design Process for the FFT

Background (GPUs)• GPU Memory

Hierarchy– Global Memory– Image Memory– Constant Memory

Memory Unit

Read Bandwidth (TB/s)

Constant 5.4

L1/L2 Cache 1.35 / 0.45Global 0.17

Table: Memory Read Bandwidth for Radeon HD 6970

synergy.cs.vt.eduThe Co-Design Process for the FFT

Background (GPUs)• GPU Memory

Hierarchy– Global Memory– Image Memory– Constant Memory– Local Memory

Memory Unit

Read Bandwidth (TB/s)

Constant 5.4Local 2.7

L1/L2 Cache 1.35 / 0.45Global 0.17

Table: Memory Read Bandwidth for Radeon HD 6970

synergy.cs.vt.eduThe Co-Design Process for the FFT

Background (GPUs)• GPU Memory

Hierarchy– Global Memory– Image Memory– Constant Memory– Local Memory– Registers

Memory Unit

Read Bandwidth (TB/s)

Registers 16.2Constant 5.4

Local 2.7L1/L2 Cache 1.35 / 0.45

Global 0.17

Table: Memory Read Bandwidth for Radeon HD 6970

synergy.cs.vt.eduThe Co-Design Process for the FFT

Global Data Bandwidth(Bus Traffic)

synergy.cs.vt.eduThe Co-Design Process for the FFT

Global Data Banwidth• Bus Traffic: Bytes transferred from off-chip

to on-chip memory and back.• Suppose we take the FFT of a 128 MB data set

– Minimum bus traffic is 2 x 128 = 256 MB• Load 128 MB (global -> on-chip)• Performs FFT• Store 128 MB (on-chip -> global)

• Bus Traffic and Performance

synergy.cs.vt.eduThe Co-Design Process for the FFT

16-pt FFT: Computation-to-Core: One Warp• What’s wrong with this access pattern?

synergy.cs.vt.eduThe Co-Design Process for the FFT

16-pt FFT: Computation-to-Core: One Warp• What’s wrong with this access pattern?

• Scattered Memory Accesses (power-of-2 strides)

• Uncoalesced Memory Access

synergy.cs.vt.eduThe Co-Design Process for the FFT

16-pt FFT: Stages of ComputationInput Array

S1: “Columns”

S2: “Twiddles”S3: “Transpose”

S1: “Columns”

synergy.cs.vt.eduThe Co-Design Process for the FFT

System-level optimizations(applicable to any application)

synergy.cs.vt.eduThe Co-Design Process for the FFT

System-level optimizations(applicable to any application)

1. Register Preloading2. Vector Access/{Vector,Scalar}

Arithmetic3. Constant Memory Usage 4. Dynamic Instruction Reduction5. Memory Coalescing

synergy.cs.vt.eduThe Co-Design Process for the FFT

S1: Register Preloading• Load to registers first

Without Register Preloading

79 __kernel void FFT16_vanilla(__global float2 *buffer) 80 { 81 int index = …; 82 buffer += index; 83 84 FFT4_in_order_output(&buffer[0], &buffer[4], &buffer[8], &buffer[12]);With Register

Preloading 79 __kernel void FFT16_strawberry1(__global float2 *buffer) 80 { 81 int index = …; 82 buffer += index; 83 84 float2 registers[4]; // Explicit Loads 85 for (int i = 0; i < 4; ++i) 87 registers[i] = buffer[4*i]; 88 FFT4_in_order_output(&registers[0], &registers[1], &registers[2], &registers[3]);

synergy.cs.vt.eduThe Co-Design Process for the FFT

S2: Vector Types• Vector Access (float{2, 4,

8, 16})

a[0]

a[1]

a[2]

a[3]

synergy.cs.vt.eduThe Co-Design Process for the FFT

S2: Vector Types• Vector Access (float{2, 4,

8, 16})

– Scalar Math (VASM)• float + float

a[0]

a[1]

a[2]

a[3]

synergy.cs.vt.eduThe Co-Design Process for the FFT

S2: Vector Types• Vector Access (float{2, 4,

8, 16})

– Scalar Math (VASM)• float + float

– Vector Math (VAVM)• floatN + floatN

a[0]

a[1]

a[2]

a[3]

synergy.cs.vt.eduThe Co-Design Process for the FFT

S3: Constant Memory• Fast cached lookup

for frequently used data

synergy.cs.vt.eduThe Co-Design Process for the FFT

S3: Constant Memory• Fast cached lookup

for frequently used data

16 __constant float2 twiddles[16] = { (float2)(1.0f,0.0f), (float2) (1.0f,0.0f), (float2)(1.0f,0.0f), (float2)(1.0f,0.0f), ... more sin/cos values};

Without Constant Memory 61 for (int j = 1; j < 4; ++j) 62 { 63 double theta = -2.0 * M_PI * tid * j / 16; 64 float2 twid = make_float2(cos(theta), sin(theta)); 65 result[j] = buffer[j*4] * twid; 66 }

With Constant Memory61 for (int j = 1; j < 4; ++j)62 result[j] = buffer[j*4] *

twiddles[4*j+tid];

synergy.cs.vt.eduThe Co-Design Process for the FFT

16-pt FFT: Stages of ComputationInput Array

S1: “Columns”

S2: “Twiddles”S3: “Transpose”

S1: “Columns”

synergy.cs.vt.eduThe Co-Design Process for the FFT

Algorithm-level optimizations

(applicable only to FFT)

synergy.cs.vt.eduThe Co-Design Process for the FFT

Algorithm-level optimizations

(applicable only to FFT)

1. Transpose via LM2. Compute/Transpose via LM3. Compute/No Transpose via LM4. Register-to-register transpose

(shuffle)

synergy.cs.vt.eduThe Co-Design Process for the FFT

A1: Transpose via local1 memory• Via Shared Memory

1 CUDA shared memory == OpenCL local memory

synergy.cs.vt.eduThe Co-Design Process for the FFT

A1: Transpose via local1 memory• Via Shared Memory

1 CUDA shared memory == OpenCL local memory

synergy.cs.vt.eduThe Co-Design Process for the FFT

A1: Transpose via local1 memory• Via Shared Memory

• Register to Register (shfl)

1 CUDA shared memory == OpenCL local memory

synergy.cs.vt.eduThe Co-Design Process for the FFT

Algorithm-level optimizationsOriginal Transposed

synergy.cs.vt.eduThe Co-Design Process for the FFT

1. Naïve Transpose (LM-CM)

Algorithm-level optimizations

Local Memory

t0 t1 t2 t3

Original Transposed

Register File

synergy.cs.vt.eduThe Co-Design Process for the FFT

1. Naïve Transpose (LM-CM)

Algorithm-level optimizations

Local Memory

t0 t1 t2 t3

Original Transposed

Register File

synergy.cs.vt.eduThe Co-Design Process for the FFT

1. Naïve Transpose (LM-CM)

Algorithm-level optimizations

Local Memory

t0 t1 t2 t3

Original Transposed

Register File

synergy.cs.vt.eduThe Co-Design Process for the FFT

1. Naïve Transpose (LM-CM)

Algorithm-level optimizations

Local Memory

t0 t1 t2 t3

Original Transposed

Register File

synergy.cs.vt.eduThe Co-Design Process for the FFT

Shuffle Mechanics

synergy.cs.vt.eduThe Co-Design Process for the FFT

Shuffle Mechanics• FFT Transpose can be implemented using

shuffle

synergy.cs.vt.eduThe Co-Design Process for the FFT

Shuffle Mechanics

Register File

t0 t1 t2 t3

• FFT Transpose can be implemented using shuffle

synergy.cs.vt.eduThe Co-Design Process for the FFT

Shuffle Mechanics

Register File

t0 t1 t2 t3

• FFT Transpose can be implemented using shuffle

Stage 1: Horizontal

synergy.cs.vt.eduThe Co-Design Process for the FFT

Shuffle Mechanics

Register File

t0 t1 t2 t3

• FFT Transpose can be implemented using shuffle

Stage 2: Vertical

synergy.cs.vt.eduThe Co-Design Process for the FFT

Shuffle Mechanics

Register File

t0 t1 t2 t3

• FFT Transpose can be implemented using shuffle

Stage 3: Horizontal

synergy.cs.vt.eduThe Co-Design Process for the FFT

Shuffle Mechanics

Register File

t0 t1 t2 t3

• FFT Transpose can be implemented using shuffle– Bottleneck: Intra-thread data movement

Stage 2: Vertical

synergy.cs.vt.eduThe Co-Design Process for the FFT

Shuffle Mechanics

Register File

t0 t1 t2 t3

for (int k = 0; k < 4; ++k) dst_registers[k] = src_registers[(4 - tid + k) % 4];

Code 1: (NAIVE)

• FFT Transpose can be implemented using shuffle– Bottleneck: Intra-thread data movement

Stage 2: Vertical

synergy.cs.vt.eduThe Co-Design Process for the FFT

Shuffle MechanicsCode 1 (NAIVE)

63 for (int k = 0; k < 4; ++k)64 dst_registers[k] = src_registers[(4 -

tid + k) % 4];

General strategies• Shuffle instructions are cheap.• CUDA local memory is slow.

– Compiler is forced to place registers into CUDA local memory if array indices CANNOT be determined at compile time.

synergy.cs.vt.eduThe Co-Design Process for the FFT

Shuffle MechanicsCode 1 (NAIVE)

63 for (int k = 0; k < 4; ++k)64 dst_registers[k] = src_registers[(4 -

tid + k) % 4];

Code 2 (DIV)int tmp = src_registers[0];if (tid == 1){

src_registers[0] = src_registers[3];src_registers[3] = src_registers[2];src_registers[2] = src_registers[1];src_registers[1] = tmp;

}else if (tid == 2){

src_registers[0] = src_registers[2];src_registers[2] = tmp;tmp = src_registers[1];src_registers[1] = src_registers[3];src_registers[3] = tmp;

}else if (tid == 3){

src_registers[0] = src_registers[1];src_registers[1] = src_registers[2];src_registers[2] = src_registers[3];src_registers[3] = tmp;

}

General strategies• Shuffle instructions are cheap.• CUDA local memory is slow.

– Compiler is forced to place registers into CUDA local memory if array indices CANNOT be determined at compile time.

synergy.cs.vt.eduThe Co-Design Process for the FFT

Shuffle MechanicsCode 1 (NAIVE)

63 for (int k = 0; k < 4; ++k)64 dst_registers[k] = src_registers[(4 -

tid + k) % 4];

Code 2 (DIV)int tmp = src_registers[0];if (tid == 1){

src_registers[0] = src_registers[3];src_registers[3] = src_registers[2];src_registers[2] = src_registers[1];src_registers[1] = tmp;

}else if (tid == 2){

src_registers[0] = src_registers[2];src_registers[2] = tmp;tmp = src_registers[1];src_registers[1] = src_registers[3];src_registers[3] = tmp;

}else if (tid == 3){

src_registers[0] = src_registers[1];src_registers[1] = src_registers[2];src_registers[2] = src_registers[3];src_registers[3] = tmp;

}

General strategies• Shuffle instructions are cheap.• CUDA local memory is slow.

– Compiler is forced to place registers into CUDA local memory if array indices CANNOT be determined at compile time.

Divergence

Divergence

Divergence

synergy.cs.vt.eduThe Co-Design Process for the FFT

Shuffle MechanicsCode 1 (NAIVE)

63 for (int k = 0; k < 4; ++k)64 dst_registers[k] = src_registers[(4 -

tid + k) % 4];

Code 2 (DIV)int tmp = src_registers[0];if (tid == 1){

src_registers[0] = src_registers[3];src_registers[3] = src_registers[2];src_registers[2] = src_registers[1];src_registers[1] = tmp;

}else if (tid == 2){

src_registers[0] = src_registers[2];src_registers[2] = tmp;tmp = src_registers[1];src_registers[1] = src_registers[3];src_registers[3] = tmp;

}else if (tid == 3){

src_registers[0] = src_registers[1];src_registers[1] = src_registers[2];src_registers[2] = src_registers[3];src_registers[3] = tmp;

}

General strategies• Shuffle instructions are cheap.• CUDA local memory is slow.

– Compiler is forced to place registers into CUDA local memory if array indices CANNOT be determined at compile time.

Code 3 (SELP OOP) 65 dst_registers[0] = (tid == 0) ? src_registers[0] : dst_registers[0]; 66 dst_registers[1] = (tid == 0) ? src_registers[1] : dst_registers[1]; 67 dst_registers[2] = (tid == 0) ? src_registers[2] : dst_registers[2]; 68 dst_registers[3] = (tid == 0) ? src_registers[3] : dst_registers[3]; 69 70 dst_registers[0] = (tid == 1) ? src_registers[3] : dst_registers[0]; 71 dst_registers[3] = (tid == 1) ? src_registers[2] : dst_registers[3]; 72 dst_registers[2] = (tid == 1) ? src_registers[1] : dst_registers[2]; 73 dst_registers[1] = (tid == 1) ? src_registers[0] : dst_registers[1]; 74 75 dst_registers[0] = (tid == 2) ? src_registers[2] : dst_registers[0]; 76 dst_registers[2] = (tid == 2) ? src_registers[0] : dst_registers[2]; 77 dst_registers[1] = (tid == 2) ? src_registers[3] : dst_registers[1]; 78 dst_registers[3] = (tid == 2) ? src_registers[1] : dst_registers[3]; 79 80 dst_registers[0] = (tid == 3) ? src_registers[1] : dst_registers[0]; 81 dst_registers[1] = (tid == 3) ? src_registers[2] : dst_registers[1]; 82 dst_registers[2] = (tid == 3) ? src_registers[3] : dst_registers[2]; 83 dst_registers[3] = (tid == 3) ? src_registers[0] : dst_registers[3];

Divergence

Divergence

Divergence

synergy.cs.vt.eduThe Co-Design Process for the FFT

Results (Experimental Testbed)EvaluationAlgorithm 1D FFT (batched), N = 16-, 64-, and 256-

ptsFFTW Version v3.3.2 (4 threads, OpenMP with AVX

extensions)FFTW Hardware Intel i5-2400 (4 cores @ 3.1 GHz)

GPU Testbed

Device CoresPeak

Performance

(GFLOPS)

PeakBandwidth

(GB/s)Max TDP(Watts)

AMD Radeon HD 6970

1536 2703 176 250

AMD Radeon HD 7970

2048 3788 264 250

NVIDIA Tesla C2075

448 1288 144 225

NVIDIA Tesla K20c 2496 4106 208 225

synergy.cs.vt.eduThe Co-Design Process for the FFT

Results (optimizations in isolation)

• Minimize bus traffic via on-chip optimizations (RP, LM-CM, LM-CC, LM-CT)– Critical in AMD GPUs, not so much for NVIDIA GPUs

• Use VASM2/VASM4 (do not consider VAVM types)RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and

Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

synergy.cs.vt.eduThe Co-Design Process for the FFT

Results (optimizations in concert)

• Device data transfer (black) subsumes execution time§

• One set of opts. for all GPUs§ in RP+LM-CM + VASM2 + CM/GAP

RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

synergy.cs.vt.eduThe Co-Design Process for the FFT

Shuffle Results

synergy.cs.vt.eduThe Co-Design Process for the FFT

Shuffle Results

• Fractionenhanced (0 < f < 1) = 16.5%

16.5%

synergy.cs.vt.eduThe Co-Design Process for the FFT

Shuffle Results

• Fractionenhanced (0 < f < 1) = 16.5%• Max. speedupenhanced (s > 1) = 1.19x

16.5%

synergy.cs.vt.eduThe Co-Design Process for the FFT

Shuffle Results

• Fractionenhanced (0 < f < 1) = 16.5%• Max. speedupenhanced (s > 1) = 1.19x

synergy.cs.vt.eduThe Co-Design Process for the FFT

Shuffle Results

• Fractionenhanced (0 < f < 1) = 16.5%• Max. speedupenhanced (s > 1) = 1.19x

– Speedupenhanced (s > 1) = 1.08x

synergy.cs.vt.eduThe Co-Design Process for the FFT

Shuffle Results

Name

Reg.

Shm.

SLOC

LMEM

Shared

50 4K 11 0

SELP(OOP

)

74 4K 382 0

SELP(IP)

49 4K 418 0

DIV 72 4K 462 0

Naive

52 4K 100 128• Fractionenhanced (0 < f < 1) = 16.5%• Max. speedupenhanced (s > 1) = 1.19x

– Speedupenhanced (s > 1) = 1.08x

synergy.cs.vt.eduThe Co-Design Process for the FFT

Shuffle Results

Name

Reg.

Shm.

SLOC

LMEM

Shared

50 4K 11 0

SELP(OOP

)

74 4K 382 0

SELP(IP)

49 4K 418 0

DIV 72 4K 462 0

Naive

52 4K 100 128• Fractionenhanced (0 < f < 1) = 16.5%• Max. speedupenhanced (s > 1) = 1.19x

– Speedupenhanced (s > 1) = 1.08x ... but, wait! There’s more!

synergy.cs.vt.eduThe Co-Design Process for the FFT

Shuffle Results

Name

Reg.

Shm.

SLOC

LMEM

Shared

50 4K 11 0

SELP(OOP

)

74 4K 382 0

SELP(IP)

49 4K 418 0

DIV 72 4K 462 0

Naive

52 4K 100 128• Fractionenhanced (0 < f < 1) = 16.5%• Max. speedupenhanced (s > 1) = 1.19x

– Speedupenhanced (s > 1) = 1.08x ... but, wait! There’s more!

synergy.cs.vt.eduThe Co-Design Process for the FFT

Shuffle Results: At 50% occupancy

• Fractionenhanced (0 < f < 1) = 16.5%• Max. speedupenhanced (s > 1) = 1.19x

synergy.cs.vt.eduThe Co-Design Process for the FFT

Shuffle Results: At 50% occupancy

• Fractionenhanced (0 < f < 1) = 16.5%• Max. speedupenhanced (s > 1) = 1.19x

synergy.cs.vt.eduThe Co-Design Process for the FFT

Shuffle Results: At 50% occupancy

• Fractionenhanced (0 < f < 1) = 16.5%• Max. speedupenhanced

1 (s > 1) = 1.19x – Speedupenhanced

2 (s > 1) = 1.08x -> 1.17x1 Calculation at 37.5% occupancy2 Calculation at 50% occupancy

synergy.cs.vt.eduThe Co-Design Process for the FFT

Shuffle Results: At 50% occupancy

1 Calculation at 37.5% occupancy2 Calculation at 50% occupancy

Name

Reg.

Shm.

SLOC

OCC

Shared

50 4K 11 37.5%

SELP(OOP

)

74 4K 382 37.5%

SELP(IP)

49 4K 418 37.5%

SELP (IP)

49 0 418 50%

• Fractionenhanced (0 < f < 1) = 16.5%• Max. speedupenhanced

1 (s > 1) = 1.19x – Speedupenhanced

2 (s > 1) = 1.08x -> 1.17x1 Calculation at 37.5% occupancy2 Calculation at 50% occupancy

synergy.cs.vt.eduThe Co-Design Process for the FFT

Name

Reg.

Shm.

SLOC

OCC

Shared

50 4K 11 37.5%

SELP(OOP

)

74 4K 382 37.5%

SELP(IP)

49 4K 418 37.5%

SELP (IP)

49 0 418 50%

Shuffle Results: At 50% occupancy

1 Calculation at 37.5% occupancy2 Calculation at 50% occupancy

• Fractionenhanced (0 < f < 1) = 16.5%• Max. speedupenhanced

1 (s > 1) = 1.19x – Speedupenhanced

2 (s > 1) = 1.08x -> 1.17x1 Calculation at 37.5% occupancy2 Calculation at 50% occupancy

Higher performance athigher occupancy

synergy.cs.vt.eduThe Co-Design Process for the FFT

Results (1D FFT 256-pts)

• Speedups – as high as 9.1 and 5.8 over FFTW– as high as 18.2 and 2.9 over unoptimized GPU

synergy.cs.vt.eduThe Co-Design Process for the FFT

Summary• Approach

– Focus on identifying optimizations for hardware

• Takeaways– FFTs are memory-bound (focus should be on memory

opts.)– Homogeneous set of optimizations for all GPUs:

RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

synergy.cs.vt.eduThe Co-Design Process for the FFT

Thank You!• Contributions:

– Optimization principles for FFT on GPUs– An analysis of GPU optimizations applied in isolation

and in concert on AMD and NVIDIA GPU architectures• Contact:

– Carlo del Mundo <cdel@vt.edu>

RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

top related