synergy.cs.vt.edu accelerating fast fourier transform for wideband channelization carlo del mundo*,...

122
synergy.cs.vt .edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan § , Wu-chun Feng* § * Department of Electrical and Computer Engineering, § Department of Computer Science, Virginia Tech

Upload: darian-sauser

Post on 14-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband ChannelizationCarlo del Mundo*, Vignesh Adhinarayanan§, Wu-chun Feng*§

* Department of Electrical and Computer Engineering, § Department of Computer Science, Virginia Tech

Page 2: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

Forecast

• Goal: Accelerate the Fast Fourier Transform (FFT) using graphics processing units (GPUs) – Replace fixed hardware ASICs with programmable

GPUs

Carlo del Mundo, [email protected], carlodelmundo.com

Page 3: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

Forecast

• Goal: Accelerate the Fast Fourier Transform (FFT) using graphics processing units (GPUs) – Replace fixed hardware ASICs with programmable

GPUs

Carlo del Mundo, [email protected], carlodelmundo.com

http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpgahttp://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg

Page 4: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

Motivation

• FFT is a critical building blockacross many disciplines

Carlo del Mundo, [email protected], carlodelmundo.com

Page 5: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

Motivation

• FFT is a critical building blockacross many disciplines

Carlo del Mundo, [email protected], carlodelmundo.com

http://www.ajnr.org/content/27/6/1230/F1.large.jpg

Page 6: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

Motivation

• FFT is a critical building blockacross many disciplines

Carlo del Mundo, [email protected], carlodelmundo.com

http://www.ajnr.org/content/27/6/1230/F1.large.jpg

Page 7: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

Motivation

• FFT is a critical building blockacross many disciplines

Carlo del Mundo, [email protected], carlodelmundo.com

http://www.ajnr.org/content/27/6/1230/F1.large.jpg

http://www.elektrodaily.com/wp-content/uploads/2013/02/shazam-app.png

Page 8: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

Motivation

• FFT is a critical building blockacross many disciplines

Carlo del Mundo, [email protected], carlodelmundo.com

http://www.ajnr.org/content/27/6/1230/F1.large.jpg

http://www.elektrodaily.com/wp-content/uploads/2013/02/shazam-app.png

Page 9: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

Motivation

• FFT is a critical building blockacross many disciplines

Carlo del Mundo, [email protected], carlodelmundo.com

http://www.ajnr.org/content/27/6/1230/F1.large.jpg

http://www.elektrodaily.com/wp-content/uploads/2013/02/shazam-app.png

http://www.wireless.vt.edu/symposium/2012/tutorials/sessionA2.html

Page 10: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

Motivation

• FFT is a critical building blockacross many disciplines

Carlo del Mundo, [email protected], carlodelmundo.com

http://www.ajnr.org/content/27/6/1230/F1.large.jpg

http://www.elektrodaily.com/wp-content/uploads/2013/02/shazam-app.png

http://www.wireless.vt.edu/symposium/2012/tutorials/sessionA2.html

Page 11: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

Introduction• Wideband Channelization

– Purpose: To isolate channels within a wideband signal

Carlo del Mundo, [email protected], carlodelmundo.com

Page 12: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

Introduction• Wideband Channelization

– Purpose: To isolate channels within a wideband signal

Carlo del Mundo, [email protected], carlodelmundo.com

Page 13: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

Introduction• Wideband Channelization

– Purpose: To isolate channels within a wideband signal

Carlo del Mundo, [email protected], carlodelmundo.com

http://www.wireless.vt.edu/symposium/2012/tutorials/sessionA2.html

Page 14: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

Introduction• Wideband Channelization

– Purpose: To isolate channels within a wideband signal

Carlo del Mundo, [email protected], carlodelmundo.com

Figure: Stages in a PFB Channelizer http://www.wireless.vt.edu/symposium/2012/tutorials/sessionA2.html

Page 15: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

Introduction (Channelization)

• Algorithm: Polyphase filter bank (PFB) channelizer

Carlo del Mundo, [email protected], carlodelmundo.com

Figure: Stages in a PFB Channelizer

Page 16: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

Introduction (Channelization)

• Algorithm: Polyphase filter bank (PFB) channelizer– Problem: FFT stage grows fastest in channelization

Carlo del Mundo, [email protected], carlodelmundo.com

Figure: Stages in a PFB Channelizer

Page 17: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

Introduction (Channelization)

• Algorithm: Polyphase filter bank (PFB) channelizer– Problem: FFT stage grows fastest in channelization

Carlo del Mundo, [email protected], carlodelmundo.com

Figure: Stages in a PFB Channelizer

Page 18: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

Choosing the Right Processor

• Criteria: Programmability & Performance

Carlo del Mundo, [email protected], carlodelmundo.com

Page 19: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Choosing the Right Processor

• Criteria: Programmability & Performance

Carlo del Mundo, [email protected], carlodelmundo.com

Carlo del Mundo, [email protected], carlodelmundo.com

http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga

http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpg

http://fr.academic.ru/pictures/frwiki/70/Fpga_xilinx_spartan.jpg

http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg

http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga

http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpg

http://fr.academic.ru/pictures/frwiki/70/Fpga_xilinx_spartan.jpg

http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg

Page 20: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Choosing the Right Processor

• Criteria: Programmability & Performance

Carlo del Mundo, [email protected], carlodelmundo.com

Carlo del Mundo, [email protected], carlodelmundo.com

http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga

http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpg

http://fr.academic.ru/pictures/frwiki/70/Fpga_xilinx_spartan.jpg

http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg

http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga

http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpg

http://fr.academic.ru/pictures/frwiki/70/Fpga_xilinx_spartan.jpg

http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg

Page 21: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Choosing the Right Processor

• Criteria: Programmability & Performance

Carlo del Mundo, [email protected], carlodelmundo.com

Carlo del Mundo, [email protected], carlodelmundo.com

http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga

http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpg

http://fr.academic.ru/pictures/frwiki/70/Fpga_xilinx_spartan.jpg

http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg

http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga

http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpg

http://fr.academic.ru/pictures/frwiki/70/Fpga_xilinx_spartan.jpg

http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg

Page 22: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Choosing the Right Processor

• Criteria: Programmability & Performance

Carlo del Mundo, [email protected], carlodelmundo.com

Carlo del Mundo, [email protected], carlodelmundo.com

http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga

http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpg

http://fr.academic.ru/pictures/frwiki/70/Fpga_xilinx_spartan.jpg

http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg

http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga

http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpg

http://fr.academic.ru/pictures/frwiki/70/Fpga_xilinx_spartan.jpg

http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg

Page 23: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Choosing the Right Processor

• Criteria: Programmability & Performance

Carlo del Mundo, [email protected], carlodelmundo.com

Carlo del Mundo, [email protected], carlodelmundo.com

http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga

http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpg

http://fr.academic.ru/pictures/frwiki/70/Fpga_xilinx_spartan.jpg

http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg

http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga

http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpg

http://fr.academic.ru/pictures/frwiki/70/Fpga_xilinx_spartan.jpg

http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg

Page 24: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Choosing the Right Processor

• Criteria: Programmability & Performance

Carlo del Mundo, [email protected], carlodelmundo.com

Carlo del Mundo, [email protected], carlodelmundo.com

http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga

http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpg

http://fr.academic.ru/pictures/frwiki/70/Fpga_xilinx_spartan.jpg

http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg

http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga

http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpg

http://fr.academic.ru/pictures/frwiki/70/Fpga_xilinx_spartan.jpg

http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg

Page 25: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

Outline

• Motivation• Introduction• Background• Approach

– System-level optimizations– Algorithm-level optimizations

• Results– Optimizations in isolation– Optimizations in concert

• Conclusion

Carlo del Mundo, [email protected], carlodelmundo.com

Page 26: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

Background (GPUs)

• GPU Memory Hierarchy

Carlo del Mundo, [email protected], carlodelmundo.com

Page 27: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

Background (GPUs)

• GPU Memory Hierarchy

Carlo del Mundo, [email protected], carlodelmundo.com

Page 28: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

Background (GPUs)

• GPU Memory Hierarchy– Global Memory

Carlo del Mundo, [email protected], carlodelmundo.com

Page 29: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

Background (GPUs)

• GPU Memory Hierarchy– Global Memory

Carlo del Mundo, [email protected], carlodelmundo.com

Memory Unit

Read Bandwidth (TB/s)

Global 0.17

Table: Memory Read Bandwidth for Radeon HD 6970

Page 30: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

Background (GPUs)

• GPU Memory Hierarchy– Global Memory– Image Memory

Carlo del Mundo, [email protected], carlodelmundo.com

Memory Unit

Read Bandwidth (TB/s)

L1/L2 Cache 1.35 / 0.45

Global 0.17

Table: Memory Read Bandwidth for Radeon HD 6970

Page 31: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

Background (GPUs)

• GPU Memory Hierarchy– Global Memory– Image Memory– Constant Memory

Carlo del Mundo, [email protected], carlodelmundo.com

Memory Unit

Read Bandwidth (TB/s)

Constant 5.4

L1/L2 Cache 1.35 / 0.45

Global 0.17

Table: Memory Read Bandwidth for Radeon HD 6970

Page 32: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

Background (GPUs)

• GPU Memory Hierarchy– Global Memory– Image Memory– Constant Memory– Local Memory

Carlo del Mundo, [email protected], carlodelmundo.com

Memory Unit

Read Bandwidth (TB/s)

Constant 5.4

Local 2.7

L1/L2 Cache 1.35 / 0.45

Global 0.17

Table: Memory Read Bandwidth for Radeon HD 6970

Page 33: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

Background (GPUs)

• GPU Memory Hierarchy– Global Memory– Image Memory– Constant Memory– Local Memory– Registers

Carlo del Mundo, [email protected], carlodelmundo.com

Memory Unit

Read Bandwidth (TB/s)

Registers 16.2

Constant 5.4

Local 2.7

L1/L2 Cache 1.35 / 0.45

Global 0.17

Table: Memory Read Bandwidth for Radeon HD 6970

Page 34: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

Outline

• Motivation• Introduction• Background• Approach

– System-level optimizations– Algorithm-level optimizations

• Results– Optimizations in isolation– Optimizations in concert

• Conclusion

Carlo del Mundo, [email protected], carlodelmundo.com

Page 35: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

Approach• Act as the “human compiler”

Carlo del Mundo, [email protected], carlodelmundo.com

Page 36: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

Approach• Act as the “human compiler”

1. Derive a candidate set of optimizations for FFT on GPUs

Carlo del Mundo, [email protected], carlodelmundo.com

Candidate Optimizations

Page 37: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

Approach• Act as the “human compiler”

1. Derive a candidate set of optimizations for FFT on GPUs

2. Apply optimizations in isolation

Carlo del Mundo, [email protected], carlodelmundo.com

Candidate Optimizations

Optimizations in Isolation

Page 38: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

Approach• Act as the “human compiler”

1. Derive a candidate set of optimizations for FFT on GPUs

2. Apply optimizations in isolation3. Apply optimizations in concert

Carlo del Mundo, [email protected], carlodelmundo.com

Candidate Optimizations

Optimizations in Concert

Optimizations in Isolation

Page 39: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

Approach

• System-level Optimizations (applicable to any application) 1. Register Preloading2. Vector Access/{Vector,Scalar} Arithmetic3. Constant Memory Usage 4. Dynamic Instruction Reduction5. Memory Coalescing6. Image Memory

• Algorithm-level Optimizations1. Transpose via LM2. Compute/Transpose via LM3. Compute/No Transpose via LM

Carlo del Mundo, [email protected], carlodelmundo.com

C. del Mundo et al., “Accelerating Fast Fourier Transform for Wideband Channelization,” IEEE ICC, Budapest, Hungary, June 2013.

Page 40: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

Approach

• System-level Optimizations (applicable to any application) 1. Register Preloading2. Vector Access/{Vector,Scalar} Arithmetic3. Constant Memory Usage 4. Dynamic Instruction Reduction5. Memory Coalescing6. Image Memory

• Algorithm-level Optimizations1. Transpose via LM2. Compute/Transpose via LM3. Compute/No Transpose via LM

Carlo del Mundo, [email protected], carlodelmundo.com

C. del Mundo et al., “Accelerating Fast Fourier Transform for Wideband Channelization,” IEEE ICC, Budapest, Hungary, June 2013.

Page 41: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

Approach

• System-level Optimizations (applicable to any application) 1. Register Preloading2. Vector Access/{Vector,Scalar} Arithmetic3. Constant Memory Usage 4. Dynamic Instruction Reduction5. Memory Coalescing6. Image Memory

Carlo del Mundo, [email protected], carlodelmundo.com

C. del Mundo et al., “Accelerating Fast Fourier Transform for Wideband Channelization,” IEEE ICC, Budapest, Hungary, June 2013.

Page 42: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

Approach

• System-level Optimizations (applicable to any application) 1. Register Preloading2. Vector Access/{Vector,Scalar} Arithmetic3. Constant Memory Usage 4. Dynamic Instruction Reduction5. Memory Coalescing6. Image Memory

• Algorithm-level Optimizations1. Naïve Transpose (LM-CM)2. Compute/Transpose via LM (LM-CC)3. Compute/No Transpose via LM (LM-CT)

Carlo del Mundo, [email protected], carlodelmundo.com

C. del Mundo et al., “Accelerating Fast Fourier Transform for Wideband Channelization,” IEEE ICC, Budapest, Hungary, June 2013.

Page 43: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

System-level Optimizations

Carlo del Mundo, [email protected], carlodelmundo.com

Page 44: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

System-level Optimizations

1. Register Preloading (RP)– Load to registers first

Carlo del Mundo, [email protected], carlodelmundo.com

Page 45: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

System-level Optimizations

Carlo del Mundo, [email protected], carlodelmundo.com

Without Register Preloading

79 __kernel void unoptimized(__global float2 *buffer) 80 { 81 int index = …; 82 buffer += index; 83 84 FFT4_in_order_output(&buffer[0], &buffer[4], &buffer[8], &buffer[12]);

1. Register Preloading (RP)– Load to registers first

Page 46: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

System-level Optimizations

Carlo del Mundo, [email protected], carlodelmundo.com

With Register Preloading

79 __kernel void optimized(__global float2 *buffer) 80 { 81 int index = …; 82 buffer += index; 83 84 __private float2 r0, r1, r2, r3; // Register Declaration 85 // Explicit Loads 86 r0 = buffer[0]; r1 = buffer[1]; r2 = buffer[2]; r3 = buffer[3]; 87 FFT4_in_order_output(&r0, &r1, &r2, &r3);

Without Register Preloading

79 __kernel void unoptimized(__global float2 *buffer) 80 { 81 int index = …; 82 buffer += index; 83 84 FFT4_in_order_output(&buffer[0], &buffer[4], &buffer[8], &buffer[12]);

1. Register Preloading (RP)– Load to registers first

Page 47: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

System-level Optimizations

2. Vector Access (float{2, 4, 8, 16})

Carlo del Mundo, [email protected], carlodelmundo.com

Page 48: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

System-level Optimizations

Carlo del Mundo, [email protected], carlodelmundo.com

a[0]

2. Vector Access (float{2, 4, 8, 16})

Page 49: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

System-level Optimizations

Carlo del Mundo, [email protected], carlodelmundo.com

a[0] a[1]

2. Vector Access (float{2, 4, 8, 16})

Page 50: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

System-level Optimizations

Carlo del Mundo, [email protected], carlodelmundo.com

a[0] a[1] a[2] a[3]

2. Vector Access (float{2, 4, 8, 16})

Page 51: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

System-level Optimizations

2. Vector Access (float{2, 4, 8, 16})

– Scalar Math (VASM)

Carlo del Mundo, [email protected], carlodelmundo.com

a[0] a[1] a[2] a[3]

Page 52: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

System-level Optimizations

Carlo del Mundo, [email protected], carlodelmundo.com

a[0] a[1] a[2] a[3]

+ =

2. Vector Access (float{2, 4, 8, 16})

– Scalar Math (VASM)• float + float

Page 53: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

System-level Optimizations

Carlo del Mundo, [email protected], carlodelmundo.com

a[0] a[1] a[2] a[3]

+ =

2. Vector Access (float{2, 4, 8, 16})

– Scalar Math (VASM)• float + float

Page 54: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

System-level Optimizations

2. Vector Access (float{2, 4, 8, 16})

– Scalar Math (VASM)• float + float

– Vector Math (VAVM)• float4 + float4

Carlo del Mundo, [email protected], carlodelmundo.com

a[0] a[1] a[2] a[3]

+ =

Page 55: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

System-level Optimizations

Carlo del Mundo, [email protected], carlodelmundo.com

a[0] a[1] a[2] a[3]

+ =

+ =

2. Vector Access (float{2, 4, 8, 16})

– Scalar Math (VASM)• float + float

– Vector Math (VAVM)• float4 + float4

Page 56: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

System-level Optimizations

Carlo del Mundo, [email protected], carlodelmundo.com

a[0] a[1] a[2] a[3]

+ =

+ =

2. Vector Access (float{2, 4, 8, 16})

– Scalar Math (VASM)• float + float

– Vector Math (VAVM)• float4 + float4

Page 57: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

Approach

• System-level Optimizations (applicable to any application) 1. Register Preloading2. Vector Access/{Vector,Scalar} Arithmetic3. Constant Memory Usage 4. Dynamic Instruction Reduction5. Memory Coalescing6. Image Memory

• Algorithm-level Optimizations

Carlo del Mundo, [email protected], carlodelmundo.com

1C. del Mundo, W. Feng. “Accelerating Fast Fourier Transform for Wideband Channelization,” IEEE ICC, Budapest, Hungary, June 2013.

Page 58: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

Approach

• System-level Optimizations (applicable to any application) 1. Register Preloading2. Vector Access/{Vector,Scalar} Arithmetic3. Constant Memory Usage 4. Dynamic Instruction Reduction5. Memory Coalescing6. Image Memory

• Algorithm-level Optimizations1. Naïve Transpose (LM-CM)2. Compute/Transpose via LM (LM-CC)3. Compute/No Transpose via LM (LM-CT)

Carlo del Mundo, [email protected], carlodelmundo.com

1C. del Mundo, W. Feng. “Accelerating Fast Fourier Transform for Wideband Channelization,” IEEE ICC, Budapest, Hungary, June 2013.

Page 59: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

Algorithm-level optimizations

Carlo del Mundo, [email protected], carlodelmundo.com

Page 60: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

• Transpose – elements across the diagonal are exchanged

Algorithm-level optimizations

Carlo del Mundo, [email protected], carlodelmundo.com

Page 61: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

• Transpose – elements across the diagonal are exchanged

Algorithm-level optimizations

Carlo del Mundo, [email protected], carlodelmundo.com

4x4 matrix

Transposed matrix

Page 62: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

• Transpose – elements across the diagonal are exchanged

Algorithm-level optimizations

Carlo del Mundo, [email protected], carlodelmundo.com

4x4 matrix

Transposed matrix

Page 63: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

• Transpose – elements across the diagonal are exchanged

Algorithm-level optimizations

Carlo del Mundo, [email protected], carlodelmundo.com

4x4 matrix

Transposed matrix

Page 64: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

• Transpose – elements across the diagonal are exchanged

Algorithm-level optimizations

Carlo del Mundo, [email protected], carlodelmundo.com

4x4 matrix

Transposed matrix

Page 65: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

• Transpose – elements across the diagonal are exchanged

Algorithm-level optimizations

Carlo del Mundo, [email protected], carlodelmundo.com

4x4 matrix

Transposed matrix

Page 66: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

Algorithm-level optimizations

Carlo del Mundo, [email protected], carlodelmundo.com

Original Transposed

Page 67: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

1. Naïve Transpose (LM-CM)

Algorithm-level optimizations

Carlo del Mundo, [email protected], carlodelmundo.com

Local Memory

t0 t1 t2 t3

Original Transposed

Register File

Page 68: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

1. Naïve Transpose (LM-CM)

Algorithm-level optimizations

Carlo del Mundo, [email protected], carlodelmundo.com

Local Memory

t0 t1 t2 t3

Original Transposed

Register File

Page 69: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

1. Naïve Transpose (LM-CM)

Algorithm-level optimizations

Carlo del Mundo, [email protected], carlodelmundo.com

Local Memory

t0 t1 t2 t3

Original Transposed

Register File

Page 70: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

1. Naïve Transpose (LM-CM)

Algorithm-level optimizations

Carlo del Mundo, [email protected], carlodelmundo.com

Local Memory

t0 t1 t2 t3

Original Transposed

Register File

Page 71: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

3. The pseudo transpose (LM-CT)

Algorithm-level optimizations

Carlo del Mundo, [email protected], carlodelmundo.com

Original Transposed

Page 72: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

3. The pseudo transpose (LM-CT)

Algorithm-level optimizations

Carlo del Mundo, [email protected], carlodelmundo.com

Original Transposed

Page 73: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

3. The pseudo transpose (LM-CT)– Idea:

• Load data to local memory

Algorithm-level optimizations

Carlo del Mundo, [email protected], carlodelmundo.com

Original Transposed

Local Memory

Page 74: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

3. The pseudo transpose (LM-CT)– Idea:

• Load data to local memory

Algorithm-level optimizations

Carlo del Mundo, [email protected], carlodelmundo.com

Original Transposed

Local Memory

Page 75: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

3. The pseudo transpose (LM-CT)– Idea:

• Load data to local memory• Perform computation on

columns,

Algorithm-level optimizations

Carlo del Mundo, [email protected], carlodelmundo.com

Original Transposed

Local Memory

Page 76: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

3. The pseudo transpose (LM-CT)– Idea:

• Load data to local memory• Perform computation on

columns, then rows.

Algorithm-level optimizations

Carlo del Mundo, [email protected], carlodelmundo.com

Original Transposed

Local Memory

Page 77: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

3. The pseudo transpose (LM-CT)– Idea:

• Load data to local memory• Perform computation on

columns, then rows.

– Advantage: • Skips the transpose step

Algorithm-level optimizations

Carlo del Mundo, [email protected], carlodelmundo.com

Original Transposed

Local Memory

Page 78: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

3. The pseudo transpose (LM-CT)– Idea:

• Load data to local memory• Perform computation on

columns, then rows.

– Advantage: • Skips the transpose step

– Disadvantage:• Local memory has lower

throughput than registers.

Algorithm-level optimizations

Carlo del Mundo, [email protected], carlodelmundo.com

Original Transposed

Local Memory

Page 79: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

Outline

• Motivation• Introduction• Background• Approach

– System-level optimizations– Algorithm-level optimizations

• Results– Optimizations in isolation– Optimizations in concert

• Conclusion

Carlo del Mundo, [email protected], carlodelmundo.com

Page 80: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

Results (Experimental Testbed)

Carlo del Mundo, [email protected], carlodelmundo.com

GPU Testbed

Device (AMD Radeon)

CoresPeak

Performance

(GFLOPS)

PeakBandwidth

(GB/s)

HD 7970 2048 3788 264

HD 6970 (VLIW) 1536 2703 176

HD 5870 (VLIW) 1600 2720 154

• Algorithm:– 1D FFT (batched), N = 16 pts– Cooley-Tukey Decomposition

Page 81: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

Results (in isolation)

Carlo del Mundo, [email protected], carlodelmundo.com

IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.

AMD Radeon HD 7970 (Scalar, non-VLIW)AMD Radeon HD 5870/6970 (VLIW)

Page 82: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

Results (in isolation)

Carlo del Mundo, [email protected], carlodelmundo.com

IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.

Improvements to Baseline (Max. % Increase)1. 160% - Minimize bus traffic via on-

chip optimizations (RP, LM-CC, LM-CT)

AMD Radeon HD 7970 (Scalar, non-VLIW)AMD Radeon HD 5870/6970 (VLIW)

Page 83: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

Results (in isolation)

Carlo del Mundo, [email protected], carlodelmundo.com

IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.

100%

Improvements to Baseline (Max. % Increase)1. 160% - Minimize bus traffic via on-

chip optimizations (RP, LM-CC, LM-CT)

AMD Radeon HD 7970 (Scalar, non-VLIW)AMD Radeon HD 5870/6970 (VLIW)

Page 84: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

Results (in isolation)

Carlo del Mundo, [email protected], carlodelmundo.com

IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.

100%

160%

Improvements to Baseline (Max. % Increase)1. 160% - Minimize bus traffic via on-

chip optimizations (RP, LM-CC, LM-CT)

AMD Radeon HD 7970 (Scalar, non-VLIW)AMD Radeon HD 5870/6970 (VLIW)

Page 85: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

Results (in isolation)

Carlo del Mundo, [email protected], carlodelmundo.com

IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.

Improvements to Baseline (Max. % Increase)1. 160% - Minimize bus traffic via on-

chip optimizations (RP, LM-CC, LM-CT)

2. 40% - Coalesce memory accesses (CGAP)

AMD Radeon HD 7970 (Scalar, non-VLIW)AMD Radeon HD 5870/6970 (VLIW)

Page 86: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

Results (in isolation)

Carlo del Mundo, [email protected], carlodelmundo.com

IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.

40%

Improvements to Baseline (Max. % Increase)1. 160% - Minimize bus traffic via on-

chip optimizations (RP, LM-CC, LM-CT)

2. 40% - Coalesce memory accesses (CGAP)

AMD Radeon HD 7970 (Scalar, non-VLIW)AMD Radeon HD 5870/6970 (VLIW)

Page 87: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

Results (in isolation)

Carlo del Mundo, [email protected], carlodelmundo.com

IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.

40%

0% (No Change)

Improvements to Baseline (Max. % Increase)1. 160% - Minimize bus traffic via on-

chip optimizations (RP, LM-CC, LM-CT)

2. 40% - Coalesce memory accesses (CGAP)

AMD Radeon HD 7970 (Scalar, non-VLIW)AMD Radeon HD 5870/6970 (VLIW)

Page 88: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

Results (in isolation)

Carlo del Mundo, [email protected], carlodelmundo.com

IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.

Improvements to Baseline (Max. % Increase)1. 160% - Minimize bus traffic via on-

chip optimizations (RP, LM-CC, LM-CT)

2. 40% - Coalesce memory accesses (CGAP)

3. 20% - Use scalar math (VASM2/VASM4)

AMD Radeon HD 7970 (Scalar, non-VLIW)AMD Radeon HD 5870/6970 (VLIW)

Page 89: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

Results (in isolation)

Carlo del Mundo, [email protected], carlodelmundo.com

IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.

20%

Improvements to Baseline (Max. % Increase)1. 160% - Minimize bus traffic via on-

chip optimizations (RP, LM-CC, LM-CT)

2. 40% - Coalesce memory accesses (CGAP)

3. 20% - Use scalar math (VASM2/VASM4)

AMD Radeon HD 7970 (Scalar, non-VLIW)AMD Radeon HD 5870/6970 (VLIW)

Page 90: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

Results (in isolation)

Carlo del Mundo, [email protected], carlodelmundo.com

IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.

20%

10%

41%

Improvements to Baseline (Max. % Increase)1. 160% - Minimize bus traffic via on-

chip optimizations (RP, LM-CC, LM-CT)

2. 40% - Coalesce memory accesses (CGAP)

3. 20% - Use scalar math (VASM2/VASM4)

AMD Radeon HD 7970 (Scalar, non-VLIW)AMD Radeon HD 5870/6970 (VLIW)

Page 91: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

Results (in isolation)

Carlo del Mundo, [email protected], carlodelmundo.com

IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.

20%

Improvements to Baseline (Max. % Increase)1. 160% - Minimize bus traffic via on-

chip optimizations (RP, LM-CC, LM-CT)

2. 40% - Coalesce memory accesses (CGAP)

3. 20% - Use scalar math (VASM2/VASM4)

AMD Radeon HD 7970 (Scalar, non-VLIW)AMD Radeon HD 5870/6970 (VLIW)

Page 92: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

Results (in isolation)

Carlo del Mundo, [email protected], carlodelmundo.com

IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.

20%

0% (No Change)

Improvements to Baseline (Max. % Increase)1. 160% - Minimize bus traffic via on-

chip optimizations (RP, LM-CC, LM-CT)

2. 40% - Coalesce memory accesses (CGAP)

3. 20% - Use scalar math (VASM2/VASM4)

AMD Radeon HD 7970 (Scalar, non-VLIW)AMD Radeon HD 5870/6970 (VLIW)

Page 93: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

Results (in isolation)

Carlo del Mundo, [email protected], carlodelmundo.com

IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.

Neutral/Detrimental to Baseline (Min. % Decrease)1. 20% - Naïve transpose (LM-CM),

40% - Constant Memory (CM-K, CM-L)

Improvements to Baseline (Max. % Increase)1. 160% - Minimize bus traffic via on-

chip optimizations (RP, LM-CC, LM-CT)

2. 40% - Coalesce memory accesses (CGAP)

3. 20% - Use scalar math (VASM2/VASM4)

AMD Radeon HD 7970 (Scalar, non-VLIW)AMD Radeon HD 5870/6970 (VLIW)

Page 94: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

Results (in isolation)

Carlo del Mundo, [email protected], carlodelmundo.com

IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.

20%

Neutral/Detrimental to Baseline (Min. % Decrease)1. 20% - Naïve transpose (LM-CM),

40% - Constant Memory (CM-K, CM-L)

Improvements to Baseline (Max. % Increase)1. 160% - Minimize bus traffic via on-

chip optimizations (RP, LM-CC, LM-CT)

2. 40% - Coalesce memory accesses (CGAP)

3. 20% - Use scalar math (VASM2/VASM4)

AMD Radeon HD 7970 (Scalar, non-VLIW)AMD Radeon HD 5870/6970 (VLIW)

Page 95: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

Results (in isolation)

Carlo del Mundo, [email protected], carlodelmundo.com

IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.

20%

40%

Neutral/Detrimental to Baseline (Min. % Decrease)1. 20% - Naïve transpose (LM-CM),

40% - Constant Memory (CM-K, CM-L)

Improvements to Baseline (Max. % Increase)1. 160% - Minimize bus traffic via on-

chip optimizations (RP, LM-CC, LM-CT)

2. 40% - Coalesce memory accesses (CGAP)

3. 20% - Use scalar math (VASM2/VASM4)

AMD Radeon HD 7970 (Scalar, non-VLIW)AMD Radeon HD 5870/6970 (VLIW)

Page 96: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

Results (in isolation)

Carlo del Mundo, [email protected], carlodelmundo.com

IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.

0% (No Change)

Neutral/Detrimental to Baseline (Min. % Decrease)1. 20% - Naïve transpose (LM-CM),

40% - Constant Memory (CM-K, CM-L)

Improvements to Baseline (Max. % Increase)1. 160% - Minimize bus traffic via on-

chip optimizations (RP, LM-CC, LM-CT)

2. 40% - Coalesce memory accesses (CGAP)

3. 20% - Use scalar math (VASM2/VASM4)

AMD Radeon HD 7970 (Scalar, non-VLIW)AMD Radeon HD 5870/6970 (VLIW)

Page 97: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

Results (in isolation)

Carlo del Mundo, [email protected], carlodelmundo.com

IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.

Neutral/Detrimental to Baseline (Min. % Decrease)1. 20% - Naïve transpose (LM-CM),

40% - Constant Memory (CM-K, CM-L)2. 0% - Dynamic instruction reduction (LU,

CSE, IL)

Improvements to Baseline (Max. % Increase)1. 160% - Minimize bus traffic via on-

chip optimizations (RP, LM-CC, LM-CT)

2. 40% - Coalesce memory accesses (CGAP)

3. 20% - Use scalar math (VASM2/VASM4)

AMD Radeon HD 7970 (Scalar, non-VLIW)AMD Radeon HD 5870/6970 (VLIW)

Page 98: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

Results (in isolation)

Carlo del Mundo, [email protected], carlodelmundo.com

IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.

0% (No Change)

Neutral/Detrimental to Baseline (Min. % Decrease)1. 20% - Naïve transpose (LM-CM),

40% - Constant Memory (CM-K, CM-L)2. 0% - Dynamic instruction reduction (LU,

CSE, IL)

Improvements to Baseline (Max. % Increase)1. 160% - Minimize bus traffic via on-

chip optimizations (RP, LM-CC, LM-CT)

2. 40% - Coalesce memory accesses (CGAP)

3. 20% - Use scalar math (VASM2/VASM4)

AMD Radeon HD 7970 (Scalar, non-VLIW)AMD Radeon HD 5870/6970 (VLIW)

Page 99: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

Results (in isolation)

Carlo del Mundo, [email protected], carlodelmundo.com

IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.

Improvements to Baseline (Max. % Increase)1. 160% - Minimize bus traffic via on-

chip optimizations (RP, LM-CC, LM-CT)

2. 40% - Coalesce memory accesses (CGAP)

3. 20% - Use scalar math (VASM2/VASM4)

Neutral/Detrimental to Baseline (Min. % Decrease)1. 20% - Naïve transpose (LM-CM),

40% - Constant Memory (CM-K, CM-L)2. 0% - Dynamic instruction reduction (LU,

CSE, IL)3. 18% - Avoid large vectors & vector math

(VASM16, VAVM8/16)

AMD Radeon HD 7970 (Scalar, non-VLIW)AMD Radeon HD 5870/6970 (VLIW)

Page 100: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

Results (in isolation)

Carlo del Mundo, [email protected], carlodelmundo.com

IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.

Improvements to Baseline (Max. % Increase)1. 160% - Minimize bus traffic via on-

chip optimizations (RP, LM-CC, LM-CT)

2. 40% - Coalesce memory accesses (CGAP)

3. 20% - Use scalar math (VASM2/VASM4)

Neutral/Detrimental to Baseline (Min. % Decrease)1. 20% - Naïve transpose (LM-CM),

40% - Constant Memory (CM-K, CM-L)2. 0% - Dynamic instruction reduction (LU,

CSE, IL)3. 18% - Avoid large vectors & vector math

(VASM16, VAVM8/16)

61%

39%

50%

AMD Radeon HD 7970 (Scalar, non-VLIW)AMD Radeon HD 5870/6970 (VLIW)

Page 101: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

Results (in isolation)

Carlo del Mundo, [email protected], carlodelmundo.com

IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.

Improvements to Baseline (Max. % Increase)1. 160% - Minimize bus traffic via on-

chip optimizations (RP, LM-CC, LM-CT)

2. 40% - Coalesce memory accesses (CGAP)

3. 20% - Use scalar math (VASM2/VASM4)

Neutral/Detrimental to Baseline (Min. % Decrease)1. 20% - Naïve transpose (LM-CM),

40% - Constant Memory (CM-K, CM-L)2. 0% - Dynamic instruction reduction (LU,

CSE, IL)3. 18% - Avoid large vectors & vector math

(VASM16, VAVM8/16)

53%

18%

34%

AMD Radeon HD 7970 (Scalar, non-VLIW)AMD Radeon HD 5870/6970 (VLIW)

Page 102: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Results (in concert)• Improvements (Max.

Increase)

Carlo del Mundo, [email protected], carlodelmundo.com

IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.2 All implementations are coalesced (CGAP) and use VASM2.3 The speedups listed in the graph only applies to the Radeon HD 7970 (for brevity).

Page 103: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Results (in concert)• Improvements (Max.

Increase)

Carlo del Mundo, [email protected], carlodelmundo.com

IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.2 All implementations are coalesced (CGAP) and use VASM2.3 The speedups listed in the graph only applies to the Radeon HD 7970 (for brevity).

2.9x 2.4

x

Page 104: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Results (in concert)• Improvements (Max.

Increase)

Carlo del Mundo, [email protected], carlodelmundo.com

IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.2 All implementations are coalesced (CGAP) and use VASM2.3 The speedups listed in the graph only applies to the Radeon HD 7970 (for brevity).

2.9x 2.4

x

2.4x

1.8x

Page 105: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Results (in concert)• Improvements (Max.

Increase)– {RP + LM-CM} best on-

chip optimization

Carlo del Mundo, [email protected], carlodelmundo.com

IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.2 All implementations are coalesced (CGAP) and use VASM2.3 The speedups listed in the graph only applies to the Radeon HD 7970 (for brevity).

2.1x1.5

x

2.9x 2.4

x

2.4x

1.8x

Page 106: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Results (in concert)• Improvements (Max. %

Increase)– {RP + LM-CM} best on-

chip optimization– Use Constant Memory

(CM) for twiddle calculations

Carlo del Mundo, [email protected], carlodelmundo.com

IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.2 All implementations are coalesced (CGAP) and use VASM2.3 The speedups listed in the graph only applies to the Radeon HD 7970 (for brevity).

2.1x1.5

x

2.9x 2.4

x

2.4x

1.8x

6.5x

5.6x

Page 107: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Results (in concert)• Improvements (Max. %

Increase)– {RP + LM-CM} best on-

chip optimization– Use Constant Memory

(CM) for twiddle calculations

– Use global memory (instead of image memory)

Carlo del Mundo, [email protected], carlodelmundo.com

IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.2 All implementations are coalesced (CGAP) and use VASM2.3 The speedups listed in the graph only applies to the Radeon HD 7970 (for brevity).

2.1x1.5

x

2.9x 2.4

x

2.4x

1.8x

6.5x

5.6x

5.6x 5.6

x

5.6x

Page 108: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Results (in concert)• Improvements (Max. %

Increase)– {RP + LM-CM} best on-

chip optimization– Use Constant Memory

(CM) for twiddle calculations

– Use global memory (instead of image memory)

Carlo del Mundo, [email protected], carlodelmundo.com

IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.2 All implementations are coalesced (CGAP) and use VASM2.3 The speedups listed in the graph only applies to the Radeon HD 7970 (for brevity).

2.1x1.5

x

2.9x 2.4

x

2.4x

1.8x

6.5x

5.6x

5.6x 5.6

x

5.6x

6.5x

Page 109: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Results (in concert)• Improvements (Max. %

Increase)– {RP + LM-CM} best on-

chip optimization– Use Constant Memory

(CM) for twiddle calculations

– Use global memory (instead of image memory)

Carlo del Mundo, [email protected], carlodelmundo.com

IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.2 All implementations are coalesced (CGAP) and use VASM2.3 The speedups listed in the graph only applies to the Radeon HD 7970 (for brevity).

2.1x1.5

x

2.9x 2.4

x

2.4x

1.8x

6.5x

5.6x

5.6x 5.6

x

5.6x

6.5x

6.3x

Page 110: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Results (in concert)• Improvements (Max. %

Increase)– {RP + LM-CM} best on-

chip optimization– Use Constant Memory

(CM) for twiddle calculations

– Use global memory (instead of image memory)

Carlo del Mundo, [email protected], carlodelmundo.com

IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.2 All implementations are coalesced (CGAP) and use VASM2.3 The speedups listed in the graph only applies to the Radeon HD 7970 (for brevity).

2.1x1.5

x

2.9x 2.4

x

2.4x

1.8x

6.5x

5.6x

5.6x 5.6

x

5.6x

6.5x

6.3x

2.4x

Page 111: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Results (in concert)• Improvements (Max. %

Increase)– {RP + LM-CM} best on-

chip optimization– Use Constant Memory

(CM) for twiddle calculations

– Use global memory (instead of image memory)

Carlo del Mundo, [email protected], carlodelmundo.com

IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.2 All implementations are coalesced (CGAP) and use VASM2.3 The speedups listed in the graph only applies to the Radeon HD 7970 (for brevity).

2.9x 2.4

x

2.4x

2.1x

6.5x

2.4x

2.4x

6.5x

6.3x

1.8x1.5

x

5.6x

5.6x 5.6

x

5.6x

Page 112: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Results (in concert)• Improvements (Max. %

Increase)– {RP + LM-CM} best on-

chip optimization– Use Constant Memory

(CM) for twiddle calculations

– Use global memory (instead of image memory)

– Optimal set for AMD GPUs

• RP – Register Preloading

• LM-CM – Transpose vialocal memory

• CM – Constant memoryusage

• CGAP – Coalesced Global Access Pattern

• VASM2 – Vector Access, Scalar Math (float2)Carlo del Mundo, [email protected],

carlodelmundo.com

IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.

*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.2 All implementations are coalesced (CGAP) and use VASM2.3 The speedups listed in the graph only applies to the Radeon HD 7970 (for brevity).

2.9x 2.4

x

2.4x

2.1x

6.5x

2.4x

2.4x

6.5x

6.3x

1.8x1.5

x

5.6x

5.6x 5.6

x

5.6x

Page 113: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

Results (1D FFT 16-pts, GPU versions)

Carlo del Mundo, [email protected], carlodelmundo.com

• Optimized GPU faster by factors of 14.5 over baseline GPU

Page 114: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

Results (1D FFT 16-pts, GPU versions)

Carlo del Mundo, [email protected], carlodelmundo.com

• Optimized GPU faster by factors of 14.5 over baseline GPU

Page 115: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

Conclusions

• Contributions:– A portable building block for FFT towards GPU-based radios– Architecture-aware insights for mapping and optimizing FFT across

three generations of AMD GPUs• Contact:

– Carlo del Mundo– [email protected]

• Optimal set for AMD GPUs– RP – Register Preloading– LM-CM – Transpose via

local memory– CM – Constant memory

usage– CGAP – Coalesced Global

Access Pattern– VASM2 – Vector Access,

Scalar Math (float2)

Carlo del Mundo, [email protected], carlodelmundo.com

http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg

http://www.wireless.vt.edu/symposium/2012/tutorials/sessionA2.html

Page 116: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

Appendix Slides

Carlo del Mundo, [email protected], carlodelmundo.com

Page 117: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

Introduction (FFT)

• Fast Fourier Transform (FFT)– A spectral method

• Key computational idiom for present and future applications (dwarf)§

List of Dwarfs1. Finite State Machine2. Circuits3. Graph Algorithms4. Structured Grid5. Dense Matrix6. Sparse Matrix7. Spectral Methods

8. Dynamic Prog.9. Particle Methods10. Backtrack/B&B11. Graphical Models12. Unstructured

Grids13. Map Reduce

Carlo del Mundo, [email protected], carlodelmundo.com

§ Asanovic et al. A View of the Parallel Computing Landscape. CACM, 2009.

Page 118: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

Background (Optimizing on GPUs)1. RP (Register Preloading) - All data elements are first preloaded onto the register file of the

respective GPU. Computation is facilitated solely on registers.2. CGAP (Coalesced Global Access Pattern) - Threads access memory contiguously (the kth

thread accesses memory element k) 3. VASM2/4 (Vector Access, Scalar Math, float{2/4}) - Data elements are loaded as the

listed vector type. Arithmetic operations are scalar (float x float).4. LM-CM (Local Memory, Communication Only) - Data elements are loaded into local

memory only for communication. Threads swap data elements solely in local memory.5. LM-CT (Local Memory, Computation, No Transpose) - Data elements are loaded into

local memory for computation. The communication step is avoided by algorithm reorganization.

6. LM-CC (Local Memory, Computation and Communication) - All data elements are preloaded into local memory. Computation is performed in local memory, while registers are used for scratchpad communication.

7. CM-K (Constant Memory - Kernel Argument) - The twiddle multiplication stage of FFT is precomputed on the CPU and stored in the GPU constant memory for fast look up.

8. CSE (Common Subexpression Elimination) - A traditional optimization that collapses identical expressions in order to save computation. This optimization may increase register live time, therefore, increasing register pressure.

9. IL (Function Inlining) - A function's code body is inserted in place of a function call. It is used primarily for functions that are frequently called.

10. IM (Image Memory) – The use of a texture image replaces the use of global memory.

Carlo del Mundo, [email protected], carlodelmundo.com

Page 119: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

Motivation (GPU FFT vs. CPU FFT)

Carlo del Mundo, [email protected], carlodelmundo.com

* Device-Host Data Transfer Not Included

• GPU FFT outperforms CPU FFT by factors as high as 6.5*– 1D batched FFT, N = 16 pts

Page 120: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

Introduction (Channelizer Architecture)• Channelizer Architecture

– FIR Filtering, FFT, and Channel Mapping.

Carlo del Mundo, [email protected], carlodelmundo.com

Page 121: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

S3: Constant Memory

• Fast cached lookup for frequently used data

Carlo del Mundo, [email protected], carlodelmundo.com

Page 122: Synergy.cs.vt.edu Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan §, Wu-chun Feng* § * Department

synergy.cs.vt.edu

Accelerating Fast Fourier Transform for Wideband Channelization

S3: Constant Memory

• Fast cached lookup for frequently used data

Carlo del Mundo, [email protected], carlodelmundo.com

16 __constant float2 twiddles[16] = { (float2)(1.0f,0.0f), (float2) (1.0f,0.0f), (float2)(1.0f,0.0f), (float2)(1.0f,0.0f), ... more sin/cos values};

Without Constant Memory 61 for (int j = 1; j < 4; ++j) 62 { 63 double theta = -2.0 * M_PI * tid * j / 16; 64 float2 twid = make_float2(cos(theta), sin(theta)); 65 result[j] = buffer[j*4] * twid; 66 }

With Constant Memory61 for (int j = 1; j < 4; ++j)62 result[j] = buffer[j*4] *

twiddles[4*j+tid];