multi-gigahertz parallel ffts for fpga and asic...
TRANSCRIPT
© Synopsys 2013 1
Multi-Gigahertz Parallel FFTs for
FPGA and ASIC Implementation
Doug Johnson, Applications Consultant
Chris Eddington, Technical Marketing
Synopsys, Inc.
700 E. Middlefield Road
Mountain View, CA 94043
(949) 585-2748
© Synopsys 2013 2
Agenda
• Why FPGA & Parallelism for Signal Processing?
• Parallel FFT Architecture Basics
• Design Example: 1024-pt Parallel FFT
– Throughput Scalability on FPGA
– Power reduction in ASIC
• Using High-Level Synthesis to Implement Parallel FFTs
• Summary
© Synopsys 2013 3
Why FPGAs (and Parallelism)?
• Parallelism can achieve higher capacities than CPU or DSP
processor-based implementation (Software/Firmware)
• Flexibility allows architecture tuning for various throughputs
• Lower power !!
– Slower clocks and reduced logic/calculation for a given throughput1
Multicore DSP Processor Capacities
Time
Flexible Signal Processing Architectures
1 Dejan Markovic, Robert Brodersen, “DSP Architecture
Design Essentials”, Springer; 2012
© Synopsys 2013 4
Capacity Gates, HW DSP Units & Speed, Design Effort
FPGA DSP Capacity Keeps Growing!RTL to bitfile
a DAYRTL to bitfile in MINUTES
RTL to Bitfile in HOURS
100K 200K 760K
1900K LUTVirtex-7 ~700-4000 DSPs
Virtex-5
Virtex-4
Virtex-II
330K
760K LUT
Stratix-IV / Virtex-6 288-2016 DSPs, 500Mhz
RTL to bit fileTOO LONG!
RTL to bitfile DAYS
1M 2M
Stratix V ~700-4000 DSPs
© Synopsys 2013 5
Multi-Gigahertz Challenges
• How to achieve multi-gigahertz throughput
with device FMAX < 500Mhz ?
– Parallel architectures!
But Design Flow Difficulties:
• IP Availability
– Can I get the tuned architecture that I really need?
• HDL design flow
– expertise, architecture choices,
• FPGA resource mapping & QoR
– How do I map to the specific HW resources?
• Portability
– Which vendor/device?
• Capacity and Long Turnaround Times
Flexible Signal Processing Architectures
Algorithm Specification
© Synopsys 2013 6
Parallel FFT Architecture Basics
• Process multiple samples at a time
• Radix-2
– Multi-path delay commutator (MDC):
throughput & pipelining
– Single-path delay feedback (SDF):
area
• Radix-4
– Single-path delay feedback (SDF)
– Multi-path delay commutator (MDC)
– Single-path delay commutator (SDC)
© Synopsys 2013 7
Radix-2 Multi-Path Delay Commutator
• ‰The most classical approach for pipeline implementation of radix-2
FFT
• ‰Input sequence broken into two parallel data streams flowing
forward with correct “distance” between data elements entering the
butterfly scheduled by proper delays
© Synopsys 2013 8
Implementation Using High-Level Design
• Created a parallel R2MDC parameterizable subsystem
in Synphony Model Compiler (Simulink/MATLAB)
• Top-level block accepts user inputs (length, #parallel
inputs, flow control, dynamic programming) and
instantiates a chain of R2MDCs underneath
. . .
Parallel FFT architecture based on user configuration
and using a parameterized parallel radix2-mdc sub-block.
© Synopsys 2013 9
High Level System Models using Synphony Model CompilerHigh Quality FPGA and ASIC Design From Simulink
• High-level synthesis creates highly optimized and re-usable hardware for FPGA and ASIC
• Save months in verifying & validating your FPGA or ASIC system hardware
• Increase simulation and system validation productivity from Simulink
• High-level signal processing IP library for easy capture of multirate, fixed-point algorithms
• Use, simulate and verify RTL natively within Simulink
High-Level Design & Verification in Simulink
RTL for multiplearchitecturesand targets
fftCS
filter
fftCS
filter
fft B
filter
RTL HardwareVerification
C-Models for System-Level Verification
Synphony Model CompilerHigh-Level Synthesis
SMCIP ModelLibrary
RTL
© Synopsys 2013 10
Synphony High Level SynthesisAutomated flow from higher levels of abstraction
• Broad set of synthesizable signal processing functions
• Quickly create fixed-point algorithms
• Architecture optimizations and exploration:
• Automatic retiming/pipelining
• Folding for area optimizations using resource sharing & scheduling
• Fast, accurate speed, area, power tradeoff exploration
• Multi-rate clock circuit generation
• Choice of clocking strategies for multi-rate designs
Synphony Fixed-Point Model
Simulink Simulation and Verification
Synphony IP Model Library
Synphony HLS Engine
High Level Synthesis
Silicon Silicon Prototype Simulators
C-ModelRTL Test benchRTLRTLRTL
© Synopsys 2013 11
Synphony HLS Architectural ExplorationExample: Exploring several digital down-converter architectures
Example architectures to choose from
1. Baseline with dedicated clocks
– Retiming
2. Fold x1 with dedicated clocks
– Retiming
– Resource sharing turned on using system clock of factor 1X sample rates
– Memory extraction turned on (>128x8)
3. Fold x2 with dedicated clocks
– Retiming
– Resource sharing turned on using system clock of factor 2X sample rates
– Memory extraction turned on (>128x8)
4, 5, 6: Same as above but with enabled clocks
Synphony HLS
Architectural Transformations
and Optimizations
User Constraints
RTLLogic
SynthesisConstraints
RTL Test bench
& Script
LogicSynthesis
Scripts
AdvancedTiming Engine
6 example architectures
© Synopsys 2013 12
Synphony High-Level Synthesis Automatic test bench generation for equivalence checking & requirements traceability
• VHDL or Verilog test bench automatically generated based on Simulink data.
• System data Inputs are used to create stimulus
• System data Outputs are used to generate Expected Results
• Bit-true proof of the VHDL/Verilog implementation equivalence vs. Simulink specification
© Synopsys 2013 13
FPGA P&R tools
Identify RTL Debugger
Identify RTL Instrumentor
SynphonyModel Compiler
In-System Debug
High-Level Synthesis
FPGA Synthesis & Debug
VCS Simulator
Simulation
FPGA-based Prototype Board (HAPS)
Production Board
RTL Synthesis and Analysis
Synplify Premier HDL Analyst
Physical Analyst
Synplify Prowith
HDL Analyst
Logic Synthesis Logic & Physical
Synthesis
FPGA Tool Suite – Implementation
© Synopsys 2013 14
Pipelining and Flow Control
• The Radix2-MDC implemented using Synphony Model Compiler has no
recursive loops, thus can be pipelined to achieve fast timing
• It is also specifically designed to map to advanced FPGA DSP devices
– Stratix V, Virtex-7
• Scalability is very good, but depends on FPGA fabric
© Synopsys 2013 15
Lowering Power: TSMC 40nm LP Results• Power tradeoff results using Synphony MC ASIC flow
– 2GS/S throughput held constant while varying parallelism
– Using activity data and Power Compiler optimization & report
• Parallel architectures achieve 51% lower power/frame
– Lower clock speeds
– Area penalty 250%
1. Using Synphony Model Compiler, parallel Radix2-MDC architecture and configuration as described in previous section
2. Implementation flow with Synphony Model Compiler 2013.03, Design Compiler 2011.09-SP4
3. Dynamic power estimated using Synphony generated design and test bench of 4 frames, VCS E-2011.03 for activity data,
and Power Compiler 2011.09-SP1
Parallel FFT1 FFT
Throughput
(Samples/s)
System
Clock
Relative
Dynamic
Power2,3
Relative
Area2
Relative
Leakage
Power
Latency
(system
cycles)
Parallel x2 2 GS/s 1 GHz 1X 1X .0003 597
Parallel x4 2 GS/s 500 MHz .72 1.23 .0004 332
Parallel x8 2 GS/s 250 MHz .59 1.64 .0008 200
Parallel x16 2 GS/s 125 MHz .49 2.51 .0008 102
© Synopsys 2013 16
References
• [1] Dejan Markovic, Robert Brodersen, “DSP Architecture Design Essentials”,
Springer; 2012
• [2] Vladimir Stojanović, “Lecture 10, Fast Fourier Transform: VLSI Architectures”,
course materials for 6.973 Communication System Design, MIT OpenCourseWare
(http://ocw.mit.edu/), Spring 2006.
• [3] H.e. Shousheng and M. Torkelson "A new approach to pipeline FFT processor,"
Parallel Processing Symposium, 1996., Proceedings of IPPS '96, The 10th
International no. SN -, pp. 766-770, 1996.
• [4] E. Wold and Alvin M. Despain "Pipeline and Parallel-Pipeline FFT Processors for
VLSI Implementations," IEEE Trans. Computers vol. 33, no. 5, pp. 414-426, 1984.
© Synopsys 2013 17
Thank You