multi-gigahertz parallel ffts for fpga and asic...

© Synopsys 2013 1

Multi-Gigahertz Parallel FFTs for

FPGA and ASIC Implementation

Doug Johnson, Applications Consultant

Chris Eddington, Technical Marketing

Synopsys, Inc.

700 E. Middlefield Road

Mountain View, CA 94043

[email protected]

(949) 585-2748

© Synopsys 2013 2

Agenda

• Why FPGA & Parallelism for Signal Processing?

• Parallel FFT Architecture Basics

• Design Example: 1024-pt Parallel FFT

– Throughput Scalability on FPGA

– Power reduction in ASIC

• Using High-Level Synthesis to Implement Parallel FFTs

• Summary

© Synopsys 2013 3

Why FPGAs (and Parallelism)?

• Parallelism can achieve higher capacities than CPU or DSP

processor-based implementation (Software/Firmware)

• Flexibility allows architecture tuning for various throughputs

• Lower power !!

– Slower clocks and reduced logic/calculation for a given throughput1

Multicore DSP Processor Capacities

Time

Flexible Signal Processing Architectures

1 Dejan Markovic, Robert Brodersen, “DSP Architecture

Design Essentials”, Springer; 2012

© Synopsys 2013 4

Capacity Gates, HW DSP Units & Speed, Design Effort

FPGA DSP Capacity Keeps Growing!RTL to bitfile

a DAYRTL to bitfile in MINUTES

RTL to Bitfile in HOURS

100K 200K 760K

1900K LUTVirtex-7 ~700-4000 DSPs

Virtex-5

Virtex-4

Virtex-II

330K

760K LUT

Stratix-IV / Virtex-6 288-2016 DSPs, 500Mhz

RTL to bit fileTOO LONG!

RTL to bitfile DAYS

1M 2M

Stratix V ~700-4000 DSPs

© Synopsys 2013 5

Multi-Gigahertz Challenges

• How to achieve multi-gigahertz throughput

with device FMAX < 500Mhz ?

– Parallel architectures!

But Design Flow Difficulties:

• IP Availability

– Can I get the tuned architecture that I really need?

• HDL design flow

– expertise, architecture choices,

• FPGA resource mapping & QoR

– How do I map to the specific HW resources?

• Portability

– Which vendor/device?

• Capacity and Long Turnaround Times

Flexible Signal Processing Architectures

Algorithm Specification

© Synopsys 2013 6

Parallel FFT Architecture Basics

• Process multiple samples at a time

• Radix-2

– Multi-path delay commutator (MDC):

throughput & pipelining

– Single-path delay feedback (SDF):

area

• Radix-4

– Single-path delay feedback (SDF)

– Multi-path delay commutator (MDC)

– Single-path delay commutator (SDC)

© Synopsys 2013 7

Radix-2 Multi-Path Delay Commutator

• ‰The most classical approach for pipeline implementation of radix-2

FFT

• ‰Input sequence broken into two parallel data streams flowing

forward with correct “distance” between data elements entering the

butterfly scheduled by proper delays

© Synopsys 2013 8

Implementation Using High-Level Design

• Created a parallel R2MDC parameterizable subsystem

in Synphony Model Compiler (Simulink/MATLAB)

• Top-level block accepts user inputs (length, #parallel

inputs, flow control, dynamic programming) and

instantiates a chain of R2MDCs underneath

. . .

Parallel FFT architecture based on user configuration

and using a parameterized parallel radix2-mdc sub-block.

© Synopsys 2013 9

High Level System Models using Synphony Model CompilerHigh Quality FPGA and ASIC Design From Simulink

• High-level synthesis creates highly optimized and re-usable hardware for FPGA and ASIC

• Save months in verifying & validating your FPGA or ASIC system hardware

• Increase simulation and system validation productivity from Simulink

• High-level signal processing IP library for easy capture of multirate, fixed-point algorithms

• Use, simulate and verify RTL natively within Simulink

High-Level Design & Verification in Simulink

RTL for multiplearchitecturesand targets

fftCS

filter

fftCS

filter

fft B

filter

RTL HardwareVerification

C-Models for System-Level Verification

Synphony Model CompilerHigh-Level Synthesis

SMCIP ModelLibrary

RTL

© Synopsys 2013 10

Synphony High Level SynthesisAutomated flow from higher levels of abstraction

• Broad set of synthesizable signal processing functions

• Quickly create fixed-point algorithms

• Architecture optimizations and exploration:

• Automatic retiming/pipelining

• Folding for area optimizations using resource sharing & scheduling

• Fast, accurate speed, area, power tradeoff exploration

• Multi-rate clock circuit generation

• Choice of clocking strategies for multi-rate designs

Synphony Fixed-Point Model

Simulink Simulation and Verification

Synphony IP Model Library

Synphony HLS Engine

High Level Synthesis

Silicon Silicon Prototype Simulators

C-ModelRTL Test benchRTLRTLRTL

© Synopsys 2013 11

Synphony HLS Architectural ExplorationExample: Exploring several digital down-converter architectures

Example architectures to choose from

1. Baseline with dedicated clocks

– Retiming

2. Fold x1 with dedicated clocks

– Retiming

– Resource sharing turned on using system clock of factor 1X sample rates

– Memory extraction turned on (>128x8)

3. Fold x2 with dedicated clocks

– Retiming

– Resource sharing turned on using system clock of factor 2X sample rates

– Memory extraction turned on (>128x8)

4, 5, 6: Same as above but with enabled clocks

Synphony HLS

Architectural Transformations

and Optimizations

User Constraints

RTLLogic

SynthesisConstraints

RTL Test bench

& Script

LogicSynthesis

Scripts

AdvancedTiming Engine

6 example architectures

© Synopsys 2013 12

Synphony High-Level Synthesis Automatic test bench generation for equivalence checking & requirements traceability

• VHDL or Verilog test bench automatically generated based on Simulink data.

• System data Inputs are used to create stimulus

• System data Outputs are used to generate Expected Results

• Bit-true proof of the VHDL/Verilog implementation equivalence vs. Simulink specification

© Synopsys 2013 13

FPGA P&R tools

Identify RTL Debugger

Identify RTL Instrumentor

SynphonyModel Compiler

In-System Debug

High-Level Synthesis

FPGA Synthesis & Debug

VCS Simulator

Simulation

FPGA-based Prototype Board (HAPS)

Production Board

RTL Synthesis and Analysis

Synplify Premier HDL Analyst

Physical Analyst

Synplify Prowith

HDL Analyst

Logic Synthesis Logic & Physical

Synthesis

FPGA Tool Suite – Implementation

© Synopsys 2013 14

Pipelining and Flow Control

• The Radix2-MDC implemented using Synphony Model Compiler has no

recursive loops, thus can be pipelined to achieve fast timing

• It is also specifically designed to map to advanced FPGA DSP devices

– Stratix V, Virtex-7

• Scalability is very good, but depends on FPGA fabric

© Synopsys 2013 15

Lowering Power: TSMC 40nm LP Results• Power tradeoff results using Synphony MC ASIC flow

– 2GS/S throughput held constant while varying parallelism

– Using activity data and Power Compiler optimization & report

• Parallel architectures achieve 51% lower power/frame

– Lower clock speeds

– Area penalty 250%

1. Using Synphony Model Compiler, parallel Radix2-MDC architecture and configuration as described in previous section

2. Implementation flow with Synphony Model Compiler 2013.03, Design Compiler 2011.09-SP4

3. Dynamic power estimated using Synphony generated design and test bench of 4 frames, VCS E-2011.03 for activity data,

and Power Compiler 2011.09-SP1

Parallel FFT1 FFT

Throughput

(Samples/s)

System

Clock

Relative

Dynamic

Power2,3

Relative

Area2

Relative

Leakage

Power

Latency

(system

cycles)

Parallel x2 2 GS/s 1 GHz 1X 1X .0003 597

Parallel x4 2 GS/s 500 MHz .72 1.23 .0004 332



© Synopsys 2013 16

References

• [1] Dejan Markovic, Robert Brodersen, “DSP Architecture Design Essentials”,

Springer; 2012

• [2] Vladimir Stojanović, “Lecture 10, Fast Fourier Transform: VLSI Architectures”,

course materials for 6.973 Communication System Design, MIT OpenCourseWare

(http://ocw.mit.edu/), Spring 2006.

• [3] H.e. Shousheng and M. Torkelson "A new approach to pipeline FFT processor,"

Parallel Processing Symposium, 1996., Proceedings of IPPS '96, The 10th

International no. SN -, pp. 766-770, 1996.

• [4] E. Wold and Alvin M. Despain "Pipeline and Parallel-Pipeline FFT Processors for

VLSI Implementations," IEEE Trans. Computers vol. 33, no. 5, pp. 414-426, 1984.

http://ocw.mit.edu/

multi-gigahertz parallel ffts for fpga and asic...

Documents