v4 tech module dsp

Upload: dcesenther

Post on 02-Jun-2018

241 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/10/2019 V4 Tech Module DSP

    1/24

    Xilinx Advanced Products Division

    Digital SignalProcessingVersion 2.0

    January 2005

  • 8/10/2019 V4 Tech Module DSP

    2/24

    Virtex-4 DSP 2

    Agenda

    Introduction Background

    Virtex-4 Solutions

    Summary

  • 8/10/2019 V4 Tech Module DSP

    3/24

    Introduction

  • 8/10/2019 V4 Tech Module DSP

    4/24

    Virtex-4 DSP 4

    High-Speed DSP Challenges

    High performance digital communicationand video imaging designs challenge

    existing DSP solutions Need higher performance

    Need lower costs

    Need lower power

    Compromises are often made Performance is sacrificed

    Time is spent designing substituteimplementations

  • 8/10/2019 V4 Tech Module DSP

    5/24

    Virtex-4 DSP 5

    Achieve DSP Performance and

    Efficiency in Virtex-4

    Virtex-4 XtremeDSP

    Performance 512 XtremeDSP slices at 500MHz

    256 GMACs/s DSP bandwidth

    Low Power

    2.3mW/100MHz scalable power efficiency

    Value

    Operate the XtremeDSP slice in over 40 different modes

    Highest DSP bandwidth per dollar solution

  • 8/10/2019 V4 Tech Module DSP

    6/24

    Background

  • 8/10/2019 V4 Tech Module DSP

    7/24

    Virtex-4 DSP 7

    FPGAs Enable Massively Parallel

    DSP

    Data OutData Out

    MAC UnitMAC Unit

    CoefficientsCoefficients

    Programmable DSPProgrammable DSP -- SequentialSequential

    1 GHz1 GHz1 GHz

    256 clock cycles256 clock cycles256 clock cycles

    = 4 MSPS= 4 MSPS= 4 MSPS

    256 clock256 clock

    cyclescyclesneededneeded

    Data InData In

    XX

    ++

    RegReg

    500 MHz500 MHz500 MHz

    1 clock cycle1 clock cycle1 clock cycle

    = 500 MSPS= 500 MSPS= 500 MSPS

    Data OutData Out

    FPGAFPGA -- Fully Parallel ImplementationFully Parallel Implementation

    256 operations256 operations

    in 1 clock cyclein 1 clock cycle

    Data InData In

    XX

    ++

    C0C0 C0C0XXC1C1 XXC2C2 XXC3C3 XXC255C255

    the unprecedented signal processing requirements of nextthe unprecedented signal processing requirements of next--generation wirelessgeneration wireless

    devices threaten to outpace the capabil ities of DSP processors,devices threaten to outpace the capabili ties of DSP processors, creating opportunitiescreating opportunities

    for massively parallel and highly customized devices.for massively parallel and highly customized devices. BDTI, 2004BDTI, 2004

    Example 256 TAP Filter ImplementationExample 256 TAP Filter Implementation

    Reg

    Reg

    Reg

    Reg

    Reg

    Reg

    Reg

    Reg

  • 8/10/2019 V4 Tech Module DSP

    8/24

    Virtex-4 DSP 8

    Parallel Adder Tree Implementation

    Consumes FPGA resources

    Fabric and Routing MayFabric and Routing May

    Reduce PerformanceReduce Performance

    32 TAP filter implementation willconsume 1,461 logic cells toimplement adders in fabric

    Parallel Adder Tree ImplementationParallel Adder Tree Implementation

    Data InData In

    XX

    ++

    C0C0 C0C0XXC1C1 XXC2C2 XXC3C3

    ++++

    XXC4C4 C0C0XXC5C5 XXC6C6 XXC7C7 XXC30C30 XXC31C31

    ++++ ++++

    Data OutData Out

    ++

    ++Consumes Logic toConsumes Logic to

    Implement AddersImplement Adders VariableVariable

    LatencyLatency

    Reg

    Reg

    Reg

    Reg

    Reg

    Reg

    Reg

    Reg

    Reg

    Reg

    Reg

    Reg

    Reg

    Reg

    Reg

    Reg

    Reg

    Reg

  • 8/10/2019 V4 Tech Module DSP

    9/24

    Virtex-4 DSP 9

    Parallel Adder Cascade ImplementationParallel Adder Cascade Implementation

    Data InData In

    XX

    ++

    XX

    ++

    XX

    ++

    Data OutData Out

    XX

    ++

    XX

    ++

    XX

    ++

    XX

    ++

    XX

    ++

    XX

    ++

    Filters ImplementedFilters Implemented

    Entirely Within theEntirely Within the

    XtremeDSP SliceXtremeDSP SliceGuaranteed 500MHz PerformanceGuaranteed 500MHz Performance

    Regardless of Filter SizeRegardless of Filter Size

    Virtex-4 Parallel Implementation

    Consumes Zero Logic Resources

    32 TAP filter implementation using32 XtremeDSP Slices

    C0C0 C1C1 C2C2 C3C3 C5C5 C6C6 C7C7 C30C30 C31C31C4C4 XX

    ++

    Re

    g

    Re

    g

    Re

    g

    Re

    g

    Re

    g

    Re

    g

    Re

    g

    Re

    g

    Re

    g

    Re

    g

    Re

    g

    Re

    g

    Re

    g

    Re

    g

    Re

    g

    Re

    g

    Re

    g

    Re

    g

    Re

    g

    Re

    g

    Re

    g

    Re

    g

    Re

    g

    Re

    g

    Re

    g

    Re

    g

    Re

    g

    Re

    g

    Re

    g

    Re

    g

    Re

    g

    Re

    g

    Re

    g

    Re

    g

    Re

    g

    Re

    g

    Reg

    Reg

    Reg

    Reg

    Reg

    Reg

    Reg

    Reg

    Reg

    Reg

    Reg

    Reg

    Reg

    Reg

    Reg

    Reg

    Reg

    Reg

  • 8/10/2019 V4 Tech Module DSP

    10/24

    Virtex-4 DSP 10

    Xilinx 4th Generation XtremeDSP

    00

    5050

    100100

    150150

    200200

    250250

    VirtexVirtex--EE VirtexVirtex--IIII VirtexVirtex--II ProII Pro VirtexVirtex--44

    DSP BandwidthDSP Bandwidth

    GMACs/sGMACs/s

    1st Generation

    11GMACs/s

    11stst GenerationGeneration

    11GMACs/s11GMACs/s

    2nd Generation

    32GMACs/s

    22ndnd GenerationGeneration

    32GMACs/s32GMACs/s

    3rd Generation

    111GMACs/s

    33rdrd GenerationGeneration

    111GMACs/s111GMACs/s

    4th Generation

    256GMACs/s

    44thth GenerationGeneration

    256GMACs/s256GMACs/s

    Virtex-4 XtremeDSP Highest DSP

    Bandwidth Available

    VirtexVirtex--4 XtremeDSP Highest DSP4 XtremeDSP Highest DSP

    Bandwidth AvailableBandwidth Available

  • 8/10/2019 V4 Tech Module DSP

    11/24

    Virtex-4Solutions

  • 8/10/2019 V4 Tech Module DSP

    12/24

    Virtex-4 DSP 12

    Full Custom Design Results in Higher

    PerformanceScalable 500MHz performance is impossible with

    Standard Cell libraries and Standard Cell design flow

    Scalable 500MHz performance is impossible withScalable 500MHz performance is impossible with

    Standard Cell libraries and Standard Cell design flowStandard Cell libraries and Standard Cell design flow

    Ar ithmatica Parallel Counter

    20% faster performance and

    uses less area

    Ar ithmaticaAr ithmatica Parallel CounterParallel Counter

    20% faster performance and20% faster performance and

    uses less areauses less area

    Integrated Cascade

    Routing enables

    scalable performance

    Integrated CascadeIntegrated Cascade

    Routing enablesRouting enables

    scalable performancescalable performance

    Ar ithmatica A+Adder

    20% faster than

    other implementations

    Ar ithmaticaAr ithmaticaA+AdderA+Adder

    20% faster than20% faster than

    other implementationsother implementations

    Pipeline Registers

    enable 500Mhzperformance

    Pipeline RegistersPipeline Registers

    enable 500Mhzenable 500Mhzperformanceperformance

    2X the performance

    of Virtex-II Pro

    2X the performance2X the performance

    of Virtexof Virtex--II ProII Pro

  • 8/10/2019 V4 Tech Module DSP

    13/24

    Virtex-4 DSP 13

    Wide Filters At Full SpeedWithin the Virtex-4 DSP Slice Column

    Systolic N-tap FIR

    Scalable N-level deep implementation

    500MHz performance at N-level deep

    Uses Integrated Pipeline Registers tosynchronize filter inputs

    Utilizes Input and Output Cascade Routing

    Build massively parallel 512-TAP FIR filter

    in a single device achieving

    256 GMACs/s performance

    Build massively parallel 512Build massively parallel 512--TAP FIR filterTAP FIR fi lter

    in a single device achievingin a single device achieving

    256 GMACs/s performance256 GMACs/s performance

    Equivalent implementation would consume

    444 Embedded Mult ipliers and 77,008 LCs

    and would only achieve half the performance

    Equivalent implementation would consumeEquivalent implementation would consume

    444 Embedded Multipliers and 77,008444 Embedded Mult ipliers and 77,008 LCsLCs

    and would only achieve half the performanceand would only achieve half the performance

  • 8/10/2019 V4 Tech Module DSP

    14/24

    Virtex-4 DSP 14

    Lowest Power DSP

    FrequencyFrequency

    (MHz)(MHz)

    Number ofNumber ofXtremeDSP SlicesXtremeDSP Slices

    Power (Power (mWmW))

    300300200200 400400 500500100100

    10001000

    400400

    200200

    512512

    300300

    100100200200

    800800

    00

    400400

    600600

    18x18 Multiply-Accumulate

    scalable power efficiency

    at 2.3mW/100MHz

    18x18 Multiply18x18 Multiply--AccumulateAccumulate

    scalable power efficiencyscalable power efficiency

    at 2.3at 2.3mmW/100MHzW/100MHz

    20 GMACs/s for .46 W20 GMACs/s for .46 W20 GMACs/s for .46 W

    60 GMACs/s for 1.38 W60 GMACs/s for 1.38 W60 GMACs/s for 1.38 W

    Note: Power efficiency achieved using the DSP48 component wit h a togg le rate of 38%.

    It is not an entire MAC with BRAM, control path sequencer/address generator in fabric, and including external routing.

  • 8/10/2019 V4 Tech Module DSP

    15/24

    Virtex-4 DSP 15

    35x18 Complex Multiply

    Imaginary Portion

    35x18 Complex Multiply35x18 Complex Multiply

    Imaginary PortionImaginary Portion

    35x18 Complex Multiply

    Real Portion

    35x18 Complex Multiply35x18 Complex Multiply

    Real PortionReal Portion

    High-Speed, Low Power

    Complex Multiply

    35x18 Complex Multiplyat 500MHz

    35x18 Complex Multiply35x18 Complex Multiplyat 500MHzat 500MHz

    Real and Imaginary

    35x18 Complex Multiply

    consumes only 92mW at 500MHz

    Real and ImaginaryReal and Imaginary

    35x18 Complex Multiply35x18 Complex Multiply

    consumes only 92mW at 500MHzconsumes only 92mW at 500MHz

    Complex filter implementation Register the inputs using minimal

    external resources Synchronize data using pipeline

    delay elements

  • 8/10/2019 V4 Tech Module DSP

    16/24

  • 8/10/2019 V4 Tech Module DSP

    17/24

    Virtex-4 DSP 17

    Virtex-4 DSP SolutionsChoose the Right Combination

    DSP SlicesDSP Slices

    MemoryMemory

    LogicLogic

    DCMsDCMs

    VirtexVirtex--4 LX4 LX

    Logic PlatformLogic Platform

    VirtexVirtex--4 SX4 SX

    DSP PlatformDSP Platform

    VirtexVirtex--4 FX4 FX

    Full Featured PlatformFull Featured Platform

    FeaturesFeatures

    256GMACs/s256GMACs/s

    DSP BandwidthDSP Bandwidth

    96GMACs/s96GMACs/s

    DSP BandwidthDSP Bandwidth

    48GMACs/s48GMACs/s

    DSP BandwidthDSP Bandwidth

  • 8/10/2019 V4 Tech Module DSP

    18/24

    Virtex-4 DSP 18

    Dynamically Programmable

    DSP Op Modes

    Enables time-division

    multiplexing for DSP Over 40 different modes

    Each XtremeDSP Slice

    individually controllable

    Change operation in a singleclock cycle

    Control functionality fromlogic, memory or processor

    6 5 4 3 2 1 0

    Zero 0 0 0 0 0 0 0 +/- Cin

    Hold P 0 0 0 0 0 1 0 +/- (P + Cin)

    A:B Sel ect 0 0 0 0 0 1 1 +/- (A:B + Cin )

    Multiply 0 0 0 0 1 0 1 +/- (A * B + Cin)

    C Select 0 0 0 1 1 0 0 +/- (C + Cin)

    Feedback Add 0 0 0 1 1 1 0 +/- (C + P + Cin)

    36-Bit Adder 0 0 0 1 1 1 1 +/- (A:B + C + Cin)

    P Cascade Select 0 0 1 0 0 0 0 PCIN +/- Cin

    P Cascade Feedback Add 0 0 1 0 0 1 0 PCIN +/- (P + Cin)

    P Cascade Add 0 0 1 0 0 1 1 PCIN +/- (A:B + Cin)

    P Cascade Multiply Add 0 0 1 0 1 0 1 PCIN +/- (A * B + Cin)

    P Cascade Add 0 0 1 1 1 0 0 PCIN +/- (C + Cin)

    P Cascade Feedback Add Ad 0 0 1 1 1 1 0 PCIN +/- (C + P + Cin)P Cascade Add Add 0 0 1 1 1 1 1 PCIN +/- (A:B + C + Cin)

    Hold P 0 1 0 0 0 0 0 P +/- Cin

    Double Feedback Add 0 1 0 0 0 1 0 P +/- (P + Cin)

    Feedback Add 0 1 0 0 0 1 1 P +/- (A:B + Cin)

    Multiply-Accumulate 0 1 0 0 1 0 1 P +/- (A * B + Cin)

    Feedback Add 0 1 0 1 1 0 0 P +/- (C + Cin)

    Double Feedback Add 0 1 0 1 1 1 0 P +/- (C + P + Cin)

    Feedback Add Add 0 1 0 1 1 1 1 P +/- (A:B + C + Cin)

    C Select 0 1 1 0 0 0 0 C +/- Cin

    Feedback Add 0 1 1 0 0 1 0 C +/- (P + Cin)

    36-Bit Adder 0 1 1 0 0 1 1 C +/- (A:B + Cin)

    Multiply-Add 0 1 1 0 1 0 1 C +/- (A * B + Cin)

    Double 0 1 1 1 1 0 0 C +/- (C + Cin)

    Double Add Feedback Add 0 1 1 1 1 1 0 C +/- (C + P + Cin)

    Double Add 0 1 1 1 1 1 1 C +/- (A:B + C + Cin)

    OpMode OutputXYZ

  • 8/10/2019 V4 Tech Module DSP

    19/24

    Virtex-4 DSP 19

    Virtex-4 XtremeDSP Slices Useful

    For More Than DSP

    6:1 high-speed, 36-bit Multiplexer

    Use four XtremeDSP Slice and op-modes 500 MHz performance using no programmable logic

    Save 1584 LCs to build equivalent function in logic

    Dynamic 18-bit Barrel Shifter Use two XtremeDSP slices

    Use dedicated cascade routing and integrated 17-bit shift Save 1449 LCs to build equivalent function in logic

    36-bit Loadable Counter Use a single XtremeDSP slice, achieve 500 MHz performance

    Save 540 LCs to build equivalent function in logic

  • 8/10/2019 V4 Tech Module DSP

    20/24

    Virtex-4 DSP 20

    DSP Design Services, Training & HotlineDSP Design Services, Training & Hotl ine Systems ExpertiseSystems Expertise

    FPGAs with DSP FunctionsFPGAs with DSP Functions Shortest Design TimeShortest Design Time Major DSP AlliancesMajor DSP Alliances

    Dedicated Field SpecialistsDedicated Field Specialists60+ Advanced DSP Cores60+ Advanced DSP Cores 60+ DSP Development Boards60+ DSP Development Boards

    256256 GMACsGMACs

    PerformancePerformance

    50+ Field DSP

    Experts

    Comprehensive Library

    Fast Turnaround

    Exceptional Performance

    Xilinx Design Services, Education and Support

    Lowest CostLowest Cost

    (90nm)(90nm)

    Distributor

    Services &

    Training

    DSP Division Experts

    Tools, IP Solutions

  • 8/10/2019 V4 Tech Module DSP

    21/24

    Virtex-4 DSP 21

    Xilinx FPGA DSP Design

    FlowFPGA Designer andFPGA Designer and

    System ArchitectSystem Architect

    Synthesize DesignSynthesize DesignSynthesize Design

    VHDL and Coregen OutputVHDL andVHDL and CoregenCoregenOutputOutput

    Specify DesignSpecify DesignSpecify DesignImplement In HardwareImplement In HardwareImplement In Hardware

    DSP Architectural WizardDSP Architectural Wizard

  • 8/10/2019 V4 Tech Module DSP

    22/24

    Summary

  • 8/10/2019 V4 Tech Module DSP

    23/24

    Virtex-4 DSP 23

    Virtex-4 XtremeDSP Enabling next generation high-performance DSP

    Highest Performance

    512 XtremeDSP slices at 500MHz

    256 GMACs/s DSP bandwidth

    Lowest Power

    2.3mW/100MHz Most Value

    Operate the XtremeDSP slice in over 40 different modes

    Highest DSP bandwidth per dollar solution available

  • 8/10/2019 V4 Tech Module DSP

    24/24