fpga prototyping

Upload: ramakrishnarao-soogoori

Post on 03-Apr-2018

232 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/28/2019 FPGA prototyping

    1/13

    FPGA prototyping of complex SoCsPartitioning and Timing Closure

    Challenges with SolutionsVijay Kumar Kodavalla, Nitin RaverkarWipro Technologies, Bangalore, India

    Abstract

    In the nanometer era, complex SoCs have higher risk of re-spins. Undoubtedly FPGAprototyping is the right way of pre-silicon SoC validation, accelerate system softwaredevelopment and to meet time-to-market demands. Today's EDA tools are not maturedenough to effectively tackle complex FPGA partitioning and timing closure issues. For

    successful FPGA prototyping, design partitioning and timing closure need to be skillfullyhandled. This paper presents partitioning and timing closure challenges along witheffective schemes to resolve these issues. This paper is backed up with vast FPGAprototyping experience of various SoCs with logic gate count up-to four million.

    1. Introduction

    FPGA prototyping is a viable solution to address growing SoC developmentcomplexities and associated risks. The exciting benefits of FPGA prototype are:

    a. Concurrent Software development and testing: Quick fine tuning of

    hardware/software partitioning, software development and comprehensivetesting before actual siliconb. Comprehensive Verification: Integrated hardware-software testingc. Field Testing: In-system device validation in end-application deployment scenario

    Following are the objectives for FPGA prototype to be effective:

    a. System performanceb. Optimal number of FPGAsc. Shorter turn-around cycle from bug fixed RTL to FPGA bitmaps with consistent

    results

    Following are the limiting factors for achieving the objectives:

    a. System performance: Due to system software and interface requirements,prototype is expected to run at certain minimum frequency (e.g.: 30-40MHz forvideo processing chips)

    b. Available FPGA resources: Gates, Pins, memories, clocks and resets

  • 7/28/2019 FPGA prototyping

    2/13

    c. Unfrozen SoC RTL: Due to concurrent prototype development and RTLverification

    d. No SoC RTL customizations: RTL modifications are not desirable for FPGAprototype timing improvement

    Partitioning and timing closure are major challenges in mitigating effects of the limitingfactors and to meet the objectives. This paper presents limitations in today's FPGAprototyping methodology in section 2. PTC (Partitioning & Timing Closure) challengesand effective techniques to resolve them are discussed in section 3. The section 4highlights benefits obtained by applying PTC techniques on a sample complex fourmillion logic gate SoC, followed by conclusions in section 5.

    2. Limitations of current Prototyping Methods

    Even after rigorously following FPGA prototyping flow using state-of-art EDA tools,following are the limitations:

    a. Many FPGAs are required for SoC partitioning, leading to prototype systemcomplexity

    b. Unable to do TDM (Time Division Multiplexing) pin assignment due to stringenttiming requirements

    c. Unable to partition multiple clocks and reset treesd. No correlation of synthesis , P&R [place & route] timing results and critical pathse. Post P&R, routing delay is 4 to 9 times of logic delay and frequency achieved is

    3-4X lower than the target valuef. Inter FPGA timing not met with long combinational paths including board delaysg. Inconsistent timing results even with minor RTL bug fixes and enhancements

    h. Route delay estimates of Physical synthesis tool are inaccurate due to lack oftarget device physical characteristics knowledge, leading to only 0.1Xperformance improvement

    Tactful planning and innovative PTC techniques need to be applied to handle thesecritical issues.

    3. Critical Issues and Solutions [PTC]

    The limitations of current prototyping methods listed in section 2 broadly fall intopartitioning and timing closure categories. This section presents the critical issues in

    partitioning & timing closure and innovative ways to resolve them.

    Figure 1 shows a sample SoC block diagram highlighting the bus structure.

  • 7/28/2019 FPGA prototyping

    3/13

    Figure 1 SoC System level block diagram

    Knowledge of SoC architecture is a must to get clarity on internal bus structure andinter-module connectivity. Bus structure indicates a possible partitioning boundary whilethe inter-module connectivity indicates the pin count requirements. Knowledge ofmodule level gate count gives an idea on modules that can be combined and helps indeciding type and number of FPGAs required.

    The critical issues and solutions of partitioning and timing closure are tightly coupled.The discussed solutions need to be applied with state-of-art flows and EDA tools.

    Challenge 1: Many FPGAs required

    Solution: Based on application test requirement, determine whether all SoC modulesare required concurrently on FPGA for validation. Usually all the SoC modules are notconcurrently required to be prototyped. Hence different SoC subsets can be formed.Build a "concurrency matrix" as shown in Figure 2, which helps in arriving at optimalbalance between number of SoC subsets and FPGAs required.

    Assume that a SoC has different modules like Processor, DMA controller, Memorycontroller and other data processing engines like M1, M2 & M3. Depending onapplication test scenarios (A, B, C and D) different modules can be grouped together,which demands concurrent verification. The modules required for each test scenario are

    shaded in Figure 2.

  • 7/28/2019 FPGA prototyping

    4/13

    Figure 2 Concurrency Matrix

    In this concurrency matrix (Figure 2), M1-M3- M4 or M2-M3-M4 needs to beconcurrently validated. If the entire SoC is targeted to be validated in one go, then thetotal gate count is equivalent to sum of that of individual modules (M1+M2+M3+M4). Inthis example the SoC has been split into two subsets (M1-M3-M4 and M2- M3-M4). Theworst-case gate count of these subsets will determine the number of FPGAs required.In this case the gate count to be considered for FPGA validation is equivalent to

    maximum of that of M1+M3+M4 or M2+M3+M4.

    Lower number of FPGAs also reduces interconnect complexity. Also the RTL for varioussubsets can be easily done using "`if def" construct in top level RTL file.

    Challenge 2: Selecting appropriate FPGA

    Solution: Worst case gate count, memories, multipliers, DLLs, number of IOs and IOstandards of the derived subsets drive the FPGA selection. The chosen FPGA logicgate count should be at least 30-40% higher as RTL might not have been matured atthe start of prototype development cycle.

    Challenge 3: Large number of IOs

    Solution: A subset partitioning can start with the knowledge of module level areautilization, IO and clock requirements. The EDA tools will aid in performing interactivepartitioning with "what if" analysis. Partitioning uncovers SoC internal large size busesand may demand more than available FPGA IOs. It is not always possible to do TDM ofpins, as it brings down system speed. Before attempting TDM technique, apply logicalsolutions like:

    a. Common module logic distribution: Slice and place common modules (e.g.:

    register block) having many net connections to other modules such that theirinterconnections are reduced.

    To elaborate this, refer to Figure 3 showing register block kept in a single FPGAand Figure 4 showing the sliced register block and its effect on interconnections.

  • 7/28/2019 FPGA prototyping

    5/13

    Figure 3 Common Register Block

    For example, assume that the SoC logic is divided into three FPGAs with theconcurrency matrix technique. The SoC register block (Register Array) isattached to processor through the processor bus. Output of register block hasmany configuration and control signals [v + n] to various modules of SoC namedM1, M2, Ma, Mb etc. Similarly signals like status, interrupts and hand-shakesignals [u + m] are the inputs to register block from various SoC modules.

    Register block when placed in a single FPGA consumes more IOs [m + n + u +v] leaving almost no pins for rest of the module connections between FPGAs.Large IO requirement issue is resolved by slicing the register block. The slicedregister blocks should be placed with its related modules (Figure 4). This will nothave any adverse effect on timing as most of the register block connection toother modules carries static signals (false or multi-cycle paths), for e.g.configuration signals. Hence these signals are not timing critical.

  • 7/28/2019 FPGA prototyping

    6/13

    Figure 4 Distributed Register Block

    b. Functional based partitioning: In case partitioning leads to multiple data busescoming out of various FPGAs and getting multiplexed in one FPGA (e.g.: DMAController with one channel active at a time, as shown in Figure 5), slice andplace the multiplexers as shown in Figure 6.

    Figure 5 Partitioning Centralized Mux

  • 7/28/2019 FPGA prototyping

    7/13

    Figure 6 Partitioning Distributed Muxes

    Challenge 4: Partitioning Clock generator with multiple derived clocks

    Solution: Use of PLLs, dividers, multiplexers and synchronizers in clock/resetgenerator of SoC complicates partitioning. Though dividers and multiplexers can bemapped to FPGA, the delay on these derived clocks will be high and may vary from runto run. The EDA tools are not able to perform IO timing analysis w.r.t derived clocks. Toget a common clock reference for all the FPGAs, place the clock generator in one of theFPGAs as shown in Figure 7, bring out derived clocks and input to all the FPGAs asprimary clocks. This also helps in getting correct IO offset timing analysis for derivedclocks. In case of number of clocks exceeding available global clock lines, apply

    following techniques:

    a. Check if any clock domains can be mergedb. Assign high fan-out clocks to dedicated clock tree in the devicec. Assign relatively low fan-out clock nets to local low skew lines in the deviced. Convert gated clocks to clock enable of Flops using advanced synthesis tools

    Figure 7 Clock Generator

  • 7/28/2019 FPGA prototyping

    8/13

    Challenge 5: Partitioning Reset generator

    Solution: Reset generator module will have reset synchronization logic to synchronizereset to each clock domain. The best way of handling reset generator module is toduplicate in all FPGAs as shown in Figure 8. Also if available, use dedicated low skew

    routing resources or device-wide dedicated reset resource.

    Figure 8 Reset Generator

    After finalizing the FPGA partitioning, the next steps are Synthesis, Place/Route andtiming closure. Analysis of synthesis report helps in estimating the frequency that canbe achieved after P&R. The maximum FPGA prototype frequency can be achieved ifrouting delay can be brought down, which is almost equal to logic delay.

    Challenge 6: Multiple iterations between Synthesis and P&R

    Solution: The maximum FPGA prototype frequency achievable and limiting factorsshould be known upfront, before iterating between synthesis and P&R. In synthesis,meeting final target frequency is necessary but not sufficient condition as the routedelay estimates are inaccurate.

    In synthesis logic delay to be achieved for a given target frequency = 0.5*[(1/Targetfrequency) off-chip delay (if any) + Clock skew]

    Above equation is valid only when PTC techniques are applied. Current Synthesis toolsdon't support constraining only logic delay. Hence manually check whether requiredlogic delay is met in Synthesis for a given target frequency. If logic delay is not met insynthesis, achievable post P&R target frequency can be estimated by using aboveequation.

    Synthesis tool features like register re-timing, logic replication and fan-out control canimprove synthesis performance. Also in synthesis keep the hierarchy intact, which willhelp in P&R.

    Challenge 7: Post P&R routing delay is more and intra-FPGA timings are 3-4Xlower

  • 7/28/2019 FPGA prototyping

    9/13

    Solutions: For complex designs with around 70% or more device utilization, it has beenobserved that post P&R routing delay is 80-90% of the overall delay. With theseexcessive routing delays, the final frequency achieved is 3-4X lower than target.

    Various reasons for large routing delays are congestion, fixed position macros, paths

    traversing hierarchies and auto placement inefficiencies. Register block partitioning,reset mapping, device macros location fixing, module level floorplanning, "IOB Ring" pinlocking and fan-out control are the techniques to control high routing delays.

    IO pin locking, macro location fixing and module level floor-planning techniques:

    a. FPGA pin-out fixing has got major impact on the internal routing delays. It isinadequate to assign pin-out based on physical pin sequence in the BGApackage. FPGA IO ring, which is present on periphery of FPGA die, needs tobe considered while assigning pin-out

    b. In the floor-planning proximity doesn't always guarantee good results, as the

    results depend on routing structure of the devicec. Draw the data flow diagram of the SoC with the memories that are used toterminate the data paths

    d. Interdependent units should be closer by avoiding criss-cross and diagonalroutes

    e. Place the Macros closer to the interfacing unit and constraint the Macrolocations.

    f. The units which are not timing critical need not be floor planned there by P&Rtool can have the flexibility in placing them

    g. Avoid overlapping regions and allow some free rows and columns betweenmodules, which will aid in inter module routing

    Challenge 8: Post P&R inter-FPGA timing issues

    Solutions: To avoid any long combinational paths between FPGAs, partitioning shouldbe always on register boundary. Also the solution for challenge-4 will ensure sourcesynchronous inter-FPGA communication without sending clock along with data.

    While driving out clock from FPGA to off-chip devices like DDR memory, use "clockforwarding" technique to match clock and data path delays. Figure 9 explains the clockforwarding technique using DDR IOs. In this the DDR data and clock path experiencesequal amount of delay in IO.

  • 7/28/2019 FPGA prototyping

    10/13

    Figure 9 Clock Forwarding

    Even with enhanced and bug fixed RTL, the PTC techniques ensure best andconsistent results in every run.

    4. Experimental Results

    The example SoC design attributes are:

    a. 4M logic gates with 2M memory bitsb. Targeted to run at 100MHzc. Maximum number of logic levels between Flop to Flop are 55

    d. Number of clocks: 24; Gated clocks: 200

    The FPGA prototype frequency target is 27MHz. Table 1 lists results achieved byapplying PTC innovative techniques.

    Challenge Results with standard flow with state-ofthe art EDA tools Result/Benefits withPTC techniques 4 millionlogic gate SoC partitioning Number of FPGAs required = 5(FPGA: 8M system gates with 1100 usable IOs) Number of FPGAs required = 3 withtwo downloads. 40% Reduction IO pins IO Pins required per FPGA = 1750 IO pinsrequired per FPGA = 950 45% Reduction Intra-FPGA timing 12MHz 40MHz 3.33Ximprovement Inter-FPGA timing 10MHz 30MHz 3X improvement

    Challenge Results with standard flow withstate-ofthe art EDA tools

    Result/Benefits with PTCtechniques

    4 million-logic gateSoC partitioning

    Number of FPGAs required = 5(FPGA: 8M system gates with 1100usable IOs)

    Number of FPGAs required =3 with two downloads.40% Reduction

  • 7/28/2019 FPGA prototyping

    11/13

    IO pins IO Pins required per FPGA = 1750 IO pins required per FPGA =95045% Reduction

    Intra-FPGA timing 12MHz 40MHz3.33X improvement

    Inter-FPGA timing 10MHz 30MHz3X improvement

    Table 1 Experimental results

    5. Conclusion

    Partitioning and timing closure challenges in FPGA prototyping of a complex SoC needsto be skillfully handled with PTC techniques at various stages of prototype development.Use of PTC techniques assures consistent results which helps in reducing the FPGA

    prototype development time.

    We have demonstrated best results of FPGA prototyping by using innovative PTCtechniques with minimal iterations and cycle time reduction. This paper will help insuccessfully meeting FPGA prototype objectives with predictive mapping and timingclosure results.

    Multi-FPGA Implementation and Partitioning

    Overview

    The Certify software is the leading implementation and partitioning tool for ASIC designers who use FPGA-based prototypes to

    their designs. Certify provides a quick and easy method for partitioning large ASIC designs onto multi-FPGA prototyping boardincludes powerful features that make it easy to adapt to existing device flows; speeding the verification process and helping to

    time to market challenges.

    Key Features

    Includes easy to use graphical user interface (GUI) flow guide

    Allows automatic and/or manual partitioning

    Supports Synopsys Design Constraints for timing management

    Tightly integrated with Confirma hardware

    Supports multi-core parallel processing for faster runtimes

    Supports most leading FPGA devices Includes industry standard Synplify Premier synthesis engine

  • 7/28/2019 FPGA prototyping

    12/13

    Figure 1 Flow based graphical interface guides the userDesign Implementation

    In order to prototype an ASIC design using FPGAs, certain design elements must be converted to structures that are recogniza

    FPGA implementation tools. These elements, such as ASIC gate-level components or gated-clock tree structures, can be very

    and time-consuming to edit manually. The Certify software automatically recognizes and converts these ASIC-specific constru

    equivalent FPGA structures.

    Partitioning

    Certifys automated mode partitions basic designs quickly with minimal user intervention by employing an intuitive, flow-driven

    graphical user interface (GUI). For more complex designs, this flow-driven GUI will guide the user through the partitioning proc

    provide utilities such as I/O pin multiplexing designed to reduce the number of I/O pins between FPGA partitions. Users can

    functional partition solutions quickly and use Certifys advanced features to optimize these solutions.

    Performance

    The Certify tool supports system timing constraints, defined in industry standard Synopsys Design Constraint (SDC) format - e

    that the overall ASIC timing is matched in the multi-FPGA implementation. The Certify software can also provide a timing repo

    outlining the possible performance of the prototype prior to programming the hardware. With Certify, users are assured that the

    constraints for the ASIC are achieved by the equivalent multi-FPGA prototyping implementation.

    Confirma Flow Integration

  • 7/28/2019 FPGA prototyping

    13/13

    Certify is tightly integrated into the Confirma Rapid Prototyping Platform - the complete ASIC verification hardware and softwar

    solution. Board descriptions for HAPS High-performance ASIC Prototyping Systems are built into the Certify tool allowing imme

    productivity with almost no set-up time. Certify software assures optimum performance because it automatically takes advanta

    HAPS signals to provide high speed time domain multiplexing which ensures the fastest available connections between FPGA

    Certify uses the world-leading FPGA synthesis engine, Synplify Premier, to achieve the best possible mapping to the target FP

    The Synplify Premier tools integration with the Identify Pro Visibility Debugging and Enhancement tool offers advanced debugcapabilities to monitor signals in critical areas of a design.

    Figure 2 Certify is the key to Multi-FPGA Implementation, a part of the Confirma Rapid Prototy

    Plus Solution