xr-ee-sb_2008_014

Upload: jaffer-sultan

Post on 06-Apr-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/2/2019 XR-EE-SB_2008_014

    1/140

    An evaluation of methods for FPGA

    implementation from a Matlab description

    Using a Matlab model as the origin, an evaluation of VHDL generation using

    AccelDSP and a method for comparing implementations on FPGA and PowerPC.

    Kristofer Lindberg

    Kristofer Nissbrandt

    Masters Degree Project

    Stockholm, Sweden 2008

    XR-EE-SB 2008:014

  • 8/2/2019 XR-EE-SB_2008_014

    2/140

  • 8/2/2019 XR-EE-SB_2008_014

    3/140

    Public

    REPORT

    Prepared (also subject responsible if other) No.

    Kristofer Lindberg, Kristofer Nissbrandt SMW/DD/G-08:070 EnApproved Checked Date Rev Reference

    DD/GFC (Anette Hansson) 2008-09-18 A

    An evaluation of FPGA implementations from a Matlab

    description

    Using a Matlab modell as the origin, an evaluation of VHDL generation using

    AccelDSP and a method for comparing implementations on FPGA and

    PowerPC.

    A Master Thesis by:

    Kristofer Lindberg

    Kristofer Nissbrandt

    Supervisor:

    Anders Bergker, Department och Data and Signal Processing, Saab

    Microwave Systems

  • 8/2/2019 XR-EE-SB_2008_014

    4/140

  • 8/2/2019 XR-EE-SB_2008_014

    5/140

    Abstract

    Algorithms for signal processing and radar control at Saab Microwave Systems

    (SMW) are implemented on a platform consisting of several boards equipped

    with programmable logic, in form of Field Programmable Gate Array (FPGA)

    devices and PowerPC processors. Different boards are responsible for different

    parts of the processing, which is performed in parallel on the incoming Video

    Data from the antenna.

    It is hard to know if a certain algorithm should be implemented on a FPGAor PowerPC to obtain the most optimal realisation. Today implementations

    towards FPGA are written in the Hardware Description Language VHDL, which

    leads to a very long implementation time. This is especially a problem for

    rapid prototyping, used by SMW to develop new functionality in their radars.

    Generating VHDL automatically from a high level description would reduce the

    implementation time.

    In this thesis Xilinx AccelDSP, a software for generating VHDL from a high

    level MathWorks Matlab description, has been evaluated. A method for decid-

    ing if an algorithm should be implemented on a FPGA or PowerPC has also

    been developed. The method is based on profiling of implementations on the

    two platforms made with generated code from the same Matlab description.

    VHDL is generated using AccelDSP and C-Code using MathWorks Real Time

    Workshop Embedded Coder (EMLC).

    Five different aspects of AccelDSP and how it can be used by SMW have

    been evaluated. The result is that the main purpose for AccelDSP at SMW is

    to be used in the method for platform decision. The software can also be used

    for generating parts of a design but this is not recommended. The drawbacks

    of VHDL generated from AccelDSP is more important than the reduced design

    time. The main problems are the performance, reliability of the description,

    readability and the problems of maintaining the source.

    The method for platform decision gives an estimation of the execution time

    and the resource usage on the both platforms given a specific design and includes

    a discussion of other metrics important for the decision. The estimated metrics

    is good enough since it gives an indication of the resulting performance usable

    for deciding implementation platform.

  • 8/2/2019 XR-EE-SB_2008_014

    6/140

    2

  • 8/2/2019 XR-EE-SB_2008_014

    7/140

    Contents

    List of acronyms 7

    Terminology 11

    1 Introduction 13

    1.1 Hardware implementation . . . . . . . . . . . . . . . . . . . . . . 14

    1.2 Problem definition . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    1.2.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . 15

    1.3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    1.3.1 Xilinx System Generator . . . . . . . . . . . . . . . . . . . 15

    1.3.2 Altera DSP Builder . . . . . . . . . . . . . . . . . . . . . 16

    1.3.3 MathWorks Simulink HDL Coder . . . . . . . . . . . . . . 16

    1.3.4 MathWorks Filter Design HDL Coder . . . . . . . . . . . 16

    1.3.5 The HPEC Challenge Benchmark Suite . . . . . . . . . . 16

    1.3.6 Implementation using Xilinx System Generator . . . . . . 16

    1.3.7 Matlab to C-Code Translation . . . . . . . . . . . . . . . 17

    2 Background 19

    2.1 FPGA Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 19

    2.1.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . 19

    2.1.2 Architecture description . . . . . . . . . . . . . . . . . . . 19

    2.2 FPGA development using VHDL . . . . . . . . . . . . . . . . . . 22

    2.2.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . 222.2.2 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . 24

    2.2.3 Intellectual Properties . . . . . . . . . . . . . . . . . . . . 24

    2.3 Processor Architecture . . . . . . . . . . . . . . . . . . . . . . . . 25

    2.3.1 Fundamental design . . . . . . . . . . . . . . . . . . . . . 25

    2.3.2 Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    2.3.3 Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

    2.3.4 PowerPC . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

    2.4 Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

    2.4.1 Benchmarking fundamentals . . . . . . . . . . . . . . . . . 28

  • 8/2/2019 XR-EE-SB_2008_014

    8/140

    4 CONTENTS

    2.4.2 Processor benchmarking . . . . . . . . . . . . . . . . . . . 29

    2.4.3 FPGA benchmarking . . . . . . . . . . . . . . . . . . . . 322.4.4 Digital signal processing benchmarking . . . . . . . . . . 34

    2.5 Xilinx AccelDSP . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

    2.5.1 Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

    2.5.2 AccelWare . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

    2.5.3 Original compiler implementation . . . . . . . . . . . . . . 39

    2.6 MathWorks Matlab . . . . . . . . . . . . . . . . . . . . . . . . . . 41

    2.6.1 Native code generation . . . . . . . . . . . . . . . . . . . . 41

    3 Method 43

    3.1 VHDL generation and simulation . . . . . . . . . . . . . . . . . . 43

    3.1.1 VHDL code generation . . . . . . . . . . . . . . . . . . . . 44

    3.1.2 Verification . . . . . . . . . . . . . . . . . . . . . . . . . . 46

    3.1.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 46

    3.2 C-Code generation and simulation . . . . . . . . . . . . . . . . . 49

    3.2.1 C-Code generation . . . . . . . . . . . . . . . . . . . . . . 49

    3.2.2 Compilation . . . . . . . . . . . . . . . . . . . . . . . . . . 49

    3.2.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 51

    3.3 Platform decision . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

    3.3.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

    4 Algorithms 554.1 FFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

    4.1.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . 56

    4.1.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 56

    4.2 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

    4.2.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . 58

    4.2.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 58

    4.3 Matrix multiplication . . . . . . . . . . . . . . . . . . . . . . . . . 59

    4.3.1 Matlab implementation . . . . . . . . . . . . . . . . . . . 59

    4.3.2 VHDL reference design . . . . . . . . . . . . . . . . . . . 60

    4.3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 614.4 FIR filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

    4.4.1 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 62

    4.5 CORDIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

    4.5.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . 63

    4.5.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 63

    4.6 CFAR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

    4.6.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . 64

    4.6.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 66

    4.7 MUSIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

  • 8/2/2019 XR-EE-SB_2008_014

    9/140

    CONTENTS 5

    4.7.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . 67

    4.7.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 674.8 Cartesian to polar map transformation . . . . . . . . . . . . . . . 68

    4.8.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . 68

    4.8.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 69

    5 Results 71

    5.1 VHDL generation using AccelDSP . . . . . . . . . . . . . . . . . 71

    5.1.1 FFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

    5.1.2 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . 73

    5.1.3 Matrix multiplication . . . . . . . . . . . . . . . . . . . . 75

    5.1.4 FIR-filter . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

    5.1.5 CORDIC . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

    5.1.6 CFAR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

    5.1.7 MUSIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

    5.1.8 Cartesian to polar map transformation . . . . . . . . . . . 87

    5.2 C-Code generation using EMLC . . . . . . . . . . . . . . . . . . 89

    5.2.1 FFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

    5.2.2 MUSIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

    5.2.3 Cartesian to polar map transformation . . . . . . . . . . . 91

    6 Discussion 93

    6.1 Evaluation of AccelDSP . . . . . . . . . . . . . . . . . . . . . . . 936.1.1 Usability . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

    6.1.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . 97

    6.1.3 Rapid prototyping . . . . . . . . . . . . . . . . . . . . . . 99

    6.1.4 Technology independence . . . . . . . . . . . . . . . . . . 100

    6.1.5 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

    6.2 Comparing a processor to a FPGA . . . . . . . . . . . . . . . . . 103

    6.2.1 Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

    6.2.2 Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . 103

    6.2.3 Resource usage . . . . . . . . . . . . . . . . . . . . . . . . 103

    6.2.4 Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1046.2.5 Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

    6.2.6 I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

    6.3 Implementation decision based on profiling . . . . . . . . . . . . 106

    6.3.1 Implementation decision . . . . . . . . . . . . . . . . . . . 106

    6.3.2 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

    6.3.3 Usability . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

    6.4 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

    6.4.1 Evaluation of AccelDSP . . . . . . . . . . . . . . . . . . . 110

    6.4.2 Implementation decision . . . . . . . . . . . . . . . . . . . 110

  • 8/2/2019 XR-EE-SB_2008_014

    10/140

    6 CONTENTS

    6.4.3 Evaluation of other tools . . . . . . . . . . . . . . . . . . . 110

    7 Conclusions 111

    7.1 VHDL generation using AccelDSP . . . . . . . . . . . . . . . . . 111

    7.1.1 Evaluation of algorithms for AccelDSP . . . . . . . . . . . 112

    7.1.2 Evaluation of AccelDSP . . . . . . . . . . . . . . . . . . . 112

    7.2 Implementation decision based on profiling . . . . . . . . . . . . 113

    Appendices 117

    A Programs used 119

    B Performance data from PSIM 121

    C Makefile for GCC and PSIM 125

    D Cartesian to polar map transformation 129

    D.1 CPU implementation . . . . . . . . . . . . . . . . . . . . . . . . . 129

    D.2 FPGA implementation . . . . . . . . . . . . . . . . . . . . . . . . 131

    E VHDL Constraints 135

  • 8/2/2019 XR-EE-SB_2008_014

    11/140

    List of acronyms

    Arithmetical Logic Unit (ALU) See, p. 26.

    American National Standards Institute (ANSI) See, p. 16.

    American Standard Code for Information Interchange (ASCII) See, p. 38.

    Application Specific Integrated Circuit (ASIC) See, p. 19.

    Abstract Syntax Tree (AST) See, p. 39.

    Computer-Aided Design (CAD) See, p. 20.

    Constant False Alarm Rate (CFAR) See, p. 64.

    Configurable Logical Block (CLB) See, p. 20.

    Component Object Model (COM) See, p. 95.

    COordinate Rotation DIgital Computer (CORDIC) See, p. 44.

    Cycles Per Instruction (CPI) See, p. 53.

    Central Processing Unit (CPU) See, p. 11.

    Defence Advanced Research Projects Agency (DARPA) See, p. 16.

    Digital Clock Management (DCM) See, p. 21.

    Discrete Fourier Transform (DFT) See, p. 55.

    Digital Signal Processing (DSP) See, p. 12.

    Digital Signal Processor (DSP) See, p. 109.

    Xilinx Virtex-5 DSP slices (DSP48E) See, p. 21.

    Executable and Linkable Format (ELF) See, p. 50.

    Embedded Matlab Subset (EML) See, p. 41.

    Real Time Workshop Embedded Coder (EMLC) See, p. 41.

  • 8/2/2019 XR-EE-SB_2008_014

    12/140

    8 LIST OF ACRONYMS

    Fast Fourier Transform (FFT) See, p. 55.

    Finite Impulse Response (FIR) See, p. 13.

    Field Programmable Gate Array (FPGA) See, p. 11.

    Front Side Bus (FSB) See, p. 104.

    GNU Compiler Collection (GCC) See, p. 50.

    GNU Debugger (GDB) See, p. 51.

    Graphical User Interface (GUI) See, p. 95.

    Institute of Electrical and Electronics Engineers, Inc (IEEE) See, p. 16.

    Infinite Impulse Response (IIR) See, p. 40.

    Input/Output (I/O) See, p. 20.

    Instructions Per Cycle (IPC) See, p. 25.

    Look-Up Table (LUT) See, p. 20.

    Matlab Code (M-Code) See, p. 41.

    Multiply Accumulate (MAC) See, p. 21.

    Mapping (MAP) See, p. 23.

    Mega Samples Per Second (MSPS) See, p. 72.

    Multiple Signal Classification Method (MUSIC) See, p. 67.

    Operating System (OS) See, p. 31.

    Place And Route (PAR) See, p. 24.

    Program Counter (PC) See, p. 25.

    Programmable Logic Device (PLD) See, p. 19.

    Phased Locked-Loop (PLL) See, p. 20.

    Plane Position Indicator (PPI) See, p. 13.

    Random Access Memory (RAM) See, p. 21.

    Reduced Instruction Set Computer (RISC) See, p. 27.

    Read Only Memory (ROM) See, p. 24.

    Register Transfer Level (RTL) See, p. 22.

  • 8/2/2019 XR-EE-SB_2008_014

    13/140

    LIST OF ACRONYMS 9

    Synthetic Aperture Radar (SAR) See, p. 16.

    Saab Microwave Systems (SMW) See, p. 13.

    Static Random Access Memory (SRAM) See, p. 20.

    VERIfy LOGic (VERILOG) See, p. 14.

    VHSIC Hardware Description Language (VHDL) See, p. 11.

    Very High Speed Integrated Circuit (VHSIC) .

    Wide Sense Stationary (WSS) See, p. 58.

  • 8/2/2019 XR-EE-SB_2008_014

    14/140

    10 LIST OF ACRONYMS

  • 8/2/2019 XR-EE-SB_2008_014

    15/140

    Terminology

    AccelDSP AccelDSP is a software from Xilinx for translating floating-point

    M-Code to fixed-point VHDL, p. 35.

    AccelWare AccelWare is a library of functions that can be used in AccelDSP

    to replace generic Matlab functions with functions optimized for

    generating VHDL, p. 39.

    bitstream Information used for programming a FPGA generated after PAR

    in the VHDL workflow, p. 22.

    CORE Generator The Xilinx software for distributing IP-Blocks, p. 24.

    execution unit A unit in the data path on a CPU responsible for a certain

    type of operations, p. 25.

    IP-Block An Intellectual property is an algorithm implementation provided

    by a FPGA manufacturer with a license term, p. 24.

    ISE Xilinx ISE Design suite, a suite containing all programs needed to

    create FPGA designs, p. 12.

    Matlab A high-level technical computation environment from MathWorks,

    p. 41.

    netlist A technology dependent description of a design generated by the

    synthesis step in the VHDL workflow, p. 23.

    PowerPC Processor with RISC architecture created by Apple, IBM and Mo-

    torola in 1991 based on the POWER architecture created by IBM,

    p. 27.

    profiler A software used for analysis of the performance of a program by

    running it in a simulated environment, p. 31.

    PSIM A PowerPC simulator for performance evaluation and simulations

    integrated with GDB, p. 51.

  • 8/2/2019 XR-EE-SB_2008_014

    16/140

    12 TERMINOLOGY

    Simulink An environment, supported by a graphical user interface, for simu-

    lation and model-based design of embedded and dynamic systemsfrom MathWorks, p. 15.

    synthesize A design process stage where VHDL are synthesized into a tech-

    nology dependent netlist, p. 23.

    video data Radar information after the early signal processing stages before

    target identification, p. 64.

    xc5vsx95t-f1136-3 A Xilinx Virtex-5 SXT FPGA device optimized for DSP

    and memory intensive applications, p. 46.

    XST Xilinx Synthesis Tool, a free synthesis tool included in ISE, p. 46.

  • 8/2/2019 XR-EE-SB_2008_014

    17/140

    Chapter 1

    Introduction

    Saab Microwave Systems (SMW) manufactures radar and other sensor systems

    for air, ground and sea, primary for military but also civil security domains.

    The goal is to provide their customers with information superiority which give

    them the ability to observe and react to situations more accurately and rapidly

    than their opponents.

    The platform for signal processing is together with the antenna and com-

    ponents to send, receive and generate microwaves core functionality of a radar.

    The platform for signal processing modulates and directs generated radar pulses,

    commonly called beamforming, and processes retrieved signals. The processingextracts target identification including velocity, course, bearing and distance as

    well as classification of targets including hovering helicopters. SMW can pro-

    vide complete radar systems including cabin, operator interface and support

    systems.

    Identified targets are transferred through an Ethernet interface to a system

    for data processing and presentation. The most common way to present targets

    is to use a Plane Position Indicator (PPI), optionally projected on a map.

    The platform for signal processing consists of several boards equipped with

    programmable logic, Field Programmable Gate Array (FPGA) and PowerPC

    processors. Different boards are responsible for different parts of the processingwhich is performed in parallel on the incoming data. The data stream for the

    surface radars with 18 antenna lobes consists of 18 channels with complex data

    sampled in a rate of 3.125M Hz . This corresponds to a sequential rate of 112.5

    millions words per second.

    When new or improved functionality is added, algorithms are realised in

    the same way as related algorithms in former projects without knowing if it is

    possible to do more effectively. For example if a FIR-filter with 16 taps was

    described in VHDL for implementation in a FPGA during a project and a filter

    with 32 taps is needed in a following project the realisation will be done using

  • 8/2/2019 XR-EE-SB_2008_014

    18/140

    14 Introduction

    the same method. The reason is that it is very hard to show how the best

    realisation is made.

    It could be possible that it is more suitable to write the FIR-filter in the

    previous example in C-Code for implementation on a processor or generate

    VHDL from a Matlab model for implementation on a FPGA.

    To develop a method for making platform decisions and to reduce the im-

    plementation time towards FPGA platforms SMW became interested in the

    software AccelDSP. The interest in AccelDSP, which can generate VHDL from

    Matlab, is the origin of the thesis.

    1.1 Hardware implementation

    An algorithm can be implemented in many ways but the thesis is restricted

    to implementations on FPGA and PowerPC. The PowerPC is an example of

    a processor platform and the implementation procedure is similar if another

    processor is of interest.

    The natural way to make the implementations is to write code for the proces-

    sor in a native programming language like C/C++ or Ada. The FPGA imple-

    mentation is made in a hardware description language like VHDL or VERILOG.

    The drawback of using native languages is the very long implementation time. If

    the implementation can be generated from a description in a high level language,

    where the algorithm can be implemented fast, a lot of time can be saved.

    Figure 1.1 shows how an algorithm described in Matlab can be implemented

    using translations between different languages.

    Matlab

    C/C++ VHDL / Verilog

    Processor FPGA

    Figure 1.1: A Matlab model can be translated using these paths for implemen-

    tation on a processor or a FPGA.

  • 8/2/2019 XR-EE-SB_2008_014

    19/140

    1.2 Problem definition 15

    1.2 Problem definition

    The aim of the thesis is to develop a procedure for making better decisions about

    realisations of algorithms for signal processing and radar control than what is

    currently possible at SMW today. The procedure should assist in deciding im-

    plementation platform for a certain algorithm and if the implementation should

    be generated or hand-coded.

    A Matlab design of the algorithm is the origin for evaluation of implemen-

    tations on PowerPC and FPGA.

    For generating VHDL from Matlab AccelDSP has been evaluated and for

    generating C-Code Real Time Workshop Embedded Coder (EMLC) has been

    used. The evaluation of AccelDSP is a requirement from SMW and EMLC is

    used because it has been evaluated by Marcus Mulleger and an evaluation license

    was provided by MathWorks. The work made by Mulleger, see section 1.3.7, is

    used as a reference for the performance of C-Code generation and evaluation of

    the functionality of different translators. Only AccelDSP has been evaluated in

    the thesis and no further evaluation of EMLC has been performed.

    1.2.1 Problem statement

    1. Is AccelDSP useful for SMW and which purposes can it fulfil?

    2. How can an algorithm be evaluated to see if it is suitable for VHDL gen-eration from Matlab using AccelDSP or if ts it better to describe the

    algorithm directly in VHDL?

    3. How can profiling on the both platforms, with code generated from

    Matlab, be used to determine if an algorithm should be implemented on

    a PowerPC or a FPGA?

    1.3 Related work

    This section contains softwares and work related to this thesis.

    1.3.1 Xilinx System Generator

    This software is closely related to AccelDSP since it is used for generating FPGA

    implementations from Matlab. The differences is that it generates technology

    dependent netlists from Simulink and not VHDL from Matlab Code (M-Code).

    This software is therefore unsuitable for SMW. The product contains a block

    library for Simulink for modelling DSP algorithms suitable for FPGA imple-

    mentation and a tool for generating the implementation.

  • 8/2/2019 XR-EE-SB_2008_014

    20/140

    16 Introduction

    1.3.2 Altera DSP Builder

    This is the equivalent to Xilinx System Generator but from the FPGA manu-

    facturer Altera. In this program it is possible to generate VHDL files which is

    not possible in System Generator. This program is designed for Altera devices

    and is therefore not useful.

    1.3.3 MathWorks Simulink HDL Coder

    Simulink HDL Coder is MathWorks equivalent to Xilinx System Generator and

    Altera DSP Builder. It generates VHDL and VERILOG files, which are stated

    to be target independent and compliant with the IEEE 1076 standard, from

    MathWorks Simulink. Matlab code from the Embedded Matlab Subset (EML)subset can be generated using a special block in Simulink. An evaluation of this

    program is considered as Future Work.

    1.3.4 MathWorks Filter Design HDL Coder

    This is a specialised tool for generating VHDL and VERILOG code for fixed-

    point filters designed with Matlab Filter Design Toolbox. The code is stated to

    be compliant with the IEEE 1076 standard.

    1.3.5 The HPEC Challenge Benchmark SuiteThe Lincoln Laboratory at Massachusetts Institute of Technology has developed

    an advanced benchmarking suite for radar systems in ANSI C sponsored by

    DARPA. The suite contains both kernel benchmarks including signal, image,

    information and knowledge processing algorithms and a SAR system bench-

    mark. The later is available as M-Code.

    1.3.6 Implementation using Xilinx System Generator

    Xilinx System Generator has been evaluated in a master thesis at SMW, at

    the time called Ericsson Microwave Systems, in cooperation with LinkopingUniversity of Technology 2004 [5].

    The evaluation is made using two complex algorithms and not a kernel set

    of algorithms as in this thesis. The two algorithms are a base band modulator

    and an algorithm for automatic gain control. The resource utilisation are com-

    pared to a hand-coded implementation for the modulator but not for the gain

    control. The timing performance are not compared with a reference design but

    is verified to fulfil the requirements by SMW. This is possible since SMW uses

    implementations of both algorithms and therefore have design specifications to

    compare with.

  • 8/2/2019 XR-EE-SB_2008_014

    21/140

    1.3 Related work 17

    The result of the thesis is that Xilinx System Generator can be used to

    effectively implement a model of a digital function in an FPGA from Xilinx.

    1.3.7 Matlab to C-Code Translation

    A functional and performance evaluation of C-Code generation from Matlab has

    been made in a master thesis by Marcus Mulleger for Ericsson AB in cooperation

    with Halmstad University [17].

    In detail two softwares for translating M-Code to C-Code has been eval-

    uated. EMLC, used in this thesis for C-Code generation, and Matlab to C

    Synthesis (MCS). The thesis compares the two translators but also the perfor-

    mance compared to hand-coded implementations. The following three aspects

    are studied:

    1. Generation of reference code

    2. Target code generation

    3. Floating to fixed-point conversion

    The benchmarking technique is a combination of using both simple and complex

    algorithms and different algorithms are used for evaluating different aspects.

    The results is that MCS provides a better support for C algorithm reference

    generation, by covering a larger set of the Matlab language, and EMLC is more

    suitable for direct target implementation. EMLC only allocates memory stati-cally. This makes the supported subset of the M-Code language smaller.

  • 8/2/2019 XR-EE-SB_2008_014

    22/140

    18 Introduction

  • 8/2/2019 XR-EE-SB_2008_014

    23/140

    Chapter 2

    Background

    This chapter contains background information for better understanding the re-

    port. Only the sections where the reader has not got previous knowledge is

    relevant to read since there are references to this chapter in the report where

    background information is needed. The chapter contains information about

    FPGA and processor architecture, an introduction to benchmarking and de-

    scriptions of the both software AccelDSP and Matlab.

    2.1 FPGA Architecture

    This section contains a brief introduction to FPGA architecture needed to un-

    derstand the performance metrics discussed in section 2.4.3 and the major dif-

    ferences between processor and FPGA architecture.

    2.1.1 Applications

    A FPGA has many advantages, compared to other platforms, since it allows

    for a fast and programmable implementation. The main drawbacks are the

    high price and the complicated implementation procedure compared to a CPU

    implementation. FPGAs are used for small batches, rapid prototyping when the

    design ofApplication Specific Integrated Circuit (ASIC) are too time-consuming

    and when a reprogrammable hardware implementation is desired. Small batches

    will normally lead to a higher price per chip for ASICs than FPGAs.

    2.1.2 Architecture description

    A Field Programmable Gate Array (FPGA) contains programmable logic and

    is thus a Programmable Logic Device (PLD). The logic is contained in blocks

    which can be programmed to perform basic logical operations as well as com-

    plex mathematical functions and the blocks can be connected by programmable

  • 8/2/2019 XR-EE-SB_2008_014

    24/140

    20 Background

    interconnections. FPGAs have a memory element such as a flip-flop in their

    blocks for performing synchronous operations.The logical blocks are arranged in a 2-dimensional array with interconnec-

    tions for connecting different logical blocks to each other and blocks with I/O

    pads. There are different ways of creating a programmable connection matrix.

    The most common way is using SRAM which are reprogrammable and antifuse

    which are not reprogrammable.

    As explained above the FPGA contains a predefined structure of logic which

    limits the possible designs and performance, such as timing and power usage,

    compared to the ASICs. Most FPGA architectures are possible to reprogram in

    circuit, it is from this ability Field Programmable comes from. Using a FPGA,

    instead of an ASIC, makes it possible to update the hardware in end-user prod-ucts for patching or upgrades.

    Since a FPGA contains configurable logic, designed using a CAD program, it

    possible to design parallel structures optimised for a given task. The only limita-

    tion is the number ofI/O pads and logic blocks for the work that can be executed

    in parallel. This should be compared to a processor with a fixed hardware layout

    capable of sequentially execute instructions. Some non-deterministic parallelism

    is introduced in modern processor architecture using advanced out-of order exe-

    cution but this is far more limited than the parallelism that can be implemented

    in hardware.

    For running different parts of the device at different clock frequencies most

    FPGAs contains several Phased Locked-Loop (PLL). A PLL is a control system

    used to generate a signal with higher or lower frequency than a reference signal

    keeping a fixed relation to the phase. It uses a negative feedback loop to control

    an oscillator so it maintains a constant phase angle relative to the reference

    signal.

    2.1.2.1 XILINX Virtex5

    Xilinx uses a Look-Up Table (LUT) based design for the logical blocks. A LUT

    is a small one bit memory which can be used as a logical function where the

    address lines are the inputs and the one bit memory is the output. The logicis realised by properly programming the 2k bits of the memory, where k is the

    number of inputs.

    Each logical block, by Xilinx called Configurable Logical Block (CLB), con-

    tains four slices and each slice contains LUTs, flip-flops and logic for example

    carry-chain operations. The interconnections between CLBs consist of segments

    of different length spanning from one CLB pair to the entire length of the chip.

    For routing a signal between two CLBs the signal must pass through at least

    one switch. The switch is usally a SRAM cell. The performance depends on

    how the CAD tools manage to route the signals.

  • 8/2/2019 XR-EE-SB_2008_014

    25/140

    2.1 FPGA Architecture 21

    The number of slices, their components as well as the design of the supporting

    logic and interconnection matrix differs between different devices.For increased performance specialised blocks are added to the FPGA archi-

    tecture by many manufacturers. This could be specialised versions of slices or

    blocks supporting advanced operations. The Virtex5 series have two types of

    slices SLICEL and SLICEM, the later slightly more complex, on-chip 36 Kbit

    BlockRAM and specialised DSP slices. A block diagram of the slices can be

    found in [27, pp. 173-174].

    All slices have a 6-input LUT as logic function generator, a carry chain for

    implementation of fast adder/subtraction structures, a multiplexer and a flip-

    flop. The SLICEM has primitives instead of an ordinary LUT which can be

    implemented as distributed RAM, shift registers or a LUT.Since FPGAs are often used for signal processing the Virtex5 series have

    built-in Xilinx Virtex-5 DSP slices with support for commonly used DSP oper-

    ations. These functions include multiply, Multiply Accumulate (MAC), multiply

    add, three-input add, barrel shift, wide-bus multiplexing, magnitude compara-

    tor, etc... The architecture also supports cascading multiple slices to form wide

    math functions, DSP filters, and complex arithmetic without the use of general

    FPGA fabric [28].

    For generating internal clock signals the Virtex5 series contains

    Digital Clock Management (DCM) as well as PLL. The DCM provides ad-

    vanced clocking capabilities including eliminating clock skew of clock nets, phase

    shift signals and generate new clock signals by a mixture of clock multiplication

    and division.

    Clock skew is the difference in timing delay between the original clock signal

    and the signal that arrives at different components in the device. The difference

    arises from different wire length, temperature differences, capacitive coupling,

    etc ...

  • 8/2/2019 XR-EE-SB_2008_014

    26/140

    22 Background

    2.2 FPGA development using VHDL

    There is several steps involved in the typical design flow when implementing on

    a FPGA using VHSIC Hardware Description Language (VHDL). The first is to

    describe the implementation using one of the three abstraction levels available

    in VHDL.

    Behavioural level is the highest level of abstraction. This level is used to

    model algorithms and system behaviour without specifying clock informa-

    tion. Some synthesis tools are available that can take behavioural VHDL

    code as input, but the lack of architecture details means that the imple-

    mentation can be inefficient in terms of area and performance.

    RTL stands for Register Transfer Level. An RTL description of an algorithm

    has an explicit clock, which means that all operations are scheduled to

    occur in specific clock cycles. The behaviour of the circuit is defined in

    terms of all registers and the flow of signals between them. This is the

    most common level used for synthesis.

    Gate level is the lowest level of abstraction in VHDL. A gate level description

    consists of a network of gates and registers instanced from a technology-

    specific library. This library provides information about the components

    timing and gate delay

    VHDL supports syntax for dividing the code into components and packages.

    This way generic components can be created, which are easy to reuse. The

    components can be connected together to form application specific algorithms.

    VHDL was developed at the behest of the US Department of Defense and

    is originally a modelling language for documentation and simulation of circuits.

    The language is standardised by IEEE for simulation but only a subset of the

    language is supported for synthesis and there is no standard for which subset

    that should be supported. Fortunately the same subset is supported by the

    most common programs used for synthesis [20].

    2.2.1 Implementation

    When the algorithm has been described in VHDL the code must be translated

    to a bitstream, which can be used for programming the FPGA. The bitstream is

    the information stored in the SRAM used for interconnections, the bits stored

    in the LUT and configuration information for example the primitives. This

    translation is the equivalent to compiling in the programming world and consists

    of several steps. After most steps a report containing information, including

    performance data, of the design is generated.

  • 8/2/2019 XR-EE-SB_2008_014

    27/140

    2.2 FPGA development using VHDL 23

    2.2.1.1 Synthesis

    The first step is to synthesize the VHDL code into a netlist. In this process

    the tool tries to recognise components from the code and the generated netlist

    contains components and how they are connected. Examples of components are

    memory elements, registers and adders. It is possible to influence the process by

    adding constraints controlling timing, area limitations and I/O pin assignment.

    The constraints are used in the optimisation of the netlist and are also saved to

    the next step of the translation. Many components like adders and multipliers

    can be implemented in many ways. For example a fast implementation of an

    adder is the Sklansky tree adder which requires a large area to be implemented

    and the ripple carry adder which is small but slow. The constraints will guide

    the program in choosing implementation, i.e. the algorithm using the smallest

    area but still fulfilling the timing constraints will be used.

    Information gathered during synthesis is summarised in a report. These

    reports are generated using Xilinxs FPGA design suite ISE. The reports contain

    information about macro statistics, resources and timing. The macro1 statistics

    are information about which components that was found by the macro analysing

    the VHDL code. The resource information is an estimate of which on-chip

    resources that are required to implement the components on a certain chip.

    The timing information is a rough estimate of how fast the implementation will

    be. No routing delay is included in this estimate since the layout performed by

    the Place And Route (PAR) process has not yet been made. It is also possible

    that optimisations done during mapping will decrease the timing.

    2.2.1.2 Mapping

    The next translation step is Mapping (MAP) where the component implemen-

    tations are mapped to primitive components in the target technology. This is

    usually the ordinary slices in the CLB but if memory elements are requested

    or advanced mathematic this is mapped to SLICEM and DSP48E slices by the

    mapping process. The mapping needs information about the target architectureto know which resources that are available. This information is provided by a

    technology library describing the target device.

    The report created after mapping contains essentially the same information

    as the synthesis report. Most interesting is the more accurate information about

    resource requirements.

    1The synthesis tools contain macro-scripts that recognise certain behaviour of the VHDL

    code. If the behaviour matches an internal component, like a blockRAM, the script will use

    the RAM instead of implementing the behaviour with LUTs.

  • 8/2/2019 XR-EE-SB_2008_014

    28/140

    24 Background

    2.2.1.3 Place and Route

    Finally the mapped components are assigned to physical resources in

    the FPGA and the resources are routed together. This step is called

    Place And Route (PAR). The process includes optimisation of the location of

    every resource and how they should be connected to meet the user defined tim-

    ing constraints. The PAR generates information to configure every resource on

    the FPGA. For example certain SLICEM are configured as ROM if memory of

    that type is defined in the design phase.

    The report generated after PAR contains the most accurate timing informa-

    tion which includes routing delay. Since all translation steps takes long time for

    large designs it could be better to use the estimates in the former reports when

    evaluating designs. The information in the PAR report is useful as a verificationin design steps before hardware verification.

    2.2.2 Simulation

    The possibility to simulate a design before it is implemented in hardware is

    an important part of VHDL. Two types of simulation can be made during the

    synthesis flow. Behavioural simulation can be made directly from the VHDL

    models without considering the hardware implementation. A gate level simula-

    tion can be made after the PAR process when accurate timing information is

    available. The gate level simulation is much slower but is able to find problem

    with timing and optimisations made by the CAD tool.

    By simulating the code before implementation it is easier to verify correct

    behaviour since internal error handling can be added to each component and

    internal signals not available to the outside world can be measured. The internal

    error handling can for example be assertions that certain signals are not high

    at the same time.

    2.2.3 Intellectual Properties

    To enable faster development and code reusage most FPGA manufacturers pro-

    vides implementations of various algorithms called IP-Block. These comes withdifferent licensing agreements and costs. To protect the source device dependent

    netlists are often provided together with a software containing a library with

    the cores and functionality to adapt the cores to a specific usage. ISE contains

    a software called CORE Generator which contains free simple IP-Blocks includ-

    ing DSP functions, memories, storage elements and math functions as well as

    evaluation versions of complex blocks [21].

  • 8/2/2019 XR-EE-SB_2008_014

    29/140

    2.3 Processor Architecture 25

    2.3 Processor Architecture

    This section contains a brief introduction to processor architecture needed to

    understand the performance metrics discussed in section 2.4.2 and the major dif-

    ferences between processor and FPGA architecture. The architecture described

    is a simple form used to introduce important concepts. Modern processors are

    far more advanced.

    2.3.1 Fundamental design

    A processor is a component capable of changing its functionality when given

    different instructions. The basic processor can be divided into two parts. A

    datapath and a control unit. The datapath includes logic, called execution units,for manipulating data. Examples of execution units are integer and floating-

    point units which are used for performing mathematical operations on integer

    or floating-point data.

    The control unit configures the data path for performing the operation rep-

    resented by an instruction. For example it activates the correct execution unit

    used by an instruction.

    A processor is a sequential machine and the operation of the simple archi-

    tecture described here can be summarised by fetch-decode-execute-store. This is

    called one instruction cycle.

    Fetch An instruction is fetched from the memory position indicated by theProgram Counter (PC).

    Decode The instruction are decoded which means that the datapath are con-

    figured for executing the instruction by the control unit.

    Execute The datapath then executes the instruction, for example calculates

    the sum of two integers.

    Store The result of the execution is stored in memory and the PC is updated

    to point to the next instruction.

    2.3.2 Pipeline

    This architecture is a valid design but the clock frequency is limited by the delay

    of the combinational logic in the datapath. A better way is to introduce registers

    between the steps in the instruction cycle which stores the intermediate values.

    This will allow an increased clock frequency and will increase the performance if

    several instructions can be executed concurrently. Executing several instructions

    concurrently by storing the intermediate values in registers is called pipelining.

    The architecture described above has four steps in its pipeline. Increasing

    the number of steps will generally reduce the Instructions Per Cycle (IPC) but

  • 8/2/2019 XR-EE-SB_2008_014

    30/140

    26 Background

    allows higher clock frequency. One reason that the IPC is decreased is data

    dependency. A simplified explanation is that one instruction in the pipelineneeds data from another instruction in the pipeline which has not yet written

    its result to the memory. A resolution is to insert stalls, which is instructions

    without operation, between the two dependent instructions. During the stall

    the result is written to the memory to be available for the next instruction.

    Another problem with pipelining is conditional branches2. This type of

    instruction is generated by conditional statements like case, if and by loops like

    while and for. If the address to the next instruction to be inserted in the pipeline

    is calculated by a conditional branch this information is not available until the

    branch is resolved. This will lead to a stall of four cycles. The resolution is

    advanced branch prediction techniques where the processor predicts the nextinstruction to be executed and throws away the result if the prediction fails.

    Modern general-purpose processor has generally much longer pipelines than

    four stages. For example the Pentium 4 from Intel with Prescott architecture

    have a 31-stage pipeline. This architecture allows clock frequencies up to almost

    4 GHz but suffers hard from problems with branch prediction. Such processors

    have several datapaths which allows instructions to execute in parallel.

    Application specific processors have datapaths for application specific oper-

    ations like mathematical functions and specialised datapaths for feedback of re-

    sults to prevent stalling. Examples are multiple Arithmetical Logic Unit (ALU)

    and specialised ALU in processors for signal processing.

    Specialised datapaths and hardware requires special instructions. Using

    these instructions is crucial for performance. An experienced programmer or

    an advanced compiler schedules instructions to take advantage of parallel dat-

    apaths and rearranges instructions to prevent data dependencies.

    2.3.3 Memory

    The processor architecture described above as well as the PowerPC operates

    on data located in registers inside the core. This memory is extremely small,

    expensive and can impossibly contain all data and instructions needed by even

    extremely small programs. A larger memory is therefore made available to

    the processor from which it can fetch instructions and data for storage in the

    registers. Such memory is considerably slower than the processor.

    The solution is to introduce cache memories on the same chip as the processor

    but outside the core and create dedicated hardware to make sure that all data

    and instructions needed by the processor is in the cache. Otherwise the processor

    is stalled while the data is transferred from the memory to the cache. One cache

    can be used for both instructions and data but they are often separated to gain

    performance.

    2When the value of the Program Counter (PC) is changed depending on a condition

  • 8/2/2019 XR-EE-SB_2008_014

    31/140

    2.3 Processor Architecture 27

    The details of the complex algorithms that handles the cache is outside the

    scope of this introduction. Thus it is important to understand that the memorybottleneck and cache misses are performance critical.

    2.3.4 PowerPC

    PowerPC was designed by Apple, IBM and Motorola in 1991 based

    on the POWER architecture created by IBM. The architectures is

    Reduced Instruction Set Computer (RISC) designed for personal computers but

    since Apples transition to Intel based workstations it is mostly used in embed-

    ded environments. The reason PowerPC is used as a target for the performance

    evaluation in this thesis is because it is a common embedded processor and that

    it is used by SMW.

    2.3.4.1 PowerPC 604

    This processor is the reference used for evaluation of processor performance

    using PSIM in this thesis. It is a general purpose processor and not specific to

    a certain application but shows good DSP performance because of the support

    for MAC instructions.

    The architecture is a super scalar RISC architecture that can dispatch four

    instructions and execute up to seven instructions in parallel. This is possible

    since the processor contains seven execution units including three integer units

    and one floating-point unit. The address width is 32 bits like the maximum

    integer resolution but up to 64 bit floating-point numbers are supported.

    A branch processing unit and a completion unit controls the out of order

    execution of instructions on the processor trying to maximise the performance.

    The processor includes separate on-chip instruction and data caches of 32

    KByte. The maximum clock speed of the CPU is 180 MHz. More information

    is found in [3, PowerPC 604 Processor System] and [10].

  • 8/2/2019 XR-EE-SB_2008_014

    32/140

    28 Background

    2.4 Benchmarking

    In this report benchmarking is implied to be the act of running a synthetic

    workload on an object and measuring the performance in order to be able to

    compare the object with other. Analysing the performance of a fixed workload

    is called profiling. The object in this report is a FPGA device or a processor.

    Benchmarking have been used for evaluating the performance of VHDL gener-

    ated with AccelDSP on an FPGA and profiling on a PowerPC and FPGA for

    comparing realisations on different platforms.

    With benchmarking as a major focus of this report, it is important to present

    a description of the fundamentals and problems concerning benchmarking and

    profiling.

    2.4.1 Benchmarking fundamentals

    Three considerations arise when benchmarking an object. The first consider-

    ation is what kind of performance to measure and in what unit. The next is

    how the performance should be measured, especially without influencing the

    object. And the last one is that the workload should be designed to give a fair

    comparison between different objects. When benchmarking a compiler, in this

    report AccelDSP, it is important that the workloads contains enough diversity

    to be able to draw a general conclusion of the performance of the compiler.

    Traditionally different approaches are used for performance measurement:Simple metrics is parameters directly derived from the manufacturing of the

    object. Those are usually too simple to describe the actual performance.

    Application benchmarking is when a complete application is used as work-

    load and is a very good way to cover a lot of the possible operations of the

    object. The disadvantage is, as already mentioned, that the usage gets

    very specific which gives a benchmark only interesting to a few applica-

    tions similar to the workload.

    Algorithm kernel benchmarking is an interesting method if a kernel set

    of algorithms can be extracted from the interesting applications. Theperformance of these benchmarks are then related to the performance

    of the application of interest. This method is used for benchmarking of

    AccelDSP in this thesis.

    Micro benchmarking is when a single metric is measured to identify peak

    capability and potential bottlenecks of a device. This can be very over

    optimistic in terms of real application performance.

    Functionality benchmarking can be seen as a kind of micro benchmarking

    and is when different types of functionality, of an object, is measured with

  • 8/2/2019 XR-EE-SB_2008_014

    33/140

    2.4 Benchmarking 29

    different benchmarks. What kind of functionality the application of in-

    terest are using determines which measurements to rely on. Example offunctionality could be I/O performance, memory performance or comput-

    ing power.

    Benchmarking is becoming more and more advanced as the complexity of new

    hardware architectures increases. Profiling, i.e. to determine the performance

    of a certain application, run in a specific way on a certain platform, is rather

    easy but to gain generally usable information is not trivial. As systems becomes

    more complex, the harder it is to isolate the performance of the object of in-

    terest. Advanced hardware often requires the compiler to generate code that is

    able to utilise the advanced features. For a C-Code compiler this means gener-

    ating specialised machine code instructions and for a translator from M-Code

    to VHDL to generate a description that can utilise optimised slices.

    2.4.2 Processor benchmarking

    Performance modelling and measurement for software and processors are thor-

    oughly discussed in [11]. This section contains a brief summary of all theory

    regarding performance evaluation and metrics, as a background for finding tech-

    niques suitable for comparing the performance of processor and FPGA archi-

    tectures.

    2.4.2.1 Traditional processor metrics

    There is a lot of simple performance metrics that can be used to describe proces-

    sor performance, but they are almost solely interesting within an architecture

    not when comparing different architectures to each other. The commonly used

    metrics are:

    Clock frequency A common metric in customer electronics which is totally

    irrelevant unless the same model, but with different clock frequencies, are

    compared.

    MIPS Describes how many instructions that can be executed per second. This

    may be interesting to compare on architectures with the same instruction

    set. It is important to remember that not all instructions can be executed

    concurrently if several datapaths are available and that not all instructions

    take the same amount of cycles to execute. For example floating-point

    operations are usually slower than fixed-point operations.

    MOPS Describes how many operations that can be executed per second. This

    suffers from related problems as MIPS. What count as an operation and

    how many operations are needed to perform useful work?

  • 8/2/2019 XR-EE-SB_2008_014

    34/140

    30 Background

    FLOPS Describes how many floating-point operations that can be executed

    per second. This is a more interesting metric than the previous men-tioned since the floating-point operation takes a defined number of cycles

    on a specific architecture. FLOPS is a common metric in scientific calcu-

    lations. This metric, like the previous, does not tell anything about the

    overall performance since it is theoretical and does not consider memory

    performance.

    IPC/CPI Describes how many instruction are executed per cycle. This is a

    very good metric which can be used to compare different processors with

    each other. The metrics is not constant for a certain processors, for reasons

    previously discussed. An average can however be calculated for a certain

    software

    Since the simple metrics only gives limited information about the performance

    of a certain processor some kind of performance evaluation must be carried out.

    How this is performed depends on the software of interest and how accurate

    results that are required.

    2.4.2.2 Performance evaluation

    Performance evaluation is of interest for those designing both hardware and

    software and can be divided into performance modelling and performance mea-

    surement. Performance modelling method is an estimation of the performance

    of a system without running the code on the actual hardware. The method is

    more common in early stages of the design process and can further be divided

    into simulation-based modelling and analytical modelling. Using performance

    measuring the code is run on actual hardware and some kind of instrumentation

    technique are used for gathering performance data.

    2.4.2.3 Performance measuring

    Observing how code is executed during runtime is very important to understand

    the bottlenecks of a system. This can be done by:

    On-chip hardware monitoring Most processors have built-in performance-

    monitoring counters that observes interesting metrics for the actual archi-

    tecture. Depending on the architecture and configuration the counters can

    monitor cycle count, instruction counts (fetch/completed), cache misses,

    branch mispredictions, etc... These counters can be read by software using

    special instructions to present the information to a developer.

    Off-chip hardware monitoring is when dedicated hardware are attached to

    the processor for the cause of monitoring the performance. The off-chip

  • 8/2/2019 XR-EE-SB_2008_014

    35/140

    2.4 Benchmarking 31

    hardware can for example interrupt the processor after every instruction

    completion and save all information of interest available in the processor.

    Software monitoring A similar method as the off-chip monitoring can be

    performed in software by interrupting the processor using trap instruction.

    This is very invasive but is easy to implement. A major drawback is that

    OS activity is hard to monitor unless trap instructions can be added to

    the OS source code.

    Microcoded instrumentation This is a method requiring hardware support

    that supports recording of instruction execution by modifying the mi-

    crocode. This method does not gather performance information but an

    instruction trace that can be used in a trace simulation.

    2.4.2.4 Analytical modelling

    Analytical modelling are not very popular for microprocessors but for whole

    systems. The method is based on models relying on probabilistic methods,

    queuing theory, Markov models or Petri nets. More information can be found

    in [11, 2.1.3 Analytical modelling]

    2.4.2.5 Simulation based modelling

    By creating a model of the system being simulated, i.e. the target machine, and

    running it on a host machine the created software can be run without access

    to the specific hardware the software was designed for. The simulator can be

    either a functional simulator or a timing simulator and be either trace-driven

    or execution-driven.

    A functional simulator is able to run the program in a simulated environment

    where p erformance data can be revealed. The functional simulator can be cycle

    accurate which means that each clock cycle of the processor are simulated which

    gives the most accurate results with the drawback of long simulation time.

    When only performance data is of interest a fully functional simulator is not

    necessary. Much simulation can be made with only a limited functionality if the

    performance and not the result are relevant. Performance analysis is commonlycalled profiling and is done using a profiler tool.

    A trace-driven simulator can be seen as a simplified form of an execution-

    driven simulator where the simulator is not able to execute the program but to

    analyse a trace of information representing the instruction sequence that would

    have executed on the target machine. The input to a trace driven simulator can

    either be fed continuously or fetched from a memory. These simulators suffers

    from mainly two problems. The large size of the trace since it is proportional

    to the dynamic instruction count and that the trace is not very representative

    for out-of-order processors.

  • 8/2/2019 XR-EE-SB_2008_014

    36/140

    32 Background

    Execution-driven simulators solves the two major problems with the trace-

    driven since the static instructions are used as input and the out-of-order pro-cessing can be simulated as well. The simulator can either interpret all in-

    structions or only those of interest letting the host processor execute the rest

    natively. The later will reduce the invasion and therefore increase the speed of

    the simulation. Execution-driven simulation is highly accurate but is very time

    consuming and requires long periods of time for developing the simulator [11,

    2.1.1.2 Execution-driven simulation].

    2.4.2.6 Energy and power simulators

    The power consumption of a processor consists of two parts. One static which

    is independent of the processor activity but depends on the processor mode.

    Modern processors can typically go into idle mode where power consuming parts

    are turned off. The other part is dynamic and depends on the work executed

    by the processor.

    Power simulation are done by modelling the power consumption of the in-

    dividual components and let the simulator calculate the consumption based on

    the activity statistics of each component. The accurateness of the models and

    the granularity of the simulation determines how exact the estimation will be.

    2.4.3 FPGA benchmarking

    To accurately benchmark the performance of an FPGA is a costly and time-

    consuming task. The reason behind this is that the complexity of todays FPGA

    architectures and CAD tools make it difficult to obtain proper benchmarking

    results. For example an FPGA can vary in terms of size, maximum clock fre-

    quency, number of I/O pins, chip specific implementations, built in PowerPC

    cores, DSP specialised slices, LUT sizes, BlockRAM etc... Even if VHDL and

    VERILOG is platform independent languages the compilers and CAD tools vary

    between platforms and some architecture specific modules may have different

    interfaces.

    When implementing on FPGAs the CAD tools have a big impact on the

    performance of the final design. Timing constraints and configuration of op-

    timisation trade-off in the tools have a large impact on the performance of

    the generated design. To correctly add constraints and configure trade-offs for

    maximising the performance sets high requirements on a benchmarking method-

    ology.

    2.4.3.1 Traditional FPGA measurements

    It is important to set constraints and configure trade-offs depending on what to

    measure. In FPGA design there is three important benchmarking and design

  • 8/2/2019 XR-EE-SB_2008_014

    37/140

    2.4 Benchmarking 33

    directions:

    Timing When designing and benchmarking towards timing, important metrics

    are: Maximum clock frequency, input pin to setup delay and clock to

    output pin delay. To get comparable results of these measures there is a

    lot of things to consider, see section 2.4.3.2 for more details.

    Area When designing with area limitations the goal is to keep the design as

    small as possible most probably to reduce the cost. The resource usage

    is given by the CAD tools after a design has been successfully mapped to

    a device. Important metrics is LUT usage, register usage, I/O pins used,

    BlockRAM count, DSP48E usage and number of PLL/DCM.

    Power Power consumption is split into two important factors, static and dy-

    namic power usage. Static power usage comes from the devices physical

    parameters like size, package and supply voltage. Other factors that con-

    tribute to static power usage are design parameters like placement, routing

    and operating conditions. Dynamic power usage comes from signal toggle

    rate. Measuring power consumption is done using probing or with the

    integrated power analysis tools often supplied with the CAD tools. The

    analysis tool for ISE is called XPower.

    2.4.3.2 Performance benchmarking

    To achieve maximum performance and meaningful benchmarks on a device re-

    quires a structured benchmark methodology. Without strict guidelines the per-

    formance results can be misleading or in worst case wrong. In this section an

    overview of these requirements will be presented. Manufacturer specific details

    for Xilinx and Altera are found in [1, 22].

    To setup a FPGA performance benchmark there is a number of important

    points to consider:

    Apply timing constraints in synthesis and place and route until the timing

    slack 3 is negative. This is done to force the tools to optimise the timing.

    Constrain all important clocks. This is to force the tools to improve all

    clocks in the design. If a single global clock constraint is used in a design

    with several clocks, the CAD tool will only use the constraint on the clock

    domain which has the worst performance.

    The highest effort choice in both the synthesis and PAR tools should be

    used.

    3The timing slack is the difference between the constraint and the resulting performance.

    If the slack is positive it means that the constraint can be set tighter to force the CAD tools

    to work harder

  • 8/2/2019 XR-EE-SB_2008_014

    38/140

    34 Background

    Apply moderate I/O constraints. This is done to force the CAD tools to

    factor in any trade-offs in I/O timing requirements. Without this con-straint the results may be unrealistic.

    When comparing different FPGAs it is important to make sure that the

    devices run on similar speed grades.

    To interpret the results given by the timing reports it is important to check that

    the CAD tools used all constraints. It is also important to run the synthesis and

    PAR a few times with a deviation of the constraints to tune the tools for better

    performance. Xilinx recommends 5% increments in constraint until timing is

    no longer met.

    2.4.4 Digital signal processing benchmarking

    In this thesis the focus is to benchmark and implement algorithms for

    Digital Signal Processing (DSP). This section presents a few important terms

    and arithmetic operations used in DSP.

    Multiply-add is when two values are multiplied and a third value is added to

    the result. It is an important operation used in DSP applications including

    computation of vector product, FIR-filtering, correlation and FFT.

    Multiply-accumulate operation (MAC) is the more common definition used

    to describe the Multiply-add operation in hardware. The difference is thatresult from the Multiply-add operation is stored in a result register called

    an accumulator, hence the name.

    Multiply-accumulates per second or MACs is a common micro bench-

    marking metric used to measure the peak performance in number of

    Multiply-accumulates a device can perform per second.

    Complex number calculations since complex numbers are very common to

    describe the phase and magnitude of signals many calculations are carried

    through with complex numbers. Complex multiplication can be imple-

    mented in two ways using either 3 real multiplications and 5 additions or4 multiplications and 2 additions.

    Matrix operations many advanced algorithms uses matrix operations includ-

    ing multiplication, decompositions and singular and eigenvalue calcula-

    tions.

  • 8/2/2019 XR-EE-SB_2008_014

    39/140

    2.5 Xilinx AccelDSP 35

    2.5 Xilinx AccelDSP

    AccelDSP became a part of Xilinx XtremeDSP solutions in 2006 when Xilinx

    acquired the company AccelChip founded 2000 in California. AccelChip was

    a provider of Matlab synthesis software for DSP systems with roots in a re-

    search project at Northwestern University in cooperation with DARPA. The

    project was called MATCH an acronym for Matlab Compilation Environment

    for Adaptive Computing [23, 18]. The only operating system AccelDSP supports

    is Microsoft Windows XP.

    The software is used for translating M-Code into VHDL which can be syn-

    thesised to create digital hardware. The translation includes automatic analysis

    of type and shapes of variables and generation of a fixed-point design suitable

    for hardware implementation. The original research included functionality for

    mapping the application to multiple FPGA by parallelising it but this has not

    reached AccelDSP.

    Many programs exist for translating general purpose programming languages

    including C/C++ and Java to VHDL but developing a direct synthesis path

    from Matlab enables a fast and easy evaluation of a lot of algorithms [7]. The

    reason is that Matlab is used by high technology companies for developing algo-

    rithms and a direct path will make intermediate translations to other program-

    ming languages unnecessary. This was the reason behind the original research.

    Using Matlab as a source has both advantages and disadvantages. The main

    advantage, besides its superiority for developing algorithms, is the high level

    syntax of the M-Code language. Since most signal processing building blocks

    as matrix multiplication, FFT, correlation, eigenvalue calculations etc... are

    available using a single function it is easy to auto-infer optimised IP-Blocks

    when translating to VHDL.

    The main drawback is that Matlab is an interpreted language with dynamic

    type and shape resolution of its variables. The variables can even change shape

    which means that a variable can change from being a scalar, to a vector a

    matrix during runtime. Dynamically changing shape of matrices in loops is

    unfortunately common in M-Code since such code is very slow and not suitable

    for hardware implementation. There is no concept of constants in the M-Code.Therefore a workaround is needed for optimisations dependent on constants.

    2.5.1 Workflow

    Generating VHDL from a Matlab model using AccelDSP starts with dividing

    the model into two parts, a script file and a function-file. The function-file

    contains the actual function to be translated into VHDL written as an ordinary

    Matlab function with an interface of input and output variables.

    The script file has three functions. It creates stimuli, feeds the stimuli to

  • 8/2/2019 XR-EE-SB_2008_014

    40/140

    36 Background

    the function in a streaming loop and verifies the output from the function. The

    streaming loop simulates the infinite stream of data entering and leaving thedesign in hardware and the combination of a Matlab function and the loop

    makes it possible to input the data into manageable partitions. The loop can

    be either a for or while loop.

    The stimuli, generated or imported, in the script file is important for two

    reasons. The data is used as a reference in the automatic type and size identi-

    fication and fixed-point generation framework implemented in AccelDSP. It is

    from the scaling of the input the internal bit-widths are determined. The input

    must represent the real world input for the verification of the function to be

    relevant. The verification is made in several steps in the AccelDSP workflow.

    An overview of the workflow is found in Figure 2.1 and is shortly explained

    in the following sections. The complete manual is found in [24].

    Analyse

    Generate fixed point

    Verify f ixed point

    Generate RTL

    Verify RTL

    Synthesize RTL

    Implement

    Verify gate level

    Fixed point model

    RTL Model

    (VHDL/Verilog)

    Bitstream

    and simulation file.

    Gate-Level Netlist

    Floating point model

    "Golden"

    Verify f loating point

    Figure 2.1: AccelDSP Workflow

  • 8/2/2019 XR-EE-SB_2008_014

    41/140

    2.5 Xilinx AccelDSP 37

    2.5.1.1 Verify floating-point

    AccelDSP lets the user execute the script file inside the program and shows

    all plots, variables and output. The floating-point model is the golden source

    which must be verified by the designer using this output. Errors in this model

    will propagate through all later steps and exist in the final bitstream. It is also

    important to check that all important variables are observed since the output

    is used to verify the fixed-point model.

    If probes, see section 2.5.1.4, are used floating-point data is stored in this

    step for later use.

    2.5.1.2 Analyse

    An in-memory model of the design is created in this step. This involves parsing

    the M-Code to an internal structure and all constructs not supported will cause

    an error to b e raised. This step also tries to identify the streaming loop and

    function-file from the script file.

    AccelDSP supports a subset of the M-Code language with restriction in

    usable functions, operators, syntax and shapes. A reference can be found in

    [26]. Matlab functions are supported using a library called AccelWare which

    is included with AccelDSP, see section 2.5.2. Some of the functions can be

    automatically inferred in the design but some needs to be generated and inserted

    in the M-Code manually. For example, a divider can be inferred automatically

    and directives for the hardware implementation can be set in the fixed-point

    parse tree. For some functions, like a FFT, the code must be generated and

    inserted manually.

    2.5.1.3 Generate fixed-point

    The framework already mentioned creates a fixed-point model in either Matlab

    or C++ from the floating-point source. The design is presented in a parse tree

    which includes all information about the design including inferred IP-Blocks,

    operators and shapes. The interface lets the user graphically add directives to

    control bit-widths, add pipeline step or choose hardware specific implementa-

    tions.

    2.5.1.4 Verify fixed-point

    In this step the fixed-point model is executed by Matlab and the same output

    as in the floating-point code is presented for the user to compare with the

    output from the floating-point. If the results are unsatisfying the user has to go

    back and annotate the design with more directives or to control or change the

    floating-point design. This iteration is performed until the user is satisfied with

    the results.

  • 8/2/2019 XR-EE-SB_2008_014

    42/140

    38 Background

    To simplify the analysis of variables inside functions and the effect of fixed-

    point conversion AccelDSP provides probes. These are inserted into the M-Codefor observing both the floating and fixed-point values of a variable. A plot pre-

    senting the values and differences between the models are shown when running

    the AccelDSP function verify fixed-point, which can be used to identify errors

    in the quantisation.

    2.5.1.5 Generate RTL

    When the user is satisfied with the fidelity of the fixed-point results an RTL

    description can be generated. Both VHDL and VERILOG files containing the

    RTL description is created. A testbench for verifying the generated design is

    also created automatically together with ASCII files containing the function

    stimuli and output from the Matlab model. The testbench reads the input to

    the VHDL function from the ASCII files and compares the output with the

    Matlab reference. The verification pass if all values are the same.

    The testbench is dependent on Xilinx packages for performing the compari-

    son with ASCII files. This leads to compact test benches but the packages must

    be available if the testbench are to be run outside AccelDSP.

    2.5.1.6 Verify RTL

    The testbench is run inside AccelDSP using one of several commercial available

    tools. The result is either pass or fail.

    2.5.1.7 Synthesise RTL

    If the verification pass, the synthesis can also be performed inside the program

    by utilising an external tool like in the verification. After this step a gate level

    netlist is created.

    2.5.1.8 Implement

    The netlist is mapped to the hardware using MAP and PAR in this step. Af-

    ter this process two files of particular interest are created. The configuration

    bitstream which is used for programming the FPGA and a gate level simulation

    file.

    2.5.1.9 Verify Gate Level

    The same testbench used in the RTL verification can be used to verify the gate

    level implementation generated after PAR. This verification is able to find both

    synthesis and timing related errors. If this verification passes it guarantees that

    the implementation is bit-true with the Matlab fixed-point model.

  • 8/2/2019 XR-EE-SB_2008_014

    43/140

    2.5 Xilinx AccelDSP 39

    2.5.2 AccelWare

    M-Code implementations ofMatlab functions optimised for generation ofVHDL

    are included in AccelDSP using a library called AccelWare. The library only

    contains a small subset of all Matlab functions and only some of the function

    can be automatically inferred using the same or a similar syntax as in Matlab.

    Functions that can not be inferred automatically must be generated using a

    graphical interface with a form for choosing implementation parameters. This

    interface creates an AccelDSP project with both a script file containing plots

    and test data to verify the functionality and a function-file which can be inserted

    in the user code instead of the non supported Matlab function.

    2.5.3 Original compiler implementation

    This section contains information about the original implementation of the

    MATCH compiler as described in [6]. How much of this that have changed

    before the current version of AccelDSP is not known but at least some parts,

    like the state machine generation, is still implemented.

    As a first compiler step a Matlab Abstract Syntax Tree (AST) is generated

    from the source code annotated with the information in the directive files.

    A type-shape inference phase infers the types and shapes of the variables by

    analysing the tree including type and shape directives. Optimised library func-

    tions are recognised and IP-Blocks are inferred. For operations where no cores

    are available a scalarisation phase expands the matrix and vector operations

    into loops.

    The original compiler then performed a parallelisation phase which split

    loops or assigned different tasks onto multiple FPGAs available on the same

    board. This required a compatible board with several FPGAs and libraries for

    communication between the devices.

    The Matlab AST is translated to a VHDL AST using a state machine de-

    scription. The state machine is, among other things, required to hold states of

    loops and calculations when using pipelining.

    A precision inference scheme operates on the new AST to find the minimum

    bits required to represent every variable. When the bit widths are inferred

    hardware dependent optimisations are performed which can alter parts of the

    generated state machine. A traversal of the generated tree produces the output

    VHDL code.

    2.5.3.1 Performance

    During the development of the MATCH compiler the researchers used bench-

    marking for evaluating the performance and measure the effect of different op-

    timisations. The benchmark suite presented in [6] consists of Matrix Multipli-

  • 8/2/2019 XR-EE-SB_2008_014

    44/140

    40 Background

    cation, FIR filter, IIR filter, Sobel edge detection algorithm, Average filter and

    Motion Estimation algorithm.The conclusion drawn by the authors were that the auto-generated code

    were almost equivalent in execution time to manually designed hardware and in

    some cases superior. The resource utilisation were within a factor of four of the

    manual designed and the design time were reduced from months to minutes.

    The benchmarking was done on the WildChild board from Annapolis Micro

    Systems with 9 FPGA devices. How parallelism in the algorithms are exploited

    to divide the algorithm between the devices and how the communication is

    implemented is crucial for the performance and resource utilisation. This func-

    tionality is not a part of AccelDSP.

  • 8/2/2019 XR-EE-SB_2008_014

    45/140

    2.6 MathWorks Matlab 41

    2.6 MathWorks Matlab

    Matlab is a high-level technical computation environment from The MathWorks

    founded by Jack Little, Steve Banger and Cleve Moler 1984. The foundation to

    Matlab was written by Cleve Moler and was an implementation of the EISPACK

    and LINPACK Fortran packages for linear algebra and eigenvalue calculations.

    The program was written when Moler was a math professor at the University

    of New Mexico to let the students do scientific calculations without having to

    program in Fortran.

    Nowadays Matlab let engineers develop algorithms, do numerical computa-

    tions and visualise data easier and faster than using traditional programming

    languages. The core of Matlab is the Matlab Code language which is a high

    level, mathematical oriented language which can be used for algorithm design

    using Matlab built-in function as well as programming new functions.

    Matlab built-in functions are mostly written in M-Code which makes it easy

    to inspect the algorithms and make changes. Fundamental and performance

    sensitive functions are compiled to gain better performance. The core function-

    ality of Matlab can be extended by installing toolboxes which contains tools

    for a certain field. Toolboxes are available for everything from aerospace and

    finance to bioinformatics [13, 15, 14].

    2.6.1 Native code generationMatlab can generate C-Code with both Matlab Compiler which can be used

    to generate code and executables for workstations and the software used in

    this thesis; Real Time Workshop Embedded Coder (EMLC) which is designed

    for embedded targets. The difference in generated code is considerable and

    EMLC is recommended for embedded platforms [17].

    EMLC is a module to Real Time Workshop which is an extension to Matlab

    for generating and executing stand-alone C-Code. The module can be used with

    both Embedded Matlab Subset (EML), and Simulink. The later is an environ-

    ment, supported by a graphical user interface, for simulation and model-based

    design of embedded and dynamic systems. EMLC was originally a part of RealTime Workshop which leads to EMLC still being dependent on Simulink.

    Both fixed and floating-point code generation are possible using EMLC. Pa-

    rameters for the code generation target and its abilities are provided to EMLC

    using configuration objects. This information includes endianess, floating sup-

    port, etc.. For fixed-point generation, which is of interest in this report, the

    Fixed-Point Toolbox is needed. This toolbox contains object types which can

    be used for annotating variables with fixed-point directives. These directives

    are used directly in M-Code to create a fixed-point Matlab implementation sup-

    ported for code generation.

  • 8/2/2019 XR-EE-SB_2008_014

    46/140

    42 Background

    For more information and a performance evaluation of code generation from

    Matlab see section 1.3.7.

  • 8/2/2019 XR-EE-SB_2008_014

    47/140

    Chapter 3

    Method

    This chapter is divided into three sections. The first method found in section 3.1

    describes a method for generating VHDL and evaluating the performance on a

    Field Programmable Gate Array (FPGA). Next section 3.2 describes a method

    for generating C-Code using Real Time Workshop Embedded Coder (EMLC)

    and evaluating the performance on a PowerPC. A method for making a platform

    decision based on the previous two methods is found in section 3.3. Descriptions

    of the algorithms used in all methods are found in chapter 4.

    3.1 VHDL generation and simulation

    This section describes the method, for generating and simulating VHDL code

    from a Matlab implementation, used in this thesis for evaluating AccelDSP and

    generating VHDL for the method for platform decision found in section 3.3.

    Background information of how AccelDSP should be used according to Xil-

    inx is found in section 2.5. A flow chart of the method described in this sec-

    tion is fou