xr-ee-sb_2008_014

8/2/2019 XR-EE-SB_2008_014

1/140

An evaluation of methods for FPGA

implementation from a Matlab description

Using a Matlab model as the origin, an evaluation of VHDL generation using

AccelDSP and a method for comparing implementations on FPGA and PowerPC.

Kristofer Lindberg

Kristofer Nissbrandt

Masters Degree Project

Stockholm, Sweden 2008

XR-EE-SB 2008:014

8/2/2019 XR-EE-SB_2008_014

2/140

8/2/2019 XR-EE-SB_2008_014

3/140

Public

REPORT

Prepared (also subject responsible if other) No.

Kristofer Lindberg, Kristofer Nissbrandt SMW/DD/G-08:070 EnApproved Checked Date Rev Reference

DD/GFC (Anette Hansson) 2008-09-18 A

An evaluation of FPGA implementations from a Matlab

description

Using a Matlab modell as the origin, an evaluation of VHDL generation using

AccelDSP and a method for comparing implementations on FPGA and

PowerPC.

A Master Thesis by:

Kristofer Lindberg

Kristofer Nissbrandt

Supervisor:

Anders Bergker, Department och Data and Signal Processing, Saab

Microwave Systems

8/2/2019 XR-EE-SB_2008_014

4/140

8/2/2019 XR-EE-SB_2008_014

5/140

Abstract

Algorithms for signal processing and radar control at Saab Microwave Systems

(SMW) are implemented on a platform consisting of several boards equipped

with programmable logic, in form of Field Programmable Gate Array (FPGA)

devices and PowerPC processors. Different boards are responsible for different

parts of the processing, which is performed in parallel on the incoming Video

Data from the antenna.

It is hard to know if a certain algorithm should be implemented on a FPGAor PowerPC to obtain the most optimal realisation. Today implementations

towards FPGA are written in the Hardware Description Language VHDL, which

leads to a very long implementation time. This is especially a problem for

rapid prototyping, used by SMW to develop new functionality in their radars.

Generating VHDL automatically from a high level description would reduce the

implementation time.

In this thesis Xilinx AccelDSP, a software for generating VHDL from a high

level MathWorks Matlab description, has been evaluated. A method for decid-

ing if an algorithm should be implemented on a FPGA or PowerPC has also

been developed. The method is based on profiling of implementations on the

two platforms made with generated code from the same Matlab description.

VHDL is generated using AccelDSP and C-Code using MathWorks Real Time

Workshop Embedded Coder (EMLC).

Five different aspects of AccelDSP and how it can be used by SMW have

been evaluated. The result is that the main purpose for AccelDSP at SMW is

to be used in the method for platform decision. The software can also be used

for generating parts of a design but this is not recommended. The drawbacks

of VHDL generated from AccelDSP is more important than the reduced design

time. The main problems are the performance, reliability of the description,

readability and the problems of maintaining the source.

The method for platform decision gives an estimation of the execution time

and the resource usage on the both platforms given a specific design and includes

a discussion of other metrics important for the decision. The estimated metrics

is good enough since it gives an indication of the resulting performance usable

for deciding implementation platform.

8/2/2019 XR-EE-SB_2008_014

6/140

2

8/2/2019 XR-EE-SB_2008_014

7/140

Contents

List of acronyms 7

Terminology 11

1 Introduction 13

1.1 Hardware implementation . . . . . . . . . . . . . . . . . . . . . . 14

1.2 Problem definition . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.2.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . 15

1.3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.3.1 Xilinx System Generator . . . . . . . . . . . . . . . . . . . 15

1.3.2 Altera DSP Builder . . . . . . . . . . . . . . . . . . . . . 16

1.3.3 MathWorks Simulink HDL Coder . . . . . . . . . . . . . . 16

1.3.4 MathWorks Filter Design HDL Coder . . . . . . . . . . . 16

1.3.5 The HPEC Challenge Benchmark Suite . . . . . . . . . . 16

1.3.6 Implementation using Xilinx System Generator . . . . . . 16

1.3.7 Matlab to C-Code Translation . . . . . . . . . . . . . . . 17

2 Background 19

2.1 FPGA Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.1.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.1.2 Architecture description . . . . . . . . . . . . . . . . . . . 19

2.2 FPGA development using VHDL . . . . . . . . . . . . . . . . . . 22

2.2.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . 222.2.2 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.2.3 Intellectual Properties . . . . . . . . . . . . . . . . . . . . 24

2.3 Processor Architecture . . . . . . . . . . . . . . . . . . . . . . . . 25

2.3.1 Fundamental design . . . . . . . . . . . . . . . . . . . . . 25

2.3.2 Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.3.3 Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.3.4 PowerPC . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.4 Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.4.1 Benchmarking fundamentals . . . . . . . . . . . . . . . . . 28

8/2/2019 XR-EE-SB_2008_014

8/140

4 CONTENTS

2.4.2 Processor benchmarking . . . . . . . . . . . . . . . . . . . 29

2.4.3 FPGA benchmarking . . . . . . . . . . . . . . . . . . . . 322.4.4 Digital signal processing benchmarking . . . . . . . . . . 34

2.5 Xilinx AccelDSP . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

2.5.1 Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

2.5.2 AccelWare . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

2.5.3 Original compiler implementation . . . . . . . . . . . . . . 39

2.6 MathWorks Matlab . . . . . . . . . . . . . . . . . . . . . . . . . . 41

2.6.1 Native code generation . . . . . . . . . . . . . . . . . . . . 41

3 Method 43

3.1 VHDL generation and simulation . . . . . . . . . . . . . . . . . . 43

3.1.1 VHDL code generation . . . . . . . . . . . . . . . . . . . . 44

3.1.2 Verification . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.1.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.2 C-Code generation and simulation . . . . . . . . . . . . . . . . . 49

3.2.1 C-Code generation . . . . . . . . . . . . . . . . . . . . . . 49

3.2.2 Compilation . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.2.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.3 Platform decision . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.3.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4 Algorithms 554.1 FFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.1.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . 56

4.1.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.2 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.2.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . 58

4.2.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.3 Matrix multiplication . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.3.1 Matlab implementation . . . . . . . . . . . . . . . . . . . 59

4.3.2 VHDL reference design . . . . . . . . . . . . . . . . . . . 60

4.3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 614.4 FIR filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.4.1 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.5 CORDIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.5.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . 63

4.5.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.6 CFAR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.6.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . 64

4.6.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.7 MUSIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

8/2/2019 XR-EE-SB_2008_014

9/140

CONTENTS 5

4.7.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . 67

4.7.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 674.8 Cartesian to polar map transformation . . . . . . . . . . . . . . . 68

4.8.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . 68

4.8.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5 Results 71

5.1 VHDL generation using AccelDSP . . . . . . . . . . . . . . . . . 71

5.1.1 FFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.1.2 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.1.3 Matrix multiplication . . . . . . . . . . . . . . . . . . . . 75

5.1.4 FIR-filter . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.1.5 CORDIC . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5.1.6 CFAR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.1.7 MUSIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5.1.8 Cartesian to polar map transformation . . . . . . . . . . . 87

5.2 C-Code generation using EMLC . . . . . . . . . . . . . . . . . . 89

5.2.1 FFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

5.2.2 MUSIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

5.2.3 Cartesian to polar map transformation . . . . . . . . . . . 91

6 Discussion 93

6.1 Evaluation of AccelDSP . . . . . . . . . . . . . . . . . . . . . . . 936.1.1 Usability . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

6.1.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . 97

6.1.3 Rapid prototyping . . . . . . . . . . . . . . . . . . . . . . 99

6.1.4 Technology independence . . . . . . . . . . . . . . . . . . 100

6.1.5 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

6.2 Comparing a processor to a FPGA . . . . . . . . . . . . . . . . . 103

6.2.1 Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

6.2.2 Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . 103

6.2.3 Resource usage . . . . . . . . . . . . . . . . . . . . . . . . 103

6.2.4 Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1046.2.5 Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

6.2.6 I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

6.3 Implementation decision based on profiling . . . . . . . . . . . . 106

6.3.1 Implementation decision . . . . . . . . . . . . . . . . . . . 106

6.3.2 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

6.3.3 Usability . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

6.4 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

6.4.1 Evaluation of AccelDSP . . . . . . . . . . . . . . . . . . . 110

6.4.2 Implementation decision . . . . . . . . . . . . . . . . . . . 110

8/2/2019 XR-EE-SB_2008_014

10/140

6 CONTENTS

6.4.3 Evaluation of other tools . . . . . . . . . . . . . . . . . . . 110

7 Conclusions 111

7.1 VHDL generation using AccelDSP . . . . . . . . . . . . . . . . . 111

7.1.1 Evaluation of algorithms for AccelDSP . . . . . . . . . . . 112

7.1.2 Evaluation of AccelDSP . . . . . . . . . . . . . . . . . . . 112

7.2 Implementation decision based on profiling . . . . . . . . . . . . 113

Appendices 117

A Programs used 119

B Performance data from PSIM 121

C Makefile for GCC and PSIM 125

D Cartesian to polar map transformation 129

D.1 CPU implementation . . . . . . . . . . . . . . . . . . . . . . . . . 129

D.2 FPGA implementation . . . . . . . . . . . . . . . . . . . . . . . . 131

E VHDL Constraints 135

8/2/2019 XR-EE-SB_2008_014

11/140

List of acronyms

Arithmetical Logic Unit (ALU) See, p. 26.

American National Standards Institute (ANSI) See, p. 16.

American Standard Code for Information Interchange (ASCII) See, p. 38.

Application Specific Integrated Circuit (ASIC) See, p. 19.

Abstract Syntax Tree (AST) See, p. 39.

Computer-Aided Design (CAD) See, p. 20.

Constant False Alarm Rate (CFAR) See, p. 64.

Configurable Logical Block (CLB) See, p. 20.

Component Object Model (COM) See, p. 95.

COordinate Rotation DIgital Computer (CORDIC) See, p. 44.

Cycles Per Instruction (CPI) See, p. 53.

Central Processing Unit (CPU) See, p. 11.

Defence Advanced Research Projects Agency (DARPA) See, p. 16.

Digital Clock Management (DCM) See, p. 21.

Discrete Fourier Transform (DFT) See, p. 55.

Digital Signal Processing (DSP) See, p. 12.

Digital Signal Processor (DSP) See, p. 109.

Xilinx Virtex-5 DSP slices (DSP48E) See, p. 21.

Executable and Linkable Format (ELF) See, p. 50.

Embedded Matlab Subset (EML) See, p. 41.

Real Time Workshop Embedded Coder (EMLC) See, p. 41.

8/2/2019 XR-EE-SB_2008_014

12/140

8 LIST OF ACRONYMS

Fast Fourier Transform (FFT) See, p. 55.

Finite Impulse Response (FIR) See, p. 13.

Field Programmable Gate Array (FPGA) See, p. 11.

Front Side Bus (FSB) See, p. 104.

GNU Compiler Collection (GCC) See, p. 50.

GNU Debugger (GDB) See, p. 51.

Graphical User Interface (GUI) See, p. 95.

Institute of Electrical and Electronics Engineers, Inc (IEEE) See, p. 16.

Infinite Impulse Response (IIR) See, p. 40.

Input/Output (I/O) See, p. 20.

Instructions Per Cycle (IPC) See, p. 25.

Look-Up Table (LUT) See, p. 20.

Matlab Code (M-Code) See, p. 41.

Multiply Accumulate (MAC) See, p. 21.

Mapping (MAP) See, p. 23.

Mega Samples Per Second (MSPS) See, p. 72.

Multiple Signal Classification Method (MUSIC) See, p. 67.

Operating System (OS) See, p. 31.

Place And Route (PAR) See, p. 24.

Program Counter (PC) See, p. 25.

Programmable Logic Device (PLD) See, p. 19.

Phased Locked-Loop (PLL) See, p. 20.

Plane Position Indicator (PPI) See, p. 13.

Random Access Memory (RAM) See, p. 21.

Reduced Instruction Set Computer (RISC) See, p. 27.

Read Only Memory (ROM) See, p. 24.

Register Transfer Level (RTL) See, p. 22.

8/2/2019 XR-EE-SB_2008_014

13/140

LIST OF ACRONYMS 9

Synthetic Aperture Radar (SAR) See, p. 16.

Saab Microwave Systems (SMW) See, p. 13.

Static Random Access Memory (SRAM) See, p. 20.

VERIfy LOGic (VERILOG) See, p. 14.

VHSIC Hardware Description Language (VHDL) See, p. 11.

Very High Speed Integrated Circuit (VHSIC) .

Wide Sense Stationary (WSS) See, p. 58.

8/2/2019 XR-EE-SB_2008_014

14/140

10 LIST OF ACRONYMS

8/2/2019 XR-EE-SB_2008_014

15/140

Terminology

AccelDSP AccelDSP is a software from Xilinx for translating floating-point

M-Code to fixed-point VHDL, p. 35.

AccelWare AccelWare is a library of functions that can be used in AccelDSP

to replace generic Matlab functions with functions optimized for

generating VHDL, p. 39.

bitstream Information used for programming a FPGA generated after PAR

in the VHDL workflow, p. 22.

CORE Generator The Xilinx software for distributing IP-Blocks, p. 24.

execution unit A unit in the data path on a CPU responsible for a certain

type of operations, p. 25.

IP-Block An Intellectual property is an algorithm implementation provided

by a FPGA manufacturer with a license term, p. 24.

ISE Xilinx ISE Design suite, a suite containing all programs needed to

create FPGA designs, p. 12.

Matlab A high-level technical computation environment from MathWorks,

p. 41.

netlist A technology dependent description of a design generated by the

synthesis step in the VHDL workflow, p. 23.

PowerPC Processor with RISC architecture created by Apple, IBM and Mo-

torola in 1991 based on the POWER architecture created by IBM,

p. 27.

profiler A software used for analysis of the performance of a program by

running it in a simulated environment, p. 31.

PSIM A PowerPC simulator for performance evaluation and simulations

integrated with GDB, p. 51.

8/2/2019 XR-EE-SB_2008_014

16/140

12 TERMINOLOGY

Simulink An environment, supported by a graphical user interface, for simu-

lation and model-based design of embedded and dynamic systemsfrom MathWorks, p. 15.

synthesize A design process stage where VHDL are synthesized into a tech-

nology dependent netlist, p. 23.

video data Radar information after the early signal processing stages before

target identification, p. 64.

xc5vsx95t-f1136-3 A Xilinx Virtex-5 SXT FPGA device optimized for DSP

and memory intensive applications, p. 46.

XST Xilinx Synthesis Tool, a free synthesis tool included in ISE, p. 46.

8/2/2019 XR-EE-SB_2008_014

17/140

Chapter 1

Introduction

Saab Microwave Systems (SMW) manufactures radar and other sensor systems

for air, ground and sea, primary for military but also civil security domains.

The goal is to provide their customers with information superiority which give

them the ability to observe and react to situations more accurately and rapidly

than their opponents.

The platform for signal processing is together with the antenna and com-

ponents to send, receive and generate microwaves core functionality of a radar.

The platform for signal processing modulates and directs generated radar pulses,

commonly called beamforming, and processes retrieved signals. The processingextracts target identification including velocity, course, bearing and distance as

well as classification of targets including hovering helicopters. SMW can pro-

vide complete radar systems including cabin, operator interface and support

systems.

Identified targets are transferred through an Ethernet interface to a system

for data processing and presentation. The most common way to present targets

is to use a Plane Position Indicator (PPI), optionally projected on a map.

The platform for signal processing consists of several boards equipped with

programmable logic, Field Programmable Gate Array (FPGA) and PowerPC

processors. Different boards are responsible for different parts of the processingwhich is performed in parallel on the incoming data. The data stream for the

surface radars with 18 antenna lobes consists of 18 channels with complex data

sampled in a rate of 3.125M Hz . This corresponds to a sequential rate of 112.5

millions words per second.

When new or improved functionality is added, algorithms are realised in

the same way as related algorithms in former projects without knowing if it is

possible to do more effectively. For example if a FIR-filter with 16 taps was

described in VHDL for implementation in a FPGA during a project and a filter

with 32 taps is needed in a following project the realisation will be done using

8/2/2019 XR-EE-SB_2008_014

18/140

14 Introduction

the same method. The reason is that it is very hard to show how the best

realisation is made.

It could be possible that it is more suitable to write the FIR-filter in the

previous example in C-Code for implementation on a processor or generate

VHDL from a Matlab model for implementation on a FPGA.

To develop a method for making platform decisions and to reduce the im-

plementation time towards FPGA platforms SMW became interested in the

software AccelDSP. The interest in AccelDSP, which can generate VHDL from

Matlab, is the origin of the thesis.

1.1 Hardware implementation

An algorithm can be implemented in many ways but the thesis is restricted

to implementations on FPGA and PowerPC. The PowerPC is an example of

a processor platform and the implementation procedure is similar if another

processor is of interest.

The natural way to make the implementations is to write code for the proces-

sor in a native programming language like C/C++ or Ada. The FPGA imple-

mentation is made in a hardware description language like VHDL or VERILOG.

The drawback of using native languages is the very long implementation time. If

the implementation can be generated from a description in a high level language,

where the algorithm can be implemented fast, a lot of time can be saved.

Figure 1.1 shows how an algorithm described in Matlab can be implemented

using translations between different languages.

Matlab

C/C++ VHDL / Verilog

Processor FPGA

Figure 1.1: A Matlab model can be translated using these paths for implemen-

tation on a processor or a FPGA.

8/2/2019 XR-EE-SB_2008_014

19/140

1.2 Problem definition 15

1.2 Problem definition

The aim of the thesis is to develop a procedure for making better decisions about

realisations of algorithms for signal processing and radar control than what is

currently possible at SMW today. The procedure should assist in deciding im-

plementation platform for a certain algorithm and if the implementation should

be generated or hand-coded.

A Matlab design of the algorithm is the origin for evaluation of implemen-

tations on PowerPC and FPGA.

For generating VHDL from Matlab AccelDSP has been evaluated and for

generating C-Code Real Time Workshop Embedded Coder (EMLC) has been

used. The evaluation of AccelDSP is a requirement from SMW and EMLC is

used because it has been evaluated by Marcus Mulleger and an evaluation license

was provided by MathWorks. The work made by Mulleger, see section 1.3.7, is

used as a reference for the performance of C-Code generation and evaluation of

the functionality of different translators. Only AccelDSP has been evaluated in

the thesis and no further evaluation of EMLC has been performed.

1.2.1 Problem statement

1. Is AccelDSP useful for SMW and which purposes can it fulfil?

2. How can an algorithm be evaluated to see if it is suitable for VHDL gen-eration from Matlab using AccelDSP or if ts it better to describe the

algorithm directly in VHDL?

3. How can profiling on the both platforms, with code generated from

Matlab, be used to determine if an algorithm should be implemented on

a PowerPC or a FPGA?

1.3 Related work

This section contains softwares and work related to this thesis.

1.3.1 Xilinx System Generator

This software is closely related to AccelDSP since it is used for generating FPGA

implementations from Matlab. The differences is that it generates technology

dependent netlists from Simulink and not VHDL from Matlab Code (M-Code).

This software is therefore unsuitable for SMW. The product contains a block

library for Simulink for modelling DSP algorithms suitable for FPGA imple-

mentation and a tool for generating the implementation.

8/2/2019 XR-EE-SB_2008_014

20/140

16 Introduction

1.3.2 Altera DSP Builder

This is the equivalent to Xilinx System Generator but from the FPGA manu-

facturer Altera. In this program it is possible to generate VHDL files which is

not possible in System Generator. This program is designed for Altera devices

and is therefore not useful.

1.3.3 MathWorks Simulink HDL Coder

Simulink HDL Coder is MathWorks equivalent to Xilinx System Generator and

Altera DSP Builder. It generates VHDL and VERILOG files, which are stated

to be target independent and compliant with the IEEE 1076 standard, from

MathWorks Simulink. Matlab code from the Embedded Matlab Subset (EML)subset can be generated using a special block in Simulink. An evaluation of this

program is considered as Future Work.

1.3.4 MathWorks Filter Design HDL Coder

This is a specialised tool for generating VHDL and VERILOG code for fixed-

point filters designed with Matlab Filter Design Toolbox. The code is stated to

be compliant with the IEEE 1076 standard.

1.3.5 The HPEC Challenge Benchmark SuiteThe Lincoln Laboratory at Massachusetts Institute of Technology has developed

an advanced benchmarking suite for radar systems in ANSI C sponsored by

DARPA. The suite contains both kernel benchmarks including signal, image,

information and knowledge processing algorithms and a SAR system bench-

mark. The later is available as M-Code.

1.3.6 Implementation using Xilinx System Generator

Xilinx System Generator has been evaluated in a master thesis at SMW, at

the time called Ericsson Microwave Systems, in cooperation with LinkopingUniversity of Technology 2004 [5].

The evaluation is made using two complex algorithms and not a kernel set

of algorithms as in this thesis. The two algorithms are a base band modulator

and an algorithm for automatic gain control. The resource utilisation are com-

pared to a hand-coded implementation for the modulator but not for the gain

control. The timing performance are not compared with a reference design but

is verified to fulfil the requirements by SMW. This is possible since SMW uses

implementations of both algorithms and therefore have design specifications to

compare with.

8/2/2019 XR-EE-SB_2008_014

21/140

1.3 Related work 17

The result of the thesis is that Xilinx System Generator can be used to

effectively implement a model of a digital function in an FPGA from Xilinx.

1.3.7 Matlab to C-Code Translation

A functional and performance evaluation of C-Code generation from Matlab has

been made in a master thesis by Marcus Mulleger for Ericsson AB in cooperation

with Halmstad University [17].

In detail two softwares for translating M-Code to C-Code has been eval-

uated. EMLC, used in this thesis for C-Code generation, and Matlab to C

Synthesis (MCS). The thesis compares the two translators but also the perfor-

mance compared to hand-coded implementations. The following three aspects

are studied:

1. Generation of reference code

2. Target code generation

3. Floating to fixed-point conversion

The benchmarking technique is a combination of using both simple and complex

algorithms and different algorithms are used for evaluating different aspects.

The results is that MCS provides a better support for C algorithm reference

generation, by covering a larger set of the Matlab language, and EMLC is more

suitable for direct target implementation. EMLC only allocates memory stati-cally. This makes the supported subset of the M-Code language smaller.

8/2/2019 XR-EE-SB_2008_014

22/140

18 Introduction

8/2/2019 XR-EE-SB_2008_014

23/140

Chapter 2

Background

This chapter contains background information for better understanding the re-

port. Only the sections where the reader has not got previous knowledge is

relevant to read since there are references to this chapter in the report where

background information is needed. The chapter contains information about

FPGA and processor architecture, an introduction to benchmarking and de-

scriptions of the both software AccelDSP and Matlab.

2.1 FPGA Architecture

This section contains a brief introduction to FPGA architecture needed to un-

derstand the performance metrics discussed in section 2.4.3 and the major dif-

ferences between processor and FPGA architecture.

2.1.1 Applications

A FPGA has many advantages, compared to other platforms, since it allows

for a fast and programmable implementation. The main drawbacks are the

high price and the complicated implementation procedure compared to a CPU

implementation. FPGAs are used for small batches, rapid prototyping when the

design ofApplication Specific Integrated Circuit (ASIC) are too time-consuming

and when a reprogrammable hardware implementation is desired. Small batches

will normally lead to a higher price per chip for ASICs than FPGAs.

2.1.2 Architecture description

A Field Programmable Gate Array (FPGA) contains programmable logic and

is thus a Programmable Logic Device (PLD). The logic is contained in blocks

which can be programmed to perform basic logical operations as well as com-

plex mathematical functions and the blocks can be connected by programmable

8/2/2019 XR-EE-SB_2008_014

24/140

20 Background

interconnections. FPGAs have a memory element such as a flip-flop in their

blocks for performing synchronous operations.The logical blocks are arranged in a 2-dimensional array with interconnec-

tions for connecting different logical blocks to each other and blocks with I/O

pads. There are different ways of creating a programmable connection matrix.

The most common way is using SRAM which are reprogrammable and antifuse

which are not reprogrammable.

As explained above the FPGA contains a predefined structure of logic which

limits the possible designs and performance, such as timing and power usage,

compared to the ASICs. Most FPGA architectures are possible to reprogram in

circuit, it is from this ability Field Programmable comes from. Using a FPGA,

instead of an ASIC, makes it possible to update the hardware in end-user prod-ucts for patching or upgrades.

Since a FPGA contains configurable logic, designed using a CAD program, it

possible to design parallel structures optimised for a given task. The only limita-

tion is the number ofI/O pads and logic blocks for the work that can be executed

in parallel. This should be compared to a processor with a fixed hardware layout

capable of sequentially execute instructions. Some non-deterministic parallelism

is introduced in modern processor architecture using advanced out-of order exe-

cution but this is far more limited than the parallelism that can be implemented

in hardware.

For running different parts of the device at different clock frequencies most

FPGAs contains several Phased Locked-Loop (PLL). A PLL is a control system

used to generate a signal with higher or lower frequency than a reference signal

keeping a fixed relation to the phase. It uses a negative feedback loop to control

an oscillator so it maintains a constant phase angle relative to the reference

signal.

2.1.2.1 XILINX Virtex5

Xilinx uses a Look-Up Table (LUT) based design for the logical blocks. A LUT

is a small one bit memory which can be used as a logical function where the

address lines are the inputs and the one bit memory is the output. The logicis realised by properly programming the 2k bits of the memory, where k is the

number of inputs.

Each logical block, by Xilinx called Configurable Logical Block (CLB), con-

tains four slices and each slice contains LUTs, flip-flops and logic for example

carry-chain operations. The interconnections between CLBs consist of segments

of different length spanning from one CLB pair to the entire length of the chip.

For routing a signal between two CLBs the signal must pass through at least

one switch. The switch is usally a SRAM cell. The performance depends on

how the CAD tools manage to route the signals.

8/2/2019 XR-EE-SB_2008_014

25/140

2.1 FPGA Architecture 21

The number of slices, their components as well as the design of the supporting

logic and interconnection matrix differs between different devices.For increased performance specialised blocks are added to the FPGA archi-

tecture by many manufacturers. This could be specialised versions of slices or

blocks supporting advanced operations. The Virtex5 series have two types of

slices SLICEL and SLICEM, the later slightly more complex, on-chip 36 Kbit

BlockRAM and specialised DSP slices. A block diagram of the slices can be

found in [27, pp. 173-174].

All slices have a 6-input LUT as logic function generator, a carry chain for

implementation of fast adder/subtraction structures, a multiplexer and a flip-

flop. The SLICEM has primitives instead of an ordinary LUT which can be

implemented as distributed RAM, shift registers or a LUT.Since FPGAs are often used for signal processing the Virtex5 series have

built-in Xilinx Virtex-5 DSP slices with support for commonly used DSP oper-

ations. These functions include multiply, Multiply Accumulate (MAC), multiply

add, three-input add, barrel shift, wide-bus multiplexing, magnitude compara-

tor, etc... The architecture also supports cascading multiple slices to form wide

math functions, DSP filters, and complex arithmetic without the use of general

FPGA fabric [28].

For generating internal clock signals the Virtex5 series contains

Digital Clock Management (DCM) as well as PLL. The DCM provides ad-

vanced clocking capabilities including eliminating clock skew of clock nets, phase

shift signals and generate new clock signals by a mixture of clock multiplication

and division.

Clock skew is the difference in timing delay between the original clock signal

and the signal that arrives at different components in the device. The difference

arises from different wire length, temperature differences, capacitive coupling,

etc ...

8/2/2019 XR-EE-SB_2008_014

26/140

22 Background

2.2 FPGA development using VHDL

There is several steps involved in the typical design flow when implementing on

a FPGA using VHSIC Hardware Description Language (VHDL). The first is to

describe the implementation using one of the three abstraction levels available

in VHDL.

Behavioural level is the highest level of abstraction. This level is used to

model algorithms and system behaviour without specifying clock informa-

tion. Some synthesis tools are available that can take behavioural VHDL

code as input, but the lack of architecture details means that the imple-

mentation can be inefficient in terms of area and performance.

RTL stands for Register Transfer Level. An RTL description of an algorithm

has an explicit clock, which means that all operations are scheduled to

occur in specific clock cycles. The behaviour of the circuit is defined in

terms of all registers and the flow of signals between them. This is the

most common level used for synthesis.

Gate level is the lowest level of abstraction in VHDL. A gate level description

consists of a network of gates and registers instanced from a technology-

specific library. This library provides information about the components

timing and gate delay

VHDL supports syntax for dividing the code into components and packages.

This way generic components can be created, which are easy to reuse. The

components can be connected together to form application specific algorithms.

VHDL was developed at the behest of the US Department of Defense and

is originally a modelling language for documentation and simulation of circuits.

The language is standardised by IEEE for simulation but only a subset of the

language is supported for synthesis and there is no standard for which subset

that should be supported. Fortunately the same subset is supported by the

most common programs used for synthesis [20].

2.2.1 Implementation

When the algorithm has been described in VHDL the code must be translated

to a bitstream, which can be used for programming the FPGA. The bitstream is

the information stored in the SRAM used for interconnections, the bits stored

in the LUT and configuration information for example the primitives. This

translation is the equivalent to compiling in the programming world and consists

of several steps. After most steps a report containing information, including

performance data, of the design is generated.

8/2/2019 XR-EE-SB_2008_014

27/140

2.2 FPGA development using VHDL 23

2.2.1.1 Synthesis

The first step is to synthesize the VHDL code into a netlist. In this process

the tool tries to recognise components from the code and the generated netlist

contains components and how they are connected. Examples of components are

memory elements, registers and adders. It is possible to influence the process by

adding constraints controlling timing, area limitations and I/O pin assignment.

The constraints are used in the optimisation of the netlist and are also saved to

the next step of the translation. Many components like adders and multipliers

can be implemented in many ways. For example a fast implementation of an

adder is the Sklansky tree adder which requires a large area to be implemented

and the ripple carry adder which is small but slow. The constraints will guide

the program in choosing implementation, i.e. the algorithm using the smallest

area but still fulfilling the timing constraints will be used.

Information gathered during synthesis is summarised in a report. These

reports are generated using Xilinxs FPGA design suite ISE. The reports contain

information about macro statistics, resources and timing. The macro1 statistics

are information about which components that was found by the macro analysing

the VHDL code. The resource information is an estimate of which on-chip

resources that are required to implement the components on a certain chip.

The timing information is a rough estimate of how fast the implementation will

be. No routing delay is included in this estimate since the layout performed by

the Place And Route (PAR) process has not yet been made. It is also possible

that optimisations done during mapping will decrease the timing.

2.2.1.2 Mapping

The next translation step is Mapping (MAP) where the component implemen-

tations are mapped to primitive components in the target technology. This is

usually the ordinary slices in the CLB but if memory elements are requested

or advanced mathematic this is mapped to SLICEM and DSP48E slices by the

mapping process. The mapping needs information about the target architectureto know which resources that are available. This information is provided by a

technology library describing the target device.

The report created after mapping contains essentially the same information

as the synthesis report. Most interesting is the more accurate information about

resource requirements.

1The synthesis tools contain macro-scripts that recognise certain behaviour of the VHDL

code. If the behaviour matches an internal component, like a blockRAM, the script will use

the RAM instead of implementing the behaviour with LUTs.

8/2/2019 XR-EE-SB_2008_014

28/140

24 Background

2.2.1.3 Place and Route

Finally the mapped components are assigned to physical resources in

the FPGA and the resources are routed together. This step is called

Place And Route (PAR). The process includes optimisation of the location of

every resource and how they should be connected to meet the user defined tim-

ing constraints. The PAR generates information to configure every resource on

the FPGA. For example certain SLICEM are configured as ROM if memory of

that type is defined in the design phase.

The report generated after PAR contains the most accurate timing informa-

tion which includes routing delay. Since all translation steps takes long time for

large designs it could be better to use the estimates in the former reports when

evaluating designs. The information in the PAR report is useful as a verificationin design steps before hardware verification.

2.2.2 Simulation

The possibility to simulate a design before it is implemented in hardware is

an important part of VHDL. Two types of simulation can be made during the

synthesis flow. Behavioural simulation can be made directly from the VHDL

models without considering the hardware implementation. A gate level simula-

tion can be made after the PAR process when accurate timing information is

available. The gate level simulation is much slower but is able to find problem

with timing and optimisations made by the CAD tool.

By simulating the code before implementation it is easier to verify correct

behaviour since internal error handling can be added to each component and

internal signals not available to the outside world can be measured. The internal

error handling can for example be assertions that certain signals are not high

at the same time.

2.2.3 Intellectual Properties

To enable faster development and code reusage most FPGA manufacturers pro-

vides implementations of various algorithms called IP-Block. These comes withdifferent licensing agreements and costs. To protect the source device dependent

netlists are often provided together with a software containing a library with

the cores and functionality to adapt the cores to a specific usage. ISE contains

a software called CORE Generator which contains free simple IP-Blocks includ-

ing DSP functions, memories, storage elements and math functions as well as

evaluation versions of complex blocks [21].

8/2/2019 XR-EE-SB_2008_014

29/140

2.3 Processor Architecture 25

2.3 Processor Architecture

This section contains a brief introduction to processor architecture needed to

understand the performance metrics discussed in section 2.4.2 and the major dif-

ferences between processor and FPGA architecture. The architecture described

is a simple form used to introduce important concepts. Modern processors are

far more advanced.

2.3.1 Fundamental design

A processor is a component capable of changing its functionality when given

different instructions. The basic processor can be divided into two parts. A

datapath and a control unit. The datapath includes logic, called execution units,for manipulating data. Examples of execution units are integer and floating-

point units which are used for performing mathematical operations on integer

or floating-point data.

The control unit configures the data path for performing the operation rep-

resented by an instruction. For example it activates the correct execution unit

used by an instruction.

A processor is a sequential machine and the operation of the simple archi-

tecture described here can be summarised by fetch-decode-execute-store. This is

called one instruction cycle.

Fetch An instruction is fetched from the memory position indicated by theProgram Counter (PC).

Decode The instruction are decoded which means that the datapath are con-

figured for executing the instruction by the control unit.

Execute The datapath then executes the instruction, for example calculates

the sum of two integers.

Store The result of the execution is stored in memory and the PC is updated

to point to the next instruction.

2.3.2 Pipeline

This architecture is a valid design but the clock frequency is limited by the delay

of the combinational logic in the datapath. A better way is to introduce registers

between the steps in the instruction cycle which stores the intermediate values.

This will allow an increased clock frequency and will increase the performance if

several instructions can be executed concurrently. Executing several instructions

concurrently by storing the intermediate values in registers is called pipelining.

The architecture described above has four steps in its pipeline. Increasing

the number of steps will generally reduce the Instructions Per Cycle (IPC) but

8/2/2019 XR-EE-SB_2008_014

30/140

26 Background

allows higher clock frequency. One reason that the IPC is decreased is data

dependency. A simplified explanation is that one instruction in the pipelineneeds data from another instruction in the pipeline which has not yet written

its result to the memory. A resolution is to insert stalls, which is instructions

without operation, between the two dependent instructions. During the stall

the result is written to the memory to be available for the next instruction.

Another problem with pipelining is conditional branches2. This type of

instruction is generated by conditional statements like case, if and by loops like

while and for. If the address to the next instruction to be inserted in the pipeline

is calculated by a conditional branch this information is not available until the

branch is resolved. This will lead to a stall of four cycles. The resolution is

advanced branch prediction techniques where the processor predicts the nextinstruction to be executed and throws away the result if the prediction fails.

Modern general-purpose processor has generally much longer pipelines than

four stages. For example the Pentium 4 from Intel with Prescott architecture

have a 31-stage pipeline. This architecture allows clock frequencies up to almost

4 GHz but suffers hard from problems with branch prediction. Such processors

have several datapaths which allows instructions to execute in parallel.

Application specific processors have datapaths for application specific oper-

ations like mathematical functions and specialised datapaths for feedback of re-

sults to prevent stalling. Examples are multiple Arithmetical Logic Unit (ALU)

and specialised ALU in processors for signal processing.

Specialised datapaths and hardware requires special instructions. Using

these instructions is crucial for performance. An experienced programmer or

an advanced compiler schedules instructions to take advantage of parallel dat-

apaths and rearranges instructions to prevent data dependencies.

2.3.3 Memory

The processor architecture described above as well as the PowerPC operates

on data located in registers inside the core. This memory is extremely small,

expensive and can impossibly contain all data and instructions needed by even

extremely small programs. A larger memory is therefore made available to

the processor from which it can fetch instructions and data for storage in the

registers. Such memory is considerably slower than the processor.

The solution is to introduce cache memories on the same chip as the processor

but outside the core and create dedicated hardware to make sure that all data

and instructions needed by the processor is in the cache. Otherwise the processor

is stalled while the data is transferred from the memory to the cache. One cache

can be used for both instructions and data but they are often separated to gain

performance.

2When the value of the Program Counter (PC) is changed depending on a condition

8/2/2019 XR-EE-SB_2008_014

31/140

2.3 Processor Architecture 27

The details of the complex algorithms that handles the cache is outside the

scope of this introduction. Thus it is important to understand that the memorybottleneck and cache misses are performance critical.

2.3.4 PowerPC

PowerPC was designed by Apple, IBM and Motorola in 1991 based

on the POWER architecture created by IBM. The architectures is

Reduced Instruction Set Computer (RISC) designed for personal computers but

since Apples transition to Intel based workstations it is mostly used in embed-

ded environments. The reason PowerPC is used as a target for the performance

evaluation in this thesis is because it is a common embedded processor and that

it is used by SMW.

2.3.4.1 PowerPC 604

This processor is the reference used for evaluation of processor performance

using PSIM in this thesis. It is a general purpose processor and not specific to

a certain application but shows good DSP performance because of the support

for MAC instructions.

The architecture is a super scalar RISC architecture that can dispatch four

instructions and execute up to seven instructions in parallel. This is possible

since the processor contains seven execution units including three integer units

and one floating-point unit. The address width is 32 bits like the maximum

integer resolution but up to 64 bit floating-point numbers are supported.

A branch processing unit and a completion unit controls the out of order

execution of instructions on the processor trying to maximise the performance.

The processor includes separate on-chip instruction and data caches of 32

KByte. The maximum clock speed of the CPU is 180 MHz. More information

is found in [3, PowerPC 604 Processor System] and [10].

8/2/2019 XR-EE-SB_2008_014

32/140

28 Background

2.4 Benchmarking

In this report benchmarking is implied to be the act of running a synthetic

workload on an object and measuring the performance in order to be able to

compare the object with other. Analysing the performance of a fixed workload

is called profiling. The object in this report is a FPGA device or a processor.

Benchmarking have been used for evaluating the performance of VHDL gener-

ated with AccelDSP on an FPGA and profiling on a PowerPC and FPGA for

comparing realisations on different platforms.

With benchmarking as a major focus of this report, it is important to present

a description of the fundamentals and problems concerning benchmarking and

profiling.

2.4.1 Benchmarking fundamentals

Three considerations arise when benchmarking an object. The first consider-

ation is what kind of performance to measure and in what unit. The next is

how the performance should be measured, especially without influencing the

object. And the last one is that the workload should be designed to give a fair

comparison between different objects. When benchmarking a compiler, in this

report AccelDSP, it is important that the workloads contains enough diversity

to be able to draw a general conclusion of the performance of the compiler.

Traditionally different approaches are used for performance measurement:Simple metrics is parameters directly derived from the manufacturing of the

object. Those are usually too simple to describe the actual performance.

Application benchmarking is when a complete application is used as work-

load and is a very good way to cover a lot of the possible operations of the

object. The disadvantage is, as already mentioned, that the usage gets

very specific which gives a benchmark only interesting to a few applica-

tions similar to the workload.

Algorithm kernel benchmarking is an interesting method if a kernel set

of algorithms can be extracted from the interesting applications. Theperformance of these benchmarks are then related to the performance

of the application of interest. This method is used for benchmarking of

AccelDSP in this thesis.

Micro benchmarking is when a single metric is measured to identify peak

capability and potential bottlenecks of a device. This can be very over

optimistic in terms of real application performance.

Functionality benchmarking can be seen as a kind of micro benchmarking

and is when different types of functionality, of an object, is measured with

8/2/2019 XR-EE-SB_2008_014

33/140

2.4 Benchmarking 29

different benchmarks. What kind of functionality the application of in-

terest are using determines which measurements to rely on. Example offunctionality could be I/O performance, memory performance or comput-

ing power.

Benchmarking is becoming more and more advanced as the complexity of new

hardware architectures increases. Profiling, i.e. to determine the performance

of a certain application, run in a specific way on a certain platform, is rather

easy but to gain generally usable information is not trivial. As systems becomes

more complex, the harder it is to isolate the performance of the object of in-

terest. Advanced hardware often requires the compiler to generate code that is

able to utilise the advanced features. For a C-Code compiler this means gener-

ating specialised machine code instructions and for a translator from M-Code

to VHDL to generate a description that can utilise optimised slices.

2.4.2 Processor benchmarking

Performance modelling and measurement for software and processors are thor-

oughly discussed in [11]. This section contains a brief summary of all theory

regarding performance evaluation and metrics, as a background for finding tech-

niques suitable for comparing the performance of processor and FPGA archi-

tectures.

2.4.2.1 Traditional processor metrics

There is a lot of simple performance metrics that can be used to describe proces-

sor performance, but they are almost solely interesting within an architecture

not when comparing different architectures to each other. The commonly used

metrics are:

Clock frequency A common metric in customer electronics which is totally

irrelevant unless the same model, but with different clock frequencies, are

compared.

MIPS Describes how many instructions that can be executed per second. This

may be interesting to compare on architectures with the same instruction

set. It is important to remember that not all instructions can be executed

concurrently if several datapaths are available and that not all instructions

take the same amount of cycles to execute. For example floating-point

operations are usually slower than fixed-point operations.

MOPS Describes how many operations that can be executed per second. This

suffers from related problems as MIPS. What count as an operation and

how many operations are needed to perform useful work?

8/2/2019 XR-EE-SB_2008_014

34/140

30 Background

FLOPS Describes how many floating-point operations that can be executed

per second. This is a more interesting metric than the previous men-tioned since the floating-point operation takes a defined number of cycles

on a specific architecture. FLOPS is a common metric in scientific calcu-

lations. This metric, like the previous, does not tell anything about the

overall performance since it is theoretical and does not consider memory

performance.

IPC/CPI Describes how many instruction are executed per cycle. This is a

very good metric which can be used to compare different processors with

each other. The metrics is not constant for a certain processors, for reasons

previously discussed. An average can however be calculated for a certain

software

Since the simple metrics only gives limited information about the performance

of a certain processor some kind of performance evaluation must be carried out.

How this is performed depends on the software of interest and how accurate

results that are required.

2.4.2.2 Performance evaluation

Performance evaluation is of interest for those designing both hardware and

software and can be divided into performance modelling and performance mea-

surement. Performance modelling method is an estimation of the performance

of a system without running the code on the actual hardware. The method is

more common in early stages of the design process and can further be divided

into simulation-based modelling and analytical modelling. Using performance

measuring the code is run on actual hardware and some kind of instrumentation

technique are used for gathering performance data.

2.4.2.3 Performance measuring

Observing how code is executed during runtime is very important to understand

the bottlenecks of a system. This can be done by:

On-chip hardware monitoring Most processors have built-in performance-

monitoring counters that observes interesting metrics for the actual archi-

tecture. Depending on the architecture and configuration the counters can

monitor cycle count, instruction counts (fetch/completed), cache misses,

branch mispredictions, etc... These counters can be read by software using

special instructions to present the information to a developer.

Off-chip hardware monitoring is when dedicated hardware are attached to

the processor for the cause of monitoring the performance. The off-chip

8/2/2019 XR-EE-SB_2008_014

35/140

2.4 Benchmarking 31

hardware can for example interrupt the processor after every instruction

completion and save all information of interest available in the processor.

Software monitoring A similar method as the off-chip monitoring can be

performed in software by interrupting the processor using trap instruction.

This is very invasive but is easy to implement. A major drawback is that

OS activity is hard to monitor unless trap instructions can be added to

the OS source code.

Microcoded instrumentation This is a method requiring hardware support

that supports recording of instruction execution by modifying the mi-

crocode. This method does not gather performance information but an

instruction trace that can be used in a trace simulation.

2.4.2.4 Analytical modelling

Analytical modelling are not very popular for microprocessors but for whole

systems. The method is based on models relying on probabilistic methods,

queuing theory, Markov models or Petri nets. More information can be found

in [11, 2.1.3 Analytical modelling]

2.4.2.5 Simulation based modelling

By creating a model of the system being simulated, i.e. the target machine, and

running it on a host machine the created software can be run without access

to the specific hardware the software was designed for. The simulator can be

either a functional simulator or a timing simulator and be either trace-driven

or execution-driven.

A functional simulator is able to run the program in a simulated environment

where p erformance data can be revealed. The functional simulator can be cycle

accurate which means that each clock cycle of the processor are simulated which

gives the most accurate results with the drawback of long simulation time.

When only performance data is of interest a fully functional simulator is not

necessary. Much simulation can be made with only a limited functionality if the

performance and not the result are relevant. Performance analysis is commonlycalled profiling and is done using a profiler tool.

A trace-driven simulator can be seen as a simplified form of an execution-

driven simulator where the simulator is not able to execute the program but to

analyse a trace of information representing the instruction sequence that would

have executed on the target machine. The input to a trace driven simulator can

either be fed continuously or fetched from a memory. These simulators suffers

from mainly two problems. The large size of the trace since it is proportional

to the dynamic instruction count and that the trace is not very representative

for out-of-order processors.

8/2/2019 XR-EE-SB_2008_014

36/140

32 Background

Execution-driven simulators solves the two major problems with the trace-

driven since the static instructions are used as input and the out-of-order pro-cessing can be simulated as well. The simulator can either interpret all in-

structions or only those of interest letting the host processor execute the rest

natively. The later will reduce the invasion and therefore increase the speed of

the simulation. Execution-driven simulation is highly accurate but is very time

consuming and requires long periods of time for developing the simulator [11,

2.1.1.2 Execution-driven simulation].

2.4.2.6 Energy and power simulators

The power consumption of a processor consists of two parts. One static which

is independent of the processor activity but depends on the processor mode.

Modern processors can typically go into idle mode where power consuming parts

are turned off. The other part is dynamic and depends on the work executed

by the processor.

Power simulation are done by modelling the power consumption of the in-

dividual components and let the simulator calculate the consumption based on

the activity statistics of each component. The accurateness of the models and

the granularity of the simulation determines how exact the estimation will be.

2.4.3 FPGA benchmarking

To accurately benchmark the performance of an FPGA is a costly and time-

consuming task. The reason behind this is that the complexity of todays FPGA

architectures and CAD tools make it difficult to obtain proper benchmarking

results. For example an FPGA can vary in terms of size, maximum clock fre-

quency, number of I/O pins, chip specific implementations, built in PowerPC

cores, DSP specialised slices, LUT sizes, BlockRAM etc... Even if VHDL and

VERILOG is platform independent languages the compilers and CAD tools vary

between platforms and some architecture specific modules may have different

interfaces.

When implementing on FPGAs the CAD tools have a big impact on the

performance of the final design. Timing constraints and configuration of op-

timisation trade-off in the tools have a large impact on the performance of

the generated design. To correctly add constraints and configure trade-offs for

maximising the performance sets high requirements on a benchmarking method-

ology.

2.4.3.1 Traditional FPGA measurements

It is important to set constraints and configure trade-offs depending on what to

measure. In FPGA design there is three important benchmarking and design

8/2/2019 XR-EE-SB_2008_014

37/140

2.4 Benchmarking 33

directions:

Timing When designing and benchmarking towards timing, important metrics

are: Maximum clock frequency, input pin to setup delay and clock to

output pin delay. To get comparable results of these measures there is a

lot of things to consider, see section 2.4.3.2 for more details.

Area When designing with area limitations the goal is to keep the design as

small as possible most probably to reduce the cost. The resource usage

is given by the CAD tools after a design has been successfully mapped to

a device. Important metrics is LUT usage, register usage, I/O pins used,

BlockRAM count, DSP48E usage and number of PLL/DCM.

Power Power consumption is split into two important factors, static and dy-

namic power usage. Static power usage comes from the devices physical

parameters like size, package and supply voltage. Other factors that con-

tribute to static power usage are design parameters like placement, routing

and operating conditions. Dynamic power usage comes from signal toggle

rate. Measuring power consumption is done using probing or with the

integrated power analysis tools often supplied with the CAD tools. The

analysis tool for ISE is called XPower.

2.4.3.2 Performance benchmarking

To achieve maximum performance and meaningful benchmarks on a device re-

quires a structured benchmark methodology. Without strict guidelines the per-

formance results can be misleading or in worst case wrong. In this section an

overview of these requirements will be presented. Manufacturer specific details

for Xilinx and Altera are found in [1, 22].

To setup a FPGA performance benchmark there is a number of important

points to consider:

Apply timing constraints in synthesis and place and route until the timing

slack 3 is negative. This is done to force the tools to optimise the timing.

Constrain all important clocks. This is to force the tools to improve all

clocks in the design. If a single global clock constraint is used in a design

with several clocks, the CAD tool will only use the constraint on the clock

domain which has the worst performance.

The highest effort choice in both the synthesis and PAR tools should be

used.

3The timing slack is the difference between the constraint and the resulting performance.

If the slack is positive it means that the constraint can be set tighter to force the CAD tools

to work harder

8/2/2019 XR-EE-SB_2008_014

38/140

34 Background

Apply moderate I/O constraints. This is done to force the CAD tools to

factor in any trade-offs in I/O timing requirements. Without this con-straint the results may be unrealistic.

When comparing different FPGAs it is important to make sure that the

devices run on similar speed grades.

To interpret the results given by the timing reports it is important to check that

the CAD tools used all constraints. It is also important to run the synthesis and

PAR a few times with a deviation of the constraints to tune the tools for better

performance. Xilinx recommends 5% increments in constraint until timing is

no longer met.

2.4.4 Digital signal processing benchmarking

In this thesis the focus is to benchmark and implement algorithms for

Digital Signal Processing (DSP). This section presents a few important terms

and arithmetic operations used in DSP.

Multiply-add is when two values are multiplied and a third value is added to

the result. It is an important operation used in DSP applications including

computation of vector product, FIR-filtering, correlation and FFT.

Multiply-accumulate operation (MAC) is the more common definition used

to describe the Multiply-add operation in hardware. The difference is thatresult from the Multiply-add operation is stored in a result register called

an accumulator, hence the name.

Multiply-accumulates per second or MACs is a common micro bench-

marking metric used to measure the peak performance in number of

Multiply-accumulates a device can perform per second.

Complex number calculations since complex numbers are very common to

describe the phase and magnitude of signals many calculations are carried

through with complex numbers. Complex multiplication can be imple-

mented in two ways using either 3 real multiplications and 5 additions or4 multiplications and 2 additions.

Matrix operations many advanced algorithms uses matrix operations includ-

ing multiplication, decompositions and singular and eigenvalue calcula-

tions.

8/2/2019 XR-EE-SB_2008_014

39/140

2.5 Xilinx AccelDSP 35

2.5 Xilinx AccelDSP

AccelDSP became a part of Xilinx XtremeDSP solutions in 2006 when Xilinx

acquired the company AccelChip founded 2000 in California. AccelChip was

a provider of Matlab synthesis software for DSP systems with roots in a re-

search project at Northwestern University in cooperation with DARPA. The

project was called MATCH an acronym for Matlab Compilation Environment

for Adaptive Computing [23, 18]. The only operating system AccelDSP supports

is Microsoft Windows XP.

The software is used for translating M-Code into VHDL which can be syn-

thesised to create digital hardware. The translation includes automatic analysis

of type and shapes of variables and generation of a fixed-point design suitable

for hardware implementation. The original research included functionality for

mapping the application to multiple FPGA by parallelising it but this has not

reached AccelDSP.

Many programs exist for translating general purpose programming languages

including C/C++ and Java to VHDL but developing a direct synthesis path

from Matlab enables a fast and easy evaluation of a lot of algorithms [7]. The

reason is that Matlab is used by high technology companies for developing algo-

rithms and a direct path will make intermediate translations to other program-

ming languages unnecessary. This was the reason behind the original research.

Using Matlab as a source has both advantages and disadvantages. The main

advantage, besides its superiority for developing algorithms, is the high level

syntax of the M-Code language. Since most signal processing building blocks

as matrix multiplication, FFT, correlation, eigenvalue calculations etc... are

available using a single function it is easy to auto-infer optimised IP-Blocks

when translating to VHDL.

The main drawback is that Matlab is an interpreted language with dynamic

type and shape resolution of its variables. The variables can even change shape

which means that a variable can change from being a scalar, to a vector a

matrix during runtime. Dynamically changing shape of matrices in loops is

unfortunately common in M-Code since such code is very slow and not suitable

for hardware implementation. There is no concept of constants in the M-Code.Therefore a workaround is needed for optimisations dependent on constants.

2.5.1 Workflow

Generating VHDL from a Matlab model using AccelDSP starts with dividing

the model into two parts, a script file and a function-file. The function-file

contains the actual function to be translated into VHDL written as an ordinary

Matlab function with an interface of input and output variables.

The script file has three functions. It creates stimuli, feeds the stimuli to

8/2/2019 XR-EE-SB_2008_014

40/140

36 Background

the function in a streaming loop and verifies the output from the function. The

streaming loop simulates the infinite stream of data entering and leaving thedesign in hardware and the combination of a Matlab function and the loop

makes it possible to input the data into manageable partitions. The loop can

be either a for or while loop.

The stimuli, generated or imported, in the script file is important for two

reasons. The data is used as a reference in the automatic type and size identi-

fication and fixed-point generation framework implemented in AccelDSP. It is

from the scaling of the input the internal bit-widths are determined. The input

must represent the real world input for the verification of the function to be

relevant. The verification is made in several steps in the AccelDSP workflow.

An overview of the workflow is found in Figure 2.1 and is shortly explained

in the following sections. The complete manual is found in [24].

Analyse

Generate fixed point

Verify f ixed point

Generate RTL

Verify RTL

Synthesize RTL

Implement

Verify gate level

Fixed point model

RTL Model

(VHDL/Verilog)

Bitstream

and simulation file.

Gate-Level Netlist

Floating point model

"Golden"

Verify f loating point

Figure 2.1: AccelDSP Workflow

8/2/2019 XR-EE-SB_2008_014

41/140


2.5.1.1 Verify floating-point

AccelDSP lets the user execute the script file inside the program and shows

all plots, variables and output. The floating-point model is the golden source

which must be verified by the designer using this output. Errors in this model

will propagate through all later steps and exist in the final bitstream. It is also

important to check that all important variables are observed since the output

is used to verify the fixed-point model.

If probes, see section 2.5.1.4, are used floating-point data is stored in this

step for later use.

2.5.1.2 Analyse

An in-memory model of the design is created in this step. This involves parsing

the M-Code to an internal structure and all constructs not supported will cause

an error to b e raised. This step also tries to identify the streaming loop and

function-file from the script file.

AccelDSP supports a subset of the M-Code language with restriction in

usable functions, operators, syntax and shapes. A reference can be found in

[26]. Matlab functions are supported using a library called AccelWare which

is included with AccelDSP, see section 2.5.2. Some of the functions can be

automatically inferred in the design but some needs to be generated and inserted

in the M-Code manually. For example, a divider can be inferred automatically

and directives for the hardware implementation can be set in the fixed-point

parse tree. For some functions, like a FFT, the code must be generated and

inserted manually.

2.5.1.3 Generate fixed-point

The framework already mentioned creates a fixed-point model in either Matlab

or C++ from the floating-point source. The design is presented in a parse tree

which includes all information about the design including inferred IP-Blocks,

operators and shapes. The interface lets the user graphically add directives to

control bit-widths, add pipeline step or choose hardware specific implementa-

tions.

2.5.1.4 Verify fixed-point

In this step the fixed-point model is executed by Matlab and the same output

as in the floating-point code is presented for the user to compare with the

output from the floating-point. If the results are unsatisfying the user has to go

back and annotate the design with more directives or to control or change the

floating-point design. This iteration is performed until the user is satisfied with

the results.

8/2/2019 XR-EE-SB_2008_014

42/140

38 Background

To simplify the analysis of variables inside functions and the effect of fixed-

point conversion AccelDSP provides probes. These are inserted into the M-Codefor observing both the floating and fixed-point values of a variable. A plot pre-

senting the values and differences between the models are shown when running

the AccelDSP function verify fixed-point, which can be used to identify errors

in the quantisation.

2.5.1.5 Generate RTL

When the user is satisfied with the fidelity of the fixed-point results an RTL

description can be generated. Both VHDL and VERILOG files containing the

RTL description is created. A testbench for verifying the generated design is

also created automatically together with ASCII files containing the function

stimuli and output from the Matlab model. The testbench reads the input to

the VHDL function from the ASCII files and compares the output with the

Matlab reference. The verification pass if all values are the same.

The testbench is dependent on Xilinx packages for performing the compari-

son with ASCII files. This leads to compact test benches but the packages must

be available if the testbench are to be run outside AccelDSP.

2.5.1.6 Verify RTL

The testbench is run inside AccelDSP using one of several commercial available

tools. The result is either pass or fail.

2.5.1.7 Synthesise RTL

If the verification pass, the synthesis can also be performed inside the program

by utilising an external tool like in the verification. After this step a gate level

netlist is created.

2.5.1.8 Implement

The netlist is mapped to the hardware using MAP and PAR in this step. Af-

ter this process two files of particular interest are created. The configuration

bitstream which is used for programming the FPGA and a gate level simulation

file.

2.5.1.9 Verify Gate Level

The same testbench used in the RTL verification can be used to verify the gate

level implementation generated after PAR. This verification is able to find both

synthesis and timing related errors. If this verification passes it guarantees that

the implementation is bit-true with the Matlab fixed-point model.

8/2/2019 XR-EE-SB_2008_014

43/140


2.5.2 AccelWare

M-Code implementations ofMatlab functions optimised for generation ofVHDL

are included in AccelDSP using a library called AccelWare. The library only

contains a small subset of all Matlab functions and only some of the function

can be automatically inferred using the same or a similar syntax as in Matlab.

Functions that can not be inferred automatically must be generated using a

graphical interface with a form for choosing implementation parameters. This

interface creates an AccelDSP project with both a script file containing plots

and test data to verify the functionality and a function-file which can be inserted

in the user code instead of the non supported Matlab function.

2.5.3 Original compiler implementation

This section contains information about the original implementation of the

MATCH compiler as described in [6]. How much of this that have changed

before the current version of AccelDSP is not known but at least some parts,

like the state machine generation, is still implemented.

As a first compiler step a Matlab Abstract Syntax Tree (AST) is generated

from the source code annotated with the information in the directive files.

A type-shape inference phase infers the types and shapes of the variables by

analysing the tree including type and shape directives. Optimised library func-

tions are recognised and IP-Blocks are inferred. For operations where no cores

are available a scalarisation phase expands the matrix and vector operations

into loops.

The original compiler then performed a parallelisation phase which split

loops or assigned different tasks onto multiple FPGAs available on the same

board. This required a compatible board with several FPGAs and libraries for

communication between the devices.

The Matlab AST is translated to a VHDL AST using a state machine de-

scription. The state machine is, among other things, required to hold states of

loops and calculations when using pipelining.

A precision inference scheme operates on the new AST to find the minimum

bits required to represent every variable. When the bit widths are inferred

hardware dependent optimisations are performed which can alter parts of the

generated state machine. A traversal of the generated tree produces the output

VHDL code.

2.5.3.1 Performance

During the development of the MATCH compiler the researchers used bench-

marking for evaluating the performance and measure the effect of different op-

timisations. The benchmark suite presented in [6] consists of Matrix Multipli-

8/2/2019 XR-EE-SB_2008_014

44/140

40 Background

cation, FIR filter, IIR filter, Sobel edge detection algorithm, Average filter and

Motion Estimation algorithm.The conclusion drawn by the authors were that the auto-generated code

were almost equivalent in execution time to manually designed hardware and in

some cases superior. The resource utilisation were within a factor of four of the

manual designed and the design time were reduced from months to minutes.

The benchmarking was done on the WildChild board from Annapolis Micro

Systems with 9 FPGA devices. How parallelism in the algorithms are exploited

to divide the algorithm between the devices and how the communication is

implemented is crucial for the performance and resource utilisation. This func-

tionality is not a part of AccelDSP.

8/2/2019 XR-EE-SB_2008_014

45/140

2.6 MathWorks Matlab 41

2.6 MathWorks Matlab

Matlab is a high-level technical computation environment from The MathWorks

founded by Jack Little, Steve Banger and Cleve Moler 1984. The foundation to

Matlab was written by Cleve Moler and was an implementation of the EISPACK

and LINPACK Fortran packages for linear algebra and eigenvalue calculations.

The program was written when Moler was a math professor at the University

of New Mexico to let the students do scientific calculations without having to

program in Fortran.

Nowadays Matlab let engineers develop algorithms, do numerical computa-

tions and visualise data easier and faster than using traditional programming

languages. The core of Matlab is the Matlab Code language which is a high

level, mathematical oriented language which can be used for algorithm design

using Matlab built-in function as well as programming new functions.

Matlab built-in functions are mostly written in M-Code which makes it easy

to inspect the algorithms and make changes. Fundamental and performance

sensitive functions are compiled to gain better performance. The core function-

ality of Matlab can be extended by installing toolboxes which contains tools

for a certain field. Toolboxes are available for everything from aerospace and

finance to bioinformatics [13, 15, 14].

2.6.1 Native code generationMatlab can generate C-Code with both Matlab Compiler which can be used

to generate code and executables for workstations and the software used in

this thesis; Real Time Workshop Embedded Coder (EMLC) which is designed

for embedded targets. The difference in generated code is considerable and

EMLC is recommended for embedded platforms [17].

EMLC is a module to Real Time Workshop which is an extension to Matlab

for generating and executing stand-alone C-Code. The module can be used with

both Embedded Matlab Subset (EML), and Simulink. The later is an environ-

ment, supported by a graphical user interface, for simulation and model-based

design of embedded and dynamic systems. EMLC was originally a part of RealTime Workshop which leads to EMLC still being dependent on Simulink.

Both fixed and floating-point code generation are possible using EMLC. Pa-

rameters for the code generation target and its abilities are provided to EMLC

using configuration objects. This information includes endianess, floating sup-

port, etc.. For fixed-point generation, which is of interest in this report, the

Fixed-Point Toolbox is needed. This toolbox contains object types which can

be used for annotating variables with fixed-point directives. These directives

are used directly in M-Code to create a fixed-point Matlab implementation sup-

ported for code generation.

8/2/2019 XR-EE-SB_2008_014

46/140

42 Background

For more information and a performance evaluation of code generation from

Matlab see section 1.3.7.

8/2/2019 XR-EE-SB_2008_014

47/140

Chapter 3

Method

This chapter is divided into three sections. The first method found in section 3.1

describes a method for generating VHDL and evaluating the performance on a

Field Programmable Gate Array (FPGA). Next section 3.2 describes a method

for generating C-Code using Real Time Workshop Embedded Coder (EMLC)

and evaluating the performance on a PowerPC. A method for making a platform

decision based on the previous two methods is found in section 3.3. Descriptions

of the algorithms used in all methods are found in chapter 4.

3.1 VHDL generation and simulation

This section describes the method, for generating and simulating VHDL code

from a Matlab implementation, used in this thesis for evaluating AccelDSP and

generating VHDL for the method for platform decision found in section 3.3.

Background information of how AccelDSP should be used according to Xil-

inx is found in section 2.5. A flow chart of the method described in this sec-

tion is fou

xr-ee-sb_2008_014

Documents