kristofer jonsson for rhw o s masters thesis da-2004-05 · a lower logic densit.y on the other hand...

a

Kristofer Jonsson

Components and Servicesfor R HWOS

Masters Thesis DA-2004-05Winter Term 2003/2004Tutor: Herbert WalderSupervisor:Prof. Dr. Lothar Thiele28.5.2004

Preface

During the last �ve years I have studied at Linköping Institute of Technologyand ETH Zurich, and it is with a touch of sentimentality I hand over thisdocument, since it marks the end of my time as student. This master thesisis a last test to prove what I have learned before taking a step out into the�real� world.

During the last 20 weeks I have helped developing a Components and Ser-vices for a Recon�gurable Hardware Operating System (RHWOS), as partof the research project X-FORCES at the ETH Zurich. In this report youwill read about the development of an ethernet driver, MMU and SDRAMDMA loader. This has not only been very exciting, I have also learned a lot.

I would like to send a great thanks to all of you that made this masterthesis possible:

A special thanks to Herbert Walder, my advisor, that always providedgreat support and came up with new ideas when all the �good� ones wereconsumed. He put together a fascinating project that it was much fun beinga part of.

Professor Doctor Lothar Thiele was my supervisor and made this masterthesis come true.

Thanks to all of you in ETZ G69. They were an invaluable support thatalways took their time for anyone who needed help or an advice.

Contents

1 Introduction 11.1 Recon�gurable Hardware Operating Systems . . . . . . . . . 21.2 Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 X-FORCES 52.1 XFBoard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 C-FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 R-FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Ethernet driver 93.1 OSI Reference Model . . . . . . . . . . . . . . . . . . . . . . . 9

3.1.1 Data Link layer (2) - Ethernet protocol . . . . . . . . 103.1.2 Network layer (3) . . . . . . . . . . . . . . . . . . . . . 123.1.3 Transport layer (4) . . . . . . . . . . . . . . . . . . . . 14

3.2 Design overview . . . . . . . . . . . . . . . . . . . . . . . . . . 153.3 Receiver implementation . . . . . . . . . . . . . . . . . . . . . 16

3.3.1 Datalatcher . . . . . . . . . . . . . . . . . . . . . . . . 173.3.2 Ethernet receiver . . . . . . . . . . . . . . . . . . . . . 173.3.3 ARP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.3.4 IP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.3.5 UDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.3.6 Controller . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.4 Transmitter implementation . . . . . . . . . . . . . . . . . . . 183.5 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.6 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4 MMU 214.1 Speci�cation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.2.1 Request Broker . . . . . . . . . . . . . . . . . . . . . . 244.2.2 Read/write interface . . . . . . . . . . . . . . . . . . . 254.2.3 OSBridge interface . . . . . . . . . . . . . . . . . . . . 27

vi CONTENTS

4.3 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.4 Further work . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5 DMA Loader 315.1 Speci�cation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5.2.1 SDRAM controller . . . . . . . . . . . . . . . . . . . . 325.2.2 OPB slave . . . . . . . . . . . . . . . . . . . . . . . . . 365.2.3 DMA . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.3 How a recon�guration works . . . . . . . . . . . . . . . . . . . 395.4 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415.5 Further work . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

6 Achievements and outlook 436.1 Achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . 436.2 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

A MMU cook-book 45

List of Figures

2.1 XFBoard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 R-FPGA layout . . . . . . . . . . . . . . . . . . . . . . . . . . 73.1 OSI seven layer model . . . . . . . . . . . . . . . . . . . . . . 103.2 Ethernet frame . . . . . . . . . . . . . . . . . . . . . . . . . . 113.3 Ethernet header . . . . . . . . . . . . . . . . . . . . . . . . . . 113.4 CRC de�nitions . . . . . . . . . . . . . . . . . . . . . . . . . . 123.5 IPv4 header . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.6 ARP header, ARP request data . . . . . . . . . . . . . . . . . 143.7 UDP header . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.8 Ethernet PHY . . . . . . . . . . . . . . . . . . . . . . . . . . 153.9 Ethernet Rx waveform . . . . . . . . . . . . . . . . . . . . . . 163.10 Ethernet driver . . . . . . . . . . . . . . . . . . . . . . . . . . 174.1 MMU integration in the RHWOS . . . . . . . . . . . . . . . . 234.2 TID and TRFID . . . . . . . . . . . . . . . . . . . . . . . . . 244.3 VFDL/PFDL lookup tables . . . . . . . . . . . . . . . . . . . 244.4 MMU layout . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.5 PC state graph . . . . . . . . . . . . . . . . . . . . . . . . . . 275.1 DMA system overview . . . . . . . . . . . . . . . . . . . . . . 325.2 SDRAM controller state diagram . . . . . . . . . . . . . . . . 335.3 Read timing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355.4 Write timing . . . . . . . . . . . . . . . . . . . . . . . . . . . 355.5 OPB bus1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365.6 OPB address decoding . . . . . . . . . . . . . . . . . . . . . . 375.7 SelectMAP timing . . . . . . . . . . . . . . . . . . . . . . . . 385.8 DMA data �ow . . . . . . . . . . . . . . . . . . . . . . . . . . 40A.1 MMU demonstration setup . . . . . . . . . . . . . . . . . . . 47A.2 MMU demonstration setup . . . . . . . . . . . . . . . . . . . 47A.3 MMU demonstration setup . . . . . . . . . . . . . . . . . . . 48A.4 MMU demonstration setup . . . . . . . . . . . . . . . . . . . 48A.5 MMU demonstration setup . . . . . . . . . . . . . . . . . . . 48

viii LIST OF FIGURES

List of Tables

3.1 Modulo 2 arithmetic . . . . . . . . . . . . . . . . . . . . . . . 114.1 PFDL Type Mapping . . . . . . . . . . . . . . . . . . . . . . 254.2 Read interface . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.3 OSBridge interface . . . . . . . . . . . . . . . . . . . . . . . . 284.4 OSBridge address mapping . . . . . . . . . . . . . . . . . . . 295.1 Hardware registers . . . . . . . . . . . . . . . . . . . . . . . . 375.2 Command/Status register . . . . . . . . . . . . . . . . . . . . 375.3 SelectMAP interface . . . . . . . . . . . . . . . . . . . . . . . 39

x LIST OF TABLES

Chapter 1

Introduction

Personal computers, cellular phones and portable mp3 players are examplesof devices that during the last ten years have become an ordinary part of oureveryday life. The producers are many and the competition is razor sharp.The consumers demand low prices, high performance and for portable devicesalso a long battery life.

The CPU is central in such a system. Modern general purpose CPUsstill follow the development path as Gordon Moore suggested. The clockfrequency is roughly doubled every 18 months, but still they can not satisfyall demands that are put on the applications of today.

The computing power is not su�cient for certain applications, like heavyrealtime video and audio compression. This specially applies for mobile de-vices that often, due to low power constrains, have a CPU that runs witha lower clock frequency. To enable these kind of applications they are com-monly run on dedicated hardware based on Application Speci�c IntegratedCircuits (ASICs).

An ASIC has a �x design that cannot be changed after tape out. Theadvantage is that they are cheap to mass produce and o�er high performance.The drawback is that when an update is needed the whole design has to bediscarded, which costs time and above all money.

Recon�gurable logic may in contrast to ASICs be recon�gured. Theadvantage is payed with a higher chip production cost, less performance anda lower logic density. On the other hand the turn-around time is signi�cantlyshorter and the chip may be recon�gured during runtime to execute di�erentservices.

A common setup for a recon�gurable system is to couple a CPU with arecon�gurable device. Computing intensive tasks are handled by the recon-�gurable device while all other computation is executed by the CPU.

Field Programmable Gate Array (FPGA) is a widespread type of recon-�gurable logic. During the last couple of years the logic density and speedhas increased appreciably and today a FPGA can host a 32 bit CPU with

2 Introduction

peripheral controllers without reaching its limit. This brings that a CPUmay delegate several tasks to the FPGA.

Xilinx[11] is the FPGA market leaders and their Virtex-II series do notonly allow full but also partial recon�guration. This means that it is possibleto just overwrite a part of the FPGA and leave the rest of the logic untouched.On a multitasking hardware system this brings the contingency to replace atask without a�ecting the other tasks.

1.1 Recon�gurable Hardware Operating SystemsAs already mentioned, a FPGA may be coupled to a CPU and seen by theCPU as a peripheral resource. The FPGA may though in its turn have fur-ther peripheral devices, as for example memory, I/O devices or VGA device.In a multitasking environment these resources may be requested by severaltasks, and an e�cient way of sharing them has to be found. The nowadaysaccepted solution from the software side is to introduce an operating sys-tem as an abstraction layer between the hardware and the processes. Thisconcept may also be applied for a hardware multitasking system.

A Recon�gurable Hardware Operating System (RHWOS) works as an ab-straction layer between the tasks and the peripheral devices. A driver for aperipheral device should only be implemented once. It would be a terriblewaste of logic to let each task implement its own driver for every resource itdemands.

Ideally an unlimited number of tasks should be able to access the sameresource without being a�ected by each other. In practice this is not thecase and each request has to be queued and scheduled before being grantedaccess to the resource. All this should be done behind the curtains of theRHWOS, and from the view of a task it should look like it had exclusiveaccess to the resource.

Memory is essential for many applications and it is likely that severaltasks may share the same memory cells. The RHWOS must provide basicfunctionality to solve resource con�icts and supply an uniform interface toaccess the available memory technologies. Further functionality may also bedesirable. Detailed information about this is described in chapter 4.

In the computer engineering laboratory at ETH Zurich, a prototype of anRHWOS is currently under development, and the XFBoard (see 2.1) servesas a platform for this.

1.2 AssignmentMy assignment has been to develop components and services for the X-FORCES RHWOS (see chapter 2). The designs should be implemented in

1.3 Outline 3

the RHWOS, their functionality proven correct and their performance bemeasured.

I have implemented the following components and services for the R-FPGA:• Memory Management Unit (MMU)• Ethernet DriverFor the C-FPGA I have developed a SDRAM DMA loader that involve

following components:• OPB Slave• SelectMAP DMA Loader• SDRAM controller

1.3 OutlineIn the following chapters a short introduction of the project X-FORCES isgiven in order to explain the background of this master thesis. It is followedby descriptions of the components and services that have been implemented.

• Ethernet driver. This chapter �rst gives a background of the ether-net protocol before going into a detailed description how the ethernetdriver is implemented. It ends with a result and a list of future work.

• MMU. This chapter begins with a motivation why a MMU is neededin a RHWOS. Then follows a speci�cation and detailed description ofhow the MMU is implemented. Finally the result is presented togetherwith a list of future work.

• DMA loader. This chapter �rst gives a short motivation why a DMAloader is needed. Then follows a speci�cation and a detailed descrip-tion of how the DMA loader is implemented. Finally the results arepresented together with a list of future work.

4 Introduction

Chapter 2

X-FORCES

X-FORCES[3] is a research project at the ETH Zurich[15] striving towardsdeveloping a complete system for partial recon�guration of FPGAs.

The goal of this project is to o�er a methodology and a design environ-ment that helps to exploit the potential of recon�gurable hardware in futureembedded systems. The vision is that designing recon�gurable subsystems be-comes as e�cient and �exible as designing processor-based systems and thatheterogeneous systems will use hardware recon�guration to combine the highperformance of dedicated hardware with the �exibility of software.1

To fully recon�gure a FPGA during runtime or to implement a System Ona Chip (Soc) is nothing new, but to combine these techniques and partiallyrecon�gure a Soc during runtime is something that needs exploration beforecoming true.

2.1 XFBoardMany ideas may look perfect in theory but are proven useless in practicalrealizations. The XFBoard was developed in order to turn concepts intoreality.

The XFBoard was developed as a semester thesis[4] by Samuel Nobs atthe ETH Zurich. It has two FPGAs coupled to each other where one acts asCPU and the other as recon�gurable devices. They are called CPU FPGA(C-FPGA) and Recon�gurable FPGA (R-FPGA).

The XFBoard is specially designed to support the needs of RHWOS,and specially targeted against networking, audio/video streaming, multime-dia, encrypting/decrypting algorithms and real-time signal processing2. TheR-FPGA has all its I/O devices located at the left and right side without anyport con�icts between di�erent devices. There are communication wires be-tween the C-FPGA and R-FPGA and the C-FPGA may over the SelectMAP

1http://www.tik.ee.ethz.ch/∼xforces/2[4] s 11

6 X-FORCES

Virtex-IIC-FPGA

Virtex-IIR-FPGA

SDRAM Left16M X 16

SDRAM Right16M X 16

SRAM Left1M X 32

SRAM Right1M X 32

SDRAM16M X 32

SRAM1M X 32

FlashRAM4M X 32 BootPROM

EthernetPHY

EthernetPHY

AudioCoDec

VideoDAC

8-LED Bar

2 LEDs2 Switches

2 LEDs2 Switches

JTAG6 Pin Header

JTAG6 Pin Header

VGA Out15 Pin D-SUB HD

VGA Out15 Pin D-SUB HD

EthernetRJ-45

Eth

erne

tR

J-45

PS/2

6 Pi

n M

iniD

IN

PS/2

6 Pi

n M

iniD

IN

RS-232

9 Pin D-SU

B

RS-2329 Pin D-SUB

RS-2329 Pin D-SUB

Aud

io I

n 1

3.5m

m m

ono

Jack

Aud

io I

n 0

3.5m

m m

ono

Jack

Audio Out3.5mm mono Jack

Expansion Slot36 Pin Header

Expansion Slot40 Pin Header

Data SignalsConfiguration Signals

Figure 2.1: XFBoard

interface recon�gure the R-FPGA with maximum 50 MHz.Further details about the XFBoard may be found in [4].

2.2 C-FPGAIn the C-FPGA a MicroBlaze is implemented. The MicroBlaze is a 32 bitsoftcore microprocessor. An OS, the XFOS, has been tailor made developedfor RHWOS. Supporting hardware functionality as VGA driver and SDRAMcontroller have been appended to an OPB bus and can be addressed by theCPU.

Detailed description about the C-FPGA may be read in Recon�gurableHardware OS Prototype "C" [6].

2.3 R-FPGA 7

2.3 R-FPGAThe approach of the RHWOS is to divide the R-FPGA into di�erent areasas �gure 2.2 shows.

OS

fram

elin

ks(s

tati

sch

)

OS

fram

ere

chts

(sta

tisc

h)

BA

RL

BA

RR

Du

mm

yT

ask

Du

mm

yT

ask

Du

mm

yT

ask

Du

mm

yT

ask

T 1

BAC

Bus Macros

Figure 2.2: R-FPGA layoutThe layout is based on the prerequisite of the XFBoard (see section 2.1).

I/O devices are located left and right and communication with the CPU islocated on the upper side.

The R-FPGA is divided into four parts; OS left and right, bus structureand task slots. The RHWOS serves as abstraction layer between the tasksand the peripheral devices. Top right and top left in the OS space is the OS-Bridge located, that is a communication channel between the XFOS runningat the CPU and the RHWOS running at the FPGA. The bus structure worksas a communication channel between the tasks and the RHWOS; hence, notask may be routed across another tasks area. The remaining area will beused to implement hardware tasks.

For a more detailed description of the FPGA design layout I recommendreading [7].

8 X-FORCES

Chapter 3

Ethernet driver

In 1973 Xerox was building the world's �rst laser printer at their Palo Altoresearch Center (PARC). They wanted their hundreds of computers to beable to connect to it, and the job to build a networking system were givento Robert Metcalfe. He had to face two challenges: the network had to befast enough to drive the fast last printer and it had to connect hundreds ofcomputers in the same building1.

In 1980 Digital Equipment, Intel and Xerox released DIX Ethernet, thede facto 10 Mbps Ethernet standard. In 1983 the �rst IEEE standard wasapproved, IEEE 802.3 10Base5, and since then the development of the world'smost widespread LAN technology has went on2.

Ethernet does ful�l its original requirements of transporting large amountsof data and letting hundreds of compute's share the same medium. The XF-Board has two ethernet devices, one for each of the FPGAs. They can,for example, be used to transport streaming audio and video data. For theC-FPGA there is already an available core from Xilinx, but for the R-FPGAthere is no core available and an ethernet driver for the RHWOS has to bedeveloped and implemented.

3.1 OSI Reference ModelThe Open Systems Interconnection Reference Model (OSI Reference Model),also called the OSI seven layer model, is a layered abstract description forcommunication and computer network design. The model divides the com-munication between computers into seven layers, as �gure 3.1 shows, whereeach layer's functionality is described by the model.

The strength of the model is that the implementation of the layers maybe arbitrary combined to �t any software and hardware environment. Forexample does an email application in the Application layer (7) not have to

1http://inventors.about.com/library/weekly/aa111598.htm2http://www2.rad.com/networks/2001/ethernet/hist.htm

10 Ethernet driver

Figure 3.1: OSI seven layer model

bother about if the message is transported over ADSL or ethernet in thePhysical layer (1).

The ethernet PHY on the XFBoard handles layer 1. To make the datausable for the hardware tasks, the RHWOS has to implement layer 2, 3 and4.

3.1.1 Data Link layer (2) - Ethernet protocolBefore any data is transmitted, a 64 byte long preamble is sent in orderto synchronize sender and receiver. It consists of 62 altering 1's and 0'followed by the pattern 11. The last two 1's are known as Start of FrameDelimiter (SFD) and indicate the end of the preamble. When encoded usingManchester encoding at 10 Mbit the 62 alternating bits produce a 5 MHzsquare wave3.

As �gure 3.3 shows, the �rst data sent is the 6 byte MAC destinationaddress followed by the equally long MAC source address. A MAC addresscan be compared to a telephone number where each ethernet card has its ownMAC address. When an ethernet frame is received and the MAC destination

3http://www.erg.abdn.ac.uk/users/gorry/course/lan-pages/mac.html

3.1 OSI Reference Model 11

Figure 3.2: Ethernet frame

Figure 3.3: Ethernet header

address either match or is the multicast address (�:�:�:�:�:�), then it hasreached its target and is received, else it is normally discarded.

After the MAC destination and source addresses the 2 byte long type�eld is transmitted. It tells which protocol that is carried. Among the mostcommon are IP (0x0800) and ARP (0x0806).

Then comes between 46 and 1500 bytes of data. The under limit isruled to insure that an ethernet frame is long enough to detect a collision.A collision can only be detected while sending a frame and if the time forsending a frame is shorter than the time for the �rst byte of the frame toreach its target, a collision could in worst case not occur until after the wholeframe had been sent.

The data link protocol shall deliver secure point to point communication.If a collision happens this is detected and the frame may be requested to beresent.

CRC generationThe last element that is sent is the Frame Check Sequence. It is based on theCyclic Redundancy Check (CRC) and is a checksum to validate the frame.The algorithm is based on modulo 2 arithmetic. It is possible to implementthe CRC in software but it is much more e�cient in hardware.

The arithmetic of modulo 2 is illustrated in table 3.1. Special attentionshould be payed to that the addition of two equal numbers lead to a zeroresult.

A B A xor B0 0 00 1 11 0 11 1 0

10110100+ 0010101010011110

10110100+ 1011010000000000

10110100- 1001111000101010

Table 3.1: Modulo 2 arithmetic

12 Ethernet driver

Figure 3.4: CRC de�nitions

Let us de�ne the following, illustrated by �gure 3.4:

• M - The original frame to be transmitted. It is k bits long.• F The resulting FCS. It is n bits long.• T The cascading of M and F. It is k+n bits long.• P The prede�ned CRC polynomial. It n+1 bits long.

It is clear that the total frame to be sent is

T = M ∗ xn + F (3.1)Suppose we divide M ∗ xn with P

M ∗ xn

P= Q +

R

P(3.2)

Let us look back at equation 3.1 and assign F = R. This gives that theframe sent will be

T = M ∗ xn + R (3.3)To check if a received frame, T, is valid it it divided by P.

T

P=

M ∗ xn + R

P=

M ∗ xn

P+

R

P= Q +

R

P+

R

P= Q +

R + R

P(3.4)

But from table 3.1 we know that the sum of two equal numbers is zero. Thisleads to that if the reminder of a received frame not equals zero, then thereis an error in the frame.

3.1.2 Network layer (3)The network layer handles the routing of the data, to make sure that datais sent in the right direction and to the correct destination.

3.1 OSI Reference Model 13

IP protocolInternet Protocol (IP) is, as the name suggests, one of the basic protocolsof the internet. It carries information about source and destination of apackage, which is used to determine if the package has reached its �nal hostor should be routed further.

The IP protocol provides a so called unreliable service, also called beste�ort. This means that no guarantee is given that the package is correctlydelivered. The package may be delivered damaged, out of order, duplicatedor even be dropped. If certainty that the package is correctly delivered isdesired, this is added in the transport layer.

The IP header looks like �gure 3.5.

Version tells which version of the IP protocol that is carried. The mostwidespread version is IPv4 and is represented by a 4. The last year's inter-net boom has though consumed a large number of IP addresses and therehas become a shortage of addresses. IPv6 has been suggested as a successorand with 128 bit addresses no shortage will come up within a foreseeablefuture.

Protocol tells which transport layer protocol that is carried.

Source IP tells who is the sender of a package. The address is used toaddress a reply package.

Destination IP tells for whom the package is addressed. When a packagearrives at a node the destination IP is examined, and a decision is made ifthe package should be received, dropped or sent further. The address �:�:�:�(IPv4) is known as the broadcast address and is addressed to all computersin a network.

Figure 3.5: IPv4 header

14 Ethernet driver

ARP protocolThe Address Resolution Protocol (ARP) is used together with the IP protocolto map an IP address to a MAC address. The name address resolution refersto the procedure of �nding an address in a network.

When an IP address has an unknown MAC address, an ARP request(see �gure 3.6) is broadcasted over the network with the message who isX.X.X.X tell Y.Y.Y.Y, where X.X.X.X and Y.Y.Y.Y both are IP addresses.If a computer in the network receives an ARP request it either drops therequest if it does not match its MAC address, else replies with an ARPreply. The ARP reply contains the MAC address asked for.

To minimize the number of address resolutions a client normally holdsa cash of recently resolved addresses. This table is of �nite size and isperiodically �ushed to keep the table up to date.

Figure 3.6: ARP header, ARP request data

3.1.3 Transport layer (4)The basic functionality of the transport layer is to receive data from thesession layer, split it into small units if needed, pass these to the networklayer and make sure that they arrive correctly. The network layer providesonly a basic and unreliable service for data transfer, while the transport layermay extend this to reliable service. The transport layer is not obligated toprovide reliability.

User Datagram Protocol (UDP)User Datagram Protocol (UDP) is a minimal service that provides no guar-antee that a message is delivered correctly. UDP only adds 4 header �eldslike �gure 3.6 shows. Since UDP is stateless there is no need for replies andthe source port is optional. It could be left empty and should, if so, be setto zero.

3.2 Design overview 15

Figure 3.7: UDP header

Lacking reliability the application using UDP must be ready to acceptsome data loss. Streaming media, real-time gaming or voice over IP areexamples of applications that often use UDP.

Since UDP is a very simple protocol and the applications that run on theXFBoard are streaming oriented, UDP is the only transport layer protocolused.

3.2 Design overviewThe LXT970A Fast Ethernet PHY Transceiver sitting on the XFBoard iscapable of handling both 10 and 100 Mbit. Basic requirements for the eth-ernet driver is to be able to handle 10 Mbit. 100 Mbit is also desired, butthis is left as a further improvement.

Figure 3.8: Ethernet PHYData is sent over an ethernet cable and received by the ethernet PHY.

This decodes OSI layer 1 and presents a stream of 4 bit nibbles together withcontrol signals that indicate when a valid frame is received. The three mostimportant signals are Rx Clk, Rx Data Valid and Rx Data.

16 Ethernet driver

Rx Clk

Rx DV

Data

L��L

LLL�HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH�L

UUU�VV�VV�VV�VV�...�V�VV�VV�VV�U1 0 3 2

8 nibbles CRC

Figure 3.9: The �gure illustrates how the hexadecimal sequence 0123 is sentover ethernet.

Figure 3.9 illustrates the signals presented by the ethernet PHY whena valid frame is received. After the preamble has been transmitted Rx DVgoes high on the falling edge of Rx Clk. At the same time the data changesand the �rst byte of the MAC header is presented. The last 4 bytes that aresent is the CRC value.

The package is divided into bytes, where the lower nibble, 4 bytes, issent before the higher nibble. A sequence of 01234567 would be sent like10325476.

The ethernet driver takes the data stream from the ethernet PHY anddecodes OSI layer 2, 3 and 4. The package payload is then written to a FIFObu�er.

Transmitting a package works in about the same way as receiving, butthe other way around. A package is written to a FIFO bu�er that is readby the ethernet driver and sent further to the ethernet PHY that delivers itover the ethernet cable. The di�erence is that the ethernet driver does inthis case not implement OSI layer 3 and 4. The reason for this is that bothUDP/IP and ARP replies should be possible to send.

3.3 Receiver implementation

The ethernet PHY present four signals that are read by the ethernet driver:Rx Clock, 4 bit Rx Data, Rx Data Valid and Rx Error. With 10 Mbit and 4bits per sample this gives a Rx Clock of 2.5 MHz.

10 · 106[bits/second]4[bits/sample]

= 2.5 · 106[samples/second] = 2.5[MHz]

One byte is divided into lower- and higher nibble, where the lower nibblecontains the least signi�cant bit. The lower nibble is transmitted before thehigher nibble.

3.3 Receiver implementation 17

Figure 3.10: Ethernet driver

3.3.1 DatalatcherThe Datalatcher is the clock domain border between the RHWOS and theethernet PHY. It samples all signals from the ethernet PHY on system clock's(50 MHz) rising edge. It also takes the lower and higher nibble and cascadethem into one byte.

3.3.2 Ethernet receiverThe Ethernet receiver decodes the ethernet header. If the MAC destinationaddress does not match the MAC address of the driver or the broadcastaddress, then the package is dropped.

Further is the type �eld of the ethernet header decoded. It tells whatnetwork layer protocol that is carried. If it does not match the ARP or IPprotocol the whole frame is dropped. Else a control signal is presented toindicate which one of them that is being received.

When the full ethernet header has been received, the ethernet receiverpresents the ETH Data Valid signal.

3.3.3 ARPThe ARP header is not decoded. Instead the header and payload is writtento an ARP FIFO bu�er.

The ARP Data Valid signal is presented while the ARP header andpayload is received.

18 Ethernet driver

3.3.4 IPThe IP header is decoded and the IP address and transport layer protocolis extracted. If the IP address does not match the board IP address, theaddress is not the broadcast address or the protocol carried is not UDP,then the package is dropped.

When the full IP header has been received the IP Data Valid signal ispresented.

3.3.5 UDPWhen the IP Data Valid signal is read the UDP process starts decodingthe UDP header. The source port and payload length are extracted. Thepayload length is stored into a register and is then decremented after eachreceived byte.

When the full UDP header has been received the UDP Data Valid signalis presented. When the payload counter has reached zero the UDP DataValid is deasserted.

3.3.6 ControllerThe Ethernet receiver, ARP, IP and UDP all present Data Valid signalswhen they have decoded their headers. They also present some of the de-coded signals as ARP/IP type and UDP port. These signals are read by thecontroller.

When the controller sees that the payload of a valid package, ARP orIP/UDP, is being received, it presents a one clock pulse long signal for eachreceived byte. This pulse is used as write signal for a FIFO bu�er.

3.4 Transmitter implementationThe transmitter does only implement the data link layer. This decision wasmade to allow the transmitter to send both ARP and IP packages.

When the transmitter is given a send command it �rst sends the 64 bitpreamble. After that, it reads a full package and passes it with lower andupper nibble to the ethernet PHY. A CRC value is generated in hardwareand is �nally cascaded to the ethernet frame.

3.5 ResultThe goal of the ethernet driver was from the beginning to have a workingdesign, and optimization was of secondary interest. This is also the �nalresult.

3.6 Future work 19

The driver is working for small designs and decodes the headers as itshould, but the code is not very �exible and does only ful�ll its purpose forthe XF RHWOS. The project has also changed over time when new compo-nents as MMU (see chapter 4) and Package Discriminator4 were introducedin the RHWOS, which has lead to modi�cations of the ethernet driver.

For larger designs the Place and Route from Xilinx �nds it di�cult tocomply with the 20 ns time constraint, which leads to an unpredictablebehavior of the system. In particular seems the one clock pulse long signalfrom the controller often miss its deadline and a byte here and there is missed.

3.6 Future work• The OSI model could be better honored.• For larger designs timing problems have been noticed. To put this rightthe longest path could be analyzed and broken into smaller parts.

• A hardware ARP deamon could be implemented to automatically re-play ARP requests.

4The Package Discriminator takes 8 bit data and UDP port as argument and writesdata in di�erent FIFO bu�ers.

20 Ethernet driver

Chapter 4

MMU

A FPGA may have many di�erent memory technologies in its disposal, withdi�erent characteristics as access time, cost per bit, volatile or non volatile. AMemory Management Unit (MMU) is in charge of accessing di�erent memorytechnologies and form storage services out of them, as FIFO bu�ers, sharedor private random access memory.

In a RHWOS the memory cells are shared by the tasks, but from theview of a task it should look as if it had exclusive to its own memory space.Ideally a MMU should be able to serve all tasks in parallel without any delay.This is normally not the case and each request has to be serialized beforebeing served. The MMU must �nd a scheme to solve resource con�icts byqueueing and scheduling requests in a way that is fair and optimizes thethroughput.

4.1 Speci�cationThe R-FPGA has access to BlockRAM, SRAM and SDRAM. Most of theapplications of the RHWOS are streaming oriented where memory is animportant service. A design decision has been made that all inter-task com-munication must go over the RHWOS, and FIFO bu�ers are well suitablefor passing data between tasks.

The MMU should form FIFO bu�ers out the on-chip BlockRAM and theperipheral SRAM cells. The SDRAM cells have been neglected since theyare not to be suitable for FIFO bu�ers. They have a complicated accessprotocol and long delay for single read and write commands.

The MMU should be implemented in the existing RHWOS and the per-formance evaluated.

• FIFO bu�ers should be implemented. Random read/write is not re-quired, but may later be desired.

22 MMU

• Tasks identify themselves with Task ID and which Task Relative FIFOID they want to access.

• The C-FPGA shall have full access to all memory and internal lookuptables over the OSBridge.

• Drivers with high communication need shall have direct access to �xedFIFO bu�ers.

• BlockRAM and SRAM should be supported. SDRAM is not required,but may in the future be desired.

4.2 ImplementationFigure 4.1 shows a simpli�ed picture of how the MMU is integrated in theRHWOS. The C-FPGA has access to the the MMU over the OSBridge, thetasks have access over the Task Communication Bus (TCB) and the drivershave direct access to �xed FIFO bu�ers.

There are two di�erent SRAM banks, one at the left side of the R-FPGAand one at the right side, each storing 16 Mb. In the R-FPGA �xed FIFObu�ers are instanced of sizes varying from 128 16 bit half words up to 2048 16bit half words. They are listed in table 4.1. The available BlockRAM dependon how large the OS frame is, since no wires may be routed to BlockRAMin the task slot area. For the current setup of the MMU the BlockRAM is4096 16 bit half words large.

From the view of a task the MMU looks like a number of FIFO bu�ersit can access by giving its Tasks ID (TID) and Task Relative FIFO ID (TR-FID). The TID is �x and uniquely identi�es the task while the TRFID isa relative id giving the task the opportunity to work with more than oneFIFO. This is illustrated in �gure 4.2.

The MMU has two internal lookup tables to map a TID and TRFID to aphysical memory address, see �gure 4.3. The Virtual FIFO Descriptor List(VFDL) tells which direction a FIFO has, read or write, and points out anentry in the Physical FIFO Descriptor List (PFDL). The PFDL describes thephysical layout of the FIFO; in which memory technology it is implemented,base address, size and current read and write pointer.

To simplify allocation of new FIFO bu�ers and the control hardware, aminimum and maximum FIFO bu�er size have been de�ned.

• Minimum FIFO bu�er depth28 · 16 = 0.5kB

• Maximum FIFO bu�er depth214 · 16 = 32kB32kB is enough for bu�ering audio data (2 channels, 16 bit, 44.1 kHz)for 0.186 s.

4.2 Implementation 23

Figure 4.1: MMU integration in the RHWOS

TID has been chosen to 3 bits, since the system can maximally execute5 tasks at a time. TRFID has been chosen to 3 bits. This leads to that eachtask may have 8 FIFO bu�ers which should be enough. Together they forman address in the VFDL.

Direction tells if the FIFO bu�er is being read or written. '1' correspondsto write direction.

PFID is a pointer to an entry in the PFDL. It is chosen 4 bits wide, andthat gives a maximum of 16 FIFO bu�ers in the system.

The type tells in which memory technology the FIFO bu�er is imple-mented. The mapping is described in table 4.1.

The physical base address is calculated as BaseAddress · BlockSize,where the block size is the minimal FIFO bu�er depth. The SRAM cellshave the largest memory space with 220 addresses, and this gives a baseaddress of 12 bits.

The physical size is calculated as Size ·BlockSize, where the block sizeis the minimal FIFO bu�er depth. Since the maximal FIFO bu�er size is setto 214, this gives a maximum of 26 blocks and a size �eld of 6 bits.

24 MMU

Figure 4.2: TID and TRFID

Figure 4.3: VFDL/PFDL lookup tables

The read and write pointers are 14 bits wide, given by the maximal FIFObu�er depth. They are implemented in separate lookup tables in order to beable to update them, and only them, without touching any other information.

4.2.1 Request BrokerThe MMU layout is given form in �gure 4.4. Each port has a Port Con-troller (PC) that is responsible for reading or writing data and keep internalregisters up to date. Each resource has a Memory Controller (MC) thatimplement the access protocol to the memory device.

Central in the design is the Memory Request Broker (MRB). It connectsthe PCs with the MCs. When a PC wants to access one or more MCs itsends a vector of requested resources to the MRB. The MRB replies witheither a grant or a deny, and if a PC is granted access it is connected tothe requested resources. A request is only granted if the full request may be


Type Memory Technology000 SRAM 1001 SRAM 2010 BlockRAM011 Direct FIFO bu�er

Base address Direct FIFO bu�er0000 Ethernet Tx0001 Ethernet Arp0010 Ethernet Udp 10011 Ethernet Udp 20100 RS 232 Tx0101 RS 232 Rx0110 Audio Tx0111 Audio Rx 11000 Audio Rx 2Table 4.1: PFDL Type Mapping

ful�lled.The MRB implements a priority based scheme, like in the lion pack the

most important PC east �rst followed by the second most important PC.After the highest priority PC has taken what resources it needs the secondhighest priority PC tries to acquire what resources it needs. If it may begranted all requested resources, it takes those and hand over what is left tothe next PC in the priority queue, else it leaves the resource vector untouchedand pass it further.

A PC may request one or more of following resources: VFDL A, VFDLB, PFDL A, PFDL B, BRAM A, BRAM B, SRAM 1, SRAM 2 and FixedFIFO bu�ers. As seen in the list the both lookup tables and the BRAMhave two ports each. A common scenario is that two tasks communicatewith each other, one task writes to the same FIFO bu�er that another taskreads from. It is therefor likely that the number of read and write requestswill be of roughly equal number, and therefor read requests by default useport A and write requests port B.

4.2.2 Read/write interfaceAs said in a previous section, a common scenario for streaming applicationsis that one task writes into the same FIFO bu�er as another task reads from.To optimize performance to separate interfaces have been implemented fromread and write requests, instead of one common. Then the PC should behaveexactly as a FIFO bu�er interface.

A PC read interface is described in table 4.2. A write interface looksexactly the same with the only di�erence that the data is an output.

26 MMU

Figure 4.4: MMU layout

When En is asserted the PC samples the TID and TRFID. A read PCalso samples the data. Until the PC replies with either an Ack or Err theinputs are regarded as don't cares.

The PC goes through three states, as �gure 4.5 shows, before the datais read or written. A look at �gure 4.3 may help understanding how thePC works. In the �rst state the PC reads the VFDL. If the direction ofthe FIFO bu�er is invalid the PC goes to the error state. If the PC is notgranted access by the MRB it stays in current state, else it continues to nextstate.

The retrieved PFID from the VFDL is used to search in the PFDL. Hereis information about the FIFO stored, and a physical read or write pointermay be calculated. If the FIFO bu�er is full for a write request or emptyfor a read request, the PC goes to the error state. If the PC is not grantedaccess by the MRB it stays in current state, else it continues to its �nal state.

In the third state the PC is ready to read or write the data. Here it alsoupdates the read or write pointer before �nally moving back to idle state.If the En signal is still asserted the PC goes directly to VFDL state. Thisis made out of optimization causes and is also how a standard FIFO bu�er


Name Type DescriptionTID[2:0] I Task ID. Every task has an unique id assigned

to it allowing the MMU to identify which taskis requesting access. Three bits give a total ofeight possible tasks.

TRFID[2:0] I Task Relative FIFO ID tells which FIFObu�er the task requests. Three bits lead to amaximum of eight FIFO bu�ers per task.

Data[15:0] I Data in.En I Enable, active high.Ack O Acknowledge, the signal is raised after a suc-

cessful read request has been �nished.Err O Error, the signal is raised on error.

Table 4.2: Read interface

Figure 4.5: PC state graph

controller works. This allows streaming read or write.The PC has no timeout and a read or write request therefor has no upper

time limit. In a MMU with a small number of PCs starvation is no problem,but if the number of PCs increase this could be troublesome.

4.2.3 OSBridge interfaceThe OSBridge at the C-FPGA side is implemented as an OPB slave andhas to deliver a response within 16 clock cycles, or else the command timesout. Therefore the OSBridge PC in the MMU has been assigned the highestpriority.

The OSBridge PC has an interface similar to a random access memory.Table 4.3 describes this in more detail. When the PC is enabled it sends arequest to the MRB and asks for permission to access the resource given bythe resource signal. The resource mapping is described in table 4.4. After

28 MMU

Name Type DescriptionResource[2:0] I Resource selects which lookup table or memory

device that is addressed.Address[19:0] I Address has a di�erent mapping depending on

what resource that is selected. This is describedin table 4.4.

En I OS Bridge is enabled when En is high.RNW I Read Not Write decides the direction of the OSB

Interface. If RNW is high data is read, else it iswritten.

Data In[15:0] I Data written to the OSB. Out of simplicity sepa-rate data in and out interfaces have been selectedinstead of having a tristate bu�er.

Data Out[15:0] O Data read from OSB.Ack O OSB answers with an acknowledge after each

read or write request is �nished.Table 4.3: OSBridge interface

the data has been read or written the PC replies with an acknowledge signal.

4.3 ResultThe MMU has been tested with the test setup below.

• Music samples arrive to the R-FPGA over ethernet and are writteninto a FIFO bu�er. The C-FPGA monitors the �ll level and requestsmore packages when the �ll level goes below a given threshold.

• A hardware task generates a sawtooth that is written into a FIFObu�er.

• The C-FPGA writes a sawtooth over the OSBridge into a FIFO bu�er.The C-FPGA also monitors the �ll level of this FIFO and �lls it upwhen the �ll level goes below a given threshold.

• A loop back task reads data from one FIFO into another. This taskis used to select data source to be written into the Audio Tx FIFObu�er.

The tests have been performed with either a hardware task sawtooth ormusic samples delivered over ethernet. These tests have shown that the datais kept consistent and private, assuming that the lookup tables, VFDL andPFDL, are correctly con�gured. The resource allocation con�icts are solvedand with only �ve PCs starvation has not been an observed problem.

4.3 Result 29

Resource Type Address[a:b]000 VFDL [2:0] TRFID

[5:3] TID

001 PFDL[1:0] PFDL Field00 Type[2:0] & Size[5:0]01 Base Address10 Read Pointer11 Write Pointer[5:2] PFID

010 SRAM 1 [19:0] SRAM 1 Address011 SRAM 2 [19:0] SRAM 2 Address100 BRAM [15:0] BlockRAM Address

101 Direct Fifo

[3:0] Direct Fifo ID0000 Ethernet Tx0001 Ethernet Arp0010 Ethernet Udp 10011 Ethernet Udp 20100 RS 232 Tx0101 RS 232 Rx0110 Audio Tx0111 Audio Rx 11000 Audio Rx 2[4] Data/Status Select0 Data[15:0]1 Empty & Full & Count[13:0]

Table 4.4: OSBridge address mapping

The opportunity of controlling the lookup tables from the C-FPGA hasproven very useful. Data �ow in the R-FPGA may be controlled in a very�exible way and to add a hardware task to a streaming application onlyneeds a few updates of the lookup tables.

For the performance no exact value has been established, but a theoreticalvalue may be reasoned about. Clear is that the minimum access time to aFIFO bu�er is 3 clock cycles. This overhead is introduced by the two lookuptables that both need one clock cycles each. For a system with only one PCthis would mean a large overhead of 200 %. This is, though, not the targetedapplication.

The MMU is designed to be able to serve more tasks at a time. Thedesign with the MRB allows the PCs to work in parallel. In fact, the PCsimplement a three stage pipeline which will decrease the overhead.

The parallel design of the PCs has its disadvantage, they require a largenumber of wires, which results in long routing delays. When the place and

30 MMU

route have the whole FPGA area at its disposal routing paths shorter than20 ns are managed, but when area constraints are applied the longest pathdoes no longer honor this timing constraint1.

4.4 Further workBefore the MMU was developed we had no idea how large the overhead wouldbe, or if it was possible to implement a dynamic MMU for the needs of aRHWOS. A design was conceptually written and implemented in hardwarewith the aim of having a working design. These are some of the possibleimprovements or extensions that could be made:

• The number of wires could be minimized by using a bus structure.• The current version of the MMU does o�er FIFO bu�ers for the hard-ware tasks. Some tasks though need random access memory. Thisservice could be added.

• BRAM and SRAM are the only supported memory technologies. TheR-FPGA have also access to SDRAM cells, and support for these couldbe added.

1Refereing to [7]

Chapter 5

DMA Loader

Direct Memory Access (DMA) means hardware support for transferring largeamounts of data. The classical approach of moving data is to read data intoan internal register, in most cases wait several clock cycles before the datais successfully read, and �nally write the data to another memory location.This is very ine�cient since the CPU is occupied, and in general most of thetime is spent waiting for data to be received and written.

A DMA overtakes the task of moving the data. The CPU only has tosetup a source address, target address and a length of how much data that isto be transferred. No data has to take a retour by the CPU's registers, andthe CPU may use the time for other calculations. The speedup is signi�cantfor large amounts of data at slow peripheral devices as hard discs.

For the XFOS the bitstreams are located in the SDRAM cells. With theclassical approach a full recon�guration has been through empirical studiesdetermined to 2.3 seconds. This value is much slower than the theoreticalvalue of 26 ms as equation 5.1 shows.

A full bitstream is 1.313.220 bytes long and the SelectMAP interface iscapable of writing 1 byte with 50 MHz. The SDRAM cells store a total of512 Mbit that correspond to more than 400 full bitstreams.

1.313.220[bytes]50 · 106[bytes/s]

= 26.2644[ms] (5.1)

The speedup is large enough to motivate the implementation of a DMAloader for recon�guration of the R-FPGA.

The MicroBlaze on the C-FPGA uses an On chip Peripheral Bus (OPB)to connect peripheral devices. Each device is given a span of addressesstarting at the Base Address and ending at the High Address. A core fromXilinx is used to access the SDRAM cells, and this core has to be replacedwith an OPB slave that acts as SDRAM controller and DMA loader.

32 DMA Loader

5.1 Speci�cation• The DMA loader is connected to an OPB bus, SDRAM cells and aSelectMAP interface. These interfaces are �x and have to be imple-mented. How the DMA loader is internally designed is free to choose.

• The DMA loader should be able to fully and partially recon�gure theR-FPGA over SelectMAP at maximum speed, that is 8 bits in parallelat 50 MHz.

• The SDRAM cells need to be refreshed to keep the data consistent. Asuitable refresh strategy should be chosen and implemented.

• The SDRAM cells support single- and burst mode read and write.Both single and burst mode should be implemented by the SDRAMcontroller, and the DMA loader should make use of the burst mode.

• The sixteen last addresses of the SDRAM address space are beingreserved as control and status registers.

5.2 ImplementationThe DMA loader is implemented, as �gure 5.1 shows, with one controllerfor each interface. The controller in the middle selects which process, OPBSlave or DMA, that is granted access to the SDRAM controller.

Figure 5.1: DMA system overview

5.2.1 SDRAM controllerSDRAM stands for Synchronous Dynamic Random Access Memory. Syn-chronous means that the memory cells are being controlled by a small mi-croprocessor. This microprocessor is clocked and synchronously executes


commands. Dynamic means that the bits are stored in capacitors that loosetheir information if they aren't continuously refreshed.

The C-FPGA has access to two parallel SDRAM cells from Micron[10].They are 16 bit wide and each store 256 Mbit. Together this yields a totalstorage capacity of 64 MB that can be accessed as 32 bit words. The cellsare organized in 4 banks by 8192 rows by 512 columns by 16 bit.

The SDRAM cells may be clocked up to 133 MHz, but since the systemclock of the XFBoard runs at 50 MHz this is no limiting constraint. A DCM1could be used to generate a 133 MHz clock, but this has not been done toavoid two clock domains.

Just like any processor the SDRAM microprocessor has a set of com-mands that may be executed. The most important are listed below.NOP No OperationACTIVE Select bank and activate rowREAD Select bank and column, and start READ

burstWRITE Select bank and column, and start

WRITE burstPRECHARGE Deactivate row in bank or banksAUTO REFRESH RefreshLOAD MODE REGISTER Load preferences into MODE REGISTER

The SDRAM controller state diagram is given by �gure 5.2.

Figure 5.2: SDRAM controller state diagramInit

1A DCM is a FPGA component that generates a faster output clock than the inputclock.

34 DMA Loader

Before the SDRAM cells are ready to use, NOP commands have to be issuedthe �rst 100 us followed by one precharge and two refresh commands. Afterthat the SDRAM cells are ready to use.Idle state

Idle is the default waiting state. All states move back to idle after �nishingtheir job.

First of all is checked if a refresh is needed. If not, it checks if a reador write command is issued. If not, there is nothing to do and the SDRAMcontroller stays in idle state. If yes, the read or write command may onlybe issued if the requested burst length (1, 2, 4 or 8) match the programmedburst length in the mode register. If not, the mode register has to be pro-grammed and the next state will be load mode register. If yes, a read orwrite command is issued.Refresh state

The SDRAM controller �rst issues a precharge command followed by a re-fresh command and then returns to idle state.Load mode register state

The SDRAM cells may be read or written in burst mode, with burst lengthof 1, 2, 4, 8 or full page (512). All these modes are supported. There is alsoan addition mode where a read or write command is initiated and terminatedby a burst terminate command. This mode is not supported.

The only case when the SDRAM controller moves to Load mode registerstate is when the burst length has to be reprogrammed. The SDRAM con-troller moves back to idle state after reprogramming the mode register.Read state

After a read command has been issued to the SDRAM cells the SDRAMcontroller has to wait two 2 clock cycles2 before the data may be read. Whenthe data is successfully read a Data Valid signal is asserted. In burst modethe Data Valid is kept asserted as long as the data is valid.

After the read has been �nished the SDRAM controller moves back toidle state.Write state

The write command and data are written the SDRAM cells in the same clockcycle, and when the �rst byte is written a Data Valid signal is presented.

22 clock cycles is given by the CAS latency. When the SDRAM cells are clocked with133 MHz CAS latency is 3 clock cycles.


CLK

DQ

T2T1 T3T0

CA S Latency = 2

LZ

DOUT

tOHt

COM M A ND NOPREA D

tA C

NOP

Figure 5.3: Read timing

In burst mode the Data Valid signal stays asserted as long as data is beingwritten. One 32 bit word may be written per clock cycle.

Similar to the read command the write command has a latency that hasto be honored. After a write command no command other than NOP maybe issued for the following two clock cycles.

CLK

DQ DIN

n

T2T1 T3T0

COM M A ND

A DDRESS

NOP NOP

DON’T CA RE

W RITE

DIN

n + 1

NOP

BA NK ,COL n

NOTE: Burst length = 2

TRA NSITIONING DA TA

Figure 5.4: Write timing

After the write command has been �nished the SDRAM controller movesback to idle state.

36 DMA Loader

Refresh strategyThe SDRAM cells on the XFBoard has to be refreshed 8192 times per 64ms in order to keep the data consistent. There are two strategies to handlethis.

• All 8192 refreshes are issued once every 64 ms in a burst.• One AUTO REFRESH command is issued every 7.8125 us.The �rst alternative is not a suitable strategy since if a refresh burst is

initiated in the middle of a DMA load the recon�guration has to be halted.The second strategy is much better and may with a cleverly designed

bu�er have no e�ect on a DMA load. The DMA controller reads data fromthe SDRAM cells into FIFO bu�er. This FIFO bu�er is then emptied andthe data is written to the SelectMAP interface.

The SelectMAP interface is capable of writing 8 bits per clock cycle. TheSDRAM is faster and is in burst mode capable of reading one 32 bit wordper clock cycle after a read command has been set up. If the FIFO bu�er ischosen large enough this will give the SDRAM controller su�cient time toexecute an AUTO REFRESH command before the FIFO bu�er runs empty.

5.2.2 OPB slaveOn chip Peripheral Bus (OPB) is a standardized on chip bus. It is designedlike �gure 5.5 shows, with a bus master and one or more slaves. Each slave isassigned a base address and a high address. The master broadcasts messagesto all slaves and a slave replies only if the broadcasted address matches itsaddress space. Since the slaves are connected to the bus with or -gates, aslave may only drive the bus when it is being addressed and enabled. If aslave would misbehave and not follow that rule the whole bus system fails.

Figure 5.5: OPB bus3

An OPB slave has to decode its address itself in order to decide if it shouldreply or not. A valid address must be dividable into a constant identi�er anda variable address, hence 400�7� is a valid address space but 400�8� is not,since for the latter case no constant identi�er may be found.

3Observe that not all OPB signals are presented in the �gure.


Figure 5.6: OPB address decoding

The OPB slave of the DMA loader has an address space of 64MB, whereeach byte corresponds to one address. The 16 last addresses are reserved forcontrol and status registers for the DMA. They are described in table 5.1and 5.2.

Address Description3f:�:�:f0 Command/Status register. See table 5.23f:�:�:f4 Start Address3f:�:�:f8 Stop Address3f:�:�:fc Reserved

Table 5.1: Hardware registersBit Description31:7 Reserved. Tied to '0'.6 Start ≥ Stop address error � RO5 Stop address not 4 byte aligned error � RO4 Start address not 4 byte aligned error � RO3 Error � RO2 Done bit set � RO

The R-FPGA set a Done bit when it is correctly pro-grammed.

1 Recon�guration mode � RW1 = Full recon�guration0 = Partial recon�guration

0 DMA Enable � RW1 = Enabled0 = Disabled/ReadyWriting a '1' to this bit enables the DMA controller.When the DMA controller has �nished its task it resetsthis bit to '0'.

Table 5.2: Command/Status register

All signals from the master are valid as long as the enable signal is as-serted, hence the slave does not have to latch any data. By default an OPBcommand times out after 16 clock cycles if no slave replies. The timeoutmay though be extended if a slave asserts the time out signal. This featureis not used by the OPB slave.

38 DMA Loader

If an address is decoded and found to be within base address and highaddress, and not to be a hardware register, the OPB slave requests accessfrom the controller to the SDRAM controller. Then it waits for either theSDRAM controller to reply with a Data Valid signal or the OPB master totime out and deassert the enable signal.

During normal operation timeout should not be a problem. In worst casethe OPB slave has to wait for a refresh (8 clock cycles), a reprogrammingof the mode register (3 clock cycles) and a read command (5 clock cycles).The worst case touches the 16 clock cycles bound.

It is also possible for the OPB slave to access the SDRAM controllerduring a DMA load, but no guarantees are given that the SDRAM controllerwill reply within the given time limit of 16 clock cycles. The OPB slave haslowest priority in the system and if a read or write request is preceded by aSDRAM refresh followed by a DMA burst read, the SDRAM controller willnot be able to deliver the data in time.

5.2.3 DMAThe R-FPGA may be recon�gured over the SelectMAP interface. This iscapable of writing 8 parallel bits at 50 MHz, which leads to very short con-�guration times as equation 5.1 shows.

The SelectMAP interface is described in table 5.3. To fully recon�gurethe FPGA init �rst has to be pulled low for at least 300 ns in order to initiatean erase of the FPGA. During this time program goes low to indicate thatthe FPGA is not ready to receive any program code. When program goeshigh again CS and write may be set low and data written.

For a partial recon�guration no initialization needs to be done. TheSelectMAP process may directly set CS and write low and start writingdata.

P R OG R AM

INIT

C C LK

C S

WR IT E

DAT A[0:7] B yte 0 B yte 1 B yte n

B yte nLoaded

B yte 0Loaded

C C LK and

WR IT EIgnored

B yte 1Loaded

Figure 5.7: SelectMAP timing

5.3 How a recon�guration works 39

Signal D DescriptionCClk O Output clock feed to the SelectMAP interface.

Maximal clock frequency is 50 MHz.Init O 1 = No action

0 = Erase the FPGAInit should be pulled low for at least 300 ns tofully eras the FPGA con�guration.

Program I 1 = FPGA is ready to be programmed0 = FPGA is not ready to be programmedProg is low during an erase of the FPGA. It goeshigh again when the FPGA is ready to receive anew con�guration.

Data[7:0] IO The 8 data bits. Bit 7 is the least signi�cant bit.CS O 1 = No valid data

0 = Valid dataChip select may be pulled high during a writesequence to temporarily halt the recon�guration.

Write O 1 = Read0 = Write

Done I 1 = FPGA is correctly con�gured0 = FPGA is not correctly con�guredTable 5.3: SelectMAP interface

The DMA is divided into two processes, SelectMAP and SDRAM, as �g-ure 5.8 shows. The SDRAM process reads data from the SDRAM controllerinto a 32 bit FIFO bu�er, while the SelectMAP process reads data fromthis bu�er and write them 8 bit at a time to the SelectMAP interface. TheSDRAM process works as master and when the DMA is enabled it activatesthe SelectMAP process by sending an enable signal to it.

5.3 How a recon�guration worksFollowing piece of C-code describes how a DMA load may be setup. Sincethe DMA loader does not generate an interrupt the CPU must poll thecontrol/status register in order to �nd out whether the DMA is �nished ornot. The CPU is of course not obligated to poll the register if it is sure thatno new recon�guration will be initiated the following 28 ms.void reconfigure(int startAddress, int stopAddress, int full) {

int controlReg, startReg, stopReg, i;

40 DMA Loader

Figure 5.8: DMA data �ow

controlReg = SDRAM_BASEADDR + 0x3ffffff0;

startReg = SDRAM_BASEADDR + 0x3ffffff4;

stopReg = SDRAM_BASEADDR + 0x3ffffff8;

// Set up start-, stop- and control register

*(int*)startReg = startAddress;

*(int*)stopReg = stopAddress;

*(int*)controlReg = full<<1 | 1;

// Poll the status register until the DMA is ready

while((*(int*)controlReg & 1) == 1) {

i++;

}

printf("Reconfiguration ended after %i polls\n", i);

// Check if an error occured

if ((*(int*)controlReg & 1<<3) == 1)) {

printf("An error occured\n");

if ((*(int*)controlReg & 1<<4) == 1))

printf("Start address is not 4 byte aligned\n");


printf("Stop address is not 4 byte aligned\n");


printf("Start address is larger than stop address\n");

}

// Check if done bit is set


printf("Done bit is set\n");

else

5.4 Result 41

printf("Done bit is not set\n");

}

5.4 ResultTest have shown that the DMA loader works well for smaller systems. Thedata is held consistent, the OPB slave replies within the 16 clock cycle time-out and the DMA writes data to the SelectMAP interface at maximum speedwithout being halted.

A full recon�guration takes 28 ms, where 2 ms come from erasing theFPGA and 26 from writing the program code. No bu�er underruns wherethe recon�guration must be halted have been seen, even when the SDRAMcontroller needs to issue a refresh command.

When the DMA loader is integrated in the larger system on the C-FPGAwith speed-grade 4, place and route reports timing errors for the global sys-tem. One of the e�ects is that the DMA loader from time to time does notwork properly due to timing violations. The system is stressed to its limitsand the constraining factor looks to be place and route, where the signalssu�er from long routing paths and don't reach their target within given timeconstraints.

The results with C-FPGA are better and the DMA loader works as itshould.

5.5 Further work• The SelectMAP interface does not only allow to write bitstreams, itmay also read bitstreams from an already con�gured FPGA. Furtherwork of interest would be to extend the DMA with read back function-ality.

• The OPB slave may access the SDRAM controller during a DMA load,but no guarantees are given that the SDRAM controller will respondwithin the 16 clock cycle time out. The OPB slave could extend thistime out by raising the time out signal on the OPB bus if it believesit will be able to produce a response.

42 DMA Loader

Chapter 6

Achievements and outlook

6.1 Achievements

During this master thesis I have developed components and services that areimportant for a RHWOS. The ethernet driver allows large amounts of datato be received and passed further to hardware tasks. This is important forstreaming applications that consume large amounts of data. The OSI layersare decoded, the payload is extracted and written to a FIFO bu�er.

In a multitasking streaming environment memory is most likely a crucialcomponent. The MMU o�ers the service of FIFO bu�ers, solves resourcecon�icts to the di�erent memory technologies and implements a priorityscheduling scheme that works well for a small number of hardware tasks.The C-FPGA has full access to all memory devices and internal registersthat give a powerful way of controlling and monitoring the data streams inthe R-FPGA. By reading and modifying the internal registers new FIFObu�ers may dynamically be allocated or deallocated. Status information, as�ll level, can be monitored and data streams can be steered in an arbitraryway.

The SDRAM DMA loader was implemented as an OPB slave for theMicroBlaze in the C-FPGA. The DMA runs a maximum speed without anyinterception, and this allows the R-FPGA to be fully recon�gured in 28 ms.For small partial bitstream the recon�guration time even shrinks to a fewmilliseconds. The recon�guration is fast enough that hardware preemptionbecomes a sizable technology. The bitstreams are stored in SDRAM cells anda SDRAM controller has been implemented that keeps the data consistentand read and write data in di�erent burst modes.

To prove the correctness of the components and services the designs have�rst been simulated in ModelSim and then tested in the real environment atthe XFBoard. A larger RHWOS has been put together audio driver, MMU,OSBridge, ethernet driver and one hardware task that writes a sawtooth tothe MMU. The test has shown that the components not only works stand

44 Achievements and outlook

alone, but also integrated in a larger RHWOS.During the test timing problems have been encountered. A FPGA has,

compared to an ASIC, a low routing density, and for large designs the routingdelay stands on the brink of violation the 20 ns timing constraint. In orderto solve this problem the longest paths need to be located and broken intosmaller pieces if possible. This is an insight I take with me, that the longestpath always has to be analyzed in detail an that the routing delay increasesrapidly when a design grows.

6.2 OutlookThe work of [7] has shown that there is still a long way to go before RHWOSmay come true. The Virtex-II FPGA does support partial recon�guration,but this is not thought for more than one or maybe two �x modules. Topartially implement a RHWOS has been a hard nut to crack for the placeand route and often resulted in too long routing delays. In the future perhapsbetter hardware support is given and maybe we will se hybrid chips with amixture of ASIC and FPGA technology. The RHWOS could draw advantageof the high performance and logic and routing density of static logic, whilethe rest of the chip is left for FPGA technology.

For future work hardware preemption would be a most interesting topicto further explore. A partial bitstream can be recon�gured within a few mil-liseconds, and just like processes in a software OS may share one commonprocessor, hardware tasks may share one common FPGA. Suitable applica-tions must be found, where a context of a few milliseconds is acceptable.A scheduler must be chosen for this special scenario. A task must also beeither stateless or a way of extracting and saving the context of a task mustbe found.

Appendix A

MMU cook-book

The XFOS o�ers prede�ned commands and structures to communicate withthe MMU over the OSBridge. Following code is de�ned in mmu.h.

//! Virtual FiFo Descriptor List Item

/*!

An item in the virtual FiFo descriptor list contains information

about the mapping between tasks and the read / write interfaces

of the FiFos listed in the physical FiFo descriptor list.

*/ typedef struct XF_VFDL_t{

Xuint8 TID; //!< ID of the task connected to the FiFo

Xuint8 TRFID; //!< Task-relative FiFo ID

Xuint8 direction; //!< Access direction: read (0), write (1)

Xuint8 PFID; //!< ID of the physically connected FiFo

}XF_VFDL;

//! Physical FiFo Descriptor List Item

/*!

An item in the physical FiFo descriptor list contains information

about the type and the dimensions of a FiFo physically present in

the design.

*/ typedef struct XF_PFDL_t{

Xuint32 PFID; //!< ID of the physical FiFo

Xuint32 type; //!< Type of the FiFo

Xuint32 baseAddr; //!< Base address of the FiFo (in blocks)

Xuint32 size; //!< Size of the FiFo (in blocks)

Xuint32 rdPtr; //!< Read pointer (in bytes), relative to

// the base address converted to bytes

Xuint32 wrPtr; //!< Write pointer (in bytes), relative to

// the base address converted to bytes

46 MMU cook-book

}XF_PFDL;

void mmu_readVFDL(XF_VFDL* vfdlPtr, Xuint32 length,

Xboolean sequential);

void mmu_writeVFDL(XF_VFDL* vfdlPtr, Xuint32 length,


void mmu_readPFDL(XF_PFDL* pfdlPtr, Xuint32 length,


void mmu_writePFDL(XF_PFDL* pfdlPtr, Xuint32 length,


Xuint32 mmu_writeToFifo(XF_PFDL* pfdlPtr, Xuint16 value,

Xboolean first);

Xuint32 mmu_readFromFifo(XF_PFDL* pfdlPtr, Xboolean first);

Xuint32 mmu_fifoFillLevel(XF_PFDL* pfdlPtr);

void mmu_dumpFifo(XF_PFDL* pfdlPtr,Xuint32 start,

Xuint32 end);

To illustrate how the MMU may be used an example scenario, describedby �gure A.1 and A.2, have been chosen. Two hardware tasks, Saw 1 andSaw 2, generate sawtooth signals with di�erent frequencies that are writ-ten to the MMU. The MMU route the incoming data to destination FIFObu�ers, that are given by the internal lookup tables. The Audio Driver readsdata from a �xed FIFO bu�er and pass them further to the audio codec.Following code setup up a pointer in the PFDL to the Audio Tx FIFO bu�er,like �gure A.3 illustrates. The type �eld is set to 3, which corresponds to adirect FIFO bu�er. The base address is set to 6, which points to the AudioTx FIFO bu�er. Since it is a direct FIFO bu�er size, read pointer and writepointer areXF_PFDL pfdl;

pfdl.PFID = 0;

pfdl.type = 3; // Direct FIFO

pfdl.baseAddr = 6; // Audio TX

pfdl.size = 4;

pfdl.rdPtr = 0;

pfdl.wrPtr = 0;

mmu_writePFDL(&pfdl, 1, XFALSE);

Following code setup a pointer for Saw 1 in the VFDL that points to theentry in the PFDL that points to the Audio Tx FIFO bu�er. Saw 0 has TID0 and writes to TRFID 0. Figure A.4 illustrates this.

47

Figure A.1: MMU demonstration setup


XF_VFDL vfdl;

vfdl.TID = 0;

vfdl.TRFID = 0;

vfdl.direction = 1; // Write

vfdl.PFID = 0;

mmu_writeVFDL(&vfdl, 1, XFALSE);

Following code disable Saw 1 by setting the direction in the VFDL to read.This will lead to that the MMU answers with errors because Saw 1 tries towrite to a read FIFO bu�er.

The code also enables Saw 2 in the same way as Saw 0 was enabled inthe previous step. Figure A.5 illustrates this.

48 MMU cook-book



XF_VFDL vfdl;

vfdl.TID = 0;

vfdl.TRFID = 0;

vfdl.direction = 0;

vfdl.PFID = 0;


vfdl.TID = 1;

vfdl.TRFID = 0;

vfdl.direction = 1;

vfdl.PFID = 0;



Bibliography

[1] Recon�gurable Hardware OS Prototypehttp://www.tik.ee.ethz.ch/∼walder/HomePage/XFORCES/TIKR168.pdf

[2] Prototype Platform for Recon�gurable Hardware Operating Systemeshttp://www.tik.ee.ethz.ch/∼walder/HomePage/XFORCES/TIKR193.pdf

[3] XForceshttp://www.tik.ee.ethz.ch/∼walder/

[4] Prototype Board for Recon�gurable OSSemester theis by Samuel Nobshttp://www.tik.ee.ethz.ch/∼walder/HomePage/SADA/PrototypeBoardForRecon�gurableOS/PBFROS.pdf

[5] Inter-Task-Communication in Recon�gurable Operating SystemsMaster theis by Andres Erni and Stefan Reichmuthhttp://www.tik.ee.ethz.ch/∼walder/HomePage/SADA/InterTaskCommunicationInRecon�gurableOS/ITCROS_Report.pdf

[6] Recon�gurable Hardware OS Prototype "C"Master theis by Samuel Nobs

[7] Recon�gurable Hardware OS Prototype "R"Master theis by Simon Steinegger

[8] IP Protocol suitehttp://www.networksorcery.com/enp/topic/ipsuite.htm

[9] Ethernet Frame Check Sequencehttp://www2.rad.com/networks/1994/err_con/crc_how.htm

[10] Micron SDRAMhttp://www.micron.com/products/dram/SDRAM/part.aspx?part=MT48LC16M16A2TG-7E

[11] Xilinxhttp://www.xilinx.com

50 BIBLIOGRAPHY

[12] Designing Custom OPB Slave Peripherals for MicroBlazehttp://www.xilinx.com

[13] Virtex FPGA Series Con�guration and Readbackhttp://www.xilinx.com

[14] Virtex-II Platform FPGA User Guidehttp://www.xilinx.com

[15] ETH Zurichhttp://www.ethz.ch

kristofer jonsson for rhw o s masters thesis da-2004-05 · a lower logic densit.y on the other hand...

Documents