11case study on a software communications

7/27/2019 11Case Study on a Software Communications

1/5

Case Study on a Software Communications

Architecture Component for Hardware Acceleration

Rolando Duarte, Omar Granados, Chen Liu, Jean Andrian

Department of Electrical and Computer Engineering

Florida International University

Miami, FL 33174, U.S.A

{rduar002, ogran002, cliu, andrianj}@fiu.edu

Abstract In this paper we introduce an architecture for

hardware acceleration of Software Communication Architecture

(SCA) components. Intercommunication, as well as

responsiveness and power dissipation, is essential for the

components of SCA. Thus, we studied the SCA to explore the

hotspot function and optimize its overall performance to not only

operate efficiently but also in an energy-efficient way. We spotted

the amplifier function takes a significant amount of the execution

time which involves single-precision floating-point computation.

A case study on the implementation of the amplifier component

on the Virtex5 Field-Programmable Gate Array (FPGA)

platform is presented. Consequently, we make a comparison of

the hardware accelerated amplifier implementation employing a

floating-point unit (FPU) engine over pure software

implementation with no FPU support. We obtained a speedup of

up to 2x faster while minimizing the energy consumption by up to

100% with a power consumption increase of 0.68%, achieving

our goal.

Keywords- Software Communication Architecture, Software

Defined Radio, Energy Efficiency, Hardware Acceleration.

I. INTRODUCTIONThe vision of ubiquitous computing systems has lead to a

sharp increase in research efforts on the development and

deployment of devices whose underlying technology is

invisible to the end users. That is, such systems are designed

to allow users to perform a wide range of computational tasks

anytime and anywhere without them being aware of the

specific devices performing such tasks [1].

Because in ubiquitous computing the users environment is

likely to change frequently, one of the main requirements for

this vision to become a reality is that of spontaneous

interoperation between devices [2]. This means that as mobile

devices change environment, they should be able to

spontaneously communicate with any other device in their

new operating environment without manual intervention. The

complexity of this issue has increased considerably nowadays

with the significant number of wireless protocols available to

serve as the link between computing elements.

Recently, a promising technology known as software

defined radio (SDR) has gained a great deal of attention as a

possible way of enabling spontaneous interoperation among

devices. An SDR system is a radio in which all of the

baseband, intermediate frequency (IF), and part of the radio

frequency (RF) processing tasks are done exclusively through

software. This characteristic makes possible for the radio to

use intelligent algorithms to modify its transmission and

reception parameters depending on the changes it detects in its

operating environment, thus eliminating the need for user

intervention [3].

In order to deploy SDR systems in a truly ubiquitous

computational environment, mobility is a requirement and thussize, weight, and power (SWaP) are major constrains. In

addition to this, the system should be able to support

communication protocols of different levels of complexity.

Therefore, two main characteristics the SDR system must have

are high computational performance and low power

consumption. A promising approach that offers improvements

in the aforementioned areas is to apply hardware acceleration

techniques to the computation intensive components

implemented under the SDRs software architecture. A general

hardware-assisted approach was presented by Tang et al. [14].

However, their design mainly focused on accelerating the

garbage collection function in the middleware layer, which is

for a totally different application domain comparing with ours.

In [4], Lau et al. showed that increased execution speed could

be achieved by offloading some of the signal processing

functions of a general SDR system onto an FPGA-based

hardware accelerator. Since the Software Communications

Architecture (SCA) is the most widely accepted software

platform for SDR systems, in [5] Millage showed how

hardware acceleration of signal processing components

implemented in the SCA can be achieved using a graphics

processing unit (GPU). There, it is observed that execution

time is reduced compared to those of software components

running only on a general purpose processor (GPP). Although

because of parallelization GPUs have considerable

computational power, their use as hardware accelerators in

mobile devices is not very practical because of their

significantly high power consumption. Therefore, in contrast

with [5], in this work we propose a hardware acceleration

platform for SCA components using an FPGA in order to

achieve a balance between computational performance and

energy consumption. We also present an initial feasibility

study of such platform on a specific FPGA board. Since

hardware acceleration is only performed on the signal

processing portion of the code that implements a waveform

component, the resulting platform is still compliant with SCA

standards, thus it can be readily deployed for current military

Fourth International Conference on Ubi-Media Computing

978-0-7695-4493-9/11 $26.00 2011 IEEE

DOI 10.1109/U-MEDIA.2011.17

36


2/5

and commercial systems without backwa

issues.

The rest of the paper is organized as foll

an overview of software defined radio a

Communications Architecture (SCA) is pres

introduces the proposed architecture

acceleration and describes the specific FP

will be used to implement it. A studyconsumption and performance of an S

implemented with and without floating poin

presented in section IV. Lastly, concluding r

in section V.

II. BACKGROUNDa) Software Defined RadioSoftware defined radio (SDR) refers t

where the majority of the transmission an

components are realized through software. Th

such system is to allow the modification of

and protocols with no or minimal hardware

alterations to parameters such as mo

transmission power, or transmission frequeentirely through software. The block diagr

hardware (HW) components of an SDR syste

Fig. 1. In an ideal SDR system, most of the p

hardware element should be able to be reco

software in order to support a large var

protocols [6]. It can be observed that for this

to be implemented, enabling technologies s

processors, analog-to-digital and digital-to-a

(ADC and DAC), reconfigurable compone

frequency (RF) front end, as well as antenna e

to be properly chosen to comply with the re

waveforms they are aimed to support.

Z&D /&

Fig. 1 Block diagram of SDR Syst

As critical as the choice of hardware plat

as important to select the right software arch

system. Such architecture should provide p

different hardware platforms. Moreover, in

its adoption in commercial applications, th

software performance bottleneck should be

is, one has to ensure that the baseband s

components and the communication protoco

are as efficiently implemented as possible.

doing so can be reflected in lower computati

less energy consumption.

Although recently there has been an em

SDR software architectures including GN

PIM and PSM for Software Radio Compone

[7], and NASAs Space Telecommunicatio

rd compatibility

ws. In section II,

d the Software

nted. Section III

sing hardware

A platform that

on the energyCA component

t unit support is

marks are given

a radio system

d reception-path

e purpose behind

radio waveforms

hanges. That is,

ulation format,

cy can be doneam of the main

m is presented in

rameters of each

nfigured through

iety of wireless

type of platform

uch as baseband

nalog converters

nts of the radio

lements, all need

uirements of the

W

m

form, it is almost

itecture for SDR

ortability among

rder to facilitate

SDR systems

minimized. That

ignal processing

ls between them

The benefits of

nal intensity and

rgence of many

Radio, OMGs

nts Specification

n Radio System

(STRS) architecture [8], t

communications architecture (SC

facto standard of the SDR commu

work our focus is on SCA architect

b) The Software CommunicatThe software communication

developed as the standard soft

Department of Defenses (DoD)program known as Joint Tactical

introduction of this standard was

technology insertion, hardware

interoperability for SDR systems [

of the SCA is presented. Here,

through C4 represent the wavefor

context, the function of each o

perform a specific signal processin

In the SCA, all the services and i

component are defined in the SCA

figure it can be seen that wavefor

each other and with the core fra

bus provided by the Common

Architecture (CORBA). One of tarchitecture is that it allows differe

or waveform components) to run

units and to communicate with eac

useful when using only one proce

the computational requirements of

Fig. 2 Software Communications

III. SYSTEM ARa) Proposed System Architect

Acceleration

To take advantage of the ben

hardware acceleration of softwar

adhering to the SCA standard, o

processing routine for each

accelerated in our proposed archit

That is, all the core framewor

services to these accelerated com

allowing the component to be initi

e so-called software

) still remains as the de-

nity [9]. Therefore, in this

re.

ions Architecture

s architecture (SCA) was

ware framework for the

ilitary radio developmentadio System (JTRS). The

ainly aimed at facilitating

bstraction, and hardware

10]. In Fig. 2 the structure

the elements labeled C1

components. In the SDR

these components is to

g task within a waveform.

nterfaces available to each

core framework. From the

components interact with

ework through the logical

Object Request Broker

e crucial features of thisnt applications (waveforms

on distributed processing

h other. This is particularly

ssing unit does not satisfy

omplex waveforms.

rchitecture Structure [10]

HITECTURE

ure for Hardware

efits obtained through the

components, while still

ly the code of the signal

aveform component is

cture as shown in Fig. 3.

interfaces and CORBA

onents remain unchanged,

alized and also located by

37


3/5

other waveform components. However, the signal processing

portion of the code is replaced by a subroutine which calls

upon the hardware accelerated implementation. One of the

implications of this process is that data stored in CORBA type

variables must be converted and loaded to the memory

location of the hardware accelerator used to implement the

signal processing routine. In Fig. 3, the component to be

accelerated is denoted as C3 and the accelerated signalprocessing routine is C3_HW.

Fig. 3 Proposed SCA Architecture with Hardware Acceleration

Since the proposed system is to be implemented using an

FPGA board, in the following subsection a thorough

description of the characteristics of the hardware platform in

question is provided.

Fig. 4 System Architecture with FPU Support

b) Hardware PlatformWe used the Virtex5 FPGA platform [11] to construct the

system architecture shown in Fig. 4. FPGA is of great

advantages for SDR design since they can be reconfigured and

reprogrammable. MicroBlaze [12] is a soft-core 32-bit RISC

microprocessor operating at a clock rate of 125 MHz. It can be

configured to have a 5-stage pipeline (performance-optimized)

or 3-stage pipeline (area-optimized), barrel shifter, integer

multiplier/divider, floating point operation and cache support.

For our design, we elect to use the 5-stage pipeline and

floating point unit (FPU) support for IEEE single precision

format. MicroBlaze can access the local memory denoted as

Block RAM (BRAM) through the Local Memory Bus (LMB).

Thus, it accesses the instruction and data through the

Instruction Local Memory Bus (ILMB) and Data Local

Memory Bus (DLMB) respectively. Every instruction or data

access through the LMB takes 1 clock cycle and the LMB

must only be connected from processor to local memoryBRAM. In contrast, the Processor Local Bus (PLB) takes

about 7 clock cycles per access; however, it allows any

peripheral or custom hardware to connect to it. The PLB is

bidirectional and is 32-bit wide. We also employs a timer to

collect timing information such as the number of clock cycles

a specified operation takes on MicroBlaze or any other

peripheral connected through the PLB. Interrupt is used to

make the system interruptible. If Microblaze needs to perform

an urgent operation, then it can interrupt the current operation

to serve the urgent one. RS232_UART is used for displaying

the output on the host machine. The FPGA platform is a

memory-mapped system and every component is assigned a

memory range. Therefore, if MicroBlaze needs to access any

peripheral it only needs to provide the address location and/ordata. The address of MicroBlaze is 32-bit wide thus it can

access up to 0xFFFFFFFF memory location.

IV. EXPERIMENTAL RESULTSa) Experimental SetupFor the purpose of this study, a waveform implemented in

an SCA-based software architecture known as OSSIE was

initially profiled. It was observed that one of the components

on which the processor spent more time (i.e., higher

percentage of samples were taken during execution) was the

IQ amplifier. This was inferred as an indication that the

acceleration of this component might lead to significant

performance improvement as it is frequently accessed by the

processor.

Therefore, our initial feasibility study consisted of

executing the amplifier function from the SCA without the

FPU incorporated in the MicroBlaze architecture over the case

with FPU support. The amplifier function accepts seven

parameters to multiply the in_phase (I_in_0) and its gain

(I_gain) as well as the quadrature (Q_in_0) and its gain

(Q_gain). Now, I_gain and Q_gain are floating point while the

I_in_0 and Q_in_0 are integer vectors.

b) ResultsWe execute the amplifier function on MicroBlaze with

and without FPU support on data set from 128 Bytes to 64

Kbyte (local memory maximum capacity). The clock cycle,execution time and energy consumption readings are collected

and presented in Table I and Table II respectively, with

doubling the data set size each time.

We first enable the FPU support and run the amplifier

with data set of 128 Bytes. From Table I, we can see that it

takes 17842 clock cycles to perform floating-point

multiplication of 128 Bytes data set. The microprocessor runs

at 125 MHz. Given that execution time equals to the number

38


4/5

of clock cycles multiply the unit clock

execution time we obtained is 0.1427 ms. F

can observe that it takes 32651 clock cy

floating-point multiplication of 128 Bytes

execution time is 0.2612 ms.

Next, we utilized XPower Analyzer [12]

consumption of the system. Xpower Analyze

dedicated to analyze the power consumptimplemented place and routed design. It coll

about the hardware design and provides ac

about the power utilization. By using the

tool, the overall power consumption for the

can be measured, which are shown in Fig. 4.

that with FPU support the total power consu

Watts, of which 1.1105 is static power and 0.

power. If we disable the FPU unit, the static

and dynamic power is 0.1463, respectively, w

consumption of 1.2565 watts. Adding the fl

support inquires more hardware thus it wil

power as well. Based on our measuremen

overhead incurred by FPU corresponds to a 0

power consumption. The system architec

support requires 415 slice register used as fl

look-up table (LUT), which constitute to 21

flip-flops and LUTs resource respectively w

the system that does not use the FPU engine.

Table I - MicroBlaze with FPU Suppo

# Bytes

Clk Cycles(10^6)

Exec Time(ms)

Energy(mJoule

128 0.0178 0.1427 0.1805

256 0.0345 0.2799 0.3541

512 0.0658 0.5484 0.6937

1K 0.1341 1.0731 1.3576

2K 0.2622 2.0983 2.6544

4K 0.5124 4.0996 5.1862

8K 1.0005 8.0041 10.1255

16K 1.9521 15.6166 19.7557

32K 3.8061 30.4487 38.5188

64K 7.4158 59.3265 75.0504

Table II - MicroBlaze without FPU Sup

# Bytes

Clk Cycles

(10^6)

Exec Time

(ms)

Energy

(mJoule

128 0.0326 0.2612 0.3270

256 0.0645 0.5164 0.6465

512 0.1275 1.0207 1.2778

1K 0.2521 2.0168 2.5248

2K 0.4984 3.9875 4.9919

4K 0.9855 7.8845 9.8705

8K 1.9475 15.1504 19.5048

16K 3.8469 30.7755 38.5272

32K 7.5965 60.7725 76.0798

64K 14.9975 119.9802 150.200

cycle time, the

om Table II, we

cles to perform

data set. The

to get the power

r is a Xilinx tool

ion for a post-ects information

urate estimation

power Analyzer

hardware system

Our results show

ption is 1.2650

1545 is dynamic

power is 1.1102

ith a total power

ating point unit

l consume more

t, the hardware

.68% increase in

ture with FPU

ip flops and 771

and 40% more

en compared to

rt

)uJoules /

Bytes

1.4106

1.3834

1.3549

1.3257

1.2961

1.2661

1.2360

1.2057

1.1755

1.1451

port

)

uJoules /

Bytes

2.5546

2.5256

2.4957

2.4656

2.4374

2.4097

2.3809

2.3515

2.3217

2.2918

Since we get the power concalculate the energy consumed d

Thus, the 128-Byte data set has0.1805 mJoules as shown in Tabl

consumed for the same data set

accordingly.

Fig. 5 Performance of FPU o

Fig. 6 Energy Reduction of FP

Fig. 7 Normalized Energy per Byte C

<

W

/

11case study on a software communications

Documents