11case study on a software communications
TRANSCRIPT
-
7/27/2019 11Case Study on a Software Communications
1/5
Case Study on a Software Communications
Architecture Component for Hardware Acceleration
Rolando Duarte, Omar Granados, Chen Liu, Jean Andrian
Department of Electrical and Computer Engineering
Florida International University
Miami, FL 33174, U.S.A
{rduar002, ogran002, cliu, andrianj}@fiu.edu
Abstract In this paper we introduce an architecture for
hardware acceleration of Software Communication Architecture
(SCA) components. Intercommunication, as well as
responsiveness and power dissipation, is essential for the
components of SCA. Thus, we studied the SCA to explore the
hotspot function and optimize its overall performance to not only
operate efficiently but also in an energy-efficient way. We spotted
the amplifier function takes a significant amount of the execution
time which involves single-precision floating-point computation.
A case study on the implementation of the amplifier component
on the Virtex5 Field-Programmable Gate Array (FPGA)
platform is presented. Consequently, we make a comparison of
the hardware accelerated amplifier implementation employing a
floating-point unit (FPU) engine over pure software
implementation with no FPU support. We obtained a speedup of
up to 2x faster while minimizing the energy consumption by up to
100% with a power consumption increase of 0.68%, achieving
our goal.
Keywords- Software Communication Architecture, Software
Defined Radio, Energy Efficiency, Hardware Acceleration.
I. INTRODUCTIONThe vision of ubiquitous computing systems has lead to a
sharp increase in research efforts on the development and
deployment of devices whose underlying technology is
invisible to the end users. That is, such systems are designed
to allow users to perform a wide range of computational tasks
anytime and anywhere without them being aware of the
specific devices performing such tasks [1].
Because in ubiquitous computing the users environment is
likely to change frequently, one of the main requirements for
this vision to become a reality is that of spontaneous
interoperation between devices [2]. This means that as mobile
devices change environment, they should be able to
spontaneously communicate with any other device in their
new operating environment without manual intervention. The
complexity of this issue has increased considerably nowadays
with the significant number of wireless protocols available to
serve as the link between computing elements.
Recently, a promising technology known as software
defined radio (SDR) has gained a great deal of attention as a
possible way of enabling spontaneous interoperation among
devices. An SDR system is a radio in which all of the
baseband, intermediate frequency (IF), and part of the radio
frequency (RF) processing tasks are done exclusively through
software. This characteristic makes possible for the radio to
use intelligent algorithms to modify its transmission and
reception parameters depending on the changes it detects in its
operating environment, thus eliminating the need for user
intervention [3].
In order to deploy SDR systems in a truly ubiquitous
computational environment, mobility is a requirement and thussize, weight, and power (SWaP) are major constrains. In
addition to this, the system should be able to support
communication protocols of different levels of complexity.
Therefore, two main characteristics the SDR system must have
are high computational performance and low power
consumption. A promising approach that offers improvements
in the aforementioned areas is to apply hardware acceleration
techniques to the computation intensive components
implemented under the SDRs software architecture. A general
hardware-assisted approach was presented by Tang et al. [14].
However, their design mainly focused on accelerating the
garbage collection function in the middleware layer, which is
for a totally different application domain comparing with ours.
In [4], Lau et al. showed that increased execution speed could
be achieved by offloading some of the signal processing
functions of a general SDR system onto an FPGA-based
hardware accelerator. Since the Software Communications
Architecture (SCA) is the most widely accepted software
platform for SDR systems, in [5] Millage showed how
hardware acceleration of signal processing components
implemented in the SCA can be achieved using a graphics
processing unit (GPU). There, it is observed that execution
time is reduced compared to those of software components
running only on a general purpose processor (GPP). Although
because of parallelization GPUs have considerable
computational power, their use as hardware accelerators in
mobile devices is not very practical because of their
significantly high power consumption. Therefore, in contrast
with [5], in this work we propose a hardware acceleration
platform for SCA components using an FPGA in order to
achieve a balance between computational performance and
energy consumption. We also present an initial feasibility
study of such platform on a specific FPGA board. Since
hardware acceleration is only performed on the signal
processing portion of the code that implements a waveform
component, the resulting platform is still compliant with SCA
standards, thus it can be readily deployed for current military
Fourth International Conference on Ubi-Media Computing
978-0-7695-4493-9/11 $26.00 2011 IEEE
DOI 10.1109/U-MEDIA.2011.17
36
-
7/27/2019 11Case Study on a Software Communications
2/5
and commercial systems without backwa
issues.
The rest of the paper is organized as foll
an overview of software defined radio a
Communications Architecture (SCA) is pres
introduces the proposed architecture
acceleration and describes the specific FP
will be used to implement it. A studyconsumption and performance of an S
implemented with and without floating poin
presented in section IV. Lastly, concluding r
in section V.
II. BACKGROUNDa) Software Defined RadioSoftware defined radio (SDR) refers t
where the majority of the transmission an
components are realized through software. Th
such system is to allow the modification of
and protocols with no or minimal hardware
alterations to parameters such as mo
transmission power, or transmission frequeentirely through software. The block diagr
hardware (HW) components of an SDR syste
Fig. 1. In an ideal SDR system, most of the p
hardware element should be able to be reco
software in order to support a large var
protocols [6]. It can be observed that for this
to be implemented, enabling technologies s
processors, analog-to-digital and digital-to-a
(ADC and DAC), reconfigurable compone
frequency (RF) front end, as well as antenna e
to be properly chosen to comply with the re
waveforms they are aimed to support.
Z&D /&
Fig. 1 Block diagram of SDR Syst
As critical as the choice of hardware plat
as important to select the right software arch
system. Such architecture should provide p
different hardware platforms. Moreover, in
its adoption in commercial applications, th
software performance bottleneck should be
is, one has to ensure that the baseband s
components and the communication protoco
are as efficiently implemented as possible.
doing so can be reflected in lower computati
less energy consumption.
Although recently there has been an em
SDR software architectures including GN
PIM and PSM for Software Radio Compone
[7], and NASAs Space Telecommunicatio
rd compatibility
ws. In section II,
d the Software
nted. Section III
sing hardware
A platform that
on the energyCA component
t unit support is
marks are given
a radio system
d reception-path
e purpose behind
radio waveforms
hanges. That is,
ulation format,
cy can be doneam of the main
m is presented in
rameters of each
nfigured through
iety of wireless
type of platform
uch as baseband
nalog converters
nts of the radio
lements, all need
uirements of the
W
m
form, it is almost
itecture for SDR
ortability among
rder to facilitate
SDR systems
minimized. That
ignal processing
ls between them
The benefits of
nal intensity and
rgence of many
Radio, OMGs
nts Specification
n Radio System
(STRS) architecture [8], t
communications architecture (SC
facto standard of the SDR commu
work our focus is on SCA architect
b) The Software CommunicatThe software communication
developed as the standard soft
Department of Defenses (DoD)program known as Joint Tactical
introduction of this standard was
technology insertion, hardware
interoperability for SDR systems [
of the SCA is presented. Here,
through C4 represent the wavefor
context, the function of each o
perform a specific signal processin
In the SCA, all the services and i
component are defined in the SCA
figure it can be seen that wavefor
each other and with the core fra
bus provided by the Common
Architecture (CORBA). One of tarchitecture is that it allows differe
or waveform components) to run
units and to communicate with eac
useful when using only one proce
the computational requirements of
Fig. 2 Software Communications
III. SYSTEM ARa) Proposed System Architect
Acceleration
To take advantage of the ben
hardware acceleration of softwar
adhering to the SCA standard, o
processing routine for each
accelerated in our proposed archit
That is, all the core framewor
services to these accelerated com
allowing the component to be initi
e so-called software
) still remains as the de-
nity [9]. Therefore, in this
re.
ions Architecture
s architecture (SCA) was
ware framework for the
ilitary radio developmentadio System (JTRS). The
ainly aimed at facilitating
bstraction, and hardware
10]. In Fig. 2 the structure
the elements labeled C1
components. In the SDR
these components is to
g task within a waveform.
nterfaces available to each
core framework. From the
components interact with
ework through the logical
Object Request Broker
e crucial features of thisnt applications (waveforms
on distributed processing
h other. This is particularly
ssing unit does not satisfy
omplex waveforms.
rchitecture Structure [10]
HITECTURE
ure for Hardware
efits obtained through the
components, while still
ly the code of the signal
aveform component is
cture as shown in Fig. 3.
interfaces and CORBA
onents remain unchanged,
alized and also located by
37
-
7/27/2019 11Case Study on a Software Communications
3/5
other waveform components. However, the signal processing
portion of the code is replaced by a subroutine which calls
upon the hardware accelerated implementation. One of the
implications of this process is that data stored in CORBA type
variables must be converted and loaded to the memory
location of the hardware accelerator used to implement the
signal processing routine. In Fig. 3, the component to be
accelerated is denoted as C3 and the accelerated signalprocessing routine is C3_HW.
Fig. 3 Proposed SCA Architecture with Hardware Acceleration
Since the proposed system is to be implemented using an
FPGA board, in the following subsection a thorough
description of the characteristics of the hardware platform in
question is provided.
Fig. 4 System Architecture with FPU Support
b) Hardware PlatformWe used the Virtex5 FPGA platform [11] to construct the
system architecture shown in Fig. 4. FPGA is of great
advantages for SDR design since they can be reconfigured and
reprogrammable. MicroBlaze [12] is a soft-core 32-bit RISC
microprocessor operating at a clock rate of 125 MHz. It can be
configured to have a 5-stage pipeline (performance-optimized)
or 3-stage pipeline (area-optimized), barrel shifter, integer
multiplier/divider, floating point operation and cache support.
For our design, we elect to use the 5-stage pipeline and
floating point unit (FPU) support for IEEE single precision
format. MicroBlaze can access the local memory denoted as
Block RAM (BRAM) through the Local Memory Bus (LMB).
Thus, it accesses the instruction and data through the
Instruction Local Memory Bus (ILMB) and Data Local
Memory Bus (DLMB) respectively. Every instruction or data
access through the LMB takes 1 clock cycle and the LMB
must only be connected from processor to local memoryBRAM. In contrast, the Processor Local Bus (PLB) takes
about 7 clock cycles per access; however, it allows any
peripheral or custom hardware to connect to it. The PLB is
bidirectional and is 32-bit wide. We also employs a timer to
collect timing information such as the number of clock cycles
a specified operation takes on MicroBlaze or any other
peripheral connected through the PLB. Interrupt is used to
make the system interruptible. If Microblaze needs to perform
an urgent operation, then it can interrupt the current operation
to serve the urgent one. RS232_UART is used for displaying
the output on the host machine. The FPGA platform is a
memory-mapped system and every component is assigned a
memory range. Therefore, if MicroBlaze needs to access any
peripheral it only needs to provide the address location and/ordata. The address of MicroBlaze is 32-bit wide thus it can
access up to 0xFFFFFFFF memory location.
IV. EXPERIMENTAL RESULTSa) Experimental SetupFor the purpose of this study, a waveform implemented in
an SCA-based software architecture known as OSSIE was
initially profiled. It was observed that one of the components
on which the processor spent more time (i.e., higher
percentage of samples were taken during execution) was the
IQ amplifier. This was inferred as an indication that the
acceleration of this component might lead to significant
performance improvement as it is frequently accessed by the
processor.
Therefore, our initial feasibility study consisted of
executing the amplifier function from the SCA without the
FPU incorporated in the MicroBlaze architecture over the case
with FPU support. The amplifier function accepts seven
parameters to multiply the in_phase (I_in_0) and its gain
(I_gain) as well as the quadrature (Q_in_0) and its gain
(Q_gain). Now, I_gain and Q_gain are floating point while the
I_in_0 and Q_in_0 are integer vectors.
b) ResultsWe execute the amplifier function on MicroBlaze with
and without FPU support on data set from 128 Bytes to 64
Kbyte (local memory maximum capacity). The clock cycle,execution time and energy consumption readings are collected
and presented in Table I and Table II respectively, with
doubling the data set size each time.
We first enable the FPU support and run the amplifier
with data set of 128 Bytes. From Table I, we can see that it
takes 17842 clock cycles to perform floating-point
multiplication of 128 Bytes data set. The microprocessor runs
at 125 MHz. Given that execution time equals to the number
38
-
7/27/2019 11Case Study on a Software Communications
4/5
of clock cycles multiply the unit clock
execution time we obtained is 0.1427 ms. F
can observe that it takes 32651 clock cy
floating-point multiplication of 128 Bytes
execution time is 0.2612 ms.
Next, we utilized XPower Analyzer [12]
consumption of the system. Xpower Analyze
dedicated to analyze the power consumptimplemented place and routed design. It coll
about the hardware design and provides ac
about the power utilization. By using the
tool, the overall power consumption for the
can be measured, which are shown in Fig. 4.
that with FPU support the total power consu
Watts, of which 1.1105 is static power and 0.
power. If we disable the FPU unit, the static
and dynamic power is 0.1463, respectively, w
consumption of 1.2565 watts. Adding the fl
support inquires more hardware thus it wil
power as well. Based on our measuremen
overhead incurred by FPU corresponds to a 0
power consumption. The system architec
support requires 415 slice register used as fl
look-up table (LUT), which constitute to 21
flip-flops and LUTs resource respectively w
the system that does not use the FPU engine.
Table I - MicroBlaze with FPU Suppo
# Bytes
Clk Cycles(10^6)
Exec Time(ms)
Energy(mJoule
128 0.0178 0.1427 0.1805
256 0.0345 0.2799 0.3541
512 0.0658 0.5484 0.6937
1K 0.1341 1.0731 1.3576
2K 0.2622 2.0983 2.6544
4K 0.5124 4.0996 5.1862
8K 1.0005 8.0041 10.1255
16K 1.9521 15.6166 19.7557
32K 3.8061 30.4487 38.5188
64K 7.4158 59.3265 75.0504
Table II - MicroBlaze without FPU Sup
# Bytes
Clk Cycles
(10^6)
Exec Time
(ms)
Energy
(mJoule
128 0.0326 0.2612 0.3270
256 0.0645 0.5164 0.6465
512 0.1275 1.0207 1.2778
1K 0.2521 2.0168 2.5248
2K 0.4984 3.9875 4.9919
4K 0.9855 7.8845 9.8705
8K 1.9475 15.1504 19.5048
16K 3.8469 30.7755 38.5272
32K 7.5965 60.7725 76.0798
64K 14.9975 119.9802 150.200
cycle time, the
om Table II, we
cles to perform
data set. The
to get the power
r is a Xilinx tool
ion for a post-ects information
urate estimation
power Analyzer
hardware system
Our results show
ption is 1.2650
1545 is dynamic
power is 1.1102
ith a total power
ating point unit
l consume more
t, the hardware
.68% increase in
ture with FPU
ip flops and 771
and 40% more
en compared to
rt
)uJoules /
Bytes
1.4106
1.3834
1.3549
1.3257
1.2961
1.2661
1.2360
1.2057
1.1755
1.1451
port
)
uJoules /
Bytes
2.5546
2.5256
2.4957
2.4656
2.4374
2.4097
2.3809
2.3515
2.3217
2.2918
Since we get the power concalculate the energy consumed d
Thus, the 128-Byte data set has0.1805 mJoules as shown in Tabl
consumed for the same data set
accordingly.
Fig. 5 Performance of FPU o
Fig. 6 Energy Reduction of FP
Fig. 7 Normalized Energy per Byte C
<
W
/