11case study on a software communications

Upload: surbhi-prasad

Post on 02-Apr-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/27/2019 11Case Study on a Software Communications

    1/5

    Case Study on a Software Communications

    Architecture Component for Hardware Acceleration

    Rolando Duarte, Omar Granados, Chen Liu, Jean Andrian

    Department of Electrical and Computer Engineering

    Florida International University

    Miami, FL 33174, U.S.A

    {rduar002, ogran002, cliu, andrianj}@fiu.edu

    Abstract In this paper we introduce an architecture for

    hardware acceleration of Software Communication Architecture

    (SCA) components. Intercommunication, as well as

    responsiveness and power dissipation, is essential for the

    components of SCA. Thus, we studied the SCA to explore the

    hotspot function and optimize its overall performance to not only

    operate efficiently but also in an energy-efficient way. We spotted

    the amplifier function takes a significant amount of the execution

    time which involves single-precision floating-point computation.

    A case study on the implementation of the amplifier component

    on the Virtex5 Field-Programmable Gate Array (FPGA)

    platform is presented. Consequently, we make a comparison of

    the hardware accelerated amplifier implementation employing a

    floating-point unit (FPU) engine over pure software

    implementation with no FPU support. We obtained a speedup of

    up to 2x faster while minimizing the energy consumption by up to

    100% with a power consumption increase of 0.68%, achieving

    our goal.

    Keywords- Software Communication Architecture, Software

    Defined Radio, Energy Efficiency, Hardware Acceleration.

    I. INTRODUCTIONThe vision of ubiquitous computing systems has lead to a

    sharp increase in research efforts on the development and

    deployment of devices whose underlying technology is

    invisible to the end users. That is, such systems are designed

    to allow users to perform a wide range of computational tasks

    anytime and anywhere without them being aware of the

    specific devices performing such tasks [1].

    Because in ubiquitous computing the users environment is

    likely to change frequently, one of the main requirements for

    this vision to become a reality is that of spontaneous

    interoperation between devices [2]. This means that as mobile

    devices change environment, they should be able to

    spontaneously communicate with any other device in their

    new operating environment without manual intervention. The

    complexity of this issue has increased considerably nowadays

    with the significant number of wireless protocols available to

    serve as the link between computing elements.

    Recently, a promising technology known as software

    defined radio (SDR) has gained a great deal of attention as a

    possible way of enabling spontaneous interoperation among

    devices. An SDR system is a radio in which all of the

    baseband, intermediate frequency (IF), and part of the radio

    frequency (RF) processing tasks are done exclusively through

    software. This characteristic makes possible for the radio to

    use intelligent algorithms to modify its transmission and

    reception parameters depending on the changes it detects in its

    operating environment, thus eliminating the need for user

    intervention [3].

    In order to deploy SDR systems in a truly ubiquitous

    computational environment, mobility is a requirement and thussize, weight, and power (SWaP) are major constrains. In

    addition to this, the system should be able to support

    communication protocols of different levels of complexity.

    Therefore, two main characteristics the SDR system must have

    are high computational performance and low power

    consumption. A promising approach that offers improvements

    in the aforementioned areas is to apply hardware acceleration

    techniques to the computation intensive components

    implemented under the SDRs software architecture. A general

    hardware-assisted approach was presented by Tang et al. [14].

    However, their design mainly focused on accelerating the

    garbage collection function in the middleware layer, which is

    for a totally different application domain comparing with ours.

    In [4], Lau et al. showed that increased execution speed could

    be achieved by offloading some of the signal processing

    functions of a general SDR system onto an FPGA-based

    hardware accelerator. Since the Software Communications

    Architecture (SCA) is the most widely accepted software

    platform for SDR systems, in [5] Millage showed how

    hardware acceleration of signal processing components

    implemented in the SCA can be achieved using a graphics

    processing unit (GPU). There, it is observed that execution

    time is reduced compared to those of software components

    running only on a general purpose processor (GPP). Although

    because of parallelization GPUs have considerable

    computational power, their use as hardware accelerators in

    mobile devices is not very practical because of their

    significantly high power consumption. Therefore, in contrast

    with [5], in this work we propose a hardware acceleration

    platform for SCA components using an FPGA in order to

    achieve a balance between computational performance and

    energy consumption. We also present an initial feasibility

    study of such platform on a specific FPGA board. Since

    hardware acceleration is only performed on the signal

    processing portion of the code that implements a waveform

    component, the resulting platform is still compliant with SCA

    standards, thus it can be readily deployed for current military

    Fourth International Conference on Ubi-Media Computing

    978-0-7695-4493-9/11 $26.00 2011 IEEE

    DOI 10.1109/U-MEDIA.2011.17

    36

  • 7/27/2019 11Case Study on a Software Communications

    2/5

    and commercial systems without backwa

    issues.

    The rest of the paper is organized as foll

    an overview of software defined radio a

    Communications Architecture (SCA) is pres

    introduces the proposed architecture

    acceleration and describes the specific FP

    will be used to implement it. A studyconsumption and performance of an S

    implemented with and without floating poin

    presented in section IV. Lastly, concluding r

    in section V.

    II. BACKGROUNDa) Software Defined RadioSoftware defined radio (SDR) refers t

    where the majority of the transmission an

    components are realized through software. Th

    such system is to allow the modification of

    and protocols with no or minimal hardware

    alterations to parameters such as mo

    transmission power, or transmission frequeentirely through software. The block diagr

    hardware (HW) components of an SDR syste

    Fig. 1. In an ideal SDR system, most of the p

    hardware element should be able to be reco

    software in order to support a large var

    protocols [6]. It can be observed that for this

    to be implemented, enabling technologies s

    processors, analog-to-digital and digital-to-a

    (ADC and DAC), reconfigurable compone

    frequency (RF) front end, as well as antenna e

    to be properly chosen to comply with the re

    waveforms they are aimed to support.

    Z&D /&

    Fig. 1 Block diagram of SDR Syst

    As critical as the choice of hardware plat

    as important to select the right software arch

    system. Such architecture should provide p

    different hardware platforms. Moreover, in

    its adoption in commercial applications, th

    software performance bottleneck should be

    is, one has to ensure that the baseband s

    components and the communication protoco

    are as efficiently implemented as possible.

    doing so can be reflected in lower computati

    less energy consumption.

    Although recently there has been an em

    SDR software architectures including GN

    PIM and PSM for Software Radio Compone

    [7], and NASAs Space Telecommunicatio

    rd compatibility

    ws. In section II,

    d the Software

    nted. Section III

    sing hardware

    A platform that

    on the energyCA component

    t unit support is

    marks are given

    a radio system

    d reception-path

    e purpose behind

    radio waveforms

    hanges. That is,

    ulation format,

    cy can be doneam of the main

    m is presented in

    rameters of each

    nfigured through

    iety of wireless

    type of platform

    uch as baseband

    nalog converters

    nts of the radio

    lements, all need

    uirements of the

    W

    m

    form, it is almost

    itecture for SDR

    ortability among

    rder to facilitate

    SDR systems

    minimized. That

    ignal processing

    ls between them

    The benefits of

    nal intensity and

    rgence of many

    Radio, OMGs

    nts Specification

    n Radio System

    (STRS) architecture [8], t

    communications architecture (SC

    facto standard of the SDR commu

    work our focus is on SCA architect

    b) The Software CommunicatThe software communication

    developed as the standard soft

    Department of Defenses (DoD)program known as Joint Tactical

    introduction of this standard was

    technology insertion, hardware

    interoperability for SDR systems [

    of the SCA is presented. Here,

    through C4 represent the wavefor

    context, the function of each o

    perform a specific signal processin

    In the SCA, all the services and i

    component are defined in the SCA

    figure it can be seen that wavefor

    each other and with the core fra

    bus provided by the Common

    Architecture (CORBA). One of tarchitecture is that it allows differe

    or waveform components) to run

    units and to communicate with eac

    useful when using only one proce

    the computational requirements of

    Fig. 2 Software Communications

    III. SYSTEM ARa) Proposed System Architect

    Acceleration

    To take advantage of the ben

    hardware acceleration of softwar

    adhering to the SCA standard, o

    processing routine for each

    accelerated in our proposed archit

    That is, all the core framewor

    services to these accelerated com

    allowing the component to be initi

    e so-called software

    ) still remains as the de-

    nity [9]. Therefore, in this

    re.

    ions Architecture

    s architecture (SCA) was

    ware framework for the

    ilitary radio developmentadio System (JTRS). The

    ainly aimed at facilitating

    bstraction, and hardware

    10]. In Fig. 2 the structure

    the elements labeled C1

    components. In the SDR

    these components is to

    g task within a waveform.

    nterfaces available to each

    core framework. From the

    components interact with

    ework through the logical

    Object Request Broker

    e crucial features of thisnt applications (waveforms

    on distributed processing

    h other. This is particularly

    ssing unit does not satisfy

    omplex waveforms.

    rchitecture Structure [10]

    HITECTURE

    ure for Hardware

    efits obtained through the

    components, while still

    ly the code of the signal

    aveform component is

    cture as shown in Fig. 3.

    interfaces and CORBA

    onents remain unchanged,

    alized and also located by

    37

  • 7/27/2019 11Case Study on a Software Communications

    3/5

    other waveform components. However, the signal processing

    portion of the code is replaced by a subroutine which calls

    upon the hardware accelerated implementation. One of the

    implications of this process is that data stored in CORBA type

    variables must be converted and loaded to the memory

    location of the hardware accelerator used to implement the

    signal processing routine. In Fig. 3, the component to be

    accelerated is denoted as C3 and the accelerated signalprocessing routine is C3_HW.

    Fig. 3 Proposed SCA Architecture with Hardware Acceleration

    Since the proposed system is to be implemented using an

    FPGA board, in the following subsection a thorough

    description of the characteristics of the hardware platform in

    question is provided.

    Fig. 4 System Architecture with FPU Support

    b) Hardware PlatformWe used the Virtex5 FPGA platform [11] to construct the

    system architecture shown in Fig. 4. FPGA is of great

    advantages for SDR design since they can be reconfigured and

    reprogrammable. MicroBlaze [12] is a soft-core 32-bit RISC

    microprocessor operating at a clock rate of 125 MHz. It can be

    configured to have a 5-stage pipeline (performance-optimized)

    or 3-stage pipeline (area-optimized), barrel shifter, integer

    multiplier/divider, floating point operation and cache support.

    For our design, we elect to use the 5-stage pipeline and

    floating point unit (FPU) support for IEEE single precision

    format. MicroBlaze can access the local memory denoted as

    Block RAM (BRAM) through the Local Memory Bus (LMB).

    Thus, it accesses the instruction and data through the

    Instruction Local Memory Bus (ILMB) and Data Local

    Memory Bus (DLMB) respectively. Every instruction or data

    access through the LMB takes 1 clock cycle and the LMB

    must only be connected from processor to local memoryBRAM. In contrast, the Processor Local Bus (PLB) takes

    about 7 clock cycles per access; however, it allows any

    peripheral or custom hardware to connect to it. The PLB is

    bidirectional and is 32-bit wide. We also employs a timer to

    collect timing information such as the number of clock cycles

    a specified operation takes on MicroBlaze or any other

    peripheral connected through the PLB. Interrupt is used to

    make the system interruptible. If Microblaze needs to perform

    an urgent operation, then it can interrupt the current operation

    to serve the urgent one. RS232_UART is used for displaying

    the output on the host machine. The FPGA platform is a

    memory-mapped system and every component is assigned a

    memory range. Therefore, if MicroBlaze needs to access any

    peripheral it only needs to provide the address location and/ordata. The address of MicroBlaze is 32-bit wide thus it can

    access up to 0xFFFFFFFF memory location.

    IV. EXPERIMENTAL RESULTSa) Experimental SetupFor the purpose of this study, a waveform implemented in

    an SCA-based software architecture known as OSSIE was

    initially profiled. It was observed that one of the components

    on which the processor spent more time (i.e., higher

    percentage of samples were taken during execution) was the

    IQ amplifier. This was inferred as an indication that the

    acceleration of this component might lead to significant

    performance improvement as it is frequently accessed by the

    processor.

    Therefore, our initial feasibility study consisted of

    executing the amplifier function from the SCA without the

    FPU incorporated in the MicroBlaze architecture over the case

    with FPU support. The amplifier function accepts seven

    parameters to multiply the in_phase (I_in_0) and its gain

    (I_gain) as well as the quadrature (Q_in_0) and its gain

    (Q_gain). Now, I_gain and Q_gain are floating point while the

    I_in_0 and Q_in_0 are integer vectors.

    b) ResultsWe execute the amplifier function on MicroBlaze with

    and without FPU support on data set from 128 Bytes to 64

    Kbyte (local memory maximum capacity). The clock cycle,execution time and energy consumption readings are collected

    and presented in Table I and Table II respectively, with

    doubling the data set size each time.

    We first enable the FPU support and run the amplifier

    with data set of 128 Bytes. From Table I, we can see that it

    takes 17842 clock cycles to perform floating-point

    multiplication of 128 Bytes data set. The microprocessor runs

    at 125 MHz. Given that execution time equals to the number

    38

  • 7/27/2019 11Case Study on a Software Communications

    4/5

    of clock cycles multiply the unit clock

    execution time we obtained is 0.1427 ms. F

    can observe that it takes 32651 clock cy

    floating-point multiplication of 128 Bytes

    execution time is 0.2612 ms.

    Next, we utilized XPower Analyzer [12]

    consumption of the system. Xpower Analyze

    dedicated to analyze the power consumptimplemented place and routed design. It coll

    about the hardware design and provides ac

    about the power utilization. By using the

    tool, the overall power consumption for the

    can be measured, which are shown in Fig. 4.

    that with FPU support the total power consu

    Watts, of which 1.1105 is static power and 0.

    power. If we disable the FPU unit, the static

    and dynamic power is 0.1463, respectively, w

    consumption of 1.2565 watts. Adding the fl

    support inquires more hardware thus it wil

    power as well. Based on our measuremen

    overhead incurred by FPU corresponds to a 0

    power consumption. The system architec

    support requires 415 slice register used as fl

    look-up table (LUT), which constitute to 21

    flip-flops and LUTs resource respectively w

    the system that does not use the FPU engine.

    Table I - MicroBlaze with FPU Suppo

    # Bytes

    Clk Cycles(10^6)

    Exec Time(ms)

    Energy(mJoule

    128 0.0178 0.1427 0.1805

    256 0.0345 0.2799 0.3541

    512 0.0658 0.5484 0.6937

    1K 0.1341 1.0731 1.3576

    2K 0.2622 2.0983 2.6544

    4K 0.5124 4.0996 5.1862

    8K 1.0005 8.0041 10.1255

    16K 1.9521 15.6166 19.7557

    32K 3.8061 30.4487 38.5188

    64K 7.4158 59.3265 75.0504

    Table II - MicroBlaze without FPU Sup

    # Bytes

    Clk Cycles

    (10^6)

    Exec Time

    (ms)

    Energy

    (mJoule

    128 0.0326 0.2612 0.3270

    256 0.0645 0.5164 0.6465

    512 0.1275 1.0207 1.2778

    1K 0.2521 2.0168 2.5248

    2K 0.4984 3.9875 4.9919

    4K 0.9855 7.8845 9.8705

    8K 1.9475 15.1504 19.5048

    16K 3.8469 30.7755 38.5272

    32K 7.5965 60.7725 76.0798

    64K 14.9975 119.9802 150.200

    cycle time, the

    om Table II, we

    cles to perform

    data set. The

    to get the power

    r is a Xilinx tool

    ion for a post-ects information

    urate estimation

    power Analyzer

    hardware system

    Our results show

    ption is 1.2650

    1545 is dynamic

    power is 1.1102

    ith a total power

    ating point unit

    l consume more

    t, the hardware

    .68% increase in

    ture with FPU

    ip flops and 771

    and 40% more

    en compared to

    rt

    )uJoules /

    Bytes

    1.4106

    1.3834

    1.3549

    1.3257

    1.2961

    1.2661

    1.2360

    1.2057

    1.1755

    1.1451

    port

    )

    uJoules /

    Bytes

    2.5546

    2.5256

    2.4957

    2.4656

    2.4374

    2.4097

    2.3809

    2.3515

    2.3217

    2.2918

    Since we get the power concalculate the energy consumed d

    Thus, the 128-Byte data set has0.1805 mJoules as shown in Tabl

    consumed for the same data set

    accordingly.

    Fig. 5 Performance of FPU o

    Fig. 6 Energy Reduction of FP

    Fig. 7 Normalized Energy per Byte C

    <

    W

    /