efficient and portable real-time application implementation on heterogeneous mpsoc platforms

Upload: sreekanthreddy-peram

Post on 02-Apr-2018

222 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/27/2019 Efficient and Portable Real-Time Application Implementation on Heterogeneous MPSoC Platforms

    1/40

    Efficient and Portable

    ea - me pp ca on mp emen a on

    on Heterogeneous MPSoC Platforms

    Gerd Ascheid

  • 7/27/2019 Efficient and Portable Real-Time Application Implementation on Heterogeneous MPSoC Platforms

    2/40

    Outline

    n ro uc on

    Case Study : MIMO OFDM Transceiver

    Tool Flow

    Summary & Outlook

    2

  • 7/27/2019 Efficient and Portable Real-Time Application Implementation on Heterogeneous MPSoC Platforms

    3/40

    Application Specific MPSoC Platforms

    SDR Platform

    P2012 Platform

    Proprietary Platform

    3

  • 7/27/2019 Efficient and Portable Real-Time Application Implementation on Heterogeneous MPSoC Platforms

    4/40

    Application Specific MPSoC Platforms

    SDR PlatformPlatform programmability issues:

    Portability

    Software is portable onto different platforms. _ , ..., _

    Loadability

    P2012 Platform

    Platform is capable of running different standards

    Device Standard_1.exe, ..., Standard_n.exe

    Proprietary PlatformSpecific mobile platform issue:

    Efficiency

    Power consumption must be close to power consumption

    4

  • 7/27/2019 Efficient and Portable Real-Time Application Implementation on Heterogeneous MPSoC Platforms

    5/40

    Application Specific MPSoC Platforms

    SDR PlatformPlatform programmability issues:

    Portability

    Software is portable onto different platforms. _ , ..., _

    LoadabilityContradicting Requirements !

    P2012 Platform

    Platform is capable of running different standards

    Device Standard_1.exe, ..., Standard_n.exeFlexibility (programmability) vs.Energy Efficiency

    Proprietary PlatformSpecific mobile platform issue:

    Efficiency

    Power consumption must be close to power consumption

    5

  • 7/27/2019 Efficient and Portable Real-Time Application Implementation on Heterogeneous MPSoC Platforms

    6/40

    Energy Efficiency Comparison

    CAESAR

    Algorithmic Flexibility Cost by Energy Efficiency

    ~

    nJ

    1

    0.1

    .

    of magnitude

    ionBits/

    10-2

    ~ . or ers

    of magnitude

    QPSKIRISC-fp(C-code,

    hardwaretInform

    at

    10-3

    10-4

    64 QAM

    fixed pt.)

    IRISC(portableCorrec

    10-5

    C-code,

    software fixed pt. emul.)10-6

    6

    Signal-to-Noise Ratio

    0 10 20 3040

  • 7/27/2019 Efficient and Portable Real-Time Application Implementation on Heterogeneous MPSoC Platforms

    7/40

    Outline

    n ro uc on

    Case Study: MIMO OFDM Transceiver

    Tool Flow

    Summary & Outlook

    7

  • 7/27/2019 Efficient and Portable Real-Time Application Implementation on Heterogeneous MPSoC Platforms

    8/40

    Nucleus Methodology

    Transceiver

    Description

    Transceiver DescriptionN 1

    N 7 N 5N 2Non N

    Tasks

    Nuclei N 1 N 2 N 7 NNN 5

    Nucleus

    Library

    Critical, demanding, algorithmic kernel

    Kernel is common among different waveforms

    o wave orm nor ar ware spec c

    PE 2

    (rASIP)

    PE 3

    (DSP)PE1

    (ASIP)HW

    8

    MEM

    Comm. Arch.

    PE 5(FPGA)

    PE 4(GPP)

    Platform

  • 7/27/2019 Efficient and Portable Real-Time Application Implementation on Heterogeneous MPSoC Platforms

    9/40

    Nucleus Methodology

    Transceiver

    Description

    Transceiver DescriptionN 1

    N 7 N 5N 2Non N

    Tasks

    Nuclei N 1 N 2 N 7 NNN 5

    Nucleus

    Library

    NINI

    Flavor

    PEsPE 2 PE 3 PE 4 PE 5PE 1

    PE 2

    (rASIP)

    PE 3

    (DSP)PE1

    (ASIP)HW

    MEM

    Comm. Arch.

    PE 5(FPGA)

    PE 4(GPP)

    Platform

    9

  • 7/27/2019 Efficient and Portable Real-Time Application Implementation on Heterogeneous MPSoC Platforms

    10/40

    Nucleus Methodology

    Transceiver

    Description

    Transceiver DescriptionN 1

    N 7 N 5N 2Non N

    Tasks

    Nuclei N 1 N 2 N 7 NNN 5

    Nucleus

    Library

    Mapping

    &

    Evaluation

    Compile

    Board

    NI

    NINI

    PEs

    uppor

    Package

    PE 2PE 1 PE 3 PE 4 PE 5

    Flavors

    PE 2

    (rASIP)

    PE 3

    (DSP)PE1

    (ASIP)HW

    MEM

    Comm. Arch.

    PE 5(FPGA)

    PE 4(GPP)

    Platform

    10

  • 7/27/2019 Efficient and Portable Real-Time Application Implementation on Heterogeneous MPSoC Platforms

    11/40

    Outline

    n ro uc on

    Case Study: MIMO OFDM Transceiver

    Application analysis - Nuclei identification Mapping on a ST P2012 platform

    -

    Using a preprocessing ASIP

    Tool Flow

    11

  • 7/27/2019 Efficient and Portable Real-Time Application Implementation on Heterogeneous MPSoC Platforms

    12/40

    Nuclei Identification: Transceiver Structure

    Outer Modem

    IEEE 802.11n

    Channel (De-)coding (De-)Interleaving

    nner o em

    RX OFDM Processing

    Channel Estimation

    OFDM Slot

    Spatial Equalizing: Mitigate channel impact on payload

    Soft Demapping: Calculate soft bits (LLRs)

    , , ,

    12

  • 7/27/2019 Efficient and Portable Real-Time Application Implementation on Heterogeneous MPSoC Platforms

    13/40

    Nuclei Identification: Kernel Identification

    Analyze different algorithmic choices within RX blocks

    Identify computational kernels

    Operate on data with certain alignment

    Build application as composition of kernels

    13

  • 7/27/2019 Efficient and Portable Real-Time Application Implementation on Heterogeneous MPSoC Platforms

    14/40

    Nuclei Identification: Kernel Overview

    Application variants consist of a few kernels only!14

  • 7/27/2019 Efficient and Portable Real-Time Application Implementation on Heterogeneous MPSoC Platforms

    15/40

    Nuclei Identification: Kernel Overview

    Wide variety of algorithms is coveredfor key transceiver functions

    Spatial Equalization

    Channel Coding

    15

  • 7/27/2019 Efficient and Portable Real-Time Application Implementation on Heterogeneous MPSoC Platforms

    16/40

    Frame Error Rate of 4x4 MIMO System

    Fix-point issues at low FER

    mus e cons ere n mapp ng me o o ogy(not discussed in this presentation)

    16

  • 7/27/2019 Efficient and Portable Real-Time Application Implementation on Heterogeneous MPSoC Platforms

    17/40

    What Describes a Nucleus Library Element?

    Flavors and configuration parameters influenceperformance properties

    Latency

    Throughput

    Implementation properties

    rea

    Memory

    Algorithmic properties

    Bit error rate

    V. Ramakrishnan et al.: Efficient Implementations

    From Libraries: Analyzing the Influence of

    Properties, IEEE PIMRC Conference 2009

  • 7/27/2019 Efficient and Portable Real-Time Application Implementation on Heterogeneous MPSoC Platforms

    18/40

    Outline

    n ro uc on

    Case Study: MIMO OFDM Transceiver

    Application analysis - Nuclei identification

    Mapping on a ST P2012 platform

    -

    Using a preprocessing ASIP

    Tool Flow

    18

  • 7/27/2019 Efficient and Portable Real-Time Application Implementation on Heterogeneous MPSoC Platforms

    19/40

    Application Implementation:

    P2012 Platform (ST Microelectronics)

    SoC platform with maximum of 32 clusters

    One cluster provides

    Max. 16 RISC cores (STxP70) @ 600MHz

    VECx vector extension (SIMD)

    128 bit vector re isters

    8x16 bit or 4x32 bit operations

    Hardware synchronizer for inter-core signaling

    19

  • 7/27/2019 Efficient and Portable Real-Time Application Implementation on Heterogeneous MPSoC Platforms

    20/40

    Application Implementation: Kernel Overview

    For 2x2 and 4x4 MIMO use case

    Cycles for execution on

    single STxP70 processorcore including VECX unit

    Corresponding time for

    600MHz clock frequency

    In the range of

    IEEE 802.11n real time(4s per OFDM slot)

    20

  • 7/27/2019 Efficient and Portable Real-Time Application Implementation on Heterogeneous MPSoC Platforms

    21/40

    Application-to-Platform Mapping:

    Assigning Cores to PGs

    Given: Single core timing requirements

    Goal: Assi n cores to match real time constraints 4 s er slot

    Task time (us) #cores

    LS Channel Estimation 17.47

    Equalizer Preprocessing 159.36

    4

    4

    Actual Processing (per OFDM slot)

    OFDM Demodulation (mem. realign) 6.83

    E ualizer Actual Detection 6.08

    2

    Soft Demapping (64 QAM) 3.78 2

    21

  • 7/27/2019 Efficient and Portable Real-Time Application Implementation on Heterogeneous MPSoC Platforms

    22/40

    Application-to-Platform Mapping:

    Assigning Cores

    Final mapping includes parallelism on task level Partitioning of components into processing groups

    um er o cores per group

    8 cores enable real time

    PG 2

    PP&EQ

    4 PEs

    PG 1Modulation

    2 PEs

    Demapping2 PEs

    22

  • 7/27/2019 Efficient and Portable Real-Time Application Implementation on Heterogeneous MPSoC Platforms

    23/40

    Occupation Graph

    Implementation on P2012 platform using 8 cores

    Minimum latency for 21 or more OFDM slots of data payload

    .

    IEEE 802.11n allows 16s (including MAC layer)

    23

  • 7/27/2019 Efficient and Portable Real-Time Application Implementation on Heterogeneous MPSoC Platforms

    24/40

    Outline

    n ro uc on

    Case Study: MIMO OFDM Transceiver

    Application analysis - Nuclei identification

    Mapping on a ST P2012 platform

    -

    Using a preprocessing ASIP

    Tool Flow

    24

  • 7/27/2019 Efficient and Portable Real-Time Application Implementation on Heterogeneous MPSoC Platforms

    25/40

    TI C6474 Evaluation Board

    25

  • 7/27/2019 Efficient and Portable Real-Time Application Implementation on Heterogeneous MPSoC Platforms

    26/40

    Execution Time 2x2

    Component Cycles Time (us) @850MHzPreprocessing Steps

    OFDM Demodulation 1,296 1.525

    Subcarrier Demapping 1,220 1.435

    CP remove 360 0.424

    , , .

    Detection Preproc. (MIMO 2x2, MMSE, DS) 39,499 46.469

    Sum 45,647 53.702

    Detection Steps

    OFDM Demodulation 648 0.762

    Subcarrier Demapping 610 0.359

    CP remove 180 0.718

    Equalizing - MIMO 984 1.158

    . .

    Soft Demapping 16QAM (per OFDM sym.) 618 0.727

    Soft Demapping 64QAM (per OFDM sym.) 1,062 1.249

    Close to real time

    requirements (4s)

    already with 1 core

    Close to real time

    requirements (4s)

    already with 1 core

    26

    Sum 3,484 4.099

  • 7/27/2019 Efficient and Portable Real-Time Application Implementation on Heterogeneous MPSoC Platforms

    27/40

    Outline

    n ro uc on

    Case Study: MIMO OFDM Transceiver

    Application analysis - Nuclei identification

    Mapping on a ST P2012 platform

    -

    Using a preprocessing-optimized ASIP

    Tool Flow

    27

  • 7/27/2019 Efficient and Portable Real-Time Application Implementation on Heterogeneous MPSoC Platforms

    28/40

    Issue

    Some preprocessing functionsoperate on short vectors and small matrices (4x4)

    rregu ar process ng no so su a e or vec or processors

    ASIP with register file large enough

    to hold all vectors and matrices

    erna ve :

    Coarse grained reconfigurable ASIP

    28

  • 7/27/2019 Efficient and Portable Real-Time Application Implementation on Heterogeneous MPSoC Platforms

    29/40

    Issue

    Some preprocessing functionsoperate on short vectors and small matrices (4x4)

    rregu ar process ng no so su a e or vec or processors

    ASIP with register file large enough

    to hold all vectors and matrices

    erna ve :

    Coarse grained reconfigurable ASIP

    29

  • 7/27/2019 Efficient and Portable Real-Time Application Implementation on Heterogeneous MPSoC Platforms

    30/40

    Alternative Solution 2: CGRA

    Pipelined Coarse Grained Reconfigurable Array (CGRA) Processing Element (PE)

    omp ex mu p er

    Complex ALU

    Shifter PE 2x2

    d1 d2 d3 d4

    PE 2x2

    Local register file

    PE 2x2

    PE PE

    PE PE

    PE PE

    PE PE

    Broadcast inputs PE Chain

    PEChainPE PECenterAlpha PE PE

    Results from PE 2x2

    Center Alpha Unit

    S ecial al ha arameter

    CGRA

    PE PE

    PE PE

    PE PE

    PE PE

    in matrix inversion

    30

  • 7/27/2019 Efficient and Portable Real-Time Application Implementation on Heterogeneous MPSoC Platforms

    31/40

    Integration with Processor

    LLR

    CGRA

    on . em a r

    Ctrl. Info.

    LLR Reg. idx

    H, y, N0

    conf. bitsRISC Pipeline

    PF FE DC WBEX

    Wide

    Register

    FileData

    Memory

    (128-bit) Mem. Write Req. & Data

    Mem. Read Data

    . .

    31

  • 7/27/2019 Efficient and Portable Real-Time Application Implementation on Heterogeneous MPSoC Platforms

    32/40

    Results: Area and Throughput

    Synthesis results*Component Area / kGE

    (Gate Equivalents)

    Complex multiplier 13,0

    Complex adder 1,1

    Processing Element 18,6

    CGRA 393

    Configuration Memory 43

    Processing performance Input: H, y

    Output: soft

    Throughput: yHIHHx HH 10

    N

    Speed up > 2 orders of magnitude

    (150 Msym/s @ 588 MHz 30 ns/sym)^

    * Synopsys Design Compiler Ultra 2010.12-SP2 Faraday 65nm technology,

    Standard Performance, typical case library & conditions, 1.00V, 25C

    32

  • 7/27/2019 Efficient and Portable Real-Time Application Implementation on Heterogeneous MPSoC Platforms

    33/40

    Outline

    n ro uc on

    Case Study: MIMO OFDM Transceiver

    Tool Flow

    Summary & Outlook

    33

  • 7/27/2019 Efficient and Portable Real-Time Application Implementation on Heterogeneous MPSoC Platforms

    34/40

    Tool Flow: Overview

    Flow inputs

    Nucleus

    Library

    User inputs

    N1 N2 NN1

    Waveform descriptionConfig & constraints

    Flow inputs

    N3 .cpn.xml .xml

    Flavor

    Board Support

    Package

    Library

    Platform

    Description

    Nucleus Mapper MAPS*

    .xmlCode Generator

    -

    Virtual Platform

    -

    implementation

    -

    Virtual Platform

    34

    * For further details see the presentation by Rainer Leupers

    Programming Heterogeneous MPSoC Platforms: The MAPS Approach

  • 7/27/2019 Efficient and Portable Real-Time Application Implementation on Heterogeneous MPSoC Platforms

    35/40

    Tool Flow: Overview (2)

    Screenshot of Graphical User Interface:

    35

    T l Fl M i

  • 7/27/2019 Efficient and Portable Real-Time Application Implementation on Heterogeneous MPSoC Platforms

    36/40

    Tool Flow: Mapping

    Mapping: Finding a flavor for every nucleus Architecture: List ofOption-Graphs that is refined

    Exports: Feasible options for further analysis

    Flavor

    Library

    Platform

    Description

    Provides timing analysis,Provides timing analysis,Provides timing analysis,

    Interface Agent

    .xml mapp ng an sc e u ngmapp ng an sc e u ngmapp ng an sc e u ng

    Option Graphs

    (NiFij)

    Throughput Agent

    Latency AgentMAPS

    Mapping and.

    36

    p on rap

    Scheduler descriptor

    T l Fl C d G ti

  • 7/27/2019 Efficient and Portable Real-Time Application Implementation on Heterogeneous MPSoC Platforms

    37/40

    Tool Flow: Code Generation

    With option graph and schedule descriptorgenerates code for different platforms

    Multi-threaded host

    implementation

    Generate

    avor .

    MAPS

    CPN -

    Virtual PlatformCompiler

    Mapper

    ISS-based

    Virtual Platform

    Code generation for target system or simulation platform

    37

    O tli

  • 7/27/2019 Efficient and Portable Real-Time Application Implementation on Heterogeneous MPSoC Platforms

    38/40

    Outline

    n ro uc on

    Case Study: MIMO OFDM Transceiver

    Tool Flow

    Summary & Outlook

    38

    Summary & Outlook

  • 7/27/2019 Efficient and Portable Real-Time Application Implementation on Heterogeneous MPSoC Platforms

    39/40

    Summary & Outlook

    Summary

    Nucleus methodology represents a systematic design approach

    for flexible and efficient transceiver design

    Nucleus methodology successfully appliedto a MIMO OFDM transceiver with mapping

    onto different MPSoC platforms

    Tool support for methodology available

    Outlook

    xpan concep o o er cr ca process ng unc ons

    in portable multimedia devices

    39

    Basic References

  • 7/27/2019 Efficient and Portable Real-Time Application Implementation on Heterogeneous MPSoC Platforms

    40/40

    Basic References

    Ramakrishnan, V., Witte, E. M., Kempf, T., Kammler, D., Ascheid, G., Meyr, H., Adrat, M., and Antweiler, M.:

    Efficient and Portable SDR Waveform Development: The Nucleus Concept,

    in Proceedings of the IEEE Military Communications Conference (MILCOM) (Boston, USA), pp. 918924, Oct. 2009

    Kempf, T., Guenther, D., Ishaque, A., and Ascheid, G.:

    MIMO OFDM transceiver for a Many-Core Computing Fabric - A Nucleus based Implementation,

    in SDR'11 - The Wireless Innovation Forum Conference on Communications Technologies and

    Software Defined Radio (Washington D.C., USA), Dec. 2011

    Chen, X., Minwegen, A., Hassan, Y., Kammler, D., Li, S., Kempf, T., Chattopadhyay, A., Ascheid, G., Leupers, R.:

    FLEXDET: Flexible, Efficient Multi-Mode MIMO Detection using reconfigurable ASIP,

    20th Annual IEEE International Symposium on Field-Programmable Custom Computing Machines 2012

    J. Castrillon, S. Schrmans, A. Stulova, W. Sheng, T. Kempf, R. Leupers, G. Ascheid, and H. Meyr,

    Component-Based Waveform Development: The Nucleus Tool Flow for Efficient and Portable Software Defined Radio,Journal of Analog Integrated Circuits and Signal Processing, vol. 69, no. 2, pp. 173190, Oct. 2011

    , ., , ., , .

    MAPS: Mapping Concurrent Dataflow Applications to Heterogeneous MPSoCs,

    IEEE Transactions on Industrial Informatics, p. 19, Nov. 2011

    40