high performance embedded computing software initiative ... · high performance embedded computing...

41
Slide-1 HPEC-SI MITRE AFRL MIT Lincoln Laboratory www.hpec-si.org High Performance Embedded Computing Software Initiative (HPEC-SI) This work is sponsored by the High Performance Computing Modernization Office under Air Force Contract F19628-00-C- 0002. Opinions, interpretations, conclusions, and recommendations are those of the author and are not necessarily endorsed by the United States Government. Dr. Jeremy Kepner MIT Lincoln Laboratory

Upload: others

Post on 23-Oct-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

  • Slide-1HPEC-SI

    MITRE AFRLMIT Lincoln Laboratorywww.hpec-si.org

    High Performance Embedded ComputingSoftware Initiative (HPEC-SI)

    This work is sponsored by the High Performance Computing Modernization Office under Air Force Contract F19628-00-C-0002. Opinions, interpretations, conclusions, and recommendations are those of the author and are not necessarily endorsed by the United States Government.

    Dr. Jeremy KepnerMIT Lincoln Laboratory

  • Slide-2www.hpec-si.org

    MITRE AFRLLincoln

    Outline

    • DoD Need• Program Structure• Introduction

    • Software Standards

    • Parallel VSIPL++

    • Future Challenges

    • Summary

  • Slide-3www.hpec-si.org

    MITRE AFRLLincoln

    Overview - High Performance Embedded Computing (HPEC) Initiative

    Enhanced Tactical Radar Correlator(ETRAC)

    ASARS-2

    Shared memory serverEmbedded

    multi-processor

    Common Imagery Processor (CIP)

    HPEC Software Initiative

    Programs

    Demonstration

    Development

    Applied Research

    DARPA

    Challenge: Transition advanced software technology and practices into major defense acquisition programs

  • Slide-4www.hpec-si.org

    MITRE AFRLLincoln

    Why Is DoD Concerned with Embedded Software?

    $0.0

    $1.0

    $2.0

    $3.0

    FY98

    Source: “HPEC Market Study” March 2001

    Estimated DoD expenditures for embedded signal and image processing hardware and software ($B)

    • COTS acquisition practices have shifted the burden from “point design” hardware to “point design” software

    • Software costs for embedded systems could be reduced by one-third with improved programming models, methodologies, and standards

    • COTS acquisition practices have shifted the burden from “point design” hardware to “point design” software

    • Software costs for embedded systems could be reduced by one-third with improved programming models, methodologies, and standards

  • Slide-5www.hpec-si.org

    MITRE AFRLLincoln

    Issues with Current HPEC Development Inadequacy of Software Practices & Standards

    NSSNAEGIS

    Rivet Joint

    Standard Missile

    PredatorGlobal Hawk

    U-2

    JSTARS MSAT-Air

    P-3/APS-137

    FF--1616

    MK-48 Torpedo

    Today – Embedded Software Is: • Not portable• Not scalable • Difficult to develop• Expensive to maintain

    Today – Embedded Software Is: • Not portable• Not scalable • Difficult to develop• Expensive to maintain

    System Development/Acquisition Stages4 Years 4 Years 4 Years

    Program MilestonesSystem Tech. DevelopmentSystem Field DemonstrationEngineering/ manufacturing DevelopmentInsertion to Military AssetSignal Processor Evolution 1st gen. 2nd gen. 3rd gen. 4th gen. 5th gen. 6th gen.

    • High Performance Embedded Computing pervasive through DoD applications

    – Airborne Radar Insertion program 85% software rewrite for each hardware platform

    – Missile common processor Processor board costs < $100k Software development costs > $100M

    – Torpedo upgrade Two software re-writes required after changes in hardware design

  • Slide-6www.hpec-si.org

    MITRE AFRLLincoln

    Evolution of Software Support Towards“Write Once, Run Anywhere/Anysize”

    1990

    ApplicationApplication

    VendorSoftwareVendor SW

    DoD software development

    COTS development

    • Application software has traditionally been tied to the hardware

    • Support “Write Once, Run Anywhere/Anysize”

    20052000

    Middleware

    VendorSoftware

    Application

    ApplicationMiddleware

    ApplicationMiddleware

    ApplicationMiddleware

    • Many acquisition programs are developing stove-piped middleware “standards”

    • Open software standards can provide portability, performance, and productivity benefits

    VendorSoftware

    ApplicationMiddleware

    Embedded SWStandards

    Middleware

  • Slide-7www.hpec-si.org

    MITRE AFRLLincoln

    Overall Initiative Goals & Impact

    Performance (1.5x)

    Porta

    bilit

    y (3

    x) Productivity (3x)

    HPECSoftwareInitiative

    Demonstrate

    Develop

    Prot

    otyp

    e

    Object Oriented

    Open

    Sta

    ndar

    ds

    Interoperable & Scalable

    Program Goals• Develop and integrate software

    technologies for embedded parallel systems to address portability, productivity, and performance

    • Engage acquisition community to promote technology insertion

    • Deliver quantifiable benefits

    Program Goals• Develop and integrate software

    technologies for embedded parallel systems to address portability, productivity, and performance

    • Engage acquisition community to promote technology insertion

    • Deliver quantifiable benefits

    Portability: reduction in lines-of-code to change port/scale to new system

    Productivity: reduction in overall lines-of-code

    Performance: computation and communication benchmarks

  • Slide-8www.hpec-si.org

    MITRE AFRLLincoln

    HPEC-SI Path to Success

    • Reduces software cost & schedule• Enables rapid COTS insertion• Improves cross-program interoperability• Basis for improved capabilities

    Benefit to DoD Programs

    • Reduces software complexity & risk• Easier comparisons/more competition• Increased functionality

    Benefit to DoD Contractors

    • Lower software barrier to entry• Reduced software maintenance costs• Evolution of open standards

    Benefit to Embedded Vendors

    HPEC Software Initiative builds on • Proven technology• Business models• Better software

    practices

    HPEC Software Initiative builds on • Proven technology• Business models• Better software

    practices

  • Slide-9www.hpec-si.org

    MITRE AFRLLincoln

    Organization

    DemonstrationDr. Keith Bromley SPAWARDr. Richard Games MITREDr. Jeremy Kepner, MIT/LL Mr. Brian Sroka MITREMr. Ron Williams MITRE

    ...

    Government Lead

    Dr. Rich Linderman AFRL

    Technical Advisory BoardDr. Rich Linderman AFRLDr. Richard Games MITREMr. John Grosh OSDMr. Bob Graybill DARPA/ITODr. Keith Bromley SPAWARDr. Mark Richards GTRIDr. Jeremy Kepner MIT/LL

    Executive CommitteeDr. Charles Holland PADUSD(S+T)RADM Paul Sullivan N77

    DevelopmentDr. James Lebak MIT/LLDr. Mark Richards GTRIMr. Dan Campbell GTRI Mr. Ken Cain MERCURY Mr. Randy Judd SPAWAR

    ...

    Applied ResearchMr. Bob Bond MIT/LLMr. Ken Flowers MERCURYDr. Spaanenburg PENTUMMr. Dennis Cottel SPAWARCapt. Bergmann AFRLDr. Tony Skjellum MPISoft

    ...

    Advanced ResearchMr. Bob Graybill DARPA

    • Partnership with ODUSD(S&T), Government Labs, FFRDCs, Universities, Contractors, Vendors and DoD programs

    • Over 100 participants from over 20 organizations

    • Partnership with ODUSD(S&T), Government Labs, FFRDCs, Universities, Contractors, Vendors and DoD programs

    • Over 100 participants from over 20 organizations

  • Slide-10www.hpec-si.org

    MITRE AFRLLincoln

    Outline

    • Standards Overview• Future Standards

    • Introduction

    • Software Standards

    • Parallel VSIPL++

    • Future Challenges

    • Summary

  • Slide-11www.hpec-si.org

    MITRE AFRLLincoln

    Emergence of Component Standards

    P0 P1 P2 P3

    Node Controller

    Parallel Embedded Processor

    Data Communication:MPI, MPI/RT, DRI

    System Controller

    ControlCommunication:

    CORBA, HP-CORBA

    OtherComputersConsoles Computation:

    VSIPLVSIPL++, ||VSIPL++

    DefinitionsVSIPL = Vector, Signal, and Image

    Processing Library||VSIPL++ = Parallel Object Oriented VSIPLMPI = Message-passing interfaceMPI/RT = MPI real-timeDRI = Data Re-org InterfaceCORBA = Common Object Request Broker

    ArchitectureHP-CORBA = High Performance CORBA

    HPEC Initiative - Builds on completed research and existing

    standards and libraries

  • Slide-12www.hpec-si.org

    MITRE AFRLLincoln

    The Path to Parallel VSIPL++(world’s first parallel object oriented standard)

    •First demo successfully completed•VSIPL++ v0.5 spec completed•VSIPL++ v0.1 code available•Parallel VSIPL++ spec in progress•High performance C++ demonstrated

    Time

    Demonstrate insertions into fielded systems (e.g., CIP)• Demonstrate 3x portability

    High-level code abstraction• Reduce code size 3x

    Unified embedded computation/ communication standard•Demonstrate scalability

    Demonstration: Existing Standards

    Development: Object-Oriented Standards

    Applied Research: Unified Comp/Comm Lib

    Demonstration: Object-Oriented Standards

    Demonstration: Unified Comp/Comm Lib

    Development: Unified Comp/Comm Lib

    Func

    tiona

    lity

    VSIPL++

    prototypeParallelVSIPL++

    VSIPLMPI

    VSIPL++

    ParallelVSIPL++

    Phase 3Applied Research:

    Self-optimizationPhase 2Development: Fault tolerance

    Applied Research: Fault tolerance

    prototypePhase 1

  • Slide-13www.hpec-si.org

    MITRE AFRLLincoln

    Working Group Technical Scope

    Development

    VSIPL++

    -MAPPING (task/pipeline parallel)-Reconfiguration (for fault tolerance)-Threads-Reliability/Availability-Data Permutation (DRI functionality)-Tools (profiles, timers, ...)-Quality of Service

    -MAPPING (data parallelism)-Early binding (computations)-Compatibility (backward/forward)-Local Knowledge (accessing local data)-Extensibility (adding new functions)-Remote Procedure Calls (CORBA)-C++ Compiler Support-Test Suite-Adoption Incentives (vendor, integrator)

    Applied Research

    Parallel VSIPL++

  • Slide-14www.hpec-si.org

    MITRE AFRLLincoln

    Overall Technical Tasks and Schedule

    VSIPL (Vector, Signal, and Image Processing Library)

    MPI (Message Passing Interface)

    VSIPL++ (Object Oriented)v0.1 Specv0.1 Codev0.5 Spec & Codev1.0 Spec & Code

    Parallel VSIPL++v0.1 Specv0.1 Codev0.5 Spec & Codev1.0 Spec & Code

    Fault Tolerance/Self Optimizing Software

    Task Name FY01 FY02 FY03 FY04 FY05 FY06 FY07 FY08Near Mid Long

    Applied Research

    Development

    Demonstrate

    CIP

    CIP

    Demo 2

    Demo 2

    Demo 3 Demo 4

    Demo 5 Demo 6

  • Slide-15www.hpec-si.org

    MITRE AFRLLincoln

    HPEC-SI Goals1st Demo Achievements

    Achieved 10x+Goal 3x

    Portability

    Achieved 6x*Goal 3x

    ProductivityHPEC

    SoftwareInitiative

    Demonstrate

    Develop

    Prot

    otyp

    e

    Object Oriented

    Open

    Sta

    ndar

    ds

    Interoperable & Scalable

    Portability: zero code changes requiredProductivity: DRI code 6x smaller vs MPI (est*)Performance: 3x reduced cost or form factor

    Performance Goal 1.5x

    Achieved 2x

  • Slide-16www.hpec-si.org

    MITRE AFRLLincoln

    Outline

    • Technical Basis• Examples

    • Introduction

    • Software Standards

    • Parallel VSIPL++

    • Future Challenges

    • Summary

  • Slide-17www.hpec-si.org

    MITRE AFRLLincoln

    Parallel Pipeline

    ParallelComputer

    BeamformXOUT = w *XIN

    DetectXOUT = |XIN|>c

    FilterXOUT = FIR(XIN )

    Signal Processing Algorithm

    Mapping

    • Data Parallel within stages• Task/Pipeline Parallel across stages • Data Parallel within stages• Task/Pipeline Parallel across stages

  • Slide-18www.hpec-si.org

    MITRE AFRLLincoln

    Types of Parallelism

    InputInput

    FIRFIlters

    FIRFIlters

    SchedulerScheduler

    Detector2

    Detector2

    Detector1

    Detector1

    Beam-former 2Beam-

    former 2Beam-

    former 1Beam-

    former 1

    Task ParallelTask Parallel

    Pipeline Pipeline

    Round RobinRound Robin

    Data ParallelData Parallel

  • Slide-19www.hpec-si.org

    MITRE AFRLLincoln

    Current Approach to Parallel Code

    CodeAlgorithm + Mappingwhile(!done){

    if ( rank()==1 || rank()==2 )stage1 ();

    else if ( rank()==3 || rank()==4 )stage2();

    }

    Proc1

    Proc1

    Proc2

    Proc2

    Stage 1

    Proc3

    Proc3

    Proc4

    Proc4

    Stage 2

    Proc5

    Proc5

    Proc6

    Proc6

    while(!done){

    if ( rank()==1 || rank()==2 )stage1();

    else if ( rank()==3 || rank()==4) ||rank()==5 || rank==6 )

    stage2();}

    • Algorithm and hardware mapping are linked• Resulting code is non-scalable and non-portable• Algorithm and hardware mapping are linked• Resulting code is non-scalable and non-portable

  • Slide-20www.hpec-si.org

    MITRE AFRLLincoln

    Scalable Approach

    Single Processor Mapping#include #include

    void addVectors(aMap, bMap, cMap) {Vector< Complex > a(‘a’, aMap, LENGTH);Vector< Complex > b(‘b’, bMap, LENGTH);Vector< Complex > c(‘c’, cMap, LENGTH);

    b = 1;c = 2;a=b+c;

    }

    A = B + C

    Multi Processor Mapping

    A = B + C

    • Single processor and multi-processor code are the same• Maps can be changed without changing software• High level code is compact

    • Single processor and multi-processor code are the same• Maps can be changed without changing software• High level code is compact

    Lincoln Parallel Vector Library (PVL)

  • Slide-21www.hpec-si.org

    MITRE AFRLLincoln

    C++ Expression Templates and PETE

    BinaryNode

    s

    B+C A

    Main

    Operator +

    Operator =

    +B& C&

    1. Pass B and Creferences tooperator +

    4. Pass expression treereference to operator

    2. Create expressionparse tree

    3. Return expressionparse tree

    5. Calculate result andperform assignment

    copy &

    copy

    B&,

    C&

    Parse trees, not vectors, createdParse trees, not vectors, created

    A=B+C*DA=B+C*D

    Expression

    ExpressiTem onplate

    Parse Tree Expression Type

    • Expression Templates enhance performance by allowing temporary variables to be avoided

    • Expression Templates enhance performance by allowing temporary variables to be avoided

  • Slide-22www.hpec-si.org

    MITRE AFRLLincoln

    PETE Linux Cluster Experiments

    A=B+C A=B+C*D A=B+C*D/E+fft(F)

    VSIPLPVL/VSIPLPVL/PETE

    VSIPLPVL/VSIPLPVL/PETE

    Vector Length Vector Length

    0.6

    0.8

    0.9

    1

    1.2

    0.7

    1.1

    0.8

    0.9

    1.1

    1

    VSIPLPVL/VSIPLPVL/PETE

    0.9

    1

    1.1

    1.2

    1.3

    Rel

    ativ

    e Ex

    ecut

    ion

    Tim

    e

    92327

    68131

    0728

    Rel

    ativ

    e Ex

    ecut

    ion

    Tim

    e

    Rel

    ativ

    e Ex

    ecut

    ion

    Tim

    e

    1.2

    32 128 512 2048

    8192

    32768131

    0728 32 128 512 2048

    8192

    32768131

    072832 128 512 2048

    81Vector Length

    • PVL with VSIPL has a small overhead• PVL with PETE can surpass VSIPL• PVL with VSIPL has a small overhead• PVL with PETE can surpass VSIPL

  • Slide-23www.hpec-si.org

    MITRE AFRLLincoln

    PowerPC AltiVec Experiments

    A=B+C*DA=B+C A=B+C*D+E*F

    A=B+C*D+E/FSoftware Technology

    Results• Hand coded loop achieves good

    performance, but is problem specific and low level

    • Optimized VSIPL performs well for simple expressions, worse for more complex expressions

    • PETE style array operators perform almost as well as the hand-coded loop and are general, can be composed, and are high-level

    AltiVec loop VSIPL (vendor optimized) PETE with AltiVec

    • C• For loop• Direct use of AltiVec extensions• Assumes unit stride• Assumes vector alignment

    • C• AltiVec aware VSIPro Core Lite

    (www.mpi-softtech.com)• No multiply-add• Cannot assume unit stride• Cannot assume vector alignment

    • C++• PETE operators• Indirect use of AltiVec extensions• Assumes unit stride• Assumes vector alignment

  • Slide-24www.hpec-si.org

    MITRE AFRLLincoln

    Outline

    • Technical Basis• Examples

    • Introduction

    • Software Standards

    • Parallel VSIPL++

    • Future Challenges

    • Summary

  • Slide-25www.hpec-si.org

    MITRE AFRLLincoln

    A = sin(A) + 2 * B;

    • Generated code (no temporaries)

    for (index i = 0; i < A.size(); ++i)A.put(i, sin(A.get(i)) + 2 * B.get(i));

    • Apply inlining to transform to

    for (index i = 0; i < A.size(); ++i)Ablock[i] = sin(Ablock[i]) + 2 * Bblock[i];

    • Apply more inlining to transform to

    T* Bp = &(Bblock[0]); T* Aend = &(Ablock[A.size()]);for (T*Ap = &(Ablock[0]); Ap < pend; ++Ap, ++Bp)*Ap = fmadd (2, *Bp, sin(*Ap));

    • Or apply PowerPC AltiVec extensions

    • Each step can be automatically generated• Optimization level whatever vendor desires• Each step can be automatically generated• Optimization level whatever vendor desires

  • Slide-26www.hpec-si.org

    MITRE AFRLLincoln

    BLAS zherk Routine

    • BLAS = Basic Linear Algebra Subprograms• Hermitian matrix M: conjug(M) = Mt• zherk performs a rank-k update of Hermitian matrix C:

    C ← α ∗ A ∗ conjug(A)t + β ∗ C

    • VSIPL codeA = vsip_cmcreate_d(10,15,VSIP_ROW,MEM_NONE);

    C = vsip_cmcreate_d(10,10,VSIP_ROW,MEM_NONE);

    tmp = vsip_cmcreate_d(10,10,VSIP_ROW,MEM_NONE);

    vsip_cmprodh_d(A,A,tmp); /* A*conjug(A)t */

    vsip_rscmmul_d(alpha,tmp,tmp);/* α*A*conjug(A)t */vsip_rscmmul_d(beta,C,C); /* β*C */vsip_cmadd_d(tmp,C,C); /* α*A*conjug(A)t + β*C */vsip_cblockdestroy(vsip_cmdestroy_d(tmp));

    vsip_cblockdestroy(vsip_cmdestroy_d(C));

    vsip_cblockdestroy(vsip_cmdestroy_d(A));

    • VSIPL++ code (also parallel)Matrix A(10,15);Matrix C(10,10);C = alpha * prodh(A,A) + beta * C;

  • Slide-27www.hpec-si.org

    MITRE AFRLLincoln

    Simple Filtering Application

    int main () {using namespace vsip;const length ROW S = 64;const length COLS = 4096;

    vsipl v;

    FFTforward_fft (Domain(RO WS,COLS), 1.0);

    FFT inverse_fft (Domain(ROW S,COLS), 1.0);

    const Matrix weights (load_weights (ROW S, COLS));

    try {while (1) output (inverse_fft (forward_fft (input ()) * weights));

    }catch (std::runtime_error) { // Successfully caught access outside domain. }

    }

  • Slide-28www.hpec-si.org

    MITRE AFRLLincoln

    Explicit Parallel Filter

    #include

    using namespace VSIPL;

    const int ROWS = 64;

    const int COLS = 4096;

    int main (int argc, char **argv)

    {

    Matrix W (ROWS, COLS, "WMap"); // weights matrix

    Matrix X (ROWS, COLS, "WMap"); // input matrix

    load_weights (W)

    try

    {

    while (1)

    {

    input (X); // some input function

    Y = IFFT ( mul (FFT(X), W));

    output (Y); // some output function

    }

    }

    catch (Exception &e) {cerr

  • Slide-29www.hpec-si.org

    MITRE AFRLLincoln

    Multi-Stage Filter (main)

    using namespace vsip;

    const length RO W S = 64;const length COLS = 4096;

    int main (int argc, char **argv) {sample_low_pass_filter LPF();sample_beamform BF();sample_matched_filter MF();

    try{while (1) output (MF(BF(LPF(input ()))));

    }catch (std::runtime_error) { // Successfully caught access outside domain. }

    }

  • Slide-30www.hpec-si.org

    MITRE AFRLLincoln

    Multi-Stage Filter (low pass filter)

    templateclass sample_low_pass_filter{public: sample_low_pass_filter(): FIR1_(load_w1 (W1_LENGTH), FIR1_LENGTH),FIR2_(load_w2 (W2_LENGTH), FIR2_LENGTH)

    { }

    Matrix operator () (const Matrix& Input) {Matrix output(RO WS, COLS);for (index row=0; row

  • Slide-31www.hpec-si.org

    MITRE AFRLLincoln

    Multi-Stage Filter (beam former)

    templateclass sample_beamform{public:sample_beamform() : W3_(load_w3 (RO WS,COLS)) { }Matrix operator () (const Matrix& Input) const{ return W3_ * Input; }

    private:const Matrix W3_;

    }

  • Slide-32www.hpec-si.org

    MITRE AFRLLincoln

    Multi-Stage Filter (matched filter)

    templateclass sample_matched_filter{public:matched_filter(): W4_(load_w4 (RO WS,COLS)),forward_fft_ (Domain(RO WS,COLS), 1.0),inverse_fft_ (Domain(RO WS,COLS), 1.0)

    {}

    Matrix operator () (const Matrix& Input) const { return inverse_fft_ (forward_fft_ (Input) * W4_); }

    private:const Matrix W4_;FFT forward_fft_;

    FFT inverse_fft_;

    }

  • Slide-33www.hpec-si.org

    MITRE AFRLLincoln

    Outline

    • Fault Tolerance• Self Optimization• High Level Languages

    • Introduction

    • Software Standards

    • Parallel VSIPL++

    • Future Challenges

    • Summary

  • Slide-34www.hpec-si.org

    MITRE AFRLLincoln

    Dynamic Mapping for Fault Tolerance

    Map1

    Spare

    Failure

    Parallel Processor

    OutputTask

    Map2

    Nodes: 1,3

    XOUTXIN Nodes: 0,2Map0

    Nodes: 0,1

    InputTask

    • Switching processors is accomplished by switching maps

    • No change to algorithm required

    • Switching processors is accomplished by switching maps

    • No change to algorithm required

  • Slide-35www.hpec-si.org

    MITRE AFRLLincoln

    Dynamic Mapping Performance Results

    0

    1

    10

    100

    32 128 512 2048

    Remap/BaselineRedundant/BaselinRebind/Baseline

    Data Size

    Rel

    ativ

    e Ti

    me

    • Good dynamic mapping performance is possible• Good dynamic mapping performance is possible

  • Slide-36www.hpec-si.org

    MITRE AFRLLincoln

    Optimal Mapping of Complex Algorithms

    Input

    XINXIN

    Low Pass Filter

    XINXIN

    W1W1

    FIR1FIR1 XOUTXOUT

    W2W2

    FIR2FIR2

    Beamform

    XINXIN

    W3W3

    multmult XOUTXOUT

    Matched Filter

    XINXIN

    W4W4

    FFTFFTIFFTIFFT XOUTXOUT

    Application

    PowerPCCluster

    IntelCluster

    Different Optimal Maps

    Workstation

    EmbeddedMulti-computerEmbedded

    Board

    Hardware• Need to automate process of

    mapping algorithm to hardware• Need to automate process of

    mapping algorithm to hardware

  • Slide-37www.hpec-si.org

    MITRE AFRLLincoln

    Self-optimizing Softwarefor Signal Processing

    Problem Size

    • Find– Min(latency | #CPU)– Max(throughput | #CPU)

    • S3P selects correct optimal mapping

    • Excellent agreement between S3P predicted and achieved latencies and throughputs

    • Find– Min(latency | #CPU)– Max(throughput | #CPU)

    • S3P selects correct optimal mapping

    • Excellent agreement between S3P predicted and achieved latencies and throughputs

    #CPU

    Late

    ncy

    (sec

    onds

    )

    Large (48x128K)Small (48x4K)

    Thro

    ughp

    ut(fr

    ames

    /sec

    )

    4 5 6 7 8

    25

    20

    15

    100.25

    0.20

    0.15

    0.10

    1.5

    1.0

    0.5

    predictedachieved

    predictedachieved

    predictedachieved

    predictedachieved

    5.0

    4.0

    3.0

    2.04 5 6 7 8

    #CPU

    1-1-1-1

    1-1-1-1

    1-1-1-1

    1-1-1-1

    1-1-2-1

    1-2-2-1

    1-2-2-2 1-3-2-2

    1-1-2-1

    1-1-2-2

    1-2-2-2

    1-3-2-2

    1-1-1-2

    1-1-2-2

    1-2-2-2

    1-2-2-3

    1-1-2-1

    1-2-2-1

    1-2-2-2

    2-2-2-2

  • Slide-38www.hpec-si.org

    MITRE AFRLLincoln

    High Level Languages

    DoD SensorProcessing

    High Performance Matlab ApplicationsDoD Mission

    PlanningScientific

    SimulationCommercialApplications

    UserInterface

    HardwareInterface

    • Parallel Matlab need has been identified•HPCMO (OSU)

    • Required user interface has been demonstrated•Matlab*P (MIT/LCS)•PVL (MIT/LL)

    • Required hardware interface has been demonstrated•MatlabMPI (MIT/LL)

    Parallel MatlabToolbox

    Parallel Computing Hardware

    • Parallel Matlab Toolbox can now be realized• Parallel Matlab Toolbox can now be realized

  • Slide-39www.hpec-si.org

    MITRE AFRLLincoln

    MatlabMPI deployment (speedup)

    • Maui– Image filtering benchmark (300x on 304 cpus)

    • Lincoln– Signal Processing (7.8x on 8 cpus)– Radar simulations (7.5x on 8 cpus)– Hyperspectral (2.9x on 3 cpus)

    • MIT– LCS Beowulf (11x Gflops on 9 duals)– AI Lab face recognition (10x on 8 duals)

    • Other– Ohio St. EM Simulations– ARL SAR Image Enhancement– Wash U Hearing Aid Simulations– So. Ill. Benchmarking– JHU Digital Beamforming– ISL Radar simulation– URI Heart modeling

    • Rapidly growing MatlabMPI user base demonstrates need for parallel matlab

    • Demonstrated scaling to 300 processors

    • Rapidly growing MatlabMPI user base demonstrates need for parallel matlab

    • Demonstrated scaling to 300 processors

    0

    1

    10

    100

    1 10 100 1000

    MatlabMPILinear

    Number of ProcessorsPe

    rfor

    man

    ce (G

    igaf

    lops

    )

    Image Filtering on IBM SPat Maui Computing Center

  • Slide-40www.hpec-si.org

    MITRE AFRLLincoln

    Summary

    • HPEC-SI Expected benefit– Open software libraries, programming models, and standards

    that provide portability (3x), productivity (3x), and performance (1.5x) benefits to multiple DoD programs

    • Invitation to Participate– DoD Program offices with Signal/Image Processing needs– Academic and Government Researchers interested in high

    performance embedded computing– Contact: [email protected]

  • Slide-41www.hpec-si.org

    MITRE AFRLLincoln

    The Links

    High Performance Embedded Computing Workshophttp://www.ll.mit.edu/HPEC

    High Performance Embedded Computing Software Initiativehttp://www.hpec-si.org/

    Vector, Signal, and Image Processing Libraryhttp://www.vsipl.org/

    MPI Software Technologies, Inc.http://www.mpi-softtech.com/Data Reorganization Initiative

    http://www.data-re.org/CodeSourcery, LLC

    http://www.codesourcery.com/MatlabMPI

    http://www.ll.mit.edu/MatlabMPI

    High Performance Embedded ComputingSoftware Initiative (HPEC-SI)OutlineOverview - High Performance Embedded Computing (HPEC) InitiativeWhy Is DoD Concerned with Embedded Software?Issues with Current HPEC Development Inadequacy of Software Practices & StandardsEvolution of Software Support Towards“Write Once, Run Anywhere/Anysize”Overall Initiative Goals & ImpactHPEC-SI Path to SuccessOrganizationOutlineEmergence of Component StandardsThe Path to Parallel VSIPL++Working Group Technical ScopeOverall Technical Tasks and ScheduleHPEC-SI Goals1st Demo AchievementsOutlineParallel PipelineTypes of ParallelismCurrent Approach to Parallel CodeScalable ApproachC++ Expression Templates and PETEPETE Linux Cluster ExperimentsPowerPC AltiVec ExperimentsOutlineA = sin(A) + 2 * B;BLAS zherk RoutineSimple Filtering ApplicationExplicit Parallel FilterMulti-Stage Filter (main)Multi-Stage Filter (low pass filter)Multi-Stage Filter (beam former)Multi-Stage Filter (matched filter)OutlineDynamic Mapping for Fault ToleranceDynamic Mapping Performance ResultsOptimal Mapping of Complex AlgorithmsSelf-optimizing Softwarefor Signal ProcessingHigh Level LanguagesMatlabMPI deployment (speedup)SummaryThe LinksHigh Performance Embedded Computing Workshophttp://www.ll.mit.edu/HPECHigh Performance Embedded Computing Softwar