agile, extensible, fast i/o module for the cyber-physical era

22
Carlos Alvarez, Eduard Ayguadé, Javier Bueno, Antonio Filgueras, Daniel Jimenez-Gonzalez, Xavier Martorell, Nacho Navarro Barcelona Supercomputing Center, Spain Dimitris Theodoropoulos, Dionisios N. Pnevmatikatos FORTH-ICS Claudio Scordino, Paolo Gai Evidence Srl Pisa, Italy Davide Catani SECO Arezzo, Italy Antonio Rizzo University of Siena Siena, Italy Carlos Segura, Carles Fernandez, David Oro, Javier Rodriguez Saeta Herta Security, Barcelona, Spain Pierluigi Passera, Alberto Pomella VIMAR SpA, Marostica, Italy Coordinator: Roberto Giorgi University of Siena The AXIOM Software Layers DSD, Funchal, 26 th August 2015 Agile, eXtensible, fast I/O Module for the cyber-physical era

Upload: others

Post on 14-Mar-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Carlos Alvarez, Eduard Ayguadé,

Javier Bueno, Antonio Filgueras,

Daniel Jimenez-Gonzalez,

Xavier Martorell, Nacho Navarro

Barcelona Supercomputing Center,

Spain

Dimitris Theodoropoulos,

Dionisios N. Pnevmatikatos

FORTH-ICS

Claudio Scordino,

Paolo Gai

Evidence Srl

Pisa, Italy

Davide Catani

SECO

Arezzo, Italy

Antonio Rizzo

University of Siena

Siena, Italy

Carlos Segura, Carles Fernandez,

David Oro, Javier Rodriguez Saeta

Herta Security, Barcelona, Spain

Pierluigi Passera,

Alberto Pomella

VIMAR SpA, Marostica, Italy

Coordinator:

Roberto Giorgi

University of Siena

The AXIOM Software LayersDSD, Funchal, 26th August 2015

Agile, eXtensible, fast I/O Module for the cyber-physical era

Scenario

• Heterogeneous many-core designs to fight the power wall

• E.g. CPU + GPU in a single chip

• AMD Fusion APUs

• < 20 watt, thousands of cores

• GPU lacks flexibility

• CPU + Programmable logics

• E.g., Xilinx Zynq

• ARM Dual-core + Artix7 | Kintex7

• Programmability opportunities

AXIOM id. 645496

http://www.axiom-project.eu2

The AXIOM Project

• Realizing a small board that is flexible, energy efficient and modularly scalable

• Easy programmability of multi-core, multi-board, FPGA

• Leveraging Open-Source software to manage the board

• Easy Interfacing with the Cyber-Physical Worlds

• Enabling real-time thread scheduling

• Contribution to Standards

AXIOM id. 645496

http://www.axiom-project.eu3

• BSC – Programming models

• EVI – Runtime, OS (Linux RT scheduler)

• FORTH – High-speed interconnects

• SECO – Hardware module realization

• UNISI – Simulation, evaluation, coordination

• VIMAR, HERTA – Applications

Xilinx Zynq

• Year 2011

• 2013: Zynq-7100

• Broad range of high-end embedded applications

• ADAS, medical, wireless, vision, ...

• ARM host to increase programmability

• Different FPGAs

• Depending on requirements

• Still, the Prog. Logic needs to be programmed!

AXIOM id. 645496

http://www.axiom-project.eu4

The AXIOM board

• Integrate multiple cores + FPGA in the same chip

• Produce prototypes of single-board computers

• connected with high-speed board-to-board comm medium

• Easily plug peripherals and components

• Such as Arduino Shields

AXIOM id. 645496

http://www.axiom-project.eu5

The AXIOM software stack

• Linux-based OS

• Highly-expressive programming model OmpSs

• OpenMP-like

• Portions of applications on FPGA

• Mapping of (HW) task to resources: OmpSs@Zynq

At cluster level:

• OmpSs@Cluster (+ OmpSs@Zynq)

• DSM (+ OmpSs@Zynq)

AXIOM id. 645496

http://www.axiom-project.eu6

Current status: MxM example

AXIOM id. 645496

http://www.axiom-project.eu7

void matrix_multiply( int BS, T a[BS][BS], T b[BS][BS], T c[BS][BS]) {

for ( int ia = 0; ia < BS; ++ia)for ( int ib = 0; ib < BS; ++ib) {

T sum = 0;for ( int id = 0; id < BS; ++id)

sum += a[ia][id] * b[id][ib];c[ia][ib] = sum;

}

}

OmpSs Pragmas: Programmability

AXIOM id. 645496

http://www.axiom-project.eu8

We can do it with

only two OmpSs

pragma directives!

#pragma omp target device(fpga, smp) copy_deps#pragma omp task in(a[0:BS*BS-1], b[0:BS*BS-1]) \

inout(c[0:BS*BS-1])void matrix_multiply( int BS, T a[BS][BS], T b[BS][BS],

T c[BS][BS]) {

for ( int ia = 0; ia < BS; ++ia)for ( int ib = 0; ib < BS; ++ib) {

T sum = 0;for ( int id = 0; id < BS; ++id)

sum += a[ia][id] * b[id][ib];c[ia][ib] = sum;

}

}

How we want to execute

AXIOM id. 645496

http://www.axiom-project.eu9

ARM

ARM

FPGA ARM

FPGA

ARM

FPGA

ARM

FPGA

ARM

FPGA

#pragma omp target device(fpga, smp) copy_deps#pragma omp task in(a[0:BS*BS-1], b[0:BS*BS-1]) \

inout(c[0:BS*BS-1])void matrix_multiply( int BS, T a[BS][BS], T b[BS][BS],

T c[BS][BS]) {

for ( int ia = 0; ia < BS; ++ia)for ( int ib = 0; ib < BS; ++ib) {

T sum = 0;for ( int id = 0; id < BS; ++id)

sum += a[ia][id] * b[id][ib];c[ia][ib] = sum;

}

}

How we want to execute

AXIOM id. 645496

http://www.axiom-project.eu10

ARM

ARM

FPGA ARM

FPGA

ARM

FPGA

ARM

FPGA

ARM

FPGA

#pragma omp target device(AXIOM)

#pragma omp target device(fpga, smp) copy_deps#pragma omp task in(a[0:BS*BS-1], b[0:BS*BS-1]) \

inout(c[0:BS*BS-1])void matrix_multiply( int BS, T a[BS][BS], T b[BS][BS],

T c[BS][BS]) {

for ( int ia = 0; ia < BS; ++ia)for ( int ib = 0; ib < BS; ++ib) {

T sum = 0;for ( int id = 0; id < BS; ++id)

sum += a[ia][id] * b[id][ib];c[ia][ib] = sum;

}

}

OmpSs@Zynq

• Mercurium source to source compiler

• C code transformation, Nanos++ call insertions and bitstream generation automation

• Nanos++ runtime

• Task dependency management and data movement (DMA lib management)

• Extrae library

• Task execution instrumentation

AXIOM id. 645496

http://www.axiom-project.eu11

OmpSs@Zynq infrastructure

AXIOM id. 645496

http://www.axiom-project.eu12

CORE

OmpSs@Zynq runtime (Nanos++)

AXIOM id. 645496

http://www.axiom-project.eu13

CORE FPGA

Going cluster

Two alternatives are being evaluated

• OmpSs@Cluster uses additional modules as heterogeneous accelerators

• Nanos takes care of the data movements between modules

• No global variables shared between the main module and the accelerator modules

• A DSM approach where the OS takes care of the data movement and behaves like a large multicore

AXIOM id. 645496

http://www.axiom-project.eu14

CORE

OmpSs@Cluster + OmpSs@Zynq

AXIOM id. 645496

http://www.axiom-project.eu15

CORE FPGA

OmpSs@Cluster memory management

AXIOM id. 645496

http://www.axiom-project.eu16

DSM + OmpSs@Zynq

AXIOM id. 645496

http://www.axiom-project.eu17

Compilation & Execution

• Non instrumentation compilation

• $ fpgacc --ompss -o main.elf main.c

• Instrumented compilation

• $ fpgacc --ompss --instrument -o main.elf main.c

• Non instrumented execution

• $ ./main.elf

• Instrumented execution

• $ NX_ARGS=”--instrumentation=extrae” ./main.elf

AXIOM id. 645496

http://www.axiom-project.eu18

Performance Analysis Tools

AXIOM id. 645496

http://www.axiom-project.eu19

Dynamic task graph

Analysis with Paraver

Results

AXIOM id. 645496

http://www.axiom-project.eu20

1 SMP 2SMP 1 ACC 2ACC Heterogeneous 1 SMP + 2 ACC

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4 64x64 MxM

Hand codedOMPSST

ime

(s)

1 SMP 2SMP 1 ACC 2ACC Heterogeneous 1 SMP + 2 ACC

0

10

20

30

40

50

60

70

80 Covariance

Hand codedOMPSST

ime

(s)

1 SMP 2SMP 1 ACC 2ACC Heterogeneous 1 SMP + 2 ACC

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45 Cholesky

Hand codedOMPSST

ime

(s)

Application Seq w/

DMA

Pthread OmpSs

Cholesky 71 26 3

Covariance 94 29 3

Matrix

Multiply

95 39 3

Programming effort

Conclusions

• H2020 AXIOM to explore heterogeneous architectures for next-generation Cyber-Physical Systems

• Need for a programming model that relieves the programmer from the burden of dealing with micro-managing hardware details: OmpSs

• Two possible alternatives of managing inter-module task scheduling

• Our preliminary system to build a software framework for the AXIOM project shows promising programmability and performance results

AXIOM id. 645496

http://www.axiom-project.eu21

Agile, eXtensible, fast I/O Module for the cyber-physical eraPROJECT ID: 645496

University of Siena

(Coordinator Partner)