kalray mcsoc 2016€¦ · kalray is the ideal co-processor to off-load x86 or ... data and control...

Kalray MPPA®

Massively Parallel Processor Array

Engineering a Manycore Processor Platform for

Mission-Critical Applications

Benoît Dupont de Dinechin, CTO

2016 – Kalray SA All Rights Reserved MCSoC 2016 2

� Cloud and Data Center acceleration� Kalray is the ideal co-processor to off-load x86 or

accelerate real-time or compute intensive functions

� Domains of application: Networking, Storage, Big Data analytics, Cybersecurity, Deep learning

� High Performance Embedded Computing� Embedded applications require higher

performance without increasing power consumption and more integration of functions including those constrained by real-time

� Domains of application: aerospace, automotive, transport, energy

Kalray Solutions Target Two Strategic Markets


� Mission-Critical Applications

� MPPA®-256 Bostan Processor

� MPPA® NoC Management

� MPPA® Software Environment

� MPPA® Coolidge Processor

� Conclusions

Outline


Evolution of Sensor Fusion for Autonomous Driving(figures from M. Fausten keynote presentation at ESWEEK 2015)


“Number Crunching” in Sensor Fusion Units (figures from M. Fausten keynote presentation at ESWEEK 2015)


� Dependability requirements

� Detect, isolate or mitigate errors that occur in digital circuits

� Transient electrical problems, soft errors, physical damage

� Timing requirements

� Time constraints associated with information manipulation

� Execution time determinism, predictability, composability

� Certification of worst-case response time by static analysis

Mission-Critical Computing


� Intra-core non-deterministic timing

� Speculative execution, branch prediction

� Cache content

� Inter-core interferences

� Concurrent accesses to caches, busses, and peripheral devices

� Cache coherence invalidations

� Main issue

� Growing gap between average execution time and static WCET

Execution Timing on a Multicore Processor


� Challenge: managing interference between cores

Classic Multicore Memory Hierarchy


� Challenge: programmability of DMA and private memories

Embedded Multicore Memory Hierarchy







� Conclusions

Outline


� DSP type of processing

� Energy efficiency

� Timing predictability

� Software programmability

� CPU ease of programming

� C/C++ GNU environment

� 32-bit/64-bit, little-endian

� Rich operating system (Linux with dynamic loading&linking)

� Integrated many-core processor

� 32 management cores on chip

� 256 application cores on chip

� High-performance I/O

� Scalable parallel computing

� MPPA® processors can be tiled together through NoC extension

� Run-time support for pooling the external DDR memory resources

MPPA®-256 Manycore Architecture Highlights

Co

urt

esy

Alt

era


MPPA®-256 Bostan Processor Architecture

Manycore Processor

� 16 compute clusters

� 2 I/O clusters each with quad-core CPUs, DDR3, 4 Ethernet 10G and 8 PCIe Gen3

� Data and control networks-on-chip

� Distributed memory architecture

� 634 GFLOPS SP for 25W @ 600Mhz

Compute Cluster

� 16 user cores + 1 system core

� NoC Tx and Rx interfaces

� Debug & Support Unit (DSU)

� 2 MB multi-banked shared memory

� 77GB/s Shared Memory BW

�16 cores SMP System

� 32-bit or 64-bit addresses

� 5-issue VLIW architecture

� MMU + I&D cache (8KB+8KB)

� 32-bit/64-bit IEEE 754-2008 FMA FPU

� Tightly coupled crypto co-processor

� 2.4 GFLOPS SP per core @600Mhz

VLIW Core


� Physical characteristics� TSMC CMOS 28HP

� 100µW/MHz per core + L1 caches

� 2W to 3W leakage

� Processor interfaces

� 2x DDR3 Memory interfaces

� 2x PCIe Gen3 8-lane interface

� 8x 1G/10G Ethernet or 2x 40G Ethernet interfaces

� SPI/I2C/UART interfaces

� Universal Static Memory Controller (NAND/NOR/SRAM)

� GPIOs with Direct NoC Access

� NoC extension through Interlaken interface (NoCX)

MPPA®-256 Bostan Processor256 + 32 VLIW cores / 18 address spaces / 400 MHz – 800 MHz


� U. Saarland / AbsInt GMBH recommendations on VLIW core and cache micro-architecture design

� AbsInt provides the aiT static timing analysis tool used to certify flight control systems at Airbus

� AbsInt aiT tool also targets the Kalray VLIW cores

� Architecture design with a focus on timing predictability

� Core level: micro-architecture� Fully timing compositional core

� LRU caches and write buffer

� Cache bypass memory opcodes

� Cluster level: multi-banked shared memory� Core-private buses for memory bank access

� Address interleaving or blocking across banks

� Processor level: NoC/DDR guaranteed services� NoC minimum bandwidth & maximum latency

� DDR controller configurations and address mapping

MPPA® Processor Co-Design for Avionics


MPPA®-256 Bostan VLIW Core Data Path

� Shared ACC read port� Unified MAU and FPU� LSU and MAU also execute

single-cycle ALU instructions

� 5-issue, polycyclic property� Unified 10 read, 5 write

64x32-bit register file� 64-bit data in register pairs


46117

225

497

0

100

200

300

400

500

600

Kalray Bostan ARM Cortex A7

(Little)*

GPU** ARM Cortex A15

(Big)*

pJ per Instruction on Matrix Multiply

MPPA® Bostan VLIW Core Energy Efficiency

*: “An Instruction Level Energy Characterization of ARM Processors” FORTH-ICS/TR-450, March 2015

**: Source: Measured on Matrix Multiply EPI Test + Source Bill Dally, “To ExaScale and Beyond” - NVidia

Kalray Bostan at 141 pJ per bundle with 3.14 instructions average per bundle







� Conclusions

Outline


� Dual 2D-torus NoC

� Direct, wormhole switching

� D-NoC: high-bandwidth RDMA

� C-NoC: low-latency mailboxes

� 4B/cycle per link direction per NoC

� Nx10Gb/s NoC extensions for connection to FPGA or other MPPA®

� Predictability

� Data NoC is configured by selecting routes and injection parameters

� Routing ensure deadlock-free traffic

� Injection parameters are the (σ,ρ) or (burst, rate) of network calculus

MPPA®-256 Bostan Network-on-Chip (NoC)


Wormhole Switching Illustrated

1 2

5 6

3

7

4

8

9

B

A

When the required channel is busy, the hop flow control blocks the trailing flits and they stay in flit buffers along the established route

A packet is composed of several flitsThe header flits governs the routeThe payload flits follow the header in a pipeline fashion


� Complex to implement

� May be true for input queueing & output matching (e.g. iSLIP)

� The MPPA® NoC routers only include demultiplexers, output queues and RR arbiters

� Prone to deadlocking

� In this example, the red flow cannot use R3→R2 because the blue flow is using it

� Likewise, the blue flow needsR1→R4 held by the red flow

� Deadlock requires full queues

Wormhole Switching Network Issues


� Deadlock results from circuits of agents and resources

connected by a wait-for relation [Dally & Seitz 1998]

� Wormhole switching: agents are packets; resources are link buffers

� Links between routers and links internal to routers (‘turns’)

� Solutions when the network topology is a 2D mesh

� Dimension order (X-Y on 2D meshes)

� Turn model [Glass & Ni 1994] (not the same as ‘turn prohibition’)

� Odd-Even [Chiu 2000] (better path diversity than turn model)

� Hamiltonian Odd-Even [Bahrebar & Stroobandt 2015] (multicasting)

� Strategy for the MPPA® NoC management

� Isolate a 2D mesh in topology and apply deadlock-free routing

Deadlock-Free Routing Wormhole Switching


� Even rows

� East-South turn prohibited

� North-West turn prohibited

� Odd rows

� North-East turn prohibited

� West-South turn prohibited

Hamiltonian Odd-Even Prohibited Turns

Even

Even

Odd

Odd

Even

Even

Odd

Odd


� Routing between compute clusters

� Routes generated assuming a 6x6 mesh

� Impossible routes are discarded

� Routing between I/O and compute cluster

� Always possible, thanks to the 4 NoC nodes per I/O cluster

Hamiltonian Odd-Even on the MPPA® NoC

I/O DDR 1

I/O DDR 0

I/O Ethernet 1

I/O Ethernet 0


� For each flow, select a single path among those proposed by adaptive routing so as to maximize utility

� If multi-path allowed, reducesto a multi-commodity flowproblem with a direct solution

� Adding the single path constraintsmakes the problem NP-hard

� Practical instances are solved withmixed integer programming

� Utility function is the max-minfairness of flow rates

� Maximize the lowest flow rate,then the next lowest flow rate etc.

MPPA® NoC Path Selection Problem

Flow Route Bandwidth

F48 R1 0.441901304029376

F49

R1 0

R2 0.375611423014079

R3 0

R4 0

F50 R1 0.13862940041161

F51 R1 0.23889251894024

R2 0

� Numerical results for all pairs offlows between 8 clusters (55 flows)


� Data NoC packet injection implements a (σ,ρ)regulation

� No more than σ+ρ(t-s) flits are injected for any interval [s,t]

� Application of Deterministic Network Calculus prevents NoC congestion and provides bounds on end-to-end delays

� Network Calculus application requires feed-forward flows

� Deadlock-free wormhole switching implies feed-forward flows

Principle of the MPPA NoC Guaranted Services

Cumulative flit count

time

σ

~ρL







� Conclusions

Outline


Complete AccessCore™ SDK

Standard C/C++

Programming Environment

C - Low Level/ Lib

Programming DSP Style

C - POSIX-Level

Programming CPU Style

OpenCL

Programming GPU Style

Simulators & Profilers,

Debuggers & System Trace

Operating Systems

& Device Drivers

AccessLib™

Optimized Libraries

OpenDataPlane

Open API for networking


� Approaches to communication� Load/Store to unified memory

� Remote DMA to unified memory

� Remote DMA to private memories

� Send/receive to private memories

� Approaches to synchronization� Locks, semaphores, conditional variables

� Barriers, collectives (reduce/scan/all2all)

� Remote atomic operations

� Remote atomic queues

� Classic HPC clusters vs MPPA� The working memory of HPC clusters is the

collection of identical node memories

� The working memory of the MPPA is the collection of SMEMs backed by the DDRs

MPPA® Distributed Memory Architecture Challenge


� MPPA local memory benefits� Local memory accesses are more energy-

efficient than a Last-Level Cache (LLC)

� Local memory access interferences can be managed for time-critical applications

� MPPA DSM for implicit DDR access� Escape code size and data size restrictions

� Mostly used for application porting

� Some implicit global memory accesses cannot be converted to remote DMA

� MPPA Asynchronous Operations� Using of local memories and sharing DDR

memories through remote DMA over NoC

� Remote atomic operations and queues

� Co-exist with the DSM or run stand-alone

MPPA® Distributed Memory Architecture + Runtime


MPPA Asynchronous Operations Ordering


MPPA® Software Stack Overview

Rich OS

Hypervisor (Exokernel)

GNU C, C++, Fortran

MPPA® platform

OpenCL runtime

OpenCL C

LanguageLibraries

RPC, Asynchronous Operations, DSM

Low-level programming interfaces

POSIX threads + OpenMP

RTOS


� Safety-critical control-command applications

� Model-based programming using SCADE Suite® from Esterel Technologies

� Complemented with static timing analysis of binary code (aiT from AbsInt)

� Motivations for multicore and manycore execution

� Distribute the compute load across cores and reduce memory interferences

� Effective implementation of multi-rate harmonic applications

� Envision use of fast Model Predictive Control (MPC) techniques

SCADE Code Generation for the MPPA®


� eSOL eMCOS is the world’s first many-core RTOS for embedded use

� Runs on the MPPA® clusters on top of Low-Level programming

eMCOS for MPPA® Processors


� Number-crunching applications

� Use OpenCL hosted on an I/O cluster quad-core running Linux

� Partition the compute clusters using OpenCL subDevices

� Enable OpenCL Data Parallel and Task Parallel models

� Inside Task Parallel model, support Posix programming

� C/C++/Pthread/FILE instead of OpenCL-C

� MPPA Asynchronous operations instead of OpenCL async_copy

� Control-command applications

� SCADE Suite® code generation for compute clusters

� Exchange with I/O application part using RDMA through NoC

� Automotive RTOS & middleware

� OSEK/VDX RTOS and support of AUTOSAR deployment environment

Automotive Software Development Kit Vision







� Conclusions

Outline


� Compute cluster as CPU

� Direct access to DDR memory

� Optional L1 cache coherence

� Full 64-bit OS support

� Functional safety

� Power domains

� Robust partitioning

� NoC error protection

� Security architecture

� Trust provisioning

� Secure boot chain

� Authenticated debug

MPPA® Coolidge Processors

Co

urt

esy

Ra

mb

us

Co

urt

esy

Insi

de

Se

cure


� TSMC 16FFC technology node� 50% maximum frequency increase

compared to 28nm HPC

� 40% dynamic power reduction at same frequency compared to 28nm HPC

� 3rd generation Kalray VLIW core� 1GHz typical core frequency

� 64-bit 5-issue VLIW core architecture

� 128-bit load/store unit (16 bytes /cycle)

� 16-bit, 32-bit and 64-bit IEEE 754 FPU

� 16KB, 4-way associative L1 cache s

� 4 privilege levels like in ARMv8

� Co-processing� Transcendental function acceleration for

deep learning and scientific computing

� Block cipher and hashing acceleration

� Pixel pre-processing for computer vision

� Compute cluster� 4MB memory capacity, 128-bit busses

� Local memory / L2 cache partitioning

� Interleaved or locked memory map

� RM core as L2 cache controller

� Unified NoC� Dual virtual channel, 128-bit flits

� RDMA, load/store and control

� 2D grid topology

� External memory� One DDR4 channel per 4 compute clusters

� 1 channel on MPPA®-64

� 4 channels on MPPA®-256

� High-speed peripherals� Ethernet 25Gbps

� PCIe Gen4

� Interlaken 25Gbps

MPPA® Coolidge Main Evolutions


� Kalray KANN tool interprets Berkeley Caffe files of trained networks and generates code

� Network architecture

� Parameters (filter coefficients)

� Focus on low-latency neural network inference

� Convolutional layer neurons are spread across compute clusters

� NoC multicasting to copy parameters from DDR to SMEM in compute clusters

� All data transfers and coordination with MPPA asynchronous operations

MPPA® Processors on Deep Learning Application

0

50

100

150

200

250

Kalray**

Bostan

Nvidia*

Tegra

Kalray**

Coolidge

Fp

s

Neural Network(Googlenet - FP32)







� Conclusions

Outline


� Target applications that benefit from manycore processors

� High performance computing (GPU, MIC)

� High performance networking (Cavium, Mellanox/Tilera)

� High integrity embedded computing (robotics, aerospace, defense)

� Achieving ‘near-data computing’ (using local memory) is key

� Required for energy efficiency and timing predictability

� Local memory is naturally exploited by KPN models of computation

� Must also support transparent and effective access to global memory

� Programming models must be adapted

� Explicit memory hierarchy and relaxed memory consistency

� OpenCL a first step but local memory exploitation is too restrictive

Lessons from Engineering Manycore Processors


Multiple applications … Cost efficiency

Sensing

ComputingNetworking

Running in

PARALLEL

& in

REAL TIME

Safety Security


MPPA, ACCESSCORE and the Kalray logo are trademarks or registered

trademarks of Kalray in various countries.

All trademarks, service marks, and trade names are the marks of the respective owner(s), and any unauthorized use thereof is strictly

prohibited. All terms and prices are indicatives and subject to any modification without notice.

KALRAY INC.

Los Altos - USA

4962 El Camino Real Los Altos, CA USA

Tel: +1 (650) 469 3729email: [email protected]

KALRAY S.A.

Grenoble - France

445 rue Lavoisier, 38 330 MontbonnotFrance

Tel: +33 (0)4 76 18 09 18email: [email protected]

KALRAY S.A.

Paris - France

86 rue de Paris,91 400 OrsayFrance

Tel: +33 (0) 184 00 00 45

email: [email protected]

Thank you


MPPA®-256 Bostan Backup Slides


� Optimize use of the data memory bandwidth

� Widening to 64-bit, no alignment restrictions

� Enable large immediate values in instruction stream

� All memory accesses may bypass the L1 data cache & write buffer

� Eliminate DLX ISA features and restrictions

� Instructions with 3 or 4 source operands, 1 or 2 target operands

� No aliasing between registers and special resources (LR, zero)

� Memory addressing modes similar to those of PowerPC

� Effective floating-point support with Fused Multiply Add

� Rework if-conversion support

� Remove Boolean registers and SELECT instructions

� Use CMOV and conditional load/store instructions

� Support hardware looping

Kalray VLIW Architecture Compared to HP Labs LxThe Lx architecture begat the STMicroelectronics ST200 VLIW family


� 20 bus masters

� 16 application cores

� 1 management core

� NoC Tx and Rx interfaces

� Debug support unit (DSU)

� 16-banked shared memory

� 2MB extensible to 4MB

� No bus interferences between cores

� RR arbitration between bus masters

� Interleaved or blocked address map

MPPA2®-256 Bostan Compute Cluster


� Ethernet as the main high-performance / low-latency IO

� Integration of Ethernet Rx and Tx to the D-NoC architecture

� Per Ethernet Rx port

� 8 classification tables for hardware dispatch

� Round-robin or classified cluster & core allocation

� Per Ethernet Tx port

� 64 independent Tx FIFOs

� Weighted round-robin between Tx FIFOs

� Flow control between clusters and Tx FIFOs

MPPA2®-256 Bostan Ethernet Support

kalray mcsoc 2016€¦ · kalray is the ideal co-processor to off-load x86 or ... data and control...

Documents