kalray mcsoc 2016€¦ · kalray is the ideal co-processor to off-load x86 or ... data and control...
TRANSCRIPT
Kalray MPPA®
Massively Parallel Processor Array
Engineering a Manycore Processor Platform for
Mission-Critical Applications
Benoît Dupont de Dinechin, CTO
2016 – Kalray SA All Rights Reserved MCSoC 2016 2
� Cloud and Data Center acceleration� Kalray is the ideal co-processor to off-load x86 or
accelerate real-time or compute intensive functions
� Domains of application: Networking, Storage, Big Data analytics, Cybersecurity, Deep learning
� High Performance Embedded Computing� Embedded applications require higher
performance without increasing power consumption and more integration of functions including those constrained by real-time
� Domains of application: aerospace, automotive, transport, energy
Kalray Solutions Target Two Strategic Markets
2016 – Kalray SA All Rights Reserved MCSoC 2016 3
� Mission-Critical Applications
� MPPA®-256 Bostan Processor
� MPPA® NoC Management
� MPPA® Software Environment
� MPPA® Coolidge Processor
� Conclusions
Outline
2016 – Kalray SA All Rights Reserved MCSoC 2016 4
Evolution of Sensor Fusion for Autonomous Driving(figures from M. Fausten keynote presentation at ESWEEK 2015)
2016 – Kalray SA All Rights Reserved MCSoC 2016 5
“Number Crunching” in Sensor Fusion Units (figures from M. Fausten keynote presentation at ESWEEK 2015)
2016 – Kalray SA All Rights Reserved MCSoC 2016 6
� Dependability requirements
� Detect, isolate or mitigate errors that occur in digital circuits
� Transient electrical problems, soft errors, physical damage
� Timing requirements
� Time constraints associated with information manipulation
� Execution time determinism, predictability, composability
� Certification of worst-case response time by static analysis
Mission-Critical Computing
2016 – Kalray SA All Rights Reserved MCSoC 2016 7
� Intra-core non-deterministic timing
� Speculative execution, branch prediction
� Cache content
� Inter-core interferences
� Concurrent accesses to caches, busses, and peripheral devices
� Cache coherence invalidations
� Main issue
� Growing gap between average execution time and static WCET
Execution Timing on a Multicore Processor
2016 – Kalray SA All Rights Reserved MCSoC 2016 8
� Challenge: managing interference between cores
Classic Multicore Memory Hierarchy
2016 – Kalray SA All Rights Reserved MCSoC 2016 9
� Challenge: programmability of DMA and private memories
Embedded Multicore Memory Hierarchy
2016 – Kalray SA All Rights Reserved MCSoC 2016 10
� Mission-Critical Applications
� MPPA®-256 Bostan Processor
� MPPA® NoC Management
� MPPA® Software Environment
� MPPA® Coolidge Processor
� Conclusions
Outline
2016 – Kalray SA All Rights Reserved MCSoC 2016 11
� DSP type of processing
� Energy efficiency
� Timing predictability
� Software programmability
� CPU ease of programming
� C/C++ GNU environment
� 32-bit/64-bit, little-endian
� Rich operating system (Linux with dynamic loading&linking)
� Integrated many-core processor
� 32 management cores on chip
� 256 application cores on chip
� High-performance I/O
� Scalable parallel computing
� MPPA® processors can be tiled together through NoC extension
� Run-time support for pooling the external DDR memory resources
MPPA®-256 Manycore Architecture Highlights
Co
urt
esy
Alt
era
2016 – Kalray SA All Rights Reserved MCSoC 2016 12
MPPA®-256 Bostan Processor Architecture
Manycore Processor
� 16 compute clusters
� 2 I/O clusters each with quad-core CPUs, DDR3, 4 Ethernet 10G and 8 PCIe Gen3
� Data and control networks-on-chip
� Distributed memory architecture
� 634 GFLOPS SP for 25W @ 600Mhz
Compute Cluster
� 16 user cores + 1 system core
� NoC Tx and Rx interfaces
� Debug & Support Unit (DSU)
� 2 MB multi-banked shared memory
� 77GB/s Shared Memory BW
�16 cores SMP System
� 32-bit or 64-bit addresses
� 5-issue VLIW architecture
� MMU + I&D cache (8KB+8KB)
� 32-bit/64-bit IEEE 754-2008 FMA FPU
� Tightly coupled crypto co-processor
� 2.4 GFLOPS SP per core @600Mhz
VLIW Core
2016 – Kalray SA All Rights Reserved MCSoC 2016 13
� Physical characteristics� TSMC CMOS 28HP
� 100µW/MHz per core + L1 caches
� 2W to 3W leakage
� Processor interfaces
� 2x DDR3 Memory interfaces
� 2x PCIe Gen3 8-lane interface
� 8x 1G/10G Ethernet or 2x 40G Ethernet interfaces
� SPI/I2C/UART interfaces
� Universal Static Memory Controller (NAND/NOR/SRAM)
� GPIOs with Direct NoC Access
� NoC extension through Interlaken interface (NoCX)
MPPA®-256 Bostan Processor256 + 32 VLIW cores / 18 address spaces / 400 MHz – 800 MHz
2016 – Kalray SA All Rights Reserved MCSoC 2016 14
� U. Saarland / AbsInt GMBH recommendations on VLIW core and cache micro-architecture design
� AbsInt provides the aiT static timing analysis tool used to certify flight control systems at Airbus
� AbsInt aiT tool also targets the Kalray VLIW cores
� Architecture design with a focus on timing predictability
� Core level: micro-architecture� Fully timing compositional core
� LRU caches and write buffer
� Cache bypass memory opcodes
� Cluster level: multi-banked shared memory� Core-private buses for memory bank access
� Address interleaving or blocking across banks
� Processor level: NoC/DDR guaranteed services� NoC minimum bandwidth & maximum latency
� DDR controller configurations and address mapping
MPPA® Processor Co-Design for Avionics
2016 – Kalray SA All Rights Reserved MCSoC 2016 15
MPPA®-256 Bostan VLIW Core Data Path
� Shared ACC read port� Unified MAU and FPU� LSU and MAU also execute
single-cycle ALU instructions
� 5-issue, polycyclic property� Unified 10 read, 5 write
64x32-bit register file� 64-bit data in register pairs
2016 – Kalray SA All Rights Reserved MCSoC 2016 16
46117
225
497
0
100
200
300
400
500
600
Kalray Bostan ARM Cortex A7
(Little)*
GPU** ARM Cortex A15
(Big)*
pJ per Instruction on Matrix Multiply
MPPA® Bostan VLIW Core Energy Efficiency
*: “An Instruction Level Energy Characterization of ARM Processors” FORTH-ICS/TR-450, March 2015
**: Source: Measured on Matrix Multiply EPI Test + Source Bill Dally, “To ExaScale and Beyond” - NVidia
Kalray Bostan at 141 pJ per bundle with 3.14 instructions average per bundle
2016 – Kalray SA All Rights Reserved MCSoC 2016 17
� Mission-Critical Applications
� MPPA®-256 Bostan Processor
� MPPA® NoC Management
� MPPA® Software Environment
� MPPA® Coolidge Processor
� Conclusions
Outline
2016 – Kalray SA All Rights Reserved MCSoC 2016 18
� Dual 2D-torus NoC
� Direct, wormhole switching
� D-NoC: high-bandwidth RDMA
� C-NoC: low-latency mailboxes
� 4B/cycle per link direction per NoC
� Nx10Gb/s NoC extensions for connection to FPGA or other MPPA®
� Predictability
� Data NoC is configured by selecting routes and injection parameters
� Routing ensure deadlock-free traffic
� Injection parameters are the (σ,ρ) or (burst, rate) of network calculus
MPPA®-256 Bostan Network-on-Chip (NoC)
2016 – Kalray SA All Rights Reserved MCSoC 2016 19
Wormhole Switching Illustrated
1 2
5 6
3
7
4
8
9
B
A
When the required channel is busy, the hop flow control blocks the trailing flits and they stay in flit buffers along the established route
A packet is composed of several flitsThe header flits governs the routeThe payload flits follow the header in a pipeline fashion
2016 – Kalray SA All Rights Reserved MCSoC 2016 20
� Complex to implement
� May be true for input queueing & output matching (e.g. iSLIP)
� The MPPA® NoC routers only include demultiplexers, output queues and RR arbiters
� Prone to deadlocking
� In this example, the red flow cannot use R3→R2 because the blue flow is using it
� Likewise, the blue flow needsR1→R4 held by the red flow
� Deadlock requires full queues
Wormhole Switching Network Issues
2016 – Kalray SA All Rights Reserved MCSoC 2016 21
� Deadlock results from circuits of agents and resources
connected by a wait-for relation [Dally & Seitz 1998]
� Wormhole switching: agents are packets; resources are link buffers
� Links between routers and links internal to routers (‘turns’)
� Solutions when the network topology is a 2D mesh
� Dimension order (X-Y on 2D meshes)
� Turn model [Glass & Ni 1994] (not the same as ‘turn prohibition’)
� Odd-Even [Chiu 2000] (better path diversity than turn model)
� Hamiltonian Odd-Even [Bahrebar & Stroobandt 2015] (multicasting)
� Strategy for the MPPA® NoC management
� Isolate a 2D mesh in topology and apply deadlock-free routing
Deadlock-Free Routing Wormhole Switching
2016 – Kalray SA All Rights Reserved MCSoC 2016 22
� Even rows
� East-South turn prohibited
� North-West turn prohibited
� Odd rows
� North-East turn prohibited
� West-South turn prohibited
Hamiltonian Odd-Even Prohibited Turns
Even
Even
Odd
Odd
Even
Even
Odd
Odd
2016 – Kalray SA All Rights Reserved MCSoC 2016 23
� Routing between compute clusters
� Routes generated assuming a 6x6 mesh
� Impossible routes are discarded
� Routing between I/O and compute cluster
� Always possible, thanks to the 4 NoC nodes per I/O cluster
Hamiltonian Odd-Even on the MPPA® NoC
I/O DDR 1
I/O DDR 0
I/O Ethernet 1
I/O Ethernet 0
2016 – Kalray SA All Rights Reserved MCSoC 2016 24
� For each flow, select a single path among those proposed by adaptive routing so as to maximize utility
� If multi-path allowed, reducesto a multi-commodity flowproblem with a direct solution
� Adding the single path constraintsmakes the problem NP-hard
� Practical instances are solved withmixed integer programming
� Utility function is the max-minfairness of flow rates
� Maximize the lowest flow rate,then the next lowest flow rate etc.
MPPA® NoC Path Selection Problem
Flow Route Bandwidth
F48 R1 0.441901304029376
F49
R1 0
R2 0.375611423014079
R3 0
R4 0
F50 R1 0.13862940041161
F51 R1 0.23889251894024
R2 0
� Numerical results for all pairs offlows between 8 clusters (55 flows)
2016 – Kalray SA All Rights Reserved MCSoC 2016 25
� Data NoC packet injection implements a (σ,ρ)regulation
� No more than σ+ρ(t-s) flits are injected for any interval [s,t]
� Application of Deterministic Network Calculus prevents NoC congestion and provides bounds on end-to-end delays
� Network Calculus application requires feed-forward flows
� Deadlock-free wormhole switching implies feed-forward flows
Principle of the MPPA NoC Guaranted Services
Cumulative flit count
time
σ
~ρL
2016 – Kalray SA All Rights Reserved MCSoC 2016 26
� Mission-Critical Applications
� MPPA®-256 Bostan Processor
� MPPA® NoC Management
� MPPA® Software Environment
� MPPA® Coolidge Processor
� Conclusions
Outline
2016 – Kalray SA All Rights Reserved MCSoC 2016 27
Complete AccessCore™ SDK
Standard C/C++
Programming Environment
C - Low Level/ Lib
Programming DSP Style
C - POSIX-Level
Programming CPU Style
OpenCL
Programming GPU Style
Simulators & Profilers,
Debuggers & System Trace
Operating Systems
& Device Drivers
AccessLib™
Optimized Libraries
OpenDataPlane
Open API for networking
2016 – Kalray SA All Rights Reserved MCSoC 2016 28
� Approaches to communication� Load/Store to unified memory
� Remote DMA to unified memory
� Remote DMA to private memories
� Send/receive to private memories
� Approaches to synchronization� Locks, semaphores, conditional variables
� Barriers, collectives (reduce/scan/all2all)
� Remote atomic operations
� Remote atomic queues
� Classic HPC clusters vs MPPA� The working memory of HPC clusters is the
collection of identical node memories
� The working memory of the MPPA is the collection of SMEMs backed by the DDRs
MPPA® Distributed Memory Architecture Challenge
2016 – Kalray SA All Rights Reserved MCSoC 2016 29
� MPPA local memory benefits� Local memory accesses are more energy-
efficient than a Last-Level Cache (LLC)
� Local memory access interferences can be managed for time-critical applications
� MPPA DSM for implicit DDR access� Escape code size and data size restrictions
� Mostly used for application porting
� Some implicit global memory accesses cannot be converted to remote DMA
� MPPA Asynchronous Operations� Using of local memories and sharing DDR
memories through remote DMA over NoC
� Remote atomic operations and queues
� Co-exist with the DSM or run stand-alone
MPPA® Distributed Memory Architecture + Runtime
2016 – Kalray SA All Rights Reserved MCSoC 2016 30
MPPA Asynchronous Operations Ordering
2016 – Kalray SA All Rights Reserved MCSoC 2016 31
MPPA® Software Stack Overview
Rich OS
Hypervisor (Exokernel)
GNU C, C++, Fortran
MPPA® platform
OpenCL runtime
OpenCL C
LanguageLibraries
RPC, Asynchronous Operations, DSM
Low-level programming interfaces
POSIX threads + OpenMP
RTOS
2016 – Kalray SA All Rights Reserved MCSoC 2016 32
� Safety-critical control-command applications
� Model-based programming using SCADE Suite® from Esterel Technologies
� Complemented with static timing analysis of binary code (aiT from AbsInt)
� Motivations for multicore and manycore execution
� Distribute the compute load across cores and reduce memory interferences
� Effective implementation of multi-rate harmonic applications
� Envision use of fast Model Predictive Control (MPC) techniques
SCADE Code Generation for the MPPA®
2016 – Kalray SA All Rights Reserved MCSoC 2016 33
� eSOL eMCOS is the world’s first many-core RTOS for embedded use
� Runs on the MPPA® clusters on top of Low-Level programming
eMCOS for MPPA® Processors
2016 – Kalray SA All Rights Reserved MCSoC 2016 34
� Number-crunching applications
� Use OpenCL hosted on an I/O cluster quad-core running Linux
� Partition the compute clusters using OpenCL subDevices
� Enable OpenCL Data Parallel and Task Parallel models
� Inside Task Parallel model, support Posix programming
� C/C++/Pthread/FILE instead of OpenCL-C
� MPPA Asynchronous operations instead of OpenCL async_copy
� Control-command applications
� SCADE Suite® code generation for compute clusters
� Exchange with I/O application part using RDMA through NoC
� Automotive RTOS & middleware
� OSEK/VDX RTOS and support of AUTOSAR deployment environment
Automotive Software Development Kit Vision
2016 – Kalray SA All Rights Reserved MCSoC 2016 35
� Mission-Critical Applications
� MPPA®-256 Bostan Processor
� MPPA® NoC Management
� MPPA® Software Environment
� MPPA® Coolidge Processor
� Conclusions
Outline
2016 – Kalray SA All Rights Reserved MCSoC 2016 36
� Compute cluster as CPU
� Direct access to DDR memory
� Optional L1 cache coherence
� Full 64-bit OS support
� Functional safety
� Power domains
� Robust partitioning
� NoC error protection
� Security architecture
� Trust provisioning
� Secure boot chain
� Authenticated debug
MPPA® Coolidge Processors
Co
urt
esy
Ra
mb
us
Co
urt
esy
Insi
de
Se
cure
2016 – Kalray SA All Rights Reserved MCSoC 2016 37
� TSMC 16FFC technology node� 50% maximum frequency increase
compared to 28nm HPC
� 40% dynamic power reduction at same frequency compared to 28nm HPC
� 3rd generation Kalray VLIW core� 1GHz typical core frequency
� 64-bit 5-issue VLIW core architecture
� 128-bit load/store unit (16 bytes /cycle)
� 16-bit, 32-bit and 64-bit IEEE 754 FPU
� 16KB, 4-way associative L1 cache s
� 4 privilege levels like in ARMv8
� Co-processing� Transcendental function acceleration for
deep learning and scientific computing
� Block cipher and hashing acceleration
� Pixel pre-processing for computer vision
� Compute cluster� 4MB memory capacity, 128-bit busses
� Local memory / L2 cache partitioning
� Interleaved or locked memory map
� RM core as L2 cache controller
� Unified NoC� Dual virtual channel, 128-bit flits
� RDMA, load/store and control
� 2D grid topology
� External memory� One DDR4 channel per 4 compute clusters
� 1 channel on MPPA®-64
� 4 channels on MPPA®-256
� High-speed peripherals� Ethernet 25Gbps
� PCIe Gen4
� Interlaken 25Gbps
MPPA® Coolidge Main Evolutions
2016 – Kalray SA All Rights Reserved MCSoC 2016 38
� Kalray KANN tool interprets Berkeley Caffe files of trained networks and generates code
� Network architecture
� Parameters (filter coefficients)
� Focus on low-latency neural network inference
� Convolutional layer neurons are spread across compute clusters
� NoC multicasting to copy parameters from DDR to SMEM in compute clusters
� All data transfers and coordination with MPPA asynchronous operations
MPPA® Processors on Deep Learning Application
0
50
100
150
200
250
Kalray**
Bostan
Nvidia*
Tegra
Kalray**
Coolidge
Fp
s
Neural Network(Googlenet - FP32)
2016 – Kalray SA All Rights Reserved MCSoC 2016 39
� Mission-Critical Applications
� MPPA®-256 Bostan Processor
� MPPA® NoC Management
� MPPA® Software Environment
� MPPA® Coolidge Processor
� Conclusions
Outline
2016 – Kalray SA All Rights Reserved MCSoC 2016 40
� Target applications that benefit from manycore processors
� High performance computing (GPU, MIC)
� High performance networking (Cavium, Mellanox/Tilera)
� High integrity embedded computing (robotics, aerospace, defense)
� Achieving ‘near-data computing’ (using local memory) is key
� Required for energy efficiency and timing predictability
� Local memory is naturally exploited by KPN models of computation
� Must also support transparent and effective access to global memory
� Programming models must be adapted
� Explicit memory hierarchy and relaxed memory consistency
� OpenCL a first step but local memory exploitation is too restrictive
Lessons from Engineering Manycore Processors
2016 – Kalray SA All Rights Reserved MCSoC 2016 41
Multiple applications … Cost efficiency
Sensing
ComputingNetworking
Running in
PARALLEL
& in
REAL TIME
Safety Security
2016 – Kalray SA All Rights Reserved MCSoC 2016 42
MPPA, ACCESSCORE and the Kalray logo are trademarks or registered
trademarks of Kalray in various countries.
All trademarks, service marks, and trade names are the marks of the respective owner(s), and any unauthorized use thereof is strictly
prohibited. All terms and prices are indicatives and subject to any modification without notice.
KALRAY INC.
Los Altos - USA
4962 El Camino Real Los Altos, CA USA
Tel: +1 (650) 469 3729email: [email protected]
KALRAY S.A.
Grenoble - France
445 rue Lavoisier, 38 330 MontbonnotFrance
Tel: +33 (0)4 76 18 09 18email: [email protected]
KALRAY S.A.
Paris - France
86 rue de Paris,91 400 OrsayFrance
Tel: +33 (0) 184 00 00 45
email: [email protected]
Thank you
2016 – Kalray SA All Rights Reserved MCSoC 2016 43
MPPA®-256 Bostan Backup Slides
2016 – Kalray SA All Rights Reserved MCSoC 2016 44
� Optimize use of the data memory bandwidth
� Widening to 64-bit, no alignment restrictions
� Enable large immediate values in instruction stream
� All memory accesses may bypass the L1 data cache & write buffer
� Eliminate DLX ISA features and restrictions
� Instructions with 3 or 4 source operands, 1 or 2 target operands
� No aliasing between registers and special resources (LR, zero)
� Memory addressing modes similar to those of PowerPC
� Effective floating-point support with Fused Multiply Add
� Rework if-conversion support
� Remove Boolean registers and SELECT instructions
� Use CMOV and conditional load/store instructions
� Support hardware looping
Kalray VLIW Architecture Compared to HP Labs LxThe Lx architecture begat the STMicroelectronics ST200 VLIW family
2016 – Kalray SA All Rights Reserved MCSoC 2016 45
� 20 bus masters
� 16 application cores
� 1 management core
� NoC Tx and Rx interfaces
� Debug support unit (DSU)
� 16-banked shared memory
� 2MB extensible to 4MB
� No bus interferences between cores
� RR arbitration between bus masters
� Interleaved or blocked address map
MPPA2®-256 Bostan Compute Cluster
2016 – Kalray SA All Rights Reserved MCSoC 2016 46
� Ethernet as the main high-performance / low-latency IO
� Integration of Ethernet Rx and Tx to the D-NoC architecture
� Per Ethernet Rx port
� 8 classification tables for hardware dispatch
� Round-robin or classified cluster & core allocation
� Per Ethernet Tx port
� 64 independent Tx FIFOs
� Weighted round-robin between Tx FIFOs
� Flow control between clusters and Tx FIFOs
MPPA2®-256 Bostan Ethernet Support