RTAS 2014 Industrial Session
Guaranteed Services on the Network-
on-Chip of a Manycore Processor Duco van Amstel
Benoît Dupont de Dinechin
2013 – Kalray SA All Rights Reserved RTAS 2014 2
Manycore versus Multicore
A multicore processor has a 4 – 16 cores that share the memory hierarchy
(cache, on-chip, off-chip)
Intel Core series, ARM Cortex-A series, Freescale P4080
A manycore processor has > 32 cores, implying some memory is close
and most memory is far from each core
Intel Xeon-Phi, GP-GPU (NVIDIA, AMD), Tilera Tile-GX
Challenges of Manycore
Difficult programming models, hard to reach performance
OpenMP or shared memory + vector instructions (Intel Xeon Phi)
OpenCL or CUDA + vector instructions (GP-GPU, TI Keystone)
Ill-suited to embedded systems
Unstable or unpredictable or unverifiable response times
High power consumption, high energy per operation
About Manycore Computing
2013 – Kalray SA All Rights Reserved RTAS 2014 3
High processing performance
700 GOPS – 230 GFLOPS SP
Low power consumption
5 – 15W at 200 – 400MHz
High execution predictability
High-level programming models
PCI Gen3, Ethernet 10G, NoCX
Kalray MPPA®-256 Processor with CMOS 28nm TSMC
Shipping since January 2013
256 VLIW processing engine cores + 32 VLIW resource management cores
2013 – Kalray SA All Rights Reserved RTAS 2014 4
MPPA®-256 Processor Hierarchical Architecture
Manycore Processor Compute Cluster VLIW Core
Instruction Level
Parallelism
Thread Level
Parallelism
Process Level
Parallelism
2013 – Kalray SA All Rights Reserved RTAS 2014 5
20 memory address spaces
16 compute clusters
4 I/O subsystems with direct access
to external DDR3 memory
Dual Network-on-Chip (NoC)
Data NoC & Control NoC
Full duplex links, 4B/cycle
2D torus topology + extension links
Explicit routing at source node
Unicast and multicast transfers
Oblivious synchronization
Data NoC guaranteed services
Flow regulation at source node
MPPA®-256 Clustered Memory Architecture
Eth
0 I
/O S
ub
syste
m
PCI1 I/O Subsystem
Eth
1 I
/O S
ub
syste
m
PCI0 I/O Subsystem
2013 – Kalray SA All Rights Reserved RTAS 2014 6
Explicitly routed NoC with wormhole switching
Message is a sequence of atomically received packets
Each packet has a header and a payload, composed of flits
The header flit(s) contains the route, port, and other information
MPPA® NoC Main Features
header flit header extension flit payload flit payload flit
bit ext. notification port number multicast route
[31] [30] [29:22] [21] [20:0]
protocol PDI (protocol dependent information)
[29:28] [28:0]
bit ext.
[31]
route extension
offset
byte count[6:0] not used
2013 – Kalray SA All Rights Reserved RTAS 2014 7
Wormhole Switching Illustrated
1 2
5 6
3
7
4
8
9
B
A
One header flit and 1-N payload flits
A packet that arrives on a busy link waits
2013 – Kalray SA All Rights Reserved RTAS 2014 8
NoC injection policy that implement a (s, r) regulation
For any time interval t the number of packets is not greater than s + rt
Implemented with a ‘sliding window’ scheme of parameters Tw, Nmax
NoC routers that only contain FIFO queues
Routers are ‘work conserving’: no idling if data ready to transmit
One set of queues per out-going link, with round-robin arbitration
Foundations of the MPPA® NoC Guaranteed Services
Cumulative
flit count
Tw
time
Nmax
𝜌 =𝑁𝑚𝑎𝑥𝑇𝑤
𝜎 = 𝜌 1 − 𝜌 𝑇𝑤
2013 – Kalray SA All Rights Reserved RTAS 2014 9
MPPA®-256 Data NoC Tx Model
2013 – Kalray SA All Rights Reserved RTAS 2014 10
MPPA® NoC Router Structure
arbiter
To W To E
To N
To S
To L
From L From S From E From W
1 2 3 0
7 6 5 4
From N From S From E From W
9
8
15
14
13
12
11
10
From N
From S
From L
From W
From N
From S
From E
From L
From N
From S
From E
From L
From W
20 19 18
17
16
2013 – Kalray SA All Rights Reserved RTAS 2014 11
Avoid NoC router queue overflow by configuring source (si, ri)
On the MPPA® NoC, s is related to r : 𝜎 = 𝜌 1 − 𝜌 𝑇𝑤
The problem can be solved by only considering injection rates ri
MPPA® NoC Guaranteed Services Problem Statement
Channel 1
Capacity constraint: sum of flow rates on a link
Backlog constraint: usage for all contributing
flows of router queue
Channel 4
Traffic crossing creates sporadic
bursts Applicative constraint: minimal injection rate Channel 3
2013 – Kalray SA All Rights Reserved RTAS 2014 12
Link capacity constraints
For each link traversed by a set of flows { (si, ri) } : r𝑖𝑖 ≤ 1
Queue backlog constraints
For each queue buffering a link with flows { (si, ri) } : s𝑖𝑖 ≤ 𝑄𝑠𝑖𝑧𝑒
Propagation of (s, r) between source and hop k [Cruz 1991]
𝜎′, 𝜌 = (𝜎 + 𝜌𝑑𝐿𝐿=𝑘𝐿=1 , 𝜌) with 𝑑𝐿 = 𝑛𝐿 − 1 𝑃𝑠𝑖𝑧𝑒
𝑛𝐿 is the number of directions merging to link L
MPPA® NoC Calculus Linear Constraints (1/3)
Channel 1
r1 6 7 5 8
2013 – Kalray SA All Rights Reserved RTAS 2014 13
Approximating non-linear term in queue backlog constraints
𝜎𝑐 = 𝜌𝑐 1 − 𝜌𝑐 𝑇𝑤 ≤ 𝑛𝐿−1
𝑛𝐿𝑇𝑤
Valid because 𝜌𝑐 ≤ 1 for 𝑛𝐿 flows on L
Linearized queue backlog constraints
For all c in CC set of contributing channels
For all L in RR links traversed upstream
𝑑𝐿,𝑐𝜌𝑐𝐿∈𝑅𝑅𝑐∈𝐶𝐶
≤ 𝑄𝑠𝑖𝑧𝑒 −(𝑛𝐿−1)
𝑛𝐿𝑇𝑤
MPPA® NoC Calculus Linear Constraints (2/3)
2013 – Kalray SA All Rights Reserved RTAS 2014 14
MPPA® NoC Calculus Linear Constraints (3/3)
𝐶 ×
𝜌1…𝜌𝑁≤𝑐𝑎𝑝𝑎𝑐𝑖𝑡𝑦 𝑏𝑜𝑢𝑛𝑑𝑠𝑏𝑎𝑐𝑘𝑙𝑜𝑔 𝑏𝑜𝑢𝑛𝑑𝑠
2013 – Kalray SA All Rights Reserved RTAS 2014 15
Objective function
Non-Linear version
Based solely on the flow rates
Simple sum 𝐹 = 𝜌𝑖𝑖 does not give enough to small flows
Use proportional fairness instead
𝐹 = log𝜌𝑖𝑖
Linear version
Non-regular piece-wise linearization of the logarithm
50% of pieces within 0; 0.0001 , 50% within 0.0001; 1
Experimental results required less than 100 linear pieces
Execution times
Linearization with regular piece-wise linearization: speed-up of 10+
Linearization with non-regular piece-wise linearization: speed-up of 100+
2013 – Kalray SA All Rights Reserved RTAS 2014 16
Linear vs. Non-Linear Model : experimental results
Solving the same instance on both models
Original model with quadratic constraints via SQP
Presented linearized version via GLPK
One toy example and two industrial applications
MotionEstimation
STAP (Space-Time Adaptive Processing)
Variance Sum
Toy SQP
GLPK
0.027316
0.040702
1.6128
1.4041
MotionEstimation SQP
GLPK
0.19835
0.215123
8.4326
8.091047
STAP SQP
GLPK
0.016439
0.009790
6.2612
6.150907
2013 – Kalray SA All Rights Reserved RTAS 2014 17
Computation of Worst Case Traversal Time
Assume that flows with rates has been computed
r : rate of virtual channel
Tw : sliding window size
Psize : packet size
n : number of hops
d : router fixed delay
Then the worst case traversal time is [Zhang 95] :
More precise than summing the individual worst cases
Property known as « pay bursts only once »
)(1
)1(max sizesizew PdnPn
Tt ++
+r
r
Flow
regulation
Packet
granularity
Bursts
incidents r
sr wT)1(
2013 – Kalray SA All Rights Reserved RTAS 2014 18
Architecture & Implementation
Fully timing-compositional VLIW cores
Includes LRU caches and hardware looping
Multi-banked local memory system without bus interferences
Dual address mapping: interleaved or blocked
Network on Chip with guaranteed services
Flow regulation at the source similar to AFDX
Low-latency local connexions
Deterministic Ethernet
Programming models for time-critical applications
POSIX-Like programming model
Processes on clusters and threads on cores
Synchronous and asynchronous POSIX I/O with call-back
POSIX timers and signals
MPPA®-256 For Time-Critical Computing