addressing mpsoc hw/sw platform challenges - ida >...
TRANSCRIPT
1
Addressing MPSoC HW/SWplatform challenges A snapshot of ongoing research activity in the MPSoC group
Luca [email protected]
DEIS Università di Bologna
Research thrusts
n Power modeling and optimizationn Networks on chipn MPSoC architectures: modeling
platform, software analysis and optimization
2
People involvedn Senior staff members
n Luca Benini (PI), Davide Bertozzi (PL)n Graduate students (core team)
n Federico Angiolini, Francesco Poletti, Martino Ruggiero, Mirko Loghi
n In cooperation withn Stanford (2), IMEC (1), UNICA (1), TUD
(1), PSU (1), UM, UNIVR (1)
Power analysis and optimization
Storage and communication EnergyPower management (VFS)
3
Power analysis platformn MPARM: complete MPSoC functional
simulation (cycle accurate)n Technology omogeneous power models
n Parameterized models provided by STMn HW and SW centric power profiling
n Per-component breakdownn Per-function breakdown
n Supports energy aware architectural explorationn Cores and memoryn Communication
0
500000
1000000
1500000
2000000
2500000
3000000
3500000
4000000
512 1024 2048 4096 8192 16384
data cache size
pJ*s
ec
512102420484096819216384
I$ size
Sensitivity analysis andDesign space exploration
n Example: Strong sensitivity to I$ size of ARM coren Points to the criticality of cache power minimization
4
Interconnect topology optimization
n Single shared bus isclearly non-scalable
n Evolutionary pathn “Patch” bus topology
n Two approachesn Clustering & Bridgingn Multi-layer/Multibus
B
M
M
STBUS Crossbar & Partial CB
PC
FC
5
Crossbar & Partial CB cost
PC
FC
Key issue: crossbar is not scalable!Partial crossbar is a compromise solution
4209.9PC
10.596FC
15135.1BUS
SizeRatio
Max. Lat (cycles)
Avg. Lat (cycles)
Type
Traffic-constrained PC optimization
n Basic ideasn Merge channels with non-overlapping trafficn Use time windows to tighten worst-case
OVR=0.1OVR=0 OVR=0.3
SL1
SL2
n Can be cast as a small-scale ILPn Exact solution feasible
In cooperation with Stanford University
6
Results for benchmark appls.
Only 5% max overlap allowed
Sensitivity to constraintsn It is possible to finely trade off allowed
conflicts agains HW complexityn Energy implications are under exploration
7
Bus clustering and frequency assignment
n Applied in an AMBA busn Key ideas
n Cluster tightly coupled masters and slavesn Assign frequencies based on bus utilization
P0
M0 M2 M3 M4 M5 M6 M7M1
P1 P2 P3P0
M0
M2 M3
M4M5 M6 M7
M1P1 P2 P3
In cooperation with Penn State University
Experimental resultsn Using high-level simulation model (to be
validated in MPARM)n Optimal configuration found using GA
Single bus 250MHzSingle bus 500MHz
8
Multiprocessor DPM
n Multi-task application (one task per processor) n Unbalanced computationsn Synchronization and inter-task communication n Throughput constraint is given
n Assign nominal core frequenciesn Static analysis (Pareto curve for single task)n Compose Pareto curves
n Dynamic tuning for workload fluctuationsn Use data queues for feedback control
Example (energy vs. time)
1. Producer-consumer's freq. determines energy reduction beyond a threshold
2. Working processor's freq. determines execution time before a threshold
1/Throughput
P
WK(fmax)PR/CN (f↓)
fWK/fPR/CN=const
1
9
Sources of non ideality
Prod
Work
Work
Work
Work
Cons
P
1/Througput
f↓
P
1/Througput
f↓
M-bound vs. CPU-bound tasksCost of communicationCost of synchronization
Analysis of communication architectures for MPSoCs
Advances inprotocols and topologies
10
Network on chip design
Xpipes Architecture and Toolset
Parallel programming on MPSoCs
Hardware supportSoftware analysis and optimization
11
Support for Message passing
Matching the architecture to programming paradigms
n Hardware extensions for messagesn Distributed memory systemn Queues, distributed semaphores, interrupts
n Software support for messagesn System software (task suspension)n Application software (library of APIs for MP)
In cooperation with IMEC
Basic architecture
MMUI/D Cache
INT
ER
CO
NN
EC
TIO
N
ARM Core
SH
AR
ED
M
EMProcessor tile
#1
SE
MA
PH
OR
ES
MMUI/D Cache
ARM Core
Processor tile#N
Producer
Consumer
INT
Int controller
12
ARM CoreARM CORE
Support for UMA
CACHE
BUS*
SNOOPDEVICE
Invalidate/Update
Address and Data
Processor tile#1
*cannot be a generic interconnect!
Support for message passing
MMU
I/D Cache
SPM
INT
ER
CO
NN
EC
TIO
N
ARM Core
SH
AR
ED
M
EM
Processor tile#1
Semaphores
MMU
I/D Cache
SPM
ARM Core
Processor tile#2
Semaphores
Consumer
Producer
13
Hardware support for MP
n More efficient access to data by using the scratch-pad memory, energy and delay-wise
n Avoids centralized slaves bottleneckn More scalable approachn Allows active polling on local semaphores
without generating bus traffic for synchn If semaphore unlocking triggers an interrupt,
suspended slaves can be reactivated
Key advantages
No centralized bottlenecks
8 cores0.00%
25.00%
50.00%
75.00%
100.00%
125.00%
150.00%
175.00%
200.00%
225.00%
250.00%
275.00%
SharedBridgingMultiLayer
Rel
ativ
e ex
ecut
ion
tim
e
8 c o r e s0 .00%
10 .00%
20 .00%
30 .00%
40 .00%
50 .00%
60 .00%
70 .00%
80 .00%
90 .00%
1 0 0 . 0 0 %
1 1 0 . 0 0 %
1 2 0 . 0 0 %
S h a r e dBridgingMultiLayer
Rel
ativ
e ex
ecut
ion
tim
e
Matrix Pipeline with basic architectureMatrix Pipeline with message passing support
170%
20%
Send+Receive cost: 35KCycles (basic architecture) vs. 4KCycles (MP support)Configuration: 4 Processors, Shared bus
14
Supporting interrupts
n 1 task/core: overhead for interrupt handling and idle task scheduling
n 2 task/core: interrupts allow better exploitation of computation resources
0
0.2
0.4
0.6
0.8
1
1.2
1.4
FC8 HC8
Relat
ive E
xecu
tion
Shared Shared+interrupt
HW support HW support+interrupt
4 core, 2 task/core8 core, 1 task/coreARM
CoreProducer
Semaphores
Int controller
Application partitioningn Mapping applications onto MPSoC platforms
n The HW-SW designer perspective: bridging the gapn MPSIM simulator for analyzing partitioning solutionsn Degree of parallelism: the energy-performance
trade-off. MPEG-2 case study
0
200
400
600
800
1000
1200
Mili
on
i
Exe
cutio
n t
ime
(clo
ck c
ycle
s)
1 proc
2 proc
4 proc
800
820
840
860
880
900
920
940
960
Total Energy for 1s decoding
Ene
rgy
(mJ)
1 proc
2 proc
4 proc
15
Programming paradigm
Which workload allocation policy?Message passing or shared memory?
MULTIMEDIAAPPLICATIONS
Each programming model has an architecture optimized for it!
Exploration space
Data size Comp.-Comm.Ratio
Data size Comp.-Comm.Ratio
CommonProcessing Data
DisjointProcessing Data
Pipelining
Master-
Slave
Workload Allocation
Policy
Application Features
Parallel MatrixMultiply
DESEncryption
PipelinedMatrix Multiply
3 algorithms were used forspace exploration
16
Master-slave partitioningCommon
processing dataDisjoint
processing data
Master-Slavecommon processing data
n Computation is O(N ), communication is O(N )n For large data sets, computation efficiency of MP (thanks to
data localization) starts making the differencen Cache size strongly impacts SHM performancen Energy plots are similar
Execution Time for Parallel Matrix Multiply
0,7
0,80,9
1
1,1
1,21,3
1,4
8 16 32
Matrix size
MP/
SHM
1 KB
2 KB
4 KB
3 2
17
Master-SlaveCommon Processing Data
Execution time for Parallel Sum of Matrices
0,7
1,2
1,7
2,2
2,7
3,2
3,7
4,2
8 16 32
Matrix size
MP
/SH
M
1 KB
2 KB
4 KB
n Computation is O(N ), communication is O(N )n Communication plays a role heren Broadcast-like communication degrades MP performancen Energy plots are similar
2 2
Master-slave, disjoint input data
0
50
100
150
200
250
300
350
0 500 1000 1500 2000 2500 3000 3500 4000 4500
Computed data size (Byte\)
Bit
rat
e (M
bit/
sec)
Message passing
Non cacheable shared memory
Cache coherent shared memory
message passing
Cache-coherent SHM
Non-cacheable SHM
n Bit rate increases because of less synchronization eventsn MP does not broadcast the same data
ü The distance between the two curves is constantü determined by the best computation efficiency of MP
18
Master-slave,disjoint input data
0
100
200
300
400
500
600
700
800
0 1000 2000 3000 4000 5000
Computed data size (Byte)
Ene
rgy
(mJ)
Message pass ing
Non cacheable shared memory
Cache coherent shared memory
n MP consumes less energy because of lower execution timen Processing is lightweight: energy comes mainly from core
and I-cache
Pipelined Matrix Multiply
050
100
150200250300
350400
8 10 12 14 16 18 20 22 24 26 28 30 32
Matrix line
Bit r
ate(M
bit/s
ec)
Message passing Non cacheable shared memoryCache coherent shared memory
n Bit rate decreases because of the increase in computationn Small data sets: MP and SHM have same performancen Large data sets:
n Convergence of cache-coherent and non-c. SHM(cache misses)
19
Pipelined sum of matrices
0
500
1000
1500
2000
2500
3000
8 10 12 14 16 18 20 22 24 26 28 30 32
Matrix line
Bit r
ate
(Mbi
t/sec
)
Message passing Non cacheable shared memoryCache coherent shared memory
n Computation is O(N ), communication efficiency comes into playn For large data sets…
n SHM outperforms MPn Large matrices: convergence of cache coherent and non-coh. SHM
n Energy plots are similar
2
Summing up…
Computation/communicationratio
MPSHM
Sharing of processing data
SHMMP
Some trends are emerging…..
trade-offs might be imposed by conflicting application features
20
System modeling infrastructure
Speeding up ValidationEnhancing modeling capabilities
Directions of improvement
n Issues:n stressing performance bottlenecks with
easily configurable amounts of trafficn triggering bugs in rarely used coden simulation speed!
n Traffic Generatorsn FPGA mapping
21
Traffic Generator architecture
n Plug&Play OCP replacement for OCP coresn Simple ISSn Loads “programs” generated from OCP traces
MPARM
Benchmark
Trace (.trc)
Translator
Assembler
TG Object File (.bin)
TG Program File (.noc)
NoC
IP IPIPSW
NoC
(a)
TG TG TGSW SW
(b)
OCP Interface
In cooperation with DUT –Technical University of Denmark
TG Accuracy, Speedup
0,00%
0,50%
1,00%
1,50%
2,00%
2,50%
Exec Time Reads Writes
Rel
ativ
e E
rror
(%
)
task – 2Ptask – 4Ptask – 6Ppipe – 2Ppipe – 4Ppipe – 6PIO – 2PIO – 4PIO – 6P 1
1,2
1,4
1,6
1,8
2
2,2
2,4
task
– 2
P
task
– 4
P
task
– 6
P
pipe
– 2
P
pipe
– 4
P
pipe
– 6
P
IO –
2P
IO –
4P
IO –
6P
Spee
dup
(x)
n Complex scenario with interrupts andsynchronization: >98% accuracy, ~2x speedup
n Simpler cases: ~100% accuracy, up to 4x speedup
22
FPGA mappingn NoC analysis environment available
n VirtexII Pro platformn Stochastic and trace-based analysis
n MPARM porting on FPGA underway
In cooperation with Stanford University, Universidad Complutense de Madrid
Model enhancementsn Interconnects
n AMBA AHB Multilayern AMBA AXI
n Coresn LXn MIPS, PPC, Xscale (with some limitations)
n Peripheralsn DSP Coprocessor with DMA (FFT)n Smart memories (DMA capable)
23
Future workn Power optimization
n Better integration and validation of communication and storage power optimization
n Deeper MP DPM exploration n NoC
n Compare area, power timing w.r.t. STBus on layoutn Additional features: link protocol, QoS support, multiple
outstanding transactionsn MPSoC hw and sw architectures
n Communication libraryn Improve software parallelization analysis (parallelization advisor)
n Modeling platformn Improve IO and External memory interface models (Memory
controller)n Eterogeneous platform support