luca benini lbenini@deis.unibo.it deis università di bologna
Post on 26-Jan-2016
40 Views
Preview:
DESCRIPTION
TRANSCRIPT
Tecniche di ottimizzazione per lo sviluppo di applicazioni embedded su piattatforme multiprocessore su singolo chip
Luca BeniniLbenini@deis.unibo.it
DEIS Università di Bologna
1%
99%
Embedded Systems
General purpose systems Embedded systems
Microprocessor market shares
Example Area: Automotive Electronics
What is “automotive electronics”?
Vehicle functions implemented with electronics
Body electronics System electronics: chassis,
engine Information/entertainment
Automotive Electronics Market Size
8.9Market
($billions)
10.5
13.1 14.1 15.8 17.4 19.3 21.0
0
200
400
600
800
1000
1200
1400
1998 1999 2000 2001 2002 2003 2004 2005
Cost of Electronics / Car ($)
90% of future innovations in vehicles:based on electronic embedded systems
2006: 25% of the total cost
of a car will be electronics
Automotive Electronics Platform Example
Source: Expanding automotive electronic systems, IEEE Computer, Jan. 2002
Digital Convergence – Mobile Example
Broadcasting
TelematicsImaging
Computing
CommunicationEntertainment
One device, multiple functions Center of ubiquitous media network Smart mobile device: next drive for semicon.
Industry
4th Gen and Next-Gen Networks
Includes: 802.20, WiMAX (802.16), HSDPA, TDD UMTS, UMTS and future versions of UMTS
SoC: Enabler for Digital Convergence
Today
Future
> 100XPerformanceLow PowerComplexity
Storage
4G/5G, DMB,
WiBro, etc.
SoCSoCSoCSoC
Application pull
Year of Introduction2005 2007 2009 2011 2013 2015
5 GOPS/W
100GOPS/W
Signrecognition
A/Vstreaming
Adaptiveroute
Collisionavoidance
Autonomousdriving
3D projecteddisplay
HMI by motionGesture detection
Ubiquitousnavigation
Si Xray
Gbit radio
UWB
802.11n
Structured encoding
Structured decoding
3D TV3D gaming
H264encoding
H264decoding
Imagerecognition
Fully recognition(security)
Autopersonalization
dictation
3D ambientinteraction
LanguageEmotionrecognition
Gesturerecognition
Expressionrecognition
MobileBase-band
1TOPS/W
[IMEC]
MPSoC Platform Evolution
45 nm
<4mm
<1GHz
I/OPERIPHERALS
3D stacked m
ain mem
ory
2
30MtrLocalMemory
hierarchy
NetInt
PowerTest
Mgmt
routerBus basedMulti Proc
Applications Software opt. Middleware, RTOS, API,Run-Time Controller
MappingV,Vt,Fclk,IL
Today’s SoCs could fit in 1 tile!! Tile-based design
Multicores Are Here!
1985 199019801970 1975 1995 2000
4004
8008
80868080 286 386 486 Pentium P2 P3P4Itanium
Itanium 2
2005 20??
# of
cor
es
1
2
4
8
16
32
64
128
256
512
Athlon
Raw
Power4Opteron
Power6
Niagara
YonahPExtreme
Tanglewood
Cell
IntelTflops
Xbox360
CaviumOcteon
RazaXLR
PA-8800
CiscoCSR-1
PicochipPC102
Boardcom 1480 Opteron 4P
Xeon MP
AmbricAM2045
[Amarasinghe06]
MPSoC – 2005 ITRS roadmap
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
0
200
400
600
800
1000
60
50
40
30
20
10
0
1200
Nu
mb
er
of
Pro
ce
ss
ing
En
gin
es
Lo
gic
, M
em
ory
Siz
e (
No
rma
lize
d t
o 2
00
5)
Number of Processing Engines(Right Axis)
Total Logic Size(Normalized to 2005, Left Axis)
Total Memory Size(Normalized to 2005, Left Axis)
16 23 32 4663 79
101133 161
212
268
348
424
526
669
878
[Martin06]
System / ServiceApplication S/W
Mobile TerminalMiddleware
ModuleRTOS
ChipHAL
ProcessS/W IP
Target System ApplicationTarget System Application
Requires design of Hardware AND software
SoC Solution-on-a-Chip
+
SOCSOC
System e-SW
Chip
Design as optimization Design space
The set of “all” possible design choices
ConstraintsSolutions that we are not willing to
accept Cost function
A property we are interested in (execution time, power, reliability…)
Hardware synthesisALGORITHM
HIGH-LEVEL SYNTHESIS
S1 S3 S4S2
0.0 200.0 4 00.0 600. 0Freq
-120 .0
-100 .0
-80 .0
-60 .0
-40 .0
-20 .0
Am
pl
(db
)
++
++
D
D
++
++
D
D
c1 c2
c3
c4 c5
c6
kIN
+
+
D
D
++
+
D
D
+
++c1
c2 c3
c4
c5
c6 c7
c8
k
d
IN OUT
APPLICATION
interconnect
ASICGP signal
MCM
processor
memory
ARCHITECTURE
LOGIC AND PHYSICAL SYNTHESIS
Behavioral synthesisControl/DataFlow Graph
(CDFG)Implementation
RegReg
Multiplier
Adder
RegReg2 1 1 ...2 3 2 ...
4 3 2 ...
0 4 7 ...
4 7 9 ...
Allocation, Assignment, and Scheduling
D
+
-
>>
>>
+
-
>>
+ >>
+
>>
+
Allocation: How Much?2 adders
Assignment: Where?
Schedule: When?
Shifter 1
Time Slot 4
1 shifter24 registers
D
Techniques Well Understood and Mature
Resource constraints
+
*3*2
3
+
*1
2
+1 1
2
3
3
4 4
+
*3*2
3
+2
+1 2
3
4
1
2 3
4 control steps
+ * * + *
*1
Schedule 1 Schedule 2
1 +1
2 +2
3 +3 *1
4 *2 *3
Control Step
1 +3
2 +1 *2
3 +2 *3
4 *1
Control Step
Scheduling under resource constraints Intractable problem Algorithms:
Exact: Integer linear program Hu (restrictive assumptions)
Approximate : List scheduling Force-directed scheduling
Binary decision variables:X = { xil, i = 1,2,…. n; l = 1,2,…, λ +
1}xil, is TRUE only when operation vi starts
in step l of the schedule (i.e. l = ti)
λ is an upper bound on latency Start time of operation vi :
Σl . xil
ILP formulation
l
Operations start only onceΣ xil = 1 i = 1, 2,…, n
Sequencing relations must be satisfiedti ≥ tj + dj (vj, vi) є E
Σ l • xil – Σ l • xil – dj ≥ 0 (vj, vi) є E Resource bounds must be satisfied
Simple case (unit delay)Σ xil ≤ ak k = 1,2,…nres ; l
ILP formulation constraints
l
A
A
A
l
l
i:T(vi)=k
ILP Formulation
min (Σ l • xnl) such that
Σ xil = 1 i = 1, 2, …, n
Σ l • xij - Σ l • xjl - dj ≥ 0 i, j = 1, 2, …, n, (vj, vi) є E
Σ Σ xim ≤ ak k = 1, 2, …, nres ; l = 0, 1, …, λ
l
ll
l
m=l-di+1i:T(vi)=k
l
Example
Resource constraints: 2 ALUs; 2 Multipliers a1 = 2; a2 = 2
Single-cycle operation di = 1 i
* * + <
-
-
* * * * +
NOP
NOP
0
1 2
3
4
5
6
7
8
9
10
11
n
A
Example Operations start only once
x11 = 1x61 + x62 =1…
Sequencing relations must be satisfiedx61 + 2x62 – 2x72 – 3x73 + 1 ≤ 02x92 + 3x93 + 4x94 – 5xN5 + 1 ≤ 0…
Resource bounds must be satisfiedx11 + x21 +x61 + x81 ≤ 2x32 + x62 + x72 + x81 ≤ 2…
Example
*
*
+
<
-
-
* *
*
*
+
NOP
NOP
0
1 2
3
4
5
6
78
9
10
11
n
TIME 1
TIME 2
TIME 3
TIME 4
Resource-EfficientApplication mapping for MPSoCs
Given a platform1. Achieve a specified throughput2. Minimize usage of shared resources
MULTIMEDIAAPPLICATIONS
Optimization Development
The abstraction gap between high level optimization tools and standard application programming models can introduce unpredictable and undesired behaviours.
Programmers must be conscious about simplified assumptions taken into account in optimization tools.
New methodology for multi-task application development on MPSoCs.
Platform Modelling
Optimization Analysis
Optimal Solution
Starting Implementation
Platform Execution
Abstractiongap
(. .
Final Implementation
Application design flow
Resource assignment and scheduling
SHARED SYSTEM BUS
On-chipMemory
Node 1 Node N
Processor
Tightly-CoupledMemory
Bus Interface
.
.
.
.
.
Task. A (WCET Ta)Task. B (WCET Tb)
Task. N (WCET Tn)
THE SYSTEM
LimitedSize Mem
Max busbandwidth
Maxtime
wheelperiod
T
AssumedTo be
infinite
The application
T7T1 T2 T0 T3 …..
Signal Processing Pipeline
Queues for inter-processor communication in TCM for efficiency reasons Program data in TCM (if space) or on-chip memory Internal state in TCM (if space) or on-chip memory
Each task is characterized by:• WCET• Memory requirements
ThroughputConstraint
Communication-aware Allocation and Scheduling for Stream-Oriented MPSoCs
T7T1 T2 T0 …..Signal Processing
Pipeline
ARM7 LocalScratchpad
MemoryBUS
PrivateMemory
ARM7
………………..
LocalScratchpad
Memory
PrivateMemory
……….
Message-oriented MPSoC
architecture
?
Simplifying assumptions vs predictabilityEfficient solutions in reasonable timePure ILP formulations suitable for small task setsWidespread use of heuristics
Master Problem model Assignment of tasks and memory slotsAssignment of tasks and memory slots (master problem)
Tij= 1 if task i executes on processor j, 0 otherwise, Yij =1 if task i allocates program data on processor j memory, 0 otherwise, Zij =1 if task i allocates the internal state on processor j memory, 0 otherwise Xij =1 if task i executes on processor j and task i+1 does not, 0 otherwise
Each process should be allocated to one processor Tij= 1 for all j
Link between variables X and T: Xij = |Tij – Ti+1 j | for all i and j (can be linearized)
If a task is NOT allocated to a processor nor its required memories are:Tij= 0 Yij =0 and Zij =0
Objective function memi (Tij - Yij) + statei (Tij - Yij) + datai Xij /2
i
i j
Improvement of the model
With the proposed model, the allocation problem solver tends With the proposed model, the allocation problem solver tends to pack all tasks on a single processor and all memory required to pack all tasks on a single processor and all memory required on the local memory so as to have a ZERO communication cost: on the local memory so as to have a ZERO communication cost: TRIVIAL SOLUTIONTRIVIAL SOLUTION
To improve the model we should add a relaxation of the To improve the model we should add a relaxation of the subproblem to the master problem model: subproblem to the master problem model:
For each set S of consecutive tasks whose sum of durations For each set S of consecutive tasks whose sum of durations exceeds the Real time requirement, we impose that their processors exceeds the Real time requirement, we impose that their processors should not be the same should not be the same
WCETi > RT Tij |S| -1
i S i S
Sub-Problem modelTask scheduling with static resource assignmentTask scheduling with static resource assignment (subproblem)
i
Sub-Problem modelTask scheduling with static resource assignmentTask scheduling with static resource assignment (subproblem)We have to schedule tasks so we have to decide when they start
Activity Starting Time: Starti::[0..Deadlinei]
Precedence constraints: Starti+Duri Startj
Real time constraints: for all activities running on the same processor (Starti+Duri ) RT
Cumulative constraints on resourcesprocessors are unary resources: cumulative([Start], [Dur], [1],1)memories are additive resources: cumulative([Start],[Dur],[MR],C)
What about the bus??
i
Bus model
BANDWIDTHBIT/SEC
TIME
Max busbandwidth
Taskistate read
Taskistate write
Execution timetaski and task j
Unary resource: granularity clock cycle
Arbitration mechanism that decides the bus allocation
Taskjstate read
Taskj State write
Bus modelBANDWIDTH
BIT/SEC
TIME
Max busbandwidthSize of program data
TaskExecTimeTask0 accessesinput data:
BW=MaxBW/NoProc
Taskistate read
Taski state write
taskjtaski
Additive bus model
The model does not hold under heavy bus congestion(more than 65% of total bandwidth)
Bus traffic has to be minimized
Taskjstate read
Taski state write
No good generation Assignment of tasks and memory slotsAssignment of tasks and memory slots (master problem)Task scheduling with static resource assignmentTask scheduling with static resource assignment (subproblem)
If no feasible schedule exist for the allocation provided by the master a no-good is generated. We use the simple BUT EFFECTIVE one: identify CONFLICTING RESOURCES CR. For each R CR, STR set of tasks allocated on R
TiR | STR | - 1
Other cuts are also possible, [Hooker, Constraints 2005], but these are enough for our case and easy to extract
MasterProblem
solution Sub-Problem
no good
solution
IP solver CP solver
i STR
Computational efficiency
CP and IP formulations simplified Hybrid approach clearly outperforms pure CP and IP techniques Search time bounded to 15 minutes
CP and IP can found a solution only in 50%- of the instances Hybrid approach always found a solution
Validation of bus model
Requesting more than 65% of the theoretical maximum bandwidth causes the additive model to fail.
Lower threshold in presence of communication hotspots (50%) Benefits of the additive model
task execution time almost indep. of bus utilizationPerformance predictability greatly enhanced
Validation of optimizer solutions
MAX error lower than 10% AVG error equal to 4.7%, with standard deviation of 0.08 Optimizer turn out to be conservative in predicting infeasibility The flow was successfully applied to GSM benchmark
Energy-EfficientApplication mapping for MPSoCs
Given a platform1. Achieve a specified throughput2. Minimize power consumption
MULTIMEDIAAPPLICATIONS
Application Mapping
The problem of allocating, scheduling and freq. selection for task graphs on multi-processors in a distributed real-time system is NP-hard.
New tool flows for efficient mapping of multi-task applications onto hardware platforms
T1
T2 T3
T4 T5 T6
T7
T8
…Proc. 1 Proc. 2 Proc. N
INTERCONNECT
Private
Mem
Private
Mem
Private
Mem…
T1 T2 T3T4 T5 T6T8 T7
Time
Res
ou
rces
T1 T2
T3
T4
T5 T7
Deadline
T8
Allocatio
n
Schedule&Freq.sel.
Exploiting Voltage Supply Supply voltage impacts power and
performance Circuit slowdown T=1/f=K/(Vdd-Vt)a
Cubic power savings P=Ceff*Vdd2*f
Just-in-time computation Stretch execution time up to the max tolerable
Available time
PowerFixed voltage + Shutdown
Variable voltage
Scheduling & Voltage Scaling
deadlinet
P
1 2 3
Energy/speed trade-offs:
varying the voltagesVbs
CPUVdd
f1 f2 f3
Different voltages:different frequencies
Mapping and scheduling: given
(fastest freq.)
Power
deadlinet1 2 3
Slack
Target architecture - 2 Homogeneous computation tiles:
ARM cores (including instruction and data caches);
Tightly coupled software-controlled scratch-pad memories (SPM);
AMBA AHB; DMA engine; RTEMS OS; Technology homogeneous (0.13um)
industrial power models (ST) Variable Voltage/Frequency cores with discrete (Vdd,f) pairs
Frequency dividers scale down the baseline 200 MHz system clock
Cores use non-cacheable shared memory to communicate;
Semaphore and interrupt facilities are used for synchronization;
Private on-chip memory to store data.
Tile TileTile Tile …Sync. Sync. Sync. Sync.
PrivateMem
PrivateMem
PrivateMem
PrivateMem
SharedMem
AMBA AHB INTERCONNECT
PrivateMem..
Prog.REG
CLOCK TREEGENERATOR
System
CL
OC
K
CLOCK NCLOCK 3
CLOCK 2CLOCK 1
INTSlave
… Int_
CL
KTile TileTile Tile …Sync. Sync. Sync. Sync.
PrivateMem
PrivateMem
PrivateMem
PrivateMem
SharedMem
AMBA AHB INTERCONNECT
PrivateMem..
Prog.REG
CLOCK TREEGENERATOR
System
CL
OC
K
CLOCK NCLOCK 3
CLOCK 2CLOCK 1
INTSlave
… Int_
CL
K
A task graph represents: A group of tasks T Task dependencies Execution times express in clock cycles: WCN(Ti) Communication time (writes & reads) expressed as: WCN(WTiTj) and WCN(RTiTj) These values can be back-annotated from functional simulation
Application model
Task1
Task2
Task3
Task4
Task5
Task6
WCN(WT1T2)WCN(RT1T2)WCN(T1)
WCN(WT1T3)WCN(RT1T3)
WCN(T2) WCN(WT2T4)WCN(RT2T4)
WCN(WT3T5)WCN(RT3T5)
WCN(WT4T6)WCN(RT4T6)
WCN(WT5T6)WCN(RT5T6)
WCN(T3)
WCN(T4)
WCN(T5)
WCN(T6)
Efficient Application Development Support In optimization tools many simplifying assumptions are
generally considered The neglecting of these assumptions in software
implementation can generate: unpredictable and not desired system-level interactions; make the overall system error-prone.
We propose an entire framework to help programmers in software implementation:
a generic customizable application template OFFLINE SUPPORT;
a set of high-level APIs ONLINE SUPPORT. The main goals of our development framework are:
the exact and reliable application’s execution after the optimization step;
guarantees about high performance and constraint satisfaction.
Customizable Application Template
Starting from a high level task and data flow graph, software developers can easily and quickly build their application infrastructure.
Programmer can intuitively translate high level representation into C-code using our facilities and library.
Users can specify: the number of tasks included in the target application; their nature (e.g. branch, fork, or-node, and-node); their precedence constraints (e.g. due to data communication);
….thus quickly drawing its CTG schema. Programmer can focus onto the functionalities of the
tasks: the main effort is given to the more specific and critic sections of
the application.
OS-level and Task-level APIs Users can easily reproduce optimizer solutions, thus:
Indirectly neglecting optimizer’s abstractions Task model; Communication model; OS overheads.
Obtaining the needed application constraint satisfaction.
Programmer can allocate to the right hardware resources Tasks; Program data; Queues.
Scheduling support APIs Frequency and voltage selection;
Communication issues Shared queues; Semaphores; Interrupts.
//Node Behaviour: 0 AND ; 1 OR; 2 FORK; 3 BRANCHuint node_behaviour[TASK_NUMBER] = {2,3,3,..};
#define N_CPU 2uint task_on_core[TASK_NUMBER] = {1,1,2,1};int schedule_on_core[N_CPU][TASK_NUMBER] = {{1,2,4,8}..};
uint queue_consumer [..] [..] = { {0,1,1,0,..},
{0,0,0,1,1,.}, {0,0,0,0,0,1,1..},
{0,0,0,0,..}..};
//Node Type: 0 NORMAL; 1 BRANCH ; 2 STOCHASTICuint node_type[TASK_NUMBER] = {1,2,2,1,..};
Example Number of nodes : 12 Graph of activities Node type
Normal, Branch, Conditional, Terminator Node behaviour
Or, And, Fork, Branch Number of CPU : 2 Task Allocation Task Scheduling Arc priorities Freq. & Voltage
Time
Res
ourc
es
N1 B2
B3
C4
C7
Deadline
N8
T2 T3
T4 T5 T6 T7
T8 T9 T10
T11
T12
T1N1
B2 B3
C4 C5 C6 C7
N8 N9 N10
N11
T12
fork
or
or
and
branch branch
P1
P2
N11
N10
T12
a1a2
a3 a4 a5 a6
a7 a8 a9 a10
a11 a12
B3 C7 N10
T12
a13
a14
#define TASK_NUMBER 12
Queue ordering optimization
Communication ordering affects system performances
T1
T2
T4
CPU1 CPU2
…
C3C1
T3
…C
2
Wait!
RUN!
T5 T6… …
C4 C5
Queue ordering optimization
Communication ordering affects system performances
T1
T2
T4
T5 T6
CPU1 CPU2
…… …
C3C1
T3
…C
2
Wait!
RUN!
C4 C5
T4 re-activated
Synchronization among tasks
T1
T2 T4C2
T3
C1
C3
Proc. 1
T1
Proc. 2
T2T3 T4
T4 is suspended
Non blocked semaphores
Logic Based Benders Decomposition
Obj. Function:Communication cost
& energy consumption
Validallocation
Allocation& Freq. Assign.:
INTEGER PROGRAMMING
Scheduling:CONSTRAINT PROGRAMMING
No good: linearconstraint
Memory constraints
Real Time constraint
Decomposes the problem into 2 sub-problems: Allocation & Assignment of freq. settings → IP
Objective Function: minimizing energy consumption during execution and communication of tasks
Scheduling → CP Objective Function: minimizing energy consumption
during frequency switching
Solver Performance
Hundreds of of decision variables Much beyond ILP solver or CP solver capability
Allocation problem modelXtfp = 1 if task t executes on processor p at frequency f;Wijfp = 1 if task i and j run on different core.
Task i on core p writes data to j at freq. f;Rijfp = 1 if task i and j run on different core.
Task j on core p reads data to i at freq. f;
WriteadComp
P
p
M
fijfpijfp
P
p
M
fijfp
P
p
M
fijfp
P
p
M
ftfp
EnEnEnOF
TjiRW
TjiR
TjiW
tX
Re
1 1
1 1
1 1
1 1
,0)(
,1
,1
1 Each task can execute only on one processor at one freq.
Communication between tasks can execute only once for execution and one write corresponds to one read
The objective function: minimize energy consumption associated with task execution and communication
Five phases behaviour INPUT=input data
reading; EXEC=computation
activity; OUTPUT=output data
writing. Atomic activities
Scheduling problem model INPUT EXEC OUTPUT
The objective function: minimize energy consumption associated with frequency switching
•Processors are modelled as unary resource•Bus is modelled as additive resource
Duration of task i is now fixed since mode is fixed:Reading phase
input
input
input
exec
output
output
output
forkjoin
Writing phase
jijijii
jiii
jii
S ta rtadddW ritedura tionS ta rt
S ta rtTdura tionS ta rt
S ta rtdu ra tionS ta rt
R e
Task i Task j
Tasks running on the same processor at the same frequency
Tasks running on the same processor at different frequencies
Tasks running on different processors
Application Development Methodology
CTGCharacterization
Phase
Simulator
OptimizationPhase
Optimizer
ApplicationProfiles
Optimal SWApplication
Implementation
ApplicationDevelopment
Support
Alloca
tion
Sched
ulin
g
PlatformExecution
MAX error lower than 10%; AVG error equal to 4.51%, with standard
deviation of 1.94; All the deadline constraints are satisfied.
Optimizer
Optimal Allocation & Schedule
Virtual Platform validation
-0.05
0
0.05
0.1
0.15
0.2
0.25
-5% -4% -3% -2% -1% 0% 1% 2% 3% 4% 5% 6% 7% 8% 9% 10% 11%
250 instances
Validation of optimizer solutions: Throughput
Pro
bab
ilit
y (%
)Throughput difference (%)
MAX error lower than 10%; AVG error equal to 4.80%, with
standard deviation of 1.71;
Optimizer
Optimal Allocation & Schedule
Virtual Platform validation
250 instances
Validation of optimizer solutions: Power
-0.05
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
-5% -4% -3% -2% -1% 0% 1% 2% 3% 4% 5% 6% 7% 8% 9% 10% 11%
250 instances
Pro
bab
ilit
y (%
)Energy consumption difference (%)
GSM Encoder
Throughput required: 1 frame/10ms. With 2 processors and 4 possible
freq.&voltage settings:
Task Graph: 10 computational tasks; 15 communication tasks.
Without optimizations:50.9μJ
With optimizations:17.1 μJ - 66,4%
Summary & future work Energy-optimal task mapping
Strong optimization engine (complete) Programmer support (design & exec time) Validation: accuracy & optimality
Future work Conditional task graphs Dealing with multiple use cases Variable execution times Aggressive communication scheduling
top related