efficient computing in cyber-physical systemsinformatik 12, 2013 sfb 876 energy consumption of...
TRANSCRIPT
technische universität dortmund
fakultät für informatikinformatik 12
Efficient Computingin Cyber-Physical Systems
Peter MarwedelTU Dortmund (Germany)
Informatik 12
2013/06/20
©S
prin
ger,
2010
- 2 -technische universitätdortmund
fakultät fürinformatik
p. marwedel, informatik 12, 2013
SFB876
Cyber-physical systems and embedded systems
CPS = ES + physical environment
Embedded systems("small computers")
Embedded systems (ES): information processing embedded into a larger product [Peter Marwedel, 2003] Cyber-physical systems (CPS): integrations of computation
with physical processes [Edward Lee, 2006].
PhysicsInformation processing
- 3 -technische universitätdortmund
fakultät fürinformatik
Scope, opportunities & challenges (Partial) solutions to challenges
• Timing predictability- WCET-aware compilation: unrolling, locking- register allocation, scratch pads
• Energy efficiency- scratch pads: (non)-overlaying, compile/run-time allocation
• Tradeoffs between multiple objectives- PAMONO bio sensor- automatic parallelization: energy efficiency vs. performance- approximate computations: timeliness vs. exactness
Summary p. marwedel, informatik 12, 2013
SFB876
Outline
- 4 -technische universitätdortmund
fakultät fürinformatik
p. marwedel, informatik 12, 2013
SFB876
Opportunities
©: Microsoft Cliparts+ P. Marwedel, 2011,
Transportation
• Automotive
• Avionics
• Rail
Medical
• Hospitals
• Support of handicapped
• Patient information
• Support for the elderly© www.dobelle.com,~2000
- 5 -technische universitätdortmund
fakultät fürinformatik
p. marwedel, informatik 12, 2013
SFB876
Opportunities (2)
(Urban) living
• Telecommunication
• Consumer electronics
• Smart home
• Logistics
• Smart grid
• Public safety
• Structural health monitoring
Robotics
Military systems© Graphics: P. Marwedel, 2011,Microsoft
- 6 -technische universitätdortmund
fakultät fürinformatik
p. marwedel, informatik 12, 2013
SFB876
Challenge (1): timing predictability
Timing: The system must react within a time interval dictated by the environment [Adopted from Bergé]
texecute
Consequences frequently underestimated
Always! Not subject to statistical arguments(except possibly extremely small probabilities)
© Graphics: Microsoft Cliparts
- 7 -technische universitätdortmund
fakultät fürinformatik
p. marwedel, informatik 12, 2013
SFB876
Challenges (2): Energy/power efficiency
Need for energy/power efficient processing (“sweet corner”)
© Graphics: P. Marwedel,H. De Man, Microsoft
- 8 -technische universitätdortmund
fakultät fürinformatik
p. marwedel, informatik 12, 2013
SFB876
Challenge(3): Many more objectives
exactness of results, QoS average case delay safety security thermal behavior reliability EMC characteristics radiation hardness cost, size weight environmental friendliness extendability …
© Graphics: P. Marwedel, 2011; Microsoft
Threatened by increased openness
- 9 -technische universitätdortmund
fakultät fürinformatik
p. marwedel, informatik 12, 2013
SFB876
Challenges (4)
Analog/digital interface (quantization, …)
Consequences of networkedembedded systems
• e.g. maintain level of local control
digital-analog signal
- 10 -technische universitätdortmund
fakultät fürinformatik
Scope, opportunities & challenges (Partial) solutions to challenges
• Timing predictability- WCET-aware compilation: unrolling, locking- register allocation, scratch pads
• Energy efficiency- scratch pads: (non)-overlaying, compile/run-time allocation
• Tradeoffs between multiple objectives- PAMONO bio sensor- automatic parallelization: energy efficiency vs. performance- approximate computations: timeliness vs. exactness
Summary p. marwedel, informatik 12, 2013
SFB876
Outline
- 11 -technische universitätdortmund
fakultät fürinformatik
p. marwedel, informatik 12, 2013
SFB876
Let’s look at timing!
Definition of worst case execution time:
WC
ET E
ST
© Graphics: adopted fromR. Wilhelm + Microsoft Cliparts
WCETEST must be 1. safe (i.e. ≥ WCET) and2. tight (WCETEST-WCET≪WCETEST)
Timeconstraint
disr
tribu
tion
of ti
mes
possible execution times
worst-case performanceworst-case guarantee
WC
ET
BC
ET
BC
ET E
ST
timing predictabilitytime
- 12 -technische universitätdortmund
fakultät fürinformatik
p. marwedel, informatik 12, 2013
SFB876
Current Trial-and-Error Based Development
1. Specification of CPS/ES software 2. Generation of Code (ANSI-C or similar)3. Compilation for given target processor4. Execution and/or simulation of machine code,
using a (e.g. random) set of input data5. Measurement-based computation of “estimated worst
case execution time” (WCETmeas)6. Adding safety margin (e.g. 20%) on top of WCETmeas and
call this “WCEThypo”7. If “WCEThypo” > real-time constraint: change some detail,
go back to 1 or 2.
- 13 -technische universitätdortmund
fakultät fürinformatik
p. marwedel, informatik 12, 2013
SFB876
Structure of WCC(WCET-aware C-compiler)
LLIR
2CR
LC
RL2
LLIR
- 14 -technische universitätdortmund
fakultät fürinformatik
p. marwedel, informatik 12, 2013
SFB876
Challenges for WCETEST-Minimization
Worst-Case Execution Path (WCEP)
WCETEST of a program = Length of longestexecution path (WCEP) in that program
WCETEST-Minimization:Reduction of the longest path
Other optimizations do not resultin a reduction of WCETEST
Optimizations need to know the WCEP
- 15 -technische universitätdortmund
fakultät fürinformatik
p. marwedel, informatik 12, 2013
SFB876
WCET-oriented optimizations
Extended loop analysis (CGO 09) Instruction cache locking (CODES/ISSS 07, CGO 12) Cache partitioning (WCET-Workshop 09) Procedure cloning (WCET-WS 07, CODES 07, SCOPES 08) Procedure/code positioning (ECRTS 08, CASES 11 (2x)) Function inlining (SMART 09, SMART 10) Loop unswitching/invariant paths (SCOPES 09) Loop unrolling (ECRTS 09) Register allocation (DAC 09, ECRTS 11)) Scratchpad optimization (DAC 09) Extension towards multi-objective optimization (RTSS 08) Superblock-based optimizations (ICESS 10) Surveys (Springer 10, Springer 12)
- 16 -technische universitätdortmund
fakultät fürinformatik
p. marwedel, informatik 12, 2013
SFB876
WCET-aware Loop Unrolling via Back-annotation
WCET-information available at assembly level
Unrolling to be applied at internal representationof source code
Solution: Back-annotation:Experimental WCET-aware compiler WCCallows copying information:assembly code source code
• WCET data
• Assembly code size
• Amount of spill code
Memory architecture info available
High-Level IR(Source Code)
Low-Level IR(Assembly Code)
Back-Annotation
Mem.Spec.
Code generation
- 17 -technische universitätdortmund
fakultät fürinformatik
p. marwedel, informatik 12, 2013
SFB876
80828486889092949698
100
0.5 kByte 1 kByte 2 kByte 4 kByte 8 kByte 16 kByte 16 kByte
I-Cache Size
Results for unrolling: WCET
100%: Avg. WCET for all benchmarks with –O3 & no unrolling WCET reduction between 10.2% and 15.4%WCET-driven Unrolling outperforms std. unrolling by 13.7%
Rel
ativ
e W
CET
[%]
STAN
DAR
D
- 18 -technische universitätdortmund
fakultät fürinformatik
p. marwedel, informatik 12, 2013
SFB876
Relative WCETEST with I-Cache Locking 5 Bench-marks/ARM920T/Postpass-Opt
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
110%
64 128 256 512 1024 2048 4096 8192 16384
Cache Size [bytes]
Rel
. WC
ET_E
ST [%
]
ADPCM G723 Statemate Compress MPEG2
(ARM920T)
[S. Plazar et al.]
H. Falk, Heiko, S. Plazar, H. Theiling: Compile Time Decided Instruction Cache Locking Using Worst-Case Execution Paths, Intern. Conf. on Hardware/Software Codesign and System Synthesis (CODES+ISSS), 2007
- 19 -technische universitätdortmund
fakultät fürinformatik
p. marwedel, informatik 12, 2013
SFB876
Register Allocation
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
110%
adpcm
_veri
fy
cjpeg
_tran
supp
compr
ess crc
dijkstr
aduff
edge_
detect ed
nep
ic
expint
fdct
fft_1
024
fft_2
56 fir
fir2d
im gsm
gsm_e
ncode h26
3
h264d
ec_b
lock
h264d
ec_m
acro
iir_4_
64
iir_biq
uad_N
jfdcti
nt
latnrm
_32_
64
lmsfi
r_8_1
lmsfi
r_32_
64 lpc
ludcmp
matmult
matrix2
_fixe
d
matrix2
_float
md5
minver
mult_10_
10
mult_4_4
ndesprim
e
qmf_rec
eive
qmf_tran
smit qurt
rijndae
l_enc
selec
tsh
a
spec
tral
startu
p
v32_
benc
Avera
ge
Rel
ativ
e W
CET
EST [
%]
(Opt
imiz
atio
n Le
vel -
O3)
100% = WCETEST using Standard Graph Coloring (highest degree)
H. Falk, Heiko: WCET-aware Register Allocation based on Graph Coloring, 46th Design Automation Conference (DAC), 2009
- 20 -technische universitätdortmund
fakultät fürinformatik
p. marwedel, informatik 12, 2013
SFB876
Scratch pad memories (SPM):Fast, energy-efficient, timing-predictable
Address space
ARM7TDMI cores, well-known for low power consumption
scratch pad memory
0
FFF..
ExampleExample
Small; no tag memory
SPMs are small, physically separate memories mapped into the address space;Selection is by an appropriate address decoder (simple!)
SPM
select
- 21 -technische universitätdortmund
fakultät fürinformatik
p. marwedel, informatik 12, 2013
SFB876
Property of the PREDATOR consortium. Confidential.
WCETEST-aware scratch pad allocation
Setup
Bosch Democar:Runnable IgnitionSWCSync
WCETEST-aware scratch padallocation of program code byWCC, including fully automatedWCETEST analyses using aiT
WCETEST reduced to about 50%, compared to gcc.
© P. Marwedel, 2011
- 22 -technische universitätdortmund
fakultät fürinformatik
Scope, opportunities & challenges (Partial) solutions to challenges
• Timing predictability- WCET-aware compilation: unrolling, locking- register allocation, scratch pads
• Energy efficiency- scratch pads: (non)-overlaying, compile/run-time allocation
• Tradeoffs between multiple objectives- PAMONO bio sensor- automatic parallelization: energy efficiency vs. performance- approximate computations: timeliness vs. exactness
Summary p. marwedel, informatik 12, 2013
SFB876
Outline
- 23 -technische universitätdortmund
fakultät fürinformatik
p. marwedel, informatik 12, 2013
SFB876
Why care about energy efficiency ?
Increasing performance
Energy a very scarce resource ReliabilityAvoiding high currents & metal migrationProblems with cooling, avoiding hot spots
Cost of energyGlobal warming
SensorCarFactoryE.g.
Unplug-ged
Unchargedperiods
PluggedExecution platformRelevant during use?
Power
© Graphics: P. Marwedel, 2011
- 24 -technische universitätdortmund
fakultät fürinformatik
p. marwedel, informatik 12, 2013
SFB876
Where is the energy consumed? - Consumer portable systems -
According to International Technology Roadmap for Semiconductors(ITRS), 2010 update, [www.itrs.net]
Violation of 0.5-1 W constraint for small mobiles.
It not just the logic, don’t ignore memory!
© ITRS, 2010
- 25 -technische universitätdortmund
fakultät fürinformatik
p. marwedel, informatik 12, 2013
SFB876
Energy consumption of memories
0
5
10
15
20
25
30
35
40
512 2K 8K 32
K12
8K51
2K 2M 8M 32M
128M
512M 2G
High Performance - 16banks SPM Access time(ns)High Performance - 16banks SPM Energy (nJ) -readHigh Performance - 16banks DDR Access time(ns)High Performance - 16banks DDR Energy (nJ) -read
Source: Olivera Jovanovic, TU Dortmund, 2011
SRAM
DRAM
CACTI: Scratchpad (SRAM) vs. DRAM (DDR2):
16 bit read; size in bytes;65 nm for SRAM, 80 nm for DRAM
- 26 -technische universitätdortmund
fakultät fürinformatik
p. marwedel, informatik 12, 2013
SFB876
Migration of data & instructions,global optimization model (TU Dortmund)
Which memory object (array,loop, etc.) to be stored in SPM?Non-overlaying (“static”) allocation:Gain gk and size sk for each object k. Maximise gain G = gk, respecting size of SPM SSP sk.Solution: knapsack algorithm.Overlaying (“dynamic”) allocation:Moving objects back and forthprocessor
scratch pad memory,capacity SSP
main memory
?
for i .{ }
for j ..{ }
while ...
repeat …
call ...
array ...
int ...
array …
Example:
- 27 -technische universitätdortmund
fakultät fürinformatik
p. marwedel, informatik 12, 2013
SFB876
Reduction in energy and average run-time for migrating functions and variables
Multi_sort (mix of sort algorithms)
Cyc
les
[x10
0]En
ergy
[µJ]
Measured processor / external memory energy + CACTI values for SPM (combined model)
Numbers will change with technology, algorithms remain unchanged.
- 28 -technische universitätdortmund
fakultät fürinformatik
p. marwedel, informatik 12, 2013
SFB876
Non-overlaying allocation problematic formultiple hot spots Overlaying allocation
Effectively results in a kind of compiler-controlled overlays for SPM
Address assignment within SPM required
CPU
Memory
Memory
SPM
- 29 -technische universitätdortmund
fakultät fürinformatik
p. marwedel, informatik 12, 2013
SFB876
Overlaying allocation by Verma et al. (1)
C
DEF A
USE A
USE A
MOD A USE C
USE C
B1
B2
B3
B4
B5
B6
B7
B8
Based on control flow graph.
[M.Verma, P.Marwedel: Dynamic Overlay of Scratchpad Memory for Energy Minimization, ISSS, 2004]
- 30 -technische universitätdortmund
fakultät fürinformatik
p. marwedel, informatik 12, 2013
SFB876
Overlaying allocation by Verma et al. (2)
SPILL_STORE(A);SPILL_LOAD(C);
SPILL_STORE(A);SPILL_LOAD(C);
SPILL_LOAD(A);SPILL_LOAD(A);
C
DEF A
USE A
USE A
MOD A USE C
USE C
B1
B2
B3
B4
B5
B6
B7
B8
B9
B10
Global set of ILP equations reflects cost/benefit relations of potential copy points
Code handled like data
- 31 -technische universitätdortmund
fakultät fürinformatik
p. marwedel, informatik 12, 2013
SFB876
Runtime/energy reduction with respect tonon-overlaying (“static”) allocation
- 32 -technische universitätdortmund
fakultät fürinformatik
p. marwedel, informatik 12, 2013
SFB876
Dynamic set of multiple applications
MEMMEM
CPU
SPM
SPMManager
SPMManager
App. 2App. 2
App. 1App. 1
App. nApp. n
SP
M
App. 3
App. 2
App. 1
t
Address space:
SPM
??
Compile-time partitioning of SPM not feasibleIntroduction of SPM-managerRuntime decisions, but
compile-time supported
[R. Pyka, Ch. Faßbach, M. Verma, H. Falk, P. Marwedel: Operating system integrated energy aware scratchpad allocation strategies for multi-process applications, SCOPES, 2007]
- 33 -technische universitätdortmund
fakultät fürinformatik
p. marwedel, informatik 12, 2013
SFB876
Comparison of SPMM to Caches for SORT
Baseline: Main memory onlySPMM peak energy reduction
by 83% at 4k Bytes scratchpad
SPM Size Δ 4-way1024 74,81%
2048 65,35%
4096 64,39%
8192 65,64%
16384 63,73%
- 34 -technische universitätdortmund
fakultät fürinformatik
Scope, opportunities & challenges (Partial) solutions to challenges
• Timing predictability- WCET-aware compilation: unrolling, locking- register allocation, scratch pads
• Energy efficiency- scratch pads: (non)-overlaying, compile/run-time allocation
• Tradeoffs between multiple objectives- PAMONO bio sensor- automatic parallelization: energy efficiency vs. performance- approximate computations: timeliness vs. exactness
Summary p. marwedel, informatik 12, 2013
SFB876
Outline
- 35 -technische universitätdortmund
fakultät fürinformatik
p. marwedel, informatik 12, 2013
SFB876
Detection of nano objects using plasmon effect(Optical effect of objects ≪ wavelength of light)
Energy consumption of (bio-) virus detection
Laser
Image sequence
CCD
Prisma
Nano object
Sensor
55,0 55,2 55,4 55,6 55,8 56,0 56,2-0,04
-0,02
0,00
0,02
0,04
0,06
Sig
nal (I/I
)
incidence angle, grad
Reflectivity
- 36 -technische universitätdortmund
fakultät fürinformatik
p. marwedel, informatik 12, 2013
SFB876
Timeliness in (bio-) virus detection (2)
Low-end Mobile GPU meets realtime constraint up to 1024x256 pixels Faster GPUs give room for larger usable sensor area
Frame Rate (fps) Image Size1024x128 1024x256 1024x512 1024x1024
CPU: i7-2600 23 20.1 16.1 11.5CPU: i7-620M (mobile) 16 8.5 4.6 2.4GPU: 3100M (mobile) 60 32.6 16 7.8GPU: GTX 480 60 60 60 26.8GPU: GTX 560Ti 60 60 60 40.3
Target: 30 frames per second at 1024x256 pixels
- 37 -technische universitätdortmund
fakultät fürinformatik
p. marwedel, informatik 12, 2013
SFB876
Timeliness in (bio-) virus detection (3)
C. Timm, A. Gelenberg, P. Marwedel, F. Weichert: Energy Considerations within the Integration of General Purpose GPUs in Embedded Systems. Intern. Conf. on Advances in Distributed and Parallel Computing, 2010C. Timm, F. Weichert, P. Marwedel, H. Müller: Design Space Exploration Towards a Realtime and Energy-Aware GPGPU-based Analysis of Biosensor Data. Computer Science - Research and Development, ENA-HPC, 2011
current clamp
Energy per frame CPU 3.26 J 5.84 J 10.52 J Reduced toEnergy per frame GPU 0.93 J 1.56 J 2.76 J avg 27%
- 38 -technische universitätdortmund
fakultät fürinformatik
p. marwedel, informatik 12, 2013
SFB876
D. Cordes, P. Marwedel: Multi-Objective Aware Extraction of Task-Level Parallelism Using Genetic Algorithms, DATE 2012
Multi-objective based parallelizationof sequential programs
- 39 -technische universitätdortmund
fakultät fürinformatik
p. marwedel, informatik 12, 2013
SFB876
Tradeoff between timeliness and exactness
Traditional error correction requires resources,including time
Conflicting with timing requirements of real-time systems
➁➀
©
➂
- 40 -technische universitätdortmund
fakultät fürinformatik
p. marwedel, informatik 12, 2013
SFB876
Tradeoff between timeliness and exactness (2)
In some cases, errors have no effect In some cases, timing requirements are dominating & error
correction may be compromised In other cases, error correction requirements are
dominating & timing requirements may be compromised
Imprecise outputNo effect System crash
unused
variable i
unused
variable j
Register 0
Register 1
Register 2
Register 3
Classification requires application knowledge
- 41 -technische universitätdortmund
fakultät fürinformatik
p. marwedel, informatik 12, 2013
SFB876
Analysis of H.264 video decoder(~ 3500 lines of C code)
Fraction of memory space
Resolution Memory size of reliable data
Memory size of unreliable data
176 x 144 90 kB (55%) 74 kB (45%)352 x 288 223 kB (43%) 297 kB (57%)
1280 x 720 1585 kB (37%) 2700 kB (63%)
Impact of uncorrected errors:
Injection into unreliable memory 240 faults / sec 80 faults / sec
Average PSNR of frames with errors 40.89 dB 50.57 dB
Deteriorated exactness, timeliness maintained!
- 42 -technische universitätdortmund
fakultät fürinformatik
Scope, opportunities & challenges (Partial) solutions to challenges
• Timing predictability- WCET-aware compilation: unrolling, locking- register allocation, scratch pads
• Energy efficiency- scratch pads: (non)-overlaying, compile/run-time allocation
• Tradeoffs between multiple objectives- PAMONO bio sensor- automatic parallelization: energy efficiency vs. performance- approximate computations: timeliness vs. exactness
Summary p. marwedel, informatik 12, 2013
SFB876
Summary