efficient computing in cyber-physical systems

42
technische universität dortmund fakultät für informatik informatik 12 Efficient Computing in Cyber-Physical Systems Peter Marwedel TU Dortmund (Germany) Informatik 12 2013/06/20 © Springer, 2010

Upload: others

Post on 05-Dec-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

technische universität dortmund

fakultät für informatikinformatik 12

Efficient Computingin Cyber-Physical Systems

Peter MarwedelTU Dortmund (Germany)

Informatik 12

2013/06/20

©S

prin

ger,

2010

- 2 -technische universitätdortmund

fakultät fürinformatik

p. marwedel, informatik 12, 2013

SFB876

Cyber-physical systems and embedded systems

CPS = ES + physical environment

Embedded systems("small computers")

Embedded systems (ES): information processing embedded into a larger product [Peter Marwedel, 2003] Cyber-physical systems (CPS): integrations of computation

with physical processes [Edward Lee, 2006].

PhysicsInformation processing

- 3 -technische universitätdortmund

fakultät fürinformatik

Scope, opportunities & challenges (Partial) solutions to challenges

• Timing predictability- WCET-aware compilation: unrolling, locking- register allocation, scratch pads

• Energy efficiency- scratch pads: (non)-overlaying, compile/run-time allocation

• Tradeoffs between multiple objectives- PAMONO bio sensor- automatic parallelization: energy efficiency vs. performance- approximate computations: timeliness vs. exactness

Summary p. marwedel, informatik 12, 2013

SFB876

Outline

- 4 -technische universitätdortmund

fakultät fürinformatik

p. marwedel, informatik 12, 2013

SFB876

Opportunities

©: Microsoft Cliparts+ P. Marwedel, 2011,

Transportation

• Automotive

• Avionics

• Rail

Medical

• Hospitals

• Support of handicapped

• Patient information

• Support for the elderly© www.dobelle.com,~2000

- 5 -technische universitätdortmund

fakultät fürinformatik

p. marwedel, informatik 12, 2013

SFB876

Opportunities (2)

(Urban) living

• Telecommunication

• Consumer electronics

• Smart home

• Logistics

• Smart grid

• Public safety

• Structural health monitoring

Robotics

Military systems© Graphics: P. Marwedel, 2011,Microsoft

- 6 -technische universitätdortmund

fakultät fürinformatik

p. marwedel, informatik 12, 2013

SFB876

Challenge (1): timing predictability

Timing: The system must react within a time interval dictated by the environment [Adopted from Bergé]

texecute

Consequences frequently underestimated

Always! Not subject to statistical arguments(except possibly extremely small probabilities)

© Graphics: Microsoft Cliparts

- 7 -technische universitätdortmund

fakultät fürinformatik

p. marwedel, informatik 12, 2013

SFB876

Challenges (2): Energy/power efficiency

Need for energy/power efficient processing (“sweet corner”)

© Graphics: P. Marwedel,H. De Man, Microsoft

- 8 -technische universitätdortmund

fakultät fürinformatik

p. marwedel, informatik 12, 2013

SFB876

Challenge(3): Many more objectives

exactness of results, QoS average case delay safety security thermal behavior reliability EMC characteristics radiation hardness cost, size weight environmental friendliness extendability …

© Graphics: P. Marwedel, 2011; Microsoft

Threatened by increased openness

- 9 -technische universitätdortmund

fakultät fürinformatik

p. marwedel, informatik 12, 2013

SFB876

Challenges (4)

Analog/digital interface (quantization, …)

Consequences of networkedembedded systems

• e.g. maintain level of local control

digital-analog signal

- 10 -technische universitätdortmund

fakultät fürinformatik

Scope, opportunities & challenges (Partial) solutions to challenges

• Timing predictability- WCET-aware compilation: unrolling, locking- register allocation, scratch pads

• Energy efficiency- scratch pads: (non)-overlaying, compile/run-time allocation

• Tradeoffs between multiple objectives- PAMONO bio sensor- automatic parallelization: energy efficiency vs. performance- approximate computations: timeliness vs. exactness

Summary p. marwedel, informatik 12, 2013

SFB876

Outline

- 11 -technische universitätdortmund

fakultät fürinformatik

p. marwedel, informatik 12, 2013

SFB876

Let’s look at timing!

Definition of worst case execution time:

WC

ET E

ST

© Graphics: adopted fromR. Wilhelm + Microsoft Cliparts

WCETEST must be 1. safe (i.e. ≥ WCET) and2. tight (WCETEST-WCET≪WCETEST)

Timeconstraint

disr

tribu

tion

of ti

mes

possible execution times

worst-case performanceworst-case guarantee

WC

ET

BC

ET

BC

ET E

ST

timing predictabilitytime

- 12 -technische universitätdortmund

fakultät fürinformatik

p. marwedel, informatik 12, 2013

SFB876

Current Trial-and-Error Based Development

1. Specification of CPS/ES software 2. Generation of Code (ANSI-C or similar)3. Compilation for given target processor4. Execution and/or simulation of machine code,

using a (e.g. random) set of input data5. Measurement-based computation of “estimated worst

case execution time” (WCETmeas)6. Adding safety margin (e.g. 20%) on top of WCETmeas and

call this “WCEThypo”7. If “WCEThypo” > real-time constraint: change some detail,

go back to 1 or 2.

- 13 -technische universitätdortmund

fakultät fürinformatik

p. marwedel, informatik 12, 2013

SFB876

Structure of WCC(WCET-aware C-compiler)

LLIR

2CR

LC

RL2

LLIR

- 14 -technische universitätdortmund

fakultät fürinformatik

p. marwedel, informatik 12, 2013

SFB876

Challenges for WCETEST-Minimization

Worst-Case Execution Path (WCEP)

WCETEST of a program = Length of longestexecution path (WCEP) in that program

WCETEST-Minimization:Reduction of the longest path

Other optimizations do not resultin a reduction of WCETEST

Optimizations need to know the WCEP

- 15 -technische universitätdortmund

fakultät fürinformatik

p. marwedel, informatik 12, 2013

SFB876

WCET-oriented optimizations

Extended loop analysis (CGO 09) Instruction cache locking (CODES/ISSS 07, CGO 12) Cache partitioning (WCET-Workshop 09) Procedure cloning (WCET-WS 07, CODES 07, SCOPES 08) Procedure/code positioning (ECRTS 08, CASES 11 (2x)) Function inlining (SMART 09, SMART 10) Loop unswitching/invariant paths (SCOPES 09) Loop unrolling (ECRTS 09) Register allocation (DAC 09, ECRTS 11)) Scratchpad optimization (DAC 09) Extension towards multi-objective optimization (RTSS 08) Superblock-based optimizations (ICESS 10) Surveys (Springer 10, Springer 12)

- 16 -technische universitätdortmund

fakultät fürinformatik

p. marwedel, informatik 12, 2013

SFB876

WCET-aware Loop Unrolling via Back-annotation

WCET-information available at assembly level

Unrolling to be applied at internal representationof source code

Solution: Back-annotation:Experimental WCET-aware compiler WCCallows copying information:assembly code source code

• WCET data

• Assembly code size

• Amount of spill code

Memory architecture info available

High-Level IR(Source Code)

Low-Level IR(Assembly Code)

Back-Annotation

Mem.Spec.

Code generation

- 17 -technische universitätdortmund

fakultät fürinformatik

p. marwedel, informatik 12, 2013

SFB876

80828486889092949698

100

0.5 kByte 1 kByte 2 kByte 4 kByte 8 kByte 16 kByte 16 kByte

I-Cache Size

Results for unrolling: WCET

100%: Avg. WCET for all benchmarks with –O3 & no unrolling WCET reduction between 10.2% and 15.4%WCET-driven Unrolling outperforms std. unrolling by 13.7%

Rel

ativ

e W

CET

[%]

STAN

DAR

D

- 18 -technische universitätdortmund

fakultät fürinformatik

p. marwedel, informatik 12, 2013

SFB876

Relative WCETEST with I-Cache Locking 5 Bench-marks/ARM920T/Postpass-Opt

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

110%

64 128 256 512 1024 2048 4096 8192 16384

Cache Size [bytes]

Rel

. WC

ET_E

ST [%

]

ADPCM G723 Statemate Compress MPEG2

(ARM920T)

[S. Plazar et al.]

H. Falk, Heiko, S. Plazar, H. Theiling: Compile Time Decided Instruction Cache Locking Using Worst-Case Execution Paths, Intern. Conf. on Hardware/Software Codesign and System Synthesis (CODES+ISSS), 2007

- 19 -technische universitätdortmund

fakultät fürinformatik

p. marwedel, informatik 12, 2013

SFB876

Register Allocation

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

110%

adpcm

_veri

fy

cjpeg

_tran

supp

compr

ess crc

dijkstr

aduff

edge_

detect ed

nep

ic

expint

fdct

fft_1

024

fft_2

56 fir

fir2d

im gsm

gsm_e

ncode h26

3

h264d

ec_b

lock

h264d

ec_m

acro

iir_4_

64

iir_biq

uad_N

jfdcti

nt

latnrm

_32_

64

lmsfi

r_8_1

lmsfi

r_32_

64 lpc

ludcmp

matmult

matrix2

_fixe

d

matrix2

_float

md5

minver

mult_10_

10

mult_4_4

ndesprim

e

qmf_rec

eive

qmf_tran

smit qurt

rijndae

l_enc

selec

tsh

a

spec

tral

startu

p

v32_

benc

Avera

ge

Rel

ativ

e W

CET

EST [

%]

(Opt

imiz

atio

n Le

vel -

O3)

100% = WCETEST using Standard Graph Coloring (highest degree)

H. Falk, Heiko: WCET-aware Register Allocation based on Graph Coloring, 46th Design Automation Conference (DAC), 2009

- 20 -technische universitätdortmund

fakultät fürinformatik

p. marwedel, informatik 12, 2013

SFB876

Scratch pad memories (SPM):Fast, energy-efficient, timing-predictable

Address space

ARM7TDMI cores, well-known for low power consumption

scratch pad memory

0

FFF..

ExampleExample

Small; no tag memory

SPMs are small, physically separate memories mapped into the address space;Selection is by an appropriate address decoder (simple!)

SPM

select

- 21 -technische universitätdortmund

fakultät fürinformatik

p. marwedel, informatik 12, 2013

SFB876

Property of the PREDATOR consortium. Confidential.

WCETEST-aware scratch pad allocation

Setup

Bosch Democar:Runnable IgnitionSWCSync

WCETEST-aware scratch padallocation of program code byWCC, including fully automatedWCETEST analyses using aiT

WCETEST reduced to about 50%, compared to gcc.

© P. Marwedel, 2011

- 22 -technische universitätdortmund

fakultät fürinformatik

Scope, opportunities & challenges (Partial) solutions to challenges

• Timing predictability- WCET-aware compilation: unrolling, locking- register allocation, scratch pads

• Energy efficiency- scratch pads: (non)-overlaying, compile/run-time allocation

• Tradeoffs between multiple objectives- PAMONO bio sensor- automatic parallelization: energy efficiency vs. performance- approximate computations: timeliness vs. exactness

Summary p. marwedel, informatik 12, 2013

SFB876

Outline

- 23 -technische universitätdortmund

fakultät fürinformatik

p. marwedel, informatik 12, 2013

SFB876

Why care about energy efficiency ?

Increasing performance

Energy a very scarce resource ReliabilityAvoiding high currents & metal migrationProblems with cooling, avoiding hot spots

Cost of energyGlobal warming

SensorCarFactoryE.g.

Unplug-ged

Unchargedperiods

PluggedExecution platformRelevant during use?

Power

© Graphics: P. Marwedel, 2011

- 24 -technische universitätdortmund

fakultät fürinformatik

p. marwedel, informatik 12, 2013

SFB876

Where is the energy consumed? - Consumer portable systems -

According to International Technology Roadmap for Semiconductors(ITRS), 2010 update, [www.itrs.net]

Violation of 0.5-1 W constraint for small mobiles.

It not just the logic, don’t ignore memory!

© ITRS, 2010

- 25 -technische universitätdortmund

fakultät fürinformatik

p. marwedel, informatik 12, 2013

SFB876

Energy consumption of memories

0

5

10

15

20

25

30

35

40

512 2K 8K 32

K12

8K51

2K 2M 8M 32M

128M

512M 2G

High Performance - 16banks SPM Access time(ns)High Performance - 16banks SPM Energy (nJ) -readHigh Performance - 16banks DDR Access time(ns)High Performance - 16banks DDR Energy (nJ) -read

Source: Olivera Jovanovic, TU Dortmund, 2011

SRAM

DRAM

CACTI: Scratchpad (SRAM) vs. DRAM (DDR2):

16 bit read; size in bytes;65 nm for SRAM, 80 nm for DRAM

- 26 -technische universitätdortmund

fakultät fürinformatik

p. marwedel, informatik 12, 2013

SFB876

Migration of data & instructions,global optimization model (TU Dortmund)

Which memory object (array,loop, etc.) to be stored in SPM?Non-overlaying (“static”) allocation:Gain gk and size sk for each object k. Maximise gain G = gk, respecting size of SPM SSP sk.Solution: knapsack algorithm.Overlaying (“dynamic”) allocation:Moving objects back and forthprocessor

scratch pad memory,capacity SSP

main memory

?

for i .{ }

for j ..{ }

while ...

repeat …

call ...

array ...

int ...

array …

Example:

- 27 -technische universitätdortmund

fakultät fürinformatik

p. marwedel, informatik 12, 2013

SFB876

Reduction in energy and average run-time for migrating functions and variables

Multi_sort (mix of sort algorithms)

Cyc

les

[x10

0]En

ergy

[µJ]

Measured processor / external memory energy + CACTI values for SPM (combined model)

Numbers will change with technology, algorithms remain unchanged.

- 28 -technische universitätdortmund

fakultät fürinformatik

p. marwedel, informatik 12, 2013

SFB876

Non-overlaying allocation problematic formultiple hot spots Overlaying allocation

Effectively results in a kind of compiler-controlled overlays for SPM

Address assignment within SPM required

CPU

Memory

Memory

SPM

- 29 -technische universitätdortmund

fakultät fürinformatik

p. marwedel, informatik 12, 2013

SFB876

Overlaying allocation by Verma et al. (1)

C

DEF A

USE A

USE A

MOD A USE C

USE C

B1

B2

B3

B4

B5

B6

B7

B8

Based on control flow graph.

[M.Verma, P.Marwedel: Dynamic Overlay of Scratchpad Memory for Energy Minimization, ISSS, 2004]

- 30 -technische universitätdortmund

fakultät fürinformatik

p. marwedel, informatik 12, 2013

SFB876

Overlaying allocation by Verma et al. (2)

SPILL_STORE(A);SPILL_LOAD(C);

SPILL_STORE(A);SPILL_LOAD(C);

SPILL_LOAD(A);SPILL_LOAD(A);

C

DEF A

USE A

USE A

MOD A USE C

USE C

B1

B2

B3

B4

B5

B6

B7

B8

B9

B10

Global set of ILP equations reflects cost/benefit relations of potential copy points

Code handled like data

- 31 -technische universitätdortmund

fakultät fürinformatik

p. marwedel, informatik 12, 2013

SFB876

Runtime/energy reduction with respect tonon-overlaying (“static”) allocation

- 32 -technische universitätdortmund

fakultät fürinformatik

p. marwedel, informatik 12, 2013

SFB876

Dynamic set of multiple applications

MEMMEM

CPU

SPM

SPMManager

SPMManager

App. 2App. 2

App. 1App. 1

App. nApp. n

SP

M

App. 3

App. 2

App. 1

t

Address space:

SPM

??

Compile-time partitioning of SPM not feasibleIntroduction of SPM-managerRuntime decisions, but

compile-time supported

[R. Pyka, Ch. Faßbach, M. Verma, H. Falk, P. Marwedel: Operating system integrated energy aware scratchpad allocation strategies for multi-process applications, SCOPES, 2007]

- 33 -technische universitätdortmund

fakultät fürinformatik

p. marwedel, informatik 12, 2013

SFB876

Comparison of SPMM to Caches for SORT

Baseline: Main memory onlySPMM peak energy reduction

by 83% at 4k Bytes scratchpad

SPM Size Δ 4-way1024 74,81%

2048 65,35%

4096 64,39%

8192 65,64%

16384 63,73%

- 34 -technische universitätdortmund

fakultät fürinformatik

Scope, opportunities & challenges (Partial) solutions to challenges

• Timing predictability- WCET-aware compilation: unrolling, locking- register allocation, scratch pads

• Energy efficiency- scratch pads: (non)-overlaying, compile/run-time allocation

• Tradeoffs between multiple objectives- PAMONO bio sensor- automatic parallelization: energy efficiency vs. performance- approximate computations: timeliness vs. exactness

Summary p. marwedel, informatik 12, 2013

SFB876

Outline

- 35 -technische universitätdortmund

fakultät fürinformatik

p. marwedel, informatik 12, 2013

SFB876

Detection of nano objects using plasmon effect(Optical effect of objects ≪ wavelength of light)

Energy consumption of (bio-) virus detection

Laser

Image sequence

CCD

Prisma

Nano object

Sensor

55,0 55,2 55,4 55,6 55,8 56,0 56,2-0,04

-0,02

0,00

0,02

0,04

0,06

Sig

nal (I/I

)

incidence angle, grad

Reflectivity

- 36 -technische universitätdortmund

fakultät fürinformatik

p. marwedel, informatik 12, 2013

SFB876

Timeliness in (bio-) virus detection (2)

Low-end Mobile GPU meets realtime constraint up to 1024x256 pixels Faster GPUs give room for larger usable sensor area

Frame Rate (fps) Image Size1024x128 1024x256 1024x512 1024x1024

CPU: i7-2600 23 20.1 16.1 11.5CPU: i7-620M (mobile) 16 8.5 4.6 2.4GPU: 3100M (mobile) 60 32.6 16 7.8GPU: GTX 480 60 60 60 26.8GPU: GTX 560Ti 60 60 60 40.3

Target: 30 frames per second at 1024x256 pixels

- 37 -technische universitätdortmund

fakultät fürinformatik

p. marwedel, informatik 12, 2013

SFB876

Timeliness in (bio-) virus detection (3)

C. Timm, A. Gelenberg, P. Marwedel, F. Weichert: Energy Considerations within the Integration of General Purpose GPUs in Embedded Systems. Intern. Conf. on Advances in Distributed and Parallel Computing, 2010C. Timm, F. Weichert, P. Marwedel, H. Müller: Design Space Exploration Towards a Realtime and Energy-Aware GPGPU-based Analysis of Biosensor Data. Computer Science - Research and Development, ENA-HPC, 2011

current clamp

Energy per frame CPU 3.26 J 5.84 J 10.52 J Reduced toEnergy per frame GPU 0.93 J 1.56 J 2.76 J avg 27%

- 38 -technische universitätdortmund

fakultät fürinformatik

p. marwedel, informatik 12, 2013

SFB876

D. Cordes, P. Marwedel: Multi-Objective Aware Extraction of Task-Level Parallelism Using Genetic Algorithms, DATE 2012

Multi-objective based parallelizationof sequential programs

- 39 -technische universitätdortmund

fakultät fürinformatik

p. marwedel, informatik 12, 2013

SFB876

Tradeoff between timeliness and exactness

Traditional error correction requires resources,including time

Conflicting with timing requirements of real-time systems

➁➀

©

- 40 -technische universitätdortmund

fakultät fürinformatik

p. marwedel, informatik 12, 2013

SFB876

Tradeoff between timeliness and exactness (2)

In some cases, errors have no effect In some cases, timing requirements are dominating & error

correction may be compromised In other cases, error correction requirements are

dominating & timing requirements may be compromised

Imprecise outputNo effect System crash

unused

variable i

unused

variable j

Register 0

Register 1

Register 2

Register 3

Classification requires application knowledge

- 41 -technische universitätdortmund

fakultät fürinformatik

p. marwedel, informatik 12, 2013

SFB876

Analysis of H.264 video decoder(~ 3500 lines of C code)

Fraction of memory space

Resolution Memory size of reliable data

Memory size of unreliable data

176 x 144 90 kB (55%) 74 kB (45%)352 x 288 223 kB (43%) 297 kB (57%)

1280 x 720 1585 kB (37%) 2700 kB (63%)

Impact of uncorrected errors:

Injection into unreliable memory 240 faults / sec 80 faults / sec

Average PSNR of frames with errors 40.89 dB 50.57 dB

Deteriorated exactness, timeliness maintained!

- 42 -technische universitätdortmund

fakultät fürinformatik

Scope, opportunities & challenges (Partial) solutions to challenges

• Timing predictability- WCET-aware compilation: unrolling, locking- register allocation, scratch pads

• Energy efficiency- scratch pads: (non)-overlaying, compile/run-time allocation

• Tradeoffs between multiple objectives- PAMONO bio sensor- automatic parallelization: energy efficiency vs. performance- approximate computations: timeliness vs. exactness

Summary p. marwedel, informatik 12, 2013

SFB876

Summary