luca benini [email protected] deis università di bologna

62
Tecniche di ottimizzazione per lo sviluppo di applicazioni embedded su piattatforme multiprocessore su singolo chip Luca Benini [email protected] DEIS Università di Bologna

Upload: amaris

Post on 26-Jan-2016

40 views

Category:

Documents


0 download

DESCRIPTION

Tecniche di ottimizzazione per lo sviluppo di applicazioni embedded su piattatforme multiprocessore su singolo chip. Luca Benini [email protected] DEIS Università di Bologna. Embedded Systems. General purpose systems. Embedded systems. Microprocessor market shares. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Luca Benini Lbenini@deis.unibo.it DEIS Università di Bologna

Tecniche di ottimizzazione per lo sviluppo di applicazioni embedded su piattatforme multiprocessore su singolo chip

Luca [email protected]

DEIS Università di Bologna

Page 2: Luca Benini Lbenini@deis.unibo.it DEIS Università di Bologna

1%

99%

Embedded Systems

General purpose systems Embedded systems

Microprocessor market shares

Page 3: Luca Benini Lbenini@deis.unibo.it DEIS Università di Bologna

Example Area: Automotive Electronics

What is “automotive electronics”?

Vehicle functions implemented with electronics

Body electronics System electronics: chassis,

engine Information/entertainment

Page 4: Luca Benini Lbenini@deis.unibo.it DEIS Università di Bologna

Automotive Electronics Market Size

8.9Market

($billions)

10.5

13.1 14.1 15.8 17.4 19.3 21.0

0

200

400

600

800

1000

1200

1400

1998 1999 2000 2001 2002 2003 2004 2005

Cost of Electronics / Car ($)

90% of future innovations in vehicles:based on electronic embedded systems

2006: 25% of the total cost

of a car will be electronics

Page 5: Luca Benini Lbenini@deis.unibo.it DEIS Università di Bologna

Automotive Electronics Platform Example

Source: Expanding automotive electronic systems, IEEE Computer, Jan. 2002

Page 6: Luca Benini Lbenini@deis.unibo.it DEIS Università di Bologna

Digital Convergence – Mobile Example

Broadcasting

TelematicsImaging

Computing

CommunicationEntertainment

One device, multiple functions Center of ubiquitous media network Smart mobile device: next drive for semicon.

Industry

Page 7: Luca Benini Lbenini@deis.unibo.it DEIS Università di Bologna

4th Gen and Next-Gen Networks

Includes: 802.20, WiMAX (802.16), HSDPA, TDD UMTS, UMTS and future versions of UMTS

Page 8: Luca Benini Lbenini@deis.unibo.it DEIS Università di Bologna

SoC: Enabler for Digital Convergence

Today

Future

> 100XPerformanceLow PowerComplexity

Storage

4G/5G, DMB,

WiBro, etc.

SoCSoCSoCSoC

Page 9: Luca Benini Lbenini@deis.unibo.it DEIS Università di Bologna

Application pull

Year of Introduction2005 2007 2009 2011 2013 2015

5 GOPS/W

100GOPS/W

Signrecognition

A/Vstreaming

Adaptiveroute

Collisionavoidance

Autonomousdriving

3D projecteddisplay

HMI by motionGesture detection

Ubiquitousnavigation

Si Xray

Gbit radio

UWB

802.11n

Structured encoding

Structured decoding

3D TV3D gaming

H264encoding

H264decoding

Imagerecognition

Fully recognition(security)

Autopersonalization

dictation

3D ambientinteraction

LanguageEmotionrecognition

Gesturerecognition

Expressionrecognition

MobileBase-band

1TOPS/W

[IMEC]

Page 10: Luca Benini Lbenini@deis.unibo.it DEIS Università di Bologna

MPSoC Platform Evolution

45 nm

<4mm

<1GHz

I/OPERIPHERALS

3D stacked m

ain mem

ory

2

30MtrLocalMemory

hierarchy

NetInt

PowerTest

Mgmt

routerBus basedMulti Proc

Applications Software opt. Middleware, RTOS, API,Run-Time Controller

MappingV,Vt,Fclk,IL

Today’s SoCs could fit in 1 tile!! Tile-based design

Page 11: Luca Benini Lbenini@deis.unibo.it DEIS Università di Bologna

Multicores Are Here!

1985 199019801970 1975 1995 2000

4004

8008

80868080 286 386 486 Pentium P2 P3P4Itanium

Itanium 2

2005 20??

# of

cor

es

1

2

4

8

16

32

64

128

256

512

Athlon

Raw

Power4Opteron

Power6

Niagara

YonahPExtreme

Tanglewood

Cell

IntelTflops

Xbox360

CaviumOcteon

RazaXLR

PA-8800

CiscoCSR-1

PicochipPC102

Boardcom 1480 Opteron 4P

Xeon MP

AmbricAM2045

[Amarasinghe06]

Page 12: Luca Benini Lbenini@deis.unibo.it DEIS Università di Bologna

MPSoC – 2005 ITRS roadmap

2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020

0

200

400

600

800

1000

60

50

40

30

20

10

0

1200

Nu

mb

er

of

Pro

ce

ss

ing

En

gin

es

Lo

gic

, M

em

ory

Siz

e (

No

rma

lize

d t

o 2

00

5)

Number of Processing Engines(Right Axis)

Total Logic Size(Normalized to 2005, Left Axis)

Total Memory Size(Normalized to 2005, Left Axis)

16 23 32 4663 79

101133 161

212

268

348

424

526

669

878

[Martin06]

Page 13: Luca Benini Lbenini@deis.unibo.it DEIS Università di Bologna

System / ServiceApplication S/W

Mobile TerminalMiddleware

ModuleRTOS

ChipHAL

ProcessS/W IP

Target System ApplicationTarget System Application

Requires design of Hardware AND software

SoC Solution-on-a-Chip

+

SOCSOC

System e-SW

Chip

Page 14: Luca Benini Lbenini@deis.unibo.it DEIS Università di Bologna

Design as optimization Design space

The set of “all” possible design choices

ConstraintsSolutions that we are not willing to

accept Cost function

A property we are interested in (execution time, power, reliability…)

Page 15: Luca Benini Lbenini@deis.unibo.it DEIS Università di Bologna

Hardware synthesisALGORITHM

HIGH-LEVEL SYNTHESIS

S1 S3 S4S2

0.0 200.0 4 00.0 600. 0Freq

-120 .0

-100 .0

-80 .0

-60 .0

-40 .0

-20 .0

Am

pl

(db

)

++

++

D

D

++

++

D

D

c1 c2

c3

c4 c5

c6

kIN

+

+

D

D

++

+

D

D

+

++c1

c2 c3

c4

c5

c6 c7

c8

k

d

IN OUT

APPLICATION

interconnect

ASICGP signal

MCM

processor

memory

ARCHITECTURE

LOGIC AND PHYSICAL SYNTHESIS

Page 16: Luca Benini Lbenini@deis.unibo.it DEIS Università di Bologna

Behavioral synthesisControl/DataFlow Graph

(CDFG)Implementation

RegReg

Multiplier

Adder

RegReg2 1 1 ...2 3 2 ...

4 3 2 ...

0 4 7 ...

4 7 9 ...

Page 17: Luca Benini Lbenini@deis.unibo.it DEIS Università di Bologna

Allocation, Assignment, and Scheduling

D

+

-

>>

>>

+

-

>>

+ >>

+

>>

+

Allocation: How Much?2 adders

Assignment: Where?

Schedule: When?

Shifter 1

Time Slot 4

1 shifter24 registers

D

Techniques Well Understood and Mature

Page 18: Luca Benini Lbenini@deis.unibo.it DEIS Università di Bologna

Resource constraints

+

*3*2

3

+

*1

2

+1 1

2

3

3

4 4

+

*3*2

3

+2

+1 2

3

4

1

2 3

4 control steps

+ * * + *

*1

Schedule 1 Schedule 2

1 +1

2 +2

3 +3 *1

4 *2 *3

Control Step

1 +3

2 +1 *2

3 +2 *3

4 *1

Control Step

Page 19: Luca Benini Lbenini@deis.unibo.it DEIS Università di Bologna

Scheduling under resource constraints Intractable problem Algorithms:

Exact: Integer linear program Hu (restrictive assumptions)

Approximate : List scheduling Force-directed scheduling

Page 20: Luca Benini Lbenini@deis.unibo.it DEIS Università di Bologna

Binary decision variables:X = { xil, i = 1,2,…. n; l = 1,2,…, λ +

1}xil, is TRUE only when operation vi starts

in step l of the schedule (i.e. l = ti)

λ is an upper bound on latency Start time of operation vi :

Σl . xil

ILP formulation

l

Page 21: Luca Benini Lbenini@deis.unibo.it DEIS Università di Bologna

Operations start only onceΣ xil = 1 i = 1, 2,…, n

Sequencing relations must be satisfiedti ≥ tj + dj (vj, vi) є E

Σ l • xil – Σ l • xil – dj ≥ 0 (vj, vi) є E Resource bounds must be satisfied

Simple case (unit delay)Σ xil ≤ ak k = 1,2,…nres ; l

ILP formulation constraints

l

A

A

A

l

l

i:T(vi)=k

Page 22: Luca Benini Lbenini@deis.unibo.it DEIS Università di Bologna

ILP Formulation

min (Σ l • xnl) such that

Σ xil = 1 i = 1, 2, …, n

Σ l • xij - Σ l • xjl - dj ≥ 0 i, j = 1, 2, …, n, (vj, vi) є E

Σ Σ xim ≤ ak k = 1, 2, …, nres ; l = 0, 1, …, λ

l

ll

l

m=l-di+1i:T(vi)=k

l

Page 23: Luca Benini Lbenini@deis.unibo.it DEIS Università di Bologna

Example

Resource constraints: 2 ALUs; 2 Multipliers a1 = 2; a2 = 2

Single-cycle operation di = 1 i

* * + <

-

-

* * * * +

NOP

NOP

0

1 2

3

4

5

6

7

8

9

10

11

n

A

Page 24: Luca Benini Lbenini@deis.unibo.it DEIS Università di Bologna

Example Operations start only once

x11 = 1x61 + x62 =1…

Sequencing relations must be satisfiedx61 + 2x62 – 2x72 – 3x73 + 1 ≤ 02x92 + 3x93 + 4x94 – 5xN5 + 1 ≤ 0…

Resource bounds must be satisfiedx11 + x21 +x61 + x81 ≤ 2x32 + x62 + x72 + x81 ≤ 2…

Page 25: Luca Benini Lbenini@deis.unibo.it DEIS Università di Bologna

Example

*

*

+

<

-

-

* *

*

*

+

NOP

NOP

0

1 2

3

4

5

6

78

9

10

11

n

TIME 1

TIME 2

TIME 3

TIME 4

Page 26: Luca Benini Lbenini@deis.unibo.it DEIS Università di Bologna

Resource-EfficientApplication mapping for MPSoCs

Given a platform1. Achieve a specified throughput2. Minimize usage of shared resources

MULTIMEDIAAPPLICATIONS

Page 27: Luca Benini Lbenini@deis.unibo.it DEIS Università di Bologna

Optimization Development

The abstraction gap between high level optimization tools and standard application programming models can introduce unpredictable and undesired behaviours.

Programmers must be conscious about simplified assumptions taken into account in optimization tools.

New methodology for multi-task application development on MPSoCs.

Platform Modelling

Optimization Analysis

Optimal Solution

Starting Implementation

Platform Execution

Abstractiongap

(. .

Final Implementation

Application design flow

Page 28: Luca Benini Lbenini@deis.unibo.it DEIS Università di Bologna

Resource assignment and scheduling

SHARED SYSTEM BUS

On-chipMemory

Node 1 Node N

Processor

Tightly-CoupledMemory

Bus Interface

.

.

.

.

.

Task. A (WCET Ta)Task. B (WCET Tb)

Task. N (WCET Tn)

THE SYSTEM

LimitedSize Mem

Max busbandwidth

Maxtime

wheelperiod

T

AssumedTo be

infinite

Page 29: Luca Benini Lbenini@deis.unibo.it DEIS Università di Bologna

The application

T7T1 T2 T0 T3 …..

Signal Processing Pipeline

Queues for inter-processor communication in TCM for efficiency reasons Program data in TCM (if space) or on-chip memory Internal state in TCM (if space) or on-chip memory

Each task is characterized by:• WCET• Memory requirements

ThroughputConstraint

Page 30: Luca Benini Lbenini@deis.unibo.it DEIS Università di Bologna

Communication-aware Allocation and Scheduling for Stream-Oriented MPSoCs

T7T1 T2 T0 …..Signal Processing

Pipeline

ARM7 LocalScratchpad

MemoryBUS

PrivateMemory

ARM7

………………..

LocalScratchpad

Memory

PrivateMemory

……….

Message-oriented MPSoC

architecture

?

Simplifying assumptions vs predictabilityEfficient solutions in reasonable timePure ILP formulations suitable for small task setsWidespread use of heuristics

Page 31: Luca Benini Lbenini@deis.unibo.it DEIS Università di Bologna

Master Problem model Assignment of tasks and memory slotsAssignment of tasks and memory slots (master problem)

Tij= 1 if task i executes on processor j, 0 otherwise, Yij =1 if task i allocates program data on processor j memory, 0 otherwise, Zij =1 if task i allocates the internal state on processor j memory, 0 otherwise Xij =1 if task i executes on processor j and task i+1 does not, 0 otherwise

Each process should be allocated to one processor Tij= 1 for all j

Link between variables X and T: Xij = |Tij – Ti+1 j | for all i and j (can be linearized)

If a task is NOT allocated to a processor nor its required memories are:Tij= 0 Yij =0 and Zij =0

Objective function memi (Tij - Yij) + statei (Tij - Yij) + datai Xij /2

i

i j

Page 32: Luca Benini Lbenini@deis.unibo.it DEIS Università di Bologna

Improvement of the model

With the proposed model, the allocation problem solver tends With the proposed model, the allocation problem solver tends to pack all tasks on a single processor and all memory required to pack all tasks on a single processor and all memory required on the local memory so as to have a ZERO communication cost: on the local memory so as to have a ZERO communication cost: TRIVIAL SOLUTIONTRIVIAL SOLUTION

To improve the model we should add a relaxation of the To improve the model we should add a relaxation of the subproblem to the master problem model: subproblem to the master problem model:

For each set S of consecutive tasks whose sum of durations For each set S of consecutive tasks whose sum of durations exceeds the Real time requirement, we impose that their processors exceeds the Real time requirement, we impose that their processors should not be the same should not be the same

WCETi > RT Tij |S| -1

i S i S

Page 33: Luca Benini Lbenini@deis.unibo.it DEIS Università di Bologna

Sub-Problem modelTask scheduling with static resource assignmentTask scheduling with static resource assignment (subproblem)

i

Page 34: Luca Benini Lbenini@deis.unibo.it DEIS Università di Bologna

Sub-Problem modelTask scheduling with static resource assignmentTask scheduling with static resource assignment (subproblem)We have to schedule tasks so we have to decide when they start

Activity Starting Time: Starti::[0..Deadlinei]

Precedence constraints: Starti+Duri Startj

Real time constraints: for all activities running on the same processor (Starti+Duri ) RT

Cumulative constraints on resourcesprocessors are unary resources: cumulative([Start], [Dur], [1],1)memories are additive resources: cumulative([Start],[Dur],[MR],C)

What about the bus??

i

Page 35: Luca Benini Lbenini@deis.unibo.it DEIS Università di Bologna

Bus model

BANDWIDTHBIT/SEC

TIME

Max busbandwidth

Taskistate read

Taskistate write

Execution timetaski and task j

Unary resource: granularity clock cycle

Arbitration mechanism that decides the bus allocation

Taskjstate read

Taskj State write

Page 36: Luca Benini Lbenini@deis.unibo.it DEIS Università di Bologna

Bus modelBANDWIDTH

BIT/SEC

TIME

Max busbandwidthSize of program data

TaskExecTimeTask0 accessesinput data:

BW=MaxBW/NoProc

Taskistate read

Taski state write

taskjtaski

Additive bus model

The model does not hold under heavy bus congestion(more than 65% of total bandwidth)

Bus traffic has to be minimized

Taskjstate read

Taski state write

Page 37: Luca Benini Lbenini@deis.unibo.it DEIS Università di Bologna

No good generation Assignment of tasks and memory slotsAssignment of tasks and memory slots (master problem)Task scheduling with static resource assignmentTask scheduling with static resource assignment (subproblem)

If no feasible schedule exist for the allocation provided by the master a no-good is generated. We use the simple BUT EFFECTIVE one: identify CONFLICTING RESOURCES CR. For each R CR, STR set of tasks allocated on R

TiR | STR | - 1

Other cuts are also possible, [Hooker, Constraints 2005], but these are enough for our case and easy to extract

MasterProblem

solution Sub-Problem

no good

solution

IP solver CP solver

i STR

Page 38: Luca Benini Lbenini@deis.unibo.it DEIS Università di Bologna

Computational efficiency

CP and IP formulations simplified Hybrid approach clearly outperforms pure CP and IP techniques Search time bounded to 15 minutes

CP and IP can found a solution only in 50%- of the instances Hybrid approach always found a solution

Page 39: Luca Benini Lbenini@deis.unibo.it DEIS Università di Bologna

Validation of bus model

Requesting more than 65% of the theoretical maximum bandwidth causes the additive model to fail.

Lower threshold in presence of communication hotspots (50%) Benefits of the additive model

task execution time almost indep. of bus utilizationPerformance predictability greatly enhanced

Page 40: Luca Benini Lbenini@deis.unibo.it DEIS Università di Bologna

Validation of optimizer solutions

MAX error lower than 10% AVG error equal to 4.7%, with standard deviation of 0.08 Optimizer turn out to be conservative in predicting infeasibility The flow was successfully applied to GSM benchmark

Page 41: Luca Benini Lbenini@deis.unibo.it DEIS Università di Bologna

Energy-EfficientApplication mapping for MPSoCs

Given a platform1. Achieve a specified throughput2. Minimize power consumption

MULTIMEDIAAPPLICATIONS

Page 42: Luca Benini Lbenini@deis.unibo.it DEIS Università di Bologna

Application Mapping

The problem of allocating, scheduling and freq. selection for task graphs on multi-processors in a distributed real-time system is NP-hard.

New tool flows for efficient mapping of multi-task applications onto hardware platforms

T1

T2 T3

T4 T5 T6

T7

T8

…Proc. 1 Proc. 2 Proc. N

INTERCONNECT

Private

Mem

Private

Mem

Private

Mem…

T1 T2 T3T4 T5 T6T8 T7

Time

Res

ou

rces

T1 T2

T3

T4

T5 T7

Deadline

T8

Allocatio

n

Schedule&Freq.sel.

Page 43: Luca Benini Lbenini@deis.unibo.it DEIS Università di Bologna

Exploiting Voltage Supply Supply voltage impacts power and

performance Circuit slowdown T=1/f=K/(Vdd-Vt)a

Cubic power savings P=Ceff*Vdd2*f

Just-in-time computation Stretch execution time up to the max tolerable

Available time

PowerFixed voltage + Shutdown

Variable voltage

Page 44: Luca Benini Lbenini@deis.unibo.it DEIS Università di Bologna

Scheduling & Voltage Scaling

deadlinet

P

1 2 3

Energy/speed trade-offs:

varying the voltagesVbs

CPUVdd

f1 f2 f3

Different voltages:different frequencies

Mapping and scheduling: given

(fastest freq.)

Power

deadlinet1 2 3

Slack

Page 45: Luca Benini Lbenini@deis.unibo.it DEIS Università di Bologna

Target architecture - 2 Homogeneous computation tiles:

ARM cores (including instruction and data caches);

Tightly coupled software-controlled scratch-pad memories (SPM);

AMBA AHB; DMA engine; RTEMS OS; Technology homogeneous (0.13um)

industrial power models (ST) Variable Voltage/Frequency cores with discrete (Vdd,f) pairs

Frequency dividers scale down the baseline 200 MHz system clock

Cores use non-cacheable shared memory to communicate;

Semaphore and interrupt facilities are used for synchronization;

Private on-chip memory to store data.

Tile TileTile Tile …Sync. Sync. Sync. Sync.

PrivateMem

PrivateMem

PrivateMem

PrivateMem

SharedMem

AMBA AHB INTERCONNECT

PrivateMem..

Prog.REG

CLOCK TREEGENERATOR

System

CL

OC

K

CLOCK NCLOCK 3

CLOCK 2CLOCK 1

INTSlave

… Int_

CL

KTile TileTile Tile …Sync. Sync. Sync. Sync.

PrivateMem

PrivateMem

PrivateMem

PrivateMem

SharedMem

AMBA AHB INTERCONNECT

PrivateMem..

Prog.REG

CLOCK TREEGENERATOR

System

CL

OC

K

CLOCK NCLOCK 3

CLOCK 2CLOCK 1

INTSlave

… Int_

CL

K

Page 46: Luca Benini Lbenini@deis.unibo.it DEIS Università di Bologna

A task graph represents: A group of tasks T Task dependencies Execution times express in clock cycles: WCN(Ti) Communication time (writes & reads) expressed as: WCN(WTiTj) and WCN(RTiTj) These values can be back-annotated from functional simulation

Application model

Task1

Task2

Task3

Task4

Task5

Task6

WCN(WT1T2)WCN(RT1T2)WCN(T1)

WCN(WT1T3)WCN(RT1T3)

WCN(T2) WCN(WT2T4)WCN(RT2T4)

WCN(WT3T5)WCN(RT3T5)

WCN(WT4T6)WCN(RT4T6)

WCN(WT5T6)WCN(RT5T6)

WCN(T3)

WCN(T4)

WCN(T5)

WCN(T6)

Page 47: Luca Benini Lbenini@deis.unibo.it DEIS Università di Bologna

Efficient Application Development Support In optimization tools many simplifying assumptions are

generally considered The neglecting of these assumptions in software

implementation can generate: unpredictable and not desired system-level interactions; make the overall system error-prone.

We propose an entire framework to help programmers in software implementation:

a generic customizable application template OFFLINE SUPPORT;

a set of high-level APIs ONLINE SUPPORT. The main goals of our development framework are:

the exact and reliable application’s execution after the optimization step;

guarantees about high performance and constraint satisfaction.

Page 48: Luca Benini Lbenini@deis.unibo.it DEIS Università di Bologna

Customizable Application Template

Starting from a high level task and data flow graph, software developers can easily and quickly build their application infrastructure.

Programmer can intuitively translate high level representation into C-code using our facilities and library.

Users can specify: the number of tasks included in the target application; their nature (e.g. branch, fork, or-node, and-node); their precedence constraints (e.g. due to data communication);

….thus quickly drawing its CTG schema. Programmer can focus onto the functionalities of the

tasks: the main effort is given to the more specific and critic sections of

the application.

Page 49: Luca Benini Lbenini@deis.unibo.it DEIS Università di Bologna

OS-level and Task-level APIs Users can easily reproduce optimizer solutions, thus:

Indirectly neglecting optimizer’s abstractions Task model; Communication model; OS overheads.

Obtaining the needed application constraint satisfaction.

Programmer can allocate to the right hardware resources Tasks; Program data; Queues.

Scheduling support APIs Frequency and voltage selection;

Communication issues Shared queues; Semaphores; Interrupts.

Page 50: Luca Benini Lbenini@deis.unibo.it DEIS Università di Bologna

//Node Behaviour: 0 AND ; 1 OR; 2 FORK; 3 BRANCHuint node_behaviour[TASK_NUMBER] = {2,3,3,..};

#define N_CPU 2uint task_on_core[TASK_NUMBER] = {1,1,2,1};int schedule_on_core[N_CPU][TASK_NUMBER] = {{1,2,4,8}..};

uint queue_consumer [..] [..] = { {0,1,1,0,..},

{0,0,0,1,1,.}, {0,0,0,0,0,1,1..},

{0,0,0,0,..}..};

//Node Type: 0 NORMAL; 1 BRANCH ; 2 STOCHASTICuint node_type[TASK_NUMBER] = {1,2,2,1,..};

Example Number of nodes : 12 Graph of activities Node type

Normal, Branch, Conditional, Terminator Node behaviour

Or, And, Fork, Branch Number of CPU : 2 Task Allocation Task Scheduling Arc priorities Freq. & Voltage

Time

Res

ourc

es

N1 B2

B3

C4

C7

Deadline

N8

T2 T3

T4 T5 T6 T7

T8 T9 T10

T11

T12

T1N1

B2 B3

C4 C5 C6 C7

N8 N9 N10

N11

T12

fork

or

or

and

branch branch

P1

P2

N11

N10

T12

a1a2

a3 a4 a5 a6

a7 a8 a9 a10

a11 a12

B3 C7 N10

T12

a13

a14

#define TASK_NUMBER 12

Page 51: Luca Benini Lbenini@deis.unibo.it DEIS Università di Bologna

Queue ordering optimization

Communication ordering affects system performances

T1

T2

T4

CPU1 CPU2

C3C1

T3

…C

2

Wait!

RUN!

T5 T6… …

C4 C5

Page 52: Luca Benini Lbenini@deis.unibo.it DEIS Università di Bologna

Queue ordering optimization

Communication ordering affects system performances

T1

T2

T4

T5 T6

CPU1 CPU2

…… …

C3C1

T3

…C

2

Wait!

RUN!

C4 C5

Page 53: Luca Benini Lbenini@deis.unibo.it DEIS Università di Bologna

T4 re-activated

Synchronization among tasks

T1

T2 T4C2

T3

C1

C3

Proc. 1

T1

Proc. 2

T2T3 T4

T4 is suspended

Non blocked semaphores

Page 54: Luca Benini Lbenini@deis.unibo.it DEIS Università di Bologna

Logic Based Benders Decomposition

Obj. Function:Communication cost

& energy consumption

Validallocation

Allocation& Freq. Assign.:

INTEGER PROGRAMMING

Scheduling:CONSTRAINT PROGRAMMING

No good: linearconstraint

Memory constraints

Real Time constraint

Decomposes the problem into 2 sub-problems: Allocation & Assignment of freq. settings → IP

Objective Function: minimizing energy consumption during execution and communication of tasks

Scheduling → CP Objective Function: minimizing energy consumption

during frequency switching

Page 55: Luca Benini Lbenini@deis.unibo.it DEIS Università di Bologna

Solver Performance

Hundreds of of decision variables Much beyond ILP solver or CP solver capability

Page 56: Luca Benini Lbenini@deis.unibo.it DEIS Università di Bologna

Allocation problem modelXtfp = 1 if task t executes on processor p at frequency f;Wijfp = 1 if task i and j run on different core.

Task i on core p writes data to j at freq. f;Rijfp = 1 if task i and j run on different core.

Task j on core p reads data to i at freq. f;

WriteadComp

P

p

M

fijfpijfp

P

p

M

fijfp

P

p

M

fijfp

P

p

M

ftfp

EnEnEnOF

TjiRW

TjiR

TjiW

tX

Re

1 1

1 1

1 1

1 1

,0)(

,1

,1

1 Each task can execute only on one processor at one freq.

Communication between tasks can execute only once for execution and one write corresponds to one read

The objective function: minimize energy consumption associated with task execution and communication

Page 57: Luca Benini Lbenini@deis.unibo.it DEIS Università di Bologna

Five phases behaviour INPUT=input data

reading; EXEC=computation

activity; OUTPUT=output data

writing. Atomic activities

Scheduling problem model INPUT EXEC OUTPUT

The objective function: minimize energy consumption associated with frequency switching

•Processors are modelled as unary resource•Bus is modelled as additive resource

Duration of task i is now fixed since mode is fixed:Reading phase

input

input

input

exec

output

output

output

forkjoin

Writing phase

jijijii

jiii

jii

S ta rtadddW ritedura tionS ta rt

S ta rtTdura tionS ta rt

S ta rtdu ra tionS ta rt

R e

Task i Task j

Tasks running on the same processor at the same frequency

Tasks running on the same processor at different frequencies

Tasks running on different processors

Page 58: Luca Benini Lbenini@deis.unibo.it DEIS Università di Bologna

Application Development Methodology

CTGCharacterization

Phase

Simulator

OptimizationPhase

Optimizer

ApplicationProfiles

Optimal SWApplication

Implementation

ApplicationDevelopment

Support

Alloca

tion

Sched

ulin

g

PlatformExecution

Page 59: Luca Benini Lbenini@deis.unibo.it DEIS Università di Bologna

MAX error lower than 10%; AVG error equal to 4.51%, with standard

deviation of 1.94; All the deadline constraints are satisfied.

Optimizer

Optimal Allocation & Schedule

Virtual Platform validation

-0.05

0

0.05

0.1

0.15

0.2

0.25

-5% -4% -3% -2% -1% 0% 1% 2% 3% 4% 5% 6% 7% 8% 9% 10% 11%

250 instances

Validation of optimizer solutions: Throughput

Pro

bab

ilit

y (%

)Throughput difference (%)

Page 60: Luca Benini Lbenini@deis.unibo.it DEIS Università di Bologna

MAX error lower than 10%; AVG error equal to 4.80%, with

standard deviation of 1.71;

Optimizer

Optimal Allocation & Schedule

Virtual Platform validation

250 instances

Validation of optimizer solutions: Power

-0.05

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

-5% -4% -3% -2% -1% 0% 1% 2% 3% 4% 5% 6% 7% 8% 9% 10% 11%

250 instances

Pro

bab

ilit

y (%

)Energy consumption difference (%)

Page 61: Luca Benini Lbenini@deis.unibo.it DEIS Università di Bologna

GSM Encoder

Throughput required: 1 frame/10ms. With 2 processors and 4 possible

freq.&voltage settings:

Task Graph: 10 computational tasks; 15 communication tasks.

Without optimizations:50.9μJ

With optimizations:17.1 μJ - 66,4%

Page 62: Luca Benini Lbenini@deis.unibo.it DEIS Università di Bologna

Summary & future work Energy-optimal task mapping

Strong optimization engine (complete) Programmer support (design & exec time) Validation: accuracy & optimality

Future work Conditional task graphs Dealing with multiple use cases Variable execution times Aggressive communication scheduling