luca benini lbenini@deis.unibo.it deis università di bologna

Post on 26-Jan-2016

40 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Tecniche di ottimizzazione per lo sviluppo di applicazioni embedded su piattatforme multiprocessore su singolo chip. Luca Benini Lbenini@deis.unibo.it DEIS Università di Bologna. Embedded Systems. General purpose systems. Embedded systems. Microprocessor market shares. - PowerPoint PPT Presentation

TRANSCRIPT

Tecniche di ottimizzazione per lo sviluppo di applicazioni embedded su piattatforme multiprocessore su singolo chip

Luca BeniniLbenini@deis.unibo.it

DEIS Università di Bologna

1%

99%

Embedded Systems

General purpose systems Embedded systems

Microprocessor market shares

Example Area: Automotive Electronics

What is “automotive electronics”?

Vehicle functions implemented with electronics

Body electronics System electronics: chassis,

engine Information/entertainment

Automotive Electronics Market Size

8.9Market

($billions)

10.5

13.1 14.1 15.8 17.4 19.3 21.0

0

200

400

600

800

1000

1200

1400

1998 1999 2000 2001 2002 2003 2004 2005

Cost of Electronics / Car ($)

90% of future innovations in vehicles:based on electronic embedded systems

2006: 25% of the total cost

of a car will be electronics

Automotive Electronics Platform Example

Source: Expanding automotive electronic systems, IEEE Computer, Jan. 2002

Digital Convergence – Mobile Example

Broadcasting

TelematicsImaging

Computing

CommunicationEntertainment

One device, multiple functions Center of ubiquitous media network Smart mobile device: next drive for semicon.

Industry

4th Gen and Next-Gen Networks

Includes: 802.20, WiMAX (802.16), HSDPA, TDD UMTS, UMTS and future versions of UMTS

SoC: Enabler for Digital Convergence

Today

Future

> 100XPerformanceLow PowerComplexity

Storage

4G/5G, DMB,

WiBro, etc.

SoCSoCSoCSoC

Application pull

Year of Introduction2005 2007 2009 2011 2013 2015

5 GOPS/W

100GOPS/W

Signrecognition

A/Vstreaming

Adaptiveroute

Collisionavoidance

Autonomousdriving

3D projecteddisplay

HMI by motionGesture detection

Ubiquitousnavigation

Si Xray

Gbit radio

UWB

802.11n

Structured encoding

Structured decoding

3D TV3D gaming

H264encoding

H264decoding

Imagerecognition

Fully recognition(security)

Autopersonalization

dictation

3D ambientinteraction

LanguageEmotionrecognition

Gesturerecognition

Expressionrecognition

MobileBase-band

1TOPS/W

[IMEC]

MPSoC Platform Evolution

45 nm

<4mm

<1GHz

I/OPERIPHERALS

3D stacked m

ain mem

ory

2

30MtrLocalMemory

hierarchy

NetInt

PowerTest

Mgmt

routerBus basedMulti Proc

Applications Software opt. Middleware, RTOS, API,Run-Time Controller

MappingV,Vt,Fclk,IL

Today’s SoCs could fit in 1 tile!! Tile-based design

Multicores Are Here!

1985 199019801970 1975 1995 2000

4004

8008

80868080 286 386 486 Pentium P2 P3P4Itanium

Itanium 2

2005 20??

# of

cor

es

1

2

4

8

16

32

64

128

256

512

Athlon

Raw

Power4Opteron

Power6

Niagara

YonahPExtreme

Tanglewood

Cell

IntelTflops

Xbox360

CaviumOcteon

RazaXLR

PA-8800

CiscoCSR-1

PicochipPC102

Boardcom 1480 Opteron 4P

Xeon MP

AmbricAM2045

[Amarasinghe06]

MPSoC – 2005 ITRS roadmap

2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020

0

200

400

600

800

1000

60

50

40

30

20

10

0

1200

Nu

mb

er

of

Pro

ce

ss

ing

En

gin

es

Lo

gic

, M

em

ory

Siz

e (

No

rma

lize

d t

o 2

00

5)

Number of Processing Engines(Right Axis)

Total Logic Size(Normalized to 2005, Left Axis)

Total Memory Size(Normalized to 2005, Left Axis)

16 23 32 4663 79

101133 161

212

268

348

424

526

669

878

[Martin06]

System / ServiceApplication S/W

Mobile TerminalMiddleware

ModuleRTOS

ChipHAL

ProcessS/W IP

Target System ApplicationTarget System Application

Requires design of Hardware AND software

SoC Solution-on-a-Chip

+

SOCSOC

System e-SW

Chip

Design as optimization Design space

The set of “all” possible design choices

ConstraintsSolutions that we are not willing to

accept Cost function

A property we are interested in (execution time, power, reliability…)

Hardware synthesisALGORITHM

HIGH-LEVEL SYNTHESIS

S1 S3 S4S2

0.0 200.0 4 00.0 600. 0Freq

-120 .0

-100 .0

-80 .0

-60 .0

-40 .0

-20 .0

Am

pl

(db

)

++

++

D

D

++

++

D

D

c1 c2

c3

c4 c5

c6

kIN

+

+

D

D

++

+

D

D

+

++c1

c2 c3

c4

c5

c6 c7

c8

k

d

IN OUT

APPLICATION

interconnect

ASICGP signal

MCM

processor

memory

ARCHITECTURE

LOGIC AND PHYSICAL SYNTHESIS

Behavioral synthesisControl/DataFlow Graph

(CDFG)Implementation

RegReg

Multiplier

Adder

RegReg2 1 1 ...2 3 2 ...

4 3 2 ...

0 4 7 ...

4 7 9 ...

Allocation, Assignment, and Scheduling

D

+

-

>>

>>

+

-

>>

+ >>

+

>>

+

Allocation: How Much?2 adders

Assignment: Where?

Schedule: When?

Shifter 1

Time Slot 4

1 shifter24 registers

D

Techniques Well Understood and Mature

Resource constraints

+

*3*2

3

+

*1

2

+1 1

2

3

3

4 4

+

*3*2

3

+2

+1 2

3

4

1

2 3

4 control steps

+ * * + *

*1

Schedule 1 Schedule 2

1 +1

2 +2

3 +3 *1

4 *2 *3

Control Step

1 +3

2 +1 *2

3 +2 *3

4 *1

Control Step

Scheduling under resource constraints Intractable problem Algorithms:

Exact: Integer linear program Hu (restrictive assumptions)

Approximate : List scheduling Force-directed scheduling

Binary decision variables:X = { xil, i = 1,2,…. n; l = 1,2,…, λ +

1}xil, is TRUE only when operation vi starts

in step l of the schedule (i.e. l = ti)

λ is an upper bound on latency Start time of operation vi :

Σl . xil

ILP formulation

l

Operations start only onceΣ xil = 1 i = 1, 2,…, n

Sequencing relations must be satisfiedti ≥ tj + dj (vj, vi) є E

Σ l • xil – Σ l • xil – dj ≥ 0 (vj, vi) є E Resource bounds must be satisfied

Simple case (unit delay)Σ xil ≤ ak k = 1,2,…nres ; l

ILP formulation constraints

l

A

A

A

l

l

i:T(vi)=k

ILP Formulation

min (Σ l • xnl) such that

Σ xil = 1 i = 1, 2, …, n

Σ l • xij - Σ l • xjl - dj ≥ 0 i, j = 1, 2, …, n, (vj, vi) є E

Σ Σ xim ≤ ak k = 1, 2, …, nres ; l = 0, 1, …, λ

l

ll

l

m=l-di+1i:T(vi)=k

l

Example

Resource constraints: 2 ALUs; 2 Multipliers a1 = 2; a2 = 2

Single-cycle operation di = 1 i

* * + <

-

-

* * * * +

NOP

NOP

0

1 2

3

4

5

6

7

8

9

10

11

n

A

Example Operations start only once

x11 = 1x61 + x62 =1…

Sequencing relations must be satisfiedx61 + 2x62 – 2x72 – 3x73 + 1 ≤ 02x92 + 3x93 + 4x94 – 5xN5 + 1 ≤ 0…

Resource bounds must be satisfiedx11 + x21 +x61 + x81 ≤ 2x32 + x62 + x72 + x81 ≤ 2…

Example

*

*

+

<

-

-

* *

*

*

+

NOP

NOP

0

1 2

3

4

5

6

78

9

10

11

n

TIME 1

TIME 2

TIME 3

TIME 4

Resource-EfficientApplication mapping for MPSoCs

Given a platform1. Achieve a specified throughput2. Minimize usage of shared resources

MULTIMEDIAAPPLICATIONS

Optimization Development

The abstraction gap between high level optimization tools and standard application programming models can introduce unpredictable and undesired behaviours.

Programmers must be conscious about simplified assumptions taken into account in optimization tools.

New methodology for multi-task application development on MPSoCs.

Platform Modelling

Optimization Analysis

Optimal Solution

Starting Implementation

Platform Execution

Abstractiongap

(. .

Final Implementation

Application design flow

Resource assignment and scheduling

SHARED SYSTEM BUS

On-chipMemory

Node 1 Node N

Processor

Tightly-CoupledMemory

Bus Interface

.

.

.

.

.

Task. A (WCET Ta)Task. B (WCET Tb)

Task. N (WCET Tn)

THE SYSTEM

LimitedSize Mem

Max busbandwidth

Maxtime

wheelperiod

T

AssumedTo be

infinite

The application

T7T1 T2 T0 T3 …..

Signal Processing Pipeline

Queues for inter-processor communication in TCM for efficiency reasons Program data in TCM (if space) or on-chip memory Internal state in TCM (if space) or on-chip memory

Each task is characterized by:• WCET• Memory requirements

ThroughputConstraint

Communication-aware Allocation and Scheduling for Stream-Oriented MPSoCs

T7T1 T2 T0 …..Signal Processing

Pipeline

ARM7 LocalScratchpad

MemoryBUS

PrivateMemory

ARM7

………………..

LocalScratchpad

Memory

PrivateMemory

……….

Message-oriented MPSoC

architecture

?

Simplifying assumptions vs predictabilityEfficient solutions in reasonable timePure ILP formulations suitable for small task setsWidespread use of heuristics

Master Problem model Assignment of tasks and memory slotsAssignment of tasks and memory slots (master problem)

Tij= 1 if task i executes on processor j, 0 otherwise, Yij =1 if task i allocates program data on processor j memory, 0 otherwise, Zij =1 if task i allocates the internal state on processor j memory, 0 otherwise Xij =1 if task i executes on processor j and task i+1 does not, 0 otherwise

Each process should be allocated to one processor Tij= 1 for all j

Link between variables X and T: Xij = |Tij – Ti+1 j | for all i and j (can be linearized)

If a task is NOT allocated to a processor nor its required memories are:Tij= 0 Yij =0 and Zij =0

Objective function memi (Tij - Yij) + statei (Tij - Yij) + datai Xij /2

i

i j

Improvement of the model

With the proposed model, the allocation problem solver tends With the proposed model, the allocation problem solver tends to pack all tasks on a single processor and all memory required to pack all tasks on a single processor and all memory required on the local memory so as to have a ZERO communication cost: on the local memory so as to have a ZERO communication cost: TRIVIAL SOLUTIONTRIVIAL SOLUTION

To improve the model we should add a relaxation of the To improve the model we should add a relaxation of the subproblem to the master problem model: subproblem to the master problem model:

For each set S of consecutive tasks whose sum of durations For each set S of consecutive tasks whose sum of durations exceeds the Real time requirement, we impose that their processors exceeds the Real time requirement, we impose that their processors should not be the same should not be the same

WCETi > RT Tij |S| -1

i S i S

Sub-Problem modelTask scheduling with static resource assignmentTask scheduling with static resource assignment (subproblem)

i

Sub-Problem modelTask scheduling with static resource assignmentTask scheduling with static resource assignment (subproblem)We have to schedule tasks so we have to decide when they start

Activity Starting Time: Starti::[0..Deadlinei]

Precedence constraints: Starti+Duri Startj

Real time constraints: for all activities running on the same processor (Starti+Duri ) RT

Cumulative constraints on resourcesprocessors are unary resources: cumulative([Start], [Dur], [1],1)memories are additive resources: cumulative([Start],[Dur],[MR],C)

What about the bus??

i

Bus model

BANDWIDTHBIT/SEC

TIME

Max busbandwidth

Taskistate read

Taskistate write

Execution timetaski and task j

Unary resource: granularity clock cycle

Arbitration mechanism that decides the bus allocation

Taskjstate read

Taskj State write

Bus modelBANDWIDTH

BIT/SEC

TIME

Max busbandwidthSize of program data

TaskExecTimeTask0 accessesinput data:

BW=MaxBW/NoProc

Taskistate read

Taski state write

taskjtaski

Additive bus model

The model does not hold under heavy bus congestion(more than 65% of total bandwidth)

Bus traffic has to be minimized

Taskjstate read

Taski state write

No good generation Assignment of tasks and memory slotsAssignment of tasks and memory slots (master problem)Task scheduling with static resource assignmentTask scheduling with static resource assignment (subproblem)

If no feasible schedule exist for the allocation provided by the master a no-good is generated. We use the simple BUT EFFECTIVE one: identify CONFLICTING RESOURCES CR. For each R CR, STR set of tasks allocated on R

TiR | STR | - 1

Other cuts are also possible, [Hooker, Constraints 2005], but these are enough for our case and easy to extract

MasterProblem

solution Sub-Problem

no good

solution

IP solver CP solver

i STR

Computational efficiency

CP and IP formulations simplified Hybrid approach clearly outperforms pure CP and IP techniques Search time bounded to 15 minutes

CP and IP can found a solution only in 50%- of the instances Hybrid approach always found a solution

Validation of bus model

Requesting more than 65% of the theoretical maximum bandwidth causes the additive model to fail.

Lower threshold in presence of communication hotspots (50%) Benefits of the additive model

task execution time almost indep. of bus utilizationPerformance predictability greatly enhanced

Validation of optimizer solutions

MAX error lower than 10% AVG error equal to 4.7%, with standard deviation of 0.08 Optimizer turn out to be conservative in predicting infeasibility The flow was successfully applied to GSM benchmark

Energy-EfficientApplication mapping for MPSoCs

Given a platform1. Achieve a specified throughput2. Minimize power consumption

MULTIMEDIAAPPLICATIONS

Application Mapping

The problem of allocating, scheduling and freq. selection for task graphs on multi-processors in a distributed real-time system is NP-hard.

New tool flows for efficient mapping of multi-task applications onto hardware platforms

T1

T2 T3

T4 T5 T6

T7

T8

…Proc. 1 Proc. 2 Proc. N

INTERCONNECT

Private

Mem

Private

Mem

Private

Mem…

T1 T2 T3T4 T5 T6T8 T7

Time

Res

ou

rces

T1 T2

T3

T4

T5 T7

Deadline

T8

Allocatio

n

Schedule&Freq.sel.

Exploiting Voltage Supply Supply voltage impacts power and

performance Circuit slowdown T=1/f=K/(Vdd-Vt)a

Cubic power savings P=Ceff*Vdd2*f

Just-in-time computation Stretch execution time up to the max tolerable

Available time

PowerFixed voltage + Shutdown

Variable voltage

Scheduling & Voltage Scaling

deadlinet

P

1 2 3

Energy/speed trade-offs:

varying the voltagesVbs

CPUVdd

f1 f2 f3

Different voltages:different frequencies

Mapping and scheduling: given

(fastest freq.)

Power

deadlinet1 2 3

Slack

Target architecture - 2 Homogeneous computation tiles:

ARM cores (including instruction and data caches);

Tightly coupled software-controlled scratch-pad memories (SPM);

AMBA AHB; DMA engine; RTEMS OS; Technology homogeneous (0.13um)

industrial power models (ST) Variable Voltage/Frequency cores with discrete (Vdd,f) pairs

Frequency dividers scale down the baseline 200 MHz system clock

Cores use non-cacheable shared memory to communicate;

Semaphore and interrupt facilities are used for synchronization;

Private on-chip memory to store data.

Tile TileTile Tile …Sync. Sync. Sync. Sync.

PrivateMem

PrivateMem

PrivateMem

PrivateMem

SharedMem

AMBA AHB INTERCONNECT

PrivateMem..

Prog.REG

CLOCK TREEGENERATOR

System

CL

OC

K

CLOCK NCLOCK 3

CLOCK 2CLOCK 1

INTSlave

… Int_

CL

KTile TileTile Tile …Sync. Sync. Sync. Sync.

PrivateMem

PrivateMem

PrivateMem

PrivateMem

SharedMem

AMBA AHB INTERCONNECT

PrivateMem..

Prog.REG

CLOCK TREEGENERATOR

System

CL

OC

K

CLOCK NCLOCK 3

CLOCK 2CLOCK 1

INTSlave

… Int_

CL

K

A task graph represents: A group of tasks T Task dependencies Execution times express in clock cycles: WCN(Ti) Communication time (writes & reads) expressed as: WCN(WTiTj) and WCN(RTiTj) These values can be back-annotated from functional simulation

Application model

Task1

Task2

Task3

Task4

Task5

Task6

WCN(WT1T2)WCN(RT1T2)WCN(T1)

WCN(WT1T3)WCN(RT1T3)

WCN(T2) WCN(WT2T4)WCN(RT2T4)

WCN(WT3T5)WCN(RT3T5)

WCN(WT4T6)WCN(RT4T6)

WCN(WT5T6)WCN(RT5T6)

WCN(T3)

WCN(T4)

WCN(T5)

WCN(T6)

Efficient Application Development Support In optimization tools many simplifying assumptions are

generally considered The neglecting of these assumptions in software

implementation can generate: unpredictable and not desired system-level interactions; make the overall system error-prone.

We propose an entire framework to help programmers in software implementation:

a generic customizable application template OFFLINE SUPPORT;

a set of high-level APIs ONLINE SUPPORT. The main goals of our development framework are:

the exact and reliable application’s execution after the optimization step;

guarantees about high performance and constraint satisfaction.

Customizable Application Template

Starting from a high level task and data flow graph, software developers can easily and quickly build their application infrastructure.

Programmer can intuitively translate high level representation into C-code using our facilities and library.

Users can specify: the number of tasks included in the target application; their nature (e.g. branch, fork, or-node, and-node); their precedence constraints (e.g. due to data communication);

….thus quickly drawing its CTG schema. Programmer can focus onto the functionalities of the

tasks: the main effort is given to the more specific and critic sections of

the application.

OS-level and Task-level APIs Users can easily reproduce optimizer solutions, thus:

Indirectly neglecting optimizer’s abstractions Task model; Communication model; OS overheads.

Obtaining the needed application constraint satisfaction.

Programmer can allocate to the right hardware resources Tasks; Program data; Queues.

Scheduling support APIs Frequency and voltage selection;

Communication issues Shared queues; Semaphores; Interrupts.

//Node Behaviour: 0 AND ; 1 OR; 2 FORK; 3 BRANCHuint node_behaviour[TASK_NUMBER] = {2,3,3,..};

#define N_CPU 2uint task_on_core[TASK_NUMBER] = {1,1,2,1};int schedule_on_core[N_CPU][TASK_NUMBER] = {{1,2,4,8}..};

uint queue_consumer [..] [..] = { {0,1,1,0,..},

{0,0,0,1,1,.}, {0,0,0,0,0,1,1..},

{0,0,0,0,..}..};

//Node Type: 0 NORMAL; 1 BRANCH ; 2 STOCHASTICuint node_type[TASK_NUMBER] = {1,2,2,1,..};

Example Number of nodes : 12 Graph of activities Node type

Normal, Branch, Conditional, Terminator Node behaviour

Or, And, Fork, Branch Number of CPU : 2 Task Allocation Task Scheduling Arc priorities Freq. & Voltage

Time

Res

ourc

es

N1 B2

B3

C4

C7

Deadline

N8

T2 T3

T4 T5 T6 T7

T8 T9 T10

T11

T12

T1N1

B2 B3

C4 C5 C6 C7

N8 N9 N10

N11

T12

fork

or

or

and

branch branch

P1

P2

N11

N10

T12

a1a2

a3 a4 a5 a6

a7 a8 a9 a10

a11 a12

B3 C7 N10

T12

a13

a14

#define TASK_NUMBER 12

Queue ordering optimization

Communication ordering affects system performances

T1

T2

T4

CPU1 CPU2

C3C1

T3

…C

2

Wait!

RUN!

T5 T6… …

C4 C5

Queue ordering optimization

Communication ordering affects system performances

T1

T2

T4

T5 T6

CPU1 CPU2

…… …

C3C1

T3

…C

2

Wait!

RUN!

C4 C5

T4 re-activated

Synchronization among tasks

T1

T2 T4C2

T3

C1

C3

Proc. 1

T1

Proc. 2

T2T3 T4

T4 is suspended

Non blocked semaphores

Logic Based Benders Decomposition

Obj. Function:Communication cost

& energy consumption

Validallocation

Allocation& Freq. Assign.:

INTEGER PROGRAMMING

Scheduling:CONSTRAINT PROGRAMMING

No good: linearconstraint

Memory constraints

Real Time constraint

Decomposes the problem into 2 sub-problems: Allocation & Assignment of freq. settings → IP

Objective Function: minimizing energy consumption during execution and communication of tasks

Scheduling → CP Objective Function: minimizing energy consumption

during frequency switching

Solver Performance

Hundreds of of decision variables Much beyond ILP solver or CP solver capability

Allocation problem modelXtfp = 1 if task t executes on processor p at frequency f;Wijfp = 1 if task i and j run on different core.

Task i on core p writes data to j at freq. f;Rijfp = 1 if task i and j run on different core.

Task j on core p reads data to i at freq. f;

WriteadComp

P

p

M

fijfpijfp

P

p

M

fijfp

P

p

M

fijfp

P

p

M

ftfp

EnEnEnOF

TjiRW

TjiR

TjiW

tX

Re

1 1

1 1

1 1

1 1

,0)(

,1

,1

1 Each task can execute only on one processor at one freq.

Communication between tasks can execute only once for execution and one write corresponds to one read

The objective function: minimize energy consumption associated with task execution and communication

Five phases behaviour INPUT=input data

reading; EXEC=computation

activity; OUTPUT=output data

writing. Atomic activities

Scheduling problem model INPUT EXEC OUTPUT

The objective function: minimize energy consumption associated with frequency switching

•Processors are modelled as unary resource•Bus is modelled as additive resource

Duration of task i is now fixed since mode is fixed:Reading phase

input

input

input

exec

output

output

output

forkjoin

Writing phase

jijijii

jiii

jii

S ta rtadddW ritedura tionS ta rt

S ta rtTdura tionS ta rt

S ta rtdu ra tionS ta rt

R e

Task i Task j

Tasks running on the same processor at the same frequency

Tasks running on the same processor at different frequencies

Tasks running on different processors

Application Development Methodology

CTGCharacterization

Phase

Simulator

OptimizationPhase

Optimizer

ApplicationProfiles

Optimal SWApplication

Implementation

ApplicationDevelopment

Support

Alloca

tion

Sched

ulin

g

PlatformExecution

MAX error lower than 10%; AVG error equal to 4.51%, with standard

deviation of 1.94; All the deadline constraints are satisfied.

Optimizer

Optimal Allocation & Schedule

Virtual Platform validation

-0.05

0

0.05

0.1

0.15

0.2

0.25

-5% -4% -3% -2% -1% 0% 1% 2% 3% 4% 5% 6% 7% 8% 9% 10% 11%

250 instances

Validation of optimizer solutions: Throughput

Pro

bab

ilit

y (%

)Throughput difference (%)

MAX error lower than 10%; AVG error equal to 4.80%, with

standard deviation of 1.71;

Optimizer

Optimal Allocation & Schedule

Virtual Platform validation

250 instances

Validation of optimizer solutions: Power

-0.05

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

-5% -4% -3% -2% -1% 0% 1% 2% 3% 4% 5% 6% 7% 8% 9% 10% 11%

250 instances

Pro

bab

ilit

y (%

)Energy consumption difference (%)

GSM Encoder

Throughput required: 1 frame/10ms. With 2 processors and 4 possible

freq.&voltage settings:

Task Graph: 10 computational tasks; 15 communication tasks.

Without optimizations:50.9μJ

With optimizations:17.1 μJ - 66,4%

Summary & future work Energy-optimal task mapping

Strong optimization engine (complete) Programmer support (design & exec time) Validation: accuracy & optimality

Future work Conditional task graphs Dealing with multiple use cases Variable execution times Aggressive communication scheduling

top related