dynamic memory management for new embedded systems david atienza (dacya/ucm), stelios mamagkakis...

59
Dynamic Memory Management for new embedded systems David Atienza (DACYA/UCM), Stelios Mamagkakis (VLSI D&T Center, Xanthi), Marc Leeman (IMEC vzw, Leuven), Francky Catthoor (IMEC vzw, Leuven), José M. Mendías (DACYA/UCM), Dimitrios Soudris (VLSI D&T Center, Xanthi)

Upload: avice-reeves

Post on 17-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Dynamic Memory Management for new embedded systems

David Atienza (DACYA/UCM),Stelios Mamagkakis (VLSI D&T Center, Xanthi),

Marc Leeman (IMEC vzw, Leuven), Francky Catthoor (IMEC vzw, Leuven),

José M. Mendías (DACYA/UCM),Dimitrios Soudris (VLSI D&T Center, Xanthi)

New embedded systems?

- New consumer devices (e.g. mobiles, PDAs):

- Main features: 1) More complex than traditional embedded devices (complex memory hierarchy, cpu, Real-Time OSes) 2) Portables - limited batteries3) Preserve “relatively” high performance (real time) 4) Many applications are usually running concurrently

New applications?

- Main features: 1) Complex high level design (e.g. C++, Java) 2) Very dynamic (variable use of resources) 3) Power hungry 4) Intensive memory use (accesses and footprint)

Scalable video rendering 3D Virtual reality

games Wireless protocols

- Multimedia and wireless network protocols:

Why optimizing these systems?

Users do not want these problems… new embedded systems need to be optimized!

No optimizations: New applications Platforms

Out of memory or battery fast!

No real-time!

low battery

Outline:1) Memory subsystem in new embedded devices

(Static vs dynamic memory allocation)

2) Dynamic Memory management mechanism

3) Dynamic memory subsystem refinement:

3.1) Dynamic Data Type Refinement (Application Level)

3.2) Dynamic Memory Management Refinement (OS Level)

4) Real life case studies and results

5) Enhanced Dynamic Memory Management (Multi-level) – Real life case study and results

6) Conclusions and future work

Memory subsystem in new embedded devices

1) Multi-level memory hierarchy (e.g. caches, etc.)

2) High-performance buses (e.g. AMBA bus)

MMU

Data Cache

Scratchpad Memory

ARM

Instruction Cache

Processor Core

AM

BA

BU

S Main Memory

- Not enough… Memory subsystem up to 70% of total system power and high performance degradation (>20x) [Vijaykrisnan2002, Catthoor2000]

- Highly optimized:

Access and storage optimizations for DM needed!

time

Dat

a

Scalable 3D decoding (per object): Memory:

Static memory vs dynamic memory Scenario 1 - Compile-time (worst case)

Low quality High qualityMedium quality

Memory size

t1 t2 t3 t4

NO!

Object1

Object 1

Object 2

Object 3

Object2

Object3

time

Dat

a

Scalable 3D decoding (per object): Memory:

Static memory vs dynamic memoryScenario 2 – Run time allocation

Low quality High qualityMedium quality

Memory size

t1 t2 t3 t4

Object 2

Object2

Object1

Object 3

Object 1

Object 4Object5

OK!Object4

t5

Object3

Object5

Memory usage scales to current input!

Overview of options for memory allocation(results 3D Image Reconstruction case study)• Worst case static memory solutions not possible or

do not work in extreme cases of input data• Dynamic solutions achieve better results• As shown later: Well-designed custom solutions

can improve further standard DM mechanisms

DM management works at 2 levels

Main memory

RTOS Dynamic Memory Manager

Heap

new(O1)

t1 t2 t3 t4

O2

O1

O4

t5

O3

Dat

a new(O2)

new(O3) delete(O2)

O1 O2 O3

new(O4)

t6

O4

O5

new(O5)

O5

Fragmentation!!

1) Applications use SW functions, C++: new()/delete()

2) Real time OS support: DM manager

Time

1) Which parts use the DM SW functions?

Dynamic Data Types:

Algorithms (Functionality)

Embedded application:

Static Data (e.g. frames)

Dynamic Data(e.g. objects to render)

PA(key1)

AR(key2) AR(key2)

Laye

r 1

Laye

r 2

data

data

data

data

data

data

LAR(key1)

PA(key2)

Laye

r 1

Laye

r 2

dat

a data

data

DDT 1 DDT nStructured data(sets of objects)

new(Object):

2) RTOS support, DM manager to use?

- Partition manager: suitable for one allocation size- One fixed block size

- First fit allocation order

Glo

bal

Info

of

man

ag

er

- Region allocator: suitable for several allocation sizes- Many block sizes, doubly-linked list

- First fit, splitting, coalescing (best fit approximation)

Fre

e

Fre

e

Fre

e

Use

dUse

d

Glo

bal

Info

of

Man

ag

er

Fre

e

head

er

New request

Use

d

Fre

e

head

er

head

er

Fre

e

head

er

Fre

e

head

er

head

er

Fre

e

Fre

e

head

er

Fre

e

Use

Use

Use

head

er

Fre

e

Fre

e

Fre

e

Fre

e

Fre

e

Use

dUse

d Use

d

New request

Use

d Use

d

New request

Fragmentation!!

Simple - low energy consumption

Complex – higher energy consumption

Design flow for new embedded systems

Specification of the system at very high-level (e.g. C++ or UML)

2. DM Managers Refinement (DMMR)

Final implementation

1. Dynamic Data Types Refinement (DDTR)

Refinement of DM subsystem (2 levels):

Further optimizations in static data

DDTs in new embedded applications

1) Complex control flow (many sub-algorithms) => Many DDTs with different (and irregular) behaviour interacting in time

2) Implementations thought from functional point of view, not efficient access or mapping to memory (e.g. clustering of data)

3) Complex implementations => combinations of trees, arrays, single linked lists, doubly linked lists, …

Result: Very expensive (e.g. energy) if not well designed

Proposed Dynamic Data Type Refinement steps

Specification of the required DDTs

(from the algorithm)

Final implementation of DDTs

Refinement of current design(interaction and implementation of DDTs)

Working implementation of DDTs (e.g. C++)

Run time simulation and profiling acquisition

PROBLEMS:

Insertion of profiling

Solutions proposed:

Error-prone

Automated ways?

Library of auto-profiled DDTs

Time-consuming

Huge design space of DDTs

High-levelEstimations?

Structured reports generation

Heuristics to limitthe exploration

Analytical high level estimations

1.- Library of complex multi-level DDTs:– Based on initial exploration in Matlab with traces of real

programs– Multiplatform compatible: ANSI C++ compliant– Basic data types (e.g. int) or user-defined (e.g. objects)– List of relevant DDTs (i.e. 14) for multimedia included:

1) Basic data types: single and doubly linked lists, array (AR) and pointer-AR

2) 2-layered combinations of ARs and pointer-ARs:

3) Single linked list – AR

4) Doubly linked list – AR

5) Binary trees

6) DDTs with pattern optimizations (e.g. fast access to the last element)

2.- Extension: Mechanism to create new DDTs in the library based on template classes (or “mixins”)

Library of auto-profiled DDTs

“Mixin” concept and how we use it● Y. Smaragdakis (2002): A method to specify extensions of

classes without defining up-front which classes exactly● Uses in our library of DM managers:

1) Specifying a subclass while the parent class is a template parameter:

template<class SuperClass> class Mixin : public SuperClass{ // mixin definitions}

2) Using a template class inside another class:

template<class SuperClass> class Myclass { SuperClass *data; // template class definitions}

Examples of DDT definitions:

template<int Size,typename T,class Super> class Array{ ... };template<typename T, class Super> class DLList{ ... };template<typename T, class Super> class BTTree{ ... };

● Generic and basic DDTs:

● One-level concrete DDTs:

● New multi-level concrete DDTs:class ARARInteg: public Array<20, int, Array<128,int> >{};class DLLARInteg: public DLList<int, Array<128, int > >{};class BTSLLDoub: public BTTree<double,SLList<double > >{};class ARARAREmployee: public Array<2,Employee, \ Array<4,Employee, Array<2, Employee> > >{};

class F_array : public Array<256,float>{};class I_DLList: public DLList<int,int>{};class D_BTTree: public BTTree<double,double>{};

Structured reports of DDTs implementations at run-time

1) Profiling already inserted in all the DDT implementations of the library

2) Information reported at run-time from the DDTs:1. Read and write accesses2. Memory footprint behaviour3. Access pattern to the data => Methods calls to data

(e.g. sequential)

3) Graphical Tool (based on Gtk/Perl) to perform code parsing and profiling insertion in new DDTs

Our run time exploration of DDTs

1) Heuristics based on clustering of blocks => Possible up to one per DDT

2) Unified exploration loop during usual execution:

3) Refinement is done in a post-processing phase.

Normal executionspeed of

the application (with instrumented

DDTs)

Profile objects

Heuristics evaluationFinished?

Start exploration

Yes

No Exploration finished

Acquiring profiling

Evaluation

Library of DDTs

for Multimedia

Post-processing phase (refinement) ● Automated refinement process:

Acquired run-time profiling information

• Further refinements possible with run time behaviour information of DDTs: global control flow simplification => Intermediate Variable Elimination (additional global gains!)

Graphical evaluation tool

Analytical power model (0.8 to 0.l3 tech. node)(based on Cacti v3.0)

Global optimal (Pareto) points, trade-offs: Power consumption / Memory footprint / Execution time

Simplification of control flow: Intermediate Variable Elimination phase

● Interaction between DDTs in a global context:

1) Complex algorithms consist of many smaller ones: data generation and consumptions (DSP)

2) Each step performs “some” transformations: filtering of points, proximity, selection, …

3) Injective Relationship: Remove buffers when index function simple enough and not intermediate results are needed later

Very significant additional global gains!

Automation tool for Dynamic Data Type Refinement

Design flow for new embedded systems

Specification of the system at very high-level (e.g. C++ or UML)

2. DM Managers Refinement (DMMR)

Final implementation

1. Dynamic Data Types Refinement (DDTR)

Refinement of DM subsystem (2 levels):

Further optimizations in static data

Problems to create custom DM managers● DM management left to the OS => General-

purpose DM managers, not custom ones! (Lea Allocator – Linux-based systems 2003)

● Custom DM managers? No guidelines! Only designers’ experience (try-test phase):

1) Huge design space to manually explore (e.g. organization of memory blocks, fit algorithms) (Wilson et al. 1995)

2) Frameworks to build and profile custom DM managers are not available (Berger2001,Attardi98)

Small example of different choices in DM managers

Several options to decide:– DM manager with information in the blocks?

– DM manager with coalescing service or not?

Both options are possible, the best for current application?

No coalescing Coalescing

New methodologies to decide the best options needed!h

ead

er

head

er

head

er

head

er

head

er

head

er

One block size Several block sizes

1) Proposed DM management methodology

1) Profiling of application’s dynamic behavior to detect most commonly occuring data type access

2) Systematic exploration of possible DM management solutions from structured (orthogonalized) design space, for a certain cost function

3) Efficient code implementation and empirical evaluation of promising DM management solutions using our own high-level C++ library to create them

• Main phases:

Proposed Dynamic Memory Management Refinement steps

Profiling of application’s dynamic behavior

(identification of DDTs access patterns)

Final selection of custom DM manager

Implementation and run-time evaluation of promising custom DM management candidates

Exploration of possible DM management solutions

for certain constraints (e.g. Power, performance, etc)

Profiling of application’s DM behavior

Dynamic data = organized data structures: Dynamic Data Types (DDTs)– Allocation sizes of each structure in the DDTs– Temporal behavior of each DDT– Memory footprint of each DDT– Interaction of the DDTs (spatial locality)

=> From our Dynamic Data Type Refinement

Exploration of possible DM management solutions to minimize memory footprint

1) Definition of structured design space for DM management:– Orthogonal categories to create custom DM managers– Categories propagate dependencies and make feasible

the design space exploration

2) Definition of a suitable order to traverse the design space reducing a certain cost function/s.

Our design space for DM management• According to basic blocks defined in DM managers:

• Orthogonal decision trees inside each categoryA.Creating block structures

C.Allocating Blocks

B.Pool division based on criterion

D.Coalescing Blocks

E.Splitting Blocks

statussize pointers

Blockrecorded info

CoalesceSplitAsk for more

memory

Flexible blocksize manager

Fitalgorithms

Exact fit First fit Next fit Best fit Worst fit

Number of Maxblock size

One

Not fixedFixed

Many (categories)

When

Always Sometimes Never

Immediate Deferred

Number ofMin block size

When

DDT OPTIONS

Blocktags

Blocksizes

One Many

Blockstructure

HeaderBoundary

tagsNone

Size

One poolper size

Single pool

Poolstructure

DDT OPTIONS

Complete DM management design space● All important state-of-the-art DM managers covered

within our DM design space: – Binary buddy, Double buddy– Simple segregated fit – First fit, next fit, best fit allocation orders– Kingsley allocator (among the fastest, N-Gage)– Region allocators (fast, embedded RTOS: RTEMS)– Complex region-segregated fit – Win XP real time (fast)– Doug Lea Allocator (Linux, best trade-off)– Obstacks (custom, optimized for stack behavior, gcc)– Xalloc (custom, variation of regions-stack, Apache)– …

Main problem: Huge design space! Order of decision trees?

Interdependencies help to explore the DM management design space

Interdependencies exist between orthogonal trees:

These interdependencies make the exploration feasible: All combinations of trees not realistic!

– A2) Block sizes in the DM manager: one or several?

One block size Several block sizes

– A5) Flexible blocks size DM manager: coalesce or not?

No coalescing Coalescing

1) (2

head

er

head

er

head

er

head

er

head

er

head

er

A.Creating block structures

C.Allocating Blocks

B.Pool division based on criterion

D.Coalescing Blocks

E.Splitting Blocks

statussize pointers

Blockrecorded info

CoalesceSplitAsk for more

memory

Flexible blocksize manager

Fitalgorithms

Exact fit First fit Next fit Best fit Worst fit

Number of Maxblock size

One

Not fixedFixed

Many (categories)

When

Always Sometimes Never

Immediate Deferred

Number ofMin block size

When

DDT OPTIONS

Blocktags

Blocksizes

One Many

Blockstructure

HeaderBoundary

tagsNone

Size

One poolper size

Single pool

Poolstructure

DDT OPTIONS

E.g. Final order for reduced memory footprint(interdependencies and factors of influence)

(1)

(2)

(3)

(4)

(5)

(6)

(7)(8)

(9)

(10) (11)

(12)

Our approach to implement and evaluate custom DM managers

1) Object-oriented library to compose DM managers:

- ANSI C++ code with “mixins” that can be efficiently optimised by current compilers (e.g. gcc, Visual C++)

- Custom DM managers composed by basic components (e.g. fit algorithms, memory blocks organizations, etc.)

- Fast creation and debugging of custom DM managers

2) Profiling framework can be easily inserted in the library to profile the candidates (e.g. memory footprint, energy, etc. )

3) Post-processing phase to compare the DM candidates

Advantages of our mixin-based approach for DM managers to traditional implementations

1) Direct equivalences between our DM design space and its implementation classes

2) Independent layers => Good maintainability and fast changes in parts of DM managers (e.g. Lea Alloc., 15000 lines of complex C code)

3) Extensive code reuse => implementation classes reused among different DM managers using common interfaces of methods

4) Profiling can be done in a similar way to DDTs

Graphical example of custom DM manager created by composition of our basic blocks

Best Fit

OS interface heap

(4B physical blocks)

OS interface heap

(8B physical blocks)

Binary tree blocks structure

Doubly-linked list blocks structure

First Fit

DMMHeap: Final custom DM manager(it chooses which manager according to alloc. size)

Example of custom DM manager created with our categories of basic blocks

/* Basic blocks for heap requests to the system */

template<typename MyT> class BasicHeap: public TypeClass<MyT,mheap>{};

/* Data types of the Dynamic Memory (DM) manager: DLL –> Doubly Linked List and BTT -> Binary Tree */

template<typename MyT, class SuperClass> class DLList {/*Implem. DLL*/ };

template<typename MyType, class SuperClass> class BTTree { /*Imp. BTT */};

/* Two basic data types instantiated for the memory manager */

class I_DLList : public DLList<int, BasicHeap<int> > {};

class D_BTTree : public BTTree<double, BasicHeap<double> >{};

/* DM manager with 2 seg. fit lists of data types, best or first fit policy */

class DMMHeap: public SegLists<list_Sizes, // List of sizes for segList numElemFirstList, // Number of lists with type 1st segList BestFit<I_DLLList>, // 1st segList FirstFit<D_BTTree> // 2nd segList > {};

1) Reuse of the interface for the profiling of DDTs2) Profiling already inserted in all the classes of our

library of DM managers3) Information reported at run-time from the DM

managers:– Memory footprint behaviour– Access pattern due to DM managers to the data

(e.g. Allocations/Deallocations, etc.)– Fragmentation in the managers– Classified for each implementation part of the

DM managers (e.g. fit algorithms, internal data structures, etc.)

Profiling framework for DM managers

Code example of integration of profiling objects in our custom DM managers

/* Declaration of profile objects of our common profile framework */

_profile *prof1, *prof2, *prof3;

class DMMHeap: public SegLists<prof1, // Profile object at this level list_Sizes, // List of memory block sizes for the segList numElemFirstList, // Number of lists with type 1st segList BestFit<I_DLLList<prof2> >, // 1st segList with profile object FirstFit<D_BTTree<prof3> > // 2nd segList with profile object > > > {};

- Easy insertion of our C++ profiling framework in the original structure of custom DM managers:

Few new parts are required!

Case study 1: 3D reconstruction algorithm

Matching of points in sequent frames (“like” motion detection)

- 1,500 2D points to ‘match‘ on average- Size: 700000 lines of C++ code- Sources of uncertainty: 1) Unknown input image sizes 2) Additional intermediate DDTs

Initial and optimised DDTs in the 3D reconstruction algorithm –DDTR phase

Final DDTs implementation

● Initial DDTs implementation

Results obtained with our DMMR phase

● Memory footprint reduction:

● Execution time reduced, overhead added to total execution time is not significant: 600 frames, 20s.

Memory footprint of different DM managers (2 frames)

0

0.50

1.00

1.50

2.00

2.50

Kingsley(Win32)

Regions Our DM manager

Me

mo

ry f

oo

trp

int

(MB

yte

s)

Overhead due to DM managers

0

0.5

1

1.5

2T

ime

(s

ec

s)

Our custom DM manager

RegionsKingsley(Win32)

Memory

0250005000075000

100000125000

150000175000

200000

225000

250000

275000

300000

325000

Normalised Memory (B)

Original

DDT

ManualOptimised

Implementation

Final results 3D Reconstruction case study

Accesses

10000

100000

1000000

10000000

100000000

Accesses

Original

DDT

Manual

Optimised

Energy

1000

10000

100000

1000000

10000000

100000000

Energy (nJ)

Original

DDT

Manual

Optimised

Implementation

Overall gains of almost 2 orders of magnitude!

Case Study 2: Virtual reality game

• Real images interact with 3D generated objects• Initially designed for embedded devices (e.g. Trimedia 1300)• Unpredictable behaviour:

– Objects on the screen– Wall detection– User movements

Initial image Processed image

Overall results trying to minimize power consumption

• Final comparison with original DM implementation:– Total memory saving: 22.48%

– Total power consumption saving: 75.3%

DDTs behaviour with 6 images

Global Pareto points for DDTs

Global pareto points for the DDTs

Examples of control flow simplification: Intermediate Variable Elimination

Case study 3: 3D rendering system

● Scalable rendering of objects according to available system resources (QoS):

Scalable 3D coding (3D Mesh + 2D Texture)

Low-quality decoding High-quality decoding

- Size: 5000 lines C++ code- Usual “static” mem. footprint: 12 MB

(real scenes with 7 objects)- Sources of uncertainty: 1) Movements of the user (Qos) 2) Several DDTs: 3D points and 3D faces

Memory

010002000300040005000

6000

70008000

9000

10000

11000

12000

Memory (B)

Original

Optimised

Implementation

Energy 1Energy 2

0

2000000

4000000

6000000

8000000

10000000

12000000

14000000

16000000

Energy (nJ)

Original

Optimised

Implementation

Results DDTR reducing memory footprint

● Energy results with two different memory hierarchies (with and without cache):

Very different results!

● Memory footprint reduction (35%):

Without cache

With cache

Results DMMR (reducing memory footprint)

● Execution may be slower:

● Memory footprint improved:

(Trade-offs between execution time, memory footprint and power)

0

0.501.001.502.002.503.003.504.004.50

Memory footprint of different DM managers (5 objects)

Kingsley(Win32)

Obstacks Our DM manager

Lea (Linux)M

em

ory

fo

otp

rin

t (M

By

tes

)

Execution time of different DM managers

0

2

4

6

8

10

Tim

e (

se

cs

)

Lea (Linux)

Kingsley(Win32)

Obstacks Our DM manager

Final results 3D rendering case study

0

20

40

60

80

100

120

original after DDTR+OS manual+OS after DDTR +DMMR

memory footprint energy

Normalized results to original implementation

- Great gains in memory footprint: almost 60%, but… Important loss in energy consumption!

(Trade-offs exist between memory footprint, energy, etc.)

Dynamic embedded application

Multi-level DM management

Main memory

RTOS Dynamic Memory Manager

Heap

new(O1),

O1 O2O5

- Use the whole memory hierarchy:

Scratchpad memory

O4 O3

new(O2), new(O3), free(O2), new(O4), new(O5) …

(low energy)

(high energy)

Hardware extensions for multi-level DM management: DMA

● DMA-like transfer controller used for DM management, moves blocks between main memory and scratchpad.

– Operates in parallel with the processor.– More energy-efficient than processor to manipulate

data transfers.–Low setup time.

MMU

Caches Scratch Transfer

DMA Controller

DMA

AM

BA

BU

SARM Core

EX

T.

ME

M

New part:

Efficient partition of scratchpad for DM management

Compile time decisions of partition of memory hierarchy for dynamic allocation

Dynamic Scenario 1 of a certain application

DDT 1DDT n

… DDT n+1

Dynamic Scenario 2

DDT 1

… DDT n

Main memory + cache

Scratchpad + DMA

DDT n+1

Case Study 4: Network scheduling application(simulated on MPARM-Bologna)

● Wireless networks -> Deficit Round Robin (DRR) application (NetBench benchmark suite)

● Forwarding scheduler algorithm in many routers:

Queues

Packets

Stream

data

data

data

data

data

data

data

datadatadata

data

data

data

data

data

data

data

data

data

data

data

……T1T5

ForwardingT2

- Size: 500 lines of C++ code- Sources of uncertainty: 1) Variable number of packets arriving at any moment in time 2) Multiple packet sizes allowed and variable number of queues

Multi-level DMMR has a large impact on energy and performance

- Caches and design time techniques are not the best choice for dynamic systems

- Large differences between DM managers (well-guided customization is needed!)

0102030405060708090

100110120130140150160170180

cache only Design time Runtime Regionallocator

RuntimeRegion/Partition

30% 20%70%

Energy Execution time

(normalized to Region/Partition)

18% 14% 10%

Execution time results (cycles) in multiprocessors

• AMBA bus, 3 cycles L2 latency, ARM7TDMI cores:

40000000

50000000

60000000

70000000

80000000

90000000

100000000

110000000

120000000

130000000

1 4 8 12 16 20

Cache Scratchpad with explicit copies Scratchpad with DMA

Gains scale and increase further with scratchpad+DMA approach for DM management!

number of processorscycles

Conclusions● Fast system design flow for embedded systems

thanks to automation and libraries (DDTR,DMMR)– 2 to 4 weeks for applications of similar complexity

● Promising results in new dynamic multimedia and wireless network applications:– Significant speedup, and memory footprint and power

consumption reductions– Trade-offs of memory footprint, performance, power

consumed in DM management are possible ● Multi-layered DM management proven to be useful

(also some companies agree…Infineon )

Future work● Hardware support of some operations for DM

management:– Significant speedups/power savings thanks to additional

operations moved to HW (e.g. pointer-chasing)– Implementation of some kind of MMU for DM that the

DM managers can control to move the data● Run time switching of DM managers for different

scenarios (e.g. just in software or with reconfigurable HW)