dynamic memory management for new embedded systems david atienza (dacya/ucm), stelios mamagkakis...
TRANSCRIPT
Dynamic Memory Management for new embedded systems
David Atienza (DACYA/UCM),Stelios Mamagkakis (VLSI D&T Center, Xanthi),
Marc Leeman (IMEC vzw, Leuven), Francky Catthoor (IMEC vzw, Leuven),
José M. Mendías (DACYA/UCM),Dimitrios Soudris (VLSI D&T Center, Xanthi)
New embedded systems?
- New consumer devices (e.g. mobiles, PDAs):
- Main features: 1) More complex than traditional embedded devices (complex memory hierarchy, cpu, Real-Time OSes) 2) Portables - limited batteries3) Preserve “relatively” high performance (real time) 4) Many applications are usually running concurrently
New applications?
- Main features: 1) Complex high level design (e.g. C++, Java) 2) Very dynamic (variable use of resources) 3) Power hungry 4) Intensive memory use (accesses and footprint)
Scalable video rendering 3D Virtual reality
games Wireless protocols
- Multimedia and wireless network protocols:
Why optimizing these systems?
Users do not want these problems… new embedded systems need to be optimized!
No optimizations: New applications Platforms
Out of memory or battery fast!
No real-time!
low battery
Outline:1) Memory subsystem in new embedded devices
(Static vs dynamic memory allocation)
2) Dynamic Memory management mechanism
3) Dynamic memory subsystem refinement:
3.1) Dynamic Data Type Refinement (Application Level)
3.2) Dynamic Memory Management Refinement (OS Level)
4) Real life case studies and results
5) Enhanced Dynamic Memory Management (Multi-level) – Real life case study and results
6) Conclusions and future work
Memory subsystem in new embedded devices
1) Multi-level memory hierarchy (e.g. caches, etc.)
2) High-performance buses (e.g. AMBA bus)
MMU
Data Cache
Scratchpad Memory
ARM
Instruction Cache
Processor Core
AM
BA
BU
S Main Memory
- Not enough… Memory subsystem up to 70% of total system power and high performance degradation (>20x) [Vijaykrisnan2002, Catthoor2000]
- Highly optimized:
Access and storage optimizations for DM needed!
time
Dat
a
Scalable 3D decoding (per object): Memory:
Static memory vs dynamic memory Scenario 1 - Compile-time (worst case)
Low quality High qualityMedium quality
Memory size
t1 t2 t3 t4
NO!
Object1
Object 1
Object 2
Object 3
Object2
Object3
time
Dat
a
Scalable 3D decoding (per object): Memory:
Static memory vs dynamic memoryScenario 2 – Run time allocation
Low quality High qualityMedium quality
Memory size
t1 t2 t3 t4
Object 2
Object2
Object1
Object 3
Object 1
Object 4Object5
OK!Object4
t5
Object3
Object5
Memory usage scales to current input!
Overview of options for memory allocation(results 3D Image Reconstruction case study)• Worst case static memory solutions not possible or
do not work in extreme cases of input data• Dynamic solutions achieve better results• As shown later: Well-designed custom solutions
can improve further standard DM mechanisms
DM management works at 2 levels
Main memory
RTOS Dynamic Memory Manager
Heap
new(O1)
t1 t2 t3 t4
O2
O1
O4
t5
O3
Dat
a new(O2)
new(O3) delete(O2)
O1 O2 O3
new(O4)
t6
O4
O5
new(O5)
O5
Fragmentation!!
1) Applications use SW functions, C++: new()/delete()
2) Real time OS support: DM manager
Time
1) Which parts use the DM SW functions?
Dynamic Data Types:
Algorithms (Functionality)
Embedded application:
Static Data (e.g. frames)
Dynamic Data(e.g. objects to render)
PA(key1)
AR(key2) AR(key2)
Laye
r 1
Laye
r 2
data
data
data
data
data
data
LAR(key1)
PA(key2)
Laye
r 1
Laye
r 2
dat
a data
data
DDT 1 DDT nStructured data(sets of objects)
new(Object):
…
2) RTOS support, DM manager to use?
- Partition manager: suitable for one allocation size- One fixed block size
- First fit allocation order
Glo
bal
Info
of
man
ag
er
- Region allocator: suitable for several allocation sizes- Many block sizes, doubly-linked list
- First fit, splitting, coalescing (best fit approximation)
Fre
e
Fre
e
Fre
e
Use
dUse
d
Glo
bal
Info
of
Man
ag
er
Fre
e
head
er
New request
Use
d
Fre
e
head
er
head
er
Fre
e
head
er
Fre
e
head
er
head
er
Fre
e
Fre
e
head
er
Fre
e
Use
Use
Use
head
er
Fre
e
Fre
e
Fre
e
Fre
e
Fre
e
Use
dUse
d Use
d
New request
Use
d Use
d
New request
Fragmentation!!
Simple - low energy consumption
Complex – higher energy consumption
Design flow for new embedded systems
Specification of the system at very high-level (e.g. C++ or UML)
2. DM Managers Refinement (DMMR)
Final implementation
1. Dynamic Data Types Refinement (DDTR)
Refinement of DM subsystem (2 levels):
Further optimizations in static data
DDTs in new embedded applications
1) Complex control flow (many sub-algorithms) => Many DDTs with different (and irregular) behaviour interacting in time
2) Implementations thought from functional point of view, not efficient access or mapping to memory (e.g. clustering of data)
3) Complex implementations => combinations of trees, arrays, single linked lists, doubly linked lists, …
Result: Very expensive (e.g. energy) if not well designed
Proposed Dynamic Data Type Refinement steps
Specification of the required DDTs
(from the algorithm)
Final implementation of DDTs
Refinement of current design(interaction and implementation of DDTs)
Working implementation of DDTs (e.g. C++)
Run time simulation and profiling acquisition
PROBLEMS:
Insertion of profiling
Solutions proposed:
Error-prone
Automated ways?
Library of auto-profiled DDTs
Time-consuming
Huge design space of DDTs
High-levelEstimations?
Structured reports generation
Heuristics to limitthe exploration
Analytical high level estimations
1.- Library of complex multi-level DDTs:– Based on initial exploration in Matlab with traces of real
programs– Multiplatform compatible: ANSI C++ compliant– Basic data types (e.g. int) or user-defined (e.g. objects)– List of relevant DDTs (i.e. 14) for multimedia included:
1) Basic data types: single and doubly linked lists, array (AR) and pointer-AR
2) 2-layered combinations of ARs and pointer-ARs:
3) Single linked list – AR
4) Doubly linked list – AR
5) Binary trees
6) DDTs with pattern optimizations (e.g. fast access to the last element)
2.- Extension: Mechanism to create new DDTs in the library based on template classes (or “mixins”)
Library of auto-profiled DDTs
“Mixin” concept and how we use it● Y. Smaragdakis (2002): A method to specify extensions of
classes without defining up-front which classes exactly● Uses in our library of DM managers:
1) Specifying a subclass while the parent class is a template parameter:
template<class SuperClass> class Mixin : public SuperClass{ // mixin definitions}
2) Using a template class inside another class:
template<class SuperClass> class Myclass { SuperClass *data; // template class definitions}
Examples of DDT definitions:
template<int Size,typename T,class Super> class Array{ ... };template<typename T, class Super> class DLList{ ... };template<typename T, class Super> class BTTree{ ... };
● Generic and basic DDTs:
● One-level concrete DDTs:
● New multi-level concrete DDTs:class ARARInteg: public Array<20, int, Array<128,int> >{};class DLLARInteg: public DLList<int, Array<128, int > >{};class BTSLLDoub: public BTTree<double,SLList<double > >{};class ARARAREmployee: public Array<2,Employee, \ Array<4,Employee, Array<2, Employee> > >{};
class F_array : public Array<256,float>{};class I_DLList: public DLList<int,int>{};class D_BTTree: public BTTree<double,double>{};
Structured reports of DDTs implementations at run-time
1) Profiling already inserted in all the DDT implementations of the library
2) Information reported at run-time from the DDTs:1. Read and write accesses2. Memory footprint behaviour3. Access pattern to the data => Methods calls to data
(e.g. sequential)
3) Graphical Tool (based on Gtk/Perl) to perform code parsing and profiling insertion in new DDTs
Our run time exploration of DDTs
1) Heuristics based on clustering of blocks => Possible up to one per DDT
2) Unified exploration loop during usual execution:
3) Refinement is done in a post-processing phase.
Normal executionspeed of
the application (with instrumented
DDTs)
Profile objects
Heuristics evaluationFinished?
Start exploration
Yes
No Exploration finished
Acquiring profiling
Evaluation
Library of DDTs
for Multimedia
Post-processing phase (refinement) ● Automated refinement process:
Acquired run-time profiling information
• Further refinements possible with run time behaviour information of DDTs: global control flow simplification => Intermediate Variable Elimination (additional global gains!)
Graphical evaluation tool
Analytical power model (0.8 to 0.l3 tech. node)(based on Cacti v3.0)
Global optimal (Pareto) points, trade-offs: Power consumption / Memory footprint / Execution time
Simplification of control flow: Intermediate Variable Elimination phase
● Interaction between DDTs in a global context:
1) Complex algorithms consist of many smaller ones: data generation and consumptions (DSP)
2) Each step performs “some” transformations: filtering of points, proximity, selection, …
3) Injective Relationship: Remove buffers when index function simple enough and not intermediate results are needed later
Very significant additional global gains!
Design flow for new embedded systems
Specification of the system at very high-level (e.g. C++ or UML)
2. DM Managers Refinement (DMMR)
Final implementation
1. Dynamic Data Types Refinement (DDTR)
Refinement of DM subsystem (2 levels):
Further optimizations in static data
Problems to create custom DM managers● DM management left to the OS => General-
purpose DM managers, not custom ones! (Lea Allocator – Linux-based systems 2003)
● Custom DM managers? No guidelines! Only designers’ experience (try-test phase):
1) Huge design space to manually explore (e.g. organization of memory blocks, fit algorithms) (Wilson et al. 1995)
2) Frameworks to build and profile custom DM managers are not available (Berger2001,Attardi98)
Small example of different choices in DM managers
Several options to decide:– DM manager with information in the blocks?
– DM manager with coalescing service or not?
Both options are possible, the best for current application?
No coalescing Coalescing
New methodologies to decide the best options needed!h
ead
er
head
er
head
er
head
er
head
er
head
er
One block size Several block sizes
1) Proposed DM management methodology
1) Profiling of application’s dynamic behavior to detect most commonly occuring data type access
2) Systematic exploration of possible DM management solutions from structured (orthogonalized) design space, for a certain cost function
3) Efficient code implementation and empirical evaluation of promising DM management solutions using our own high-level C++ library to create them
• Main phases:
Proposed Dynamic Memory Management Refinement steps
Profiling of application’s dynamic behavior
(identification of DDTs access patterns)
Final selection of custom DM manager
Implementation and run-time evaluation of promising custom DM management candidates
Exploration of possible DM management solutions
for certain constraints (e.g. Power, performance, etc)
Profiling of application’s DM behavior
Dynamic data = organized data structures: Dynamic Data Types (DDTs)– Allocation sizes of each structure in the DDTs– Temporal behavior of each DDT– Memory footprint of each DDT– Interaction of the DDTs (spatial locality)
=> From our Dynamic Data Type Refinement
Exploration of possible DM management solutions to minimize memory footprint
1) Definition of structured design space for DM management:– Orthogonal categories to create custom DM managers– Categories propagate dependencies and make feasible
the design space exploration
2) Definition of a suitable order to traverse the design space reducing a certain cost function/s.
Our design space for DM management• According to basic blocks defined in DM managers:
• Orthogonal decision trees inside each categoryA.Creating block structures
C.Allocating Blocks
B.Pool division based on criterion
D.Coalescing Blocks
E.Splitting Blocks
statussize pointers
Blockrecorded info
CoalesceSplitAsk for more
memory
Flexible blocksize manager
Fitalgorithms
Exact fit First fit Next fit Best fit Worst fit
Number of Maxblock size
One
Not fixedFixed
Many (categories)
When
Always Sometimes Never
Immediate Deferred
Number ofMin block size
When
DDT OPTIONS
Blocktags
Blocksizes
One Many
Blockstructure
HeaderBoundary
tagsNone
Size
One poolper size
Single pool
Poolstructure
DDT OPTIONS
Complete DM management design space● All important state-of-the-art DM managers covered
within our DM design space: – Binary buddy, Double buddy– Simple segregated fit – First fit, next fit, best fit allocation orders– Kingsley allocator (among the fastest, N-Gage)– Region allocators (fast, embedded RTOS: RTEMS)– Complex region-segregated fit – Win XP real time (fast)– Doug Lea Allocator (Linux, best trade-off)– Obstacks (custom, optimized for stack behavior, gcc)– Xalloc (custom, variation of regions-stack, Apache)– …
Main problem: Huge design space! Order of decision trees?
Interdependencies help to explore the DM management design space
Interdependencies exist between orthogonal trees:
These interdependencies make the exploration feasible: All combinations of trees not realistic!
– A2) Block sizes in the DM manager: one or several?
One block size Several block sizes
– A5) Flexible blocks size DM manager: coalesce or not?
No coalescing Coalescing
1) (2
head
er
head
er
head
er
head
er
head
er
head
er
A.Creating block structures
C.Allocating Blocks
B.Pool division based on criterion
D.Coalescing Blocks
E.Splitting Blocks
statussize pointers
Blockrecorded info
CoalesceSplitAsk for more
memory
Flexible blocksize manager
Fitalgorithms
Exact fit First fit Next fit Best fit Worst fit
Number of Maxblock size
One
Not fixedFixed
Many (categories)
When
Always Sometimes Never
Immediate Deferred
Number ofMin block size
When
DDT OPTIONS
Blocktags
Blocksizes
One Many
Blockstructure
HeaderBoundary
tagsNone
Size
One poolper size
Single pool
Poolstructure
DDT OPTIONS
E.g. Final order for reduced memory footprint(interdependencies and factors of influence)
(1)
(2)
(3)
(4)
(5)
(6)
(7)(8)
(9)
(10) (11)
(12)
Our approach to implement and evaluate custom DM managers
1) Object-oriented library to compose DM managers:
- ANSI C++ code with “mixins” that can be efficiently optimised by current compilers (e.g. gcc, Visual C++)
- Custom DM managers composed by basic components (e.g. fit algorithms, memory blocks organizations, etc.)
- Fast creation and debugging of custom DM managers
2) Profiling framework can be easily inserted in the library to profile the candidates (e.g. memory footprint, energy, etc. )
3) Post-processing phase to compare the DM candidates
Advantages of our mixin-based approach for DM managers to traditional implementations
1) Direct equivalences between our DM design space and its implementation classes
2) Independent layers => Good maintainability and fast changes in parts of DM managers (e.g. Lea Alloc., 15000 lines of complex C code)
3) Extensive code reuse => implementation classes reused among different DM managers using common interfaces of methods
4) Profiling can be done in a similar way to DDTs
Graphical example of custom DM manager created by composition of our basic blocks
Best Fit
OS interface heap
(4B physical blocks)
OS interface heap
(8B physical blocks)
Binary tree blocks structure
Doubly-linked list blocks structure
First Fit
DMMHeap: Final custom DM manager(it chooses which manager according to alloc. size)
Example of custom DM manager created with our categories of basic blocks
/* Basic blocks for heap requests to the system */
template<typename MyT> class BasicHeap: public TypeClass<MyT,mheap>{};
/* Data types of the Dynamic Memory (DM) manager: DLL –> Doubly Linked List and BTT -> Binary Tree */
template<typename MyT, class SuperClass> class DLList {/*Implem. DLL*/ };
template<typename MyType, class SuperClass> class BTTree { /*Imp. BTT */};
/* Two basic data types instantiated for the memory manager */
class I_DLList : public DLList<int, BasicHeap<int> > {};
class D_BTTree : public BTTree<double, BasicHeap<double> >{};
/* DM manager with 2 seg. fit lists of data types, best or first fit policy */
class DMMHeap: public SegLists<list_Sizes, // List of sizes for segList numElemFirstList, // Number of lists with type 1st segList BestFit<I_DLLList>, // 1st segList FirstFit<D_BTTree> // 2nd segList > {};
1) Reuse of the interface for the profiling of DDTs2) Profiling already inserted in all the classes of our
library of DM managers3) Information reported at run-time from the DM
managers:– Memory footprint behaviour– Access pattern due to DM managers to the data
(e.g. Allocations/Deallocations, etc.)– Fragmentation in the managers– Classified for each implementation part of the
DM managers (e.g. fit algorithms, internal data structures, etc.)
Profiling framework for DM managers
Code example of integration of profiling objects in our custom DM managers
/* Declaration of profile objects of our common profile framework */
_profile *prof1, *prof2, *prof3;
class DMMHeap: public SegLists<prof1, // Profile object at this level list_Sizes, // List of memory block sizes for the segList numElemFirstList, // Number of lists with type 1st segList BestFit<I_DLLList<prof2> >, // 1st segList with profile object FirstFit<D_BTTree<prof3> > // 2nd segList with profile object > > > {};
- Easy insertion of our C++ profiling framework in the original structure of custom DM managers:
Few new parts are required!
Case study 1: 3D reconstruction algorithm
Matching of points in sequent frames (“like” motion detection)
- 1,500 2D points to ‘match‘ on average- Size: 700000 lines of C++ code- Sources of uncertainty: 1) Unknown input image sizes 2) Additional intermediate DDTs
Initial and optimised DDTs in the 3D reconstruction algorithm –DDTR phase
Final DDTs implementation
● Initial DDTs implementation
Results obtained with our DMMR phase
● Memory footprint reduction:
● Execution time reduced, overhead added to total execution time is not significant: 600 frames, 20s.
Memory footprint of different DM managers (2 frames)
0
0.50
1.00
1.50
2.00
2.50
Kingsley(Win32)
Regions Our DM manager
Me
mo
ry f
oo
trp
int
(MB
yte
s)
Overhead due to DM managers
0
0.5
1
1.5
2T
ime
(s
ec
s)
Our custom DM manager
RegionsKingsley(Win32)
Memory
0250005000075000
100000125000
150000175000
200000
225000
250000
275000
300000
325000
Normalised Memory (B)
Original
DDT
ManualOptimised
Implementation
Final results 3D Reconstruction case study
Accesses
10000
100000
1000000
10000000
100000000
Accesses
Original
DDT
Manual
Optimised
Energy
1000
10000
100000
1000000
10000000
100000000
Energy (nJ)
Original
DDT
Manual
Optimised
Implementation
Overall gains of almost 2 orders of magnitude!
Case Study 2: Virtual reality game
• Real images interact with 3D generated objects• Initially designed for embedded devices (e.g. Trimedia 1300)• Unpredictable behaviour:
– Objects on the screen– Wall detection– User movements
Initial image Processed image
Overall results trying to minimize power consumption
• Final comparison with original DM implementation:– Total memory saving: 22.48%
– Total power consumption saving: 75.3%
DDTs behaviour with 6 images
Global Pareto points for DDTs
Global pareto points for the DDTs
Case study 3: 3D rendering system
● Scalable rendering of objects according to available system resources (QoS):
Scalable 3D coding (3D Mesh + 2D Texture)
Low-quality decoding High-quality decoding
- Size: 5000 lines C++ code- Usual “static” mem. footprint: 12 MB
(real scenes with 7 objects)- Sources of uncertainty: 1) Movements of the user (Qos) 2) Several DDTs: 3D points and 3D faces
Memory
010002000300040005000
6000
70008000
9000
10000
11000
12000
Memory (B)
Original
Optimised
Implementation
Energy 1Energy 2
0
2000000
4000000
6000000
8000000
10000000
12000000
14000000
16000000
Energy (nJ)
Original
Optimised
Implementation
Results DDTR reducing memory footprint
● Energy results with two different memory hierarchies (with and without cache):
Very different results!
● Memory footprint reduction (35%):
Without cache
With cache
Results DMMR (reducing memory footprint)
● Execution may be slower:
● Memory footprint improved:
(Trade-offs between execution time, memory footprint and power)
0
0.501.001.502.002.503.003.504.004.50
Memory footprint of different DM managers (5 objects)
Kingsley(Win32)
Obstacks Our DM manager
Lea (Linux)M
em
ory
fo
otp
rin
t (M
By
tes
)
Execution time of different DM managers
0
2
4
6
8
10
Tim
e (
se
cs
)
Lea (Linux)
Kingsley(Win32)
Obstacks Our DM manager
Final results 3D rendering case study
0
20
40
60
80
100
120
original after DDTR+OS manual+OS after DDTR +DMMR
memory footprint energy
Normalized results to original implementation
- Great gains in memory footprint: almost 60%, but… Important loss in energy consumption!
(Trade-offs exist between memory footprint, energy, etc.)
Dynamic embedded application
Multi-level DM management
Main memory
RTOS Dynamic Memory Manager
Heap
new(O1),
O1 O2O5
- Use the whole memory hierarchy:
Scratchpad memory
O4 O3
new(O2), new(O3), free(O2), new(O4), new(O5) …
(low energy)
(high energy)
Hardware extensions for multi-level DM management: DMA
● DMA-like transfer controller used for DM management, moves blocks between main memory and scratchpad.
– Operates in parallel with the processor.– More energy-efficient than processor to manipulate
data transfers.–Low setup time.
MMU
Caches Scratch Transfer
DMA Controller
DMA
AM
BA
BU
SARM Core
EX
T.
ME
M
New part:
Efficient partition of scratchpad for DM management
Compile time decisions of partition of memory hierarchy for dynamic allocation
Dynamic Scenario 1 of a certain application
DDT 1DDT n
… DDT n+1
Dynamic Scenario 2
DDT 1
… DDT n
Main memory + cache
Scratchpad + DMA
DDT n+1
Case Study 4: Network scheduling application(simulated on MPARM-Bologna)
● Wireless networks -> Deficit Round Robin (DRR) application (NetBench benchmark suite)
● Forwarding scheduler algorithm in many routers:
Queues
Packets
Stream
data
data
data
data
data
data
data
datadatadata
data
data
data
data
data
data
data
data
data
data
data
……T1T5
ForwardingT2
- Size: 500 lines of C++ code- Sources of uncertainty: 1) Variable number of packets arriving at any moment in time 2) Multiple packet sizes allowed and variable number of queues
Multi-level DMMR has a large impact on energy and performance
- Caches and design time techniques are not the best choice for dynamic systems
- Large differences between DM managers (well-guided customization is needed!)
0102030405060708090
100110120130140150160170180
cache only Design time Runtime Regionallocator
RuntimeRegion/Partition
30% 20%70%
Energy Execution time
(normalized to Region/Partition)
18% 14% 10%
Execution time results (cycles) in multiprocessors
• AMBA bus, 3 cycles L2 latency, ARM7TDMI cores:
40000000
50000000
60000000
70000000
80000000
90000000
100000000
110000000
120000000
130000000
1 4 8 12 16 20
Cache Scratchpad with explicit copies Scratchpad with DMA
Gains scale and increase further with scratchpad+DMA approach for DM management!
number of processorscycles
Conclusions● Fast system design flow for embedded systems
thanks to automation and libraries (DDTR,DMMR)– 2 to 4 weeks for applications of similar complexity
● Promising results in new dynamic multimedia and wireless network applications:– Significant speedup, and memory footprint and power
consumption reductions– Trade-offs of memory footprint, performance, power
consumed in DM management are possible ● Multi-layered DM management proven to be useful
(also some companies agree…Infineon )
Future work● Hardware support of some operations for DM
management:– Significant speedups/power savings thanks to additional
operations moved to HW (e.g. pointer-chasing)– Implementation of some kind of MMU for DM that the
DM managers can control to move the data● Run time switching of DM managers for different
scenarios (e.g. just in software or with reconfigurable HW)