the$rise$and$fall$of$$ scratchpad$memories$jasonxue/meaow/meaow-talk-riseandriseof... · web page:...

C M L

The Rise and Fall of Scratchpad Memories

Aviral Shrivastava Compiler Microarchitecture Lab

Arizona State University

Rise

C M L Web page: aviral.lab.asu.edu C M L

Remember -‐ It is all about Memory!

2

}  First Generation }  ENIAC , UNIVAC – No memory

}  Second Generation }  IBM 7000 series -‐ Magnetic core memory

}  Third Generation }  IBM 360 -‐ Semiconductor memory

}  Fourth Generation }  PC and onwards -‐ VLSI Memory

}  First documented use of Cache }  IBM 360* }  “to bridge the speed gap between processor and memory”

}  Since then: Caches maybe the most important feature in a processor }  Itanium 2: cache and cache-‐like structures

}  More than 90% of transistors by count, 70% of chip by area, 50% power, 80% of leakage

10/8/13

*IBM (June, 1968), IBM System/360 Model 85 Functional Characteristics, SECOND EDITION, A22-6916-1.

Computer Architecture and Networks

First Generation (1945-1958)…

Built to calculate trajectories

for ballistic shells during

WWII, programmed by

setting switches and plugging

&

unplugging cables.

It used 18,000 tubes, weighted

30 tones and consumed 160

kilowatts of electrical power.

1943-46, ENIAC (Electronic Numerical Integrator and Calculator) by J.

Mauchly and J. Presper Eckert, first general purpose electronic computer

The size of its numerical word was 10 decimal digits, and it could perform 5000

additions and 357 multiplications per second.


SPMs for Power, Performance, and Area

0

1

2

3

4

5

6

7

8

9

256 512 1024 2048 4096 8192 16384

memory size

Ener

gy p

er a

cces

s [n

J]

.

Scratch padCache, 2way, 4GB spaceCache, 2way, 16 MB spaceCache, 2way, 1 MB space

Data Array Tag Array

Tag Comparators, Muxes

Address Decoder

Cache SPM

}  40% less energy as compared to cache [Banakar02] }  Absence of tag arrays, comparators and muxes

}  34 % less area as compared to cache of same size [Banakar02] }  Simple hardware design (only a memory array & address decoding

circuitry) }  Simpler and cheaper to build and verify


SPMs became popular in ES }  DSPs have used SPMs for a long time

}  TI-‐99/4A, released in 1981 had 256 bytes of SPM

}  Gaming Consoles regularly use SPMs }  SuperH in Sega }  PS1: could use SPM for stack data }  PS2: 16KB SPM }  PS3: Each SPU has 256KB SPM

}  Network and Graphics Processors }  Intel Network processors, and Nvidia Tesla

}  Many embedded processors used line locking }  Coldfire MCF5249, PowerPC440, MPC5554, ARM940, and

ARM946E-‐S

}  Several versions of ARM and Renesas have SPMs }  -‐ ARM supports upto 4M of SPM Sony Playstation

Sega Saturn


Using SPMs in Embedded Systems

5 10/8/13

ARM SPM

Cache

DMA

ARM Memory Architecture

Global Memory

•  Programs work without using SPM –  SPM for optimization –  Improve power, performance

•  Placing frequently used data in SPM –  Typically arrays –  Using linker script

All of this was done manually!


Compilers for using SPMs

6

}  As applications become more complex, it was not easy to identify what should be mapped to SPM

}  Compiler techniques to use SPM in embedded systems }  Global: Panda97, Brockmeyer03 Avissar02, Gao05,

Kandemir02, Steinke02, Grosslinger09 }  Code: Janapsatya06, Egger06, Angiolini04 }  Stack: Udayakumaran06,Dominguez05 }  Heap: Dominguez05, Mcllroy08

10/8/13


Compilers to use SPM

7

}  In general-‐purpose systems }  Kennedy – proposed to use SPM for register spills

}  SPMs have largely remained in embedded systems }  Not popular in general-‐purpose computing

10/8/13

Not much work - Because caches keep programming and debugging simple

Times are a changing…


Inevitable march to multi-‐cores }  Marketing Needs

}  Moore’s law

}  Real Needs }  Temperature and Power Problems

}  Microarchitecture level: Hotspots }  Chip level: Cooling Efficiency }  System Level: Total power consumption

}  Only way to improve performance without much increase in power

}  Multi-‐cores }  Reduce design complexity }  Spread heat and alleviate hotspots }  Improve reliability through redundancy

10/8/13 8


But… how do you scale the memory? }  Coherent-‐Cache Architectures (Current path)

}  Can still write programs like in the uni-‐core era, but

}  Coherency overheads do not scale }  Tilera64 has a whole separate mesh network for coherence traffic

}  http://www.theinquirer.net/inquirer/news/1006963/tilera-‐releases-‐core-‐chip

}  Non-‐Coherent Cache Architectures }  48-‐core Single-‐chip Cloud Computer (SCC)

}  Partly Coherent }  TI-‐6678 – vertically coherent, but horizontally not coherent

}  Hybrid }  Locally coherent, but globally non-‐coherent

}  Caches still consume a very significant amount of power


PPE

Element Interconnect Bus (EIB)

Off-chip Global

Memory PPE: Power Processor Element SPE: Synergistic Processor Element LS: Local Store

SPE 0 SPE 2

SPE 5

SPE 4

SPE 3 SPE 1

SPE 6

Software Managed Memory (SMM) Architecture

10

}  Cores have small local memories (scratch pad) }  Core can only access local memory }  Accesses to global memory through explicit DMAs in the program

}  e.g. IBM Cell architecture, which is in Sony PS3.

SPE 7 LS

SPU


SMM Execution

11

}  Task based programming, MPI like communication #include<libspe2.h> extern spe_program_handle_t hello_spu; int main(void) { int speid, status; speid (&hello_spu); }

Main Core

<spu_mfcio.h> int main(speid, argp) { printf("Hello world!\n"); }

Local Core


Local Core


Local Core


Local Core


Local Core


Local Core

= spe_create_thread

}  Extremely power-‐efficient computation }  If all code and data fit into the local memory of the cores

Processor Fab Frequency GFlops Power Power Efficiency Cell/B.E. 45nm 3.2 GHz 230 50 W 4.6 Intel i7 4-core Bloomfield 965 XE

45nm 3.2 GHz 70 130 W 0.5


SMM memory organization

12

ARM SPM

Global Memory

DMA

ARM Memory Architecture

SPE SPM

Global Memory

DMA

IBM Cell Memory Architecture

SPM is for Optimization SPM is Essential

}  Dynamic code/data management is needed }  All code/data must be managed

Previous works are not directly applicable


How to manage data within a core?

13

Local Memory Aware Code Original Code

int global; f1(){ int a,b; global = a + b; f2(); }

int global; f1(){ int a,b; DMA.fetch(global) global = a + b; DMA.writeback(global) DMA.fetch(f2) f2(); }


Data Management in LLM multicores }  Manage any amount of heap, stack and code, in the core of an LLM multi-‐core

}  Global data

}  If small, can be permanently located in the local memory

}  Stack data }  ‘liveness’ depends on call path

}  Function stack size know at compiler time, but not stack depth

}  Heap data }  dynamic and size can be unbounded

}  Code

}  Statically linked

}  Our strategy }  Partition local memory into regions for each kind of data

}  Manage each kind of data in a constant amount of space

code global

stack

heap

heap

heap

stack


Stack Management: Problem

15

Function Frame Size

(bytes)

F1 28

F2 40

F3 60

F4 54

F1

F2

F3

F4

Local Memory Size = 128 bytes

Local Memory Global Memory

F1

F2

F3

28

68

128

Old SP

F4

54

Global SP


Stack Management: Solution

16

}  Keep the active portion of the stack on the local memory }  Granularity of stack frames is chosen to minimize management overhead

}  It is a dynamic software technique }  fci(func_stack_size)

}  Check for available space in local memory }  Move old frame(s) to global memory if needed

}  fco() }  Check if the caller frame exists in local memory! }  Fetch from global memory, if it is absent

Optimized Compiler GCC 4.1.1

Executable

Runtime Library Runtime Library

void fci(int func_stack_size); void fco();

C Source

F1() { int a,b; F2();

} F2() {

F3(); } F3() {

int j=30; }

F1() { int a,b; fci(F2); F2(); fco(F1);

} F2() {

fci(F3); F3(); fco(F2);

} F3() {

int j=30; }


Code Management: Problem }  Static Compilation

}  Functions need to be linked before execution }  Divide code part of SPM in regions }  Map functions to these SPM regions }  Functions in the same region replace each other

REGION

REGION

REGION • • •

Local Memory Code Section


(c) Local Memory

F2

F3

F1

Code region

Code Management: Solution

18

(d) Global Memory

heap

global

stack

F2

F1 F3

F1

F2

F3

F1

F2

F3 (a) Application call graph

SECTIONS { OVERLAY { F1.o F3.o } OVERLAY { F2.o } } (b) Linker script

}  # of Regions and Function-‐To-‐Region Mapping }  Two extreme cases

}  Need careful code placement – Problem is NP-‐Complete }  Minimum data transfer with given space


malloc2

malloc1

Heap Size = 32bytes sizeof(student)=16bytes

HP

Local Memory Global Memory GM_HP

typedef struct{ int id; float score; }Student; main() { for (i=0; i<N; i++) { student[i] = malloc( sizeof(Student) ); } for (i=0; i<N; i++) { student[i].id = i; } }

malloc3

•  New malloc() }  May need to evict older heap

objects to global memory }  It may need to allocate more global

memory

•  malloc() }  allocates space in local memory

Heap Data Management

19


Pointer Threat: Problem Stack Size= 70 bytes Stack Size= 100 bytes F1() {

int a=5, b; fci(F2); F2(&a); fco(F1);

} F2(int *a) {

fci(F3); F3(a); fco(F2);

} F3(int *a) {

int j=30; *a = 100;

}

Aha! FOUND “a”

F2 20

SP

F3 30

F1 50

a

100

50

30

0

F2 20

SP

F3 30

F1 50

a 100

50

30

Wrong value of “a”

90 90 a

Local Memory Local Memory

EVICTED


Pointer Threat: Resolution

F1() { int a=5, b; fci(F2); F2(&a); fco(F1);

} F2(int *a) {


} F3(int *a) {

int j=30; *a = 100;

}

F1() { int a=5, b; fci(F2); fco(F1);

} F2(int *a) {


} F3(int *a) {

int j=30; t = g2l(a) *t = 100; l2p(a, t);

}

*ptr = val;

val = *ptr;

tptr = _g2l(ptr); *tptr = val; l2p(ptr, tptr);

tptr = _g2l(ptr); val = *tptr;


Global Memory

}  Can use DMA to transfer heap object to global memory }  DMA is very fast – no core-‐to-‐core communication

}  But eventually, you can overwrite some other data }  Need mediation

Execution Core malloc Main Core

malloc

Execution Core malloc

Global Memory

DMA

22

How to evict data to global memory?

Execution Core


Compiler and Runtime Infrastructure

23

}  Our infrastructure includes: }  code overlay script generating

tool, }  runtime library implementing

the API, }  compiler that inserts API

functions in the application.

Linker Script

Runtime Library API

inserting Compiler

SPE Objects

Code Overlay Script

Generating Tool

SPE Linker

SPE

Executable

Runtime Library API void * _malloc(int size, int chunkSize);

void _free (void *ppeAddr);

void _fci(int func_stack_size);

void _fco();

void * _g2l(void *ppeAddr, int size, int wrFlag);

void * _l2g(void *ppeAddr, void* speAddr, int size);

SPE Source

S H C


Experimental Setup }  Sony PlayStation 3 running a Fedora Core 9 Linux

}  Only 6 SPEs available

}  MiBench Benchmark Suite and some other applications

}  Runtimes are measured with spu_decrementer() for SPE and _mftb() for the PPE provided with IBM Cell SDK 3.1

}  Download GCC compiler patch }  http://aviral.lab.asu.edu/?p=95

10/8/13 24


Results

Enable execution for arbitrary stack sizes But quite high overheads!

100

1000

10000

100000

Log

of R

untim

e(us

)

Parameter n

Without Stack Management

Our Approach

n = 3842 Our Technique works for arbitrary stack size.

Without management the program crashes! There is no space left in local memory for the stack.

int rcount(int n) {

if (n==0) return 0; return rcount(n-1) + 1;

}


How does it work?

}  Pretty bad!! }  Several programs run, but with high overhead }  Several program still do not run

}  Pointer problem }  How to evict to global memory }  Reduce overheads

}  # of times API functions are called }  # of times DMA is performed

}  Good news: It only gets better from here!


Reduce Data Transfer Overhead

27

malloc() { if (enough space in global memory) then write heap data using DMA else request more space in global memory }

Execution Thread on execution core

S

startAddr endAddr

mail-box based communication

Global Memory

allocate ≥S space

DMA write from local memory to global memory

Main core

Global Memory

Execution Core malloc Main Core

malloc Execution

Core


Improving Stack Management }  Opportunities to reduce repeated API calls by consolidation

fci(F1);!F1();!fco(F0);!fci(F2);!F2();!fco(F0);!!

fci(F1);!F1(){! fci(F2);! F2();! fco(F1);!}!fco(F0);!!

Sequential Calls Nested Call

while(<condition>){! fci(F1);! F1();! fco(F0);!}!

Call in loop

fci(max(F1,F2));!F1();!F2();!fco(F0);!

fci(F1+F2);!F1(){ ! F2();!}!fco(F0);!

fci(F1);!while(<condition>){ ! F1();!}!fco(F1);!

F1();!F2();!

F1(){! F2();!}!

while(<condition>){! F1();!}!


Find optimal stack management points }  Can consolidate function frame movement }  Do not need to move

functions at every function call

}  Formulate the problem as that of inserting cuts in the GCCFG }  At the cut, dump the SPM

contents into global memory

main 128

print 32

stream 1936

init 0

update 160

final 80

transform 352

0

1 1

1 10 1

1 100

0

0

0

Cut 1

Cut 2

Cut 3

Cut 4


More Stack Management Optimizations }  Movement of functions

}  Biggest contributor }  Consolidate management for multiple functions

}  Pointer management }  Reduce the number of times p2s is called

}  If stack variable is used continuously – perform p2s only once }  If the stack variable belongs to the function that is in the SPM, do not need p2s

}  Reduce the instructions in management functions }  SPM-‐level management is simpler }  Less fragmentation – so the management code is less


Efficient Execution

}  Very few fci and fco calls inserted }  Less number of g2l calls }  Less number of instructions executed at every management point


Overheads Table 3: Number of sstore/ fci and sload/ fco Calls

Benchmark

sstore/ fci sload/ fco

CSM SSDM CSM SSDM

BasicMath 40012 0 40012 0

Dijkstra 60365 202 60365 202

FFT 7190 8 7190 8

FFT inverse 7190 8 7190 8

SHA 57 2 57 2

String Search 503 143 503 143

Susan Edges 776 1 776 1

Susan Smoothing 112 2 112 2

Table 4: Code size of stack manager (in bytes)sstore/ fci sload/ fco l2g g2l wb

CSM 2404 1900 96 1024 1112

SSDM 184 176 24 120 80

only four applications among our eight applications that con-tain pointers to stack data. We can observe that our schemecan slightly improve the performance of SHA, and totallyeliminate the pointer management functions for other threebenchmarks.

More results: Besides comparing results between SSDMand CSM, we also examined the impact of di↵erent stackspace sizes, the scalability of our heuristic, and discussed ourSSDM with cache design. We found that i) performance im-proves as we increase the space for stack data, ii) our SSDMscales well with di↵erent number of cores, iii) the penaltyof management is much less with our SSDM compared tohardware cache. The detailed results are presented in theAppendix, section F, section G, and section H.

9. CONCLUSIONScratchpad based Multicore Processor (SMP) architectures

are promising, since they are more scalable. However, sincescratchpad memory cannot always accommodate the wholeprogram, certain schemes are required to mange code, globaldata, stack data and heap data of the program to enable itsexecution. The main focus of this paper is on managing stackdata, since majority of the accesses in embedded applicationsmay be to stack variables. Assuming other data are properlymanaged by other schemes, managing stack data is especiallychallenging. In this paper, we formulated the problem of ef-ficiently placing library functions at the function call sites.In addition, we proposed a heuristic algorithm called SSDMto generate the e�cient function placement. As for pointersto stack data, a proper scheme was presented to reduce themanagement cost. Our experimental results show that SSDMgenerates function placement which leads to significant per-formance improvement compared to CSM.

10. REFERENCES[1] “GCC Internals”. http://gcc.gnu.org/onlinedocs/gccint/.[2] Intel Core i7 Processor Extreme Edition and Intel Core i7

Processor Datasheet, Volume 1. In White paper. Intel.[3] Raw Performance: SiSoftware Sandra 2010 Pro (GFLOPS).[4] SPU C/C++ Language Extensions. Technical report.[5] The SCC Programmer’s Guide. Technical report.[6] Compilers: Principles, Techniques, and Tools. Addison Wesley,

1986.[7] F. Angiolini, F. Menichelli, A. Ferrero, L. Benini, and

M. Olivieri. A Post-Compiler Approach to Scratchpad Mappingof Code. In Proc. CASES, pages 259–267, 2004.

[8] O. Avissar, R. Barua, and D. Stewart. An Optimal MemoryAllocation Scheme for Scratch-pad-based Embedded Systems.ACM TECS, 1(1):6–26, 2002.

[9] K. Bai and A. Shrivastava. Heap Data Management for LimitedLocal Memory (LLM) Multi-core Processors. In Proc.CODES+ISSS, 2010.

[10] K. Bai, A. Shrivastava, and S. Kudchadker. Stack DataManagement for Limited Local Memory (LLM) Multi-coreProcessors. In Proc. ASP-DAC, pages 231–234, 2011.

Table 5: Dynamic instructions per functionsstore/ fci sload/ fco

l2g

g2l wb

F NF F NF hit miss hit miss

CSM 180 100 148 95 24 45 76 60 34

SSDM 46 0 44 0 6 11 30 4 20

* F: stack region is full when function is called; NF: stack region isenough for the incoming function frame.

Table 6: Number of pointer mgmt. function callsl2g g2l wb

CSM SSDM CSM SSDM CSM SSDM

BasicMath 37010 0 123046 0 89026 0

SHA 2 2 163 158 68 68

Edges 1 0 515 0 514 0

Smoothing 1 0 515 0 514 0

* Edges - Susan Edges, Smoothing - Susan Smoothing

[11] M. A. Baker, A. Panda, N. Ghadge, A. Kadne, and K. S.Chatha. A Performance Model and Code Overlay Generator forScratchpad Enhanced Embedded Processors. In Proc.CODES+ISSS, pages 287–296, 2010.

[12] R. Banakar, S. Steinke, B.-S. Lee, M. Balakrishnan, andP. Marwedel. Scratchpad Memory: Design Alternative for Cacheon-chip Memory in Embedded Systems. In Proc. CODES+ISSS,pages 73–78, 2002.

[13] A. Dominguez, S. Udayakumaran, and R. Barua. Heap DataAllocation to Scratch-pad Memory in Embedded Systems. J.Embedded Comput., 1(4):521–540, 2005.

[14] B. Egger, C. Kim, C. Jang, Y. Nam, J. Lee, and S. L. Min. ADynamic Code Placement Technique for Scratchpad MemoryUsing Postpass Optimization. In Proc. CASES, pages 223–233,2006.

[15] B. Flachs at el. The Microarchitecture of the SynergisticProcessor for A Cell Processor. IEEE Solid-state circuits,41(1):63–70, 2006.

[16] M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin,T. Mudge, and R. B. Brown. Mibench: A Free, CommerciallyRepresentative Embedded Benchmark Suite. Proc. WorkloadCharacterization, pages 3–14, 2001.

[17] A. Janapsatya, A. Ignjatovic, and S. Parameswaran. A NovelInstruction Scratchpad Memory Optimization Method Based onConcomitance Metric. In Proc. ASP-DAC, pages 612–617, 2006.

[18] S. c. Jung, A. Shrivastava, and K. Bai. Dynamic Code Mappingfor Limited Local Memory Systems. In Proc. ASAP, pages13–20, 2010.

[19] M. Kandemir and A. Choudhary. Compiler-directed Scratch padMemory Hierarchy Design and Management. In Proc. DAC,pages 628–633, 2002.

[20] M. Kandemir, J. Ramanujam, J. Irwin, N. Vijaykrishnan,I. Kadayif, and A. Parikh. Dynamic Management of Scratch-padMemory Space. In Proc. DAC, pages 690–695, 2001.

[21] M. Kistler, M. Perrone, and F. Petrini. Cell MultiprocessorCommunication Network: Built for Speed. IEEE Micro,26(3):10–23, May 2006.

[22] L. Li, L. Gao, and J. Xue. Memory Coloring: A CompilerApproach for Scratchpad Memory Management. In Proc. PACT,pages 329–338, 2005.

[23] M. Mamidipaka and N. Dutt. On-chip Stack Based MemoryOrganization for Low Power Embedded Architectures. In Proc.DATE, pages 1082–1087, 2003.

[24] R. Mcllroy, P. Dickman, and J. Sventek. E�cient Dynamic HeapAllocation of Scratch-pad Memory. In ISMM, pages 31–40, 2008.

[25] N. Nguyen, A. Dominguez, and R. Barua. Memory Allocationfor Embedded Systems with A Compile-time-unknownScratch-pad Size. In Proc. CASES, pages 115–125, 2005.

[26] P. Panda, N. D. Dutt, and A. Nicolau. On-chip vs. O↵-chipMemory: the Data Partitioning Problem in EmbeddedProcessor-based Systems. In ACM TODAES, pages 682–704,2000.

[27] S. Park, H.-w. Park, and S. Ha. A Novel Technique to UseScratch-pad Memory for Stack Management. In Proc. DATE,pages 1478–1483, 2007.

[28] F. Poletti, P. Marchal, D. Atienza, L. Benini, F. Catthoor, andJ. M. Mendias. An Integrated Hardware/Software Approach forRun-time Scratchpad Management. In Proc. DAC, pages238–243, 2004.

[29] A. Shrivastava, A. Kannan, and J. Lee. A Software-only Solutionto Use Scratch Pads for Stack Data. IEEE TCAD,28(11):1719–1728, 2009.

[30] J. E. Smith. A Study of Branch Prediction Strategies. In Proc.of ISCA, pages 135–148, 1981.

[31] S. Udayakumaran, A. Dominguez, and R. Barua. DynamicAllocation for Scratch-pad Memory Using Compile-timeDecisions. ACM TECS, 5(2):472–511, 2006.

Table 3: Number of sstore/ fci and sload/ fco Calls

Benchmark


CSM SSDM CSM SSDM

BasicMath 40012 0 40012 0

Dijkstra 60365 202 60365 202

FFT 7190 8 7190 8


SHA 57 2 57 2





CSM 2404 1900 96 1024 1112

SSDM 184 176 24 120 80













l2g

g2l wb


CSM 180 100 148 95 24 45 76 60 34

SSDM 46 0 44 0 6 11 30 4 20




BasicMath 37010 0 123046 0 89026 0

SHA 2 2 163 158 68 68

Edges 1 0 515 0 514 0

Smoothing 1 0 515 0 514 0
























Benchmark


CSM SSDM CSM SSDM

BasicMath 40012 0 40012 0

Dijkstra 60365 202 60365 202

FFT 7190 8 7190 8


SHA 57 2 57 2





CSM 2404 1900 96 1024 1112

SSDM 184 176 24 120 80













l2g

g2l wb


CSM 180 100 148 95 24 45 76 60 34

SSDM 46 0 44 0 6 11 30 4 20




BasicMath 37010 0 123046 0 89026 0

SHA 2 2 163 158 68 68

Edges 1 0 515 0 514 0

Smoothing 1 0 515 0 514 0























(a) SSDM against ILP and CSM. (b) Overhead comparison between SSDM and CSM.

Figure 6: SSDM reduces the data management overhead and improves performance.

We first utilized PPE and 1 SPE available in the IBM CellBE and compared our SSDM performance against the resultsfrom ILP and CSM [10]. The y-axis in Figure 6(a) stands forthe execution time of each benchmark normalized to its ex-ecution time that with ILP. In this section, the number offunction calls used in Weighted Call Graph (WCG) is esti-mated from profile information. In the Appendix, sectionD, we present a compile-time scheme to assign weights onedges. Experimental results show that both the non-profiling-based scheme and the profiling-based scheme achieve almostthe same performance. As observed from Figure 6(a), ourSSDM shows very similar performance to ILP approach. Thismeans our heuristic approaches the optimal solution whenthe benchmark has a small call graph. Compared the CSMscheme, our SSDM demonstrates up to 19% and average 11%performance improvement. The overhead of the managementcomprises of i) time for data transfer, ii) execution of the in-structions in the management library functions. Figure 6(b)compares the execution time overhead of CSM and the pro-posed SSDM. Results show that when using CSM, an average11.3% of the execution time was spent on stack data man-agement. With our new approach SSDM, the overhead isreduced to a mere 0.8% – a reduction of 13X. Next we break-down the overhead and explain the e↵ect of our techniqueson the di↵erent components of the overhead:

Opt1 - Increase in the granularity of management:Due to our stack space level granularity of management, thenumber of DMA calls have been reduced. Table 2 shows thenumber of stack data management DMAs executed when weuse CSM, vs. the new technique SSDM. Note that thereare no DMAs required for Basicmath. This is because thewhole stack fits into the stack space allowed for this bench-mark. Our technique performs well for all benchmarks, ex-cept for Disjkstra. This is because of the recursive functionprint path in Dijkstra. CSM will perform a DMA only whenthe stack space is full of recursive function instantiations,while we have to evict recursive functions every time withunused stack space. As a result, our technique does not per-form very well on recursive programs. However, since manyembedded programs are non-recursive, we have left the prob-lem of optimizing for recursive functions as a future work.

Opt2 - Not performing management when not ab-solutely needed: Our SSDM scheme reduces the number

Table 1: Benchmarks, their stack sizes, and the stackspace we manage them on.

Benchmark

Stack Size Stack Region

(bytes) Size (bytes)

BasicMath 400 512

Dijkstra 1712 1024

FFT 656 512

FFT inverse 656 512

SHA 2512 2048

String Search 992 768

Susan Edges 832 768

Susan Smoothing 448 256

of library function calls because of our compile-time analy-sis. In Table 3, we compare the number of sstore and sloadfunction calls executed when using SSDM, vs. fci and fcocalls when using CSM. We can observe that our scheme hasmuch less number of library function calls. The main reasonis that our SSDM considers the thrashing e↵ect discussed inSection 4. Our approach tries to avoid placing managementlibrary function sstore and sload around the function con-taining large number of function calls if possible, while CSMalways inserts management function at all function call sites.

Opt3 - Performing minimal work each time man-agement is performed: Our management library is sim-pler, since we only need to maintain a linear queue, as com-pared to a circular queue in CSM. Table 4 shows the amountof local memory required by our SSDM and CSM, where wecan find our runtime library has much less footprint thanCSM does. It is very important for improving the perfor-mance, since stack frames will obtain less space in the localmemory if the library occupies more space. The reason forlarger footprint of CSM is that it needs to handle memoryfragmentation, while our SSDM doesn’t have this trouble.

Table 5 shows the cost of extra instructions per libraryfunction call. We ran all benchmarks with both schemes andapproximately calculated the average additional instructionsincurred by each library call. As demonstrated in Table 5, ourSSDM performs much better than CSM. There is no cost inSSDM when the stack region is su�cient to hold the incomingframes. However, CSM still needs extra instructions, since itchecks the status of the stack region at runtime. hit for g2land wb means the accessing stack data is residing in thelocal memory when the function is called, while miss denotesstack data is not in the local memory. In CSM approach,more instructions are needed for hit case than miss case inthe function wb. It is because the library directly writes backthe data to the global memory when miss, but looking up themanagement table is required to translate the address. Moreimportantly, as the table itself occupies space and thereforeneeds to be managed, CSM may need additional instructionsto transfer table entries.

Opt4 - Not performing pointer management whennot needed: Stack pointer management is properly man-aged in SSDM, while CSM might manage all pointers exces-sively. Table 6 shows the results of four benchmarks withand without pointer optimization technique. They are the

Table 2: Comparison of number of DMAsBenchmark CSM SSDM

BasicMath 0 0

Dijkstra 108 364

FFT 26 14

FFT inverse 26 14

SHA 10 4

String Search 380 342

Susan Edges 8 2

Susan Smoothing 12 4


Benchmark


CSM SSDM CSM SSDM

BasicMath 40012 0 40012 0

Dijkstra 60365 202 60365 202

FFT 7190 8 7190 8


SHA 57 2 57 2





CSM 2404 1900 96 1024 1112

SSDM 184 176 24 120 80













l2g

g2l wb


CSM 180 100 148 95 24 45 76 60 34

SSDM 46 0 44 0 6 11 30 4 20




BasicMath 37010 0 123046 0 89026 0

SHA 2 2 163 158 68 68

Edges 1 0 515 0 514 0

Smoothing 1 0 515 0 514 0
























Minimal Overhead

}  4% of execution time spent on management


Comparison with Caches }  Cache Miss penalty = # misses * miss latency }  SPM miss overhead = # API function calls * no. of instructions in API function

+ # times DMA is called * delay of the DMA (dep. on DMA size)

Cache is better when miss latency < 260 ps 260 ps = 0.86 * cycle time


Scalability of Management

}  The main core does not choke on the memory requests from several cores

0.97

0.98

0.99

1

1.01

1.02

1.03

1.04

1.05

1.06

1.07

1 2 3 4 5 6

Nor

mal

ized

Run

time

# of Cores

basicmath

DFS

dijkstra

fft

invfft

MST

rbTree

sha

stringsearch


Summary }  SPMs are an embedded system technology }  SPMs will ne needed in general-‐purpose computing

}  Will need to manage stack, heap, and code }  Do not work without management

}  Need different strategies for different data }  Code (statically linked) }  Stack (Circular) }  Heap (High associativity)

}  Overheads of Software Data Management }  DMA overhead can be comparable or better than cache

}  We have just begun – lots of room for improvement

Stack

Heap

Global

Code


Communication Management }  No problem in MPI-‐style

}  Communication is explicit

}  For multi-‐threaded programs }  Replace load => coh_load(), and store =>

coh_store()

}  Too much overhead for sequential consistency

}  Weak Consistency models allow for efficient software implementations of coherency protocols

}  Lazy vs. Eager

}  Invalidate vs. Update

}  Page based granularity in multi-‐processor systems

}  Need finer granularity in multi-‐cores

1

10

100

1000

10000

100000

Benchmarks

CRC LRC-inv LRC-upd

Exe

cuti

on t

ime

(ms)


Real-‐time Multicores

}  Data and communication management in software }  Better timing guarantees

}  Managing data at its natural granularity simplifies WCET calculation

}  e.g., find out how many instruction cache misses vs. find out how many function swaps

}  Not only lower WCET, but tighter WCET estimate }  Excellent platform for Real-‐time Systems }  Can tune the management policy to improve WCET

}  Software Branch Hinting }  Close to 1-‐bit HBP performance }  Can place hints to achieve tighter WCET

the$rise$and$fall$of$$ scratchpad$memories$jasonxue/meaow/meaow-talk-riseandriseof... · web page:...

Documents