the$rise$and$fall$of$$ scratchpad$memories$jasonxue/meaow/meaow-talk-riseandriseof... · web page:...

38
C M L The Rise and Fall of Scratchpad Memories Aviral Shrivastava Compiler Microarchitecture Lab Arizona State University Rise

Upload: others

Post on 17-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The$Rise$and$Fall$of$$ Scratchpad$Memories$jasonxue/MeAOW/MEAOW-talk-RiseAndRiseOf... · Web page: aviral.lab.asu.edu C M L PPE Element Interconnect Bus (EIB) Off-chip Global Memory

C M L

The  Rise  and  Fall  of    Scratchpad  Memories  

Aviral Shrivastava Compiler Microarchitecture Lab

Arizona State University

Rise

Page 2: The$Rise$and$Fall$of$$ Scratchpad$Memories$jasonxue/MeAOW/MEAOW-talk-RiseAndRiseOf... · Web page: aviral.lab.asu.edu C M L PPE Element Interconnect Bus (EIB) Off-chip Global Memory

C M L Web page: aviral.lab.asu.edu C M L

Remember  -­‐  It  is  all  about  Memory!  

2

}  First  Generation  }  ENIAC  ,  UNIVAC  –  No  memory  

}  Second  Generation  }  IBM  7000  series  -­‐  Magnetic  core  memory  

}  Third  Generation  }  IBM  360  -­‐  Semiconductor  memory  

}  Fourth  Generation  }  PC  and  onwards  -­‐  VLSI  Memory  

}  First  documented  use  of  Cache  }  IBM  360*  }  “to  bridge  the  speed  gap  between  processor  and  memory”  

}  Since  then:  Caches  maybe  the  most  important  feature  in  a  processor  }  Itanium  2:  cache  and  cache-­‐like  structures  

}  More  than  90%  of  transistors  by  count,  70%  of  chip  by  area,  50%  power,  80%  of  leakage  

10/8/13

*IBM (June, 1968), IBM System/360 Model 85 Functional Characteristics, SECOND EDITION, A22-6916-1.

Computer Architecture and Networks

First Generation (1945-1958)…

Built to calculate trajectories

for ballistic shells during

WWII, programmed by

setting switches and plugging

&

unplugging cables.

It used 18,000 tubes, weighted

30 tones and consumed 160

kilowatts of electrical power.

1943-46, ENIAC (Electronic Numerical Integrator and Calculator) by J.

Mauchly and J. Presper Eckert, first general purpose electronic computer

The size of its numerical word was 10 decimal digits, and it could perform 5000

additions and 357 multiplications per second.

Page 3: The$Rise$and$Fall$of$$ Scratchpad$Memories$jasonxue/MeAOW/MEAOW-talk-RiseAndRiseOf... · Web page: aviral.lab.asu.edu C M L PPE Element Interconnect Bus (EIB) Off-chip Global Memory

C M L Web page: aviral.lab.asu.edu C M L

SPMs  for  Power,  Performance,  and  Area  

0

1

2

3

4

5

6

7

8

9

256 512 1024 2048 4096 8192 16384

memory size

Ener

gy p

er a

cces

s [n

J]

.

Scratch padCache, 2way, 4GB spaceCache, 2way, 16 MB spaceCache, 2way, 1 MB space

Data Array Tag Array

Tag Comparators, Muxes

Address Decoder

Cache SPM

}  40%  less  energy  as  compared  to  cache  [Banakar02]  }  Absence  of  tag  arrays,  comparators  and  muxes  

}  34  %  less  area  as  compared  to  cache  of  same  size  [Banakar02]  }  Simple  hardware  design  (only  a  memory  array  &  address  decoding  

circuitry)    }  Simpler  and  cheaper  to  build  and  verify  

Page 4: The$Rise$and$Fall$of$$ Scratchpad$Memories$jasonxue/MeAOW/MEAOW-talk-RiseAndRiseOf... · Web page: aviral.lab.asu.edu C M L PPE Element Interconnect Bus (EIB) Off-chip Global Memory

C M L Web page: aviral.lab.asu.edu C M L

SPMs  became  popular  in  ES  }  DSPs  have  used  SPMs  for  a  long  time  

}  TI-­‐99/4A,  released  in  1981  had  256  bytes  of  SPM  

}  Gaming  Consoles  regularly  use  SPMs  }  SuperH  in  Sega  }  PS1:  could  use  SPM  for  stack  data  }  PS2:  16KB  SPM  }  PS3:  Each  SPU  has  256KB  SPM  

}  Network  and  Graphics  Processors  }  Intel  Network  processors,  and  Nvidia  Tesla  

}  Many  embedded  processors  used  line  locking  }  Coldfire  MCF5249,  PowerPC440,  MPC5554,  ARM940,  and  

ARM946E-­‐S  

}  Several  versions  of  ARM  and  Renesas  have  SPMs  }  -­‐  ARM  supports  upto  4M  of  SPM   Sony Playstation

Sega Saturn

Page 5: The$Rise$and$Fall$of$$ Scratchpad$Memories$jasonxue/MeAOW/MEAOW-talk-RiseAndRiseOf... · Web page: aviral.lab.asu.edu C M L PPE Element Interconnect Bus (EIB) Off-chip Global Memory

C M L Web page: aviral.lab.asu.edu C M L

Using  SPMs  in  Embedded  Systems  

5 10/8/13

ARM SPM

Cache

DMA

ARM Memory Architecture

Global Memory

•  Programs  work  without  using  SPM  –  SPM  for  optimization    –  Improve  power,  performance  

•  Placing  frequently  used  data  in  SPM  –  Typically  arrays  –  Using  linker  script  

All of this was done manually!

Page 6: The$Rise$and$Fall$of$$ Scratchpad$Memories$jasonxue/MeAOW/MEAOW-talk-RiseAndRiseOf... · Web page: aviral.lab.asu.edu C M L PPE Element Interconnect Bus (EIB) Off-chip Global Memory

C M L Web page: aviral.lab.asu.edu C M L

Compilers  for  using  SPMs  

6

}  As  applications  become  more  complex,  it  was  not  easy  to  identify  what  should  be  mapped  to  SPM  

}  Compiler  techniques  to  use  SPM  in  embedded  systems  }  Global:  Panda97,  Brockmeyer03  Avissar02,  Gao05,  

Kandemir02,  Steinke02,  Grosslinger09  }  Code:  Janapsatya06,  Egger06,  Angiolini04  }  Stack:  Udayakumaran06,Dominguez05  }  Heap:  Dominguez05,  Mcllroy08  

10/8/13

Page 7: The$Rise$and$Fall$of$$ Scratchpad$Memories$jasonxue/MeAOW/MEAOW-talk-RiseAndRiseOf... · Web page: aviral.lab.asu.edu C M L PPE Element Interconnect Bus (EIB) Off-chip Global Memory

C M L Web page: aviral.lab.asu.edu C M L

Compilers  to  use  SPM  

7

}  In  general-­‐purpose  systems  }  Kennedy  –  proposed  to  use  SPM  for  register  spills  

 }  SPMs  have  largely  remained  in  embedded  systems  }  Not  popular  in  general-­‐purpose  computing  

10/8/13

Not much work - Because caches keep programming and debugging simple

Times are a changing…

Page 8: The$Rise$and$Fall$of$$ Scratchpad$Memories$jasonxue/MeAOW/MEAOW-talk-RiseAndRiseOf... · Web page: aviral.lab.asu.edu C M L PPE Element Interconnect Bus (EIB) Off-chip Global Memory

C M L Web page: aviral.lab.asu.edu C M L

Inevitable  march  to  multi-­‐cores  }  Marketing  Needs  

}  Moore’s  law    

}  Real  Needs  }  Temperature  and  Power  Problems  

}  Microarchitecture  level:  Hotspots  }  Chip  level:  Cooling  Efficiency  }  System  Level:  Total  power  consumption  

}  Only  way  to  improve  performance  without  much  increase  in  power  

}  Multi-­‐cores  }  Reduce  design  complexity  }  Spread  heat  and  alleviate  hotspots  }  Improve  reliability  through  redundancy  

10/8/13 8

Page 9: The$Rise$and$Fall$of$$ Scratchpad$Memories$jasonxue/MeAOW/MEAOW-talk-RiseAndRiseOf... · Web page: aviral.lab.asu.edu C M L PPE Element Interconnect Bus (EIB) Off-chip Global Memory

C M L Web page: aviral.lab.asu.edu C M L

But…  how  do  you  scale  the  memory?  }  Coherent-­‐Cache  Architectures  (Current  path)  

}  Can  still  write  programs  like  in  the  uni-­‐core  era,  but  

}  Coherency  overheads  do  not  scale  }  Tilera64  has  a  whole  separate  mesh  network  for  coherence  traffic  

}  http://www.theinquirer.net/inquirer/news/1006963/tilera-­‐releases-­‐core-­‐chip  

}  Non-­‐Coherent  Cache  Architectures  }  48-­‐core  Single-­‐chip  Cloud  Computer  (SCC)  

}  Partly  Coherent  }  TI-­‐6678  –  vertically  coherent,  but  horizontally  not  coherent  

}  Hybrid  }  Locally  coherent,  but  globally  non-­‐coherent  

}  Caches  still  consume  a  very  significant  amount  of  power  

Page 10: The$Rise$and$Fall$of$$ Scratchpad$Memories$jasonxue/MeAOW/MEAOW-talk-RiseAndRiseOf... · Web page: aviral.lab.asu.edu C M L PPE Element Interconnect Bus (EIB) Off-chip Global Memory

C M L Web page: aviral.lab.asu.edu C M L

PPE

Element Interconnect Bus (EIB)

Off-chip Global

Memory PPE: Power Processor Element SPE: Synergistic Processor Element LS: Local Store

SPE 0 SPE 2

SPE 5

SPE 4

SPE 3 SPE 1

SPE 6

Software  Managed  Memory  (SMM)  Architecture  

10

}  Cores  have  small  local  memories  (scratch  pad)  }  Core  can  only  access  local  memory  }  Accesses  to  global  memory  through  explicit  DMAs  in  the  program  

}  e.g.  IBM  Cell  architecture,  which  is  in  Sony  PS3.  

SPE 7 LS

SPU

Page 11: The$Rise$and$Fall$of$$ Scratchpad$Memories$jasonxue/MeAOW/MEAOW-talk-RiseAndRiseOf... · Web page: aviral.lab.asu.edu C M L PPE Element Interconnect Bus (EIB) Off-chip Global Memory

C M L Web page: aviral.lab.asu.edu C M L

SMM  Execution  

11

}  Task  based  programming,  MPI  like  communication  #include<libspe2.h> extern spe_program_handle_t hello_spu; int main(void) { int speid, status; speid (&hello_spu); }

Main Core

<spu_mfcio.h> int main(speid, argp) { printf("Hello world!\n"); }

Local Core

<spu_mfcio.h> int main(speid, argp) { printf("Hello world!\n"); }

Local Core

<spu_mfcio.h> int main(speid, argp) { printf("Hello world!\n"); }

Local Core

<spu_mfcio.h> int main(speid, argp) { printf("Hello world!\n"); }

Local Core

<spu_mfcio.h> int main(speid, argp) { printf("Hello world!\n"); }

Local Core

<spu_mfcio.h> int main(speid, argp) { printf("Hello world!\n"); }

Local Core

= spe_create_thread

}  Extremely  power-­‐efficient  computation  }  If  all  code  and  data  fit  into  the  local  memory  of  the  cores  

Processor Fab Frequency GFlops Power Power Efficiency Cell/B.E. 45nm 3.2 GHz 230 50 W 4.6 Intel i7 4-core Bloomfield 965 XE

45nm 3.2 GHz 70 130 W 0.5

Page 12: The$Rise$and$Fall$of$$ Scratchpad$Memories$jasonxue/MeAOW/MEAOW-talk-RiseAndRiseOf... · Web page: aviral.lab.asu.edu C M L PPE Element Interconnect Bus (EIB) Off-chip Global Memory

C M L Web page: aviral.lab.asu.edu C M L

SMM  memory  organization  

12

ARM SPM

Global Memory

DMA

ARM Memory Architecture

SPE SPM

Global Memory

DMA

IBM Cell Memory Architecture

SPM is for Optimization SPM is Essential

}  Dynamic  code/data  management  is  needed  }  All  code/data  must  be  managed  

Previous works are not directly applicable

Page 13: The$Rise$and$Fall$of$$ Scratchpad$Memories$jasonxue/MeAOW/MEAOW-talk-RiseAndRiseOf... · Web page: aviral.lab.asu.edu C M L PPE Element Interconnect Bus (EIB) Off-chip Global Memory

C M L Web page: aviral.lab.asu.edu C M L

How  to  manage  data  within  a  core?  

13

Local Memory Aware Code Original Code

int global; f1(){ int a,b; global = a + b; f2(); }

int global; f1(){ int a,b; DMA.fetch(global) global = a + b; DMA.writeback(global) DMA.fetch(f2) f2(); }

Page 14: The$Rise$and$Fall$of$$ Scratchpad$Memories$jasonxue/MeAOW/MEAOW-talk-RiseAndRiseOf... · Web page: aviral.lab.asu.edu C M L PPE Element Interconnect Bus (EIB) Off-chip Global Memory

C M L Web page: aviral.lab.asu.edu C M L

Data  Management  in  LLM  multicores  }  Manage  any  amount  of  heap,  stack  and  code,  in  the  core  of  an  LLM  multi-­‐core    

}  Global  data    

}  If  small,  can  be  permanently  located  in  the  local  memory  

}  Stack  data  }  ‘liveness’  depends  on  call  path  

}  Function  stack  size  know  at  compiler  time,  but  not  stack  depth  

}  Heap  data  }  dynamic  and  size  can  be  unbounded  

}  Code  

}  Statically  linked  

}  Our  strategy  }  Partition  local  memory  into  regions  for  each  kind  of  data  

}  Manage  each  kind  of  data  in  a  constant  amount  of  space  

code global

stack

heap

heap

heap

stack

Page 15: The$Rise$and$Fall$of$$ Scratchpad$Memories$jasonxue/MeAOW/MEAOW-talk-RiseAndRiseOf... · Web page: aviral.lab.asu.edu C M L PPE Element Interconnect Bus (EIB) Off-chip Global Memory

C M L Web page: aviral.lab.asu.edu C M L

Stack  Management:  Problem  

15

Function Frame Size

(bytes)

F1 28

F2 40

F3 60

F4 54

F1

F2

F3

F4

Local Memory Size = 128 bytes

Local Memory Global Memory

F1

F2

F3

28

68

128

Old SP

F4

54

Global SP

Page 16: The$Rise$and$Fall$of$$ Scratchpad$Memories$jasonxue/MeAOW/MEAOW-talk-RiseAndRiseOf... · Web page: aviral.lab.asu.edu C M L PPE Element Interconnect Bus (EIB) Off-chip Global Memory

C M L Web page: aviral.lab.asu.edu C M L

Stack  Management:  Solution

16

}  Keep  the  active  portion  of  the  stack  on  the  local  memory  }  Granularity  of  stack  frames  is  chosen  to  minimize  management  overhead  

}  It  is  a  dynamic  software  technique  }  fci(func_stack_size)  

}  Check  for  available  space  in  local  memory  }  Move  old  frame(s)  to  global  memory  if  needed  

}  fco()  }  Check  if  the  caller  frame  exists  in  local  memory!  }  Fetch  from  global  memory,  if  it  is  absent  

Optimized Compiler GCC 4.1.1

Executable

Runtime Library Runtime Library

void fci(int func_stack_size); void fco();

C Source

F1() { int a,b; F2();

} F2() {

F3(); } F3() {

int j=30; }

F1() { int a,b; fci(F2); F2(); fco(F1);

} F2() {

fci(F3); F3(); fco(F2);

} F3() {

int j=30; }

Page 17: The$Rise$and$Fall$of$$ Scratchpad$Memories$jasonxue/MeAOW/MEAOW-talk-RiseAndRiseOf... · Web page: aviral.lab.asu.edu C M L PPE Element Interconnect Bus (EIB) Off-chip Global Memory

C M L Web page: aviral.lab.asu.edu C M L

Code  Management:  Problem  }  Static  Compilation  

}  Functions  need  to  be  linked  before  execution  }  Divide  code  part  of  SPM  in  regions  }  Map  functions  to  these  SPM  regions  }  Functions  in  the  same  region  replace  each  other  

REGION

REGION

REGION • • •

Local Memory Code Section

Page 18: The$Rise$and$Fall$of$$ Scratchpad$Memories$jasonxue/MeAOW/MEAOW-talk-RiseAndRiseOf... · Web page: aviral.lab.asu.edu C M L PPE Element Interconnect Bus (EIB) Off-chip Global Memory

C M L Web page: aviral.lab.asu.edu C M L

(c) Local Memory

F2

F3

F1

Code region

Code  Management:  Solution

18

(d) Global Memory

heap

global

stack

F2

F1 F3

F1

F2

F3

F1

F2

F3 (a) Application call graph

SECTIONS { OVERLAY { F1.o F3.o } OVERLAY { F2.o } } (b) Linker script

}  #  of  Regions  and  Function-­‐To-­‐Region  Mapping  }  Two  extreme  cases  

}  Need  careful  code  placement  –  Problem  is  NP-­‐Complete  }  Minimum  data  transfer  with  given  space  

Page 19: The$Rise$and$Fall$of$$ Scratchpad$Memories$jasonxue/MeAOW/MEAOW-talk-RiseAndRiseOf... · Web page: aviral.lab.asu.edu C M L PPE Element Interconnect Bus (EIB) Off-chip Global Memory

C M L Web page: aviral.lab.asu.edu C M L

malloc2

malloc1

Heap  Size  =  32bytes  sizeof(student)=16bytes  

HP

Local Memory Global Memory GM_HP

typedef  struct{          int  id;          float  score;  }Student;    main()  {            for  (i=0;  i<N;  i++)    {                      student[i]  =  malloc(  sizeof(Student)  );            }                        for  (i=0;  i<N;  i++)    {                      student[i].id  =  i;            }  }    

malloc3

•  New  malloc()  }  May  need  to  evict  older  heap  

objects  to  global  memory  }  It  may  need  to  allocate  more  global  

memory  

•  malloc()  }  allocates  space  in  local  memory  

Heap  Data  Management

19

Page 20: The$Rise$and$Fall$of$$ Scratchpad$Memories$jasonxue/MeAOW/MEAOW-talk-RiseAndRiseOf... · Web page: aviral.lab.asu.edu C M L PPE Element Interconnect Bus (EIB) Off-chip Global Memory

C M L Web page: aviral.lab.asu.edu C M L

Pointer  Threat:  Problem  Stack Size= 70 bytes Stack Size= 100 bytes F1() {

int a=5, b; fci(F2); F2(&a); fco(F1);

} F2(int *a) {

fci(F3); F3(a); fco(F2);

} F3(int *a) {

int j=30; *a = 100;

}

Aha! FOUND “a”

F2 20

SP

F3 30

F1 50

a

100

50

30

0

F2 20

SP

F3 30

F1 50

a 100

50

30

Wrong value of “a”

90 90 a

Local Memory Local Memory

EVICTED

Page 21: The$Rise$and$Fall$of$$ Scratchpad$Memories$jasonxue/MeAOW/MEAOW-talk-RiseAndRiseOf... · Web page: aviral.lab.asu.edu C M L PPE Element Interconnect Bus (EIB) Off-chip Global Memory

C M L Web page: aviral.lab.asu.edu C M L

Pointer  Threat:  Resolution  

F1() { int a=5, b; fci(F2); F2(&a); fco(F1);

} F2(int *a) {

fci(F3); F3(a); fco(F2);

} F3(int *a) {

int j=30; *a = 100;

}

F1() { int a=5, b; fci(F2); fco(F1);

} F2(int *a) {

fci(F3); F3(a); fco(F2);

} F3(int *a) {

int j=30; t = g2l(a) *t = 100; l2p(a, t);

}

*ptr = val;

val = *ptr;

tptr = _g2l(ptr); *tptr = val; l2p(ptr, tptr);

tptr = _g2l(ptr); val = *tptr;

Page 22: The$Rise$and$Fall$of$$ Scratchpad$Memories$jasonxue/MeAOW/MEAOW-talk-RiseAndRiseOf... · Web page: aviral.lab.asu.edu C M L PPE Element Interconnect Bus (EIB) Off-chip Global Memory

C M L Web page: aviral.lab.asu.edu C M L

Global Memory

}  Can  use  DMA  to  transfer  heap  object  to  global  memory  }  DMA  is  very  fast  –  no  core-­‐to-­‐core  communication  

}  But  eventually,  you  can  overwrite  some  other  data  }  Need  mediation  

Execution Core malloc Main Core

malloc

Execution Core malloc

Global Memory

DMA

22

How  to  evict  data  to  global  memory?

Execution Core

Page 23: The$Rise$and$Fall$of$$ Scratchpad$Memories$jasonxue/MeAOW/MEAOW-talk-RiseAndRiseOf... · Web page: aviral.lab.asu.edu C M L PPE Element Interconnect Bus (EIB) Off-chip Global Memory

C M L Web page: aviral.lab.asu.edu C M L

Compiler  and  Runtime  Infrastructure

23

}  Our  infrastructure  includes:    }  code  overlay  script  generating  

tool,    }  runtime  library  implementing  

the  API,    }  compiler  that  inserts  API  

functions  in  the  application.  

Linker Script

Runtime Library API

inserting Compiler

SPE Objects

Code Overlay Script

Generating Tool

SPE Linker

SPE

Executable

Runtime Library API void * _malloc(int size, int chunkSize);

void _free (void *ppeAddr);

void _fci(int func_stack_size);

void _fco();

void * _g2l(void *ppeAddr, int size, int wrFlag);

void * _l2g(void *ppeAddr, void* speAddr, int size);

SPE Source

S H C

Page 24: The$Rise$and$Fall$of$$ Scratchpad$Memories$jasonxue/MeAOW/MEAOW-talk-RiseAndRiseOf... · Web page: aviral.lab.asu.edu C M L PPE Element Interconnect Bus (EIB) Off-chip Global Memory

C M L Web page: aviral.lab.asu.edu C M L

Experimental  Setup  }  Sony  PlayStation  3  running  a  Fedora  Core  9  Linux  

}  Only  6  SPEs  available  

}  MiBench  Benchmark  Suite  and  some  other  applications  

}  Runtimes  are  measured  with  spu_decrementer()  for  SPE  and  _mftb()  for  the  PPE  provided  with  IBM  Cell  SDK  3.1  

}  Download  GCC  compiler  patch    }  http://aviral.lab.asu.edu/?p=95  

10/8/13 24

Page 25: The$Rise$and$Fall$of$$ Scratchpad$Memories$jasonxue/MeAOW/MEAOW-talk-RiseAndRiseOf... · Web page: aviral.lab.asu.edu C M L PPE Element Interconnect Bus (EIB) Off-chip Global Memory

C M L Web page: aviral.lab.asu.edu C M L

Results  

Enable  execution  for  arbitrary  stack  sizes  But  quite  high  overheads!  

100

1000

10000

100000

Log

of R

untim

e(us

)

Parameter n

Without Stack Management

Our Approach

n = 3842 Our Technique works for arbitrary stack size.

Without management the program crashes! There is no space left in local memory for the stack.

int rcount(int n) {

if (n==0) return 0; return rcount(n-1) + 1;

}

Page 26: The$Rise$and$Fall$of$$ Scratchpad$Memories$jasonxue/MeAOW/MEAOW-talk-RiseAndRiseOf... · Web page: aviral.lab.asu.edu C M L PPE Element Interconnect Bus (EIB) Off-chip Global Memory

C M L Web page: aviral.lab.asu.edu C M L

How  does  it  work?    

}  Pretty  bad!!  }  Several  programs  run,  but  with  high  overhead  }  Several  program  still  do  not  run  

}  Pointer  problem  }  How  to  evict  to  global  memory  }  Reduce  overheads  

}  #  of  times  API  functions  are  called  }  #  of  times  DMA  is  performed  

}  Good  news:  It  only  gets  better  from  here!  

Page 27: The$Rise$and$Fall$of$$ Scratchpad$Memories$jasonxue/MeAOW/MEAOW-talk-RiseAndRiseOf... · Web page: aviral.lab.asu.edu C M L PPE Element Interconnect Bus (EIB) Off-chip Global Memory

C M L Web page: aviral.lab.asu.edu C M L

Reduce  Data  Transfer  Overhead

27

malloc() { if (enough space in global memory) then write heap data using DMA else request more space in global memory }

Execution Thread on execution core

S

startAddr endAddr

mail-box based communication

Global Memory

allocate ≥S space

DMA write from local memory to global memory

Main core

Global Memory

Execution Core malloc Main Core

malloc Execution

Core

Page 28: The$Rise$and$Fall$of$$ Scratchpad$Memories$jasonxue/MeAOW/MEAOW-talk-RiseAndRiseOf... · Web page: aviral.lab.asu.edu C M L PPE Element Interconnect Bus (EIB) Off-chip Global Memory

C M L Web page: aviral.lab.asu.edu C M L

Improving  Stack  Management  }  Opportunities  to  reduce  repeated  API  calls  by  consolidation  

fci(F1);!F1();!fco(F0);!fci(F2);!F2();!fco(F0);!!

fci(F1);!F1(){! fci(F2);! F2();! fco(F1);!}!fco(F0);!!

Sequential Calls Nested Call

while(<condition>){! fci(F1);! F1();! fco(F0);!}!

Call in loop

fci(max(F1,F2));!F1();!F2();!fco(F0);!

fci(F1+F2);!F1(){ ! F2();!}!fco(F0);!

fci(F1);!while(<condition>){ ! F1();!}!fco(F1);!

F1();!F2();!

F1(){! F2();!}!

while(<condition>){! F1();!}!

Page 29: The$Rise$and$Fall$of$$ Scratchpad$Memories$jasonxue/MeAOW/MEAOW-talk-RiseAndRiseOf... · Web page: aviral.lab.asu.edu C M L PPE Element Interconnect Bus (EIB) Off-chip Global Memory

C M L Web page: aviral.lab.asu.edu C M L

Find  optimal  stack  management  points  }  Can  consolidate  function  frame  movement  }  Do  not  need  to  move  

functions  at  every  function  call  

}  Formulate  the  problem  as  that  of  inserting  cuts  in  the  GCCFG  }  At  the  cut,  dump  the  SPM  

contents  into  global  memory  

main 128

print 32

stream 1936

init 0

update 160

final 80

transform 352

0

1 1

1 10 1

1 100

0

0

0

Cut 1

Cut 2

Cut 3

Cut 4

Page 30: The$Rise$and$Fall$of$$ Scratchpad$Memories$jasonxue/MeAOW/MEAOW-talk-RiseAndRiseOf... · Web page: aviral.lab.asu.edu C M L PPE Element Interconnect Bus (EIB) Off-chip Global Memory

C M L Web page: aviral.lab.asu.edu C M L

More  Stack  Management  Optimizations  }  Movement  of  functions  

}  Biggest  contributor  }  Consolidate  management  for  multiple  functions  

}  Pointer  management  }  Reduce  the  number  of  times  p2s  is  called  

}  If  stack  variable  is  used  continuously  –  perform  p2s  only  once  }  If  the  stack  variable  belongs  to  the  function  that  is  in  the  SPM,  do  not  need  p2s  

}  Reduce  the  instructions  in  management  functions  }  SPM-­‐level  management  is  simpler  }  Less  fragmentation  –  so  the  management  code  is  less  

Page 31: The$Rise$and$Fall$of$$ Scratchpad$Memories$jasonxue/MeAOW/MEAOW-talk-RiseAndRiseOf... · Web page: aviral.lab.asu.edu C M L PPE Element Interconnect Bus (EIB) Off-chip Global Memory

C M L Web page: aviral.lab.asu.edu C M L

Efficient  Execution  

}  Very  few  fci  and  fco  calls  inserted  }  Less  number  of  g2l  calls  }  Less  number  of  instructions  executed  at  every  management  point  

Page 32: The$Rise$and$Fall$of$$ Scratchpad$Memories$jasonxue/MeAOW/MEAOW-talk-RiseAndRiseOf... · Web page: aviral.lab.asu.edu C M L PPE Element Interconnect Bus (EIB) Off-chip Global Memory

C M L Web page: aviral.lab.asu.edu C M L

Overheads  Table 3: Number of sstore/ fci and sload/ fco Calls

Benchmark

sstore/ fci sload/ fco

CSM SSDM CSM SSDM

BasicMath 40012 0 40012 0

Dijkstra 60365 202 60365 202

FFT 7190 8 7190 8

FFT inverse 7190 8 7190 8

SHA 57 2 57 2

String Search 503 143 503 143

Susan Edges 776 1 776 1

Susan Smoothing 112 2 112 2

Table 4: Code size of stack manager (in bytes)sstore/ fci sload/ fco l2g g2l wb

CSM 2404 1900 96 1024 1112

SSDM 184 176 24 120 80

only four applications among our eight applications that con-tain pointers to stack data. We can observe that our schemecan slightly improve the performance of SHA, and totallyeliminate the pointer management functions for other threebenchmarks.

More results: Besides comparing results between SSDMand CSM, we also examined the impact of di↵erent stackspace sizes, the scalability of our heuristic, and discussed ourSSDM with cache design. We found that i) performance im-proves as we increase the space for stack data, ii) our SSDMscales well with di↵erent number of cores, iii) the penaltyof management is much less with our SSDM compared tohardware cache. The detailed results are presented in theAppendix, section F, section G, and section H.

9. CONCLUSIONScratchpad based Multicore Processor (SMP) architectures

are promising, since they are more scalable. However, sincescratchpad memory cannot always accommodate the wholeprogram, certain schemes are required to mange code, globaldata, stack data and heap data of the program to enable itsexecution. The main focus of this paper is on managing stackdata, since majority of the accesses in embedded applicationsmay be to stack variables. Assuming other data are properlymanaged by other schemes, managing stack data is especiallychallenging. In this paper, we formulated the problem of ef-ficiently placing library functions at the function call sites.In addition, we proposed a heuristic algorithm called SSDMto generate the e�cient function placement. As for pointersto stack data, a proper scheme was presented to reduce themanagement cost. Our experimental results show that SSDMgenerates function placement which leads to significant per-formance improvement compared to CSM.

10. REFERENCES[1] “GCC Internals”. http://gcc.gnu.org/onlinedocs/gccint/.[2] Intel Core i7 Processor Extreme Edition and Intel Core i7

Processor Datasheet, Volume 1. In White paper. Intel.[3] Raw Performance: SiSoftware Sandra 2010 Pro (GFLOPS).[4] SPU C/C++ Language Extensions. Technical report.[5] The SCC Programmer’s Guide. Technical report.[6] Compilers: Principles, Techniques, and Tools. Addison Wesley,

1986.[7] F. Angiolini, F. Menichelli, A. Ferrero, L. Benini, and

M. Olivieri. A Post-Compiler Approach to Scratchpad Mappingof Code. In Proc. CASES, pages 259–267, 2004.

[8] O. Avissar, R. Barua, and D. Stewart. An Optimal MemoryAllocation Scheme for Scratch-pad-based Embedded Systems.ACM TECS, 1(1):6–26, 2002.

[9] K. Bai and A. Shrivastava. Heap Data Management for LimitedLocal Memory (LLM) Multi-core Processors. In Proc.CODES+ISSS, 2010.

[10] K. Bai, A. Shrivastava, and S. Kudchadker. Stack DataManagement for Limited Local Memory (LLM) Multi-coreProcessors. In Proc. ASP-DAC, pages 231–234, 2011.

Table 5: Dynamic instructions per functionsstore/ fci sload/ fco

l2g

g2l wb

F NF F NF hit miss hit miss

CSM 180 100 148 95 24 45 76 60 34

SSDM 46 0 44 0 6 11 30 4 20

* F: stack region is full when function is called; NF: stack region isenough for the incoming function frame.

Table 6: Number of pointer mgmt. function callsl2g g2l wb

CSM SSDM CSM SSDM CSM SSDM

BasicMath 37010 0 123046 0 89026 0

SHA 2 2 163 158 68 68

Edges 1 0 515 0 514 0

Smoothing 1 0 515 0 514 0

* Edges - Susan Edges, Smoothing - Susan Smoothing

[11] M. A. Baker, A. Panda, N. Ghadge, A. Kadne, and K. S.Chatha. A Performance Model and Code Overlay Generator forScratchpad Enhanced Embedded Processors. In Proc.CODES+ISSS, pages 287–296, 2010.

[12] R. Banakar, S. Steinke, B.-S. Lee, M. Balakrishnan, andP. Marwedel. Scratchpad Memory: Design Alternative for Cacheon-chip Memory in Embedded Systems. In Proc. CODES+ISSS,pages 73–78, 2002.

[13] A. Dominguez, S. Udayakumaran, and R. Barua. Heap DataAllocation to Scratch-pad Memory in Embedded Systems. J.Embedded Comput., 1(4):521–540, 2005.

[14] B. Egger, C. Kim, C. Jang, Y. Nam, J. Lee, and S. L. Min. ADynamic Code Placement Technique for Scratchpad MemoryUsing Postpass Optimization. In Proc. CASES, pages 223–233,2006.

[15] B. Flachs at el. The Microarchitecture of the SynergisticProcessor for A Cell Processor. IEEE Solid-state circuits,41(1):63–70, 2006.

[16] M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin,T. Mudge, and R. B. Brown. Mibench: A Free, CommerciallyRepresentative Embedded Benchmark Suite. Proc. WorkloadCharacterization, pages 3–14, 2001.

[17] A. Janapsatya, A. Ignjatovic, and S. Parameswaran. A NovelInstruction Scratchpad Memory Optimization Method Based onConcomitance Metric. In Proc. ASP-DAC, pages 612–617, 2006.

[18] S. c. Jung, A. Shrivastava, and K. Bai. Dynamic Code Mappingfor Limited Local Memory Systems. In Proc. ASAP, pages13–20, 2010.

[19] M. Kandemir and A. Choudhary. Compiler-directed Scratch padMemory Hierarchy Design and Management. In Proc. DAC,pages 628–633, 2002.

[20] M. Kandemir, J. Ramanujam, J. Irwin, N. Vijaykrishnan,I. Kadayif, and A. Parikh. Dynamic Management of Scratch-padMemory Space. In Proc. DAC, pages 690–695, 2001.

[21] M. Kistler, M. Perrone, and F. Petrini. Cell MultiprocessorCommunication Network: Built for Speed. IEEE Micro,26(3):10–23, May 2006.

[22] L. Li, L. Gao, and J. Xue. Memory Coloring: A CompilerApproach for Scratchpad Memory Management. In Proc. PACT,pages 329–338, 2005.

[23] M. Mamidipaka and N. Dutt. On-chip Stack Based MemoryOrganization for Low Power Embedded Architectures. In Proc.DATE, pages 1082–1087, 2003.

[24] R. Mcllroy, P. Dickman, and J. Sventek. E�cient Dynamic HeapAllocation of Scratch-pad Memory. In ISMM, pages 31–40, 2008.

[25] N. Nguyen, A. Dominguez, and R. Barua. Memory Allocationfor Embedded Systems with A Compile-time-unknownScratch-pad Size. In Proc. CASES, pages 115–125, 2005.

[26] P. Panda, N. D. Dutt, and A. Nicolau. On-chip vs. O↵-chipMemory: the Data Partitioning Problem in EmbeddedProcessor-based Systems. In ACM TODAES, pages 682–704,2000.

[27] S. Park, H.-w. Park, and S. Ha. A Novel Technique to UseScratch-pad Memory for Stack Management. In Proc. DATE,pages 1478–1483, 2007.

[28] F. Poletti, P. Marchal, D. Atienza, L. Benini, F. Catthoor, andJ. M. Mendias. An Integrated Hardware/Software Approach forRun-time Scratchpad Management. In Proc. DAC, pages238–243, 2004.

[29] A. Shrivastava, A. Kannan, and J. Lee. A Software-only Solutionto Use Scratch Pads for Stack Data. IEEE TCAD,28(11):1719–1728, 2009.

[30] J. E. Smith. A Study of Branch Prediction Strategies. In Proc.of ISCA, pages 135–148, 1981.

[31] S. Udayakumaran, A. Dominguez, and R. Barua. DynamicAllocation for Scratch-pad Memory Using Compile-timeDecisions. ACM TECS, 5(2):472–511, 2006.

Table 3: Number of sstore/ fci and sload/ fco Calls

Benchmark

sstore/ fci sload/ fco

CSM SSDM CSM SSDM

BasicMath 40012 0 40012 0

Dijkstra 60365 202 60365 202

FFT 7190 8 7190 8

FFT inverse 7190 8 7190 8

SHA 57 2 57 2

String Search 503 143 503 143

Susan Edges 776 1 776 1

Susan Smoothing 112 2 112 2

Table 4: Code size of stack manager (in bytes)sstore/ fci sload/ fco l2g g2l wb

CSM 2404 1900 96 1024 1112

SSDM 184 176 24 120 80

only four applications among our eight applications that con-tain pointers to stack data. We can observe that our schemecan slightly improve the performance of SHA, and totallyeliminate the pointer management functions for other threebenchmarks.

More results: Besides comparing results between SSDMand CSM, we also examined the impact of di↵erent stackspace sizes, the scalability of our heuristic, and discussed ourSSDM with cache design. We found that i) performance im-proves as we increase the space for stack data, ii) our SSDMscales well with di↵erent number of cores, iii) the penaltyof management is much less with our SSDM compared tohardware cache. The detailed results are presented in theAppendix, section F, section G, and section H.

9. CONCLUSIONScratchpad based Multicore Processor (SMP) architectures

are promising, since they are more scalable. However, sincescratchpad memory cannot always accommodate the wholeprogram, certain schemes are required to mange code, globaldata, stack data and heap data of the program to enable itsexecution. The main focus of this paper is on managing stackdata, since majority of the accesses in embedded applicationsmay be to stack variables. Assuming other data are properlymanaged by other schemes, managing stack data is especiallychallenging. In this paper, we formulated the problem of ef-ficiently placing library functions at the function call sites.In addition, we proposed a heuristic algorithm called SSDMto generate the e�cient function placement. As for pointersto stack data, a proper scheme was presented to reduce themanagement cost. Our experimental results show that SSDMgenerates function placement which leads to significant per-formance improvement compared to CSM.

10. REFERENCES[1] “GCC Internals”. http://gcc.gnu.org/onlinedocs/gccint/.[2] Intel Core i7 Processor Extreme Edition and Intel Core i7

Processor Datasheet, Volume 1. In White paper. Intel.[3] Raw Performance: SiSoftware Sandra 2010 Pro (GFLOPS).[4] SPU C/C++ Language Extensions. Technical report.[5] The SCC Programmer’s Guide. Technical report.[6] Compilers: Principles, Techniques, and Tools. Addison Wesley,

1986.[7] F. Angiolini, F. Menichelli, A. Ferrero, L. Benini, and

M. Olivieri. A Post-Compiler Approach to Scratchpad Mappingof Code. In Proc. CASES, pages 259–267, 2004.

[8] O. Avissar, R. Barua, and D. Stewart. An Optimal MemoryAllocation Scheme for Scratch-pad-based Embedded Systems.ACM TECS, 1(1):6–26, 2002.

[9] K. Bai and A. Shrivastava. Heap Data Management for LimitedLocal Memory (LLM) Multi-core Processors. In Proc.CODES+ISSS, 2010.

[10] K. Bai, A. Shrivastava, and S. Kudchadker. Stack DataManagement for Limited Local Memory (LLM) Multi-coreProcessors. In Proc. ASP-DAC, pages 231–234, 2011.

Table 5: Dynamic instructions per functionsstore/ fci sload/ fco

l2g

g2l wb

F NF F NF hit miss hit miss

CSM 180 100 148 95 24 45 76 60 34

SSDM 46 0 44 0 6 11 30 4 20

* F: stack region is full when function is called; NF: stack region isenough for the incoming function frame.

Table 6: Number of pointer mgmt. function callsl2g g2l wb

CSM SSDM CSM SSDM CSM SSDM

BasicMath 37010 0 123046 0 89026 0

SHA 2 2 163 158 68 68

Edges 1 0 515 0 514 0

Smoothing 1 0 515 0 514 0

* Edges - Susan Edges, Smoothing - Susan Smoothing

[11] M. A. Baker, A. Panda, N. Ghadge, A. Kadne, and K. S.Chatha. A Performance Model and Code Overlay Generator forScratchpad Enhanced Embedded Processors. In Proc.CODES+ISSS, pages 287–296, 2010.

[12] R. Banakar, S. Steinke, B.-S. Lee, M. Balakrishnan, andP. Marwedel. Scratchpad Memory: Design Alternative for Cacheon-chip Memory in Embedded Systems. In Proc. CODES+ISSS,pages 73–78, 2002.

[13] A. Dominguez, S. Udayakumaran, and R. Barua. Heap DataAllocation to Scratch-pad Memory in Embedded Systems. J.Embedded Comput., 1(4):521–540, 2005.

[14] B. Egger, C. Kim, C. Jang, Y. Nam, J. Lee, and S. L. Min. ADynamic Code Placement Technique for Scratchpad MemoryUsing Postpass Optimization. In Proc. CASES, pages 223–233,2006.

[15] B. Flachs at el. The Microarchitecture of the SynergisticProcessor for A Cell Processor. IEEE Solid-state circuits,41(1):63–70, 2006.

[16] M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin,T. Mudge, and R. B. Brown. Mibench: A Free, CommerciallyRepresentative Embedded Benchmark Suite. Proc. WorkloadCharacterization, pages 3–14, 2001.

[17] A. Janapsatya, A. Ignjatovic, and S. Parameswaran. A NovelInstruction Scratchpad Memory Optimization Method Based onConcomitance Metric. In Proc. ASP-DAC, pages 612–617, 2006.

[18] S. c. Jung, A. Shrivastava, and K. Bai. Dynamic Code Mappingfor Limited Local Memory Systems. In Proc. ASAP, pages13–20, 2010.

[19] M. Kandemir and A. Choudhary. Compiler-directed Scratch padMemory Hierarchy Design and Management. In Proc. DAC,pages 628–633, 2002.

[20] M. Kandemir, J. Ramanujam, J. Irwin, N. Vijaykrishnan,I. Kadayif, and A. Parikh. Dynamic Management of Scratch-padMemory Space. In Proc. DAC, pages 690–695, 2001.

[21] M. Kistler, M. Perrone, and F. Petrini. Cell MultiprocessorCommunication Network: Built for Speed. IEEE Micro,26(3):10–23, May 2006.

[22] L. Li, L. Gao, and J. Xue. Memory Coloring: A CompilerApproach for Scratchpad Memory Management. In Proc. PACT,pages 329–338, 2005.

[23] M. Mamidipaka and N. Dutt. On-chip Stack Based MemoryOrganization for Low Power Embedded Architectures. In Proc.DATE, pages 1082–1087, 2003.

[24] R. Mcllroy, P. Dickman, and J. Sventek. E�cient Dynamic HeapAllocation of Scratch-pad Memory. In ISMM, pages 31–40, 2008.

[25] N. Nguyen, A. Dominguez, and R. Barua. Memory Allocationfor Embedded Systems with A Compile-time-unknownScratch-pad Size. In Proc. CASES, pages 115–125, 2005.

[26] P. Panda, N. D. Dutt, and A. Nicolau. On-chip vs. O↵-chipMemory: the Data Partitioning Problem in EmbeddedProcessor-based Systems. In ACM TODAES, pages 682–704,2000.

[27] S. Park, H.-w. Park, and S. Ha. A Novel Technique to UseScratch-pad Memory for Stack Management. In Proc. DATE,pages 1478–1483, 2007.

[28] F. Poletti, P. Marchal, D. Atienza, L. Benini, F. Catthoor, andJ. M. Mendias. An Integrated Hardware/Software Approach forRun-time Scratchpad Management. In Proc. DAC, pages238–243, 2004.

[29] A. Shrivastava, A. Kannan, and J. Lee. A Software-only Solutionto Use Scratch Pads for Stack Data. IEEE TCAD,28(11):1719–1728, 2009.

[30] J. E. Smith. A Study of Branch Prediction Strategies. In Proc.of ISCA, pages 135–148, 1981.

[31] S. Udayakumaran, A. Dominguez, and R. Barua. DynamicAllocation for Scratch-pad Memory Using Compile-timeDecisions. ACM TECS, 5(2):472–511, 2006.

Table 3: Number of sstore/ fci and sload/ fco Calls

Benchmark

sstore/ fci sload/ fco

CSM SSDM CSM SSDM

BasicMath 40012 0 40012 0

Dijkstra 60365 202 60365 202

FFT 7190 8 7190 8

FFT inverse 7190 8 7190 8

SHA 57 2 57 2

String Search 503 143 503 143

Susan Edges 776 1 776 1

Susan Smoothing 112 2 112 2

Table 4: Code size of stack manager (in bytes)sstore/ fci sload/ fco l2g g2l wb

CSM 2404 1900 96 1024 1112

SSDM 184 176 24 120 80

only four applications among our eight applications that con-tain pointers to stack data. We can observe that our schemecan slightly improve the performance of SHA, and totallyeliminate the pointer management functions for other threebenchmarks.

More results: Besides comparing results between SSDMand CSM, we also examined the impact of di↵erent stackspace sizes, the scalability of our heuristic, and discussed ourSSDM with cache design. We found that i) performance im-proves as we increase the space for stack data, ii) our SSDMscales well with di↵erent number of cores, iii) the penaltyof management is much less with our SSDM compared tohardware cache. The detailed results are presented in theAppendix, section F, section G, and section H.

9. CONCLUSIONScratchpad based Multicore Processor (SMP) architectures

are promising, since they are more scalable. However, sincescratchpad memory cannot always accommodate the wholeprogram, certain schemes are required to mange code, globaldata, stack data and heap data of the program to enable itsexecution. The main focus of this paper is on managing stackdata, since majority of the accesses in embedded applicationsmay be to stack variables. Assuming other data are properlymanaged by other schemes, managing stack data is especiallychallenging. In this paper, we formulated the problem of ef-ficiently placing library functions at the function call sites.In addition, we proposed a heuristic algorithm called SSDMto generate the e�cient function placement. As for pointersto stack data, a proper scheme was presented to reduce themanagement cost. Our experimental results show that SSDMgenerates function placement which leads to significant per-formance improvement compared to CSM.

10. REFERENCES[1] “GCC Internals”. http://gcc.gnu.org/onlinedocs/gccint/.[2] Intel Core i7 Processor Extreme Edition and Intel Core i7

Processor Datasheet, Volume 1. In White paper. Intel.[3] Raw Performance: SiSoftware Sandra 2010 Pro (GFLOPS).[4] SPU C/C++ Language Extensions. Technical report.[5] The SCC Programmer’s Guide. Technical report.[6] Compilers: Principles, Techniques, and Tools. Addison Wesley,

1986.[7] F. Angiolini, F. Menichelli, A. Ferrero, L. Benini, and

M. Olivieri. A Post-Compiler Approach to Scratchpad Mappingof Code. In Proc. CASES, pages 259–267, 2004.

[8] O. Avissar, R. Barua, and D. Stewart. An Optimal MemoryAllocation Scheme for Scratch-pad-based Embedded Systems.ACM TECS, 1(1):6–26, 2002.

[9] K. Bai and A. Shrivastava. Heap Data Management for LimitedLocal Memory (LLM) Multi-core Processors. In Proc.CODES+ISSS, 2010.

[10] K. Bai, A. Shrivastava, and S. Kudchadker. Stack DataManagement for Limited Local Memory (LLM) Multi-coreProcessors. In Proc. ASP-DAC, pages 231–234, 2011.

Table 5: Dynamic instructions per functionsstore/ fci sload/ fco

l2g

g2l wb

F NF F NF hit miss hit miss

CSM 180 100 148 95 24 45 76 60 34

SSDM 46 0 44 0 6 11 30 4 20

* F: stack region is full when function is called; NF: stack region isenough for the incoming function frame.

Table 6: Number of pointer mgmt. function callsl2g g2l wb

CSM SSDM CSM SSDM CSM SSDM

BasicMath 37010 0 123046 0 89026 0

SHA 2 2 163 158 68 68

Edges 1 0 515 0 514 0

Smoothing 1 0 515 0 514 0

* Edges - Susan Edges, Smoothing - Susan Smoothing

[11] M. A. Baker, A. Panda, N. Ghadge, A. Kadne, and K. S.Chatha. A Performance Model and Code Overlay Generator forScratchpad Enhanced Embedded Processors. In Proc.CODES+ISSS, pages 287–296, 2010.

[12] R. Banakar, S. Steinke, B.-S. Lee, M. Balakrishnan, andP. Marwedel. Scratchpad Memory: Design Alternative for Cacheon-chip Memory in Embedded Systems. In Proc. CODES+ISSS,pages 73–78, 2002.

[13] A. Dominguez, S. Udayakumaran, and R. Barua. Heap DataAllocation to Scratch-pad Memory in Embedded Systems. J.Embedded Comput., 1(4):521–540, 2005.

[14] B. Egger, C. Kim, C. Jang, Y. Nam, J. Lee, and S. L. Min. ADynamic Code Placement Technique for Scratchpad MemoryUsing Postpass Optimization. In Proc. CASES, pages 223–233,2006.

[15] B. Flachs at el. The Microarchitecture of the SynergisticProcessor for A Cell Processor. IEEE Solid-state circuits,41(1):63–70, 2006.

[16] M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin,T. Mudge, and R. B. Brown. Mibench: A Free, CommerciallyRepresentative Embedded Benchmark Suite. Proc. WorkloadCharacterization, pages 3–14, 2001.

[17] A. Janapsatya, A. Ignjatovic, and S. Parameswaran. A NovelInstruction Scratchpad Memory Optimization Method Based onConcomitance Metric. In Proc. ASP-DAC, pages 612–617, 2006.

[18] S. c. Jung, A. Shrivastava, and K. Bai. Dynamic Code Mappingfor Limited Local Memory Systems. In Proc. ASAP, pages13–20, 2010.

[19] M. Kandemir and A. Choudhary. Compiler-directed Scratch padMemory Hierarchy Design and Management. In Proc. DAC,pages 628–633, 2002.

[20] M. Kandemir, J. Ramanujam, J. Irwin, N. Vijaykrishnan,I. Kadayif, and A. Parikh. Dynamic Management of Scratch-padMemory Space. In Proc. DAC, pages 690–695, 2001.

[21] M. Kistler, M. Perrone, and F. Petrini. Cell MultiprocessorCommunication Network: Built for Speed. IEEE Micro,26(3):10–23, May 2006.

[22] L. Li, L. Gao, and J. Xue. Memory Coloring: A CompilerApproach for Scratchpad Memory Management. In Proc. PACT,pages 329–338, 2005.

[23] M. Mamidipaka and N. Dutt. On-chip Stack Based MemoryOrganization for Low Power Embedded Architectures. In Proc.DATE, pages 1082–1087, 2003.

[24] R. Mcllroy, P. Dickman, and J. Sventek. E�cient Dynamic HeapAllocation of Scratch-pad Memory. In ISMM, pages 31–40, 2008.

[25] N. Nguyen, A. Dominguez, and R. Barua. Memory Allocationfor Embedded Systems with A Compile-time-unknownScratch-pad Size. In Proc. CASES, pages 115–125, 2005.

[26] P. Panda, N. D. Dutt, and A. Nicolau. On-chip vs. O↵-chipMemory: the Data Partitioning Problem in EmbeddedProcessor-based Systems. In ACM TODAES, pages 682–704,2000.

[27] S. Park, H.-w. Park, and S. Ha. A Novel Technique to UseScratch-pad Memory for Stack Management. In Proc. DATE,pages 1478–1483, 2007.

[28] F. Poletti, P. Marchal, D. Atienza, L. Benini, F. Catthoor, andJ. M. Mendias. An Integrated Hardware/Software Approach forRun-time Scratchpad Management. In Proc. DAC, pages238–243, 2004.

[29] A. Shrivastava, A. Kannan, and J. Lee. A Software-only Solutionto Use Scratch Pads for Stack Data. IEEE TCAD,28(11):1719–1728, 2009.

[30] J. E. Smith. A Study of Branch Prediction Strategies. In Proc.of ISCA, pages 135–148, 1981.

[31] S. Udayakumaran, A. Dominguez, and R. Barua. DynamicAllocation for Scratch-pad Memory Using Compile-timeDecisions. ACM TECS, 5(2):472–511, 2006.

(a) SSDM against ILP and CSM. (b) Overhead comparison between SSDM and CSM.

Figure 6: SSDM reduces the data management overhead and improves performance.

We first utilized PPE and 1 SPE available in the IBM CellBE and compared our SSDM performance against the resultsfrom ILP and CSM [10]. The y-axis in Figure 6(a) stands forthe execution time of each benchmark normalized to its ex-ecution time that with ILP. In this section, the number offunction calls used in Weighted Call Graph (WCG) is esti-mated from profile information. In the Appendix, sectionD, we present a compile-time scheme to assign weights onedges. Experimental results show that both the non-profiling-based scheme and the profiling-based scheme achieve almostthe same performance. As observed from Figure 6(a), ourSSDM shows very similar performance to ILP approach. Thismeans our heuristic approaches the optimal solution whenthe benchmark has a small call graph. Compared the CSMscheme, our SSDM demonstrates up to 19% and average 11%performance improvement. The overhead of the managementcomprises of i) time for data transfer, ii) execution of the in-structions in the management library functions. Figure 6(b)compares the execution time overhead of CSM and the pro-posed SSDM. Results show that when using CSM, an average11.3% of the execution time was spent on stack data man-agement. With our new approach SSDM, the overhead isreduced to a mere 0.8% – a reduction of 13X. Next we break-down the overhead and explain the e↵ect of our techniqueson the di↵erent components of the overhead:

Opt1 - Increase in the granularity of management:Due to our stack space level granularity of management, thenumber of DMA calls have been reduced. Table 2 shows thenumber of stack data management DMAs executed when weuse CSM, vs. the new technique SSDM. Note that thereare no DMAs required for Basicmath. This is because thewhole stack fits into the stack space allowed for this bench-mark. Our technique performs well for all benchmarks, ex-cept for Disjkstra. This is because of the recursive functionprint path in Dijkstra. CSM will perform a DMA only whenthe stack space is full of recursive function instantiations,while we have to evict recursive functions every time withunused stack space. As a result, our technique does not per-form very well on recursive programs. However, since manyembedded programs are non-recursive, we have left the prob-lem of optimizing for recursive functions as a future work.

Opt2 - Not performing management when not ab-solutely needed: Our SSDM scheme reduces the number

Table 1: Benchmarks, their stack sizes, and the stackspace we manage them on.

Benchmark

Stack Size Stack Region

(bytes) Size (bytes)

BasicMath 400 512

Dijkstra 1712 1024

FFT 656 512

FFT inverse 656 512

SHA 2512 2048

String Search 992 768

Susan Edges 832 768

Susan Smoothing 448 256

of library function calls because of our compile-time analy-sis. In Table 3, we compare the number of sstore and sloadfunction calls executed when using SSDM, vs. fci and fcocalls when using CSM. We can observe that our scheme hasmuch less number of library function calls. The main reasonis that our SSDM considers the thrashing e↵ect discussed inSection 4. Our approach tries to avoid placing managementlibrary function sstore and sload around the function con-taining large number of function calls if possible, while CSMalways inserts management function at all function call sites.

Opt3 - Performing minimal work each time man-agement is performed: Our management library is sim-pler, since we only need to maintain a linear queue, as com-pared to a circular queue in CSM. Table 4 shows the amountof local memory required by our SSDM and CSM, where wecan find our runtime library has much less footprint thanCSM does. It is very important for improving the perfor-mance, since stack frames will obtain less space in the localmemory if the library occupies more space. The reason forlarger footprint of CSM is that it needs to handle memoryfragmentation, while our SSDM doesn’t have this trouble.

Table 5 shows the cost of extra instructions per libraryfunction call. We ran all benchmarks with both schemes andapproximately calculated the average additional instructionsincurred by each library call. As demonstrated in Table 5, ourSSDM performs much better than CSM. There is no cost inSSDM when the stack region is su�cient to hold the incomingframes. However, CSM still needs extra instructions, since itchecks the status of the stack region at runtime. hit for g2land wb means the accessing stack data is residing in thelocal memory when the function is called, while miss denotesstack data is not in the local memory. In CSM approach,more instructions are needed for hit case than miss case inthe function wb. It is because the library directly writes backthe data to the global memory when miss, but looking up themanagement table is required to translate the address. Moreimportantly, as the table itself occupies space and thereforeneeds to be managed, CSM may need additional instructionsto transfer table entries.

Opt4 - Not performing pointer management whennot needed: Stack pointer management is properly man-aged in SSDM, while CSM might manage all pointers exces-sively. Table 6 shows the results of four benchmarks withand without pointer optimization technique. They are the

Table 2: Comparison of number of DMAsBenchmark CSM SSDM

BasicMath 0 0

Dijkstra 108 364

FFT 26 14

FFT inverse 26 14

SHA 10 4

String Search 380 342

Susan Edges 8 2

Susan Smoothing 12 4

Table 3: Number of sstore/ fci and sload/ fco Calls

Benchmark

sstore/ fci sload/ fco

CSM SSDM CSM SSDM

BasicMath 40012 0 40012 0

Dijkstra 60365 202 60365 202

FFT 7190 8 7190 8

FFT inverse 7190 8 7190 8

SHA 57 2 57 2

String Search 503 143 503 143

Susan Edges 776 1 776 1

Susan Smoothing 112 2 112 2

Table 4: Code size of stack manager (in bytes)sstore/ fci sload/ fco l2g g2l wb

CSM 2404 1900 96 1024 1112

SSDM 184 176 24 120 80

only four applications among our eight applications that con-tain pointers to stack data. We can observe that our schemecan slightly improve the performance of SHA, and totallyeliminate the pointer management functions for other threebenchmarks.

More results: Besides comparing results between SSDMand CSM, we also examined the impact of di↵erent stackspace sizes, the scalability of our heuristic, and discussed ourSSDM with cache design. We found that i) performance im-proves as we increase the space for stack data, ii) our SSDMscales well with di↵erent number of cores, iii) the penaltyof management is much less with our SSDM compared tohardware cache. The detailed results are presented in theAppendix, section F, section G, and section H.

9. CONCLUSIONScratchpad based Multicore Processor (SMP) architectures

are promising, since they are more scalable. However, sincescratchpad memory cannot always accommodate the wholeprogram, certain schemes are required to mange code, globaldata, stack data and heap data of the program to enable itsexecution. The main focus of this paper is on managing stackdata, since majority of the accesses in embedded applicationsmay be to stack variables. Assuming other data are properlymanaged by other schemes, managing stack data is especiallychallenging. In this paper, we formulated the problem of ef-ficiently placing library functions at the function call sites.In addition, we proposed a heuristic algorithm called SSDMto generate the e�cient function placement. As for pointersto stack data, a proper scheme was presented to reduce themanagement cost. Our experimental results show that SSDMgenerates function placement which leads to significant per-formance improvement compared to CSM.

10. REFERENCES[1] “GCC Internals”. http://gcc.gnu.org/onlinedocs/gccint/.[2] Intel Core i7 Processor Extreme Edition and Intel Core i7

Processor Datasheet, Volume 1. In White paper. Intel.[3] Raw Performance: SiSoftware Sandra 2010 Pro (GFLOPS).[4] SPU C/C++ Language Extensions. Technical report.[5] The SCC Programmer’s Guide. Technical report.[6] Compilers: Principles, Techniques, and Tools. Addison Wesley,

1986.[7] F. Angiolini, F. Menichelli, A. Ferrero, L. Benini, and

M. Olivieri. A Post-Compiler Approach to Scratchpad Mappingof Code. In Proc. CASES, pages 259–267, 2004.

[8] O. Avissar, R. Barua, and D. Stewart. An Optimal MemoryAllocation Scheme for Scratch-pad-based Embedded Systems.ACM TECS, 1(1):6–26, 2002.

[9] K. Bai and A. Shrivastava. Heap Data Management for LimitedLocal Memory (LLM) Multi-core Processors. In Proc.CODES+ISSS, 2010.

[10] K. Bai, A. Shrivastava, and S. Kudchadker. Stack DataManagement for Limited Local Memory (LLM) Multi-coreProcessors. In Proc. ASP-DAC, pages 231–234, 2011.

Table 5: Dynamic instructions per functionsstore/ fci sload/ fco

l2g

g2l wb

F NF F NF hit miss hit miss

CSM 180 100 148 95 24 45 76 60 34

SSDM 46 0 44 0 6 11 30 4 20

* F: stack region is full when function is called; NF: stack region isenough for the incoming function frame.

Table 6: Number of pointer mgmt. function callsl2g g2l wb

CSM SSDM CSM SSDM CSM SSDM

BasicMath 37010 0 123046 0 89026 0

SHA 2 2 163 158 68 68

Edges 1 0 515 0 514 0

Smoothing 1 0 515 0 514 0

* Edges - Susan Edges, Smoothing - Susan Smoothing

[11] M. A. Baker, A. Panda, N. Ghadge, A. Kadne, and K. S.Chatha. A Performance Model and Code Overlay Generator forScratchpad Enhanced Embedded Processors. In Proc.CODES+ISSS, pages 287–296, 2010.

[12] R. Banakar, S. Steinke, B.-S. Lee, M. Balakrishnan, andP. Marwedel. Scratchpad Memory: Design Alternative for Cacheon-chip Memory in Embedded Systems. In Proc. CODES+ISSS,pages 73–78, 2002.

[13] A. Dominguez, S. Udayakumaran, and R. Barua. Heap DataAllocation to Scratch-pad Memory in Embedded Systems. J.Embedded Comput., 1(4):521–540, 2005.

[14] B. Egger, C. Kim, C. Jang, Y. Nam, J. Lee, and S. L. Min. ADynamic Code Placement Technique for Scratchpad MemoryUsing Postpass Optimization. In Proc. CASES, pages 223–233,2006.

[15] B. Flachs at el. The Microarchitecture of the SynergisticProcessor for A Cell Processor. IEEE Solid-state circuits,41(1):63–70, 2006.

[16] M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin,T. Mudge, and R. B. Brown. Mibench: A Free, CommerciallyRepresentative Embedded Benchmark Suite. Proc. WorkloadCharacterization, pages 3–14, 2001.

[17] A. Janapsatya, A. Ignjatovic, and S. Parameswaran. A NovelInstruction Scratchpad Memory Optimization Method Based onConcomitance Metric. In Proc. ASP-DAC, pages 612–617, 2006.

[18] S. c. Jung, A. Shrivastava, and K. Bai. Dynamic Code Mappingfor Limited Local Memory Systems. In Proc. ASAP, pages13–20, 2010.

[19] M. Kandemir and A. Choudhary. Compiler-directed Scratch padMemory Hierarchy Design and Management. In Proc. DAC,pages 628–633, 2002.

[20] M. Kandemir, J. Ramanujam, J. Irwin, N. Vijaykrishnan,I. Kadayif, and A. Parikh. Dynamic Management of Scratch-padMemory Space. In Proc. DAC, pages 690–695, 2001.

[21] M. Kistler, M. Perrone, and F. Petrini. Cell MultiprocessorCommunication Network: Built for Speed. IEEE Micro,26(3):10–23, May 2006.

[22] L. Li, L. Gao, and J. Xue. Memory Coloring: A CompilerApproach for Scratchpad Memory Management. In Proc. PACT,pages 329–338, 2005.

[23] M. Mamidipaka and N. Dutt. On-chip Stack Based MemoryOrganization for Low Power Embedded Architectures. In Proc.DATE, pages 1082–1087, 2003.

[24] R. Mcllroy, P. Dickman, and J. Sventek. E�cient Dynamic HeapAllocation of Scratch-pad Memory. In ISMM, pages 31–40, 2008.

[25] N. Nguyen, A. Dominguez, and R. Barua. Memory Allocationfor Embedded Systems with A Compile-time-unknownScratch-pad Size. In Proc. CASES, pages 115–125, 2005.

[26] P. Panda, N. D. Dutt, and A. Nicolau. On-chip vs. O↵-chipMemory: the Data Partitioning Problem in EmbeddedProcessor-based Systems. In ACM TODAES, pages 682–704,2000.

[27] S. Park, H.-w. Park, and S. Ha. A Novel Technique to UseScratch-pad Memory for Stack Management. In Proc. DATE,pages 1478–1483, 2007.

[28] F. Poletti, P. Marchal, D. Atienza, L. Benini, F. Catthoor, andJ. M. Mendias. An Integrated Hardware/Software Approach forRun-time Scratchpad Management. In Proc. DAC, pages238–243, 2004.

[29] A. Shrivastava, A. Kannan, and J. Lee. A Software-only Solutionto Use Scratch Pads for Stack Data. IEEE TCAD,28(11):1719–1728, 2009.

[30] J. E. Smith. A Study of Branch Prediction Strategies. In Proc.of ISCA, pages 135–148, 1981.

[31] S. Udayakumaran, A. Dominguez, and R. Barua. DynamicAllocation for Scratch-pad Memory Using Compile-timeDecisions. ACM TECS, 5(2):472–511, 2006.

Page 33: The$Rise$and$Fall$of$$ Scratchpad$Memories$jasonxue/MeAOW/MEAOW-talk-RiseAndRiseOf... · Web page: aviral.lab.asu.edu C M L PPE Element Interconnect Bus (EIB) Off-chip Global Memory

C M L Web page: aviral.lab.asu.edu C M L

Minimal  Overhead  

}  4%  of  execution  time  spent  on  management  

Page 34: The$Rise$and$Fall$of$$ Scratchpad$Memories$jasonxue/MeAOW/MEAOW-talk-RiseAndRiseOf... · Web page: aviral.lab.asu.edu C M L PPE Element Interconnect Bus (EIB) Off-chip Global Memory

C M L Web page: aviral.lab.asu.edu C M L

Comparison  with  Caches  }  Cache  Miss  penalty  =  #  misses  *  miss  latency  }  SPM  miss  overhead  =  #  API  function  calls  *  no.  of  instructions  in  API  function  

+  #  times  DMA  is  called  *  delay  of  the  DMA  (dep.  on  DMA  size)  

Cache is better when miss latency < 260 ps 260 ps = 0.86 * cycle time

Page 35: The$Rise$and$Fall$of$$ Scratchpad$Memories$jasonxue/MeAOW/MEAOW-talk-RiseAndRiseOf... · Web page: aviral.lab.asu.edu C M L PPE Element Interconnect Bus (EIB) Off-chip Global Memory

C M L Web page: aviral.lab.asu.edu C M L

Scalability  of  Management  

}  The  main  core  does  not  choke  on  the  memory  requests  from  several  cores  

0.97

0.98

0.99

1

1.01

1.02

1.03

1.04

1.05

1.06

1.07

1 2 3 4 5 6

Nor

mal

ized

Run

time

# of Cores

basicmath

DFS

dijkstra

fft

invfft

MST

rbTree

sha

stringsearch

Page 36: The$Rise$and$Fall$of$$ Scratchpad$Memories$jasonxue/MeAOW/MEAOW-talk-RiseAndRiseOf... · Web page: aviral.lab.asu.edu C M L PPE Element Interconnect Bus (EIB) Off-chip Global Memory

C M L Web page: aviral.lab.asu.edu C M L

Summary  }  SPMs  are  an  embedded  system  technology  }  SPMs  will  ne  needed  in  general-­‐purpose  computing  

}  Will  need  to  manage  stack,  heap,  and  code  }  Do  not  work  without  management  

}  Need  different  strategies  for  different  data  }  Code  (statically  linked)  }  Stack  (Circular)    }  Heap  (High  associativity)  

}  Overheads  of  Software  Data  Management  }  DMA  overhead  can  be  comparable  or  better  than  cache  

}  We  have  just  begun  –  lots  of  room  for  improvement  

Stack

Heap

Global

Code

Page 37: The$Rise$and$Fall$of$$ Scratchpad$Memories$jasonxue/MeAOW/MEAOW-talk-RiseAndRiseOf... · Web page: aviral.lab.asu.edu C M L PPE Element Interconnect Bus (EIB) Off-chip Global Memory

C M L Web page: aviral.lab.asu.edu C M L

Communication  Management  }  No  problem  in  MPI-­‐style  

}  Communication  is  explicit  

}  For  multi-­‐threaded  programs  }  Replace  load  =>  coh_load(),  and  store  =>  

coh_store()  

}  Too  much  overhead  for  sequential  consistency  

}  Weak  Consistency  models  allow  for  efficient  software  implementations  of  coherency  protocols  

}  Lazy  vs.  Eager  

}  Invalidate  vs.  Update  

}  Page  based  granularity  in  multi-­‐processor  systems  

}  Need  finer  granularity  in  multi-­‐cores  

1

10

100

1000

10000

100000

Benchmarks

CRC LRC-inv LRC-upd

Exe

cuti

on t

ime

(ms)

Page 38: The$Rise$and$Fall$of$$ Scratchpad$Memories$jasonxue/MeAOW/MEAOW-talk-RiseAndRiseOf... · Web page: aviral.lab.asu.edu C M L PPE Element Interconnect Bus (EIB) Off-chip Global Memory

C M L Web page: aviral.lab.asu.edu C M L

Real-­‐time  Multicores  

}  Data  and  communication  management  in  software  }  Better  timing  guarantees  

}  Managing  data  at  its  natural  granularity  simplifies  WCET  calculation  

}  e.g.,  find  out  how  many  instruction  cache  misses    vs.  find  out  how  many  function  swaps  

}  Not  only  lower  WCET,  but  tighter  WCET  estimate  }  Excellent  platform  for  Real-­‐time  Systems  }  Can  tune  the  management  policy  to  improve  WCET  

}  Software  Branch  Hinting  }  Close  to  1-­‐bit  HBP  performance  }  Can  place  hints  to  achieve  tighter  WCET