graphite)internals) - mit csailgroups.csail.mit.edu/.../uploads/2013/...internals.pdf ·...

34
Graphite Internals Addi0onal Details of the Architecture and Opera0on of the Simulator 1

Upload: others

Post on 03-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Graphite)Internals) - MIT CSAILgroups.csail.mit.edu/.../uploads/2013/...internals.pdf · Graphite)Internals) Addi0onal)Details)of)the)Architecture) and)Operaon)of)the)Simulator) 1

Graphite  Internals  

Addi0onal  Details  of  the  Architecture  and  Opera0on  of  the  Simulator  

1  

Page 2: Graphite)Internals) - MIT CSAILgroups.csail.mit.edu/.../uploads/2013/...internals.pdf · Graphite)Internals) Addi0onal)Details)of)the)Architecture) and)Operaon)of)the)Simulator) 1

Outline  

•  Mul0-­‐machine  distribu0on  –  Single  shared  address  space  –  Thread  distribu0on  –  System  calls  

•  Component  Models  – Overview  –  Core  – Memory  Hierarchy  – Network  –  Conten0on  –  Power  

2  

Page 3: Graphite)Internals) - MIT CSAILgroups.csail.mit.edu/.../uploads/2013/...internals.pdf · Graphite)Internals) Addi0onal)Details)of)the)Architecture) and)Operaon)of)the)Simulator) 1

Graphite  Architecture  

•  Applica0on  threads  mapped  to  target  0les  –  On  trap,  use  correct  target  0le’s  models  

•  Target  0les  are  distributed  among  host  processes  

•  Processes  can  be  distributed  to  mul0ple  host  machines    

Target Architecture

Target Tile

Target Tile

Target Tile

Target Tile

Target Tile

Target Tile

Target Tile

Target Tile

Target Tile

Graphite

Host Threads

Application Threads

Application

Host Core

Host Core

Host Core

Host Core

Host Core

Host Core

Host Process

Host OS Host OS

Host Process

Host Process

Host Machines

3  

Page 4: Graphite)Internals) - MIT CSAILgroups.csail.mit.edu/.../uploads/2013/...internals.pdf · Graphite)Internals) Addi0onal)Details)of)the)Architecture) and)Operaon)of)the)Simulator) 1

Parallel  Distribu0on  Benefits  

•  Accelerate  slow  simula0ons  –  Addi0onal  compute/cache  resources    –  Reduces  latency  of  simula0on  for  quick  turn-­‐around  –  Some  efficiency  lost  to  communica0on  overhead  

•  Enable  huge  simula0ons  –  Can  easily  exhaust  OS  or  memory  resources  of  one  machine  –  Addi0onal  DRAM,  memory  bandwidth  –  Addi0onal  thread  contexts  –  No  other  way  to  run  these  experiments  

4  

Page 5: Graphite)Internals) - MIT CSAILgroups.csail.mit.edu/.../uploads/2013/...internals.pdf · Graphite)Internals) Addi0onal)Details)of)the)Architecture) and)Operaon)of)the)Simulator) 1

Parallel  Distribu0on  Challenges  

•  Wanted  support  for  standard  pthreads  model  –  Allows  use  of  off-­‐the-­‐shelf  apps  –  Simulate  coherent-­‐shared-­‐memory  architectures  

•  Must  provide  the  illusion  that  all  threads  are  running  in  a  single  process  on  a  single  machine  –  Single  shared  address  space    –  Thread  spawning  –  System  calls  

5  

Page 6: Graphite)Internals) - MIT CSAILgroups.csail.mit.edu/.../uploads/2013/...internals.pdf · Graphite)Internals) Addi0onal)Details)of)the)Architecture) and)Operaon)of)the)Simulator) 1

Single  Shared  Address  Space  

•  All  applica0on  threads  run  in  a  single  simulated  address  space  

•  Memory  subsystem  provides  modeling  as  well  as  func0onality  

•  Func0onality  implemented  as  part  of  the  target  memory  models  –  Eliminate  redundant  work  –  Test  correctness  of  memory  models  

Host  Address  Space  

Host  Address  Space  

Host  Address  Space  

Simulated  Address  Space  

Application

6  

Page 7: Graphite)Internals) - MIT CSAILgroups.csail.mit.edu/.../uploads/2013/...internals.pdf · Graphite)Internals) Addi0onal)Details)of)the)Architecture) and)Operaon)of)the)Simulator) 1

Simulated  Address  Space  

•  Simulated  address  space  distributed  among  hosts  •  Graphite  manages  the  simulated  address  space  

–  Follows  the  System  V  ABI  

Simulated  Address  Space  

Pin   Models   Pin   Models   Pin   Models  

Host  Address  Space   Host  Address  Space  Host  Address  Space  

7  

Page 8: Graphite)Internals) - MIT CSAILgroups.csail.mit.edu/.../uploads/2013/...internals.pdf · Graphite)Internals) Addi0onal)Details)of)the)Architecture) and)Operaon)of)the)Simulator) 1

Managing  the  address  space  

•  Stack  space  is  allocated  at  thread  start  

•  Appropriate  syscalls  are  intercepted  and  handled  by  Graphite  –  mmap  and  munmap  use  dynamically  allocated  segments  –  brk  allocates  from  program  heap  

•  Memory  accesses  corresponding  to  instruc0on  fetch  not  redirected  –  These  accesses  are  s0ll  modeled  –  Don’t  support  self  modifying  or  dynamically  linked  code  at  the  

moment  

Simulated  Address  Space  

8  

Page 9: Graphite)Internals) - MIT CSAILgroups.csail.mit.edu/.../uploads/2013/...internals.pdf · Graphite)Internals) Addi0onal)Details)of)the)Architecture) and)Operaon)of)the)Simulator) 1

Memory  Bootstrapping  

•  Need  to  bootstrap  the  simulated  address  space  –  Copy  over  code  and  data  from  the  applica0on  binary  –  Copy  over  arguments  and  environment  variables  from  the  stack  

Simulated  Address  Space  

9  

Page 10: Graphite)Internals) - MIT CSAILgroups.csail.mit.edu/.../uploads/2013/...internals.pdf · Graphite)Internals) Addi0onal)Details)of)the)Architecture) and)Operaon)of)the)Simulator) 1

Rewri0ng  memory  operands  

•  Graphite  uses  Pin  API  calls  to  rewrite  memory  accesses  •  Data  resides  somewhere  in  the  modeled  memory  system  

–  May  be  on  a  different  machine!  •  Data  access  may  span  mul0ple  cache  lines  

Emula&on  code  

L1  Cache  

Cache  Hierarchy  

DRAM  

read(addr,  8)   write(addr,  8)  

AND  addr,  rax  Rewrite  

10  

Target  Memory  System  

Page 11: Graphite)Internals) - MIT CSAILgroups.csail.mit.edu/.../uploads/2013/...internals.pdf · Graphite)Internals) Addi0onal)Details)of)the)Architecture) and)Operaon)of)the)Simulator) 1

Rewri0ng  memory  operands  (contd.)  

•  Solu0on:  scratchpads!  

Emula&on  Code  

L1  Cache  

Cache  Hierarchy  

DRAM  

read(addr,  8)   write(addr,  8)  

ADD  addr,  rax  Rewrite  

1

2

3

Execute  

11  

Target  Memory  System  

Page 12: Graphite)Internals) - MIT CSAILgroups.csail.mit.edu/.../uploads/2013/...internals.pdf · Graphite)Internals) Addi0onal)Details)of)the)Architecture) and)Operaon)of)the)Simulator) 1

Atomic  memory  opera0ons  

•  Need  to  prevent  other  cores  from  modifying  data  –  Lock  the  private  L1    cache  during  execu0on  –  This  together  with  the  cache  coherence  protocol  ensures  atomicity  

Emula&on  Code  

L1  Cache  

Cache  Hierarchy  

DRAM  

LOCK  CMPXCHG  addr,  rax  Rewrite  

Execute  

1

2

3

12  

Target  Memory  System  

Page 13: Graphite)Internals) - MIT CSAILgroups.csail.mit.edu/.../uploads/2013/...internals.pdf · Graphite)Internals) Addi0onal)Details)of)the)Architecture) and)Operaon)of)the)Simulator) 1

Thread  Distribu0on  

•  Graphite  runs  applica0on  threads  across  several  host  machines  

•  Must  ini0alize  each  host  process  correctly  

•  Threads  are  automa0cally  distributed  by  trapping  threading  calls  

Host  Machines  

core   core   core   core   core   core  

core   core   core   core   core   core  

Target  Mul0core  

Applica0on  

13  

Page 14: Graphite)Internals) - MIT CSAILgroups.csail.mit.edu/.../uploads/2013/...internals.pdf · Graphite)Internals) Addi0onal)Details)of)the)Architecture) and)Operaon)of)the)Simulator) 1

Process  Ini0aliza0on  

•  Need  to  ini0alize  state  correctly  in  each  process  (glibc  ini0aliza0on,  TLS  setup)  

•  Execute  ini0aliza0on  rou0nes  serially  in  each  process  

•  Process  0  executes  main()  

Process  0  

Ini0alize   Ini0alize  

Ini0alize  

Process  1  

Process  2  

Execute  main()  Wait  for    spawn  request  

14  

Page 15: Graphite)Internals) - MIT CSAILgroups.csail.mit.edu/.../uploads/2013/...internals.pdf · Graphite)Internals) Addi0onal)Details)of)the)Architecture) and)Operaon)of)the)Simulator) 1

Thread  Spawning  

•  Thread  distribu0on  managed  through  MCP/LCPs  

Process  0   Process  1   Process  2  

Target  Cores  

MCP  LCP  

LCP  

LCP  

Spawn  request  

Spawn  request  

Spawn  request  

15  

–  MCP  and  LCPs  not  part  of  target  architecture  –  Perform  management  tasks  (thread  spawning,  syscalls,  etc.)  

Page 16: Graphite)Internals) - MIT CSAILgroups.csail.mit.edu/.../uploads/2013/...internals.pdf · Graphite)Internals) Addi0onal)Details)of)the)Architecture) and)Operaon)of)the)Simulator) 1

Thread  Management  

•  MCP  keeps  table  of  thread  state  

•  Performs  simple  load  balancing  on  spawns  –  Target  cores  striped  across  host  processes  –  Future  work:  beeer  scheduling/load  balancing  

•  Implements  pthread  API  by  intercep0ng  calls  –  Pthread_create()  ini0ates  a  spawn  request  to  MCP  –  Pthread_join()  messages  MCP  and  waits  for  a  reply  when  thread  exits    

16  

Page 17: Graphite)Internals) - MIT CSAILgroups.csail.mit.edu/.../uploads/2013/...internals.pdf · Graphite)Internals) Addi0onal)Details)of)the)Architecture) and)Operaon)of)the)Simulator) 1

Other  syscalls  

File  Management  

Synchroniza0on/  Communica0on  

Memory  Management  

mmap,  munmap,  brk  

kill,  waitpid,  futex  

open,  access,  read,  write  

getrlimit,  nanosleep,  gefd  

Signal  Management  

sigprocmask,  sigsuspend,  sigac0on  

System  Calls  

17  

Page 18: Graphite)Internals) - MIT CSAILgroups.csail.mit.edu/.../uploads/2013/...internals.pdf · Graphite)Internals) Addi0onal)Details)of)the)Architecture) and)Operaon)of)the)Simulator) 1

Other  syscalls  

File  Management  

Synchroniza0on/  Communica0on  

Memory  Management  

mmap,  munmap,  brk  

kill,  waitpid,  futex  

open,  access,  read,  write  

getrlimit,  nanosleep,  ge3d  

Signal  Management  

sigprocmask,  sigsuspend,  sigac9on  

System  Calls  

Handled  locally  

18  

Page 19: Graphite)Internals) - MIT CSAILgroups.csail.mit.edu/.../uploads/2013/...internals.pdf · Graphite)Internals) Addi0onal)Details)of)the)Architecture) and)Operaon)of)the)Simulator) 1

Other  syscalls  

File  Management  

Synchroniza9on/Communica9on  

Memory  Management  

mmap,  munmap,  brk  

kill,  waitpid,  futex  

open,  access,  read,  write  

getrlimit,  nanosleep,  gefd  

Signal  Management  

sigprocmask,  sigsuspend,  sigac0on  

System  Calls  

Handled  at  the  MCP  19  

Page 20: Graphite)Internals) - MIT CSAILgroups.csail.mit.edu/.../uploads/2013/...internals.pdf · Graphite)Internals) Addi0onal)Details)of)the)Architecture) and)Operaon)of)the)Simulator) 1

Core  

… … mov eax, 1 int $0x80 … …

Mechanism  -­‐  Syscalls  Handled  Locally  

On Syscall Entry

On Syscall Exit

Syscalls  Handled  Locally  

20  

Arguments  are  copied  into  a  local  buffer  (if  needed)  

Arguments  copied  back  into  simulated  memory  (if  needed)  

Syscall  Executed  

Page 21: Graphite)Internals) - MIT CSAILgroups.csail.mit.edu/.../uploads/2013/...internals.pdf · Graphite)Internals) Addi0onal)Details)of)the)Architecture) and)Operaon)of)the)Simulator) 1

Core  

MCP  

… … mov eax, 1 int $0x80 … …

1)  Arguments  are  copied  from  simulated  memory  into  a  local  buffer  

2)  Syscall  is  changed  to  “NOP”  (getpid)  

On Syscall Entry

Sent to MCP

1)  Syscall  return  value  received  from  MCP  

2)  Arguments  copied  back  to  simulated  memory  

Mechanism  -­‐  Syscalls  Handled  Centrally  at  the  MCP  

On Syscall Exit

Syscalls  Handled  Centrally  

21  

Syscall  Executed  

Page 22: Graphite)Internals) - MIT CSAILgroups.csail.mit.edu/.../uploads/2013/...internals.pdf · Graphite)Internals) Addi0onal)Details)of)the)Architecture) and)Operaon)of)the)Simulator) 1

Emulated  Syscalls  

•  Some  system  calls  have  to  be  emulated  and  not  simply  executed  locally/centrally  

•  Synchroniza0on  (e.g.,  futex)  – Ensures  global  thread  co-­‐ordina0on  and  correct  target  0me  updates  

•  Memory  management  (brk,  mmap,  munmap)  – Ensures  global  management  of  virtual  memory  

•  Syscalls  with  modeled  state  dependence  – E.g.,  clock_gefme  (needs  target  0me)  

22  

Page 23: Graphite)Internals) - MIT CSAILgroups.csail.mit.edu/.../uploads/2013/...internals.pdf · Graphite)Internals) Addi0onal)Details)of)the)Architecture) and)Operaon)of)the)Simulator) 1

Applica0on  Synchroniza0on  

•  Normal  futex  /  atomic  instruc0ons  – Useful  for  pthread  style  programs  –  Falls  through  to  mechanisms  previously  described  –  Implemented  via  memory  system  

•  Applica0on  func0on  calls  (e.g.,  Barrier())  – Gets  replaced  by  a  simulated  version  – Allows  explora0on  of  architectural  support  for  synchroniza0on  mechanisms  

– Does  not  depend  on  the  memory  system  

23  

Page 24: Graphite)Internals) - MIT CSAILgroups.csail.mit.edu/.../uploads/2013/...internals.pdf · Graphite)Internals) Addi0onal)Details)of)the)Architecture) and)Operaon)of)the)Simulator) 1

Outline  

•  Mul0-­‐machine  distribu0on  –  Single  shared  address  space  –  Thread  distribu0on  –  System  calls  

•  Component  Models  – Overview  –  Core  – Memory  Hierarchy  – Network  –  Conten0on  –  Power  

24  

Page 25: Graphite)Internals) - MIT CSAILgroups.csail.mit.edu/.../uploads/2013/...internals.pdf · Graphite)Internals) Addi0onal)Details)of)the)Architecture) and)Operaon)of)the)Simulator) 1

Simulated  Target  Architecture  

•  Swappable  models  for  processor,  network,  and  memory  hierarchy  components  –  Explore  different  architectures  –  Trade  accuracy  for  performance  

•  Cores  may  be  homogeneous  or  heterogeneous  

Target Tile

Interconnection Network

DRAM

Target Tile

Target Tile

Processor Core

Network Switch

Cache Hierarchy

DRAM Controller

25  

Page 26: Graphite)Internals) - MIT CSAILgroups.csail.mit.edu/.../uploads/2013/...internals.pdf · Graphite)Internals) Addi0onal)Details)of)the)Architecture) and)Operaon)of)the)Simulator) 1

Modeling  Overview  •  Func0onal  and  0ming  components  are  separate  where  

possible  –  Excep0ons  made  for  performance  reasons  

•  Func0onality  –  Direct-­‐execu0on  of  as  many  instruc0ons  as  possible  –  Trap  into  simulator  for  new  behaviors  

•  Timing  (performance)  –  Inputs  from  front  end  and  func0onal  components  used  to  update  simulated  clock  

•  Energy  –  Es0mated  on-­‐line  using  events  from  modeled  components  

•  Each  0le  actually  has  two  threads  –  App  thread  is  the  original  applica0on  thread  instrumented  by  Pin  –  Sim  thread  executes  most  models  (including  memory  and  network)  

26  

Page 27: Graphite)Internals) - MIT CSAILgroups.csail.mit.edu/.../uploads/2013/...internals.pdf · Graphite)Internals) Addi0onal)Details)of)the)Architecture) and)Operaon)of)the)Simulator) 1

Interac0on  between  Models  

Power  Model  McPAT/DSENT  

Conten0on  Model  

Front  End  (applica0on  thread  running  on  Pin)  

Cache  Model  

Network  Model  

Core  Model   Memory  

Controllers/DRAM  

Inputs  from  all  models  

Page 28: Graphite)Internals) - MIT CSAILgroups.csail.mit.edu/.../uploads/2013/...internals.pdf · Graphite)Internals) Addi0onal)Details)of)the)Architecture) and)Operaon)of)the)Simulator) 1

Core  Modeling  

•  Performance  model  completely  separate  from  func0onal  component  –  Applica0on  executes  na0vely  –  Stream  of  events  fed  into  0ming  model  

•  Inputs  from  Pin  as  well  as  dynamic  informa0on  from  the  network  and  memory  components  –  Instruc0on  stream  –  Latency  of  memory  and  network  opera0ons  

•  The  current  model  is  a  simple  in-­‐order  model  –  Basic  pipeline  with  configurable  latency  for  different  classes  of  

instruc0ons  –  Allows  mul0ple  outstanding  memory  opera0ons  

•  “Special  instruc0ons”  used  to  model  aspects  such  as  message  passing  

28  

Page 29: Graphite)Internals) - MIT CSAILgroups.csail.mit.edu/.../uploads/2013/...internals.pdf · Graphite)Internals) Addi0onal)Details)of)the)Architecture) and)Operaon)of)the)Simulator) 1

Memory  Modeling  

1)  Private  L1,  private  L2  cache  hierarchy  –  Directory-­‐based  coherence  scheme  for  L2  –  Directory  distributed  across  all  0les  

2)  Private  L1,  shared  L2  cache  hierarchy  –  L2  cache  distributed  across  all  0les  –  Directory  co-­‐located  with  L2  tags    

•  Configurable  number  of  controllers/DRAM  channels  •  Memory  models  are  both  func0onal  and  0ming  

–  Target  coherence  scheme  used  to  maintain  coherence  across  machines  

–  Messages  are  used  both  to  communicate  data/update  state  and  to  compute  latencies  

29  

Page 30: Graphite)Internals) - MIT CSAILgroups.csail.mit.edu/.../uploads/2013/...internals.pdf · Graphite)Internals) Addi0onal)Details)of)the)Architecture) and)Operaon)of)the)Simulator) 1

Network  models  

•  Func0onal  and  0ming  components  – Message  type  (unicast,  mul0cast,  broadcast)  –  Rou0ng  algorithms  – Network  traversal  latencies  (queuing,  serializa0on,  zero-­‐load)  

•  Uses  Physical  Transport  layer  to  send  messages  to  other  cores’  network  models  

•  Opportunity  for  performance/accuracy  trade-­‐off  –  Timing  may  be  analy0cal,  fully  detailed  or  a  combina0on  

30  

Page 31: Graphite)Internals) - MIT CSAILgroups.csail.mit.edu/.../uploads/2013/...internals.pdf · Graphite)Internals) Addi0onal)Details)of)the)Architecture) and)Operaon)of)the)Simulator) 1

Conten0on  Models  

•  Used  by  network  and  DRAM  to  calculate  queuing  delay  

•  Analy0cal  Model  – Using  an  M/G/1  Queuing  Model  –  Inputs  are  link  u0liza0on,  average  packet  size  

•  History  of  Free  Intervals  – Captures  history  of  network  u0liza0on  – More  accurately  handles  burs0ness  and  clock  skew  

Page 32: Graphite)Internals) - MIT CSAILgroups.csail.mit.edu/.../uploads/2013/...internals.pdf · Graphite)Internals) Addi0onal)Details)of)the)Architecture) and)Operaon)of)the)Simulator) 1

Power  Models  

•  Ac0vity  counters  track  events  during  simula0on  – E.g.,  cache  access,  network  link  traversal  – Energy  calculated  online  from  sta0c  and  dynamic  components  

•  Models  available  for  following  components:  – Network  (using  DSENT)  – Caches  (using  McPAT)  – Cores  (using  McPAT)  

Page 33: Graphite)Internals) - MIT CSAILgroups.csail.mit.edu/.../uploads/2013/...internals.pdf · Graphite)Internals) Addi0onal)Details)of)the)Architecture) and)Operaon)of)the)Simulator) 1

Summary  

•  Special  techniques  used  for  distributed  simula0on:  –  Single,  distributed  shared  address  space  –  Thread  spawning  and  distribu0on  –  Syscall  intercep0on  and  proxying  

•  Graphite  provides  performance  and  power  models  for  core,  memory,  and  network  subsystems  – Modular  architecture  makes  it  easy  to  create  new  models  

33  

Page 34: Graphite)Internals) - MIT CSAILgroups.csail.mit.edu/.../uploads/2013/...internals.pdf · Graphite)Internals) Addi0onal)Details)of)the)Architecture) and)Operaon)of)the)Simulator) 1

Examples  of  Alternate  Models  

•  Ghent  U.,  Intel  Exascience  Lab  –  Replaced  core  model  with  “interval  simula0on”  

•  Univ.  of  New  Mexico  –  Modified  cache  models  

•  CSG  group  @  MIT  –  Modified  cache  protocols  –  Added  special-­‐purpose  mul0-­‐threading  to  core  

•  MIT,  Intel  PAR  group  –  Implemented  hierarchy  of  shared  caches  –  New  cache  coherence  protocol  –  Formally  verified  with  Murphi  checker  

•  Carbon  group  (in-­‐house)  –  Mul0ple  network  models  (accuracy/performance,  on-­‐chip  op0cs)  –  Mul0ple  cache  coherence  protocols  

34