graphite)internals) - mit csailgroups.csail.mit.edu/.../uploads/2013/...internals.pdf ·...
Post on 03-Jun-2020
3 Views
Preview:
TRANSCRIPT
Graphite Internals
Addi0onal Details of the Architecture and Opera0on of the Simulator
1
Outline
• Mul0-‐machine distribu0on – Single shared address space – Thread distribu0on – System calls
• Component Models – Overview – Core – Memory Hierarchy – Network – Conten0on – Power
2
Graphite Architecture
• Applica0on threads mapped to target 0les – On trap, use correct target 0le’s models
• Target 0les are distributed among host processes
• Processes can be distributed to mul0ple host machines
Target Architecture
Target Tile
Target Tile
Target Tile
Target Tile
Target Tile
Target Tile
Target Tile
Target Tile
Target Tile
Graphite
Host Threads
Application Threads
Application
Host Core
Host Core
Host Core
Host Core
Host Core
Host Core
Host Process
Host OS Host OS
Host Process
Host Process
Host Machines
3
Parallel Distribu0on Benefits
• Accelerate slow simula0ons – Addi0onal compute/cache resources – Reduces latency of simula0on for quick turn-‐around – Some efficiency lost to communica0on overhead
• Enable huge simula0ons – Can easily exhaust OS or memory resources of one machine – Addi0onal DRAM, memory bandwidth – Addi0onal thread contexts – No other way to run these experiments
4
Parallel Distribu0on Challenges
• Wanted support for standard pthreads model – Allows use of off-‐the-‐shelf apps – Simulate coherent-‐shared-‐memory architectures
• Must provide the illusion that all threads are running in a single process on a single machine – Single shared address space – Thread spawning – System calls
5
Single Shared Address Space
• All applica0on threads run in a single simulated address space
• Memory subsystem provides modeling as well as func0onality
• Func0onality implemented as part of the target memory models – Eliminate redundant work – Test correctness of memory models
Host Address Space
Host Address Space
Host Address Space
Simulated Address Space
Application
6
Simulated Address Space
• Simulated address space distributed among hosts • Graphite manages the simulated address space
– Follows the System V ABI
Simulated Address Space
Pin Models Pin Models Pin Models
Host Address Space Host Address Space Host Address Space
7
Managing the address space
• Stack space is allocated at thread start
• Appropriate syscalls are intercepted and handled by Graphite – mmap and munmap use dynamically allocated segments – brk allocates from program heap
• Memory accesses corresponding to instruc0on fetch not redirected – These accesses are s0ll modeled – Don’t support self modifying or dynamically linked code at the
moment
Simulated Address Space
8
Memory Bootstrapping
• Need to bootstrap the simulated address space – Copy over code and data from the applica0on binary – Copy over arguments and environment variables from the stack
Simulated Address Space
9
Rewri0ng memory operands
• Graphite uses Pin API calls to rewrite memory accesses • Data resides somewhere in the modeled memory system
– May be on a different machine! • Data access may span mul0ple cache lines
Emula&on code
L1 Cache
Cache Hierarchy
DRAM
read(addr, 8) write(addr, 8)
AND addr, rax Rewrite
10
Target Memory System
Rewri0ng memory operands (contd.)
• Solu0on: scratchpads!
Emula&on Code
L1 Cache
Cache Hierarchy
DRAM
read(addr, 8) write(addr, 8)
ADD addr, rax Rewrite
1
2
3
Execute
11
Target Memory System
Atomic memory opera0ons
• Need to prevent other cores from modifying data – Lock the private L1 cache during execu0on – This together with the cache coherence protocol ensures atomicity
Emula&on Code
L1 Cache
Cache Hierarchy
DRAM
LOCK CMPXCHG addr, rax Rewrite
Execute
1
2
3
12
Target Memory System
Thread Distribu0on
• Graphite runs applica0on threads across several host machines
• Must ini0alize each host process correctly
• Threads are automa0cally distributed by trapping threading calls
Host Machines
core core core core core core
core core core core core core
Target Mul0core
Applica0on
13
Process Ini0aliza0on
• Need to ini0alize state correctly in each process (glibc ini0aliza0on, TLS setup)
• Execute ini0aliza0on rou0nes serially in each process
• Process 0 executes main()
Process 0
Ini0alize Ini0alize
Ini0alize
Process 1
Process 2
Execute main() Wait for spawn request
14
Thread Spawning
• Thread distribu0on managed through MCP/LCPs
Process 0 Process 1 Process 2
Target Cores
MCP LCP
LCP
LCP
Spawn request
Spawn request
Spawn request
15
– MCP and LCPs not part of target architecture – Perform management tasks (thread spawning, syscalls, etc.)
Thread Management
• MCP keeps table of thread state
• Performs simple load balancing on spawns – Target cores striped across host processes – Future work: beeer scheduling/load balancing
• Implements pthread API by intercep0ng calls – Pthread_create() ini0ates a spawn request to MCP – Pthread_join() messages MCP and waits for a reply when thread exits
16
Other syscalls
File Management
Synchroniza0on/ Communica0on
Memory Management
mmap, munmap, brk
kill, waitpid, futex
open, access, read, write
getrlimit, nanosleep, gefd
Signal Management
sigprocmask, sigsuspend, sigac0on
System Calls
17
Other syscalls
File Management
Synchroniza0on/ Communica0on
Memory Management
mmap, munmap, brk
kill, waitpid, futex
open, access, read, write
getrlimit, nanosleep, ge3d
Signal Management
sigprocmask, sigsuspend, sigac9on
System Calls
Handled locally
18
Other syscalls
File Management
Synchroniza9on/Communica9on
Memory Management
mmap, munmap, brk
kill, waitpid, futex
open, access, read, write
getrlimit, nanosleep, gefd
Signal Management
sigprocmask, sigsuspend, sigac0on
System Calls
Handled at the MCP 19
Core
… … mov eax, 1 int $0x80 … …
Mechanism -‐ Syscalls Handled Locally
On Syscall Entry
On Syscall Exit
Syscalls Handled Locally
20
Arguments are copied into a local buffer (if needed)
Arguments copied back into simulated memory (if needed)
Syscall Executed
Core
MCP
… … mov eax, 1 int $0x80 … …
1) Arguments are copied from simulated memory into a local buffer
2) Syscall is changed to “NOP” (getpid)
On Syscall Entry
Sent to MCP
1) Syscall return value received from MCP
2) Arguments copied back to simulated memory
Mechanism -‐ Syscalls Handled Centrally at the MCP
On Syscall Exit
Syscalls Handled Centrally
21
Syscall Executed
Emulated Syscalls
• Some system calls have to be emulated and not simply executed locally/centrally
• Synchroniza0on (e.g., futex) – Ensures global thread co-‐ordina0on and correct target 0me updates
• Memory management (brk, mmap, munmap) – Ensures global management of virtual memory
• Syscalls with modeled state dependence – E.g., clock_gefme (needs target 0me)
22
Applica0on Synchroniza0on
• Normal futex / atomic instruc0ons – Useful for pthread style programs – Falls through to mechanisms previously described – Implemented via memory system
• Applica0on func0on calls (e.g., Barrier()) – Gets replaced by a simulated version – Allows explora0on of architectural support for synchroniza0on mechanisms
– Does not depend on the memory system
23
Outline
• Mul0-‐machine distribu0on – Single shared address space – Thread distribu0on – System calls
• Component Models – Overview – Core – Memory Hierarchy – Network – Conten0on – Power
24
Simulated Target Architecture
• Swappable models for processor, network, and memory hierarchy components – Explore different architectures – Trade accuracy for performance
• Cores may be homogeneous or heterogeneous
Target Tile
Interconnection Network
DRAM
Target Tile
Target Tile
Processor Core
Network Switch
Cache Hierarchy
DRAM Controller
25
Modeling Overview • Func0onal and 0ming components are separate where
possible – Excep0ons made for performance reasons
• Func0onality – Direct-‐execu0on of as many instruc0ons as possible – Trap into simulator for new behaviors
• Timing (performance) – Inputs from front end and func0onal components used to update simulated clock
• Energy – Es0mated on-‐line using events from modeled components
• Each 0le actually has two threads – App thread is the original applica0on thread instrumented by Pin – Sim thread executes most models (including memory and network)
26
Interac0on between Models
Power Model McPAT/DSENT
Conten0on Model
Front End (applica0on thread running on Pin)
Cache Model
Network Model
Core Model Memory
Controllers/DRAM
Inputs from all models
Core Modeling
• Performance model completely separate from func0onal component – Applica0on executes na0vely – Stream of events fed into 0ming model
• Inputs from Pin as well as dynamic informa0on from the network and memory components – Instruc0on stream – Latency of memory and network opera0ons
• The current model is a simple in-‐order model – Basic pipeline with configurable latency for different classes of
instruc0ons – Allows mul0ple outstanding memory opera0ons
• “Special instruc0ons” used to model aspects such as message passing
28
Memory Modeling
1) Private L1, private L2 cache hierarchy – Directory-‐based coherence scheme for L2 – Directory distributed across all 0les
2) Private L1, shared L2 cache hierarchy – L2 cache distributed across all 0les – Directory co-‐located with L2 tags
• Configurable number of controllers/DRAM channels • Memory models are both func0onal and 0ming
– Target coherence scheme used to maintain coherence across machines
– Messages are used both to communicate data/update state and to compute latencies
29
Network models
• Func0onal and 0ming components – Message type (unicast, mul0cast, broadcast) – Rou0ng algorithms – Network traversal latencies (queuing, serializa0on, zero-‐load)
• Uses Physical Transport layer to send messages to other cores’ network models
• Opportunity for performance/accuracy trade-‐off – Timing may be analy0cal, fully detailed or a combina0on
30
Conten0on Models
• Used by network and DRAM to calculate queuing delay
• Analy0cal Model – Using an M/G/1 Queuing Model – Inputs are link u0liza0on, average packet size
• History of Free Intervals – Captures history of network u0liza0on – More accurately handles burs0ness and clock skew
Power Models
• Ac0vity counters track events during simula0on – E.g., cache access, network link traversal – Energy calculated online from sta0c and dynamic components
• Models available for following components: – Network (using DSENT) – Caches (using McPAT) – Cores (using McPAT)
Summary
• Special techniques used for distributed simula0on: – Single, distributed shared address space – Thread spawning and distribu0on – Syscall intercep0on and proxying
• Graphite provides performance and power models for core, memory, and network subsystems – Modular architecture makes it easy to create new models
33
Examples of Alternate Models
• Ghent U., Intel Exascience Lab – Replaced core model with “interval simula0on”
• Univ. of New Mexico – Modified cache models
• CSG group @ MIT – Modified cache protocols – Added special-‐purpose mul0-‐threading to core
• MIT, Intel PAR group – Implemented hierarchy of shared caches – New cache coherence protocol – Formally verified with Murphi checker
• Carbon group (in-‐house) – Mul0ple network models (accuracy/performance, on-‐chip op0cs) – Mul0ple cache coherence protocols
34
top related