graphite)internals) - mit csailgroups.csail.mit.edu/.../uploads/2013/...internals.pdf ·...

Graphite Internals

Addi0onal Details of the Architecture and Opera0on of the Simulator

Outline

•  Mul0-‐machine distribu0on –  Single shared address space –  Thread distribu0on –  System calls

•  Component Models – Overview –  Core – Memory Hierarchy – Network –  Conten0on –  Power

Graphite Architecture

•  Applica0on threads mapped to target 0les –  On trap, use correct target 0le’s models

•  Target 0les are distributed among host processes

•  Processes can be distributed to mul0ple host machines

Target Architecture

Target Tile

Graphite

Host Threads

Application Threads

Application

Host Core

Host Process

Host OS Host OS

Host Process

Host Machines

Parallel Distribu0on Benefits

•  Accelerate slow simula0ons –  Addi0onal compute/cache resources –  Reduces latency of simula0on for quick turn-‐around –  Some efficiency lost to communica0on overhead

•  Enable huge simula0ons –  Can easily exhaust OS or memory resources of one machine –  Addi0onal DRAM, memory bandwidth –  Addi0onal thread contexts –  No other way to run these experiments

Parallel Distribu0on Challenges

•  Wanted support for standard pthreads model –  Allows use of off-‐the-‐shelf apps –  Simulate coherent-‐shared-‐memory architectures

•  Must provide the illusion that all threads are running in a single process on a single machine –  Single shared address space –  Thread spawning –  System calls

Single Shared Address Space

•  All applica0on threads run in a single simulated address space

•  Memory subsystem provides modeling as well as func0onality

•  Func0onality implemented as part of the target memory models –  Eliminate redundant work –  Test correctness of memory models

Host Address Space

Simulated Address Space

Application

•  Simulated address space distributed among hosts •  Graphite manages the simulated address space

–  Follows the System V ABI

Pin Models Pin Models Pin Models

Host Address Space Host Address Space Host Address Space

Managing the address space

•  Stack space is allocated at thread start

•  Appropriate syscalls are intercepted and handled by Graphite –  mmap and munmap use dynamically allocated segments –  brk allocates from program heap

•  Memory accesses corresponding to instruc0on fetch not redirected –  These accesses are s0ll modeled –  Don’t support self modifying or dynamically linked code at the

moment

Memory Bootstrapping

•  Need to bootstrap the simulated address space –  Copy over code and data from the applica0on binary –  Copy over arguments and environment variables from the stack

Rewri0ng memory operands

•  Graphite uses Pin API calls to rewrite memory accesses •  Data resides somewhere in the modeled memory system

–  May be on a different machine! •  Data access may span mul0ple cache lines

Emula&on code

L1 Cache

Cache Hierarchy

read(addr, 8) write(addr, 8)

AND addr, rax Rewrite

Target Memory System

Rewri0ng memory operands (contd.)

•  Solu0on: scratchpads!

Emula&on Code

L1 Cache

Cache Hierarchy

read(addr, 8) write(addr, 8)

ADD addr, rax Rewrite

Execute

Atomic memory opera0ons

•  Need to prevent other cores from modifying data –  Lock the private L1 cache during execu0on –  This together with the cache coherence protocol ensures atomicity

Emula&on Code

L1 Cache

Cache Hierarchy

LOCK CMPXCHG addr, rax Rewrite

Execute

Thread Distribu0on

•  Graphite runs applica0on threads across several host machines

•  Must ini0alize each host process correctly

•  Threads are automa0cally distributed by trapping threading calls

Host Machines

core core core core core core

Target Mul0core

Applica0on

Process Ini0aliza0on

•  Need to ini0alize state correctly in each process (glibc ini0aliza0on, TLS setup)

•  Execute ini0aliza0on rou0nes serially in each process

•  Process 0 executes main()

Process 0

Ini0alize Ini0alize

Ini0alize

Process 1

Process 2

Execute main() Wait for spawn request

Thread Spawning

•  Thread distribu0on managed through MCP/LCPs

Process 0 Process 1 Process 2

Target Cores

MCP LCP

Spawn request

–  MCP and LCPs not part of target architecture –  Perform management tasks (thread spawning, syscalls, etc.)

Thread Management

•  MCP keeps table of thread state

•  Performs simple load balancing on spawns –  Target cores striped across host processes –  Future work: beeer scheduling/load balancing

•  Implements pthread API by intercep0ng calls –  Pthread_create() ini0ates a spawn request to MCP –  Pthread_join() messages MCP and waits for a reply when thread exits

Other syscalls

File Management

Synchroniza0on/ Communica0on

Memory Management

mmap, munmap, brk

kill, waitpid, futex

open, access, read, write

getrlimit, nanosleep, gefd

Signal Management

sigprocmask, sigsuspend, sigac0on

System Calls

Other syscalls

File Management

Synchroniza0on/ Communica0on

Memory Management

mmap, munmap, brk

getrlimit, nanosleep, ge3d

Signal Management

System Calls

Handled locally

Other syscalls

File Management

Synchroniza9on/Communica9on

Memory Management

mmap, munmap, brk

getrlimit, nanosleep, gefd

Signal Management

System Calls

Handled at the MCP 19

… … mov eax, 1 int $0x80 … …

Mechanism -‐ Syscalls Handled Locally

On Syscall Entry

On Syscall Exit

Syscalls Handled Locally

Arguments are copied into a local buffer (if needed)

Arguments copied back into simulated memory (if needed)

Syscall Executed

… … mov eax, 1 int $0x80 … …

1)  Arguments are copied from simulated memory into a local buffer

2)  Syscall is changed to “NOP” (getpid)

On Syscall Entry

Sent to MCP

1)  Syscall return value received from MCP

2)  Arguments copied back to simulated memory

Mechanism -‐ Syscalls Handled Centrally at the MCP

On Syscall Exit

Syscalls Handled Centrally

Syscall Executed

Emulated Syscalls

•  Some system calls have to be emulated and not simply executed locally/centrally

•  Synchroniza0on (e.g., futex) – Ensures global thread co-‐ordina0on and correct target 0me updates

•  Memory management (brk, mmap, munmap) – Ensures global management of virtual memory

•  Syscalls with modeled state dependence – E.g., clock_gefme (needs target 0me)

Applica0on Synchroniza0on

•  Normal futex / atomic instruc0ons – Useful for pthread style programs –  Falls through to mechanisms previously described –  Implemented via memory system

•  Applica0on func0on calls (e.g., Barrier()) – Gets replaced by a simulated version – Allows explora0on of architectural support for synchroniza0on mechanisms

– Does not depend on the memory system

Outline

•  Mul0-‐machine distribu0on –  Single shared address space –  Thread distribu0on –  System calls

•  Component Models – Overview –  Core – Memory Hierarchy – Network –  Conten0on –  Power

Simulated Target Architecture

•  Swappable models for processor, network, and memory hierarchy components –  Explore different architectures –  Trade accuracy for performance

•  Cores may be homogeneous or heterogeneous

Target Tile

Interconnection Network

Target Tile

Processor Core

Network Switch

Cache Hierarchy

DRAM Controller

Modeling Overview •  Func0onal and 0ming components are separate where

possible –  Excep0ons made for performance reasons

•  Func0onality –  Direct-‐execu0on of as many instruc0ons as possible –  Trap into simulator for new behaviors

•  Timing (performance) –  Inputs from front end and func0onal components used to update simulated clock

•  Energy –  Es0mated on-‐line using events from modeled components

•  Each 0le actually has two threads –  App thread is the original applica0on thread instrumented by Pin –  Sim thread executes most models (including memory and network)

Interac0on between Models

Power Model McPAT/DSENT

Conten0on Model

Front End (applica0on thread running on Pin)

Cache Model

Network Model

Core Model Memory

Controllers/DRAM

Inputs from all models

Core Modeling

•  Performance model completely separate from func0onal component –  Applica0on executes na0vely –  Stream of events fed into 0ming model

•  Inputs from Pin as well as dynamic informa0on from the network and memory components –  Instruc0on stream –  Latency of memory and network opera0ons

•  The current model is a simple in-‐order model –  Basic pipeline with configurable latency for different classes of

instruc0ons –  Allows mul0ple outstanding memory opera0ons

•  “Special instruc0ons” used to model aspects such as message passing

Memory Modeling

1)  Private L1, private L2 cache hierarchy –  Directory-‐based coherence scheme for L2 –  Directory distributed across all 0les

2)  Private L1, shared L2 cache hierarchy –  L2 cache distributed across all 0les –  Directory co-‐located with L2 tags

•  Configurable number of controllers/DRAM channels •  Memory models are both func0onal and 0ming

–  Target coherence scheme used to maintain coherence across machines

–  Messages are used both to communicate data/update state and to compute latencies

Network models

•  Func0onal and 0ming components – Message type (unicast, mul0cast, broadcast) –  Rou0ng algorithms – Network traversal latencies (queuing, serializa0on, zero-‐load)

•  Uses Physical Transport layer to send messages to other cores’ network models

•  Opportunity for performance/accuracy trade-‐off –  Timing may be analy0cal, fully detailed or a combina0on

Conten0on Models

•  Used by network and DRAM to calculate queuing delay

•  Analy0cal Model – Using an M/G/1 Queuing Model –  Inputs are link u0liza0on, average packet size

•  History of Free Intervals – Captures history of network u0liza0on – More accurately handles burs0ness and clock skew

Power Models

•  Ac0vity counters track events during simula0on – E.g., cache access, network link traversal – Energy calculated online from sta0c and dynamic components

•  Models available for following components: – Network (using DSENT) – Caches (using McPAT) – Cores (using McPAT)

Summary

•  Special techniques used for distributed simula0on: –  Single, distributed shared address space –  Thread spawning and distribu0on –  Syscall intercep0on and proxying

•  Graphite provides performance and power models for core, memory, and network subsystems – Modular architecture makes it easy to create new models

Examples of Alternate Models

•  Ghent U., Intel Exascience Lab –  Replaced core model with “interval simula0on”

•  Univ. of New Mexico –  Modified cache models

•  CSG group @ MIT –  Modified cache protocols –  Added special-‐purpose mul0-‐threading to core

•  MIT, Intel PAR group –  Implemented hierarchy of shared caches –  New cache coherence protocol –  Formally verified with Murphi checker

•  Carbon group (in-‐house) –  Mul0ple network models (accuracy/performance, on-‐chip op0cs) –  Mul0ple cache coherence protocols

graphite)internals) - mit csailgroups.csail.mit.edu/.../uploads/2013/...internals.pdf ·...

Documents

docker internals

compiler internals: exceptions and...

sga internals

usenix 2001, boston, ma. solaris internals solaris internals

nginx internals

pgs(pgs(pyrolytic graphite sheet) graphite sheetpyrolytic...

javascript internals

internals en

whips1.0 internals

firebird internals

hbase internals

spamcheetah internals

symfony internals

twig internals - maksym moskvychevtwig internals maksym...

zfs internals - eall.com.br internals.pdf · zfs internals...

cassandra internals

solaris internals, 2nd edition - solaris internals and...

solid metal - trebo · 2019. 1. 18. · m4515 metallic...

aix internals

internals - eall.com.br internals.pdf · distribute vigra...