(C) 2005 Multifacet Project http://www.cs.wisc.edu/gems
ISCA TutorialJune 5th, 2005
Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin Moore
Please Ask Questions
Slide 2 http://www.cs.wisc.edu/gems
What do you want to simulate?
Symmetric Multiprocessor
Glueless Multiprocessor
CPU
Uniprocessor
Multiple-CMP
CMP CMP
CMP CMP
P
Chip Multiprocessor (CMP)
P P P
$ $ $ $
Slide 3 http://www.cs.wisc.edu/gems
Open Source Release of GEMS
• GEMS v1.1 released as GPL software
http://www.cs.wisc.edu/gems• Contributors
Alaa Alameldeen
Brad Beckmann
Ross Dickson
Pacia Harper
Milo Martin
Mike Marty
Carl Mauer
Kevin Moore
Manoj Plakal
Dan Sorin
Min Xu
Luke Yen
• Multifacet Project directed by Mark Hill & David Wood
Slide 4 http://www.cs.wisc.edu/gems
GEMS Requirements
• Virtutech Simics 2.0.x or 2.2.x – Personal academic licenses available– http://www.virtutech.com
• Host Machine– x86 (32 or 64-bit) Linux or Sparc/Solaris host machine– > 1 GB Memory
• Workload Checkpoints YOU Create– License issues w/ releasing checkpoints
Slide 5 http://www.cs.wisc.edu/gems
GEMS From 50,000 Feet
DetailedProcessor
Model
OpalSimics
Microbenchmarks
RandomTester
Dete
rmin
isti
c
Conte
nded lock
s
Tra
ce fl
ie
Slide 6 http://www.cs.wisc.edu/gems
GEMS From 50,000 Feet
DetailedProcessor
Model
OpalSimics
Microbenchmarks
RandomTester
Dete
rmin
isti
c
Conte
nded lock
s
Tra
ce fl
ie
Full-System Functional Simulator
• Boots unmodified Solaris 9• BUT, each instruction 1-cycle
• www.virtutech.com
Slide 7 http://www.cs.wisc.edu/gems
GEMS From 50,000 Feet
DetailedProcessor
Model
OpalSimics
Microbenchmarks
RandomTester
Dete
rmin
isti
c
Conte
nded lock
s
Tra
ce fl
ie
Memory System Model
• Flexible multiprocessor memory hierarchy • Includes domain-specific language
Slide 8 http://www.cs.wisc.edu/gems
GEMS From 50,000 Feet
DetailedProcessor
Model
OpalSimics
Microbenchmarks
RandomTester
Dete
rmin
isti
c
Conte
nded lock
s
Tra
ce fl
ie
OoO Processor Model
• Implements partial SPARC v9 ISA• Modeled after MIPS R10000
Slide 9 http://www.cs.wisc.edu/gems
GEMS From 50,000 Feet
DetailedProcessor
Model
OpalSimics
Microbenchmarks
RandomTester
Dete
rmin
isti
c
Conte
nded lock
s
Tra
ce fl
ie
Other Drivers
• Testing independent of Simics• Microbenchmarks
Slide 10 http://www.cs.wisc.edu/gems
Outline
• Introduction and Motivation
• Demo: Simulating a Multiple-CMP System with GEMS
• Ruby: Memory system model
• BREAK
• Opal: Out-of-order processor model
• Demo: Two gems are better than one
• GEMS Source Code Tour and Extending Ruby
• Building Workloads
Slide 11 http://www.cs.wisc.edu/gems
Full-System Simulation with GEMS
• Steps:– Choosing a Ruby protocol– Building Ruby and Opal– Starting and configuring Simics– Loading and configuring Ruby– Loading and configuring Opal– Running simulation– Getting results
Demo
Slide 12 http://www.cs.wisc.edu/gems
Choosing the Ruby System/Protocol
• Included with GEMS release v1.1
– CMP protocols • MOESI_CMP_token: M-CMP token coherence
• MSI_MOSI_CMP_directory: 2-level Directory
• MOESI_CMP_directory: higher performing 2-level Directory
– SMP protocols• MOSI_SMP_bcast: snooping on ordered interconnect
• MOSI_SMP_directory
• MOSI_SMP_hammer: based on AMD Hammer
• And more
Demo
Slide 13 http://www.cs.wisc.edu/gems
Building Ruby and Opal
• Ruby module
cd $GEMS_ROOT/ruby
– set compile-time defaultsvi config/rubyconfig.defaults
– Build module, choosing protocol and destination dirmake PROTOCOL=MOESI_CMP_token DESTINATION=MOESI_CMP_token
– SLICC runs, generates HTML and additional C++ files– Ruby module built and moved to
$GEMS_ROOT/simics/home/MOESI_CMP_token
• Build Opal
cd $GEMS_ROOT/opalmake module DESTINATION=MOESI_CMP_token
Demo
Slide 14 http://www.cs.wisc.edu/gems
Starting Simics
• Start non-GUI Simics
maya(9)% cd $GEMS_ROOT/simics/home/MOESI_CMP_token/
maya(10)% ./simics
Checking out a license... done: academic license.
Looking for additional Simics modules in ./modules
+----------------+ Copyright 1998-2004 by Virtutech, All Rights Reserved
| Virtutech | Version: simics-2.0.23
| Simics | Compiled: Thu Oct 14 20:27:36 CEST 2004
+----------------+
www.simics.com "Virtutech" and "Simics" are trademarks of Virtutech AB
Type 'copyright' for details on copyright.
Type 'license' for details on warranty, copying, etc.
Type 'readme' for further information about this version.
Type 'help help' for info on the on-line documentation.
simics>
Demo
Slide 15 http://www.cs.wisc.edu/gems
Checkpoint and Configuration
• Checkpoints should be created first– Simics-only process
simics> read-configuration ../../checkpoints-u3/jbb/jbb-16p.check
– SpecJBB checkpoint loaded
• Load python scripts
simics> @sys.path.append("../../../gen-scripts")simics> @import mfacet
• Configure Simicssimics> istc-disableTurning I-STC off and flushing old datasimics> dstc-disableTurning D-STC off and flushing old datasimics> instruction-fetch-mode instruction-fetch-tracesimics> magic-break-enablesimics> cpu-switch-time 1
Demo
Slide 16 http://www.cs.wisc.edu/gems
Load and Configure Ruby
Load module
simics> load-module ruby
Setting # processors is required
simics> ruby0.setparam g_NUM_PROCESSORS 16
Create a M-CMP system (4 chips, 4 procs/chip)
simics> ruby0.setparam g_PROCS_PER_CHIP 4
Override compile-time defaults
simics> ruby0.setparam g_NUM_L2_BANKS 32simics> ruby0.setparam L2_CACHE_ASSOC 4simics> ruby0.setparam L2_CACHE_NUM_SETS_BITS 16simics> ruby0.setparam NETWORK_LINK_LATENCY 50
Initialize
simics> ruby0.init
Demo
Slide 17 http://www.cs.wisc.edu/gems
Optionally Load and Configure Opal
Load module
simics> load-module opal
Initialize default processor
simics> opal0.init
simics> opal0.listparam
Start opal (but do not start simulating)
simics> opal0.sim-start “output.opal"
Demo
Slide 18 http://www.cs.wisc.edu/gems
Running simulation
• Setup transaction-based simulation– “magic breakpoints”– Five JBB transactions
simics> @mfacet.setup_run_for_n_transactions(5,1)
• Start simulating– Ruby only (Simics drives Ruby):
simics> c
– Opal is loaded (Opal steps Simics):
simics> opal0.sim-step 9999999999
Demo
Slide 19 http://www.cs.wisc.edu/gems
Dumping Some Output
• Opal stats
simics> opal0.stats
• Ruby stats
simics> ruby0.dump-stats ruby.stats
• Ruby short stats
simics> ruby0.dump-short-stats
– Ruby_cycles is a good runtime metric
Demo
Slide 20 http://www.cs.wisc.edu/gems
Outline
• Introduction and Motivation• Demo: Simulating a Multiple-CMP System with GEMS
• Ruby: Memory system model– Overview (Drivers & Memory System)– Event-driven simulation– Interconnection network– SLICC: Specifying the logic of the system– Simple example: SMP MI protocol– Limitations
• BREAK
• Opal: Out-of-order processor model• Demo: Two gems are better than one• GEMS Source Code Tour and Extending Ruby• Building Workloads
Slide 21 http://www.cs.wisc.edu/gems
High-Level Infrastructure MapD
rivers
Mem
ory
Syst
em
Inte
rnal
Ruby Teste
rs
Exter
nal
CPU M
odel
s
DetailedProcessor
Model
OpalSimics
Microbenchmarks
RandomTester
Dete
rmin
isti
c
Conte
nded lock
s
Tra
ce fl
ie
Slide 22 http://www.cs.wisc.edu/gems
Ruby Driver: Random Tester
• “Verifying a Multiprocessor Cache Controller Using Random Test Generation” [Wood et al. 90]
• Purpose: Excite cache coherency bugs • Competing actions performed then checked• Utilizes false sharing
– Multiple writers - action– Single read - check
• Randomly inserted delay
RandomTester
Slide 23 http://www.cs.wisc.edu/gems
Ruby Driver: Microbenchmarks
• Deterministic tester– Simple sequence of requests– Sanity checking and performance tuning– DeterministicDriver.C
• GETX, SeriesGETS, Inv
• Contended locks– Compare and swap atomic op.– RequestGenerator.C / SyntheticDriver.C
• Trace file– Issues requests one at a time– Similar to cache warmup mechanism– ‘-z <trace_file.gz>’
Microbenchmarks
Dete
rmin
isti
c
Conte
nded lock
s
Tra
ce fi
le
Slide 24 http://www.cs.wisc.edu/gems
Ruby Driver: In-order Processor Model
• Simics blocking interface (in-order processor)– Single issue, non-pipelined processor– Only one outstanding request per CPU
• SIMICS_RUBY_MULTIPLIER > 1– Estimates a higher performance processor– Multiple simics processor cycles == one ruby cycle
Simics
Slide 25 http://www.cs.wisc.edu/gems
Ruby Driver: In-order Processor Model
• Implements Simics’ mh_memorytracer_possible_cache_miss()• “Callback” Simics with SIM_stall_cycle(proc_ptr, 0)
P0
Simics time queue
P1 P2 P3
stall()
/unstall()
stall()
/unstall()
stall()/unstall()
stall()/unstall()
instructions
Simics in-order processor model
SIMICS
RubyMemory System Model
Slide 26 http://www.cs.wisc.edu/gems
Ruby Driver: Out-of-order Processor Model
• Opal (out-of-order processor)– Super-scalar pipelined processor– Multiple outstanding requests per CPU
• OPAL_RUBY_MULTIPLIER > 1– Faster processor core frequency than memory– Simulation execution optimization
What are they driving?
DetailedProcessor
Model
Opal
Slide 27 http://www.cs.wisc.edu/gems
Ruby Multiprocessor Memory System
• Physical Components– Caches– Memory– System Interconnect
• Determines the timing of memory requests– Driver issues memory request to Ruby– Ruby simulates the requests– Ruby eventually callbacks the driver with the latency
• Ruby’s purpose:
Return memory latency
RubyMemory System Model
Slide 28 http://www.cs.wisc.edu/gems
Outline
• Introduction and Motivation• Demo: Simulating a Multiple-CMP System with GEMS
• Ruby: Memory system model– Overview (Drivers & Memory System)– Event-driven simulation– Interconnection network– SLICC: Specifying the logic of the system– Simple example: SMP MI protocol– Limitations
• BREAK
• Opal: Out-of-order processor model• Demo: Two gems are better than one• GEMS Source Code Tour and Extending Ruby• Building Workloads
Slide 29 http://www.cs.wisc.edu/gems
Discrete Event-driven Simulation
• Discrete event-driven simulation– Events change system state– Series of scheduled events
• Global EventQueue– Heart of Ruby– Priority heap of event/time pairs
• Not a true queue - not in FIFO order
• Self-sorting queue
– Given cycle events occur in arbitrary order– All events must be at least one unit of time
GlobalEventQueue
Event | Time
*Event G 7
*Event B 5
*Event J 3
*Event S 3
*Event A 4
Slide 30 http://www.cs.wisc.edu/gems
Events and Consumers
• Event = Consumer Wakeup– Consumer determines event type– Consumer changes system state
• Typical event– Consumer wakes up to observe its input ports– Consumer acts upon the incoming message(s)
• Change system state• Enqueue outgoing messages
– Consumer pops the incoming message(s)– Consumer schedules outgoing message(s) consumers
Input PortConsumer
Output Port
Output Port
Consumer
Consumer
Slide 31 http://www.cs.wisc.edu/gems
Events and Consumers
• Stalled event– Consumer wakes up to observer its input ports– Consumer encounters a stall– Consumer schedules itself again
• Doesn’t pop incoming queue
Input PortConsumer
Output Port
Output Port
Consumer
Consumer
Slide 32 http://www.cs.wisc.edu/gems
Outline
• Introduction and Motivation• Demo: Simulating a Multiple-CMP System with GEMS
• Ruby: Memory system model– Overview (Drivers & Memory System)– Event-driven simulation– Interconnection network– SLICC: Specifying the logic of the system– Simple example: SMP MI protocol– Limitations
• BREAK
• Opal: Out-of-order processor model• Demo: Two gems are better than one• GEMS Source Code Tour and Extending Ruby• Building Workloads
Slide 33 http://www.cs.wisc.edu/gems
Interconnection Network
• A single flexible infrastructure– Point-to-point links and switches: Consumers– Both intra-chip and inter-chip networks
• Dynamic network creation– Routing tables created at runtime– Utilizes input parameters
• Two ways to generate topologies1. Auto-generated
– Intra-chip network: Single on-chip switch
– Inter-chip network: 4 included (next slide)
2. Customized– TopologyType_FILE_SPECIFIED
– Adjust individual link latency and bandwidth
– Specify one link per line
Link
Switch
Throttle.C
PerfectSwitch.C
Slide 34 http://www.cs.wisc.edu/gems
Auto-generated Inter-chip Network Topologies
TopologyType_TORUS_2D
TopologyType_CROSSBAR
TopologyType_HIERARCHICAL_SWITCH
TopologyType_PT_TO_PT
Slide 35 http://www.cs.wisc.edu/gems
Network Characteristics
• Link latency1. Auto-generated
– ON_CHIP_LINK_LATENCY– NETWORK_LINK_LATENCY
2. Customized– ‘link_latency:’
• Link bandwidth– Bandwidth specified in 1000th of byte1. Auto-generated
– On-chip = 10 x g_endpoint_bandwidth– Off-chip = g_endpoint_bandwidth
2. Customized– Individual link bandwidth = ‘bw_multiplier:’ x g_endpoint_bandwidth
• Buffer size1. Infinite by default2. Customized network supports finite buffering
• Prevent 2D-mesh network deadlock through e-cube restrictive routing• ‘link_weight’
1. Perfect switch bandwidth
Slide 36 http://www.cs.wisc.edu/gems
Outline
• Introduction and Motivation• Demo: Simulating a Multiple-CMP System with GEMS
• Ruby: Memory system model– Overview (Drivers & Memory System)– Event-driven simulation– Interconnection network– SLICC: Specifying the logic of the system– Simple example: SMP MI protocol– Limitations
• BREAK
• Opal: Out-of-order processor model• Demo: Two gems are better than one• GEMS Source Code Tour and Extending Ruby• Building Workloads
Slide 37 http://www.cs.wisc.edu/gems
• Domain-specific language– Designed to specify state machines for cache coherence– Syntactically similar to C/C++/Java– Constrains to hardware-like structures (i.e. no loops)– Generates C++ tightly coupled to Ruby
• Two purposes1. Specify system coherence
– Per-memory-block State Machines– I.e. cache and memory controller logic
2. Glue components together– Caches with transaction buffers– Network ports with controllers
Specification Language for Implementing Cache Coherence (SLICC)
SLICCState
Machine
NetworkIn-ports
NetworkOut-ports
Slide 38 http://www.cs.wisc.edu/gems
System Flexibility via SLICC
• Substantial portion of Ruby code generated– In combination with dynamic network creation– Permits a tremendously flexible simulation infrastructure
• protocols/<protocol_name>.slicc– Indicates the SLICC files needed by the protocol– Specifies the necessary generated objects
• Controller state machines
• Network messages– Snooping protocol: requests and response messages– Directory protocol: requests, forwarded requests, and responses
– Allocates only C++ objects needed by the particular protocol• Ex. Shadow tags for an exclusive two-level cache
• Ex. Persistent Request Table for Token coherence
Slide 39 http://www.cs.wisc.edu/gems
Inside a SLICC State Machine
• Network buffers– Outgoing and incoming ports
• States– Base and transient states
• Events– Internal events that cause state transitions
• Ruby Structures– Caches, transaction buffers… etc.
• Trigger events– Incoming messages trigger internal events
• Actions– Operations performed on structures
• Transitions– Cross-product of possible states and events– Performs atomic sequence of actions
<controller_name>.smnetwork ports
states
events
ruby structures
trigger events
actions
transitions
Slide 40 http://www.cs.wisc.edu/gems
Outline
• Introduction and Motivation• Demo: Simulating a Multiple-CMP System with GEMS
• Ruby: Memory system model– Overview (Drivers & Memory System)– Event-driven simulation– Interconnection network– SLICC: Specifying the logic of the system– Simple example: SMP MI protocol– Limitations
• BREAK
• Opal: Out-of-order processor model• Demo: Two gems are better than one• GEMS Source Code Tour and Extending Ruby• Building Workloads
Slide 41 http://www.cs.wisc.edu/gems
Creating a protocol with SLICC
• MI-example protocol– Simple, SMP directory protocol– Cache and directory/memory controller– Assume ordered interconnect (for simplicity)
Demo
$ $ $
Ruby interconnect
Ruby interconnect
dir dir dir M
I
GETS/G
ETX
Fwd
Slide 42 http://www.cs.wisc.edu/gems
MI Cache Controller – States and Events
// STATES enumeration(State, desc="Cache states") {
// stables states
I, desc="Not Present/Invalid"; M, desc="Modified";
// transient states MI, desc="Modified, issued PUT"; II, desc="Not Present/Invalid, issued PUT"; IS, desc="Issued request for IFETCH/GETX"; IM, desc="Issued request for STORE/ATOMIC"; }
// EVENTS enumeration(Event, desc="Cache events") { // from processor Load, desc="Load request from processor"; Ifetch, desc="Ifetch request from processor"; Store, desc="Store request from processor";
Data, desc="Data from network"; Fwd_GETX, desc="Forward from network";
Replacement, desc="Replace a block"; Writeback_Ack, desc="Ack from the directory for a writeback"; Writeback_Nack, desc="Nack from the directory for a writeback"; }
Demo
Slide 43 http://www.cs.wisc.edu/gems
MI Cache Controller – Network Ports
// NETWORK BUFFERS
MessageBuffer requestFromCache, network="To", virtual_network="0", ordered="true";
MessageBuffer responseFromCache, network="To", virtual_network="1", ordered="true";
MessageBuffer forwardToCache, network="From", virtual_network="2", ordered="true";
MessageBuffer responseToCache, network="From", virtual_network="1", ordered="true";
// NETWORK PORTS
out_port(requestNetwork_out, RequestMsg, requestFromCache);
out_port(responseNetwork_out, ResponseMsg, responseFromCache);
in_port(forwardRequestNetwork_in, RequestMsg, forwardToCache) {
if (forwardRequestNetwork_in.isReady()) {
peek(forwardRequestNetwork_in, RequestMsg) {
if (in_msg.Type == CoherenceRequestType:GETX) {
trigger(Event:Fwd_GETX, in_msg.Address);
}
else if (in_msg.Type == CoherenceRequestType:WB_ACK) {
trigger(Event:Writeback_Ack, in_msg.Address);
}
else {
error("Unexpected message");
}
}
}
}
Demo
Slide 44 http://www.cs.wisc.edu/gems
MI Cache Controller – Structures
// CacheEntry
structure(Entry, desc="...", interface="AbstractCacheEntry") {
State CacheState, desc="cache state";
bool Dirty, desc="Is the data dirty (different than memory)?";
DataBlock DataBlk, desc="data for the block";
}
external_type(CacheMemory) {
bool cacheAvail(Address);
Address cacheProbe(Address);
void allocate(Address);
void deallocate(Address);
Entry lookup(Address);
void changePermission(Address, AccessPermission);
bool isTagPresent(Address);
}
CacheMemory cacheMemory, template_hack="<L1Cache_Entry>", constructor_hack='L1_CACHE_NUM_SETS_BITS, L1_CACHE_ASSOC, MachineType_L1Cache, int_to_string(i)+"_L1"', abstract_chip_ptr="true";
Demo
Slide 45 http://www.cs.wisc.edu/gems
MI Cache Controller – “Mandatory Queue”
// Mandatory Queue in_port(mandatoryQueue_in, CacheMsg, mandatoryQueue, desc="...") {
if (mandatoryQueue_in.isReady()) {
peek(mandatoryQueue_in, CacheMsg) {
if (cacheMemory.isTagPresent(in_msg.Address) == false &&
cacheMemory.cacheAvail(in_msg.Address) == false ) {
// make room for the block
trigger(Event:Replacement, cacheMemory.cacheProbe(in_msg.Address));
}
else {
trigger(mandatory_request_type_to_event(in_msg.Type), in_msg.Address);
}
}
}
}
Demo
Slide 46 http://www.cs.wisc.edu/gems
MI Cache Controller – Transitions
transition(I, Store, IM) {
v_allocateTBE;
i_allocateL1CacheBlock;
a_issueRequest;
m_popMandatoryQueue;
}
transition(IM, Data, M) {
u_writeDataToCache;
s_store_hit;
w_deallocateTBE;
n_popResponseQueue;
}
transition(M, Fwd_GETX, I) {
e_sendData;
o_popForwardedRequestQueue;
}
transition(M, Replacement, MI) { v_allocateTBE;
b_issuePUT;
x_copyDataFromCacheToTBE;
h_deallocateL1CacheBlock;
}
Atomic sequence of actions
Demo
Slide 47 http://www.cs.wisc.edu/gems
MI Cache Controller – Actions
action(a_issueRequest, "a", desc="Issue a request") {
enqueue(requestNetwork_out, RequestMsg, latency="ISSUE_LATENCY") {
out_msg.Address := address;
out_msg.Type := CoherenceRequestType:GETX;
out_msg.Requestor := machineID;
out_msg.Destination.add(map_Address_to_Directory(address));
out_msg.MessageSize := MessageSizeType:Control;
}
}
action(e_sendData, "e", desc="Send data from cache to requestor") {
peek(forwardRequestNetwork_in, RequestMsg) {
enqueue(responseNetwork_out, ResponseMsg, latency="CACHE_RESPONSE_LATENCY") {
out_msg.Address := address;
out_msg.Type := CoherenceResponseType:DATA;
out_msg.Sender := machineID;
out_msg.Destination.add(in_msg.Requestor);
out_msg.DataBlk := cacheMemory[address].DataBlk;
out_msg.MessageSize := MessageSizeType:Response_Data;
}
}
}
Demo
Slide 48 http://www.cs.wisc.edu/gems
SLICC-generated HTML tablesDemo
• http://www.cs.wisc.edu/gems/MI_example_html/
Slide 49 http://www.cs.wisc.edu/gems
Testing MI_exampleDemo
Build Protocol
cd $GEMS_ROOT/ruby
make PROTOCOL=MI_example
Random test– stresses protocol with simultaneous false-sharing requests– 16 processors (-p), 10000 requests (-l)
./amd64_linux/generated/MI_example/bin/tester.exec –p 16 –l 10000
Deterministic test with transition trace– use a trace, requests handled one at a time– input trace (-z), compressed or non-compressed – transition debug (-s) starting at cycle 1
./amd64_linux/generated/MI_example/bin/tester.exec –p 16 –z ruby.trace.gz –s 1
Slide 50 http://www.cs.wisc.edu/gems
Outline
• Introduction and Motivation• Demo: Simulating a Multiple-CMP System with GEMS• Ruby: Memory system model
• BREAK
• Opal: Out-of-order processor model– Overview
– Pipeline
– Example: Load instruction
– Additional Tidbits
• Demo: Two gems are better than one• GEMS Source Code Tour and Extending Ruby• Building Workloads
Slide 51 http://www.cs.wisc.edu/gems
Overview
• What is OPAL?– Out-of-Order SPARC processor simulator
• (modeled after MIPS R10K)
– Uses Timing-First design– Realized as a Simics module – like RUBY– Does NOT use Simics’ MAI interface
• Goal of this section– Starting point for hacking Opal
• Learning approaches– Code review / summarization (using Control Flow Graphs)– Example: a load instruction– Analogies to SimpleScalar…pay attention to the differences
Slide 52 http://www.cs.wisc.edu/gems
Ruby Driver: In-order Processor Model
• Implements Simics’ mh_memorytracer_possible_cache_miss()• “Callback” Simics with SIM_stall_cycle(proc_ptr, 0)
P0
Simics time queue
P1 P2 P3
stall()
/unstall()
stall()
/unstall()
stall()/unstall()
stall()/unstall()
instructions
Simics in-order processor model
SIMICS
RubyMemory System Model
Slide 53 http://www.cs.wisc.edu/gems
Preview: OPAL & Simics
• Use opal’s opal0.sim-step command
P0Phy_mem
fetch
decode
Schedule/execute
retire
check
12
SIMICS
OPAL
8 76 54 3 1
Instruction
Step
RUBY
LOAD
IFETCHHIT
HIT
Slide 54 http://www.cs.wisc.edu/gems
Timing-First Simulation [Mauer Sigmetrics 02]
• Timing Simulator (Opal)– functional execution of user/supervisor operations– speculative, OoO multiprocessor timing simulation– does NOT implement full ISA or any devices
• Functional Simulator (Simics)– full-system multiprocessor simulation– does NOT model detailed micro-architectural timing
KEY: Reload state if Opal state != Simics state
Slide 55 http://www.cs.wisc.edu/gems
Measured Deviations
• Less than 20 deviations per 100,000 instructions (0.02%)
Worst case performance error: 2.4% (assuming deviation latency is pipeline flush)
additional timing slides
Slide 56 http://www.cs.wisc.edu/gems
Opal and UltraSparc
• Functionally simulates 103 of 183 of UltraSparc ISA instructions (99.99% of all dynamic instr in workloads) LIST
• Sample of unimplemented instrs:– ARRAY -FEXPAND -FPADD -RDSOFTINT
– EDGE -FMUL8x16 -FPMERGE -RDSTICK
– SHUTDOWN -SIAM -SIR -WRSOFTINT -WRSTICK
• Does not functionally simulate devices or any I/O instructions– SCSI controllers and disks
– PCI and SBUS interfaces
– interrupt and DMA controllers
– temperature sensors
Correctness type % error
Functional
Performance
0
2.4 (worst case)
Slide 57 http://www.cs.wisc.edu/gems
Simulation Control (system.[C h])
system_t::simulate(int instrs)
Disable all simics procs
returnSimulated enough instrs?
No
Yes
Forall seq->advanceCycle()
ruby->advanceTime()
global_cycle++
Pipeline is modeled here
For MP sims: P0’s instrs counted here
Slide 58 http://www.cs.wisc.edu/gems
Outline
• Introduction and Motivation• Demo: Simulating a Multiple-CMP System with GEMS• Ruby: Memory system model
• BREAK
• Opal: Out-of-order processor model– Overview
– Pipeline
– Example: Load instruction
– Additional Tidbits
• Demo: Two gems are better than one• GEMS Source Code Tour and Extending Ruby• Building Workloads
Slide 59 http://www.cs.wisc.edu/gems
What’s done in a cycle?
• SimpleScalar uses a reverse order, why?
pseq::advanceCycle()
FetchInstructions()
return
DecodeInstructions()
ScheduleInstructions()
RetireInstructions()
Scheduler->execute()
Uses separate queues (finitecycle.h) to record how many instructions are available for each stage.
The order is in fact not important here.
Slide 60 http://www.cs.wisc.edu/gems
Pipeline Model (pseq.[C h])
• Instructions stored/tracked in a RUU-like structure (iwindow.[C h])
• Flexible multi-stage pipeline– Delay modeled with separate queues
(finitecycle.h)
• Models fully-pipelined FUs– Types: CONFIG_ALU_MAPPING– Number: CONFIG_NUM_ALUS
F
F
F
D
D
FU0
FU0
FU0
R
R
FU1
FU1
FETC
H_S
TA
GES
DEC
OD
E_S
TA
GES
RETIR
E_S
TA
GES
Determined byCONFIG_ALU_LATENCY
MAX_FETCH
MAX_DECODE
MAX_RETIRE
MAX_DISPATCHSched
MAX_EXECUTE
Slide 61 http://www.cs.wisc.edu/gems
Instructions ({dynamic,statici,memop,controlop}.[C h] )
Dynamic
Control (controlop.[C h])
Memory (memop.[C h]) ALU (dynamic.[C h])
decoded instr (statici.[C h])
Traps
Registers
Event Times
Seq #Wait List ptr
Predicted Addr
Actual Addr
Virtual/Phys Addr
LSQ index
Taken/Not Taken
Slide 62 http://www.cs.wisc.edu/gems
Outline
• Introduction and Motivation• Demo: Simulating a Multiple-CMP System with GEMS• Ruby: Memory system model
• BREAK
• Opal: Out-of-order processor model– Overview
– Pipeline
– Example: Load instruction
– Additional Tidbits
• Demo: Two gems are better than one• GEMS Source Code Tour and Extending Ruby• Building Workloads
Slide 63 http://www.cs.wisc.edu/gems
Fetch
Is Fetch Ready?
Address Translation I-TLB Miss?
Emit NOP/Stall Fetch
Yes
Read instruction:
pseq::getInstr()
No
Stall fetch
Invoke Ruby to simulate Ifetch
timing
Create Dynamic Instr
(load_inst_t)
Yes
Slide 64 http://www.cs.wisc.edu/gems
Decode
Get load instr from instr window
dynamic_inst_t::decode()
Insert decoded load inst in decode
queue
Get current source operand mappings :
arf::readDecodeMap() (regmap.[C h], arf.[C h])
Rename dest reg : arf::allocateRegister()
(regmap.[C h], arf.[C h])
Slide 65 http://www.cs.wisc.edu/gems
Schedule
Get load instr from instr window
Exceeded scheduling window?
Stop scheduling
Yes
TestSourceReadiness() WAIT_XX_STAGE
Source not ready
Scheduler->schedule() All sources ready?
Wakeup
Yes
NoSource is ready
Slide 66 http://www.cs.wisc.edu/gems
Execute
Read port avail? D-TLB address translate (memory_inst_t::addresstranslate())
TLB Miss?Raise TLB miss exceptionYes
No, reschedule
Invoke Ruby to simulate load timing (rubycache_t::access())
Read value from Simics memory(pseq->readPhysicalMemory())
No
Cache Miss?
CACHE_MISS_STAGE
Yes
pseq->complete()
No
Yes
Slide 67 http://www.cs.wisc.edu/gems
Retire
Get completed LD inst
checkCriticalstate():(PC, NPC,regs)
checkChangedState() (verify load value)
FullSquash() (reload state &
refetch from instr following LD)
FAIL
Step Simics (pseq->advanceSimics())
Retire LD
Traps?takeTrap() (set trap state,squash pipeline)
Yes
No
FAIL
Match
Match
Memory
Consistency
Slide 68 http://www.cs.wisc.edu/gems
Outline
• Introduction and Motivation• Demo: Simulating a Multiple-CMP System with GEMS• Ruby: Memory system model
• BREAK
• Opal: Out-of-order processor model– Overview
– Pipeline
– Example: Load instruction
– Additional Tidbits
• Demo: Two gems are better than one• GEMS Source Code Tour and Extending Ruby• Building Workloads
Slide 69 http://www.cs.wisc.edu/gems
Opal-Ruby Interface
rubycache_t:access()complete()
OpalInterface:isReady()makeRequest()hitCallback()
OPAL RUBY
system_t:rubyCompletedRequest()
pseq_t:completedRequest()
load_inst_t::Execute()
Complete()
LD
Asynchronous
12
3
45
6
78
Slide 70 http://www.cs.wisc.edu/gems
Branch Prediction
pseq_t::createInstruction{…s_instr->nextPC()…
}
dynamic_inst_t::nextPC_call(),nextPC_predicated_branch(),nextPC_predict_branch(),nextPC_indirect()
Branch predictor (fetch/{yags.[C h], …} :
Predict()Update()
Predict()Controlop_t::Execute(){ (check prediction and flush if mispredict)}Retire(){
…Bpred->Update()…
}
Update()
Slide 71 http://www.cs.wisc.edu/gems
Common Config Parameters
Processor Width:MAX_FETCH
_DECODE _DISPATCH _EXECUTE _RETIRE
Pipeline Stages:FETCH_STAGESDECODE_STAGESRETIRE_STAGES
Register File Sizes:CONFIG_IREG_PHYSICAL (int)CONFIG_FPREG_PHYSICAL (fp)CONFIG_CCREG_PHYSICAL (cond code)
ROB Size:IWINDOW_ROB_SIZE
Scheduling Window Size:IWINDOW_WIN_SIZE
Slide 72 http://www.cs.wisc.edu/gems
Opal : Present and Future
• Implements Sparc instructions– Simulating additional Sparc instructions easy task– Porting to x86 substantial code rewrite
• Simulates timing of weaker memory consistency models– Add SC checks in Opal– Add write buffers for weaker models (like TSO)
• No functional simulation of I/O– Plug in disk simulator that interacts with Opal
• Not currently using MAI interface– Possible to replace Opal w/ MAI module that interacts with
Ruby
• Aggressive micro-architectural techniques not modeled– Add support for trace caches, mem. dependence pred., etc
Slide 73 http://www.cs.wisc.edu/gems
Outline
• Introduction and Motivation• Demo: Simulating a Multiple-CMP System with
GEMS• Ruby: Memory system model• BREAK• Opal: Out-of-order processor model
• Demo: Two gems are better than one– Breakdown network stats– Example: Network contention with and without Opal – Simulation runtimes
• GEMS Source Code Tour and Extending Ruby• Building Workloads
Slide 74 http://www.cs.wisc.edu/gems
Breaking Down Ruby Stats Files
• Ruby system config print– Values of all ruby config parameters
• Overall runtime– Target and host machine runtimes, IPC, etc.
• Cache profiling: L1I, L1D, L2…etc.
• Structure occupancy– Demand for cache ports, transaction buffers
• Latency breakdown• Request vs. system state (optional)• Message delay cycles (optional)• Network stats
– Link and switch utilization
• CC event / transition counts
<system_config>.statsRuby config
Overall runtime
Cache profiling
Demo
Structureoccupancy
Latencybreakdown
Request vs.system state
Messagedelay cycles
Network stats
Event /transition
counts
Slide 75 http://www.cs.wisc.edu/gems
Two GEMS are Better than One
• Network behavior with and without Opal• 8 processor CMP• SPLASH benchmark: ocean• 8 byte-wide links between CPUs & L2 cache banks• Two runs using a customized network
1. Ruby only• Allows only one requests per processor• Maximum 8 outstanding requests• Low network utilization• Little network contention
2. Ruby & Opal• Allows multiple outstanding requests• Maximum 128 outstanding requests• Higher network utilization• Noticeable network contention
Demo
Slide 76 http://www.cs.wisc.edu/gems
Two GEMS are Better than One
Ruby Only
Demo
Message Delayed Cycles----------------------
Total_delay_cycles: [binsize: 16 max: 553 count: 22892759
average: 0.534205 | standard deviation: 4.18656 | 22855760 20077 1945 325 309 175 105 3935 7681 518 338 254 397 273 166 130 142 33 41 25 26 29 15 10 10 2 0 2 0 4 4 10 7 6 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
Network Stats-------------
links_utilized_percent_switch_0_link_3: 4.38966 bw: 8000 base_latency: 1
links_utilized_percent_switch_0_link_4: 4.36838 bw: 8000 base_latency: 1
Ruby_cycles: 41361869
Slide 77 http://www.cs.wisc.edu/gems
Two GEMS are Better than One
Ruby & Opal
Demo
Message Delayed Cycles----------------------
Total_delay_cycles: [binsize: 16 max: 703 count: 22893122
average: 1.35992 (0.534205) | standard deviation: 6.55126 | 22608266 220366 29575 9084 4686 3248 2009 1687 6018 1798 1143 828 625 516 384 272 271 288 398 319 299 228 203 161 92 51 41 26 12 9 30 39 48 43 25 20 3 0 0 1 0 2 4 4 0 0 0 0 0 0 ]
Network Stats-------------
links_utilized_percent_switch_0_link_3: 7.81863 (4.38966) bw: 8000 base_latency: 1
links_utilized_percent_switch_0_link_4: 7.64388 (4.36838) bw: 8000 base_latency: 1
Ruby_cycles: 72550169 (41361869)
Slide 78 http://www.cs.wisc.edu/gems
Simulation Time Comparison
• Comparisons of Runtimes– Progressively add more simulation fidelity
• Simics only
• Simics + Ruby
• Simics + Ruby + Opal
– Accuracy vs. simulation time tradeoff
• Target Machine– 8 UltraSPARC™ iii processor SMP (1 GHz)– 4 GBs of memory
• Host Machine– AMD Opteron™ uniprocessor (2.2 GHz)– 4 GBs of memory
Slide 79 http://www.cs.wisc.edu/gems
Simulation Slowdown
Time Slowdown Slowdown / CPU
Target 20 ms 1 1
Simics 1 minute 3000 x 380 x
Simics + Ruby 15 minutes 45000 x 5600 x
Simics + Ruby + Opal 45 minutes 140000 x 17000 x
2000 JBB Transactions
CAVEAT: These performance numbers may not reflect the optimal configuration of Virtutech Simics. For example, running Simics in “fast mode” (or emulation-only mode) can reduce the slowdown (per CPU) of Simics, compared to real hardware, to less than 10x
Slide 80 http://www.cs.wisc.edu/gems
Outline
• Introduction and Motivation• Demo: Simulating a Multiple-CMP System with
GEMS• Ruby: Memory system model• BREAK• Opal: Out-of-order processor model• Demo: Two gems are better than one
• GEMS Source Code Tour and Extending Ruby– GEMS software structure– Directory Tour– Demo: Extending Ruby and a CMP Protocol
• Building Workloads
http://www.cs.wisc.edu/gems
GEMS Software Structure
System
DriverChip ProfilerNetworkcommon/Driver.h
Internal Drivers
Simics Interface
Opal Interface
generated/<protocol>/Chip.h
profiler/Profiler.hnetwork/simple
Deterministic Tester
Contended Locks
Random Tester
tester/DeterministicDriver.h
tester/SyntheticDriver.h
Topologytester/Tester.h
interface/OpalInterface.h
simics/SimicsInterface.h
network/simple/Topology.h
MultipleInstantiations
OneInstantiation
http://www.cs.wisc.edu/gems
Ruby Software Structure
Chip
DirectorySequencer
Caches
Cache Controllers
Cache Line
Directory State
system/DirectoryMemory.hsystem/CacheMemory.h
Directory Controller
SLICC
system/Sequencer.h
Network Ports
buffer/MessageBuffer.h
generated/<protocol>/Chip.h SLICC
generated/<protocol>/L1Cache_Controller.hgenerated/<protocol>/Directory_Controller.hgenerated/<protocol>/L2Cache_Controller.h
Ruby
Ruby
generated/<protocol>/L1Cache_Entry.h
generated/<protocol>/Directory_Entry.h
Slide 83 http://www.cs.wisc.edu/gems
Outline
• Introduction and Motivation• Demo: Simulating a Multiple-CMP System with
GEMS• Ruby: Memory system model• BREAK• Opal: Out-of-order processor model• Demo: Two gems are better than one
• GEMS Source Code Tour and Extending Ruby– GEMS software structure– Directory Tour– Demo: Extending Ruby and a CMP Protocol
• Building Workloads
Slide 84 http://www.cs.wisc.edu/gems
Map of Directories: Top-Level
Top-Level Directory
ruby opal slicc protocols
common gen-scripts scripts
LICENSE KNOWN_ISSUESREADME
microbenchmarks
MemorySystemComponents
ProcessorComponents
GeneratorCode
ProtocolSpecificationFiles
CommonGEMSC++ code
GeneratedSimicsInterfaceScripts
results SimulationOutput
SeparateMicrobenchmarkExecutables
CommonGEMSscripts
Slide 85 http://www.cs.wisc.edu/gems
Map of Directories: ruby
ruby
buffers common config MessageBufferbetween consumers
Ruby config files forModule and tester
Common RubyC++ structs
eventqueue interfaces module network Globaleventqueue
Ruby → Opal &Simics
Simple network codeThe ruby simicsmodule
profiler recorder simics slicc_interface Profiling code cache and trace
recordersAbstract classes interfacewith different protocols
Simics → Ruby
system tester platform generated Physical memory components
Random tester& ubenchmarks
SLICC generated C++files
Object files &executables
html Protocoltables
Example tracefile
Ruby debug flag infoRuby initializer &destroyer
ruby.trace.gz init.h/.C README.debugging
Makefile
Slide 86 http://www.cs.wisc.edu/gems
Map of Directories: ruby/system
ruby/system
memory datastructure
object that identifies aunique chip or machineinstatiation
object that uniquelyidentifies all rubymachines
specific to tokenprotocol a fully associative,
unbounded cachememory template
specific to tokenprotocol
specific to tokenprotocol
manages memoryrequests between the driver and L1 cache controller
used to simulateTSO-like timing
top-level object of theruby memory system,all ruby objects can beaccessed via theg_system_ptr
used to simulateTSO-like timing
transaction bufferentry table used by cache controllers fortransient requests
specific to tokenprotocol
CacheMemory.h DirectoryMemory.h/C MachineID.h NodeID.h
NodePresistentTable.h/C
PerfrectCacheMemory.h PresistentArbiter.h/C PersistentTable.h/C
Sequencer.h/CStoreBuffer.h/C StoreCache.h/C System.h/C
TBETable.h/CTimerTable.h
cache templatedata structure
Slide 87 http://www.cs.wisc.edu/gems
Map of Directories: ruby/slicc_interface
ruby/slicc_interface
ruby abstract class forthe protocol specificchip object
parent class of all messagesmessages communicatedbetween consumers viaMessageBuffers
contains booleans to defineprotocol characteristics to ruby
parent class of allnetwork messages, each protocolimplements uniquenetwork messageobjects to communicatebetween controllers
All address manipulation to determine location and set mapping is here
miscellaneous rubyfunctions used bythe generated controllers
interface between the generated protocol logic and the ruby profiler code
wrapper for the RubySlicc interface files
AbstractCacheEntry.h/C AbstractChip.h/C AbstractProtocol.h/C
Message.h NetworkMessage.h RubySlicc_ComponentMapping.h
RubySlicc_Profiler_interface.h/C RubySlicc_Util.h RubySlicc_includes.h
ruby abstract class forthe protocol specific cache entries
Slide 88 http://www.cs.wisc.edu/gems
Map of Directories: slicc
sliccast doc
parser
Abstract SyntaxTree code
contains the lexerand parser thatconstruct aprotocol’s AST
contains someold but usefuldocumentation
symbols contains SLICCobjects createdduring the firstpass of the AST,majority of codegenerated by these symbols
generator file, html and MIF generatorcode
platform generated generated lexer and parser files
Object files &executables
main functionof the SLICCexecutable
defines typedef, namespaces, etc.
main.h/C slicc_global.h
Makefile
READMESummary ofhow SLICC works
Makefile for theSLICC codegenerator executable
Slide 89 http://www.cs.wisc.edu/gems
Map of Directories: opal
opal
benchmark bypassing common Micro-architecturebenchmarks
Global Opal structsMisc. proc structs
config design fetch module Module and testerconfig files
Helpful informaldesign docs
Code for Opal modulePredictors (branch,Trap,RAS)
python regression sparc systemMisc test and graphing scripts
Golden resultsfor tester
Pipeline modelImplementation-specific defines
tester trace platform generated Opal tester files Files for branch,
memory tracesFiles for parsing configparams
Object files &executables
TODO Todo wish list Describes building
& running OpalOpal handling of mem. consistency
README README.memory_consistency
Makefile
Slide 90 http://www.cs.wisc.edu/gems
Map of Directories: opal/system (1)
opal/system
Register file interface
Used to analyze memdependencies
Opal’s built-in cachestructures
Structs used in validation w/ Simics Type defines for
config paramsPer opcode stats collector class
Branch instr typeclass
TLB implementation for stand-alone sims Code for execution
of dynamic instrsNon-renamed registerfile interface
Top-level classfor all dynamic instrs
CFG class Opal-Simics interface
actor.[C h] arf.[C h] cache.[C h] chain.[C h]
checkresult.hconfig.include controlop.[C h] decode.[C h]
dtlb.[C h]dx.[C h i] dynamic.[C h] flatarf.[C h]
flow.[C h] hfa.C
General micro-arch.structure class
hfa_init.h histogram.[C h]
Opal-Simics interfaceexterns
Histogram statsclass
Slide 91 http://www.cs.wisc.edu/gems
Map of Directories: opal/system (2)
opal/system
Instruction page cache class
Code to execute CFGinstructions
RUU-like struct forstoring/tracking instrs
Stats on locks in system LSQ structure Memory addr stats
classMemory instr class
Simlink to Opal-Ruby interface MSHR structure
(used in Opal cachehierarchy only)
Single waiter object for pipepool
Wait-list object . Used to model MSHR whenrunning w/ Ruby
Top-level procsequencer
Functions used forAPI calls to Simics
ipage.[C h] ipagemap.[C h] iwindow.[C h] ix.[C h]
lockstat.[C h]lsq.[C h] memop.[C h] memstat.[C h]
mf_api.hmshr.[C h] pipepool.[C h] pipestate.[C h]
pseq.[C h] pstate.[C h]
Instruction pageclass
ptrace.[C h] regbox.[C h]
Used for analyzingmemory traces
Contains interfaceptrs to registers.
Slide 92 http://www.cs.wisc.edu/gems
Map of Directories: opal/system (3)
opal/system
Rename map structure
Global event queueHandles all Opal -Ruby memorytransactions
Dummy Simics functions for tester Several includes Decoded instr classStats class for
static insts
Timer class, used tocollect time stats Stats class for
dynamic instsStats class for tracking per-threadstats
Top-level class formanipulating sim
Wait-list object for dynamic insts
regfile.[C h] regmap.[C h] rubycache.[C h] scheduler.[C h]
simdist12.Csparx.C sstat.[C h] statici.[C h]
stopwatch.[C h]sysstat.[C h] system.[C h] threadstat.[C h]
wait.[C h]
Models the registerfile itself
Slide 93 http://www.cs.wisc.edu/gems
Outline
• Introduction and Motivation• Demo: Simulating a Multiple-CMP System with
GEMS• Ruby: Memory system model• BREAK• Opal: Out-of-order processor model• Demo: Two gems are better than one
• GEMS Source Code Tour and Extending Ruby– GEMS software structure– Directory Tour– Demo: Extending Ruby and a CMP Protocol
• Building Workloads
Slide 94 http://www.cs.wisc.edu/gems
Extending Ruby
• Goal: – Add new functionality to Ruby and interface to SLICC
• DemoPrefetcher– Simple, L2->memory next-line prefetcher– Module implemented as C++ object (DemoPrefetcher.C)– New type added to SLICC– Observes L1 GETS requests via function call– Triggers event for prefetch in next cycle
• Object is connected to an in_port
– Not the only way (or the right way) of implementing a prefetcher
Demo
Slide 95 http://www.cs.wisc.edu/gems
Implementing DemoPrefetcher
• Creating an object that can “wakeup” a controller
DemoPrefetcher.h
class DemoPrefetcher {public:
// An object in a SLICC controller will be passed a Chip* DemoPrefetcher(Chip* chip_ptr);
// Allow an in_port to be attached void setConsumer(Consumer* consumer_ptr) { m_consumer_ptr =
consumer_ptr; }
// When wakeup() is called, ensure it should do something bool isReady() const;
// functions to implement simple next-line prefetching const Address& popNextPrefetch(); const Address& peekNextPrefetch() const; void cancelNextPrefetch(); void observeL1Request(const Address& address);
Demo
Slide 96 http://www.cs.wisc.edu/gems
Implementing DemoPrefetcher
DemoPrefetcher.C
void DemoPrefetcher::observeL1Request(const Address& address){ // next-line prefetch address Address prefetch_addr = address; prefetch_addr.makeNextStrideAddress(1);
// add to prefetch queue m_prefetch_queue.push( prefetch_addr );
// when to wakeup-- choose 1 cycles later Time ready_time = g_eventQueue_ptr->getTime() + 1;
// schedule a wakeup() so that the L2 controller can trigger g_eventQueue_ptr->scheduleEventAbsolute(m_consumer_ptr,
ready_time);
}
Demo
Slide 97 http://www.cs.wisc.edu/gems
Interfacing DemoPrefetcher to SLICC
external_type(DemoPrefetcher, inport="yes") { bool isReady(); Address popNextPrefetch(); void cancelNextPrefetch(); Address peekNextPrefetch(); void observeL1Request(Address); } DemoPrefetcher prefetcher;
// wakeup logic in_port(prefetcher_in, Null, prefetcher) { if (prefetcher_in.isReady() ) { if (L2cacheMemory.cacheAvail(prefetcher.peekNextPrefetch()) ||
L2cacheMemory.isTagPresent(prefetcher.peekNextPrefetch())) { if ( getState(prefetcher.peekNextPrefetch()) == State:I ||
getState(prefetcher.peekNextPrefetch()) == State:NP ) { trigger(Event:Prefetch, prefetcher.popNextPrefetch()); } else { // tag is already present in a non-invalid state prefetcher.cancelNextPrefetch(); } } else { trigger(Event:L2_Replacement,
L2cacheMemory.cacheProbe(prefetcher.peekNextPrefetch())); } } }
Demo
Slide 98 http://www.cs.wisc.edu/gems
Implementing DemoPrefetcher
• Nice property of TokenCMP: no tracking of prefetch– A tag is allocated and a request issued to memory– keeps received tokens/data if tag allocated
MOESI_CMP_tokenDEMO-L2cache.sm
transition(NP, Prefetch, I) { vv_allocateL2CacheBlock;
a_issuePrefetch;
}
transition(I, Prefetch) {
a_issuePrefetch;
}
transition({S,O,M,I_L,S_L}, Prefetch) {
// do nothing
}
Demo
Slide 99 http://www.cs.wisc.edu/gems
Outline
• Introduction and Motivation• Demo: Simulating a Multiple-CMP System with
GEMS• Ruby: Memory system model• BREAK• Opal: Out-of-order processor model• Demo: Two gems are better than one• GEMS Source Code Tour and Extending Ruby
• Building Workloads
Slide 100 http://www.cs.wisc.edu/gems
Workloads for Simics/GEMS
• Unfortunately, we cannot release our workloads (legal reasons)
• Steps for Workload Development– Simple Example: Barnes-Hut– What about more complex applications?
• Workload Simulation Methodology– Simulating transactions/requests– Coping with workload variability
Slide 101 http://www.cs.wisc.edu/gems
Workload Setup
• Simple Example: Barnes-Hut (Splash2 suite)– Commands not to be taken literally! (might be different in
different versions)
• Main Steps:– Build OS checkpoint– Copy application source or binary to simulation– Create initial (cold) application checkpoint in Simics– Create warm application checkpoint with Simics/Ruby
Slide 102 http://www.cs.wisc.edu/gems
Build OS Checkpoint
• Use Simics to boot your OS and get a checkpoint (assuming 16 processor serengeti target machine)– cd simics/home/sarek– ./simics –x sarek-16p.simics
• Script loads configuration and boots Solaris
• Scripts should be provided with your Simics distribution assuming you have Solaris license (contact Virtutech Simics Forum)
• Modify scripts to fit your target configuration (e.g., memory, disk, network)
– At the end of your script, take a system snapshot (checkpoint):
simics> write-configuration CHKPT_DIR/sarek-16p.check
simics> quit
– Use this checkpoint to build all your workloads’ 16 processor checkpoints
Slide 103 http://www.cs.wisc.edu/gems
Copy Barnes Source or Binary
• Develop benchmark on real machine (if available)– Use Simics “magic” instructions after initialization
• See Simics reference manual for magic instruction use
– Compile benchmark with such instructions before running in Simics
• Load from your OS checkpoint– ./simics
simics> read-configuration CHKPT_DIR/sarek-16p.check
simics> magic-break-enable
• Copy binary into simulated machine (or copy source and compile) – Console commands:
mount /host
cp –r /host/workloads/splash2/codes/apps/barnes/BARNES .
• See Simics reference manual on the use of the /host filesystem
Slide 104 http://www.cs.wisc.edu/gems
Obtain Initial Barnes Checkpoint
• Warm up application in Simics– Console Commands:
./BARNES < input-warm
• input_warm specifies Barnes parameters
./BARNES < input-warm• Use this second run to warm up cache (see next slide)
./BARNES < input-run > output; magic_call break
• After initial run, write checkpointsimics> write-configuration CHKPT_DIR/barnes-cold-16p.check
simics> quit
• Checkpoint is ready for GEMS run
Slide 105 http://www.cs.wisc.edu/gems
Obtain Warm Barnes Checkpoint
• Load initial checkpoint– setenv CHECKPOINT_AT_END yes– setenv TRANSACTIONS 1– setenv PROCESSORS 16– setenv CHECKPOINT CHKPT_DIR/barnes-cold-16p.check– ./simics -no-win -x GEMS_ROOT/gen-scripts/go.simics
• Script (provided in release) should load ruby and run till the end of the warmup run– Also writes checkpoint at the end
• Edit checkpoint to remove ruby object
– Modify script to suit your needs
Slide 106 http://www.cs.wisc.edu/gems
What About More Complex Applications?
• Setup on real hardware– Tune workload, OS parameters– Scale-down for PC memory limits– Re-tune – For details, [Alameldeen et al., IEEE Computer, Feb’03]
• What if we don’t have access to real hardware?– Install applications and setup in Simics– Checkpoint often– Not optimal for large scale applications!
Slide 107 http://www.cs.wisc.edu/gems
Simulating Transactions/Requests
• Throughput-based applications– Work-based unit to compare configurations– IPC not always meaningful
• Counting Transactions during Simulation– Enable magic breaks in Simics– Benchmark traps to Simics on every magic instruction– Count magic breaks until we reach required number of
transactions– Cope with benchmark variability
Slide 108 http://www.cs.wisc.edu/gems
Why Consider Variability?
OLTP
Slide 109 http://www.cs.wisc.edu/gems
Workload Variability
• How can slower memory lead to faster workload?• Answer: Multithreaded workload takes different paths
– Different lock race outcomes– Different scheduling decisions→ Runs from same initial conditions can be different
This can lead to wrong conclusions for deterministic simulations
• Solution with deterministic simulation– Add pseudo-random delay on memory accesses
(MEMORY_LATENCY)– Simulate base (and enhanced) system multiple times– Use simple or complex statistics [Alameldeen and Wood,
HPCA 2003]
Slide 110 http://www.cs.wisc.edu/gems
The End
• Download and Subscribe to Mailing Lists
http://www.cs.wisc.edu/gems
• We encourage your contributions– Workloads– Additional timing fidelity
Slide 111 http://www.cs.wisc.edu/gems
Additional Opal Slides
Slide 112 http://www.cs.wisc.edu/gems
Sensitivity Analysis
return
Slide 113 http://www.cs.wisc.edu/gems
Sensitivity Results
return
Slide 114 http://www.cs.wisc.edu/gems
Opal and Memory Consistency
• Designed to be aggressive OoO processor• Our use of Simics is sequentially consistent execution• Models the performance of weaker models (such as
TSO) for only SC memory interleavings• Violations of SC in Opal:
– Identical MSHR entry for memory requests with same addr– Executes Ld/St out of program order– No snooping of LSQ for external stores
Return
Slide 115 http://www.cs.wisc.edu/gems
Implemented UltraSparc Instructions (1)addaddcaddccaddcccalignaddralignaddrlandandccandnandnccbabccbcsbebgbgebgublblebleubmaskbnbnebnegbpabpccbpcs
bpebpgbpgebpgubplbplebpleubpnbpnebpnegbposbpposbpvcbpvsbrgezbrgzbrlezbrlzbrnzbrzbshufflebvcbvscallcasacasxacmpdoneretryfabsdfabsqfabss
fadddfaddqfaddsfaligndatafbafbefbgfbgefblfblefblgfbnfbnefbofbpafbpefbpgfbpgefbplfbplefbplgfbpnfbpnefbpofbpufbpuefbpugfbpugefbpulfbpulefbufbuefbug
fbugefbulfbulefcmpdfcmpedfcmpeqfcmpeq16fcmpeq32fcmpesfcmpgt16fcmpgt32fcmple16fcmple32fcmpne16fcmpne32fcmpqfcmpsfdivdfdivqfdivsfdmulqfdtoifdtoqfdtosfdtoxfitodfitoqfitosflushflushwfmovdfmovdafmovdccfmovdcsfmovde
fmovdgfmovdgefmovdgufmovdlfmovdlefmovdleufmovdnfmovdnefmovdnegfmovdposfmovdvcfmovdvsfmovfdafmovfdefmovfdgfmovfdgefmovfdlfmovfdlefmovfdlgfmovfdnfmovfdnefmovfdofmovfdufmovfduefmovfdugfmovfdugefmovfdulfmovfdulefmovfqafmovfqefmovfqgfmovfqgefmovfqlfmovfqle
fmovfqlgfmovfqnfmovfqnefmovfqofmovfqufmovfquefmovfqugfmovfqugefmovfqulfmovfqulefmovfsafmovfsefmovfsgfmovfsgefmovfslfmovfslefmovfslgfmovfsnfmovfsnefmovfsofmovfsufmovfsuefmovfsugfmovfsugefmovfsulfmovfsulefmovqfmovqafmovqccfmovqcsfmovqefmovqgfmovqgefmovqgu
fmovqlfmovqlefmovqleufmovqnfmovqnefmovqnegfmovqposfmovqvcfmovqvsfmovrdgezfmovrdgzfmovrdlezfmovrdlzfmovrdnzfmovrdzfmovrqgezfmovrqgzfmovrqlezfmovrqlzfmovrqnzfmovrqzfmovrsgezfmovrsgzfmovrslezfmovrslzfmovrsnzfmovrszfmovsfmovsafmovsccfmovscsfmovsefmovsgfmovsge
fmovsgufmovslfmovslefmovsleufmovsnfmovsnefmovsnegfmovsposfmovsvcfmovsvsfmuldfmulqfmulsfnegdfnegqfnegsfqtodfqtoifqtosfqtoxfsmuldfsqrtdfsqrtqfsqrtsfsrc1fstodfstoifstoqfstoxfsubdfsubqfsubsfxtodfxtoq
Slide 116 http://www.cs.wisc.edu/gems
Implemented UltraSparc Instructions (2)fxtosfzerofzerosillimpdep1impdep2jmpjmplldblklddlddalddflddfaldfldfaldfsrldqaldqfldqfaldsbldsbaldshldshaldstubldstubaldswldswaldubldubalduhlduhalduwlduwaldx
ldxaldxfsrmembarmovmovamovccmovcsmovemovfamovfemovfgmovfgemovflmovflemovflgmovfnmovfnemovfomovfumovfuemovfugmovfugemovfulmovfulemovgmovgemovgumovlmovlemovleumovnmovnemovnegmovpos
movrgezmovrgzmovrlezmovrlzmovrnzmovrzmovvcmovvsmulsccmulxnopnotororccornornccpopcprefetchprefetchardrdccrdprrestorerestoredretrnsavesavedsdivsdivccsdivxsethisllsllxsmul
smulccsrasraxsrlsrlxstbstbastbarstblkstdstdastdfstdfastfstfastfsrsthsthastqfstqfastwstwastxstxastxfsrsubsubcsubccsubcccswapswapatataddcctaddcctv
tcctcstetgtgetgutltletleutntnetnegtpostraptsubcctsubcctvtvctvsudivudivccudivxumulumulccwrwrccwrprxnorxnorccxorxorcc
return
Slide 117 http://www.cs.wisc.edu/gems
TLB Misses
• ITLB Misses– emit special NOP instruction: STATIC_INSTR_MOP; stall
fetch– does NOT update PC, NPC – fetch resumes whenever any instr (including special NOP)
squashes
• DTLB Misses– Set DTLB miss trap for instruction (setTrapType()) in
Execute()– In retireInstruction(), retrieve trap and call takeTrap() to set
trap state for DTLB handler– refetch from DTLB handler
Slide 118 http://www.cs.wisc.edu/gems
Example: Load instruction
• In dynamic_t::Schedule(), load waits until all operands ready (WAIT_XX_STAGE cases)
• Scheduler gets invoked when all operands ready• Load waits until read port to L1 is available• Load_inst_t::Execute() gets called
– Generates virtual address
– Performs D-TLB address translation
– Inserts entry in LSQ
– Initiates cache access (via Ruby or Opal’s built-in simple cache hierarchy)
– If cache miss -> put on wait list (CACHE_MISS_STAGE) and is woken up by rubycache_t::complete()
• Invokes Simics to read actual memory value in load_inst_t::Complete()
• Retirement check of load value & squash if value deviates from Simics
Slide 119 http://www.cs.wisc.edu/gems
Modifying Opal-Ruby Interface
• Ruby->Opal interface defined in mf_opal_api object (ruby/interfaces/mf_api.h)
• Opal->Ruby interface defined in mf_ruby_api object• To create new Ruby->Opal callback (ex: hitCallback())
– Define function in ruby/interfaces/OpalInterface.C– Add new function pointer to mf_opal_api object– Create a new function handler in opal/system/system.C and
assign m_opal_api object’s new function pointer to this function handler
• To create new Opal->Ruby callback (ex: makeRequest())
– Define function in ruby/interfaces/OpalInterface.C– Add new function pointer to mf_ruby_api object– Assign function pointer to new function in
OpalInterface::installInterface()