enabling efficient on-the-fly microarchitecture simulation thierry lafage [email protected]...
TRANSCRIPT
Enabling Efficient On-the-fly Microarchitecture Simulation
Thierry Lafage
September 2000
September 2000 Thierry Lafage 2
Introduction
• Microarchitecture simulation:– Accurate, but slow (execution 1000-10000)
– “On-the-fly” (vs. trace-driven):• Enables execution-driven simulation (complex
microprocessors)
• Simulation of long running workloads
• Complete microprocessor simulation requires:– Realistic workloads and working sets
– Huge amount of CPU time
September 2000 Thierry Lafage 3
• Realistic simulations in an affordable time
simulations of a reduced number of instructions:
• One “big slice” (eg. after program start-up phase)
• Trace sampling
Introduction (2)
Representativeness of the simulated execution slices?Representativeness of the simulated execution slices?
• On-the-fly simulations fast forwarding Current tools “fast” forwarding mode: >20Current tools “fast” forwarding mode: >20 execution slowdown execution slowdown
0 1.5B.1B.500M. ...
0 1B.500M.
September 2000 Thierry Lafage 4
Outline1. Speeding up the fast forwarding mode
– Approach
– Implementation
– Performance on the SPEC95 benchmarks
– Conclusion
2. Selecting representative execution slices– Approach
– Application to data cache simulations
– Conclusion
Conclusion and Future Work
September 2000 Thierry Lafage 5
Speeding up the fast forwarding mode
Two execution modes:
• A really fast mode (static code annotation) Rapid positioning of the execution where to
begin the simulation with direct execution
• An emulation mode (embedded instruction-set emulator) Calls to analysis routines (user provided)
At run time:Dynamic switches between both modes
September 2000 Thierry Lafage 6
DICEHost ISAEmulator
User analysisroutines
Implementation
Original code
SPARC V9 assembly
code
calvin2Static Code Annotation Tool
checkpoint
checkpoint
checkpoint
checkpoint
checkpoint
Switching event
Emulation modeSwitching event
September 2000 Thierry Lafage 7
Performance on the SPEC95 Benchmarks
• calvin2+DICE:– Average slowdown in fast mode: 1.31 (checkpoints at
procedure calls and inside loops)
– Average slowdown in emulation mode (instruction and data addresses trace): 117.47
• Shade (instruction and data address generation enabled):– Average slowdown in “fast forward” mode: 17.07
(empty analysis routine)
– Average slowdown in emulation mode: 82.19 (tracing analysis routine)
September 2000 Thierry Lafage 8
A Simple Example of Microprocessor Simulation
• Simulation of 1% of a 1 hour workload
• Additional 1000 slowdown
Direct Execution Emulation + Simulation
With calvin2+DICE:0.99 1.31 + 0.01 (117.45 + 1000) = 12.5 hours
Fast Forward Emulation + Simulation
With Shade:0.99 17.07 + 0.01 (82.19 + 1000) = 27.7 hours
September 2000 Thierry Lafage 9
Conclusion for calvin2+DICE
• Performance of the emulator: not an issue
• Overall performance given by the performance of the fast forwarding mode (long running workloads)
calvin2+DICE enables simulations on slices spread over a whole application
September 2000 Thierry Lafage 10
Outline1. Speeding up the fast forwarding mode
– Approach
– Implementation
– Performance on the SPEC95 benchmarks
– Conclusion
2. Selecting representative execution slices– Approach
– Application to cache simulations
– Conclusion
Conclusion and Future Work
September 2000 Thierry Lafage 11
• On-the-fly simulations using realistic applications in an affordable time simulations of a reduced number of instructions– Before: one “big slice” (after program start-up phase)
– With calvin2+DICE: on-the-fly statistical sampling
• Number of simulated instructions often determined by:– The simulation time
– Empirical results
Introduction
Representativeness of the simulated instructions?Representativeness of the simulated instructions?
0 1B.500M.
0 1.5B.1B.500M. ...
September 2000 Thierry Lafage 12
Our Approach
Dynamic characterization of the target programs
Select representative execution slices for simulations (classification)
Aim:
Tune a per-program amount of simulated activity Reduce simulation time or increase simulation result accuracy
September 2000 Thierry Lafage 13
Dynamic Characterization of the Target Programs
0 1 2 NExecution
Slices
ProgramCharacterization
Metrics independent from the implementation detail of the Metrics independent from the implementation detail of the simulated componentssimulated components
September 2000 Thierry Lafage 14
Selection of Representative Execution Slices
0 1 2 3 4
Hierarchical Classification
02 3 41
{2,1,3},{0,4}
Two slices selected
September 2000 Thierry Lafage 15
Selection of Class Representatives
Wmdc indicator: weighted mean of distances from class centers
Class centersClass representatives
September 2000 Thierry Lafage 16
Application to the Data StreamData stream characterization:
– Temporal locality: data reuse distances– Spatial locality: data reuse distances with
several line sizes
Data reuse distance (in instructions)
Rel
ativ
e fr
eque
ncy
(%)
September 2000 Thierry Lafage 17
Results for Trained Cache Simulations on the SPEC95 Benchmarks
3.3%
5%10%
10%
0
2
4
6
8
10
12
14
16
Avg
. R
E(%
)
CHAVL Sampling Sampling Big slice
Cache configurations: 4-way set associative, LRU write back, write allocate sizes from 4KB to 512KB line sizes from 16B to 128B
September 2000 Thierry Lafage 18
Conclusion for representative slice selection
• Similar results with:– Branch characterization for branch predictor simulations
– Data stream characterization, branch characterization, instruction mix and basic block sizes for data cache simulations and branch predictor simulations
Program characterization actually helps in tuning the amount of simulated activity
September 2000 Thierry Lafage 19
General Conclusion• calvin2+DICE enables simulations on slices
spread over a whole application• Our approach enables to select representative
execution slices
Future Work• Complete execution-driven simulations (complex
microprocessor)• Operating system activity: LiKE, a Linux Kernel
Emulator