enabling efficient on-the-fly microarchitecture simulation thierry lafage [email protected]...

Enabling Efficient On-the-fly Microarchitecture Simulation

Thierry Lafage

[email protected]

September 2000

September 2000 Thierry Lafage 2

Introduction

• Microarchitecture simulation:– Accurate, but slow (execution 1000-10000)

– “On-the-fly” (vs. trace-driven):• Enables execution-driven simulation (complex

microprocessors)

• Simulation of long running workloads

• Complete microprocessor simulation requires:– Realistic workloads and working sets

– Huge amount of CPU time


• Realistic simulations in an affordable time

simulations of a reduced number of instructions:

• One “big slice” (eg. after program start-up phase)

• Trace sampling

Introduction (2)

Representativeness of the simulated execution slices?Representativeness of the simulated execution slices?

• On-the-fly simulations fast forwarding Current tools “fast” forwarding mode: >20Current tools “fast” forwarding mode: >20 execution slowdown execution slowdown

0 1.5B.1B.500M. ...

0 1B.500M.


Outline1. Speeding up the fast forwarding mode

– Approach

– Implementation

– Performance on the SPEC95 benchmarks

– Conclusion

2. Selecting representative execution slices– Approach

– Application to data cache simulations

– Conclusion

Conclusion and Future Work


Speeding up the fast forwarding mode

Two execution modes:

• A really fast mode (static code annotation) Rapid positioning of the execution where to

begin the simulation with direct execution

• An emulation mode (embedded instruction-set emulator) Calls to analysis routines (user provided)

At run time:Dynamic switches between both modes


DICEHost ISAEmulator

User analysisroutines

Implementation

Original code

SPARC V9 assembly

code

calvin2Static Code Annotation Tool

checkpoint

checkpoint

checkpoint

checkpoint

checkpoint

Switching event

Emulation modeSwitching event


Performance on the SPEC95 Benchmarks

• calvin2+DICE:– Average slowdown in fast mode: 1.31 (checkpoints at

procedure calls and inside loops)

– Average slowdown in emulation mode (instruction and data addresses trace): 117.47

• Shade (instruction and data address generation enabled):– Average slowdown in “fast forward” mode: 17.07

(empty analysis routine)

– Average slowdown in emulation mode: 82.19 (tracing analysis routine)


A Simple Example of Microprocessor Simulation

• Simulation of 1% of a 1 hour workload

• Additional 1000 slowdown

Direct Execution Emulation + Simulation

With calvin2+DICE:0.99 1.31 + 0.01 (117.45 + 1000) = 12.5 hours

Fast Forward Emulation + Simulation

With Shade:0.99 17.07 + 0.01 (82.19 + 1000) = 27.7 hours


Conclusion for calvin2+DICE

• Performance of the emulator: not an issue

• Overall performance given by the performance of the fast forwarding mode (long running workloads)

calvin2+DICE enables simulations on slices spread over a whole application


Outline1. Speeding up the fast forwarding mode

– Approach

– Implementation

– Performance on the SPEC95 benchmarks

– Conclusion

2. Selecting representative execution slices– Approach

– Application to cache simulations

– Conclusion

Conclusion and Future Work


• On-the-fly simulations using realistic applications in an affordable time simulations of a reduced number of instructions– Before: one “big slice” (after program start-up phase)

– With calvin2+DICE: on-the-fly statistical sampling

• Number of simulated instructions often determined by:– The simulation time

– Empirical results

Introduction

Representativeness of the simulated instructions?Representativeness of the simulated instructions?

0 1B.500M.

0 1.5B.1B.500M. ...


Our Approach

Dynamic characterization of the target programs

Select representative execution slices for simulations (classification)

Aim:

Tune a per-program amount of simulated activity Reduce simulation time or increase simulation result accuracy


Dynamic Characterization of the Target Programs

0 1 2 NExecution

Slices

ProgramCharacterization

Metrics independent from the implementation detail of the Metrics independent from the implementation detail of the simulated componentssimulated components


Selection of Representative Execution Slices

0 1 2 3 4

Hierarchical Classification

02 3 41

{2,1,3},{0,4}

Two slices selected


Selection of Class Representatives

Wmdc indicator: weighted mean of distances from class centers

Class centersClass representatives


Application to the Data StreamData stream characterization:

– Temporal locality: data reuse distances– Spatial locality: data reuse distances with

several line sizes

Data reuse distance (in instructions)

Rel

ativ

e fr

eque

ncy

(%)


Results for Trained Cache Simulations on the SPEC95 Benchmarks

3.3%

5%10%

10%

0

2

4

6

8

10

12

14

16

Avg

. R

E(%

)

CHAVL Sampling Sampling Big slice

Cache configurations: 4-way set associative, LRU write back, write allocate sizes from 4KB to 512KB line sizes from 16B to 128B


Conclusion for representative slice selection

• Similar results with:– Branch characterization for branch predictor simulations

– Data stream characterization, branch characterization, instruction mix and basic block sizes for data cache simulations and branch predictor simulations

Program characterization actually helps in tuning the amount of simulated activity


General Conclusion• calvin2+DICE enables simulations on slices

spread over a whole application• Our approach enables to select representative

execution slices

Future Work• Complete execution-driven simulations (complex

microprocessor)• Operating system activity: LiKE, a Linux Kernel

Emulator

enabling efficient on-the-fly microarchitecture simulation thierry lafage [email protected]...

Documents

fast forward mode

execution slowdown

application slide

event slide

emulation mode instruction

thierry lafage9 conclusion

event emulation mode

thierry lafage7 performance