methodologies for performance simulation of super-scalar ooo processors

31
Methodologies for Performance Simulation of Super-scalar OOO processors Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Upload: acton

Post on 18-Mar-2016

25 views

Category:

Documents


0 download

DESCRIPTION

Methodologies for Performance Simulation of Super-scalar OOO processors. Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project. Architectural Simulators. Explore Design Space Evaluate existing hardware, or Predict performance of proposed hardware Designer has control. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Methodologies for Performance Simulation of Super-scalar OOO processors

Methodologies for Performance Simulation of

Super-scalar OOO processors

Srinivas NeginhalAnantharaman Kalyanaraman

CprE 585: Survey Project

Page 2: Methodologies for Performance Simulation of Super-scalar OOO processors

Architectural Simulators Explore Design Space Evaluate existing hardware, or

Predict performance of proposed hardware

Designer has controlFunctional Simulators: Model architecture (programmers’ focus)Eg., sim-fast, sim-safe

Performance Simulators: Model microarchitecture (designer’s focus)Eg., cycle-by-cycle (sim-outoforder)

Page 3: Methodologies for Performance Simulation of Super-scalar OOO processors

Simulation Issues Real-applications take too long for a cycle-by-cycle

simulation

Vast design space: Design Parameters:

code properties, value prediction, dynamic instruction distance, basic block size, instruction fetch mechanisms, etc.

Architectural metrics: IPC/ILP, cache miss rate, branch prediction accuracy, etc.

Find design flaws + Provide design improvements

Need a “robust” simulation methodology !!

Page 4: Methodologies for Performance Simulation of Super-scalar OOO processors

Two Methodologies HLS

Hybrid: Statistical + Symbolic REF:

HLS: Combining Statistical and Symbolic Simulation to Guide Microprocessor Designs. M. Oskin, F. T. Chong and M. Farrens. Proc. ISCA. 71-82. 2000.

BBDA Basic block distribution analysis REF:

Basic Block Distribution Analysis to Find Periodic Behavior and Simulation Points in Applications. T. Sherwood, E. Perelman and B. Calder. Proc. PACT. 2001.

Page 5: Methodologies for Performance Simulation of Super-scalar OOO processors

HLS: An Overview A hybrid processor simulator

HLSStatistical Model

Symbolic Execution

Performance Contours spanned by design space parameters

What can be achieved?Explore design changes in

architectures and compilers that would be impractical to simulate using conventional simulators

Page 6: Methodologies for Performance Simulation of Super-scalar OOO processors

HLS: Main IdeaApplication code

Statistical Profiling

Instruction stream, data stream

Code characteristics:-basic block size-Dynamic instruction distance-Instruction mix

Structural Simulation of FU, issue pipeline units

Architecture metrics:-Cache behavior

-Branch prediction accuracy

Synthetically generated code

Page 7: Methodologies for Performance Simulation of Super-scalar OOO processors

Statistical Code Generation Each “synthetic instruction”

contains the following parameters based on the statistical profile:

Functional unit requirements Dynamic instruction distances Cache behavior

Page 8: Methodologies for Performance Simulation of Super-scalar OOO processors

Validation of HLS against SimpleScalar For varying combinations of design

parameters:

Run original benchmark code on SimpleScalar (use sim-outoforder)

Run statistically generated code on HLS

Compare SimpleScalar IPC vs. HLS IPC

Page 9: Methodologies for Performance Simulation of Super-scalar OOO processors

Validation: Single- and Multi-value correlations

IPC vs. L1-cache hit rate

For SPECint95: HLS Errors are within 5-7% of the cycle-by-cycle results !!

Page 10: Methodologies for Performance Simulation of Super-scalar OOO processors

HLS: Code PropertiesBasic Block Size vs. L1-Cache Hit

RateCorrelation suggests that:

Increasing block size helps only when L1 cache hit rate is >96% or <82%

Page 11: Methodologies for Performance Simulation of Super-scalar OOO processors

HLS: Value Prediction

DID vs. Value predictability

GOAL: Break True Dependency Stall Penalty for mispredict

vs. Value Prediction Knowledge

Page 12: Methodologies for Performance Simulation of Super-scalar OOO processors

HLS: Conclusions Low error rate only on SPECint95 benchmark

suite. High error rates on SPECfp95 and STREAM benchmarks

Findings: by R. H. Bell et. Al, 2004

Reason: Instruction-level granularity for workload

Recommended Improvement: Basic block-level granularity

Page 13: Methodologies for Performance Simulation of Super-scalar OOO processors

Basic Block Distribution Analysis

Basic Block Distribution Analysis to Find Periodic Behavior and Simulation Points in Applications.

T. Sherwood, E. Perelman and B. Calder.

Proc. PACT. 2001.

Page 14: Methodologies for Performance Simulation of Super-scalar OOO processors

Introduction Goal

To capture large scale program behavior in significantly reduced simulation time.

Approach Find a representative

subset of the full program.

Find an ideal place to simulate given a specific number of instructions one has to simulate

Accurate confidence estimation of the simulation point.

InitializationSimulation Points

Period

Prog

ram

Ex

ecut

ion

Page 15: Methodologies for Performance Simulation of Super-scalar OOO processors

Program Behavior Program behavior has ramifications

on architectural techniques. Program behavior is different in

different parts of execution. Initialization Cyclic behavior (Periodic)

Cyclic Behavior is not representative of all programs.

Common case for compute bound applications.

Page 16: Methodologies for Performance Simulation of Super-scalar OOO processors

BBDA Basics Fast profiling is used to determine the

number of times a basic block executes. Behavior of the program is directly related to

the code that it is executing. Profiling gives a basic block fingerprint for

that particular interval of time. The interval chosen is ideally a representative

of the full execution of the program. Profiling information is collected in

intervals of 100 million instructions.

Page 17: Methodologies for Performance Simulation of Super-scalar OOO processors

Basic Block VectorIn

terv

al i B1

B2

…Bx

B1 B2 BD

Frequency

BBV = Fingerprint of an interval Varying size intervals

A BBV collected over an interval of N times 100 million instructions is a BBV of duration N.

BBV for Interval i:

Page 18: Methodologies for Performance Simulation of Super-scalar OOO processors

Target BBV BBVs are normalized

Each element divided by the sum of all elements.

Target BBV BBV for the entire execution of the

program. Objective

Find a BBV of smallest duration “similar” to Target BBV.

Page 19: Methodologies for Performance Simulation of Super-scalar OOO processors

Basic Block Vector Difference Difference between BBVs

Euclidean Distance

Manhattan Distance 2BiAi

BiAi

Page 20: Methodologies for Performance Simulation of Super-scalar OOO processors

Basic Block Difference Graph Plot of how well each individual interval

in the program compares to the target BBV.

For each interval of 100 million instructions, we create a BBV and calculate its difference from target BBV.

Used to Find the end of initialization phase. Find the period for the program.

Page 21: Methodologies for Performance Simulation of Super-scalar OOO processors

Basic Block Difference Graph

Page 22: Methodologies for Performance Simulation of Super-scalar OOO processors

Initialization Initialization is not trivial. Important to simulate representative sections of

the initialization code. Detection of the end of the initialization phase is

important. Initialization Difference Graph

Initial Representative Signal - First quarter of BB Difference graph.

Slide it across BB difference graph. Difference calculated at each point for first half of BBDG. When IRS reaches the end of the initialization stage on

the BB difference graph, the difference is maximized.

Page 23: Methodologies for Performance Simulation of Super-scalar OOO processors

Initialization

Page 24: Methodologies for Performance Simulation of Super-scalar OOO processors

Period Period Difference Graph

Period Representative Signal Part of BBDG, starting from the end of

initialization to ¼th the length of program execution.

Slide across half the BBDG. Distance between the minimum Y-axis points

is the period. Using larger durations of a BBV creates a

BBDG that emphasizes larger periods.

Page 25: Methodologies for Performance Simulation of Super-scalar OOO processors

Period

Page 26: Methodologies for Performance Simulation of Super-scalar OOO processors

Summary of Results IPC of chosen period vs. IPC of the

full execution Differed by 5%

BBV based technique (to be continued…)

Page 27: Methodologies for Performance Simulation of Super-scalar OOO processors

Characterizing Program Behavior Through Clustering

Automatically characterizing Large Scale Program Behavior.

T. Sherwood, E. Perelman, G. Hamerly and B. Calder.

ASPLOS 2002

Page 28: Methodologies for Performance Simulation of Super-scalar OOO processors

Clustering Approach

Clustering

#1

#2

#K

P1

P2

Pk

N B

BVs

Mul

tiple

Sim

ulat

ion

Poin

ts

Clusters

Page 29: Methodologies for Performance Simulation of Super-scalar OOO processors

Clustering (k-means) Goal is to divide a set of points into groups such that

points within each group are similar to one another by a desired metric.

Input: N points in D-dimensional space Output: A partition of k clusters Algorithm:

Randomly choose k points as centroids (initialization)

Compute cluster membership of each point based on its distance from each centroid

Compute new centroid for each cluster Iterate steps 2 and 3 until convergence

Runtime complexity affected by the “curse of dimensionality”

Page 30: Methodologies for Performance Simulation of Super-scalar OOO processors

Random Projection Reduce the dimension of the BBVs

to 15

Dimension Selection

Dimension Reduction Random Linear Projection.

Page 31: Methodologies for Performance Simulation of Super-scalar OOO processors

BBDA: Conclusions BBDA provides better sensitivity

and lower performance variation in phases

Other related work such as instruction working set technique provides higher “stability”