statistically rigorous regression modeling for the ...leebcc/documents/lee2006-mobs-talk.… · i...

Motivation & BackgroundModel DerivationModel Evaluation

Conclusion

Statistically Rigorous Regression Modelingfor the Microprocessor Design Space

Benjamin C. Lee1,2, David M. Brooks1

[email protected] of Engineering and Applied Sciences

Harvard University

2Center for Applied Scientific Computing ResearchLawrence Livermore National Laboratory

Benjamin C. Lee, David M. Brooks :: 18 June 2006 1 :: Workshop on Modeling, Benchmarking, and Simulation


Conclusion

OutlineMotivation & Background

Simulation ChallengesSimulation ParadigmsRegression Theory

Model DerivationExperimental MethodologyCorrelation AnalysisModel Specification

Model EvaluationValidation ApproachPerformancePower

ConclusionSummaryFuture Directions



Conclusion









Conclusion


Microarchitectural Design Space

I Trend toward chip multiprocessors (CMP’s) with varying core designsI Power 4, Pentium 4, UltraSPARC T1I Tractably quantify trade-offs between core complexity, count



Conclusion


Design Space Exploration

I Limitations of Existing Simulation MethodologyI Trace sampling, compression reduce per simulation costsI Existing techniques do not reduce number of simulationsI Space size increases exponentially with parameter countI Multi-threaded, multi-core simulations further constrained

I Prior Design Space AnalysesI Consider mp design pointsI Vary one or two parameters at fine granularityI Vary multiple parameters at coarse granularityI Hold majority of parameters at constant values



Conclusion


Simulation Paradigms

I ObjectivesI Comprehensively understand microprocessor design spaceI Selectively perform a modest number of simulationsI Efficiently leverage simulation data

I Random Configuration SamplingI Sample points UAR from design space for simulationI Controls exponential increase in design count

I Statistical InferenceI Reveals trends, trade-offs from sparse samplingI Enables prediction for metrics of interest



Conclusion


Statistical Inference

I ApproachI Models approximate solutions to intractable problemsI Requires initial data to train, formulate modelI Leverages correlations from initial data for prediction

I Regression ModelingI Efficient formulation :: sample 1K of ≈1B, least squaresI Accurate inference :: 4− 7% median errorI Static accuracy :: no predictive training



Conclusion


Model FormulationI Notation

I n observationsI Response :: y = y1, . . . , yn

I Predictor :: xi = xi,1, . . . , xi,p

I Regression Coefficients :: β = β0, . . . , βp

I Random Error :: e = e1, . . . , en where ei ∼ N(0, σ2)I Transformations :: f , g = g1, . . . , gp

I Model

f (yi) = βg(xi) + ei

= β0 +p∑

j=1

βjgj(xij) + ei



Conclusion


Predictor Interaction

I Modeling InteractionI Suppose effects of predictors x1, x2 cannot be separatedI Construct predictor x3 = x1x2

y = β0 + β1x1 + β2x2 + β3x1x2 + ei

I ExampleI Let x1 be pipeline depth, x2 be L2 cache sizeI Performance impact of pipelining affected by cache size

Speedup =Depth

1 + Stalls/Inst



Conclusion


Predictor Non-Linearity

I Restricted Cubic SplinesI Divide predictor domain into intervals separated by knotsI Piecewise cubic polynomials joined at knots 1

I Higher order polynomials provide better fits

I Location of KnotsI Location of knots less important than number of knotsI Place knots at fixed predictor quantiles

I Number of KnotsI Flexibility, risk of over-fitting increases with knot countI 5 knots or fewer are often sufficientI 4 knots balances flexibility, over-fitting

1Stone [SS’86]Benjamin C. Lee, David M. Brooks :: 18 June 2006 10 :: Workshop on Modeling, Benchmarking, and Simulation


Conclusion


Prediction

I Expected ResponseI Suppose coefficients β, predictors’ xi,1, . . . , xi,p are knownI Expected response is weighted sum of predictor values

E[yi

]= E

[β0 +

p∑j=1

βjxij

]+ E

[ei

]= β0 +

p∑j=1

βjxij



Conclusion

Experimental MethodologyCorrelation AnalysisModel Specification








Conclusion


Tools and Benchmarks

I Simulation FrameworkI Turandot :: a cycle-accurate trace driven simulatorI PowerTimer :: power models derived from circuit analysesI Baseline simulator models POWER4/POWER5 architecture

I BenchmarksI SPEC2kCPU :: compute-intensive benchmarksI SPECjbb :: Java server benchmark

I Statistical FrameworkI R :: software environment for statistical computingI Hmisc and Design packages2

2Harrell [Springer,’01]Benjamin C. Lee, David M. Brooks :: 18 June 2006 13 :: Workshop on Modeling, Benchmarking, and Simulation


Conclusion


Configuration Sampling

I Design Space SizeI For i∈[1, p], Si defines possible values for parameter xi

I S =∏p

i=1 Si defines design spaceI |S| =

∏pi=1 |Si| defines space size

I B defines set of benchmarks, |B| × |S| potential simulationsI |S| ≈ 109 and |B| = 22

I Sampling Uniformly at Random (UAR)I Sample n = 4, 000 design points and benchmarksI Unbiased observations from full range of parameter valuesI Trends, trade-offs between parameters at fine granularity



Conclusion


Predictors :: MicroarchitectureSet Parameters Measure Range |Si|

S1 Depth depth FO4 9::3::36 10S2 Width width insn b/w 4,8,16 3

L/S reorder queue entries 15::15::45store queue entries 14::14::42functional units count 1,2,4

S3 Physical general purpose (GP) count 40::10::130 10Registers floating-point (FP) count 40::8::112

special purpose (SP) count 42::6::96S4 Reservation branch entries 6::1::15 10

Stations fixed-point/memory entries 10::2::28floating-point entries 5::1::14

S5 I-L1 Cache i-L1 cache size log2(entries) 7::1::11 5S6 D-L1 Cache d-L1 sache size log2(entries) 6::1::10 5S7 L2 Cache L2 cache size log2(entries) 11::1::15 5

L2 cache latency cycles 6::2::14S8 Control Latency branch latency cycles 1,2 2S9 FX Latency ALU latency cycles 1::1::5 5

FX-multiply latency cycles 4::1::8FX-divide latency cycles 35::5::55

S10 FP Latency FPU latency cycles 5::1::9 5FP-divide latency cycles 25::5::45

S11 L/S Latency Load/Store latency cycles 3::1::7 5S12 Memory Latency Main memory latency cycles 70::5::115 10



Conclusion


Predictors :: Application-Specific

I Application CharacteristicsI Collect program characteristics on baseline architectureI Baseline instruction throughput (BIPS)I Cache access patterns (i-L1, d-L1, L2 miss rates)I Branch patterns (branch frequency, mispredict rate)I Sources of pipeline stalls (per queue stall histograms)

I Application EffectsI Characteristics are significant predictors when interacting

with microarchitectural predictorsI Example :: Impact of d-L1 cache affected by access rates



Conclusion


Variable Clustering

stal

l_ca

stst

all_

resv

dl1m

iss_

rate

base

_bip

sdl

2mis

s_ra

test

all_

dmis

sqst

all_

stor

eqbr

_mis

_rat

est

all_

reor

derq

fix_l

atct

l_la

tls

_lat

fpu_

lat

dept

hm

em_l

atw

idth

bips

phys

_reg

l2ca

che_

size

resv

il1m

iss_

rate

il2m

iss_

rate stal

l_in

fligh

tst

all_

rena

me

br_r

ate

br_s

tall

icac

he_s

ize

dcac

he_s

ize

1.0

0.8

0.6

0.4

0.2

0.0

Spe

arm

an ρ

2



Conclusion


Strength of Marginal Relationships



Conclusion


Regression Model Specification

I InteractionsI Pipeline width/depth interact with

I instruction bandwidth structures (queues, register file)I cache hierarchy

I Cache hierarchy sizes interact withI adjacent levels in hierarchyI application-specific access rates

I Baseline performance interacts with resource sizings

I Restricted Cubic SplinesI Weaker relationships (latencies, caches, queues) :: 3 knotsI Stronger relationships (depth, registers) :: 4 knotsI Baseline application performance :: 5 knots



Conclusion

Validation ApproachPerformancePower








Conclusion


Validation ApproachI Framework

I Formulate models with n∗ < n = 4, 000 samplesI Obtain 100 additional random samples for validationI Quantify percentage error, 100 ∗ |yi − yi|/yi

I Model VariantsI Baseline (B): Model non-transformed responseI Variance Stabilized (S): Model square-root of responseI Regional (S+R): For each query, reformulate model with

samples most similarly configured to query

d =

[ p∑i=1

(ai − bi

ai

)2]1/2

I Application-Specific (S+A): Fix sample benchmarks



Conclusion


Performance Prediction



Conclusion


Power Prediction



Conclusion


Performance-Power Comparison

I Performance AccuracyI 7.4% median error for S+A modelI S+A reduces performance variance across applicationsI S+R ineffective since application is primary determinant of

performance

I Power AccuracyI 4.3% median error for S+R modelI S+R reduces power variance across configurationsI S+A ineffective since resource sizings are primary

determinants of power



Conclusion

SummaryFuture Directions








Conclusion


Summary

I Simulation ChallengesI Limited design space studies due to simulation costsI Existing frameworks reduce per simulation costs only

I Regression ModelsI Sampling :: 1K of ≈1B configurations UARI Specification :: correlation analysesI Refinement :: stabilizing transformations

I Model EvaluationI 7.4%, 4.3% median errors for performance, powerI S+A, S+R more effective for performance, power



Conclusion


Future Directions

I Model ApplicationsI Demonstrate applicability to prior studiesI Models enable more aggressive studiesI Construct a CMP simulation framework

I Model ImprovementsI Techniques, transformations to further reduce error, bias

I Survey Approaches in Statistical InferenceI Compare regression modeling with machine learning


AppendixLinksReferencesExtra Slides

Publicationswww.deas.harvard.edu/∼bclee

B.C. Lee and D.M. Brooks.Statistically rigorous regression modeling for the microprocessordesign space.ISCA-33: Workshop on Modeling, Benchmarking, and Simulation,June 2006.

B.C. Lee and D.M. Brooks.Accurate, efficient regression modeling for microarchitecturalperformance, power prediction.ASPLOS-XII: International Conference on Architectural Support forProgramming Languages and Operating Systems, Oct 2006. (ToAppear)



References IY. Li, B.C. Lee, D. Brooks, Z. Hu, K. Skadron.CMP design space exploration subject to physical constraints.HPCA-12: International Symposium on High-Performance Computer Architecture, Feb 2006.

L. Eeckhout, S. Nussbaum, J. Smith, and K. DeBosschere.Statistical simulation: Adding efficiency to the computer designer’s toolbox.IEEE Micro, Sept/Oct 2003.

R. Liu and K. Asanovic.Accelerating architectural exploration using canonical instruction segments.In International Symposium on Performance Analysis of Systems and Software, Austin, Texas,March 2006.

T. Sherwood, E. Perelman, G. Hamerly, and B. Calder.Automatically characterizing large scale program behavior.ASPLOX-X: Architectural Support for Programming Languages and Operating Systems,October 2002.

B.C. Lee and D.M. Brooks.Effects of pipeline complexity on SMT/CMP power-performance efficiency.ISCA-32: Workshop on Complexity Effective Design, June 2005.



References IIC. Stone.Comment: Generalized additive models.Statistical Science, 1986.

F. Harrell.Regression modeling strategies.Springer, New York, NY, 2001.

J. Yi, D. Lilja, and D. Hawkins.Improving computer architecture simulation methodology by adding statistical rigor.IEEE Computer, Nov 2005.

P. Joseph, K. Vaswani, and M. J. Thazhuthaveetil.Construction and use of linear regression models for processor performance analysis.In Proceedings of the 12th Symposium on High Performance Computer Architecture, Austin,Texas, February 2006.

S. Nussbaum and J. Smith.Modeling superscalar processors via statistical simulation.In PACT2001: International Conference on Parallel Architectures and Compilation Techniques,Barcelona, Sept 2001.



Assessing FitI Multiple Correlation Statistic

I R2 is fraction of response variance captured by predictorsI Large R2 suggests better fit to observed dataI R2 → 1 suggests over-fitting (less likely if p < n/20)

R2 = 1−∑n

i=1(yi − yi)2∑ni=1(yi − 1

n

∑ni=1 yi)2

I Residual Distribution AssumptionsI Residuals are normally distributed, ei ∼ N(0, σ2)I No correlation between residuals and response, predictorsI Validate by scatterplots and quantile-quantile plots

ei = yi − β0 −p∑

j=0

βjxij



Predictor Non-Linearity I

I Polynomial TransformationsI Undesirable peaks and valleysI Differing trends across regions

I Linear SplinesI Piecewise linear regions separated by knotsI Inadequate for complex, highly curved relationships

I Restricted Cubic SplinesI Higher order polynomials provide better fitsI Continuous at knotsI Linear constraint on tails



Predictor Non-Linearity II

I Location of KnotsI Location of knots less important than number of knotsI Place knots at fixed predictor quantiles

I Number of KnotsI Flexibility, risk of over-fitting increases with knot countI 5 knots or fewer are often sufficient 3

I 4 knots is a good compromise between flexibility, over-fittingI Fewer knots required for small data sets

3Stone [SS’86]Benjamin C. Lee, David M. Brooks :: 18 June 2006 33 :: Workshop on Modeling, Benchmarking, and Simulation


Significance Testing I

I ApproachI Given two nested models, hypothesis H0 states additional

predictors in larger model have no response associationI Test H0 with F-statistics and p-values

I ExampleI Predictor interaction requires comparing nested modelsI Consider a model y = β0 + β1x1 + β2x2 + β3x1x2.I Test significance of x1 with null hypothesis H0 : β1 = β3 = 0



Significance Testing III F-Statistic

I Compare two nested models using their R2 and F-statisticI R2 is fraction of response variance captured by predictors

R2 = 1−∑n

i=1(yi − yi)2∑ni=1(yi − 1

n

∑ni=1 yi)2

I F-statistic of two nested models follows F distribution

Fk,n−p−1 =R2 − R2

∗k

× n− p− 11− R2

I P-ValuesI Probability F-statistic greater than or equal to observed

value would occur under H0

I Small p-values cast doubt on H0



Treatment of Missing Data

I Missing Completely at Random (MCAR)I Treat unobserved design points as missing dataI Sampling UAR ensures observations are MCARI Data is missing for reasons unrelated to characteristics or

responses of the configuration

I Informative MissingI Data is more likely missing if their responses are

systematically higher or lowerI “Missingness” is non-ignorable and must also be modeledI Sampling UAR avoids such modeling complications



Performance Associations I

bips

0.6 0.7 0.8 0.9

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

12131170 815 802

131813201362

11971179 802 822

11251247 818 810

4000

N

4 8

16

[ 9,18) [18,27) [27,33) [33,36]

[ 40, 70) [ 70,100) [100,120) [120,130]

[ 6, 9) [ 9,12)

[12,14) [14,15]

depth

width

phys_reg

resv

Overall

Pipeline

N=4000bips

0.70 0.80 0.90

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

10101017 977 996

1505 728 953 814

2399 800 432 180 189

1036 973

1197 794

1298 707

1046 949

4000

N

4

1 2 3 4 5

1 2

5

[ 34, 56) [ 56, 77)

[ 77,124) [124,307]

[1, 4)

[5, 8) [8,19]

[3, 5) [5,13]

[ 2, 5)

[ 6,10) [10,24]

mem_lat

ls_lat

ctl_lat

fix_lat

fpu_lat

Overall

Latency

N=4000bips

0.74 0.78 0.82

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

799 828 807 805 761

774 794 773 837 822

825 776 823 780 796

4000

N

11 12 13 14 15

7 8 9

10 11

6 7 8 9

10

l2cache_size

icache_size

dcache_size

Overall

Memory Hierarchy

N=4000



Performance Associations II

bips

0.6 0.8

●

●

●

●

●

●

●

●

●

1076

1060

934

930

1085

1055

960

900

4000

N

[0.01,0.05)

[0.05,0.10)

[0.10,0.16)

[0.16,0.29]

[0.005,0.034)

[0.034,0.072)

[0.072,0.093)

[0.093,0.165]

br_rate

br_mis_rate

Overall

Branch

N=4000bips

0.6 0.8

●

●

●

●

●

●

●

●

●

●

●

●

●

10751104 910 911

1114 933

1053 900

1135 876

1129 860

4000

N

[ 6225, 108733) [108733, 242619) [242619, 597291) [597291,2821539]

[ 193084,2.60e+06) [ 2599382,7.22e+06) [ 7222608,4.69e+07) [46879987,1.04e+08]

[ 949, 6037) [ 6037, 18667)

[ 18667, 254857) [254857,14883942]

stall_inflight

stall_dmissq

stall_cast

Overall

Stalls

N=4000bips

0.5 0.7 0.9

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

1093 933

1045 929

1096 907

1077 920

1127 922

1070 881

10991075 941 885

4000

N

[ 0, 9959) [ 9959, 97381)

[ 97381, 299112) [299112,1373099]

[ 0, 350) [ 350, 24448)

[ 24448,144305) [144305,923543]

[ 1814764,5.98e+06) [ 5984258,1.01e+07)

[10134921,2.42e+07) [24224471,1.13e+08]

[ 16188, 2013914) [ 2013914, 4131316)

[ 4131316,22315283) [22315283,45296951]

stall_storeq

stall_reorderq

stall_resv

stall_rename

Overall

Stalls

N=4000



Performance Associations III

bips

0.5 0.8

●

●

●

●

●

●

●

●

●

●

●

●

●

1110 9191071 900

1103 9111103 883

1856 349 928 867

4000

N

0.02

[5.08e−06,3.91e−05) [3.91e−05,7.15e−04) [7.15e−04,3.85e−03) [3.85e−03,3.71e−02]

[0.02,0.09) [0.09,0.12) [0.12,0.18) [0.18,0.55]

[0.00,0.02)

[0.04,0.11) [0.11,0.24]

il1miss_rate

dl1miss_rate

dl2miss_rate

Overall

Cache

N=4000bips

0.5 0.7 0.9

●

●

●

●

●

1056

1122

913

909

4000

N

[0.309,0.92)

[0.920,1.21)

[1.208,1.35)

[1.347,1.58]

base_bips

Overall

Baseline Performance

N=4000



Significance Tests

I Microarchitectural PredictorsI Majority of F-tests imply significance (p-values < 2.2E − 16)I Several predictors were less significant

I Control latency (p-value = 0.1247)I Reservation station size (p-value = 0.1239)I L1 instruction cache size (p-value = 0.02941)

I Application-Specific PredictorsI Majority of F-tests imply significance (p-values < 2.2E − 16)I Pipeline stalls classified by structure are less significant

I Completion and reorder queue stalls (p-values > 0.4)



Related Work

I Statistical Significance RankingI Yi :: Plackett-Burman, effect rankingsI Joseph :: Stepwise regression, coefficient rankingsI Bound parameter values to improve tractabilityI Require simulation for estimation

I Synthetic WorkloadsI Eeckhout :: Profile workloads to obtain synthetic tracesI Nussbaum :: Superscalar and SMP simulationI Obtain distribution of instructions and data dependenciesI Require simulation with smaller traces for estimation


statistically rigorous regression modeling for the ...leebcc/documents/lee2006-mobs-talk.… · i...

Documents