data analysis overview experimental environment prototype real sys exec- driven sim trace- driven...

76
Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config parameters Factor levels Raw Data Samples Samples Samples . . . . . . Different experiments

Upload: poppy-hill

Post on 13-Dec-2015

227 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

Data Analysis OverviewExperimentalenvironment

prototypereal sys

exec-driven

sim

trace-driven

sim

stochasticsim

Workloadparameters

SystemConfig

parameters

Factorlevels Raw Data

Samples

Samples

Samples

.

.

.. . .D

iffer

ent e

xper

imen

ts

Page 2: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

Data Analysis OverviewExperimentalenvironment

prototypereal sys

exec-driven

sim

trace-driven

sim

stochasticsim

Workloadparameters

SystemConfig

parameters

Factorlevels Raw Data

. . .1

set o

f exp

erim

ents

Rep

eate

d se

t

Comparison of AlternativesCommon case – one samplepoint for each.• Conclusions only about this set of experiments.

Page 3: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

Data Analysis OverviewExperimentalenvironment

prototypereal sys

exec-driven

sim

trace-driven

sim

stochasticsim

Workloadparameters

SystemConfig

parameters

Factorlevels Raw Data

1 repeated experiment

Samples

Samples

.

.

.

Characterizing this sample data set• Central tendency – means, mode, median• Variability – range, std dev, COV, quantiles• Fit to known distribution

Sample data vs.

Population• Confidence interval for mean• Significance

level • Sample size n given r% accuracy

Page 4: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

Data Analysis OverviewExperimentalenvironment

prototypereal sys

exec-driven

sim

trace-driven

sim

stochasticsim

Workloadparameters

SystemConfig

parameters

Factorlevels Raw Data

1 experiment

Samples

Samples

.

.

.. . .1

set o

f exp

erim

ents

Comparison of AlternativesPaired Observations• As one sample of pairwise differences ai - bi

• Confidence interval

A

B

Page 5: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

Data Analysis OverviewExperimentalenvironment

prototypereal sys

exec-driven

sim

trace-driven

sim

stochasticsim

Workloadparameters

SystemConfig

parameters

Factorlevels Raw Data

1 experiment

Samples

Samples

.

.

.. . .1

set o

f exp

erim

ents

Unpaired Observations• As multiple samples, sample means and overlapping CIs• t-test on mean difference: xa - xb

xa , sa , CIa

xb , sb , CIb

Page 6: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

Data Analysis OverviewExperimentalenvironment

prototypereal sys

exec-driven

sim

trace-driven

sim

stochasticsim

Workloadparameters

SystemConfig

parameters

Factorlevels Raw Data

Samples

Samples

.

.

.

Predictorvalues x,

factor levels

Samples of response

. . .

y1

y2

yn

Regression models• response var = f (predictors)

Page 7: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

© 1998, Geoff Kuenning

Linear Regression Models

What is a (good) model?Estimating model parametersAllocating variation (R2)• Confidence intervals for regressionsVerifying assumptions visually

Page 8: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

© 1998, Geoff Kuenning

Confidence Intervals for Regressions

• Regression is done from a single sample (size n)– Different sample might give different

results– True model is y = 0 + 1x– Parameters b0 and b1 are really means

taken from a population sample

Page 9: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

© 1998, Geoff Kuenning

Calculating Intervalsfor Regression

Parameters• Standard deviations of parameters:

• Confidence intervals are bi t sbi

• where t has n - 2 degrees of freedom

s sn

x

x nx

ss

x nx

b e

be

0

1

1 2

2 2

2 2

Page 10: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

© 1998, Geoff Kuenning

Example of Regression Confidence Intervals

• Recall se = 0.13, n = 5, x2 = 264, = 6.8

• So

• Using a 90% confidence level, t0.95;3 = 2.353

x

s

s

b

b

0

1

0 131

5

6 8

264 5 6 80 16

0 13

264 5 6 80 004

2

2

2

.( . )

( . ).

.

( . ).

Page 11: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

© 1998, Geoff Kuenning

0.29 2.353(0.004) = (0.28,0.30)

Regression Confidence Example, cont’d

• Thus, b0 interval is

– Not significant at 90%

• And b1 is

– Significant at 90% (and would survive even 99.9% test)

0.35 2.353(0.16) = (-0.03,0.73)

Page 12: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

© 1998, Geoff Kuenning

Confidence Intervalsfor Predictions

• Previous confidence intervals are for parameters– How certain can we be that the parameters

are correct?• Purpose of regression is prediction

– How accurate are the predictions?– Regression gives mean of predicted

response, based on sample we took

Page 13: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

© 1998, Geoff Kuenning

Predicting m Samples

• Standard deviation for mean of future sample of m observations at xp is

• Note deviation drops as m • Variance minimal at x = • Use t-quantiles with n–2 DOF for interval

s s

m n

x x

x nxy ep

mp

1 1

2

2 2

x

ymp

S

Page 14: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

© 1998, Geoff Kuenning

Example of Confidenceof Predictions

• Using previous equation, what is predicted time for a single run of 8 loops?

• Time = 0.35 + 0.29(8) = 2.67• Standard deviation of errors se = 0.13

• 90% interval is then

sy p .

.

( . ).

10 13 1

1

5

8 6 8

264 5 6 80 14

2

yp

S

2.67 2.353(0.14) = (2.34, 3.00)

Page 15: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

© 1998, Geoff Kuenning

• Multiple linear regression – more than one predictor variable

• Categorical predictors – some of the predictors aren’t quantitative but represent categories

• Curvilinear regression – nonlinear relationship• Transformations – when errors not normally

distributed or variance not constant• Handling outliers• Common mistakes in regression analysis

Other Regression Methods

Page 16: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

© 1998, Geoff Kuenning

Multiple Linear Regression

• Models with more than one predictor variable• But each predictor variable has a linear

relationship to the response variable• Conceptually, plotting a regression line in n-

dimensional space, instead of 2-dimensional

Page 17: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

© 1998, Geoff Kuenning

Basic Multiple Linear Regression Formula

• Response y is a function of k predictor variables x1,x2, . . . , xk

y = b0 + b1x1 + b2x2 + . . . + bkxk + e

Page 18: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

© 1998, Geoff Kuenning

A Multiple Linear Regression Model

Given sample of n observations

model consists of n equations (note typo in book):

y b b x b x b x ek k1 0 1 11 2 21 1 1 y b b x b x b x ek k2 0 1 12 2 22 2 2

y b b x b x b x en n n k kn n 0 1 1 2 2

x x x y x x x yk n n kn n11 21 1 1 1 2, , , , , , , , , , . . . . . . . . .

. . .

. . .

. . .

.

.

.

Page 19: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

© 1998, Geoff Kuenning

Looks Like It’s Matrix Arithmetic Time

y = Xb +e

y

y

y

x x x

x x x

x x x

b

b

b

e

e

en

k

k

n n kn k n

1

2

11 21 1

12 22 2

2 2

0

1

1

2

1

1

1

.

.

.

. . . . .

. . . . .

. . . . .

.

.

.

.

.

.

. . .

. . .

. . .

Page 20: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

© 1998, Geoff Kuenning

Analysis ofMultiple Linear

Regression• Listed in box 15.1 of Jain• Not terribly important (for our purposes) how

they were derived– This isn’t a class on statistics

• But you need to know how to use them• Mostly matrix analogs to simple linear

regression results

Page 21: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

© 1998, Geoff Kuenning

Example ofMultiple Linear

Regression• Internet Movie Database keeps popularity

ratings of movies (in numerical form)• Postulate popularity of Academy Award

winning films is based on two factors -– Age– Running time

• Produce a regression

rating = b0 + b1(length) +b2(age)

Page 22: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

© 1998, Geoff Kuenning

Some Sample Data

Title Age LengthRating

Silence of the Lambs 5 118 8.1Terms of Endearment 13 132 6.8Rocky 20 119 7.0Oliver! 28 153 7.4Marty 41 91 7.7Gentleman’s Agreement 49 118 7.5Mutiny on the Bounty 61 132 7.6It Happened One Night 62 105 8.0

Page 23: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

© 1998, Geoff Kuenning

Now for Some Tedious Matrix Arithmetic

• We need to calculate X, XT, XTX, (XTX)-1, and XTy

• Because• We will see that

b = (8.373, .005, -.009 )• Meaning the regression predicts:

rating = 8.373 + 0.005*age – 0.009*length

b X X X yT T1

Page 24: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

© 1998, Geoff Kuenning

X Matrix for Example

105621

132611

118491

91411

153281

119201

132131

11851

X

Page 25: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

© 1998, Geoff Kuenning

Transpose to Get XT

10513211891153119132118

626149412820135

11111111TX

Page 26: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

© 1998, Geoff Kuenning

Multiply To Get XTX

11957233045968

3304513025279

9682798

XXT

Page 27: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

© 1998, Geoff Kuenning

Invert to Get (XTX)-1

0004.00001.00562.0

0001.00003.002270

0562.00227.07134.7

.1T XX

Page 28: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

© 1998, Geoff Kuenning

Multiply to Get XTy

57247

92118

160

.

.

.

yXT

Page 29: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

© 1998, Geoff Kuenning

Multiply (XTX)-1(XTy)to Get b

0090

0050

378

.

.

.

b

Page 30: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

© 1998, Geoff Kuenning

How Good Is ThisRegression Model?

• How accurately does the model predict the rating of a film based on its age and running time?

• Best way to determine this analytically is to calculate the errors

or

SSE T T T y y b X y

SSE ei 2

Page 31: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

© 1998, Geoff Kuenning

Calculating the ErrorsEstimated

Rating Age Length Rating ei ei2

8.1 5 118 7.4 -0.71 0.516.8 13 132 7.3 0.51 0.267.0 20 119 7.4 0.45 0.217.4 28 153 7.2 -0.20 0.047.7 41 91 7.8 0.10 0.017.5 49 118 7.6 0.11 0.017.6 61 132 7.5 -0.05 0.008.0 62 105 7.8 -0.21 0.04

Page 32: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

© 1998, Geoff Kuenning

Calculating the Errors, Continued

• So SSE = 1.08• SSY =• SS0 = • SST = SSY - SS0 = 452.9- 451.5 = 1.4• SSR = SST - SSE = .33

• In other words, this regression stinks

914522 .yi

54512 .yn

23.41.1

33.2 SST

SSRR

Page 33: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

© 1998, Geoff Kuenning

Why Does It Stink?

• Let’s look at the properties of the regression parameters

• Now calculate standard deviations of the regression parameters

46.5

08.1

3

n

SSEse

Page 34: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

© 1998, Geoff Kuenning

Calculating STDEVof Regression Parameters• Estimations only, since we’re working with a

sample• Estimated stdev of

2914.171.746.000 csb e

0097.0003.46.111 csb e

0083.0004.46.222 csb e

Page 35: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

© 1998, Geoff Kuenning

Calculating Confidence Intervals

• At the 90% level, for instance• Confidence intervals for

• Only b0 is significant, at this level

b0 = 8.37 (2.015)(1.29) = (5.77, 10.97)

b1 = .005 (2.015)(.01) = (-.02, .02)

b2 = -.009 (2.015)(.008) = (-.03, .01)

Page 36: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

© 1998, Geoff Kuenning

Analysis of Variance

• So, can we really say that none of the predictor variables are significant?– Not yet; predictors may be correlated

• F-test can be used for this purpose– E.g., to determine if the SSR is significantly

higher than the SSE– Equivalent to testing that y does not

depend on any of the predictor variables

Page 37: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

© 1998, Geoff Kuenning

Running an F-Test

• Need to calculate SSR and SSE• From those, calculate mean squares of the

regression (MSR) and the errors (MSE)• MSR/MSE has an F distribution• If MSR/MSE > F-table, predictors explain a

significant fraction of response variation• Note typos in book’s table 15.3

– SSR has k degrees of freedom– SST matches y y

Page 38: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

© 1998, Geoff Kuenning

F-Test for Our Example

• SSR = .33• SSE = 1.08• MSR = SSR/k = .33/2 = .16• MSE = SSE/(n-k-1) = 1.08/(8 - 2 - 1) = .22• F-computed = MSR/MSE = .76• F[90; 2,5] = 3.78 (at 90%)• So it fails the F-test at 90% (miserably)

Page 39: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

© 1998, Geoff Kuenning

Multicollinearity

• If two predictor variables are linearly dependent, they are collinear– Meaning they are related– And thus the second variable does not

improve the regression– In fact, it can make it worse

• Typical symptom is inconsistent results from various significance tests

Page 40: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

© 1998, Geoff Kuenning

Finding Multicollinearity

• Must test correlation between predictor variables

• If it’s high, eliminate one and repeat the regression without it

• If the significance of regression improves, it’s probably due to collinearity between the variables

Page 41: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

© 1998, Geoff Kuenning

Is Multicollinearity a Problem in Our Example?

• Probably not, since the significance tests are consistent

• But let’s check, anyway• Calculate correlation of age and length• After tedious calculation, -.25

– Not especially correlated• Important point - adding a predictor variable

does not always improve a regression

Page 42: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

© 1998, Geoff Kuenning

Why Didn’t RegressionWork Well Here?

• Check the scatter plots– Rating vs. age– Rating vs. length

• Regardless of how good or bad regressions look, always check the scatter plots

Page 43: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

© 1998, Geoff Kuenning

Rating vs. Length

6

6.5

7

7.5

8

8.5

9

80 100 120 140 160

Length

Ra

tin

g

Page 44: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

© 1998, Geoff Kuenning

Rating vs. Age

6

6.5

7

7.5

8

8.5

9

0 20 40 60 80Age

Ra

tin

g

Page 45: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

© 1998, Geoff Kuenning

Regression WithCategorical Predictors

• Regression methods discussed so far assume numerical variables

• What if some of your variables are categorical in nature?

• Use techniques discussed later in the class if all predictors are categorical

• Levels - number of values a category can take

Page 46: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

© 1998, Geoff Kuenning

HandlingCategorical Predictors

• If only two levels, define bi as follows– bi = 0 for first value– bi = 1 for second value

• This definition is missing from book in section 15.2

• Can use +1 and -1 as values, instead• Need k-1 predictor variables for k levels

– To avoid implying order in categories

Page 47: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

© 1998, Geoff Kuenning

Categorical Variables Example

• Which is a better predictor of a high rating in the movie database, winning an Oscar,winning the Golden Palm at Cannes, or winning the New York Critics Circle?

Page 48: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

© 1998, Geoff Kuenning

Choosing Variables

• Categories are not mutually exclusive• x1= 1 if Oscar

0 if otherwise• x2= 1 if Golden Palm

0 if otherwise• x3= 1 if Critics Circle Award

0 if otherwise• y = b0+b1 x1+b2 x2+b3 x3

Page 49: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

© 1998, Geoff Kuenning

A Few Data Points

Title Rating Oscar Palm NYCGentleman’s Agreement 7.5 X X

Mutiny on the Bounty 7.6 XMarty 7.4 X X XIf 7.8 XLa Dolce Vita 8.1 XKagemusha 8.2 XThe Defiant Ones 7.5 XReds 6.6 XHigh Noon 8.1 X

Page 50: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

© 1998, Geoff Kuenning

And Regression Says . . .

• • How good is that?• R2 is 34% of the variation

– Better than age and length– But still no great shakes

• Are the regression parameters significant at the 90% level?

321 4.2.1.8.7ˆ xxxy

Page 51: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

© 1998, Geoff Kuenning

Curvilinear Regression

• Linear regression assumes a linear relationship between predictor and response

• What if it isn’t linear?• You need to fit some other type of function to

the relationship

Page 52: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

© 1998, Geoff Kuenning

When To UseCurvilinear Regression

• Easiest to tell by sight • Make a scatter plot

– If plot looks non-linear, try curvilinear regression

• Or if non-linear relationship is suspected for other reasons

• Relationship should be convertible to a linear form

Page 53: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

© 1998, Geoff Kuenning

Types ofCurvilinear Regression

• Many possible types, based on a variety of relationships:

• Many others

y bx a

y abxy a b

x

Page 54: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

© 1998, Geoff Kuenning

Transform Themto Linear Forms

• Apply logarithms, multiplication, division, whatever to produce something in linear form

• I.e., y = a + b*something• Or a similar form• If predictor appears in more than one

transformed predictor variable, correlation likely

Page 55: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

© 1998, Geoff Kuenning

Transformations

• Using some function of the response variable y in place of y itself

• Curvilinear regression is one example of transformation

• But techniques are more generally applicable

Page 56: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

© 1998, Geoff Kuenning

When To Transform?

• If known properties of the measured system suggest it

• If the data’s range covers several orders of magnitude

• If the homogeneous variance assumption of the residuals is violated

Page 57: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

© 1998, Geoff Kuenning

Transforming Due To Homoscedasticity

• If spread of scatter plot of residual vs. predicted response is not homogeneous,

• Then residuals are still functions of the predictor variables

• Transformation of response may solve the problem

Page 58: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

© 1998, Geoff Kuenning

What TransformationTo Use?

• Compute standard deviation of the residuals• Plot as function of the mean of the

observations– Assuming multiple experiments for single

set of predictor values• Check for linearity - if it is, use a log transform

Page 59: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

© 1998, Geoff Kuenning

Other Tests for Transformations

• If variance against mean of observations is linear, use square root transform

• If standard deviation against mean squared is linear, use inverse transform

• If standard deviation against mean to a power is linear, use a power transform

• More covered in the book

Page 60: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

© 1998, Geoff Kuenning

General Transformation Principle

For some observed function

if

transform to

s g y ( )

h yg y

dy( )( )

1

w h y ( )

Page 61: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

© 1998, Geoff Kuenning

For Example,

• A log transformation:• If the standard deviation against the mean is

linear, then g(y) = ay

So h y

aydy a y( ) ln

1

Page 62: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

© 1998, Geoff Kuenning

Outliers

• Atypical observations might be outliers– Measurements that are not truly

characteristic– By chance, several standard deviations out– Or mistakes might have been made in

measurement• Which leads to a problem:

Do you include outliers in analysis or not?

Page 63: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

© 1998, Geoff Kuenning

DecidingHow To Handle Outliers

1. Find them (by looking at scatter plot)2. Check carefully for experimental error3. Repeat experiments at predictor values for

the outlier4. Decide whether to include or not include

outliers– Or do analysis both ways

Question: Is the first point in the example an outlier on the rating vs. age plot?

Page 64: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

© 1998, Geoff Kuenning

Common Mistakesin Regression

• Generally based on taking shortcuts• Or not being careful• Or not understanding some fundamental

principles of statistics

Page 65: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

© 1998, Geoff Kuenning

Not Verifying Linearity

• Draw the scatter plot• If it isn’t linear, check for curvilinear

possibilities• Using linear regression when the relationship

isn’t linear is misleading

Page 66: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

© 1998, Geoff Kuenning

Relying on ResultsWithout Visual

Verification• Always check the scatter plot as part of

regression– Examining the line regression predicts vs.

the actual points• Particularly important if regression is done

automatically

Page 67: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

© 1998, Geoff Kuenning

Attaching ImportanceTo Values of Parameters

• Numerical values of regression parameters depend on scale of predictor variables

• So just because a particular parameter’s value seems “small” or “large,” not necessarily an indication of importance

• E.g., converting seconds to microseconds doesn’t change anything fundamental– But magnitude of associated parameter

changes

Page 68: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

© 1998, Geoff Kuenning

Not SpecifyingConfidence Intervals

• Samples of observations are random• Thus, regression performed on them yields

parameters with random properties• Without a confidence interval, it’s impossible

to understand what a parameter really means

Page 69: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

© 1998, Geoff Kuenning

Not CalculatingCoefficient of Determination

• Without R2, difficult to determine how much of variance is explained by the regression

• Even if R2 looks good, safest to also perform an F-test

• The extra amount of effort isn’t that large, anyway

Page 70: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

© 1998, Geoff Kuenning

Using Coefficient of Correlation Improperly

• Coefficient of determination is R2

• Coefficient of correlation is R• R2 gives percentage of variance explained by

regression, not R• E.g., if R is .5, R2 is .25

– And the regression explains 25% of variance

– Not 50%

Page 71: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

© 1998, Geoff Kuenning

Using Highly Correlated Predictor Variables

• If two predictor variables are highly correlated, using both degrades regression

• E.g., likely to be a correlation between an executable’s on-disk size and in-core size– So don’t use both as predictors of run time

• Which means you need to understand your predictor variables as well as possible

Page 72: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

© 1998, Geoff Kuenning

Using Regression Beyond Range of Observations

• Regression is based on observed behavior in a particular sample

• Most likely to predict accurately within range of that sample– Far outside the range, who knows?

• E.g., a regression on run time of executables that are smaller than size of main memory may not predict performance of executables that require much VM activity

Page 73: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

© 1998, Geoff Kuenning

Using Too ManyPredictor Variables

• Adding more predictors does not necessarily improve the model

• More likely to run into multicollinearity problems

• So what variables to choose?– Subject of much of this course

Page 74: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

© 1998, Geoff Kuenning

Measuring Too Littleof the Range

• Regression only predicts well near range of observations

• If you don’t measure the commonly used range, regression won’t predict much

• E.g., if many programs are bigger than main memory, only measuring those that are smaller is a mistake

Page 75: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

© 1998, Geoff Kuenning

Assuming Good PredictorIs a Good Controller

• Correlation isn’t necessarily control• Just because variable A is related to variable

B, you may not be able to control values of B by varying A

• E.g., if number of hits on a Web page and server bandwidth are correlated, you might not increase hits by increasing bandwidth

• Often, a goal of regression is finding control variables

Page 76: Data Analysis Overview Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Workload parameters System Config

For Discussion TodayProject Proposal1. Statement of hypothesis2. Workload decisions3. Metrics to be used4. Method