class 16: thursday, nov. 4 note: i will e-mail you some info on the final project this weekend and...

19
Class 16: Thursday, Nov. 4 • Note: I will e-mail you some info on the final project this weekend and will discuss in class on Tuesday.

Upload: janel-montgomery

Post on 18-Dec-2015

218 views

Category:

Documents


4 download

TRANSCRIPT

Class 16: Thursday, Nov. 4

• Note: I will e-mail you some info on the final project this weekend and will discuss in class on Tuesday.

Predicting Emergency Calls to the AAA Club

Response Calls Summary of Fit RSquare 0.692384 RSquare Adj 0.584719 Root Mean Square Error 1735.151 Mean of Response 4318.75 Observations (or Sum Wgts)

28

Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 3628.7902 2153.788 1.68 0.1076 Average Temperature

-35.63182 51.52383 -0.69 0.4972

Range 133.30434 50.85675 2.62 0.0164 Rain forecast 429.70588 1211.933 0.35 0.7266 Snow forecast 548.80038 1342.27 0.41 0.6870 Weekday -1603.1 876.7378 -1.83 0.0824 Sunday -1847.152 1212.612 -1.52 0.1433 Subzero 3857.6004 1489.803 2.59 0.0175

R-Squared

• R-squared: As in simple linear regression, measures proportion of variability in Y explained by the regression of Y on these X’s. Between 0 and 1, nearer to 1 indicates more variability explained.

• Don’t get excited that R-squared has increased when you add more variables into the model. Adding another explanatory variable to the model will always increase R-squared. The right question to ask is not whether R-squared has increased when we add an explanatory variable to a model but whether or not R-squared has increased by a useful amount. The t-statistic and the associated p-value for the t-test for each coefficient answers this question.

Summary of Fit RSquare 0.692384 RSquare Adj 0.584719 Root Mean Square Error 1735.151 Mean of Response 4318.75 Observations (or Sum Wgts)

28

Overall F-test

• Test of whether any of the predictors are useful: vs. at least one of does not

equal zero. Tests whether the model provides better predictions than the sample mean of Y.

• p-value for the test: Prob>F in Analysis of Variance table.• p-value = 0.005, strong evidence that at least one of the

predictors is useful for predicting ERS for the New York AAA club.

Analysis of Variance Source DF Sum of Squares Mean Square F Ratio Model 7 135532366 19361767 6.4309 Error 20 60214949 3010747.4 Prob > F C. Total 27 195747315 0.0005

0: 10 pH :aH p ,,1

Assumptions of Multiple Linear Regression Model

1. Linearity:

2. Constant variance: The standard deviation of Y for the subpopulation of units with is the same for all subpopulations.

3. Normality: The distribution of Y for the subpopulation of units with is normally distributed for all subpopulations.

4. The observations are independent.

ppp XXXXYE 1101 ),,|(

pp xXxX ,,11

pp xXxX ,,11

Assumptions for linear regression and their importance to inferences

Inference Assumptions that are important

Point prediction, point estimation

Linearity, independence

Confidence interval for slope, hypothesis test for slope, confidence interval for mean response

Linearity, constant variance, independence, normality (only if n<30)

Prediction interval Linearity, constant variance, independence, normality

Checking Linearity• Plot residuals versus each of the

explanatory variables. Each of these plots should look like random scatter, with no pattern in the mean of the residuals.

Bivariate Fit of Residual Calls By Average Temperature

-3000

-2000

-1000

0

1000

2000

3000

4000

Res

idua

l Cal

ls

0 10 20 30 40 50

Average Temperature

Bivariate Fit of Residual Calls By Range

-3000

-2000

-1000

0

1000

2000

3000

4000

Res

idua

l Cal

ls

-5 0 5 10 15 20 25 30 35 40

Range

If residual plots show a problem, then we could try to transform the x-variable and/orthe y-variable.

Residual Plots in JMP

• After Fit Model, click red triangle next to Response, click Save Columns and click Residuals.

• Use Fit Y by X with Y=Residuals and X the explanatory variable of interest. Fit Line will draw a horizontal line with intercept zero. It is a property of the residuals from multiple linear regression that a least squares regression of the residuals on an explanatory variable has slope zero and intercept zero.

Residual by Predicted Plot

• Fit Model displays the Residual by Predicted Plot automatically in its output. • The plot is a plot of the residuals versus the predicted Y’s,

We can think of the predicted Y’s as summarizing all the information in the

X’s. As usual we would like this plot to show random scatter. • Pattern in the mean of the residuals as the predicted Y’s increase: Indicates

problem with linearity. Look at residual plots versus each explanatory variable to isolate problem and consider transformations.

• Pattern in the spread of the residuals: Indicates problem with constant variance.

Residual by Predicted Plot

-3000

-2000

-1000

0

1000

2000

3000

4000

Cal

ls R

esid

ual

1000 3000 5000 7000 9000

Calls Predicted

),,|(ˆˆ11 ippiii xXxXYEY

Checking Normality

• As with simple linear regression, make histogram of residuals and normal quantile plot of residuals.

Distributions Residual Calls

.01

.05

.10

.25

.50

.75

.90

.95

.99

-3

-2

-1

0

1

2

3

Nor

mal

Qua

ntile

Plo

t

-3000 -1000 0 1000 2000 3000 4000

Normality appears to be violated: several points areoutside the confidence bands. Distribution ofResiduals is skewed to the right.

Transformations to Remedy Constant Variance and Normality

Nonconstant Variance• When the variance of Y| increases with ,

try transforming Y to log Y or Y to • When the variance of Y| decreases with ,

try transforming Y to 1/Y or Y to Y2

Nonnormality• When the distribution of the residuals is

skewed to the right, try transforming Y to log Y.• When the distribution of the residuals is

skewed to the left, try transforming Y to Y2

Y

YY

YY

Influential Points, High Leverage Points, Outliers

• As in simple linear regression, we identify high leverage and high influence points by checking the leverages and Cook’s distances (Use save columns to save Cook’s D Influence and Hats).

• High influence points: Cook’s distance > 1• High leverage points: Hat greater than (3*(# of

explanatory variables + 1))/n is a point with high leverage.

• Use same guidelines for dealing with influential observations as in simple linear regression.

• Point that has unusual Y given its explanatory variables: point with a residual that is more than 3 RMSEs away from zero.

Scatterplot Matrix

• Before fitting a multiple linear regression model, it is good idea to make scatterplots of the response variable versus the explanatory variable. This can suggest transformations of the explanatory variables that need to be done as well as potential outliers and influential points.

• Scatterplot matrix in JMP: Click Analyze, Multivariate Methods and Multivariate, and then put the response variable first in the Y, columns box and then the explanatory variables in the Y, columns box.

Scatterplot Matrix

2000

4000

6000

8000

10

30

50

0

1525

40

0

0.5

1

0

0.5

1

0

0.5

1

0

0.5

1

0

0.5

1

Calls

2000 6000

AverageTemperature

1020 4050

Range

05 15 25 35

Rainforecast

0 .5 1

Snowforecast

0 .5 1

Weekday

0 .5 1

Sunday

0 .5 1

Subzero

0 .5 1

• In order to evaluate benefits of a proposed irrigation scheme in Egypt, the relation of yield Y of wheat to rainfall is investigated over several years (see rainfall.JMP).

• How can regression analysis help? Year Yield (Bu./Acre), Y Total Spring

Rainfall, RAverage Spring Temperature, T

1963 60 8 56

1964 50 10 47

1965 70 11 53

1966 70 10 53

1967 80 9 56

1968 50 9 47

1969 60 12 44

1970 40 11 44

Simple Linear Regression of Yield on Rainfall

• Bivariate Fit of Yield By Total Spring Rainfall

30

40

50

60

70

80

90Y

ield

7 8 9 10 11 12 13

Total Spring Rainfall

Linear Fit

Linear Fit Yield = 76.666667 - 1.6666667 Total Spring Rainfall

Rainfall reduces yield!? Is irrigation a bad idea?

• Interpretation of coefficient of rainfall: The change in the mean yield that is associated with a one inch increase in rainfall. Other important variables (lurking variables) are not held fixed and might tend to change as rainfall increases.

Linear Fit Yield = 76.666667 - 1.6666667 Total Spring Rainfall

Bivariate Fit of Average Spring Temperature By Total Spring Rainfall

42.5

45

47.5

50

52.5

55

57.5

Ave

rage

Spr

ing

Tem

pera

ture

7 8 9 10 11 12 13

Total Spring Rainfall

Temperature tends to decrease asrainfall increases.

Controlling for Known Lurking Variables: Multiple Regression

• To evaluate the benefits of the irrigation scheme, we want to know how changes in rainfall are associated with changes in yield when all other important variables (lurking variables) such as temperature held fixed.

• Multiple regression provides this.• Coefficient on rainfall in the multiple regression

of yield on rainfall and temperature = change in the mean yield that is associated with a one inch increase in rainfall when temperature is held fixed.

Multiple Regression Analysis

• Rainfall is estimated to be beneficial once temperature is held fixed.

Response Yield Summary of Fit RSquare 0.790476 RSquare Adj 0.706667 Root Mean Square Error 7.091242 Mean of Response 60 Observations (or Sum Wgts)

8

Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept -144.7619 55.8499 -2.59 0.0487 Total Spring Rainfall 5.7142857 2.680238 2.13 0.0862 Average Spring Temperature

2.952381 0.692034 4.27 0.0080