class 16: thursday, nov. 4 note: i will e-mail you some info on the final project this weekend and...
TRANSCRIPT
Class 16: Thursday, Nov. 4
• Note: I will e-mail you some info on the final project this weekend and will discuss in class on Tuesday.
Predicting Emergency Calls to the AAA Club
Response Calls Summary of Fit RSquare 0.692384 RSquare Adj 0.584719 Root Mean Square Error 1735.151 Mean of Response 4318.75 Observations (or Sum Wgts)
28
Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 3628.7902 2153.788 1.68 0.1076 Average Temperature
-35.63182 51.52383 -0.69 0.4972
Range 133.30434 50.85675 2.62 0.0164 Rain forecast 429.70588 1211.933 0.35 0.7266 Snow forecast 548.80038 1342.27 0.41 0.6870 Weekday -1603.1 876.7378 -1.83 0.0824 Sunday -1847.152 1212.612 -1.52 0.1433 Subzero 3857.6004 1489.803 2.59 0.0175
R-Squared
• R-squared: As in simple linear regression, measures proportion of variability in Y explained by the regression of Y on these X’s. Between 0 and 1, nearer to 1 indicates more variability explained.
• Don’t get excited that R-squared has increased when you add more variables into the model. Adding another explanatory variable to the model will always increase R-squared. The right question to ask is not whether R-squared has increased when we add an explanatory variable to a model but whether or not R-squared has increased by a useful amount. The t-statistic and the associated p-value for the t-test for each coefficient answers this question.
Summary of Fit RSquare 0.692384 RSquare Adj 0.584719 Root Mean Square Error 1735.151 Mean of Response 4318.75 Observations (or Sum Wgts)
28
Overall F-test
•
• Test of whether any of the predictors are useful: vs. at least one of does not
equal zero. Tests whether the model provides better predictions than the sample mean of Y.
• p-value for the test: Prob>F in Analysis of Variance table.• p-value = 0.005, strong evidence that at least one of the
predictors is useful for predicting ERS for the New York AAA club.
Analysis of Variance Source DF Sum of Squares Mean Square F Ratio Model 7 135532366 19361767 6.4309 Error 20 60214949 3010747.4 Prob > F C. Total 27 195747315 0.0005
0: 10 pH :aH p ,,1
Assumptions of Multiple Linear Regression Model
1. Linearity:
2. Constant variance: The standard deviation of Y for the subpopulation of units with is the same for all subpopulations.
3. Normality: The distribution of Y for the subpopulation of units with is normally distributed for all subpopulations.
4. The observations are independent.
ppp XXXXYE 1101 ),,|(
pp xXxX ,,11
pp xXxX ,,11
Assumptions for linear regression and their importance to inferences
Inference Assumptions that are important
Point prediction, point estimation
Linearity, independence
Confidence interval for slope, hypothesis test for slope, confidence interval for mean response
Linearity, constant variance, independence, normality (only if n<30)
Prediction interval Linearity, constant variance, independence, normality
Checking Linearity• Plot residuals versus each of the
explanatory variables. Each of these plots should look like random scatter, with no pattern in the mean of the residuals.
Bivariate Fit of Residual Calls By Average Temperature
-3000
-2000
-1000
0
1000
2000
3000
4000
Res
idua
l Cal
ls
0 10 20 30 40 50
Average Temperature
Bivariate Fit of Residual Calls By Range
-3000
-2000
-1000
0
1000
2000
3000
4000
Res
idua
l Cal
ls
-5 0 5 10 15 20 25 30 35 40
Range
If residual plots show a problem, then we could try to transform the x-variable and/orthe y-variable.
Residual Plots in JMP
• After Fit Model, click red triangle next to Response, click Save Columns and click Residuals.
• Use Fit Y by X with Y=Residuals and X the explanatory variable of interest. Fit Line will draw a horizontal line with intercept zero. It is a property of the residuals from multiple linear regression that a least squares regression of the residuals on an explanatory variable has slope zero and intercept zero.
Residual by Predicted Plot
• Fit Model displays the Residual by Predicted Plot automatically in its output. • The plot is a plot of the residuals versus the predicted Y’s,
We can think of the predicted Y’s as summarizing all the information in the
X’s. As usual we would like this plot to show random scatter. • Pattern in the mean of the residuals as the predicted Y’s increase: Indicates
problem with linearity. Look at residual plots versus each explanatory variable to isolate problem and consider transformations.
• Pattern in the spread of the residuals: Indicates problem with constant variance.
Residual by Predicted Plot
-3000
-2000
-1000
0
1000
2000
3000
4000
Cal
ls R
esid
ual
1000 3000 5000 7000 9000
Calls Predicted
),,|(ˆˆ11 ippiii xXxXYEY
Checking Normality
• As with simple linear regression, make histogram of residuals and normal quantile plot of residuals.
Distributions Residual Calls
.01
.05
.10
.25
.50
.75
.90
.95
.99
-3
-2
-1
0
1
2
3
Nor
mal
Qua
ntile
Plo
t
-3000 -1000 0 1000 2000 3000 4000
Normality appears to be violated: several points areoutside the confidence bands. Distribution ofResiduals is skewed to the right.
Transformations to Remedy Constant Variance and Normality
Nonconstant Variance• When the variance of Y| increases with ,
try transforming Y to log Y or Y to • When the variance of Y| decreases with ,
try transforming Y to 1/Y or Y to Y2
Nonnormality• When the distribution of the residuals is
skewed to the right, try transforming Y to log Y.• When the distribution of the residuals is
skewed to the left, try transforming Y to Y2
Y
YY
YY
Influential Points, High Leverage Points, Outliers
• As in simple linear regression, we identify high leverage and high influence points by checking the leverages and Cook’s distances (Use save columns to save Cook’s D Influence and Hats).
• High influence points: Cook’s distance > 1• High leverage points: Hat greater than (3*(# of
explanatory variables + 1))/n is a point with high leverage.
• Use same guidelines for dealing with influential observations as in simple linear regression.
• Point that has unusual Y given its explanatory variables: point with a residual that is more than 3 RMSEs away from zero.
Scatterplot Matrix
• Before fitting a multiple linear regression model, it is good idea to make scatterplots of the response variable versus the explanatory variable. This can suggest transformations of the explanatory variables that need to be done as well as potential outliers and influential points.
• Scatterplot matrix in JMP: Click Analyze, Multivariate Methods and Multivariate, and then put the response variable first in the Y, columns box and then the explanatory variables in the Y, columns box.
Scatterplot Matrix
2000
4000
6000
8000
10
30
50
0
1525
40
0
0.5
1
0
0.5
1
0
0.5
1
0
0.5
1
0
0.5
1
Calls
2000 6000
AverageTemperature
1020 4050
Range
05 15 25 35
Rainforecast
0 .5 1
Snowforecast
0 .5 1
Weekday
0 .5 1
Sunday
0 .5 1
Subzero
0 .5 1
• In order to evaluate benefits of a proposed irrigation scheme in Egypt, the relation of yield Y of wheat to rainfall is investigated over several years (see rainfall.JMP).
• How can regression analysis help? Year Yield (Bu./Acre), Y Total Spring
Rainfall, RAverage Spring Temperature, T
1963 60 8 56
1964 50 10 47
1965 70 11 53
1966 70 10 53
1967 80 9 56
1968 50 9 47
1969 60 12 44
1970 40 11 44
Simple Linear Regression of Yield on Rainfall
• Bivariate Fit of Yield By Total Spring Rainfall
30
40
50
60
70
80
90Y
ield
7 8 9 10 11 12 13
Total Spring Rainfall
Linear Fit
Linear Fit Yield = 76.666667 - 1.6666667 Total Spring Rainfall
Rainfall reduces yield!? Is irrigation a bad idea?
• Interpretation of coefficient of rainfall: The change in the mean yield that is associated with a one inch increase in rainfall. Other important variables (lurking variables) are not held fixed and might tend to change as rainfall increases.
Linear Fit Yield = 76.666667 - 1.6666667 Total Spring Rainfall
Bivariate Fit of Average Spring Temperature By Total Spring Rainfall
42.5
45
47.5
50
52.5
55
57.5
Ave
rage
Spr
ing
Tem
pera
ture
7 8 9 10 11 12 13
Total Spring Rainfall
Temperature tends to decrease asrainfall increases.
Controlling for Known Lurking Variables: Multiple Regression
• To evaluate the benefits of the irrigation scheme, we want to know how changes in rainfall are associated with changes in yield when all other important variables (lurking variables) such as temperature held fixed.
• Multiple regression provides this.• Coefficient on rainfall in the multiple regression
of yield on rainfall and temperature = change in the mean yield that is associated with a one inch increase in rainfall when temperature is held fixed.
Multiple Regression Analysis
• Rainfall is estimated to be beneficial once temperature is held fixed.
Response Yield Summary of Fit RSquare 0.790476 RSquare Adj 0.706667 Root Mean Square Error 7.091242 Mean of Response 60 Observations (or Sum Wgts)
8
Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept -144.7619 55.8499 -2.59 0.0487 Total Spring Rainfall 5.7142857 2.680238 2.13 0.0862 Average Spring Temperature
2.952381 0.692034 4.27 0.0080