Download - Lecture 24: Thurs., April 8th
Lecture 24: Thurs., April 8th
Inference for Multiple Regression
Types of inferences:• Confidence intervals/hypothesis tests for
regression coefficients• Confidence intervals for mean response,
prediction intervals• Overall usefulness of predictors (F-test, R-
squared)• Effect tests (we will cover these later when we
cover categorical explanatory variables)
Overall usefulness of predictors• • Are any of the predictors useful? Does the mean of y
change as any of the explanatory variables changes.• vs. at least one of ‘s does not
equal zero. • Test (called overall F test) is carried out in Analysis of
Variance table. We reject for large values of F statistic. Prob>F is the p-value for this test.
• For fish mercury data, Prob>F less than 0.0001 – strong evidence that at least one of length/weight is a useful predictor of mercury concentration.
ppp xxxxy 1101 },...,|{
pH 210 : :aH
The R-Squared Statistic• P-value from overall F test tests whether any of predictors are useful but
does not give a measure of how useful the predictors are. • R squared is a measure of how good the predictions from the multiple
regression model are compared to using the mean of y, i.e., none of the predictors, to predict y.
• Similar interpretation as in simple linear regression. The R-squared statistic is the proportion of the variation in y explained by the multiple regression model
• Total Sum of Squares: • Residual Sum of Squares: squares of sum Total
squares of sum Residual - squares of sum Total2 R
2
1)(
n
i i yy
n
i ippii xxy1
2110 )ˆˆˆ(
Summary of Fit
RSquare 0.427626
Air Pollution and Mortality
• Data set pollution.JMP provides information about the relationship between pollution and mortality for 60 cities between 1959-1961.
• The variables are• y (MORT)=total age adjusted mortality in deaths per 100,000
population; • PRECIP=mean annual precipitation (in inches);
EDUC=median number of school years completed for persons 25 and older;
NONWHITE=percentage of 1960 population that is nonwhite; NOX=relative pollution potential of Nox (related to amount of tons of
Nox emitted per day per square kilometer);
SO2=relative pollution potential of SO2
Multiple Regression and Causal Inference
• Goal: Figure out what the causal effect on mortality would be of decreasing air pollution (and keeping everything else in the world fixed)
• Confounding variable: A variable that is related to both air pollution in a city and mortality in a city.
• In order to figure out whether air pollution causes mortality, we want to compare mean mortality among cities with different air pollution levels but the same values of the confounding variables.
• If we include all of the confounding variables in the multiple regression model, the coefficient on air pollution represents the change in the mean of mortality that is caused by a one unit increase in air pollution.
Omitted Variables
• What happens if we omit a confounding variable from the regression, e.g., percentage of smokers?
• Suppose we are interested in the causal effect of on y and believe that there are confounding variables and that
• is the causal effect of on y. If we omit the confounding variable, , then the multiple regression will be estimating the coefficient as the coefficient on . How different are and .
1x
12 ,, pxx
ppp
ppp
xxxxy
xxxxy*
1*1
*01
1111011
},,|{
},,|{
1 1x
1px*1
1x*1 1
Omitted Variables Bias Formula• Suppose that
• Then • Formula tells us about direction and magnitude of
bias from omitting a variable in estimating a causal effect.
• Formula also applies to least squares estimates, i.e.,
pppp
ppp
ppp
xxxxx
xxxxy
xxxxy
11011
*1
*1
*01
1111011
},,|{
},,|{
},,|{
111*1 p
111*1
ˆˆˆˆ p
Assumptions of Multiple Linear Regression Model
• Assumptions of multiple linear regression:– For each subpopulation ,
• (A-1A)• (A-1B) • (A-1C) The distribution of is normal[Distribution of residuals should not depend on ]
– (A-2) The observations are independent of one another
pxx ,...,1
ppp XXXXY 1101 },...,|{2
1 ),...,|( pXXYVar
pXXY ,...,| 1
pxx ,...,1
Checking/Refining Model
• Tools for checking (A-1A) and (A-1B)– Residual plots versus predicted (fitted) values– Residual plots versus explanatory variables – If model is correct, there should be no pattern in the
residual plots
• Tool for checking (A-1C)– Histogram of residuals
• Tool for checking (A-2)– Residual plot versus time or spatial order of
observations
pxx ,,1
Model Building
1. Make scatterplot matrix of variables (using analyze, multivariate). Decide on whether to transform any of the explanatory variables.
2. Fit model.3. Check residual plots for whether assumptions of
multiple regression model are satisfied. Also look for outliers and influential points.
4. Make changes to model and repeat steps 2-3 until an adequate model is found.
1. Get the correlation matrix of all the variables:
Correlations:
MORT PRECIP
EDUC NONWHITE
NOX SO2
MORT 1.0000 0.5095 -0.5110 0.6437 -0.0774 0.4259 PRECIP 0.5095 1.0000 -0.4904 0.4132 -0.4873 -0.1069 EDUC -0.5110 -0.4904 1.0000 -0.2088 0.2244 -0.2343 NONWHITE 0.6437 0.4132 -0.2088 1.0000 0.0184 0.1593 NOX -0.0774 -0.4873 0.2244 0.0184 1.0000 0.4094 SO2 0.4259 -0.1069 -0.2343 0.1593 0.4094 1.0000
2. a) From the scatter plot of MORT vs. NOX we see that NOX values are crunched very tight. A Log transformation of NOX is needed.
b) The curvature in MORT vs. SO2 indicates a Log transformation for SO2 may be suitable.
After the two transformations we have the following correlations:
MORT PRECIP
EDUC NONWHITE
NOX SO2 Log(NOX)
Log(SO2)
MORT 1.0000 0.5095 -0.5110 0.6437 -0.0774 0.4259 0.2920 0.4031 PRECIP 0.5095 1.0000 -0.4904 0.4132 -0.4873 -0.1069 -0.3683 -0.1212 EDUC -0.5110 -0.4904 1.0000 -0.2088 0.2244 -0.2343 0.0180 -0.2562 NONWHITE 0.6437 0.4132 -0.2088 1.0000 0.0184 0.1593 0.1897 0.0524 NOX -0.0774 -0.4873 0.2244 0.0184 1.0000 0.4094 0.7054 0.3582 SO2 0.4259 -0.1069 -0.2343 0.1593 0.4094 1.0000 0.6905 0.7738 Log(NOX) 0.2920 -0.3683 0.0180 0.1897 0.7054 0.6905 1.0000 0.7328 Log(SO2) 0.4031 -0.1212 -0.2562 0.0524 0.3582 0.7738 0.7328 1.0000
800900
1050
10
30
50
70
9.010.0
11.5
515
30
50
150
250
350
50
150
250
MORT
800 950 1150
PRECIP
10 30 50 70
EDUC
9.0 10.5 12.5
NONWHITE
5 1525 35
NOX
50150 300
SO2
50 150250
Scatterplot Matrix
Dealing with Influential Observations
• By influential observations, we mean one or several observations whose removal causes a different conclusion or course of action.
• Display 11.8 provides a strategy for dealing with suspected influential cases.
Cook’s Distance
• Cook’s distance is a statistic that can be used to flag observations which are influential.
• After fit model, click on red triangle next to Response, Save columns, Cook’s D influence.
• Cook’s distance of close to or larger than 1 indicates a large influence.
Leverage Plots
• The leverage plots produced by JMP provide a “simple regression view” of a multiple regression coefficient. (The leverage plot for variable is a plot of vs. multiple regression residuals.)
• Slope of line shown in leverage plot is equal to the coefficient for that variable in the multiple regression.
• Distances from the points to the line in leverage plot are multiple regression residuals. Distance from point to horizontal line is the residual if the explanatory variable is not included in the model.
• These plots are used to identify outliers, leverage, and influential points for the particular regression coefficient in the multiple regression. (Use them the same way as in a simple regression.)
jxKjjj xxxxx ,...,,,...,| 111
ˆ
jx
Whole Model Summary of Fit RSquare 0.688278 Analysis of Variance Source DF Sum of
Squares Mean
Square F Ratio
Model 5 157115.28 31423.1 23.8462 Error 54 71157.80 1317.7 Prob > F C. Total 59 228273.08 <.0001 Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 940.6541 94.05424 10.00 <.0001 PRECIP 1.9467286 0.700696 2.78 0.0075 EDUC -14.66406 6.937846 -2.11 0.0392 NONWHITE 3.028953 0.668519 4.53 <.0001
Log(NOX) 6.7159712 7.39895 0.91 0.3681 Log(SO2) 11.35814 5.295487 2.14 0.0365 Effect Tests Source Sum of Squares F Ratio Prob > F PRECIP 10171.388 7.7188 0.0075 EDUC 5886.913 4.4674 0.0392 NONWHITE 27051.227 20.5285 <.0001
Log(NOX) 1085.691 0.8239 0.3681 Log(SO2) 6062.217 4.6005 0.0365 Residual by Predicted Plot
-100
-50
0
50
100
MO
RT
Re
sid
ua
l
New Orleans, LA
750 800 850 900 950 1050 1150
MORT Predicted
An alternative model
Because of the importance of NOX and SO2, One could choose the final model to be: MORTvs.PRECIP,NONWHITE, EDUC and log Nox and log SO2
Notice that even though log Nox is not significant, one could still leave it in the model.
The influential points can have extreme impact on the analysis
•
PRECIP Leverage Plot
750
800
850
900
950
1000
1050
1100
1150
MO
RT
Leverage R
esid
uals
10 20 30 40 50 60
PRECIP Leverage, P=0.0075
EDUC Leverage Plot
750
800
850
900
950
1000
1050
1100
1150
MO
RT
Leverage R
esid
uals
9.0 9.5 10.0 10.5 11.0 11.5 12.0 12.5 13.0
EDUC Leverage, P=0.0392
NONWHITE Leverage Plot
750
800
850
900
950
1000
1050
1100
1150
MO
RT
Leverage R
esid
uals
-5 0 5 10 15 20 25 30 35 40
NONWHITE Leverage, P<.0001
Log NOX Leverage Plot
750
800
850
900
950
1000
1050
1100
1150
MO
RT
Leverage R
esid
uals
0 1 2 3 4 5 6
Log NOX Leverage, P=0.3681
Log SO2 Leverage Plot
750
800
850
900
950
1000
1050
1100
1150
MO
RT
Leverage R
esid
uals
-1 0 1 2 3 4 5 6
Log SO2 Leverage, P=0.0365
The enlarged observation New Orleans is an outlier for estimating each coefficient and is highly leveraged for estimating the coefficients of interest on log Nox and log SO2. Since New Orleans is both highly leveraged and an outlier, we expect it to be influential.
•
Multiple Regression with New Orleans Summary of Fit RSquare 0.688278 RSquare Adj 0.659415 Root Mean Square Error 36.30065 Mean of Response 940.3568 Observations (or Sum Wgts) 60 Analysis of Variance Source DF Sum of Squares Mean Square F Ratio Model 5 157115.28 31423.1 23.8462 Error 54 71157.80 1317.7 Prob > F C. Total 59 228273.08 <.0001 Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 940.6541 94.05424 10.00 <.0001 PRECIP 1.9467286 0.700696 2.78 0.0075 EDUC -14.66406 6.937846 -2.11 0.0392 NONWHITE 3.028953 0.668519 4.53 <.0001 Log NOX 6.7159712 7.39895 0.91 0.3681 Log SO2 11.35814 5.295487 2.14 0.0365
Multiple Regression without New Orleans Summary of Fit RSquare 0.724661 RSquare Adj 0.698686 Root Mean Square Error 32.06752 Mean of Response 937.4297 Observations (or Sum Wgts) 59 Analysis of Variance Source DF Sum of Squares Mean Square F Ratio Model 5 143441.28 28688.3 27.8980 Error 53 54501.26 1028.3 Prob > F C. Total 58 197942.54 <.0001 Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 852.3761 85.9328 9.92 <.0001 PRECIP 1.3633298 0.635732 2.14 0.0366 EDUC -5.666948 6.52378 -0.87 0.3889 NONWHITE 3.0396794 0.590566 5.15 <.0001 Log NOX -9.898442 7.730645 -1.28 0.2060 Log SO2 26.032584 5.931083 4.39 <.0001
Removing New Orleans has a large impact on the coefficients of log NOX and log SO2, in particular, it reverses the sign of log S02.
800
950
1100
1030
5070
9.0
10.5
12.0
5
20
35
50150
250350
50
150
250
02
46
02
46
MORT
800 950 1150
PRECIP
10 30 50 70
EDUC
9.0 10.5 12.5
NONWHITE
5 152535
NOX
50150 300
SO2
50 150 300
Log(NOX)
01 2 345 6
Log(SO2)
0 12 34 56