lecture 24: thurs., april 8th

Lecture 24: Thurs., April 8th

Inference for Multiple Regression

Types of inferences:• Confidence intervals/hypothesis tests for

regression coefficients• Confidence intervals for mean response,

prediction intervals• Overall usefulness of predictors (F-test, R-

squared)• Effect tests (we will cover these later when we

cover categorical explanatory variables)

Overall usefulness of predictors• • Are any of the predictors useful? Does the mean of y

change as any of the explanatory variables changes.• vs. at least one of ‘s does not

equal zero. • Test (called overall F test) is carried out in Analysis of

Variance table. We reject for large values of F statistic. Prob>F is the p-value for this test.

• For fish mercury data, Prob>F less than 0.0001 – strong evidence that at least one of length/weight is a useful predictor of mercury concentration.

ppp xxxxy 1101 },...,|{

pH 210 : :aH

The R-Squared Statistic• P-value from overall F test tests whether any of predictors are useful but

does not give a measure of how useful the predictors are. • R squared is a measure of how good the predictions from the multiple

regression model are compared to using the mean of y, i.e., none of the predictors, to predict y.

• Similar interpretation as in simple linear regression. The R-squared statistic is the proportion of the variation in y explained by the multiple regression model

• Total Sum of Squares: • Residual Sum of Squares: squares of sum Total

squares of sum Residual - squares of sum Total2 R

2

1)(

n

i i yy

n

i ippii xxy1

2110 )ˆˆˆ(

Summary of Fit

RSquare 0.427626

Air Pollution and Mortality

• Data set pollution.JMP provides information about the relationship between pollution and mortality for 60 cities between 1959-1961.

• The variables are• y (MORT)=total age adjusted mortality in deaths per 100,000

population; • PRECIP=mean annual precipitation (in inches);

EDUC=median number of school years completed for persons 25 and older;

NONWHITE=percentage of 1960 population that is nonwhite; NOX=relative pollution potential of Nox (related to amount of tons of

Nox emitted per day per square kilometer);

SO2=relative pollution potential of SO2

Multiple Regression and Causal Inference

• Goal: Figure out what the causal effect on mortality would be of decreasing air pollution (and keeping everything else in the world fixed)

• Confounding variable: A variable that is related to both air pollution in a city and mortality in a city.

• In order to figure out whether air pollution causes mortality, we want to compare mean mortality among cities with different air pollution levels but the same values of the confounding variables.

• If we include all of the confounding variables in the multiple regression model, the coefficient on air pollution represents the change in the mean of mortality that is caused by a one unit increase in air pollution.

Omitted Variables

• What happens if we omit a confounding variable from the regression, e.g., percentage of smokers?

• Suppose we are interested in the causal effect of on y and believe that there are confounding variables and that

• is the causal effect of on y. If we omit the confounding variable, , then the multiple regression will be estimating the coefficient as the coefficient on . How different are and .

1x

12 ,, pxx

ppp

ppp

xxxxy

xxxxy*

1*1

*01

1111011

},,|{

},,|{

1 1x

1px*1

1x*1 1

Omitted Variables Bias Formula• Suppose that

• Then • Formula tells us about direction and magnitude of

bias from omitting a variable in estimating a causal effect.

• Formula also applies to least squares estimates, i.e.,

pppp

ppp

ppp

xxxxx

xxxxy

xxxxy

11011

*1

*1

*01

1111011

},,|{

},,|{

},,|{

111*1 p

111*1

ˆˆˆˆ p

Assumptions of Multiple Linear Regression Model

• Assumptions of multiple linear regression:– For each subpopulation ,

• (A-1A)• (A-1B) • (A-1C) The distribution of is normal[Distribution of residuals should not depend on ]

– (A-2) The observations are independent of one another

pxx ,...,1

ppp XXXXY 1101 },...,|{2

1 ),...,|( pXXYVar

pXXY ,...,| 1

pxx ,...,1

Checking/Refining Model

• Tools for checking (A-1A) and (A-1B)– Residual plots versus predicted (fitted) values– Residual plots versus explanatory variables – If model is correct, there should be no pattern in the

residual plots

• Tool for checking (A-1C)– Histogram of residuals

• Tool for checking (A-2)– Residual plot versus time or spatial order of

observations

pxx ,,1

Model Building

1. Make scatterplot matrix of variables (using analyze, multivariate). Decide on whether to transform any of the explanatory variables.

2. Fit model.3. Check residual plots for whether assumptions of

multiple regression model are satisfied. Also look for outliers and influential points.

4. Make changes to model and repeat steps 2-3 until an adequate model is found.

1. Get the correlation matrix of all the variables:

Correlations:

MORT PRECIP

EDUC NONWHITE

NOX SO2

MORT 1.0000 0.5095 -0.5110 0.6437 -0.0774 0.4259 PRECIP 0.5095 1.0000 -0.4904 0.4132 -0.4873 -0.1069 EDUC -0.5110 -0.4904 1.0000 -0.2088 0.2244 -0.2343 NONWHITE 0.6437 0.4132 -0.2088 1.0000 0.0184 0.1593 NOX -0.0774 -0.4873 0.2244 0.0184 1.0000 0.4094 SO2 0.4259 -0.1069 -0.2343 0.1593 0.4094 1.0000

2. a) From the scatter plot of MORT vs. NOX we see that NOX values are crunched very tight. A Log transformation of NOX is needed.

b) The curvature in MORT vs. SO2 indicates a Log transformation for SO2 may be suitable.

After the two transformations we have the following correlations:

MORT PRECIP

EDUC NONWHITE

NOX SO2 Log(NOX)

Log(SO2)

MORT 1.0000 0.5095 -0.5110 0.6437 -0.0774 0.4259 0.2920 0.4031 PRECIP 0.5095 1.0000 -0.4904 0.4132 -0.4873 -0.1069 -0.3683 -0.1212 EDUC -0.5110 -0.4904 1.0000 -0.2088 0.2244 -0.2343 0.0180 -0.2562 NONWHITE 0.6437 0.4132 -0.2088 1.0000 0.0184 0.1593 0.1897 0.0524 NOX -0.0774 -0.4873 0.2244 0.0184 1.0000 0.4094 0.7054 0.3582 SO2 0.4259 -0.1069 -0.2343 0.1593 0.4094 1.0000 0.6905 0.7738 Log(NOX) 0.2920 -0.3683 0.0180 0.1897 0.7054 0.6905 1.0000 0.7328 Log(SO2) 0.4031 -0.1212 -0.2562 0.0524 0.3582 0.7738 0.7328 1.0000

800900

1050

10

30

50

70

9.010.0

11.5

515

30

50

150

250

350

50

150

250

MORT

800 950 1150

PRECIP

10 30 50 70

EDUC

9.0 10.5 12.5

NONWHITE

5 1525 35

NOX

50150 300

SO2

50 150250

Scatterplot Matrix

Dealing with Influential Observations

• By influential observations, we mean one or several observations whose removal causes a different conclusion or course of action.

• Display 11.8 provides a strategy for dealing with suspected influential cases.

Cook’s Distance

• Cook’s distance is a statistic that can be used to flag observations which are influential.

• After fit model, click on red triangle next to Response, Save columns, Cook’s D influence.

• Cook’s distance of close to or larger than 1 indicates a large influence.

Leverage Plots

• The leverage plots produced by JMP provide a “simple regression view” of a multiple regression coefficient. (The leverage plot for variable is a plot of vs. multiple regression residuals.)

• Slope of line shown in leverage plot is equal to the coefficient for that variable in the multiple regression.

• Distances from the points to the line in leverage plot are multiple regression residuals. Distance from point to horizontal line is the residual if the explanatory variable is not included in the model.

• These plots are used to identify outliers, leverage, and influential points for the particular regression coefficient in the multiple regression. (Use them the same way as in a simple regression.)

jxKjjj xxxxx ,...,,,...,| 111

ˆ

jx

Whole Model Summary of Fit RSquare 0.688278 Analysis of Variance Source DF Sum of

Squares Mean

Square F Ratio

Model 5 157115.28 31423.1 23.8462 Error 54 71157.80 1317.7 Prob > F C. Total 59 228273.08 <.0001 Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 940.6541 94.05424 10.00 <.0001 PRECIP 1.9467286 0.700696 2.78 0.0075 EDUC -14.66406 6.937846 -2.11 0.0392 NONWHITE 3.028953 0.668519 4.53 <.0001

Log(NOX) 6.7159712 7.39895 0.91 0.3681 Log(SO2) 11.35814 5.295487 2.14 0.0365 Effect Tests Source Sum of Squares F Ratio Prob > F PRECIP 10171.388 7.7188 0.0075 EDUC 5886.913 4.4674 0.0392 NONWHITE 27051.227 20.5285 <.0001

Log(NOX) 1085.691 0.8239 0.3681 Log(SO2) 6062.217 4.6005 0.0365 Residual by Predicted Plot

-100

-50

0

50

100

MO

RT

Re

sid

ua

l

New Orleans, LA

750 800 850 900 950 1050 1150

MORT Predicted

An alternative model

Because of the importance of NOX and SO2, One could choose the final model to be: MORTvs.PRECIP,NONWHITE, EDUC and log Nox and log SO2

Notice that even though log Nox is not significant, one could still leave it in the model.

The influential points can have extreme impact on the analysis

•

PRECIP Leverage Plot

750

800

850

900

950

1000

1050

1100

1150

MO

RT

Leverage R

esid

uals

10 20 30 40 50 60

PRECIP Leverage, P=0.0075

EDUC Leverage Plot

750

800

850

900

950

1000

1050

1100

1150

MO

RT

Leverage R

esid

uals

9.0 9.5 10.0 10.5 11.0 11.5 12.0 12.5 13.0

EDUC Leverage, P=0.0392

NONWHITE Leverage Plot

750

800

850

900

950

1000

1050

1100

1150

MO

RT

Leverage R

esid

uals

-5 0 5 10 15 20 25 30 35 40

NONWHITE Leverage, P<.0001

Log NOX Leverage Plot

750

800

850

900

950

1000

1050

1100

1150

MO

RT

Leverage R

esid

uals

0 1 2 3 4 5 6

Log NOX Leverage, P=0.3681

Log SO2 Leverage Plot

750

800

850

900

950

1000

1050

1100

1150

MO

RT

Leverage R

esid

uals

-1 0 1 2 3 4 5 6

Log SO2 Leverage, P=0.0365

The enlarged observation New Orleans is an outlier for estimating each coefficient and is highly leveraged for estimating the coefficients of interest on log Nox and log SO2. Since New Orleans is both highly leveraged and an outlier, we expect it to be influential.

•

Multiple Regression with New Orleans Summary of Fit RSquare 0.688278 RSquare Adj 0.659415 Root Mean Square Error 36.30065 Mean of Response 940.3568 Observations (or Sum Wgts) 60 Analysis of Variance Source DF Sum of Squares Mean Square F Ratio Model 5 157115.28 31423.1 23.8462 Error 54 71157.80 1317.7 Prob > F C. Total 59 228273.08 <.0001 Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 940.6541 94.05424 10.00 <.0001 PRECIP 1.9467286 0.700696 2.78 0.0075 EDUC -14.66406 6.937846 -2.11 0.0392 NONWHITE 3.028953 0.668519 4.53 <.0001 Log NOX 6.7159712 7.39895 0.91 0.3681 Log SO2 11.35814 5.295487 2.14 0.0365

Multiple Regression without New Orleans Summary of Fit RSquare 0.724661 RSquare Adj 0.698686 Root Mean Square Error 32.06752 Mean of Response 937.4297 Observations (or Sum Wgts) 59 Analysis of Variance Source DF Sum of Squares Mean Square F Ratio Model 5 143441.28 28688.3 27.8980 Error 53 54501.26 1028.3 Prob > F C. Total 58 197942.54 <.0001 Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 852.3761 85.9328 9.92 <.0001 PRECIP 1.3633298 0.635732 2.14 0.0366 EDUC -5.666948 6.52378 -0.87 0.3889 NONWHITE 3.0396794 0.590566 5.15 <.0001 Log NOX -9.898442 7.730645 -1.28 0.2060 Log SO2 26.032584 5.931083 4.39 <.0001

Removing New Orleans has a large impact on the coefficients of log NOX and log SO2, in particular, it reverses the sign of log S02.

800

950

1100

1030

5070

9.0

10.5

12.0

5

20

35

50150

250350

50

150

250

02

46

02

46

MORT

800 950 1150

PRECIP

10 30 50 70

EDUC

9.0 10.5 12.5

NONWHITE

5 152535

NOX

50150 300

SO2

50 150 300

Log(NOX)

01 2 345 6

Log(SO2)

0 12 34 56

lecture 24: thurs., april 8th

Documents