diagnostics for linear regression models · • regression models fit to data by the method of...

Post on 08-Jul-2020

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Diagnostics for

Linear Regression

Models

Prof. David Sibbritt

Session Outline

• Using residuals to check the model

residual plots

• Multicollinearity diagnostics

variance inflation factor (VIF)

Background

• When a regression model is selected, one cannot

usually be certain in advance that the model is

appropriate

• It is important to consider a series of regression

diagnostics available

allow us to look for flaws that may affect our parameter

estimates

consider whether the assumptions underlying the model are

violated and whether our results are heavily impacted by

influential observations

Using Residuals to check the model

• There are several plots that we can do to determine if

there are any departures from regression model

1) The regression function is not linear

plot residuals vs independent variable (or dependent

variable)

residuals should be evenly scattered on a straight line

about zero and there should not be any other obvious

pattern (eg. curved)

2) The error terms do not have constant variance

plot residuals vs independent variable (or dependent

variable)

residuals should be constantly scattered about zero

(ie. no ‘fanning out’)

3) The model fits all but one or a few outlier observations

plot residuals vs independent variable (or dependent

variable)

there should not be any residuals that are positioned far

from zero

4) The error terms are not normally distributed

histogram of residuals and/or normal probability plot

5) One or several important independent variables have

been omitted from the model

plot residuals vs any other independent variables

residuals should not be raised above, or lowered below,

zero

• Below are some example residual plots (assuming

linear regression model), that typically results:

(a) shows that all residuals are evenly scattered about

zero, with no obvious pattern or outliers

suggests that the model is appropriate

(a) (b)

i i

0 0

X X

(c) (d)

i i

0 0

X Time

(a) (b)

i i

0 0

X X

(c) (d)

i i

0 0

X Time

(a) (b)

i i

0 0

X X

(c) (d)

i i

0 0

X Time

(b) shows that as the predictor variable X increase, so does

the variation in residuals (fanning out)

suggests that the model is not appropriate

(c) shows a definite pattern (curved) in the residuals

suggests that the model is not appropriate (and also

that a curved model would be a better choice)

• In SPSS, we can produce residuals as follows:

• Should be randomly scattered about zero

Notes of caution:

a) Range of observations

even if fit appears satisfactory for the observations we

have available to use, the model may not be a good fit

when extended outside the range of past observations

b) Causality

the presence of a regression relation between two

variables does not imply a cause-and-effect relation

between them

Multicollinearity Diagnostics –

Variance Inflation Factor

• In multiple linear regression, problems can arise

when the independent variables being considered

for the regression model are highly correlated

among themselves

ie. the correlated variables will have a similar

relationship with the dependent variable

• There is a highly useful diagnostic; the variance

inflation factor

• The variance inflation factor (VIF) measures how much

the variances of the estimated regression coefficients

are inflated as compared to when the independent

variables are not linearly related

• The largest VIF value among all X variables is often

used as an indicator of the severity of multicollinearity

a maximum VIF value in excess of 10 is often taken as an

indication that multicollinearity may be unduly influencing

the least squares estimates

• Mean VIF values considerably larger than 1 are

indicative of serious multicollinearity problems

• In general, the VIF for the j th regression coefficient can

be written as

VIFj =

where is the coefficient of multiple determination

obtained from regressing Xj on the other regressor variables

21

1

kR

2kR

• If Xj is nearly linearly dependent on some of the other

regressors, then will be near unity and VIFj will be large

• If Xj is orthogonal to the remaining predictors, its VIF will

be 1

• Regression models fit to data by the method of least

squares, when strong multicollinearity is present, are

notoriously poor prediction equations and the values

of the regression coefficients are often very sensitive

to the data in the particular sample collected

2kR

Example: AIS Athletes Study

• Maximum VIF = 5.01 < 10 (not too bad)

• Mean VIF = 3.73 > 1 (not too bad?)

Coefficientsa

Model

Unstandardized Coefficients

Standardized

Coefficients

t Sig.

Collinearity Statistics

B Std. Error Beta Tolerance VIF

1 (Constant) -63.714 47.543 -1.340 .182

RCC -12.359 15.533 -.118 -.796 .427 .211 4.747

Hg 13.884 5.356 .396 2.592 .010 .200 5.012

Bfat -.233 .626 -.030 -.372 .710 .700 1.428

a. Dependent Variable: Ferr

Mean VIF = ([4.747 + 5.012 + 1.428] ÷ 3) = 3.73

Reference

• DuPont WD. (2009) “Statistical Modeling for

Biomedical Researchers: A Simple Introduction to the

Analysis of Complex Data” Cambridge University

Press.

top related