stats 330: lecture 8
DESCRIPTION
STATS 330: Lecture 8. Collinearity. Collinearity. Aims of today’s lecture: Explain the idea of collinearity and its connection with estimating regression coefficients To discuss added variable plots, a graphical method for deciding if a variable should be added to a regression. - PowerPoint PPT PresentationTRANSCRIPT
04/19/23 330 lecture 8 1
STATS 330: Lecture 8
04/19/23 330 lecture 8 2
CollinearityAims of today’s lecture:
Explain the idea of collinearity and its connection with estimating regression coefficients
To discuss added variable plots, a graphical method for deciding if a variable should be added to a regression
04/19/23 330 lecture 8 3
Variance of regression coefficients
We saw in Lecture 6 how the standard errors of the regression coefficients depend on the error variance 2: the bigger 2, the bigger the standard errors.
We also suggested that the standard error depends on the arrangement of the x’s.
In today’s lecture, we explore this idea a bit further.
04/19/23 330 lecture 8 4
Example Suppose we have a regression
relationship of the form
Y=1 + 2x –w + between a response variable Y and two explanatory variables x and w.
Consider two data sets, A and B, each following the model above.
04/19/23 330 lecture 8 5
Data sets A & B: x,w data
3 4 5 6 7 8
34
56
7
Data set A, corr = -0.149
x
w
3 4 5 6 7 83
45
67
8
Data set B, corr = 0.991
x
w
04/19/23 330 lecture 8 6
04/19/23 330 lecture 8 7
Conclusion: The greater the correlation, the more
variable the plane. In fact, for the coefficient of x
)1)(var(
)1/()ˆvar(
2
2
rx
n
04/19/23 330 lecture 8 8
GeneralizationIf we have k explanatory variables, then the variance of the jth estimated coefficient is
)1)(var(
)1/()ˆvar( 2
2
jj
jRx
n
where Rj2 is the R2 if we regress variable j
on the other explanatory variables
04/19/23 330 lecture 8 9
Best case If xj is orthogonal to (uncorrelated with) the
other explanatory variables, then Rj2 is zero
and the variance is the smallest possible i.e.
)var(
)1/(2
x
n
04/19/23 330 lecture 8 10
Variance inflation factor
The factor
represents the increase in variance caused by correlation between the explanatory variables and is called the variance inflation factor (VIF)
)1(
12jR
04/19/23 330 lecture 8 11
Calculating the VIF: theory
To calculate the VIF for the jth explanatory variable, use the relationship
using the residuals from regressing the jth explanatory variable on the other explanatory variables
)var(/)var(
/
)//(1
)1/(12
residualsj variable
RSSTSS
TSSRSS
RVIF
jj
jj
j
04/19/23 330 lecture 8 12
Calculating the VIF: example
For the petrol data, calculate the VIF for t.vp (tank vapour pressure)
> attach(vapour.df)> tvp.reg <- lm(t.vp~t.temp + p.temp +
p.vp,data=vapour.df)
> var(t.vp)/var(residuals(tvp.reg))[1] 66.13817
Correlation increases variance by a factor of 66
04/19/23 330 lecture 8 13
Calculating the VIF: quick method
A useful mathematical relationship:Suppose we calculate the inverse of the correlation matrix of the explanatory variables. Then the VIF’s are the diagonal elements.
> X<-vapour.df[,-5] # delete 5th column # (hc, the response)> VIF <- diag(solve(cor(X)))
> VIF t.temp p.temp t.vp p.vp 11.927292 5.615662 66.138172 60.938695
04/19/23 330 lecture 8 14
Pairs plot
t.temp
40 60 80 3 4 5 6 7
30
50
70
90
40
60
80
p.temp
t.vp
34
56
7
34
56
7
p.vp
30 50 70 90 3 4 5 6 7 20 30 40 50
20
30
40
50
hc
04/19/23 330 lecture 8 15
Collinearity If one or more variables in a regression have big
VIF’s, the regression is said to be collinear Caused by one or more variables being almost
linear combinations of the others Sometimes indicated by high correlations
between the independent variables Results in imprecise estimation of regression
coefficients Standard errors are high, so t-statistics are
small, variables are often non-significant ( data is insufficient to detect a difference)
04/19/23 330 lecture 8 16
Non-significance If a variable has a non-significant t, then either
• The variable is not related to the response, or• The variable is related to the response, but it is not
required in the regression because it is strongly related to a third variable that is in the regression, so we don’t need both.
First case: small t-value, small VIF, small correlation with response
Second case, small t-value, big VIF, big correlation with response
04/19/23 330 lecture 8 17
Remedy The usual remedy is to drop one or more
variables from the model. This “breaks” the linear relationship
between the variables This leads to the problem of “subset
selection”, which subset to choose. See Lectures 14 and 15
04/19/23 330 lecture 8 18
Example: Cement data
Measurements on batches of cement Response variable: Heat (heat emitted) Explanatory variables
• X1: amount of tricalcium aluminate (%)• X2: amount of tricalcium silicate (%)• X3: amount of tetracalcium aluminaoferrite (%)• X4: amount of dicalcium silicate (%)
04/19/23 330 lecture 8 19
Example: Cement data
Heat X1 X2 X3 X478.5 7 26 6 6074.3 1 29 15 52104.3 11 56 8 2087.6 11 31 8 4795.9 7 52 6 33109.2 11 55 9 22102.7 3 71 17 672.5 1 31 22 4493.1 2 54 18 22115.9 21 47 4 2683.8 1 40 23 34113.3 11 66 9 12109.4 10 68 8 12
04/19/23 330 lecture 8 20
Example: Cement dataEstimate Std. Error t value Pr(>|t|)
(Intercept) 62.4054 70.0710 0.891 0.3991 X1 1.5511 0.7448 2.083 0.0708 .X2 0.5102 0.7238 0.705 0.5009 X3 0.1019 0.7547 0.135 0.8959 X4 -0.1441 0.7091 -0.203 0.8441 Residual standard error: 2.446 on 8 degrees of freedomMultiple R-Squared: 0.9824, Adjusted R-squared: 0.9736 F-statistic: 111.5 on 4 and 8 DF, p-value: 4.756e-07
> round(cor(cement.df),2) Heat X1 X2 X3 X4Heat 1.00 0.73 0.82 -0.53 -0.82X1 0.73 1.00 0.23 -0.82 -0.25X2 0.82 0.23 1.00 -0.14 -0.97X3 -0.53 -0.82 -0.14 1.00 0.03X4 -0.82 -0.25 -0.97 0.03 1.00
Big correlation
Large p-values
Big R-squared
04/19/23 330 lecture 8 21
Cement data
> diag(solve(cor(cement.df[, -1])))
X1 X2 X3 X4
38.49621 254.42317 46.86839 282.51286
cement.df$X1 + cement.df$X2 + cement.df$X3 + cement.df$X4 [1] 99 97 95 97 98 97 97 98 96 98 98 98 98
Omit Heat
Very large!Very large!
04/19/23 330 lecture 8 22
Drop X4> diag(solve(cor(cement.df[, -c(1,5)])))
X1 X2 X3
3.251068 1.063575 3.142125
Estimate Std. Error t value Pr(>|t|) (Intercept) 48.19363 3.91330 12.315 6.17e-07 ***X1 1.69589 0.20458 8.290 1.66e-05 ***X2 0.65691 0.04423 14.851 1.23e-07 ***X3 0.25002 0.18471 1.354 0.209
Residual standard error: 2.312 on 9 degrees of freedomMultiple R-Squared: 0.9823, Adjusted R-squared: 0.9764 F-statistic: 166.3 on 3 and 9 DF, p-value: 3.367e-08
VIF’s now small
X1, X2 now
signif
04/19/23 330 lecture 8 23
Added variable plots (AVP’s)
To see if a variable, say x, is needed in a regression:• Step 1: Calculate the residuals from regressing the
response on all the explanatory variables except x• Step 2: calculate the residuals from regressing x on
the other explanatory variables• Step 3: Plot the first set of residuals versus the
second set
NB: Also called partial regression plots in some books
04/19/23 330 lecture 8 24
Rationale The first set of residuals represents the variation
in y not explained by the other explanatory variables
The second set of residuals represents the part of x not explained by the other explanatory variables
If there is a relationship between the two sets, there is a relationship between x and the response that is not accounted for by the other explanatory variables
Thus, if we see a relationship in the plot, x is needed in the regression!!!
04/19/23 330 lecture 8 25
Example: the petrol data
Let’s do an AVP for tank vapour pressure, t.vp.> rest.reg<- lm(hc~t.temp + p.temp + p.vp,data=vapour.df)> y.res<-residuals(rest.reg)> tvp.reg<-lm(t.vp~t.temp + p.temp + p.vp,data=vapour.df)> tvp.res<-residuals(tvp.reg)> plot(tvp.res,y.res, xlab = "Tank vapour pressure", ylab="Hydrocarbon emission", main = “AVP for Tank vapour pressure")
04/19/23 330 lecture 8 26
Hint of a relationship: so variable required?
-0.4 -0.2 0.0 0.2 0.4
-50
5
AVP for Tank vapour pressure
Tank vapour pressure
Hyd
roca
rbon e
mis
sion
04/19/23 330 lecture 8 27
Short cut in RThere is a function added.variable.plotsin R to draw the plots automatically. This is one of the functions in the R330 package which must be installed before the function can be used.
> vapour.lm<-lm(hc~.,data=vapour.df)> par(mfrow=c(2,2)) # 2 x 2 array of plots> added.variable.plots(vapour.lm)
Note this useful trick!
04/19/23 330 lecture 8 28
Not significant in the regression (Lect 7)
04/19/23 330 lecture 8 29
Some curious facts about AVP’s
Assuming a constant term in both regressions, a least squares line fitted though the AVP goes through the origin.
The slope of this line is the fitted regression coefficient for the variable in the original regression
The residuals from this line are the same as the residuals from the original regression