topic 7 – other regression issues reading: some parts of chapters 11 and 15
TRANSCRIPT
![Page 1: Topic 7 – Other Regression Issues Reading: Some parts of Chapters 11 and 15](https://reader030.vdocument.in/reader030/viewer/2022032516/56649c735503460f94925d79/html5/thumbnails/1.jpg)
Topic 7 – Other Regression Issues
Reading: Some parts of
Chapters 11 and 15
![Page 2: Topic 7 – Other Regression Issues Reading: Some parts of Chapters 11 and 15](https://reader030.vdocument.in/reader030/viewer/2022032516/56649c735503460f94925d79/html5/thumbnails/2.jpg)
Overview
Confounding (Chapter 11)
Interaction (Chapter 11)
Using Polynomial Terms (Chapter 15)
![Page 3: Topic 7 – Other Regression Issues Reading: Some parts of Chapters 11 and 15](https://reader030.vdocument.in/reader030/viewer/2022032516/56649c735503460f94925d79/html5/thumbnails/3.jpg)
Regression: Primary Goals
We usually are focused on one of the following goals:
Predicting the response variable based on a set of predictorsReliability
Quantifying the relationship between the predictors and the response--Interpretability
It both situations, confounding and interaction can be concerns.
![Page 4: Topic 7 – Other Regression Issues Reading: Some parts of Chapters 11 and 15](https://reader030.vdocument.in/reader030/viewer/2022032516/56649c735503460f94925d79/html5/thumbnails/4.jpg)
What is “Confounding”?
We saw this with the Smoking and Age predictors in our SBP example.
We consider the relationship of SBP to…
Smoking Status alone
Smoking Status along with age
Our interest is in determining whether smoking raises blood pressure.
![Page 5: Topic 7 – Other Regression Issues Reading: Some parts of Chapters 11 and 15](https://reader030.vdocument.in/reader030/viewer/2022032516/56649c735503460f94925d79/html5/thumbnails/5.jpg)
SBP Example Continued
![Page 6: Topic 7 – Other Regression Issues Reading: Some parts of Chapters 11 and 15](https://reader030.vdocument.in/reader030/viewer/2022032516/56649c735503460f94925d79/html5/thumbnails/6.jpg)
Smoking is confounded with Age
Smoking by itself is not significant
Without age, we are not able to see a difference in the smoking groups.
(The groups are actually different but we cannot see it until we add age (a covariate).
Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 140.80000 3.66147 38.45 <.0001 SMK 1 7.02353 5.02350 1.40 0.1723
![Page 7: Topic 7 – Other Regression Issues Reading: Some parts of Chapters 11 and 15](https://reader030.vdocument.in/reader030/viewer/2022032516/56649c735503460f94925d79/html5/thumbnails/7.jpg)
Smoking is confounded with Age (2)
Smoking variable tests significant
After adjusting for age, the two smoking groups are clearly different!
Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 48.04960 11.12956 4.32 0.0002 AGE 1 1.70916 0.20176 8.47 <.0001 SMK 1 10.29439 2.76811 3.72 0.0009
![Page 8: Topic 7 – Other Regression Issues Reading: Some parts of Chapters 11 and 15](https://reader030.vdocument.in/reader030/viewer/2022032516/56649c735503460f94925d79/html5/thumbnails/8.jpg)
Estimates
The effect of smoking is confounded with age – if we don’t first adjust for age we cannot won’t see accurately the effect of smoking.
![Page 9: Topic 7 – Other Regression Issues Reading: Some parts of Chapters 11 and 15](https://reader030.vdocument.in/reader030/viewer/2022032516/56649c735503460f94925d79/html5/thumbnails/9.jpg)
Confounding Confounding exists if meaningfully different
interpretations of a relationship of interest can be made depending on whether or not a nuisance variable (or covariate) is included in the model.
How to find confounding?
Get lucky and stumble upon it (like we did)
Look for it intentionally by running a lot of different models and watching for variables that aren’t significant at first but become significant when adding other variables (covariates).
![Page 10: Topic 7 – Other Regression Issues Reading: Some parts of Chapters 11 and 15](https://reader030.vdocument.in/reader030/viewer/2022032516/56649c735503460f94925d79/html5/thumbnails/10.jpg)
Confounding (2) If confounding is present, it may lead to
inaccurate results if not careful – important covariates MUST be included (even if they aren’t even significant!)
Making the variable of interest significant is enough to warrant including the covariate
If we had failed to adjust for age, we will not get a good estimate for the difference due to smoking, and will also have wrongly conclude that smoking status doesn’t matter.
![Page 11: Topic 7 – Other Regression Issues Reading: Some parts of Chapters 11 and 15](https://reader030.vdocument.in/reader030/viewer/2022032516/56649c735503460f94925d79/html5/thumbnails/11.jpg)
Confounding vs. Multicollinearity
Parameter estimates will change wildly when (multi)collinearity is involved too!
They are almost opposite
SE’s increase and X1 becomes insignificant (added last) when X2 is in the model – (MULTI)COLLINEARITY
This (usually) works both ways—both variables “fight”
SE’s decrease and X1 becomes significant (added last) only when X2 is in the model – CONFOUNDING
Confounding is usually only one way—the covariate(Z) helps the confounded variable(X)Age is helping Smoking
![Page 12: Topic 7 – Other Regression Issues Reading: Some parts of Chapters 11 and 15](https://reader030.vdocument.in/reader030/viewer/2022032516/56649c735503460f94925d79/html5/thumbnails/12.jpg)
Confounding vs. Multicollinearity (2)
Can catch (multi)collinearity in the correlation matrix
Any single correlation > 0.9 collinearity between just those two predictors
Any predictor that has several values between 0.5 and 0.9 with other predictors multi-collinearity
For confounding, there will usually be some correlation between X and Z but it will not be very large.
Our example: , 0.13age smkr
![Page 13: Topic 7 – Other Regression Issues Reading: Some parts of Chapters 11 and 15](https://reader030.vdocument.in/reader030/viewer/2022032516/56649c735503460f94925d79/html5/thumbnails/13.jpg)
Interaction
Interaction is (sort of) one step beyond confounding – not only does it make a difference to adjust for Z, but the relationship between Y and X is fundamentally different at different levels of Z.
Can think of this as having a differerent regression line for each fixed level of Z. With no interaction, these lines would be parallel.
![Page 14: Topic 7 – Other Regression Issues Reading: Some parts of Chapters 11 and 15](https://reader030.vdocument.in/reader030/viewer/2022032516/56649c735503460f94925d79/html5/thumbnails/14.jpg)
SBP Example
We found Age and Smk to both be important. Is it possible that they are interacting?
X = age
Z = 0 for non-smokers, 1 for smokers
![Page 15: Topic 7 – Other Regression Issues Reading: Some parts of Chapters 11 and 15](https://reader030.vdocument.in/reader030/viewer/2022032516/56649c735503460f94925d79/html5/thumbnails/15.jpg)
![Page 16: Topic 7 – Other Regression Issues Reading: Some parts of Chapters 11 and 15](https://reader030.vdocument.in/reader030/viewer/2022032516/56649c735503460f94925d79/html5/thumbnails/16.jpg)
Interaction
Looking at plots can give us some idea of interaction (parallel lines). However...
It is very easy to just test to see if the XZ interaction term is important.
Treat it just as you would any other variable and do a partial F-test.
Note that if a model includes XZ interaction term, it should also include X and Z main effects. We would never just look at the XZ term by itself.
![Page 17: Topic 7 – Other Regression Issues Reading: Some parts of Chapters 11 and 15](https://reader030.vdocument.in/reader030/viewer/2022032516/56649c735503460f94925d79/html5/thumbnails/17.jpg)
Age/Smk Interaction Model
Interaction mathematically described using a product term:
Or just:
where X3 is X1X2
0 1 2 12Y X Z XZ
0 1 1 2 2 3 3Y X X X
![Page 18: Topic 7 – Other Regression Issues Reading: Some parts of Chapters 11 and 15](https://reader030.vdocument.in/reader030/viewer/2022032516/56649c735503460f94925d79/html5/thumbnails/18.jpg)
SBP Example
The interaction tests insignificant, there is no significant interaction between age and smk
Suppose it was significant
Would then have to keep the age_smk interaction term AS WELL AS both the age and smk variables (even if age and smk themselves are insignificant)
Source DF Type I SS F Value Pr > F age 1 3862 64.84 <.0001 smk 1 828 13.90 0.0009 age_smk 1 69 1.15 0.2918
![Page 19: Topic 7 – Other Regression Issues Reading: Some parts of Chapters 11 and 15](https://reader030.vdocument.in/reader030/viewer/2022032516/56649c735503460f94925d79/html5/thumbnails/19.jpg)
Confounding vs. Interaction
Y = response
X = predictor
Z = covariate / 2nd predictor
Is the estimated relationship between Y and X dramatically different if one adjusts or does not adjust for Z? Confounding
Is the estimated relationship between Y and X meaningfully different at different values of Z? Interaction
![Page 20: Topic 7 – Other Regression Issues Reading: Some parts of Chapters 11 and 15](https://reader030.vdocument.in/reader030/viewer/2022032516/56649c735503460f94925d79/html5/thumbnails/20.jpg)
Correlations One problem with using interaction terms is that they
tend to be highly correlated with one or both of the original variables
In our example: Correlation between SMK and AGE_SMK turned out to be 0.98
This is NOT REAL!!! It is a form of “fake” collinearity, the variables aren’t really “fighting” to explain SS
To remove this “fake” collinearity just center the variables
Subtract the mean from all predictors
This doesn’t change any significance tests or p-values, it only removes what we are calling fake collinearity
![Page 21: Topic 7 – Other Regression Issues Reading: Some parts of Chapters 11 and 15](https://reader030.vdocument.in/reader030/viewer/2022032516/56649c735503460f94925d79/html5/thumbnails/21.jpg)
How to center?SBP Example
Mean age was 53.25, subtract 53.25 from all the ages in the dataset and use these new values in the analysis
Mean smk was 0.53125, (do the same thing)
After centering:
Correlation between SMK and AGE_SMK is now 0.017 (so they weren’t really fighting, it just looked like it because we didn’t center)
Maybe we should always center???
![Page 22: Topic 7 – Other Regression Issues Reading: Some parts of Chapters 11 and 15](https://reader030.vdocument.in/reader030/viewer/2022032516/56649c735503460f94925d79/html5/thumbnails/22.jpg)
Polynomial Regression
Chapter 15
![Page 23: Topic 7 – Other Regression Issues Reading: Some parts of Chapters 11 and 15](https://reader030.vdocument.in/reader030/viewer/2022032516/56649c735503460f94925d79/html5/thumbnails/23.jpg)
General Uses
Polynomial models used in situations where the relationship between Y and X is non-linear
Can usually see it in scatterplots
Should definitely catch it in residual plots!
Somewhat dangerous, since a polynomial model of order n – 1 will always fit n data points exactly.
Example?
![Page 24: Topic 7 – Other Regression Issues Reading: Some parts of Chapters 11 and 15](https://reader030.vdocument.in/reader030/viewer/2022032516/56649c735503460f94925d79/html5/thumbnails/24.jpg)
Strategy for fitting
CENTER your variables to avoid the “fake” (multi)collinearity.
Use a special type of backward elimination procedure Test highest order term first!
If a higher order term is significant, you MUST include all lower order terms for that variable
![Page 25: Topic 7 – Other Regression Issues Reading: Some parts of Chapters 11 and 15](https://reader030.vdocument.in/reader030/viewer/2022032516/56649c735503460f94925d79/html5/thumbnails/25.jpg)
Example
Problem 15.7 (sas/data available online)
X = amount of vaccine, Y = measure of skin response in rats.
12 data points
If we run just a simple linear regression, the R-square is only 45%, we will consider a polynomial model and try to do better!
![Page 26: Topic 7 – Other Regression Issues Reading: Some parts of Chapters 11 and 15](https://reader030.vdocument.in/reader030/viewer/2022032516/56649c735503460f94925d79/html5/thumbnails/26.jpg)
Scatter Plot
![Page 27: Topic 7 – Other Regression Issues Reading: Some parts of Chapters 11 and 15](https://reader030.vdocument.in/reader030/viewer/2022032516/56649c735503460f94925d79/html5/thumbnails/27.jpg)
Residual plot
![Page 28: Topic 7 – Other Regression Issues Reading: Some parts of Chapters 11 and 15](https://reader030.vdocument.in/reader030/viewer/2022032516/56649c735503460f94925d79/html5/thumbnails/28.jpg)
Cubic Model
x is X, x2 is X2=X*X, x3 is X3=X*X*X, etc
X3 is important – Must keep X2 and X, why?
Cubic model, model with X, X2, and X3 now explains 82% of the variation (was only 45% for the linear model)
Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 14.53498 0.20011 72.63 <.0001 x 1 -0.54454 0.11047 -4.93 0.0012 x2 1 0.12179 0.05386 2.26 0.0536 x3 1 0.28852 0.08177 3.53 0.0077