lecture 14 diagnostics & remedial...
TRANSCRIPT
![Page 1: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated](https://reader033.vdocument.in/reader033/viewer/2022041809/5e567f220178e07e7c0aa5e6/html5/thumbnails/1.jpg)
14-1
Lecture 14
Diagnostics & Remedial Measures
STAT 512
Spring 2011
Background Reading
KNNL: 6.8, 7.6, 10.5
![Page 2: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated](https://reader033.vdocument.in/reader033/viewer/2022041809/5e567f220178e07e7c0aa5e6/html5/thumbnails/2.jpg)
14-2
Topic Overview
• Usual plots/tests to examine error
assumptions
• Multicollinearity
• CDI Case Study
![Page 3: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated](https://reader033.vdocument.in/reader033/viewer/2022041809/5e567f220178e07e7c0aa5e6/html5/thumbnails/3.jpg)
14-3
Diagnostic (Residual) Plots
• Residuals vs. Normal Quantiles (Check
Normality)
• Residuals vs. Predicted Values (Check
Constant Variance)
• Residuals vs. Predictor Variables (Check
Linearity, Constant Variance)
• Residuals vs. Order of Observations (Check
Independence)
![Page 4: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated](https://reader033.vdocument.in/reader033/viewer/2022041809/5e567f220178e07e7c0aa5e6/html5/thumbnails/4.jpg)
14-4
Diagnostic Tests
• Breusch-Pagan or Brown-Forsythe to test
for constancy of variance.
• Kolmogorov-Smirnov, etc. to test for
normality.
• Lack-of-fit test (we’ll hold off talking about
this one until we’ve discussed ANOVA)
![Page 5: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated](https://reader033.vdocument.in/reader033/viewer/2022041809/5e567f220178e07e7c0aa5e6/html5/thumbnails/5.jpg)
14-5
Scatter Plot Matrix
• Plots Y, X1, X2, etc. against each of the
other variables.
• Compare Y to X’s to find relationships.
• Compare X’s to each other to identify
potential multicollinearity.
![Page 6: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated](https://reader033.vdocument.in/reader033/viewer/2022041809/5e567f220178e07e7c0aa5e6/html5/thumbnails/6.jpg)
14-6
Remedial Measures
• Transform X if relationship non-linear
• Transform Y if violations of constant
variance and/or normality assumptions
• Use Box-Cox to come up with “best”
transformation on Y.
![Page 7: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated](https://reader033.vdocument.in/reader033/viewer/2022041809/5e567f220178e07e7c0aa5e6/html5/thumbnails/7.jpg)
14-7
Multicollinearity (1)
• Definition: Intercorrelation exists
whenever the predictor variables are
correlated. The term multicollinearity is
generally reserved for instances where the
correlation is very high (greater than 0.9).
• Multicollinearity can make it difficult to...
� Judge relative importance of predictor variables.
� Ascertain the magnitude of an effect of a predictor
variable on the response.
![Page 8: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated](https://reader033.vdocument.in/reader033/viewer/2022041809/5e567f220178e07e7c0aa5e6/html5/thumbnails/8.jpg)
14-8
Ideal Situation
• For a balanced design, absolutely no
intercorrelation exists. (For example, if
there are two predictor variables, 212 0r = ).
• Uncorrelated variables cannot overlap in the
variation in the response that they explain
• Type I and Type III SS will be identical.
• Slope estimates will also be the same.
![Page 9: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated](https://reader033.vdocument.in/reader033/viewer/2022041809/5e567f220178e07e7c0aa5e6/html5/thumbnails/9.jpg)
14-9
Example
(p. 279) Eight observations on productivity based
on crew-size and bonus-size.
Productivity (Y) Crew Size (X1) Bonus Pay (X2)
42, 39 4 2
48, 51 4 3
49, 53 6 2
61, 60 6 3
![Page 10: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated](https://reader033.vdocument.in/reader033/viewer/2022041809/5e567f220178e07e7c0aa5e6/html5/thumbnails/10.jpg)
14-10
Example (2)
Output from PROC GLM Source DF SS MS F P-value
Model 2 402.250 201.125 57 0.0004
Error 5 17.625 3.525
Total 7 419.875
Source DF Type I SS Mean Square
size 1 231.1250000 231.1250000
bonuspay 1 171.1250000 171.1250000
Source DF Type III SS Mean Square
size 1 231.1250000 231.1250000
bonuspay 1 171.1250000 171.1250000
![Page 11: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated](https://reader033.vdocument.in/reader033/viewer/2022041809/5e567f220178e07e7c0aa5e6/html5/thumbnails/11.jpg)
14-11
Example (3)
Parameter EST SE T P-value
Intercept 0.375 4.74 0.08 0.9400
size 5.375 0.66 8.10 0.0005
bonuspay 9.250 1.33 6.97 0.0009
If we consider only size:
Source DF SS MS F P-val
Model 1 231.125 231.125 7.4 0.035
Error 6 188.750 31.458
Total 7 419.875 _
Parameter Est SE T P-value
Intercept 23.5 10.1 2.32 0.0591
size 5.375 1.98 2.71 0.0351
![Page 12: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated](https://reader033.vdocument.in/reader033/viewer/2022041809/5e567f220178e07e7c0aa5e6/html5/thumbnails/12.jpg)
14-12
Perfect Correlation Example
(p. 281) Four observations in three-space, but
over the line X2 = 5 + 0.5 * X1.
X1 X2 Y
2 6 25
8 9 81
6 8 60
10 10 113
![Page 13: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated](https://reader033.vdocument.in/reader033/viewer/2022041809/5e567f220178e07e7c0aa5e6/html5/thumbnails/13.jpg)
14-13
Perfect Correlation Example (2)
• Since points are exactly a line in two-space,
there are infinitely many regression planes
available. There is no unique best regression
plane.
• If you try to fit this in SAS, you will get output
that does not fit the full model (because it
cannot since ′X X is not invertible).
• SAS is “smart” enough to figure out that
something is wrong, and try to do something
about it.
![Page 14: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated](https://reader033.vdocument.in/reader033/viewer/2022041809/5e567f220178e07e7c0aa5e6/html5/thumbnails/14.jpg)
14-14
Output
Source DF SS MS F P-val
Model 1 4007.15 4007.15 91.49 0.0108
Error 2 87.60 43.80
Total 3 4094.75
NOTE: Model is not full rank. Least-squares
solutions for the parameters are not unique.
Some statistics will be misleading. A
reported DF of 0 or B means that the
estimate is biased.
![Page 15: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated](https://reader033.vdocument.in/reader033/viewer/2022041809/5e567f220178e07e7c0aa5e6/html5/thumbnails/15.jpg)
14-15
Output (2)
NOTE: The following parameters have been set
to 0, since the variables are a linear
combination of other variables as shown.
x2 = 5 * Intercept + 0.5 * x1
Parameter Std
Variable DF Estimate Error T P-value
Intercept B 0.20 7.99 0.03 0.9800
x1 B 10.70 1.12 9.56 0.0108
x2 0 0 . . .
![Page 16: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated](https://reader033.vdocument.in/reader033/viewer/2022041809/5e567f220178e07e7c0aa5e6/html5/thumbnails/16.jpg)
14-16
Effects of Multicollinearity
• Variables are almost never 100% correlated.
• When there is a lot of intercorrelation....
� Can generally still obtain a “good fit”.
� Prediction made within the scope of the model
is generally unaffected.
� ′X X matrix has a near-zero determinant that
can be a source of serious round-off errors
![Page 17: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated](https://reader033.vdocument.in/reader033/viewer/2022041809/5e567f220178e07e7c0aa5e6/html5/thumbnails/17.jpg)
14-17
Effects on Regression Coefficients
• Regression coefficients are highly correlated
and have large standard errors
• Cannot use the common interpretation of the
regression coefficient (since, for one thing,
probably isn’t feasible to hold other
variables constant).
![Page 18: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated](https://reader033.vdocument.in/reader033/viewer/2022041809/5e567f220178e07e7c0aa5e6/html5/thumbnails/18.jpg)
14-18
Simultaneous T-tests
• Common abuse of multiple linear regression
models is to do simultaneous t-tests for testing
0kβ = , k = 1, 2, ...., p – 1
• The big problem with this is that these are all
MARGINAL (or variable-added-last) tests.
• If there were no intercorrelation, all of the
variables act “independently” and this would
be no problem. But when there is
intercorrelation, often would end up
incorrectly dropping ALL variables on this
basis.
![Page 19: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated](https://reader033.vdocument.in/reader033/viewer/2022041809/5e567f220178e07e7c0aa5e6/html5/thumbnails/19.jpg)
14-19
Extra Sums of Squares
• When predictor variables are correlated, Type I
and Type III SS tend to be quite different
• Added first a variable may do a lot in terms of
explaining variation, but added later it may
not do much
![Page 20: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated](https://reader033.vdocument.in/reader033/viewer/2022041809/5e567f220178e07e7c0aa5e6/html5/thumbnails/20.jpg)
14-20
Indicators of Multicollinearity
• Large simple correlations between pairs of
predictors.
• F-test says model is significant; Marginal T-
tests do not show any significance.
• Watch for Type I and Type III SS having
large differences.
• Large changes in estimated regression
coefficients when variables are
added/deleted.
![Page 21: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated](https://reader033.vdocument.in/reader033/viewer/2022041809/5e567f220178e07e7c0aa5e6/html5/thumbnails/21.jpg)
14-21
Variance Inflation Factors
• Formal method for detecting multicollinearity
• VIF is related to the variance of the estimated
regression coefficients (think: variances get
“inflated” by having intercorrelation among
the predictors)
2
1
1k
k
VIFR
=−
• 2kR is the coefficient of determination obtained
in regression of Xk on all other predictors.
![Page 22: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated](https://reader033.vdocument.in/reader033/viewer/2022041809/5e567f220178e07e7c0aa5e6/html5/thumbnails/22.jpg)
14-22
Variance Inflation Factors (2)
• 2kR > 0.9 means that Xk is well predicted by
the other variables. This corresponds to VIF
of 10 or higher and indicates excessive
multicollinearity.
• Tolerance is defined as
21 1/k k
TOL R VIF= − =
• Tolerance below 0.01, 0.001, or 0.0001
typically raise concern.
![Page 23: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated](https://reader033.vdocument.in/reader033/viewer/2022041809/5e567f220178e07e7c0aa5e6/html5/thumbnails/23.jpg)
14-23
Physicians Case Study (7.37)
• Goal: Predict # of active physicians in a
county (Y) from
1. X1 = Total Population
2. X2 = Total Personal Income
3. X3 = Land Area
4. X4 = Percent of Pop. Age 65 or older
5. X5 = # of hospital beds
6. X6 = Total Serious Crimes
• SAS code available in file CDI.sas.
![Page 24: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated](https://reader033.vdocument.in/reader033/viewer/2022041809/5e567f220178e07e7c0aa5e6/html5/thumbnails/24.jpg)
14-24
Initial Model
• All six predictors included
• VIF and TOL can be used as options after
the ‘/’ in the model statement of REG
Variable DF Tolerance VIF _
tot_pop 1 0.01192 83.87229
tot_income 1 0.01883 53.10731
land_area 1 0.79952 1.25074
pop_elderly 1 0.94209 1.06147
beds 1 0.12251 8.16293
crimes 1 0.16763 5.96556
![Page 25: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated](https://reader033.vdocument.in/reader033/viewer/2022041809/5e567f220178e07e7c0aa5e6/html5/thumbnails/25.jpg)
14-25
Residual Plots
Residual
-2000
-1000
0
1000
2000
3000
Predicted Value of physicians
0 5000 10000 15000 20000 25000
![Page 26: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated](https://reader033.vdocument.in/reader033/viewer/2022041809/5e567f220178e07e7c0aa5e6/html5/thumbnails/26.jpg)
14-26
Residual Plots
Resid vs. Total Population – See SAS for other variables
Residual
-2000
-1000
0
1000
2000
3000
tot_pop
0 1000000 2000000 3000000 4000000 5000000 6000000 7000000 8000000 9000000
![Page 27: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated](https://reader033.vdocument.in/reader033/viewer/2022041809/5e567f220178e07e7c0aa5e6/html5/thumbnails/27.jpg)
14-27
Normal Probability Plot
-3 -2 -1 0 1 2 3
-2000
-1000
0
1000
2000
3000
Residual
Normal Quantiles
![Page 28: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated](https://reader033.vdocument.in/reader033/viewer/2022041809/5e567f220178e07e7c0aa5e6/html5/thumbnails/28.jpg)
14-28
Assumption Violations
• Errors not normal.
• Variance does not appear to be constant.
• BOXCOX suggests a log transformation,
which clears up some of the issues.
![Page 29: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated](https://reader033.vdocument.in/reader033/viewer/2022041809/5e567f220178e07e7c0aa5e6/html5/thumbnails/29.jpg)
14-29
Normal Probability Plot
-3 -2 -1 0 1 2 3
-5
-4
-3
-2
-1
0
1
2
Residual
Normal Quantiles
![Page 30: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated](https://reader033.vdocument.in/reader033/viewer/2022041809/5e567f220178e07e7c0aa5e6/html5/thumbnails/30.jpg)
14-30
Two Outliers
• Can see these in the QQ-plot.
• Further investigation shows that they are for
Los Angeles County and Cook County
� Twice as many physicians than other counties.
� Also outliers in total population and total
income.
� There is reason to drop these two for the time
being, as it makes sense that such huge
counties should not be considered as the same
population as the rest.
![Page 31: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated](https://reader033.vdocument.in/reader033/viewer/2022041809/5e567f220178e07e7c0aa5e6/html5/thumbnails/31.jpg)
14-31
QQ Plot w/o Outliers
-3 -2 -1 0 1 2 3
-4
-3
-2
-1
0
1
2
Residual
Normal Quantiles
![Page 32: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated](https://reader033.vdocument.in/reader033/viewer/2022041809/5e567f220178e07e7c0aa5e6/html5/thumbnails/32.jpg)
14-32
Residual Plot
Residual
-4
-3
-2
-1
0
1
2
Predicted Value of lphysicians
5 6 7 8 9 10 11 12 13
![Page 33: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated](https://reader033.vdocument.in/reader033/viewer/2022041809/5e567f220178e07e7c0aa5e6/html5/thumbnails/33.jpg)
14-33
Residual Plot
Resid vs. Total Population – See SAS for other variables
Residual
-4
-3
-2
-1
0
1
2
tot_pop
0 1000000 2000000 3000000
![Page 34: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated](https://reader033.vdocument.in/reader033/viewer/2022041809/5e567f220178e07e7c0aa5e6/html5/thumbnails/34.jpg)
14-34
Still Problems?
• Normality is ok
• No other unreasonable outliers
• Residual Plot suggests some nonlinearity
• Look at Residual vs. Predictor Variable
Plots to learn more
• Possibly add some quadratic or other terms
• We’ve thus far ignored multicollinearity –
time to consider it.
![Page 35: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated](https://reader033.vdocument.in/reader033/viewer/2022041809/5e567f220178e07e7c0aa5e6/html5/thumbnails/35.jpg)
14-35
Multicollinearity
• VIF’s for tot_pop and tot_income already
have informed us that there are problems.
pop inc l_ar eld beds
tot_pop 1.00 0.90.90.90.99999 0.17 -0.03 0.920.920.920.92
tot_income 0.90.90.90.99999 1.00 0.13 -0.02 0.900.900.900.90
land_area 0.17 0.13 1.00 0.01 0.07
pop_elderly -0.03 -0.02 0.01 1.00 0.05
beds 0.920.920.920.92 0.900.900.900.90 0.07 0.05 1.00
• Will continue analysis with Model Selection
![Page 36: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated](https://reader033.vdocument.in/reader033/viewer/2022041809/5e567f220178e07e7c0aa5e6/html5/thumbnails/36.jpg)
14-36
Big Picture
• For checking basic assumptions: PLOTS
are generally easier to construct than
TESTS – and generally if there is
something to see, it will show up in the
appropriate plot.
• MULTICOLLINEARITY is a big issue
when trying to interpret estimates –
however it’s not really a problem for
prediction.
![Page 37: Lecture 14 Diagnostics & Remedial Measuresghobbs/STAT_512/Lecture_Notes/Regression/Topic_14.pdfEffects on Regression Coefficients • Regression coefficients are highly correlated](https://reader033.vdocument.in/reader033/viewer/2022041809/5e567f220178e07e7c0aa5e6/html5/thumbnails/37.jpg)
14-37
Upcoming in Lecture 15...
• Model Building: Selection Criteria (Ch 9)
• Continuing the Physicians Dataset Analysis