22s:152 applied linear regression chapter 11 diagnostics...
TRANSCRIPT
22s:152 Applied Linear Regression
Chapter 11 Diagnostics:Unusual and Influential Data,Outliers and Leverage
————————————————————
We’ve seen that there are certain assumptionsmade when we model the data.
These assumptions allow us to calculate a teststatistic, and know the distribution of the teststatistic under a null hypothesis.
We’ve also seen that there are other things tocheck for in regression, besides specified distri-butional assumptions, such as multicollinearity.
In this section, we again look for characteristicsin the data that can cause problems with ourfitted models.
1
In multiple regression, we often investigate witha scatterplot matrix first:
collinearityoutliersmarginal pairwise linearity
These scatterplots can be useful, but it onlyshows pairwise relationships.
After fitting the model, we check for:nonlinearitynon-constant variancenon-normalitymulticollinearity (VIFs)outliers and influential observations
2
Outliers, Leverage, and Influence
• In the general sense, an outlier is any unusualdata point.
• It helps to be more specific:
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●● ●
●
●
●
●
●●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
40 60 80 100 120 140 160
4060
8010
012
0
weight
repw
t
• In the above plot there is a strong outlier inthe bottom right.
This is an outlier with respect to the X-distribution(independent variable). It is not an outlierwith respect to the Y-distribution (depen-dent variable).
3
• A strong outlier with respect to the indepen-dent variable(s), is said to have high lever-age.
• Such a point could have a big impact on thefitted model.
• Here is the fitted line for the full data set:
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●● ●
●
●
●
●
●●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
40 60 80 100 120 140 160
4060
8010
012
0
weight
repw
t
4
• Here is the fitted line after removing the highleverage point:
40 60 80 100 120 140 160
4060
8010
012
0
weight
repw
t
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●● ●
●
●
●
●
●●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
• This one point had a a lot of influence on thefitted line.
• The second picture shows a much better fitto the bulk of the data. But one must becareful about removing outliers. In this caseit was a value that was reportedly incorrectly,and removal was justified.
5
• A point with high leverage doesn’t HAVE tohave a large impact on the fitted line.
• Recall the cigarette data:
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0 5 10 15 20 25 30
0.5
1.0
1.5
2.0
cig.data$Tar
cig.
data
$Nic
• Above is the fitted line with the inclusionof the point of high leverage (top right datapoint is far outside the distribution of x-values).
6
• Here is the fitted line after removing the out-lier:
0 5 10 15 20 25 30
0.5
1.0
1.5
2.0
cig.data$Tar
cig.
data
$Nic
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
• The before and after in this case are quitesimilar. This leverage point did NOT havea large influence on the fitted model.
• A point with high leverage has the potentialto greatly affect the fitted model.
7
• The blue point below is an outlier, but not with re-spect to the independent variable (not high leverage).
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●
−1 0 1 2 3
01
23
45
x
y
●
●
• Here is the fitted line after removal of the outlier, it
did not have a big influence on the fitted line.
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●
−1 0 1 2 3
01
23
45
x
y
8
•What about higher dimensions (more predictors)?
• A point with high leverage will be outside ofthe ‘cloud’ of observed x-values.
•With 2 predictors X1 and X2:
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●
−2 −1 0 1 2
−1
01
23
45
x1
x2
●
The point at the top left has high leveragebecause it is outside the cloud of (X1, X2)independent values.
9
Assessing Leverage: Hat Values
• Beyond graphics, we have a quantity calledthe hat value which is a measure of leverage.
• Leverage is a measure of “potential”influence.
• The hat value hij captures the contributionof observation Yi to the fitted value of thejth observation, or Yj.
• If hij is large, Yi CAN have a substantialimpact on the jth fitted value.
In fact, we have the relationship:
Yj = h1jY1 + h2jY2 + · · · + hnjYn
=∑ni=1 hijYi
10
• The overall potential impact from Yi on thefitted model is found by summing all thesquared hij values (a sum across all j’s).
• The hat value for obs i: hi =∑nj=1 h
2ij
• How do we determine if hi is large?
* 1/n ≤ hi ≤ 1
*∑ni=1 hi = (k + 1)
(where k is the number of predictors)
* h = (k + 1)/n
•We can compare hi to the average. If it ismuch larger, then the point has high lever-age.
• Often use 2(k+1)/n or 3(k+1)/n as a guidefor determining high leverage.
11
• In SLR, hi just measures the distance fromthe mean X .
hi =1
n+
(Xi − X)2∑nj=1(Xj − X)2
• In multiple regression, hi measures distancefrom the centroid (center of the cloud of datapoints in the X-space).
• The dependent-variable values are not involvedin determining leverage.
12
• Example: Duncan data
Reponse: PrestigePrestige measure of occupation.
Predictors: IncomePercent of males in occupation earn-
ing $3500 or more in 1950.
EducationPercent of males in occupation in 1950
who were high school graduates.
> attach(Duncan)
> head(Duncan)
type income education prestige
accountant prof 62 86 82
pilot prof 72 76 83
architect prof 75 92 90
author prof 55 90 76
chemist prof 64 86 90
minister prof 21 84 87
> nrow(Duncan)
[1] 45
13
Two Predictors:
> plot(education,income,pch=16,cex=2)
## The next line produces an interactive option for the
## user to identify points on a plot with the mouse.
> identify(education,income,row.names(Duncan))
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
●
●
●
●
●
●
●
20 40 60 80 100
2040
6080
education
inco
me
minister
conductor
RR.engineer
From this bivariate plot of the two predictors,it looks like the ‘RR.engineer’, ‘conductor’, and‘minister’ may have high leverage.
14
Fit the model, get the hat values:
> lm.out=lm(prestige ~ education + income)
> hatvalues(lm.out)
1 2 3 4 ...
0.05092832 0.05732001 0.06963699 0.06489441 ...
Plot the hat values against the indices 1 to 45,and include thresholds for 2 ∗ (k + 1)n and3 ∗ (k + 1)n:
> plot(hatvalues(lm.out),pch=16,cex=2)
> abline(h=2*3/45,lty=2)
> abline(h=3*3/45,lty=2)
> identify(1:45,hatvalues(lm.out),row.names(Duncan))
15
●●
●●
●
●
●
●
●●
●
● ● ●●
●
● ●
●
●●
●● ●
●●
●
●
●
●
●
●
●●
●
● ● ●
●
●●
●
●
●
●
0 10 20 30 40
0.05
0.10
0.15
0.20
0.25
Index
hatv
alue
s(lm
.out
)
minister
conductor
RR.engineer
These three points have high leverage (potentialto greatly influence the fitted model) using the2 times and 3 times the average hat value crite-rion.
To measure influence, we’ll look at another statis-tic, but first... consider studentized residuals.
16
Studentized Residuals
• In this section, we’re looking for outliers inthe Y-direction for a given combination of in-dependent variables (x-vector).
• Such an observations would have an unusallylarge residual, and we call it a regression outlier.
• If you have two predictors, the mean struc-ture is a plane, and you’d be looking for ob-servations that fall exceptionally far from theplane compared to the other observations.
• First, we start with standardized residuals.
• Though we assume the errors in our modelhave constant variance [εi ∼ N(0, σ2)], theestimated errors, or the sample residuals ei’sDO NOT have equal variance...
17
• Observations with high leverage tend to havesmaller residuals. This is an artifact of ourmodel fitting. These points can pull the fit-ted model close to them, giving them a ten-dency toward smaller residuals.
• Var(ei) = σ2(1− hi)
•We know... 1/n ≤ hi ≤ 1.
•With this variance, we can form a standardizedresidual e′i which all have equal variance as
e′i =ei
σ√
1− hi=
eiSE√
1− hi
but the distribution of e′i isn’t a t becausethe numerator and denominator are not in-dependent (part of theory of t).
18
We can get away from the problem by esti-mating σ2 with a sum of squares that doesnot include the ith residual.
– Delete the ith observation, re-fit the modelbased on n− 1 observations, and get
σ2(−i) = RSS
n−1−k−1
and SE(−i) =√
RSSn−1−k−1
• This gives us the studentized residual
e∗i =ei
SE(−i)√
1− hi• Now, the σ in the denominator is not corre-
lated with the numerator and
e∗i ∼ tn−1−k−1
19
e∗i ∼ tn−1−k−1
•We can now use the t-distribution to judgewhat is a large studentized residual, or howlikely we are to get a studentized residual asfar away from 0 as the ones we get.
• As a note, if the ith observation has a largeresidual and we left it in the computation ofSE, it may greatly inflate SE, deflating thestandardized residual e′i and making it hardto notice that it was large.
• The standardized residuals is also referred toas the internally studentized residual.
• The studentized residual listed here is also re-ferred to as the externally studentized residual.
20
• Test for outliers by comparing e∗i to a t dis-tribution with n − k − 2 df and applying aBonferroni correction (multiply the p-valuesby the number of residuals).
• Really, we only have to be concerned aboutthe large studentized residuals. Perhaps takea closer look at those with e∗i > 2————————————————————
• Example: Returning to the Duncan model:
> nrow(Duncan)
[1] 45
> lm.out=lm(prestige ~ education + income)
> sort(rstudent(lm.out))
-2.3970223990 -1.9309187757 -1.7604905300 ...
Can look at adjusted p-values from t41, butthere is a built-in function to help with this...
21
Get the adjusted p-value for the largest |e∗i |:> outlier.test(lm.out,row.names(Duncan))
max|rstudent| = 3.134519, degrees of freedom = 41,
unadjusted p = 0.003177202, Bonferroni p = 0.1429741
Observation: minister
Since p= 0.1429741 is larger than 0.05, wewould conclude that this model doesn’t haveany extreme residuals.
————————————————————
• An outlier may indicate a sample peculiar-ity or may indicate a data entry error. Itmay suggest an observation belongs to an-other ‘population’.
• Fitting a model with and without an obser-vation gives you a feel for the sensitivity ofthe fitted model to the observation.
22
• If an observation has a big impact on the fit-ted model and it’s not justified to remove it,one option to report both models (with andwithout) commenting on the differences.
• An analysis called Robust Regression will al-low you to leave the observation in, whilereducing it’s impact on the fitted model.
This method essentially ‘weights’ the obser-vations differently when computing the leastsquares estimates.
23
Measuring influence
•We’ve described a point with high leverageas having the potential to greatly influencethe fitted model.
• To greatly influence the fitted model, a pointhas high leverage and it’s Y-value is not ‘in-line’ with the general trend of the data (hashigh leverage and looks ‘odd’).
• High leverage + large studentized residual =High influence
• To check influence, we can delete an observa-tion, and see how much the fitted regressioncoefficients change. A large change suggestshigh influence.
Difference=βj − βj(−i)24
• Fortunately this can be done analytically.
We will use subscript (−i) to indicate quan-tities calculated without case i...
• DFBETAS - effect of Yi on a single estimatedcoefficient
DFBETASj,i =βj − βj(−i)SE(βj(−i))
=
|DFBETAS| > 1 is considered large in asmall or medium sized sample
|DFBETAS| > 2n−1/2 is considered largein a big sample
25
• DFFITS - effect of ith case on fitted valuefor Yi
DFFITS =Yi − Yi(−i)SE(−i)
√hi
= e∗i
√hi
1− hi
|DFFITS| > 1 is considered large in asmall or medium sized sample
|DFFITS| > 2√k+1n is considered large in
a big sample
Look at DFFITS relative to each other.
26
• COOKSD - effect on all fitted values
Based on:how far x-values are from the mean of the x’show far Yi is from the regression line
COOKSD =
∑j(Yj − Yj(−i))2
(k + 1)S2E
=e2ihi
S2E(k + 1)(1− hi)2
COOKSD > 4n−k−1 Unusually influential
case.
Look at COOKSD relative to each other.
27
• Example: Returning to the Duncan model:
> influence.measures(lm.out)
Influence measures of
lm(formula = prestige ~ education + income) :
dfb.1_ dfb.edct dfb.incm dffit cov.r cook.d hat inf
1 -2.25e-02 0.035944 6.66e-04 0.070398 1.125 1.69e-03 0.0509
2 -2.54e-02 -0.008118 5.09e-02 0.084067 1.131 2.41e-03 0.0573
3 -9.19e-03 0.005619 6.48e-03 0.019768 1.155 1.33e-04 0.0696
4 -4.72e-05 0.000140 -6.02e-05 0.000187 1.150 1.20e-08 0.0649
5 -6.58e-02 0.086777 1.70e-02 0.192261 1.078 1.24e-02 0.0513
.
.
.
You get a DFBETA for each regression coef-ficient and each observation.
We’re not really interested in the interceptDFBETA though. That said, a bivariate plotof the other DFBETAS may be useful...
28
> plot( dfbetas(lm.out)[,c(2,3)],pch=16)
> identify(dfbetas(lm.out)[,2],dfbetas(lm.out)[,3],
row.names(Duncan))
●
●
●●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●● ●
●
●
●
●
●●●
●
0.0 0.5 1.0
−1.
0−
0.5
0.0
0.5
education
inco
me
minister
conductor
RR.engineer
The ‘minister’ observation may have a largeimpact on both regression coefficients.
29
The Cook’s distance is perhaps most com-monly looked at (effect on all fitted values).
> plot(cookd(lm.out),pch=16,cex=1.5)
> abline(h=4/(45-2-1),lty=2)
> identify(1:45,cookd(lm.out),row.names(Duncan))
● ● ● ●●
●
● ●
●
● ● ● ● ● ●
●
●
●
●●
●●
● ●●
●
●
●
● ● ●
● ●
● ● ● ●●
● ●●
● ● ● ●
0 10 20 30 40
0.0
0.1
0.2
0.3
0.4
0.5
Index
cook
d(lm
.out
)
minister
conductor
30
We can bring together leverage, studentizedresiduals and cooks distance in the following“Bubble plot”:
> plot(hatvalues(lm.out),rstudent(lm.out),type="n")
> cook=sqrt(cookd(lm.out))
> points(hatvalues(lm.out),rstudent(lm.out),
cex=10*cook/max(cook))
## Include diagnostic thresholds:
> abline(h=c(-2,0,2),lty=2)
> abline(v=c(2,3)*3/45,lty=2)
## Overlay names:
> identify(hatvalues(lm.out),rstudent(lm.out),
row.names(Duncan))
31
0.05 0.10 0.15 0.20 0.25
−2
−1
01
23
hatvalues(lm.out)
rstu
dent
(lm.o
ut)
●●
●
●
● ●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
RR.engineer
minister
reporter
conductor
• ‘conductor’, ‘minister’, and ‘RR engineer’ havehigh leverage
• ‘minister’ and ‘reporter’ have large regressionresiduals
• ‘minister’ has the highest influence on thefitted model
32