22s:152 applied linear regression chapter 11 diagnostics...

22s:152 Applied Linear Regression

Chapter 11 Diagnostics:Unusual and Influential Data,Outliers and Leverage

————————————————————

We’ve seen that there are certain assumptionsmade when we model the data.

These assumptions allow us to calculate a teststatistic, and know the distribution of the teststatistic under a null hypothesis.

We’ve also seen that there are other things tocheck for in regression, besides specified distri-butional assumptions, such as multicollinearity.

In this section, we again look for characteristicsin the data that can cause problems with ourfitted models.

1

In multiple regression, we often investigate witha scatterplot matrix first:

collinearityoutliersmarginal pairwise linearity

These scatterplots can be useful, but it onlyshows pairwise relationships.

After fitting the model, we check for:nonlinearitynon-constant variancenon-normalitymulticollinearity (VIFs)outliers and influential observations

2

Outliers, Leverage, and Influence

• In the general sense, an outlier is any unusualdata point.

• It helps to be more specific:

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●● ●

●

●

●

●

●●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

40 60 80 100 120 140 160

4060

8010

012

0

weight

repw

t

• In the above plot there is a strong outlier inthe bottom right.

This is an outlier with respect to the X-distribution(independent variable). It is not an outlierwith respect to the Y-distribution (depen-dent variable).

3

• A strong outlier with respect to the indepen-dent variable(s), is said to have high lever-age.

• Such a point could have a big impact on thefitted model.

• Here is the fitted line for the full data set:

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●● ●

●

●

●

●

●●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

40 60 80 100 120 140 160

4060

8010

012

0

weight

repw

t

4

• Here is the fitted line after removing the highleverage point:

40 60 80 100 120 140 160

4060

8010

012

0

weight

repw

t

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●● ●

●

●

●

●

●●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

• This one point had a a lot of influence on thefitted line.

• The second picture shows a much better fitto the bulk of the data. But one must becareful about removing outliers. In this caseit was a value that was reportedly incorrectly,and removal was justified.

5

• A point with high leverage doesn’t HAVE tohave a large impact on the fitted line.

• Recall the cigarette data:

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

0 5 10 15 20 25 30

0.5

1.0

1.5

2.0

cig.data$Tar

cig.

data

$Nic

• Above is the fitted line with the inclusionof the point of high leverage (top right datapoint is far outside the distribution of x-values).

6

• Here is the fitted line after removing the out-lier:

0 5 10 15 20 25 30

0.5

1.0

1.5

2.0

cig.data$Tar

cig.

data

$Nic

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

• The before and after in this case are quitesimilar. This leverage point did NOT havea large influence on the fitted model.

• A point with high leverage has the potentialto greatly affect the fitted model.

7

• The blue point below is an outlier, but not with re-spect to the independent variable (not high leverage).

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

● ●

●

●

●

●

●

●

●

●

●

−1 0 1 2 3

01

23

45

x

y

●

●

• Here is the fitted line after removal of the outlier, it

did not have a big influence on the fitted line.

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

● ●

●

●

●

●

●

●

●

●

●

−1 0 1 2 3

01

23

45

x

y

8

•What about higher dimensions (more predictors)?

• A point with high leverage will be outside ofthe ‘cloud’ of observed x-values.

•With 2 predictors X1 and X2:

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

● ●

●

●

●

●

●

●

●●

●

●

●

−2 −1 0 1 2

−1

01

23

45

x1

x2

●

The point at the top left has high leveragebecause it is outside the cloud of (X1, X2)independent values.

9

Assessing Leverage: Hat Values

• Beyond graphics, we have a quantity calledthe hat value which is a measure of leverage.

• Leverage is a measure of “potential”influence.

• The hat value hij captures the contributionof observation Yi to the fitted value of thejth observation, or Yj.

• If hij is large, Yi CAN have a substantialimpact on the jth fitted value.

In fact, we have the relationship:

Yj = h1jY1 + h2jY2 + · · · + hnjYn

=∑ni=1 hijYi

10

• The overall potential impact from Yi on thefitted model is found by summing all thesquared hij values (a sum across all j’s).

• The hat value for obs i: hi =∑nj=1 h

2ij

• How do we determine if hi is large?

* 1/n ≤ hi ≤ 1

*∑ni=1 hi = (k + 1)

(where k is the number of predictors)

* h = (k + 1)/n

•We can compare hi to the average. If it ismuch larger, then the point has high lever-age.

• Often use 2(k+1)/n or 3(k+1)/n as a guidefor determining high leverage.

11

• In SLR, hi just measures the distance fromthe mean X .

hi =1

n+

(Xi − X)2∑nj=1(Xj − X)2

• In multiple regression, hi measures distancefrom the centroid (center of the cloud of datapoints in the X-space).

• The dependent-variable values are not involvedin determining leverage.

12

• Example: Duncan data

Reponse: PrestigePrestige measure of occupation.

Predictors: IncomePercent of males in occupation earn-

ing $3500 or more in 1950.

EducationPercent of males in occupation in 1950

who were high school graduates.

> attach(Duncan)

> head(Duncan)

type income education prestige

accountant prof 62 86 82

pilot prof 72 76 83

architect prof 75 92 90

author prof 55 90 76

chemist prof 64 86 90

minister prof 21 84 87

> nrow(Duncan)

[1] 45

13

Two Predictors:

> plot(education,income,pch=16,cex=2)

## The next line produces an interactive option for the

## user to identify points on a plot with the mouse.

> identify(education,income,row.names(Duncan))

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

● ●

●

●

●

●

●

●

●

20 40 60 80 100

2040

6080

education

inco

me

minister

conductor

RR.engineer

From this bivariate plot of the two predictors,it looks like the ‘RR.engineer’, ‘conductor’, and‘minister’ may have high leverage.

14

Fit the model, get the hat values:

> lm.out=lm(prestige ~ education + income)

> hatvalues(lm.out)

1 2 3 4 ...

0.05092832 0.05732001 0.06963699 0.06489441 ...

Plot the hat values against the indices 1 to 45,and include thresholds for 2 ∗ (k + 1)n and3 ∗ (k + 1)n:

> plot(hatvalues(lm.out),pch=16,cex=2)

> abline(h=2*3/45,lty=2)

> abline(h=3*3/45,lty=2)

> identify(1:45,hatvalues(lm.out),row.names(Duncan))

15

●●

●●

●

●

●

●

●●

●

● ● ●●

●

● ●

●

●●

●● ●

●●

●

●

●

●

●

●

●●

●

● ● ●

●

●●

●

●

●

●

0 10 20 30 40

0.05

0.10

0.15

0.20

0.25

Index

hatv

alue

s(lm

.out

)

minister

conductor

RR.engineer

These three points have high leverage (potentialto greatly influence the fitted model) using the2 times and 3 times the average hat value crite-rion.

To measure influence, we’ll look at another statis-tic, but first... consider studentized residuals.

16

Studentized Residuals

• In this section, we’re looking for outliers inthe Y-direction for a given combination of in-dependent variables (x-vector).

• Such an observations would have an unusallylarge residual, and we call it a regression outlier.

• If you have two predictors, the mean struc-ture is a plane, and you’d be looking for ob-servations that fall exceptionally far from theplane compared to the other observations.

• First, we start with standardized residuals.

• Though we assume the errors in our modelhave constant variance [εi ∼ N(0, σ2)], theestimated errors, or the sample residuals ei’sDO NOT have equal variance...

17

• Observations with high leverage tend to havesmaller residuals. This is an artifact of ourmodel fitting. These points can pull the fit-ted model close to them, giving them a ten-dency toward smaller residuals.

• Var(ei) = σ2(1− hi)

•We know... 1/n ≤ hi ≤ 1.

•With this variance, we can form a standardizedresidual e′i which all have equal variance as

e′i =ei

σ√

1− hi=

eiSE√

1− hi

but the distribution of e′i isn’t a t becausethe numerator and denominator are not in-dependent (part of theory of t).

18

We can get away from the problem by esti-mating σ2 with a sum of squares that doesnot include the ith residual.

– Delete the ith observation, re-fit the modelbased on n− 1 observations, and get

σ2(−i) = RSS

n−1−k−1

and SE(−i) =√

RSSn−1−k−1

• This gives us the studentized residual

e∗i =ei

SE(−i)√

1− hi• Now, the σ in the denominator is not corre-

lated with the numerator and

e∗i ∼ tn−1−k−1

19

e∗i ∼ tn−1−k−1

•We can now use the t-distribution to judgewhat is a large studentized residual, or howlikely we are to get a studentized residual asfar away from 0 as the ones we get.

• As a note, if the ith observation has a largeresidual and we left it in the computation ofSE, it may greatly inflate SE, deflating thestandardized residual e′i and making it hardto notice that it was large.

• The standardized residuals is also referred toas the internally studentized residual.

• The studentized residual listed here is also re-ferred to as the externally studentized residual.

20

• Test for outliers by comparing e∗i to a t dis-tribution with n − k − 2 df and applying aBonferroni correction (multiply the p-valuesby the number of residuals).

• Really, we only have to be concerned aboutthe large studentized residuals. Perhaps takea closer look at those with e∗i > 2————————————————————

• Example: Returning to the Duncan model:

> nrow(Duncan)

[1] 45

> lm.out=lm(prestige ~ education + income)

> sort(rstudent(lm.out))

-2.3970223990 -1.9309187757 -1.7604905300 ...

Can look at adjusted p-values from t41, butthere is a built-in function to help with this...

21

Get the adjusted p-value for the largest |e∗i |:> outlier.test(lm.out,row.names(Duncan))

max|rstudent| = 3.134519, degrees of freedom = 41,

unadjusted p = 0.003177202, Bonferroni p = 0.1429741

Observation: minister

Since p= 0.1429741 is larger than 0.05, wewould conclude that this model doesn’t haveany extreme residuals.

————————————————————

• An outlier may indicate a sample peculiar-ity or may indicate a data entry error. Itmay suggest an observation belongs to an-other ‘population’.

• Fitting a model with and without an obser-vation gives you a feel for the sensitivity ofthe fitted model to the observation.

22

• If an observation has a big impact on the fit-ted model and it’s not justified to remove it,one option to report both models (with andwithout) commenting on the differences.

• An analysis called Robust Regression will al-low you to leave the observation in, whilereducing it’s impact on the fitted model.

This method essentially ‘weights’ the obser-vations differently when computing the leastsquares estimates.

23

Measuring influence

•We’ve described a point with high leverageas having the potential to greatly influencethe fitted model.

• To greatly influence the fitted model, a pointhas high leverage and it’s Y-value is not ‘in-line’ with the general trend of the data (hashigh leverage and looks ‘odd’).

• High leverage + large studentized residual =High influence

• To check influence, we can delete an observa-tion, and see how much the fitted regressioncoefficients change. A large change suggestshigh influence.

Difference=βj − βj(−i)24

• Fortunately this can be done analytically.

We will use subscript (−i) to indicate quan-tities calculated without case i...

• DFBETAS - effect of Yi on a single estimatedcoefficient

DFBETASj,i =βj − βj(−i)SE(βj(−i))

=

|DFBETAS| > 1 is considered large in asmall or medium sized sample

|DFBETAS| > 2n−1/2 is considered largein a big sample

25

• DFFITS - effect of ith case on fitted valuefor Yi

DFFITS =Yi − Yi(−i)SE(−i)

√hi

= e∗i

√hi

1− hi

|DFFITS| > 1 is considered large in asmall or medium sized sample

|DFFITS| > 2√k+1n is considered large in

a big sample

Look at DFFITS relative to each other.

26

• COOKSD - effect on all fitted values

Based on:how far x-values are from the mean of the x’show far Yi is from the regression line

COOKSD =

∑j(Yj − Yj(−i))2

(k + 1)S2E

=e2ihi

S2E(k + 1)(1− hi)2

COOKSD > 4n−k−1 Unusually influential

case.

Look at COOKSD relative to each other.

27

• Example: Returning to the Duncan model:

> influence.measures(lm.out)

Influence measures of

lm(formula = prestige ~ education + income) :

dfb.1_ dfb.edct dfb.incm dffit cov.r cook.d hat inf

1 -2.25e-02 0.035944 6.66e-04 0.070398 1.125 1.69e-03 0.0509

2 -2.54e-02 -0.008118 5.09e-02 0.084067 1.131 2.41e-03 0.0573

3 -9.19e-03 0.005619 6.48e-03 0.019768 1.155 1.33e-04 0.0696

4 -4.72e-05 0.000140 -6.02e-05 0.000187 1.150 1.20e-08 0.0649

5 -6.58e-02 0.086777 1.70e-02 0.192261 1.078 1.24e-02 0.0513

.

.

.

You get a DFBETA for each regression coef-ficient and each observation.

We’re not really interested in the interceptDFBETA though. That said, a bivariate plotof the other DFBETAS may be useful...

28

> plot( dfbetas(lm.out)[,c(2,3)],pch=16)

> identify(dfbetas(lm.out)[,2],dfbetas(lm.out)[,3],

row.names(Duncan))

●

●

●●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●● ●

●

●

●

●

●●●

●

0.0 0.5 1.0

−1.

0−

0.5

0.0

0.5

education

inco

me

minister

conductor

RR.engineer

The ‘minister’ observation may have a largeimpact on both regression coefficients.

29

The Cook’s distance is perhaps most com-monly looked at (effect on all fitted values).

> plot(cookd(lm.out),pch=16,cex=1.5)

> abline(h=4/(45-2-1),lty=2)

> identify(1:45,cookd(lm.out),row.names(Duncan))

● ● ● ●●

●

● ●

●

● ● ● ● ● ●

●

●

●

●●

●●

● ●●

●

●

●

● ● ●

● ●

● ● ● ●●

● ●●

● ● ● ●

0 10 20 30 40

0.0

0.1

0.2

0.3

0.4

0.5

Index

cook

d(lm

.out

)

minister

conductor

30

We can bring together leverage, studentizedresiduals and cooks distance in the following“Bubble plot”:

> plot(hatvalues(lm.out),rstudent(lm.out),type="n")

> cook=sqrt(cookd(lm.out))

> points(hatvalues(lm.out),rstudent(lm.out),

cex=10*cook/max(cook))

## Include diagnostic thresholds:

> abline(h=c(-2,0,2),lty=2)

> abline(v=c(2,3)*3/45,lty=2)

## Overlay names:

> identify(hatvalues(lm.out),rstudent(lm.out),

row.names(Duncan))

31

0.05 0.10 0.15 0.20 0.25

−2

−1

01

23

hatvalues(lm.out)

rstu

dent

(lm.o

ut)

●●

●

●

● ●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

RR.engineer

minister

reporter

conductor

• ‘conductor’, ‘minister’, and ‘RR engineer’ havehigh leverage

• ‘minister’ and ‘reporter’ have large regressionresiduals

• ‘minister’ has the highest influence on thefitted model

32

22s:152 applied linear regression chapter 11 diagnostics...

Documents