inference and diagnostics for simple linear regression

Post on 11-Dec-2015

225 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Stats

TRANSCRIPT

Lecture 3: Inference and Diagnostics for SimpleLinear Regression

Nancy R. Zhang

Statistics 203, Stanford University

January 12, 2010

Nancy R. Zhang (Statistics 203) Lecture 3 January 12, 2010 1 / 24

Review

Assumptions of the linear model

Yi = �0Xi + �1 + �i

1 Y has a linear dependence on X .2 Error variances are equal.3 Errors are independent.4 Errors are Gaussian.

Data points that deviate from the “bulk", which we assume to satisfythese model assumptions, are called outliers.

Nancy R. Zhang (Statistics 203) Lecture 3 January 12, 2010 2 / 24

Anscombe’s quartet

Nancy R. Zhang (Statistics 203) Lecture 3 January 12, 2010 3 / 24

Anscombe’s quartet

Nancy R. Zhang (Statistics 203) Lecture 3 January 12, 2010 4 / 24

Standardized ResidualsA simple way to evaluate model fit is through the residuals,

ri = yi − yi .

The residuals have variance

�2(1− hii),

where

hii =1n

+(xi − x)2∑j(xj − x)2

is called the leverage.

We need to standardize the residuals so that they are comparable:

zi =ri

�√

1− hii.

However, we don’t know �, so need to estimated it.Nancy R. Zhang (Statistics 203) Lecture 3 January 12, 2010 5 / 24

Standardized ResidualsThere are two ways to estimate �:

1 Using all of the data:

�2 =SSEn − 2

2 Let SSE(i) be the sum of squared residuals obtained when we fitthe model to the sample with the i-th point taken out.

�2(i) =

SSE(i)

n − 3,

Thus, there are two ways to standardize ri :1 Studentized residuals: ri = ri

�√

1−hii

2 Studentized deleted residuals:

r∗i =ri

�(i)√

1− hiihas tn−2 distribution.

Nancy R. Zhang (Statistics 203) Lecture 3 January 12, 2010 6 / 24

New York Rivers DataHaith (1976) Study: How does land use around a river basin contributeto water pollution?

River Forest CommIndl NitrogenOlean 63 0.29 1.1

Cassadaga 57 0.09 1.01Oatka 26 0.58 1.9

Neversink 84 1.98 1Hackensack 27 3.11 1.99Wappinger 61 0.56 1.42

Fishkill 60 1.11 2.04Honeoye 43 0.24 1.65

Susquehanna 62 0.15 1.01Chenango 60 0.23 1.21

Tioughnioga 53 0.18 1.33WestCanada 75 0.16 0.75EastCanada 84 0.12 0.73

Saranac 81 0.35 0.8Ausable 89 0.35 0.76

Black 82 0.15 0.87Schoharie 70 0.22 0.8Raquette 75 0.18 0.87

Oswegatchie 56 0.13 0.66Cohocton 49 0.13 1.25

CommIndl: % land area in either commercial or industrial useNitrogen: Mean nitrogen concentration (mg/liter)

Nancy R. Zhang (Statistics 203) Lecture 3 January 12, 2010 7 / 24

Diagnostic plots - scatter plot

Fit model:

Nitrogen = �0 + �1Agr + error

Nancy R. Zhang (Statistics 203) Lecture 3 January 12, 2010 8 / 24

Diagnostic plots - scatter plot

Fit model:

Nitrogen = �0 + �1Forest + error

Nancy R. Zhang (Statistics 203) Lecture 3 January 12, 2010 9 / 24

Diagnostic plots - standardized residual plot

Plot standardized residuals versusXi , large standardized residualsindicate outliers.

Nancy R. Zhang (Statistics 203) Lecture 3 January 12, 2010 10 / 24

Diagnostic plots - qq-plot of residuals

Quantile-quantile plots may besometimes helpful for identifyingoutliers.

Nancy R. Zhang (Statistics 203) Lecture 3 January 12, 2010 11 / 24

Diagnostic plots - scatter plot

Fit model:

Nitrogen = �0 +�1CommInd+error

Nancy R. Zhang (Statistics 203) Lecture 3 January 12, 2010 12 / 24

Diagnostic plots - scatter plot

Fit model:

Nitrogen = �0 +�1CommInd+error

Nancy R. Zhang (Statistics 203) Lecture 3 January 12, 2010 13 / 24

Diagnostic plots - scatter plot

Fit model:

Nitrogen = �0 +�1CommInd+error

Nancy R. Zhang (Statistics 203) Lecture 3 January 12, 2010 14 / 24

Diagnostic plots - scatter plot

Fit model:

Nitrogen = �0 +�1CommInd+error

Nancy R. Zhang (Statistics 203) Lecture 3 January 12, 2010 15 / 24

Outliers in X versus Outliers in Y

1 Outliers in Y can be detected by examining the standardizedresiduals.

ri =ri

�√

1− hiieasier to compute

2 Outliers in X can be detected using leverage values.

hii =1n

+(xi − x)2∑

(xi − x)2

Mean of the leverage values is 2/n. When data set is reasonablylarge, leverage values larger than 2×(mean leverage) areconsidered large. Another convention is to consider leveragebetween 0.2 and 0.5 as high.

Nancy R. Zhang (Statistics 203) Lecture 3 January 12, 2010 16 / 24

Leverage versus Residuals

Standardized Residuals

Leverage

Is there a measure that can detect both types of outliers?Nancy R. Zhang (Statistics 203) Lecture 3 January 12, 2010 17 / 24

Anscombe’s quartet

Nancy R. Zhang (Statistics 203) Lecture 3 January 12, 2010 18 / 24

Influence of OutliersHow much influence does the data point have on the model fit?

Nancy R. Zhang (Statistics 203) Lecture 3 January 12, 2010 19 / 24

Different measures of Influence

1 How much influence does observation i have on its own fit?

(DFFITS)i =yi − yi(i)√MSE(i)hii

DFFITS exceeding 2√

p/n is considered large.2 How much influence does observation i have on the fitted �’s?

(DFBETAS)i =�1 − �1(i)√MSE(i)c

−111

,

where c11 =∑

i(xi − x)2. DFBETA exceeding 2/√

n is consideredlarge.

These conventional rules work for “reasonably sized” data sets.

Nancy R. Zhang (Statistics 203) Lecture 3 January 12, 2010 20 / 24

Different measures of Influence: Cook’s Distance

1 Cook’s distance is defined as:

Di =

∑nj=1(yj − yj(i))

2

pMSE

2 Considers the influence of Yi on all of the fitted values, not just thei-th case.

3 It can be shown that Di is equivalent to

r2i

pMSEhii

(1− hii)2 .

4 Compare Di to the Fp,n−p distribution.

Nancy R. Zhang (Statistics 203) Lecture 3 January 12, 2010 21 / 24

Cook’s Distance

Nancy R. Zhang (Statistics 203) Lecture 3 January 12, 2010 22 / 24

Masking of Outliers

Nancy R. Zhang (Statistics 203) Lecture 3 January 12, 2010 23 / 24

What to do with outliers?

1 Sometimes they hint that our model assumptions are wrong.2 Down-weight or delete outlying data points.3 Transform the variables (more on this later).

Nancy R. Zhang (Statistics 203) Lecture 3 January 12, 2010 24 / 24

top related