assumptions of ols regressionnitiphong.com/paper_pdf/ols_assumptions.pdfassumptions of ols...

31
Assumptions of Ordinary Least Squares Regression

Upload: vohuong

Post on 15-Mar-2018

219 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Assumptions of OLS regressionnitiphong.com/paper_pdf/OLS_Assumptions.pdfAssumptions of OLS regression 1. Model is linear in parameters 2. The data are a random sample of the population

Assumptions of Ordinary Least Squares Regression

Page 2: Assumptions of OLS regressionnitiphong.com/paper_pdf/OLS_Assumptions.pdfAssumptions of OLS regression 1. Model is linear in parameters 2. The data are a random sample of the population

Assumptions of OLS regression

1. Model is linear in parameters2. The data are a random sample of the

population1. The errors are statistically independent

from one another3. The expected value of the errors is

always zero4. The independent variables are not too

strongly collinear5. The independent variables are measured

precisely6. The residuals have constant variance7. The errors are normally distributed

• If assumptions 1-5 are satisfied, thenOLS estimator is unbiased

• If assumption 6 is also satisfied, then

OLS estimator has minimum variance of all unbiased estimators.

• If assumption 7 is also satisfied, then we can do hypothesis testing using t and F tests

• How can we test these assumptions?• If assumptions are violated,

– what does this do to our conclusions?– how do we fix the problem?

Page 3: Assumptions of OLS regressionnitiphong.com/paper_pdf/OLS_Assumptions.pdfAssumptions of OLS regression 1. Model is linear in parameters 2. The data are a random sample of the population

1. Model not linear in parameters

• Problem: Can’t fit the model!• Diagnosis: Look at the model• Solutions:

1. Re-frame the model2. Use nonlinear least squares (NLS)

regression

Page 4: Assumptions of OLS regressionnitiphong.com/paper_pdf/OLS_Assumptions.pdfAssumptions of OLS regression 1. Model is linear in parameters 2. The data are a random sample of the population

2. Errors not independent

• Problem: parameter estimates are biased

• Diagnosis (1): look for correlation between residuals and another variable (not in the model)– I.e., residuals are dominated by another

variable, Z, which is not random with respect to the other independent variables

• Solution (1): add the variable to the model

• Diagnosis (2): look at autocorrelation function of residuals to find patterns in – time– Space– I.e., observations that are nearby in

time or space have residuals that are more similar than average

• Solution (2): fit model using generalized least squares (GLS)

Page 5: Assumptions of OLS regressionnitiphong.com/paper_pdf/OLS_Assumptions.pdfAssumptions of OLS regression 1. Model is linear in parameters 2. The data are a random sample of the population

3. Average error not everywhere zero (“nonlinearity”)

• Problem: indicates that model is wrong

• Diagnosis:– Look for curvature in plot of

observed vs. predicted Y

0

20

40

60

80

100

120

140

160

180

0 20 40 60 80 100 120 140 160

Predicted chlorophyll-a

Obs

erve

d ch

lorp

hyl

Page 6: Assumptions of OLS regressionnitiphong.com/paper_pdf/OLS_Assumptions.pdfAssumptions of OLS regression 1. Model is linear in parameters 2. The data are a random sample of the population

3. Average error not everywhere zero (“nonlinearity”)

• Problem: indicates that model is wrong

• Diagnosis:– Look for curvature in plot of

observed vs. predicted Y– Look for curvature in plot of

residuals vs. predicted Y

-30

-20

-10

0

10

20

30

40

50

60

0 20 40 60 80 100 120 140 160

Predicted chlorophyll-a

Res

idua

ls

Page 7: Assumptions of OLS regressionnitiphong.com/paper_pdf/OLS_Assumptions.pdfAssumptions of OLS regression 1. Model is linear in parameters 2. The data are a random sample of the population

3. Average error not everywhere zero (“nonlinearity”)

• Problem: indicates that model is wrong• Diagnosis:

– Look for curvature in plot of observed vs. predicted Y

– Look for curvature in plot of residuals vs. predicted Y

– look for curvature in partial-residual plots (also component+residual plots [CR plots])

• Most software doesn’t provide these, so instead can take a quick look at plots of Y vs. each of the independent variables

Page 8: Assumptions of OLS regressionnitiphong.com/paper_pdf/OLS_Assumptions.pdfAssumptions of OLS regression 1. Model is linear in parameters 2. The data are a random sample of the population

A simple look a nonlinearity: bivariate plots

0

50

100

150

200

250

0 100 200 300 400 500 600 700

Phosphorus

Chl

orop

hyll

0

20

40

60

80

100

120

140

160

180

200

0 1000 2000 3000 4000 5000

NP

Chl

orop

hyll

Page 9: Assumptions of OLS regressionnitiphong.com/paper_pdf/OLS_Assumptions.pdfAssumptions of OLS regression 1. Model is linear in parameters 2. The data are a random sample of the population

A better way to look at nonlinearity: partial residual plots

• The previous plots are fitting a different model: – for phosphorus, we are looking at residuals from the model

– We want to look at residuals from

• Construct Partial Residuals:Phosphorus NP

0 1i i iC a a P e= + +

0 1 2i i i i iC b b P b N P e= + + +

2i i i iPR b N P e= +1i i iPR b P e= +

Page 10: Assumptions of OLS regressionnitiphong.com/paper_pdf/OLS_Assumptions.pdfAssumptions of OLS regression 1. Model is linear in parameters 2. The data are a random sample of the population

A better way to look at nonlinearity: partial residual plots

0

20

40

60

80

100

120

0 1000 2000 3000 4000 5000

NP

NP

part

ial r

esid

u-20

0

20

40

60

80

100

120

0 100 200 300 400 500 600 700

Phosphorus

Phos

phor

us p

artia

l res

id

1i i iPR b P e= + 2i i i iPR b N P e= +

Page 11: Assumptions of OLS regressionnitiphong.com/paper_pdf/OLS_Assumptions.pdfAssumptions of OLS regression 1. Model is linear in parameters 2. The data are a random sample of the population

Average error not everywhere zero (“nonlinearity”)

• Solutions:• If pattern is monotonic*, try

transforming independent variable– Downward curving: use powers less

than one • E.g. Square root, log, inverse

– Upward curving: use powers greater than one

• E.g. square

* Monotonic: always increasing or always decreasing

• If not, try adding additional terms in the independent variable (e.g., quadratic)

Page 12: Assumptions of OLS regressionnitiphong.com/paper_pdf/OLS_Assumptions.pdfAssumptions of OLS regression 1. Model is linear in parameters 2. The data are a random sample of the population

4. Independent variables are collinear

• Problem: parameter estimates are imprecise

• Diagnosis:– Look for correlations among

independent variables– In regression output, none of the

individual terms are significant, even though the model as a whole is

• Solutions:– Live with it– Remove statistically redundant

variables

Page 13: Assumptions of OLS regressionnitiphong.com/paper_pdf/OLS_Assumptions.pdfAssumptions of OLS regression 1. Model is linear in parameters 2. The data are a random sample of the population

Parameter Est value St dev t student Prob(>|t|)

b0 16.37383 41.50584 0.394495 0.696315

b1 1.986335 1.02642 1.935206 0.063504

b2 -1.22964 2.131899 -0.57678 0.568867

Residual St dev 31.6315 y = b0 + b1.x1 + b2.x2

R2 0.534192

R2(adj) 0.499688

F 15.48191

Prob(>F) 3.32E-05

0 1 20; 1; 0.5; 0.95XZβ β β ρ= = = =

Page 14: Assumptions of OLS regressionnitiphong.com/paper_pdf/OLS_Assumptions.pdfAssumptions of OLS regression 1. Model is linear in parameters 2. The data are a random sample of the population

5. Independent variables not precise (“measurement error”)

• Problem: parameter estimates are biased

• Diagnosis: know how your data were collected!

• Solution: very hard– State space models– Restricted maximum likelihood

(REML)– Use simulations to estimate bias– Consult a professional!

Page 15: Assumptions of OLS regressionnitiphong.com/paper_pdf/OLS_Assumptions.pdfAssumptions of OLS regression 1. Model is linear in parameters 2. The data are a random sample of the population

6. Errors have non-constant variance (“heteroskedasticity”)

• Problem:– Parameter estimates are unbiased– P-values are unreliable

• Diagnosis: plot residuals against fitted values

Page 16: Assumptions of OLS regressionnitiphong.com/paper_pdf/OLS_Assumptions.pdfAssumptions of OLS regression 1. Model is linear in parameters 2. The data are a random sample of the population

-30

-20

-10

0

10

20

30

40

50

60

0 20 40 60 80 100 120 140 160

Predicted chlorophyll-a

Res

idua

ls

Page 17: Assumptions of OLS regressionnitiphong.com/paper_pdf/OLS_Assumptions.pdfAssumptions of OLS regression 1. Model is linear in parameters 2. The data are a random sample of the population

Errors have non-constant variance (“heteroskedasticity”)

• Problem:– Parameter estimates are unbiased– P-values are unreliable

• Diagnosis: plot studentized residuals against fitted values

• Solutions:– Transform the dependent variable

• If residual variance increases with predicted value, try transforming with power less than one

Page 18: Assumptions of OLS regressionnitiphong.com/paper_pdf/OLS_Assumptions.pdfAssumptions of OLS regression 1. Model is linear in parameters 2. The data are a random sample of the population

Try square root transform

-3

-2

-1

0

1

2

3

4

sqrt(

Chl

orop

hyll-

a) R

esid

ual

.0 2.5 5.0 7.5 10.0 12.5 15.0sqrt(Chlorophyll-a) Predicted

Page 19: Assumptions of OLS regressionnitiphong.com/paper_pdf/OLS_Assumptions.pdfAssumptions of OLS regression 1. Model is linear in parameters 2. The data are a random sample of the population

Errors have non-constant variance (“heteroskedasticity”)

• Problem:– Parameter estimates are unbiased– P-values are unreliable

• Diagnosis: plot studentized residuals against fitted values

• Solutions:– Transform the dependent variable

• May create nonlinearity in the model– Fit a generalized linear model (GLM)

• For some distributions, the variance changes with the mean in predictable ways

– Fit a generalized least squares model (GLS)

• Specifies how variance depends on one or more variables

– Fit a weighted least squares regression (WLS)

• Also good when data points have differing amount of precision

Page 20: Assumptions of OLS regressionnitiphong.com/paper_pdf/OLS_Assumptions.pdfAssumptions of OLS regression 1. Model is linear in parameters 2. The data are a random sample of the population

7. Errors not normally distributed

• Problem:– Parameter estimates are unbiased– P-values are unreliable– Regression fits the mean; with

skewed residuals the mean is not a good measure of central tendency

• Diagnosis: examine QQ plot of residuals

Page 21: Assumptions of OLS regressionnitiphong.com/paper_pdf/OLS_Assumptions.pdfAssumptions of OLS regression 1. Model is linear in parameters 2. The data are a random sample of the population

-2

-1

0

1

2

3.01 .05.10 .25 .50 .75 .90.95 .99

-3 -2 -1 0 1 2 3

Normal Quantile Plot

Studentized Resid Chlorophyll-a

Distributions

Page 22: Assumptions of OLS regressionnitiphong.com/paper_pdf/OLS_Assumptions.pdfAssumptions of OLS regression 1. Model is linear in parameters 2. The data are a random sample of the population

Errors not normally distributed

• Problem:– Parameter estimates are unbiased– P-values are unreliable– Regression fits the mean; with

skewed residuals the mean is not a good measure of central tendency

• Diagnosis: examine QQ plot of Studentized residuals– Corrects for bias in estimates of

residual variance

• Solutions:– Transform the dependent variable

• May create nonlinearity in the model

Page 23: Assumptions of OLS regressionnitiphong.com/paper_pdf/OLS_Assumptions.pdfAssumptions of OLS regression 1. Model is linear in parameters 2. The data are a random sample of the population

Try transforming the response variable

-3

-2

-1

0

1

2

3.01 .05.10 .25 .50 .75 .90.95 .99

-3 -2 -1 0 1 2 3

Normal Quantile Plot

Studentized Resid sqrt(Chlorophyll-a)

Distributions

Page 24: Assumptions of OLS regressionnitiphong.com/paper_pdf/OLS_Assumptions.pdfAssumptions of OLS regression 1. Model is linear in parameters 2. The data are a random sample of the population

But we’ve introduced nonlinearity…

Actual by Predicted Plot (Chlorophyll) Actual by Predicted Plot (sqrt[Chlorophyll])

0

50

100

150

Chl

orop

hyll-

a A

ctua

l

0 50 100 150Chlorophyll-a Predicted P<.0001 RSq=0.88RMSE=18.074

0

2.5

5

7.5

10

12.5

15

sqrt(

Chl

orop

hyll-

a) A

ctua

l.0 2.5 5.0 7.5 10.0 12.5 15.0sqrt(Chlorophyll-a) Predicted P<.0001RSq=0.85 RMSE=1.4361

Page 25: Assumptions of OLS regressionnitiphong.com/paper_pdf/OLS_Assumptions.pdfAssumptions of OLS regression 1. Model is linear in parameters 2. The data are a random sample of the population

Errors not normally distributed

• Problem:– Parameter estimates are unbiased– P-values are unreliable– Regression fits the mean; with

skewed residuals the mean is not a good measure of central tendency

• Diagnosis: examine QQ plot of Studentized residuals– Corrects for bias in estimates of

residual variance

• Solutions:– Transform the dependent variable

• May create nonlinearity in the model

– Fit a generalized linear model (GLM)

• Allows us to assume the residuals follow a different distribution (binomial, gamma, etc.)

Page 26: Assumptions of OLS regressionnitiphong.com/paper_pdf/OLS_Assumptions.pdfAssumptions of OLS regression 1. Model is linear in parameters 2. The data are a random sample of the population

Summary of OLS assumptions

Violation Problem Solution

Nonlinear in parameters Can’t fit model NLS

Non-normal errors Bad P-values Transform Y; GLM

Heteroskedasticity Bad P-values Transform Y; GLM

Nonlinearity Wrong model Transform X; add terms

Nonindependence Biased parameter estimates

GLS

Measurement error Biased parameter estimates

Hard!!!

Collinearity Individual P-values inflated

Remove X terms?

Page 27: Assumptions of OLS regressionnitiphong.com/paper_pdf/OLS_Assumptions.pdfAssumptions of OLS regression 1. Model is linear in parameters 2. The data are a random sample of the population

Fixing assumptions via data transformations is an iterative process

• After each modification, fit the new model and look at all the assumptions again

Page 28: Assumptions of OLS regressionnitiphong.com/paper_pdf/OLS_Assumptions.pdfAssumptions of OLS regression 1. Model is linear in parameters 2. The data are a random sample of the population

What can we do about chlorophyll regression?

• Square root transform helps a little with non-normality and a lot with heteroskedasticity

• But it creates nonlinearity

-2

0

2

4

6

8

10

0 1000 2000 3000 4000 5000

NP

NP

part

ial r

esid

u

-2

-1

0

1

2

3

4

5

6

0 100 200 300 400 500 600 700

Phosphorus

Phos

phor

us p

artia

l res

id

Page 29: Assumptions of OLS regressionnitiphong.com/paper_pdf/OLS_Assumptions.pdfAssumptions of OLS regression 1. Model is linear in parameters 2. The data are a random sample of the population

A new model … it’s linear …

0

2.5

5

7.5

10

12.5

15

sqrt(

Chl

orop

hyll-

a)

0 5 10 15 20 25sqrt(Phosphorus)

Linear Fit

Bivariate Fit of sqrt(Chlorophyll-a) By sqrt(Phosphorus)

0

2.5

5

7.5

10

12.5

15

sqrt(

Chl

orop

hyll-

a)10 20 30 40 50 60 70

sqrt(NP)

Linear Fit

Bivariate Fit of sqrt(Chlorophyll-a) By sqrt(NP)

Fit Y by X Group

Page 30: Assumptions of OLS regressionnitiphong.com/paper_pdf/OLS_Assumptions.pdfAssumptions of OLS regression 1. Model is linear in parameters 2. The data are a random sample of the population

… it’s normal (sort of) and homoskedastic …

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5.01 .05.10 .25 .50 .75 .90.95 .99

-3 -2 -1 0 1 2 3

Normal Quantile Plot

Studentized Resid sqrt(Chlorophyll-a) 2

Distributions

-2

-1

0

1

2

3

sqrt(

Chl

orop

hyll-

a) R

esid

ual

.0 2.5 5.0 7.5 10.0 12.5 15.0sqrt(Chlorophyll-a) Predicted

Page 31: Assumptions of OLS regressionnitiphong.com/paper_pdf/OLS_Assumptions.pdfAssumptions of OLS regression 1. Model is linear in parameters 2. The data are a random sample of the population

… and it fits well!

RSquareRSquare AdjRoot Mean Square ErrorMean of ResponseObservations (or Sum Wgts)

0.8969720.8876061.1983826.167458

25

Summary of Fit

ModelErrorC. Total

Source 2 22 24

DF 275.06699 31.59463 306.66163

Sum of Squares 137.533 1.436

Mean Square 95.7674

F Ratio

<.0001Prob > F

Analysis of Variance

Interceptsqrt(Phosphorus)sqrt(NP)

Term-0.901414 0.2140750.1513313

Estimate 0.615840.0954710.037419

Std Error -1.46 2.24 4.04

t Ratio0.15740.03530.0005

Prob>|t|

Parameter Estimates

sqrt(Phosphorus)sqrt(NP)

Source 1 1

Nparm 1 1

DF 7.220755 23.488665

Sum of Squares 5.0280 16.3556

F Ratio 0.0353 0.0005

Prob > F

Effect Tests

Whole Model

Response sqrt(Chlorophyll-a)

0

2.5

5

7.5

10

12.5

15

sqrt(

Chl

orop

hyll-

a) A

ctua

l.0 2.5 5.0 7.5 10.0 12.5 15.0sqrt(Chlorophyll-a) Predicted P<.0001RSq=0.90 RMSE=1.1984