gov óþþþ section Ÿ: diagnosing and fixing regression...

110
A D S R A M D GOV Section : Diagnosing and Fixing Regression Problems Konstantin Kashin Harvard University October , ese notes and accompanying code draw on the notes from Molly Roberts, Maya Sen, Iain Osgood, Brandon Stewart, and TF’s from previous years

Upload: others

Post on 29-Mar-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

GOV 2000 Section 8:Diagnosing and Fixing Regression Problems

Konstantin Kashin1

Harvard University

October 24, 2012

1�ese notes and accompanying code draw on the notes fromMolly Roberts,Maya Sen, Iain Osgood, Brandon Stewart, and TF’s from previous years

Page 2: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Outline

Administrative Details

Stocktake

Regression Assumptions

Missing Data

Page 3: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Administrative Details

▸ Midterm returned

▸ Problem Set 5 returned; corrections due next Tuesday▸ Problem Set 7 due next Tuesday

Page 4: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Administrative Details

▸ Midterm returned▸ Problem Set 5 returned; corrections due next Tuesday

▸ Problem Set 7 due next Tuesday

Page 5: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Administrative Details

▸ Midterm returned▸ Problem Set 5 returned; corrections due next Tuesday▸ Problem Set 7 due next Tuesday

Page 6: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Outline

Administrative Details

Stocktake

Regression Assumptions

Missing Data

Page 7: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

What have we covered?

▸ Summarizing and describing data: both univariate andmultivariate populations

▸ Sampling as source of randomness and uncertainty▸ Probability / random variables▸ Sample statistics and sampling distributions

▸ Revisit regression in context of sampling▸ Standard errors, hypothesis testing, and con�dence intervals forestimated regression coe�cients

▸ But wait! �ere are many assumptions that go into regression...▸ And there’s missing data...

Page 8: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

What have we covered?

▸ Summarizing and describing data: both univariate andmultivariate populations

▸ Sampling as source of randomness and uncertainty▸ Probability / random variables▸ Sample statistics and sampling distributions

▸ Revisit regression in context of sampling▸ Standard errors, hypothesis testing, and con�dence intervals forestimated regression coe�cients

▸ But wait! �ere are many assumptions that go into regression...▸ And there’s missing data...

Page 9: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

What have we covered?

▸ Summarizing and describing data: both univariate andmultivariate populations

▸ Sampling as source of randomness and uncertainty▸ Probability / random variables▸ Sample statistics and sampling distributions

▸ Revisit regression in context of sampling▸ Standard errors, hypothesis testing, and con�dence intervals forestimated regression coe�cients

▸ But wait! �ere are many assumptions that go into regression...▸ And there’s missing data...

Page 10: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

What have we covered?

▸ Summarizing and describing data: both univariate andmultivariate populations

▸ Sampling as source of randomness and uncertainty▸ Probability / random variables▸ Sample statistics and sampling distributions

▸ Revisit regression in context of sampling▸ Standard errors, hypothesis testing, and con�dence intervals forestimated regression coe�cients

▸ But wait! �ere are many assumptions that go into regression...

▸ And there’s missing data...

Page 11: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

What have we covered?

▸ Summarizing and describing data: both univariate andmultivariate populations

▸ Sampling as source of randomness and uncertainty▸ Probability / random variables▸ Sample statistics and sampling distributions

▸ Revisit regression in context of sampling▸ Standard errors, hypothesis testing, and con�dence intervals forestimated regression coe�cients

▸ But wait! �ere are many assumptions that go into regression...▸ And there’s missing data...

Page 12: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Where are we going?

Causal inference is a missing data problem!

Page 13: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Where are we going?

Causal inference is a missing data problem!

Page 14: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Outline

Administrative Details

Stocktake

Regression Assumptions

Missing Data

Page 15: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

What are key regression assumptions?

▸ Random sampling

▸ Constant variance (homoskedasticity)▸ Normality▸ Linear conditional expectation function

Page 16: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

What are key regression assumptions?

▸ Random sampling▸ Constant variance (homoskedasticity)

▸ Normality▸ Linear conditional expectation function

Page 17: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

What are key regression assumptions?

▸ Random sampling▸ Constant variance (homoskedasticity)▸ Normality

▸ Linear conditional expectation function

Page 18: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

What are key regression assumptions?

▸ Random sampling▸ Constant variance (homoskedasticity)▸ Normality▸ Linear conditional expectation function

Page 19: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Credit Card Expenditures Data

We will be working with ccarddata.csv.

▸ Outcome variable: credit card expenditure▸ Covariates:

▸ Age▸ Household income (monthly in thousands of dollars)▸ Dummy for home ownership

Page 20: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Credit Card Expenditures Data

We will be working with ccarddata.csv.

▸ Outcome variable: credit card expenditure

▸ Covariates:▸ Age▸ Household income (monthly in thousands of dollars)▸ Dummy for home ownership

Page 21: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Credit Card Expenditures Data

We will be working with ccarddata.csv.

▸ Outcome variable: credit card expenditure▸ Covariates:

▸ Age▸ Household income (monthly in thousands of dollars)▸ Dummy for home ownership

Page 22: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

OLS

lm.cc <- lm(ccexpend ~ income + homeowner + age , data=

cc)

Page 23: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

OLS

Estimate Std. Error t value Pr(>∣t∣)(Intercept) 0.3972 163.9851 0.00 0.9981

income 79.8358 23.6724 3.37 0.0012homeowner 32.0877 84.7241 0.38 0.7061

age -0.7769 5.5128 -0.14 0.8883

Page 24: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Violations of Constant Variance Assumption

Why is heteroskedasticity an issue?

▸ Estimated variances / standard errors are biased▸ OLS is no longer e�cient (BLUE)▸ Hypothesis testing and con�dence intervals are o�

Good news: problem is usually not that bad (depends on severity ofheteroskedasticity) and point estimates of regression coe�cients stillunbiased!

Page 25: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Violations of Constant Variance Assumption

Why is heteroskedasticity an issue?

▸ Estimated variances / standard errors are biased

▸ OLS is no longer e�cient (BLUE)▸ Hypothesis testing and con�dence intervals are o�

Good news: problem is usually not that bad (depends on severity ofheteroskedasticity) and point estimates of regression coe�cients stillunbiased!

Page 26: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Violations of Constant Variance Assumption

Why is heteroskedasticity an issue?

▸ Estimated variances / standard errors are biased▸ OLS is no longer e�cient (BLUE)

▸ Hypothesis testing and con�dence intervals are o�

Good news: problem is usually not that bad (depends on severity ofheteroskedasticity) and point estimates of regression coe�cients stillunbiased!

Page 27: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Violations of Constant Variance Assumption

Why is heteroskedasticity an issue?

▸ Estimated variances / standard errors are biased▸ OLS is no longer e�cient (BLUE)▸ Hypothesis testing and con�dence intervals are o�

Good news: problem is usually not that bad (depends on severity ofheteroskedasticity) and point estimates of regression coe�cients stillunbiased!

Page 28: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Violations of Constant Variance Assumption

Why is heteroskedasticity an issue?

▸ Estimated variances / standard errors are biased▸ OLS is no longer e�cient (BLUE)▸ Hypothesis testing and con�dence intervals are o�

Good news: problem is usually not that bad (depends on severity ofheteroskedasticity) and point estimates of regression coe�cients stillunbiased!

Page 29: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

DiagnosingHeteroskedasticity

We can use a scale-location plot to diagnose heteroskedasticity:

100 200 300 400 500 600 700 800

0.0

0.5

1.0

1.5

2.0

2.5

Predicted / Fitted Values

Squ

are

Roo

t of A

bs V

al o

f St.

Res

ids

Page 30: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

DiagnosingHeteroskedasticity

We can use a scale-location plot to diagnose heteroskedasticity:

100 200 300 400 500 600 700 800

0.0

0.5

1.0

1.5

2.0

2.5

Predicted / Fitted Values

Squ

are

Roo

t of A

bs V

al o

f St.

Res

ids

Page 31: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

DiagnosingHeteroskedasticity

In R, we can construct a scale-location plot as follows:

plot(lm.cc, 3)

Or manually as:

scatter.smooth(fitted(lm.cc), sqrt(abs(rstudent(lm.cc)

)), col="red")

Page 32: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

DiagnosingHeteroskedasticity

In R, we can construct a scale-location plot as follows:

plot(lm.cc, 3)

Or manually as:

scatter.smooth(fitted(lm.cc), sqrt(abs(rstudent(lm.cc)

)), col="red")

Page 33: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

FixingHeteroskedasticity

▸ Treat it as a nuisance: use heteroskedasticity-consistent standarderrors (i.e. Huber-White)

▸ Model it: use Weighted Least Squares (WLS)▸ Treat it as model diagnostic tool: change entire model

Page 34: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

FixingHeteroskedasticity

▸ Treat it as a nuisance: use heteroskedasticity-consistent standarderrors (i.e. Huber-White)

▸ Model it: use Weighted Least Squares (WLS)

▸ Treat it as model diagnostic tool: change entire model

Page 35: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

FixingHeteroskedasticity

▸ Treat it as a nuisance: use heteroskedasticity-consistent standarderrors (i.e. Huber-White)

▸ Model it: use Weighted Least Squares (WLS)▸ Treat it as model diagnostic tool: change entire model

Page 36: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Homoskedastic Variance-CovarianceMatrix

How can we characterize the variance of the error terms underhomoskedasticity?

V[є] = Σ =

⎛⎜⎜⎜⎜⎜⎜⎝

σ2 0 0 ⋯ 00 σ2 0 ⋯ 00 0 σ2 ⋯ 0⋮ ⋮ ⋮ ⋱ ⋮0 0 0 ⋯ σ2

⎞⎟⎟⎟⎟⎟⎟⎠

We then estimate σ2 with σ2 = ∑ni=1 є

2

n − k − 1

Page 37: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Homoskedastic Variance-CovarianceMatrix

How can we characterize the variance of the error terms underhomoskedasticity?

V[є] = Σ =

⎛⎜⎜⎜⎜⎜⎜⎝

σ2 0 0 ⋯ 00 σ2 0 ⋯ 00 0 σ2 ⋯ 0⋮ ⋮ ⋮ ⋱ ⋮0 0 0 ⋯ σ2

⎞⎟⎟⎟⎟⎟⎟⎠

We then estimate σ2 with σ2 = ∑ni=1 є

2

n − k − 1

Page 38: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Homoskedastic Variance-CovarianceMatrix

How can we characterize the variance of the error terms underhomoskedasticity?

V[є] = Σ =

⎛⎜⎜⎜⎜⎜⎜⎝

σ2 0 0 ⋯ 00 σ2 0 ⋯ 00 0 σ2 ⋯ 0⋮ ⋮ ⋮ ⋱ ⋮0 0 0 ⋯ σ2

⎞⎟⎟⎟⎟⎟⎟⎠

We then estimate σ2 with σ2 = ∑ni=1 є

2

n − k − 1

Page 39: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Heteroskedastic Variance-CovarianceMatrix

How can we characterize the variance of the error terms underheteroskedasticity?

V[є] = Σ =

⎛⎜⎜⎜⎜⎜⎜⎝

σ21 0 0 ⋯ 00 σ2

2 0 ⋯ 00 0 σ2

3 ⋯ 0⋮ ⋮ ⋮ ⋱ ⋮0 0 0 ⋯ σ2

n

⎞⎟⎟⎟⎟⎟⎟⎠

How do we estimate the σs?

Page 40: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Heteroskedastic Variance-CovarianceMatrix

How can we characterize the variance of the error terms underheteroskedasticity?

V[є] = Σ =

⎛⎜⎜⎜⎜⎜⎜⎝

σ21 0 0 ⋯ 00 σ2

2 0 ⋯ 00 0 σ2

3 ⋯ 0⋮ ⋮ ⋮ ⋱ ⋮0 0 0 ⋯ σ2

n

⎞⎟⎟⎟⎟⎟⎟⎠

How do we estimate the σs?

Page 41: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Heteroskedastic Variance-CovarianceMatrix

How can we characterize the variance of the error terms underheteroskedasticity?

V[є] = Σ =

⎛⎜⎜⎜⎜⎜⎜⎝

σ21 0 0 ⋯ 00 σ2

2 0 ⋯ 00 0 σ2

3 ⋯ 0⋮ ⋮ ⋮ ⋱ ⋮0 0 0 ⋯ σ2

n

⎞⎟⎟⎟⎟⎟⎟⎠

How do we estimate the σs?

Page 42: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Huber-White Variance-CovarianceMatrix

�e Huber-White variance-covariance matrix is a consistent estimateof the heteroskedastic Σ:

V[є] = Σ =

⎛⎜⎜⎜⎜⎜⎜⎝

є21 0 0 ⋯ 00 є22 0 ⋯ 00 0 є23 ⋯ 0⋮ ⋮ ⋮ ⋱ ⋮0 0 0 ⋯ є2n

⎞⎟⎟⎟⎟⎟⎟⎠

Page 43: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Huber-White Variance-CovarianceMatrix

�e Huber-White variance-covariance matrix is a consistent estimateof the heteroskedastic Σ:

V[є] = Σ =

⎛⎜⎜⎜⎜⎜⎜⎝

є21 0 0 ⋯ 00 є22 0 ⋯ 00 0 є23 ⋯ 0⋮ ⋮ ⋮ ⋱ ⋮0 0 0 ⋯ є2n

⎞⎟⎟⎟⎟⎟⎟⎠

Page 44: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

FixingHeteroskedasticity: Huber-White SEs

In R, we can calculate Huber-White SEs as:

library(car)

# returns variance -covariance matrix:

hccm(lm.cc, type="hc0")

# returns standard errors:

sqrt(diag(hccm(lm.cc, type = "hc0")))

Page 45: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

FixingHeteroskedasticity: Huber-White SEs

In R, we can calculate Huber-White SEs as:

library(car)

# returns variance -covariance matrix:

hccm(lm.cc, type="hc0")

# returns standard errors:

sqrt(diag(hccm(lm.cc, type = "hc0")))

Page 46: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

FixingHeteroskedasticity: Small Sample Correction

A potential problem with Huber-White SEs is that it requires a largersample size. �us, we o�en use a small-sample correction to obtainmore conservative estimates:

V[є] = Σ = n

n − k − 1

⎛⎜⎜⎜⎜⎜⎜⎝

є21 0 0 ⋯ 00 є22 0 ⋯ 00 0 є23 ⋯ 0⋮ ⋮ ⋮ ⋱ ⋮0 0 0 ⋯ є2n

⎞⎟⎟⎟⎟⎟⎟⎠

�is is o�en known as HC1 and is used in many publications(,robust option in Stata).

Page 47: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

FixingHeteroskedasticity: Small Sample Correction

A potential problem with Huber-White SEs is that it requires a largersample size. �us, we o�en use a small-sample correction to obtainmore conservative estimates:

V[є] = Σ = n

n − k − 1

⎛⎜⎜⎜⎜⎜⎜⎝

є21 0 0 ⋯ 00 є22 0 ⋯ 00 0 є23 ⋯ 0⋮ ⋮ ⋮ ⋱ ⋮0 0 0 ⋯ є2n

⎞⎟⎟⎟⎟⎟⎟⎠

�is is o�en known as HC1 and is used in many publications(,robust option in Stata).

Page 48: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

FixingHeteroskedasticity: Small Sample Correction

A potential problem with Huber-White SEs is that it requires a largersample size. �us, we o�en use a small-sample correction to obtainmore conservative estimates:

V[є] = Σ = n

n − k − 1

⎛⎜⎜⎜⎜⎜⎜⎝

є21 0 0 ⋯ 00 є22 0 ⋯ 00 0 є23 ⋯ 0⋮ ⋮ ⋮ ⋱ ⋮0 0 0 ⋯ є2n

⎞⎟⎟⎟⎟⎟⎟⎠

�is is o�en known as HC1 and is used in many publications(,robust option in Stata).

Page 49: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

FixingHeteroskedasticity: Small Sample Correction

In R, we can calculate Huber-White SEs with di�erent small-samplecorrections as:

hccm(lm.cc, type="hc1")

hccm(lm.cc, type="hc2")

hccm(lm.cc, type="hc3")

Page 50: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

FixingHeteroskedasticity: Small Sample Correction

In R, we can calculate Huber-White SEs with di�erent small-samplecorrections as:

hccm(lm.cc, type="hc1")

hccm(lm.cc, type="hc2")

hccm(lm.cc, type="hc3")

Page 51: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Heteroskedastic Variance-CovarianceMatrix

We can characterize the variance of the error terms underheteroskedasticity as weighted homoskedastic matrix:

V[є] = Σ =

⎛⎜⎜⎜⎜⎜⎜⎝

a21 σ2 0 0 ⋯ 0

0 a22σ2 0 ⋯ 0

0 0 a23σ2 ⋯ 0

⋮ ⋮ ⋮ ⋱ ⋮0 0 0 ⋯ a2nσ2

⎞⎟⎟⎟⎟⎟⎟⎠

= σ2

⎛⎜⎜⎜⎜⎜⎜⎝

a21 0 0 ⋯ 00 a22 0 ⋯ 00 0 a23 ⋯ 0⋮ ⋮ ⋮ ⋱ ⋮0 0 0 ⋯ a2n

⎞⎟⎟⎟⎟⎟⎟⎠

If we knew weights ai, we could reweight data using1ai. Problem is we

don’t know the weights exactly...

Page 52: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Heteroskedastic Variance-CovarianceMatrix

We can characterize the variance of the error terms underheteroskedasticity as weighted homoskedastic matrix:

V[є] = Σ =

⎛⎜⎜⎜⎜⎜⎜⎝

a21 σ2 0 0 ⋯ 0

0 a22σ2 0 ⋯ 0

0 0 a23σ2 ⋯ 0

⋮ ⋮ ⋮ ⋱ ⋮0 0 0 ⋯ a2nσ2

⎞⎟⎟⎟⎟⎟⎟⎠

= σ2

⎛⎜⎜⎜⎜⎜⎜⎝

a21 0 0 ⋯ 00 a22 0 ⋯ 00 0 a23 ⋯ 0⋮ ⋮ ⋮ ⋱ ⋮0 0 0 ⋯ a2n

⎞⎟⎟⎟⎟⎟⎟⎠

If we knew weights ai, we could reweight data using1ai. Problem is we

don’t know the weights exactly...

Page 53: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Heteroskedastic Variance-CovarianceMatrix

We can characterize the variance of the error terms underheteroskedasticity as weighted homoskedastic matrix:

V[є] = Σ =

⎛⎜⎜⎜⎜⎜⎜⎝

a21 σ2 0 0 ⋯ 0

0 a22σ2 0 ⋯ 0

0 0 a23σ2 ⋯ 0

⋮ ⋮ ⋮ ⋱ ⋮0 0 0 ⋯ a2nσ2

⎞⎟⎟⎟⎟⎟⎟⎠

= σ2

⎛⎜⎜⎜⎜⎜⎜⎝

a21 0 0 ⋯ 00 a22 0 ⋯ 00 0 a23 ⋯ 0⋮ ⋮ ⋮ ⋱ ⋮0 0 0 ⋯ a2n

⎞⎟⎟⎟⎟⎟⎟⎠

If we knew weights ai, we could reweight data using1ai. Problem is we

don’t know the weights exactly...

Page 54: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

FixingHeteroskedasticity: Weighted Least Squares(WLS)

An alternative to heteroskedasticity-consistent standard errors is usingWLS whereby we weight observations that we believe have small errorvariance higher.

▸ E�cient if we correctly specify weights▸ Wemay have good guess as to what weights are▸ Unbiased for βs and consistent for V(β) even if we get weightswrong

If we, for example, believe that error variance is inversely proportionalto income:

lm.cc.wt <- lm(lm.cc$call , weights =1/income , data=cc)

Page 55: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

FixingHeteroskedasticity: Weighted Least Squares(WLS)

An alternative to heteroskedasticity-consistent standard errors is usingWLS whereby we weight observations that we believe have small errorvariance higher.▸ E�cient if we correctly specify weights

▸ Wemay have good guess as to what weights are▸ Unbiased for βs and consistent for V(β) even if we get weightswrong

If we, for example, believe that error variance is inversely proportionalto income:

lm.cc.wt <- lm(lm.cc$call , weights =1/income , data=cc)

Page 56: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

FixingHeteroskedasticity: Weighted Least Squares(WLS)

An alternative to heteroskedasticity-consistent standard errors is usingWLS whereby we weight observations that we believe have small errorvariance higher.▸ E�cient if we correctly specify weights▸ Wemay have good guess as to what weights are

▸ Unbiased for βs and consistent for V(β) even if we get weightswrong

If we, for example, believe that error variance is inversely proportionalto income:

lm.cc.wt <- lm(lm.cc$call , weights =1/income , data=cc)

Page 57: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

FixingHeteroskedasticity: Weighted Least Squares(WLS)

An alternative to heteroskedasticity-consistent standard errors is usingWLS whereby we weight observations that we believe have small errorvariance higher.▸ E�cient if we correctly specify weights▸ Wemay have good guess as to what weights are▸ Unbiased for βs and consistent for V(β) even if we get weightswrong

If we, for example, believe that error variance is inversely proportionalto income:

lm.cc.wt <- lm(lm.cc$call , weights =1/income , data=cc)

Page 58: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

FixingHeteroskedasticity: Weighted Least Squares(WLS)

An alternative to heteroskedasticity-consistent standard errors is usingWLS whereby we weight observations that we believe have small errorvariance higher.▸ E�cient if we correctly specify weights▸ Wemay have good guess as to what weights are▸ Unbiased for βs and consistent for V(β) even if we get weightswrong

If we, for example, believe that error variance is inversely proportionalto income:

lm.cc.wt <- lm(lm.cc$call , weights =1/income , data=cc)

Page 59: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Violations of Normality Assumption

Why is non-normality an issue?

In small samples:

▸ β will not have normal sampling distribution▸ Test statistics will not have t distributions▸ Since SEs are o�, we have incorrect probability of Type I error intesting and incorrect coverage of con�dence intervals

Good news: in large samples, Central Limit�eorem makes theseproblems go away!

Page 60: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Violations of Normality Assumption

Why is non-normality an issue?

In small samples:▸ β will not have normal sampling distribution

▸ Test statistics will not have t distributions▸ Since SEs are o�, we have incorrect probability of Type I error intesting and incorrect coverage of con�dence intervals

Good news: in large samples, Central Limit�eorem makes theseproblems go away!

Page 61: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Violations of Normality Assumption

Why is non-normality an issue?

In small samples:▸ β will not have normal sampling distribution▸ Test statistics will not have t distributions

▸ Since SEs are o�, we have incorrect probability of Type I error intesting and incorrect coverage of con�dence intervals

Good news: in large samples, Central Limit�eorem makes theseproblems go away!

Page 62: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Violations of Normality Assumption

Why is non-normality an issue?

In small samples:▸ β will not have normal sampling distribution▸ Test statistics will not have t distributions▸ Since SEs are o�, we have incorrect probability of Type I error intesting and incorrect coverage of con�dence intervals

Good news: in large samples, Central Limit�eorem makes theseproblems go away!

Page 63: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Violations of Normality Assumption

Why is non-normality an issue?

In small samples:▸ β will not have normal sampling distribution▸ Test statistics will not have t distributions▸ Since SEs are o�, we have incorrect probability of Type I error intesting and incorrect coverage of con�dence intervals

Good news: in large samples, Central Limit�eorem makes theseproblems go away!

Page 64: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Diagnosing Non-Normality

▸ Density plots of errors: studentized residuals should have tdistribution with n − k − 2 degrees of freedom

▸ Formal tests▸ Quantile-quantile (Q-Q) plots

Page 65: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Diagnosing Non-Normality

▸ Density plots of errors: studentized residuals should have tdistribution with n − k − 2 degrees of freedom

▸ Formal tests

▸ Quantile-quantile (Q-Q) plots

Page 66: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Diagnosing Non-Normality

▸ Density plots of errors: studentized residuals should have tdistribution with n − k − 2 degrees of freedom

▸ Formal tests▸ Quantile-quantile (Q-Q) plots

Page 67: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Diagnosing Non-Normality: Q-Q Plots

▸ Generally: we compare quantiles of empirical distribution withquantiles of theoretical distribution

▸ Speci�cally: we compare quantiles of studentized residuals toquantiles of t distribution

Page 68: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Diagnosing Non-Normality: Q-Q Plots

▸ Generally: we compare quantiles of empirical distribution withquantiles of theoretical distribution

▸ Speci�cally: we compare quantiles of studentized residuals toquantiles of t distribution

Page 69: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Diagnosing Non-Normality: Q-Q Plots

In R:

library(car)

qqPlot(lm.cc)

plot(lm.cc ,2)

Page 70: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Diagnosing Non-Normality: Q-Q Plots

-2 -1 0 1 2

02

46

t Quantiles

Stu

dent

ized

Res

idua

ls(lm

.cc)

Page 71: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Violations of Linearity Assumption

Why is non-linearity an issue?

▸ Bias in estimated regression function (for population CEF, not forbest linear approximation to CEF)

▸ Leads to bad inferences and poor prediction

Page 72: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Violations of Linearity Assumption

Why is non-linearity an issue?

▸ Bias in estimated regression function (for population CEF, not forbest linear approximation to CEF)

▸ Leads to bad inferences and poor prediction

Page 73: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Violations of Linearity Assumption

Why is non-linearity an issue?

▸ Bias in estimated regression function (for population CEF, not forbest linear approximation to CEF)

▸ Leads to bad inferences and poor prediction

Page 74: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Diagnosing Non-Linearity: GAM Plots

Fitting a generalized additive model (GAM) can reveal non-linearities:

library(mgcv)

gam.cc <- gam(ccexpend ~ s(income) + s(age) +

homeowner , data=cc)

▸ Using s() around the variables allows GAM to choose smoothfunctional form

▸ Algorithmminimizes deviations from surface without �tting datatoo closely (bias-variance tradeo�)

Page 75: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Diagnosing Non-Linearity: GAM Plots

Fitting a generalized additive model (GAM) can reveal non-linearities:

library(mgcv)

gam.cc <- gam(ccexpend ~ s(income) + s(age) +

homeowner , data=cc)

▸ Using s() around the variables allows GAM to choose smoothfunctional form

▸ Algorithmminimizes deviations from surface without �tting datatoo closely (bias-variance tradeo�)

Page 76: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Diagnosing Non-Linearity: GAM Plots

Fitting a generalized additive model (GAM) can reveal non-linearities:

library(mgcv)

gam.cc <- gam(ccexpend ~ s(income) + s(age) +

homeowner , data=cc)

▸ Using s() around the variables allows GAM to choose smoothfunctional form

▸ Algorithmminimizes deviations from surface without �tting datatoo closely (bias-variance tradeo�)

Page 77: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Diagnosing Non-Linearity: GAM Plots

> summary(gam.cc)

Parametric coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 252.06 45.78 5.506 6.26e-07 ***

homeowner 27.94 83.10 0.336 0.738

---

Approximate significance of smooth terms:

edf Ref.df F p-value

s(income) 1.912 2.38 6.151 0.00229 **

s(age) 1.000 1.00 0.171 0.68057

---

R-sq.(adj) = 0.199 Deviance explained = 24.3%

GCV score = 86930 Scale est. = 81000 n = 72

Page 78: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Diagnosing Non-Linearity: GAM Plots

▸ Equilalent degrees of freedom (edf): how many variables areneeded to de�ne smooth regression surface

▸ edf= k: smooth surface is linear▸ edf> k: smooth surface deviates from linear (and requiresadditional variables)

▸ Our example: edf for income is ≈ 2, indicating that we shouldconsider squared term

▸ F-value / p-value: probability that variable would have at leastthis extreme an e�ect under null hypothesis that there is norelationship

▸ Generalized cross-validation (GCV) score = predictive(out-of-sample) performance of smooth regression surface

Page 79: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Diagnosing Non-Linearity: GAM Plots

▸ Equilalent degrees of freedom (edf): how many variables areneeded to de�ne smooth regression surface

▸ edf= k: smooth surface is linear

▸ edf> k: smooth surface deviates from linear (and requiresadditional variables)

▸ Our example: edf for income is ≈ 2, indicating that we shouldconsider squared term

▸ F-value / p-value: probability that variable would have at leastthis extreme an e�ect under null hypothesis that there is norelationship

▸ Generalized cross-validation (GCV) score = predictive(out-of-sample) performance of smooth regression surface

Page 80: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Diagnosing Non-Linearity: GAM Plots

▸ Equilalent degrees of freedom (edf): how many variables areneeded to de�ne smooth regression surface

▸ edf= k: smooth surface is linear▸ edf> k: smooth surface deviates from linear (and requiresadditional variables)

▸ Our example: edf for income is ≈ 2, indicating that we shouldconsider squared term

▸ F-value / p-value: probability that variable would have at leastthis extreme an e�ect under null hypothesis that there is norelationship

▸ Generalized cross-validation (GCV) score = predictive(out-of-sample) performance of smooth regression surface

Page 81: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Diagnosing Non-Linearity: GAM Plots

▸ Equilalent degrees of freedom (edf): how many variables areneeded to de�ne smooth regression surface

▸ edf= k: smooth surface is linear▸ edf> k: smooth surface deviates from linear (and requiresadditional variables)

▸ Our example: edf for income is ≈ 2, indicating that we shouldconsider squared term

▸ F-value / p-value: probability that variable would have at leastthis extreme an e�ect under null hypothesis that there is norelationship

▸ Generalized cross-validation (GCV) score = predictive(out-of-sample) performance of smooth regression surface

Page 82: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Diagnosing Non-Linearity: GAM Plots

▸ Equilalent degrees of freedom (edf): how many variables areneeded to de�ne smooth regression surface

▸ edf= k: smooth surface is linear▸ edf> k: smooth surface deviates from linear (and requiresadditional variables)

▸ Our example: edf for income is ≈ 2, indicating that we shouldconsider squared term

▸ F-value / p-value: probability that variable would have at leastthis extreme an e�ect under null hypothesis that there is norelationship

▸ Generalized cross-validation (GCV) score = predictive(out-of-sample) performance of smooth regression surface

Page 83: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Diagnosing Non-Linearity: GAM Plots

▸ Equilalent degrees of freedom (edf): how many variables areneeded to de�ne smooth regression surface

▸ edf= k: smooth surface is linear▸ edf> k: smooth surface deviates from linear (and requiresadditional variables)

▸ Our example: edf for income is ≈ 2, indicating that we shouldconsider squared term

▸ F-value / p-value: probability that variable would have at leastthis extreme an e�ect under null hypothesis that there is norelationship

▸ Generalized cross-validation (GCV) score = predictive(out-of-sample) performance of smooth regression surface

Page 84: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Diagnosing Non-Linearity: GAM Plots

We can also examine partial relationships between explanatoryvariables and the outcome graphically!

2 4 6 8 10

-200

0200

400

600

income

s(income,1.91)

Page 85: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Diagnosing Non-Linearity: GAM Plots

We can also examine partial relationships between explanatoryvariables and the outcome graphically!

2 4 6 8 10

-200

0200

400

600

income

s(income,1.91)

Page 86: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Diagnosing Non-Linearity: GAM Plots

We can also examine partial relationships between explanatoryvariables and the outcome graphically!

20 25 30 35 40 45 50 55

-200

0200

400

600

age

s(age,1)

Page 87: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Fixing Non-Linearity

▸ Transform outcome variable (i.e. using log or square root)

▸ Transform explanatory variables (i.e. using log or square root) oradd higher order terms

▸ Use semi-parametric or nonparametric models (i.e. GAMs)

Page 88: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Fixing Non-Linearity

▸ Transform outcome variable (i.e. using log or square root)▸ Transform explanatory variables (i.e. using log or square root) oradd higher order terms

▸ Use semi-parametric or nonparametric models (i.e. GAMs)

Page 89: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Fixing Non-Linearity

▸ Transform outcome variable (i.e. using log or square root)▸ Transform explanatory variables (i.e. using log or square root) oradd higher order terms

▸ Use semi-parametric or nonparametric models (i.e. GAMs)

Page 90: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Outline

Administrative Details

Stocktake

Regression Assumptions

Missing Data

Page 91: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

MissingnessMechanism

How was missingness generated?

We can characterize the missingness mechanism as:

▸ Missing completely at random (MCAR): missingness unrelatedto variables in data

▸ Missing at random (MAR): missingness related to observed data▸ Not missing at random (NMAR): missingness related tounobserved data

Page 92: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

MissingnessMechanism

How was missingness generated?We can characterize the missingness mechanism as:

▸ Missing completely at random (MCAR): missingness unrelatedto variables in data

▸ Missing at random (MAR): missingness related to observed data▸ Not missing at random (NMAR): missingness related tounobserved data

Page 93: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Missingness in Our Data

Let’s work with ccarddata_missing.csv, which now has missingvalues of credit card expenditures (outcome) for some observations.

How can we characterize this missingness?

Xmissing Xnon-missing Xmissing − Xnon-missing t-statage 33.68 29.13 4.54 2.80

income 3.62 3.27 0.34 0.85homeowner 0.35 0.39 -0.04 -0.36

Page 94: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Missingness in Our Data

Let’s work with ccarddata_missing.csv, which now has missingvalues of credit card expenditures (outcome) for some observations.

How can we characterize this missingness?

Xmissing Xnon-missing Xmissing − Xnon-missing t-statage 33.68 29.13 4.54 2.80

income 3.62 3.27 0.34 0.85homeowner 0.35 0.39 -0.04 -0.36

Page 95: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Missingness in Our Data

Let’s work with ccarddata_missing.csv, which now has missingvalues of credit card expenditures (outcome) for some observations.

How can we characterize this missingness?

Xmissing Xnon-missing Xmissing − Xnon-missing t-statage 33.68 29.13 4.54 2.80

income 3.62 3.27 0.34 0.85homeowner 0.35 0.39 -0.04 -0.36

Page 96: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Missingness in Our Data

We can also look at missingness mechanism graphically:### create indicator for missingness

cc.missing$missing <- 0

cc.missing[is.na(cc.missing$ccexpend) ,]$missing <-1

### create vector of t-stats and plot

t.stats <- c(t.test(cc.missing[cc.missing$missing ==1,"age"],cc.missing[cc.missing$missing ==0,"age"])$statistic

,t.test(cc.missing[cc.missing$missing ==1,"income"],cc.missing[cc.missing$missing ==0,"income"])$statistic

,t.test(cc.missing[cc.missing$missing ==1,"homeowner"],cc.missing[cc.missing$missing ==0,"homeowner"])$statistic)

dotchart(t.stats , labels=c("age","income","homeowner")

, xlim=c(-3,3), xlab="Standardized Diff. in Means"

,pch =19)

abline(v=0, col="red", lty =2)

Page 97: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Missingness in Our Data

age

income

homeowner

-3 -2 -1 0 1 2 3

Standardized Diff. in Means

Page 98: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Missingness in Our Data

�e �rst 6 observations in our dataset are:

ccexpend age income homeowner

124.98 38 4.52 133 2.42 0

15.00 34 4.50 131 2.54 0

546.50 32 9.79 192.00 23 2.50 0

Missing values shaded in red.

Page 99: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Missingness in Our Data

�e �rst 6 observations in our dataset are:

ccexpend age income homeowner

124.98 38 4.52 133 2.42 0

15.00 34 4.50 131 2.54 0

546.50 32 9.79 192.00 23 2.50 0

Missing values shaded in red.

Page 100: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Dealing withMissingness

A few common ways to deal with missingness:

▸ Complete case analysis▸ Mean imputation▸ Regression imputation▸ Multiple imputation

Page 101: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Dealing withMissingness

A few common ways to deal with missingness:

▸ Complete case analysis▸ Mean imputation▸ Regression imputation▸ Multiple imputation

Page 102: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Complete Case Analysis

ccexpend age income homeowner

124.98 38 4.52 133 2.42 0

15.00 34 4.50 131 2.54 0

546.50 32 9.79 192.00 23 2.50 0

In R:

# R automatically row -deletes observations with

missing data

lm(ccexpend ~ income + homeowner + age , data=cc.

missing)

Page 103: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Complete Case Analysis

ccexpend age income homeowner

124.98 38 4.52 133 2.42 0

15.00 34 4.50 131 2.54 0

546.50 32 9.79 192.00 23 2.50 0

In R:

# R automatically row -deletes observations with

missing data

lm(ccexpend ~ income + homeowner + age , data=cc.

missing)

Page 104: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Mean Imputation

ccexpend age income homeowner

124.98 38 4.52 1y 33 2.42 0

15.00 34 4.50 1y 31 2.54 0

546.50 32 9.79 192.00 23 2.50 0

In R, mean of outcome is:

mean(cc.missing$ccexpend , na.rm=TRUE)

# mean is 209.4542

Page 105: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Mean Imputation

ccexpend age income homeowner

124.98 38 4.52 1y 33 2.42 0

15.00 34 4.50 1y 31 2.54 0

546.50 32 9.79 192.00 23 2.50 0

In R, mean of outcome is:

mean(cc.missing$ccexpend , na.rm=TRUE)

# mean is 209.4542

Page 106: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Mean Imputation

ccexpend age income homeowner

124.98 38 4.52 1209.45 33 2.42 015.00 34 4.50 1209.45 31 2.54 0546.50 32 9.79 192.00 23 2.50 0

Page 107: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Regression Imputation

ccexpend age income homeowner

124.98 38 4.52 1y2 33 2.42 0

15.00 34 4.50 1y4 31 2.54 0

546.50 32 9.79 192.00 23 2.50 0

In R, we can predict missing values:lm.cc.missing <- lm(ccexpend ~ income + homeowner +

age , data=cc.missing)

missing.df<- cc.missing[is.na(cc.missing$ccexpend),c("income","homeowner","age")]

predict(lm.cc.missing ,missing.df)

Page 108: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Regression Imputation

ccexpend age income homeowner

124.98 38 4.52 1y2 33 2.42 0

15.00 34 4.50 1y4 31 2.54 0

546.50 32 9.79 192.00 23 2.50 0

In R, we can predict missing values:lm.cc.missing <- lm(ccexpend ~ income + homeowner +

age , data=cc.missing)

missing.df<- cc.missing[is.na(cc.missing$ccexpend),c("income","homeowner","age")]

predict(lm.cc.missing ,missing.df)

Page 109: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Regression Imputation

ccexpend age income homeowner

124.98 38 4.52 194.41 33 2.42 015.00 34 4.50 1109.15 31 2.54 0546.50 32 9.79 192.00 23 2.50 0

Page 110: GOV óþþþ Section Ÿ: Diagnosing and Fixing Regression …projects.iq.harvard.edu/files/gov2000/files/section8-r.pdfDiagnosing and Fixing Regression Problems Konstantin KashinÕ

Administrative Details Stocktake Regression Assumptions Missing Data

Multiple Imputation

What limitations of previous imputation methods does multipleimputation address?