gov2000-10. troubleshootingthelinear model · 2019. 7. 23. · nonnormality of the errors 2....
TRANSCRIPT
![Page 1: Gov2000-10. TroubleshootingtheLinear Model · 2019. 7. 23. · Nonnormality of the errors 2. Nonlinearity of the regression function 3. Outliers, ... going? • Last few weeks: estimation](https://reader036.vdocument.in/reader036/viewer/2022071423/611dff6ba00ea9282c0b473e/html5/thumbnails/1.jpg)
Gov 2000 - 10.Troubleshooting the LinearModel
Matthew BlackwellNovember 24, 2015
1 / 69
![Page 2: Gov2000-10. TroubleshootingtheLinear Model · 2019. 7. 23. · Nonnormality of the errors 2. Nonlinearity of the regression function 3. Outliers, ... going? • Last few weeks: estimation](https://reader036.vdocument.in/reader036/viewer/2022071423/611dff6ba00ea9282c0b473e/html5/thumbnails/2.jpg)
1. Nonnormality of the errors
2. Nonlinearity of the regression function
3. Outliers, leverage points, and influential observations
2 / 69
![Page 3: Gov2000-10. TroubleshootingtheLinear Model · 2019. 7. 23. · Nonnormality of the errors 2. Nonlinearity of the regression function 3. Outliers, ... going? • Last few weeks: estimation](https://reader036.vdocument.in/reader036/viewer/2022071423/611dff6ba00ea9282c0b473e/html5/thumbnails/3.jpg)
Where are we? Where are wegoing?
• Last few weeks: estimation and inference for the linear modelunder Gauss-Markov assumptions (and sometimes conditionalNormality)
• This week: what happens when the assumptions fail? Can wetell? Can we fix it?
• Next weeks: more of the same
3 / 69
![Page 4: Gov2000-10. TroubleshootingtheLinear Model · 2019. 7. 23. · Nonnormality of the errors 2. Nonlinearity of the regression function 3. Outliers, ... going? • Last few weeks: estimation](https://reader036.vdocument.in/reader036/viewer/2022071423/611dff6ba00ea9282c0b473e/html5/thumbnails/4.jpg)
Review of the OLS assumptions
1. Linearity: 𝐲 = 𝐗𝛽 + 𝐮2. Random/iid sample: (𝑦𝑖, 𝐱′
𝑖) are a iid sample from thepopulation.
3. No perfect collinearity: 𝐗 is an 𝑛 × (𝑘 + 1) matrix with rank𝑘 + 1
4. Zero conditional mean: 𝔼[𝐮|𝐗] = 𝟎5. Homoskedasticity: var(𝐮|𝐗) = 𝜎2𝑢𝐈𝑛6. Normality: 𝐮|𝐗 ∼ 𝑁(𝟎, 𝜎2𝑢𝐈𝑛)
• 1-4 give us unbiasedness/consistency• 1-5 are the Gauss-Markov, allow for large-sample inference• 1-6 allow for small-sample inference
4 / 69
![Page 5: Gov2000-10. TroubleshootingtheLinear Model · 2019. 7. 23. · Nonnormality of the errors 2. Nonlinearity of the regression function 3. Outliers, ... going? • Last few weeks: estimation](https://reader036.vdocument.in/reader036/viewer/2022071423/611dff6ba00ea9282c0b473e/html5/thumbnails/5.jpg)
Violations of the assumptions1. Nonlinearity
▶ Result: biased/inconsistent estimates▶ Diagnose: scatterplots, added variable plots,
component-plus-residual plots▶ Correct: transformations, polynomials, splines
2. iid/random sample▶ Result: no bias with appropriate alternative assumptions
(structured dependence)▶ Result (ii): violations imply heteroskedasticity▶ Result (iii): outliers from different distributions can cause
inefficiency/bias▶ Diagnose/Correct: next week!
3. Perfect collinearity▶ Result: can’t run OLS▶ Diagnose/correct: drop one collinear term
5 / 69
![Page 6: Gov2000-10. TroubleshootingtheLinear Model · 2019. 7. 23. · Nonnormality of the errors 2. Nonlinearity of the regression function 3. Outliers, ... going? • Last few weeks: estimation](https://reader036.vdocument.in/reader036/viewer/2022071423/611dff6ba00ea9282c0b473e/html5/thumbnails/6.jpg)
Violations of the assumptions (ii)
4. Zero conditional mean error▶ Result: biased/inconsistent estimates▶ Diagnose: very difficult▶ Correct: instrumental variables, fixed effects, regression
discontinuity (Gov 2002)
5. Heteroskedasticity▶ Result: SEs are biased (usually downward)▶ Diagnose/correct: next week!
6. Non-Normality▶ Result: critical values for 𝑡 and 𝐹 tests wrong▶ Diagnose: checking the (studentized) residuals, QQ-plots, etc▶ Correct: transformations, add variables to 𝐗, different model
6 / 69
![Page 7: Gov2000-10. TroubleshootingtheLinear Model · 2019. 7. 23. · Nonnormality of the errors 2. Nonlinearity of the regression function 3. Outliers, ... going? • Last few weeks: estimation](https://reader036.vdocument.in/reader036/viewer/2022071423/611dff6ba00ea9282c0b473e/html5/thumbnails/7.jpg)
Example: Buchanan votes inFlorida, 2000
• 2000 Presidential election in FL (Wand et al., 2001, APSR)
7 / 69
![Page 8: Gov2000-10. TroubleshootingtheLinear Model · 2019. 7. 23. · Nonnormality of the errors 2. Nonlinearity of the regression function 3. Outliers, ... going? • Last few weeks: estimation](https://reader036.vdocument.in/reader036/viewer/2022071423/611dff6ba00ea9282c0b473e/html5/thumbnails/8.jpg)
Example: Buchanan votes inFlorida, 2000
0 100000 200000 300000 400000 500000 600000
0
500
1000
1500
2000
2500
3000
3500
Total Votes
Buc
hana
n Vot
es
8 / 69
![Page 9: Gov2000-10. TroubleshootingtheLinear Model · 2019. 7. 23. · Nonnormality of the errors 2. Nonlinearity of the regression function 3. Outliers, ... going? • Last few weeks: estimation](https://reader036.vdocument.in/reader036/viewer/2022071423/611dff6ba00ea9282c0b473e/html5/thumbnails/9.jpg)
Example: Buchanan votes inFlorida, 2000
0 100000 200000 300000 400000 500000 600000
0
500
1000
1500
2000
2500
3000
3500
Total Votes
Buc
hana
n Vot
es
Duval
Lee
Broward
MartinCollierDixie
Pinellas
Osceola
Miami-Dade
Alachua
Glades
LeonVolusia
HighlandsHendry
St. Lucie
Polk
Madison
Orange
OkeechobeeLafayetteDesotoIndian RiverWakulla
Brevard
ManateeSarasota
JeffersonSuwanneeGulfColumbiaWaltonSumter
Pasco
PutnamSt. JohnsSeminoleFranklinMonroe
CharlotteGilchristWashington
Marion
JacksonFlaglerCitrus
HolmesClay
TaylorUnionLake
Hillsborough
HernandoBradfordCalhounNassau
BayBakerGadsden
OkaloosaEscambia
Santa RosaLevy
Palm Beach
HamiltonHardeeLiberty
9 / 69
![Page 10: Gov2000-10. TroubleshootingtheLinear Model · 2019. 7. 23. · Nonnormality of the errors 2. Nonlinearity of the regression function 3. Outliers, ... going? • Last few weeks: estimation](https://reader036.vdocument.in/reader036/viewer/2022071423/611dff6ba00ea9282c0b473e/html5/thumbnails/10.jpg)
1/ Nonnormality ofthe errors
10 / 69
![Page 11: Gov2000-10. TroubleshootingtheLinear Model · 2019. 7. 23. · Nonnormality of the errors 2. Nonlinearity of the regression function 3. Outliers, ... going? • Last few weeks: estimation](https://reader036.vdocument.in/reader036/viewer/2022071423/611dff6ba00ea9282c0b473e/html5/thumbnails/11.jpg)
Review of the Normalityassumption
• In matrix notation:
𝐮|𝐗 ∼ 𝒩(0, 𝜎2𝑢𝐈)
• Equivalent to:𝑢𝑖|𝐱′
𝑖 ∼ 𝒩(0, 𝜎2𝑢)• Fix 𝐱′
𝑖 and the distribution of errors should be Normal
11 / 69
![Page 12: Gov2000-10. TroubleshootingtheLinear Model · 2019. 7. 23. · Nonnormality of the errors 2. Nonlinearity of the regression function 3. Outliers, ... going? • Last few weeks: estimation](https://reader036.vdocument.in/reader036/viewer/2022071423/611dff6ba00ea9282c0b473e/html5/thumbnails/12.jpg)
Consequences of non-Normalerrors?
• In small samples:▶ Sampling distribution of 𝜷 will not be Normal▶ Test statistics will not have 𝑡 or 𝐹 distributions▶ Probability of Type I error will not be 𝛼▶ 1 − 𝛼 confidence interval will not have 1 − 𝛼 coverage
• In large samples:▶ Sampling distribution of 𝜷 ≈ Normal by the CLT▶ Test statistics will be ≈ 𝑡 or 𝐹 by the CLT▶ Probability of Type I error ≈ 𝛼▶ 1 − 𝛼 confidence interval will have ≈ 1 − 𝛼 coverage
• The 𝑛 needed for approximation to hold depends on hownon-Normal the data are
12 / 69
![Page 13: Gov2000-10. TroubleshootingtheLinear Model · 2019. 7. 23. · Nonnormality of the errors 2. Nonlinearity of the regression function 3. Outliers, ... going? • Last few weeks: estimation](https://reader036.vdocument.in/reader036/viewer/2022071423/611dff6ba00ea9282c0b473e/html5/thumbnails/13.jpg)
Marginal versus conditional
• Be careful with this assumption: distribution of the error, notthe distribution of 𝑦𝑖
• The marginal distribution of 𝑦𝑖 can be non-Normal even if theconditional distribution is Normal!
13 / 69
![Page 14: Gov2000-10. TroubleshootingtheLinear Model · 2019. 7. 23. · Nonnormality of the errors 2. Nonlinearity of the regression function 3. Outliers, ... going? • Last few weeks: estimation](https://reader036.vdocument.in/reader036/viewer/2022071423/611dff6ba00ea9282c0b473e/html5/thumbnails/14.jpg)
Is this a violation?• For example, this looks bad:
x <- rbinom(100, 1, 0.5)y <- 10 * x + rnorm(100, 0, 1)plot(density(y), lwd = 3, col = "indianred", las = 1, xlab = "x", main = "",
bty = "n", ylab = "")
-5 0 5 10 15
0.00
0.02
0.04
0.06
0.08
0.10
x
14 / 69
![Page 15: Gov2000-10. TroubleshootingtheLinear Model · 2019. 7. 23. · Nonnormality of the errors 2. Nonlinearity of the regression function 3. Outliers, ... going? • Last few weeks: estimation](https://reader036.vdocument.in/reader036/viewer/2022071423/611dff6ba00ea9282c0b473e/html5/thumbnails/15.jpg)
Is this a violation?• But if we look at the conditional distributions, things look
better:
-3 -1 0 1 2
0.0
0.1
0.2
0.3
0.4
x = 0
x
7 8 9 10 11 12
0.0
0.1
0.2
0.3
x = 1
x15 / 69
![Page 16: Gov2000-10. TroubleshootingtheLinear Model · 2019. 7. 23. · Nonnormality of the errors 2. Nonlinearity of the regression function 3. Outliers, ... going? • Last few weeks: estimation](https://reader036.vdocument.in/reader036/viewer/2022071423/611dff6ba00ea9282c0b473e/html5/thumbnails/16.jpg)
How to diagnose?
• Assumption is about unobserved 𝐮 = 𝐲 − 𝐗𝜷• We can only observe residuals, �� = 𝐲 − 𝐗𝜷• If distribution of residuals ≈ distribution of errors, we could
check residuals• But this is actually not true—the distribution of the residuals
is complicated
16 / 69
![Page 17: Gov2000-10. TroubleshootingtheLinear Model · 2019. 7. 23. · Nonnormality of the errors 2. Nonlinearity of the regression function 3. Outliers, ... going? • Last few weeks: estimation](https://reader036.vdocument.in/reader036/viewer/2022071423/611dff6ba00ea9282c0b473e/html5/thumbnails/17.jpg)
Hat matrix
• First we need to define an important matrix𝐇 = 𝐗 (𝐗′𝐗)−1 𝐗′
�� = 𝐲 − 𝐗𝜷= 𝐲 − 𝐗 (𝐗′𝐗)−1 𝐗′𝐲≡ 𝐲 − 𝐇𝐲= (𝐈 − 𝐇)𝐲
• 𝐇 is the hat matrix because it puts the “hat” on 𝐲:
�� = 𝐇𝐲
▶ 𝐇 is an 𝑛 × 𝑛 symmetric matrix▶ 𝐇 is idempotent: 𝐇𝐇 = 𝐇
17 / 69
![Page 18: Gov2000-10. TroubleshootingtheLinear Model · 2019. 7. 23. · Nonnormality of the errors 2. Nonlinearity of the regression function 3. Outliers, ... going? • Last few weeks: estimation](https://reader036.vdocument.in/reader036/viewer/2022071423/611dff6ba00ea9282c0b473e/html5/thumbnails/18.jpg)
Relating the residuals to the errors
• Possible to show that residuals �� are a linear function of theerrors, 𝐮
�� = (𝐈 − 𝐇)𝐮• For instance,
𝑢1 = (1 − ℎ11)𝑢1 −𝑛
∑𝑖=2
ℎ1𝑖𝑢𝑖
• Note that the residual is a function of all of the errors
18 / 69
![Page 19: Gov2000-10. TroubleshootingtheLinear Model · 2019. 7. 23. · Nonnormality of the errors 2. Nonlinearity of the regression function 3. Outliers, ... going? • Last few weeks: estimation](https://reader036.vdocument.in/reader036/viewer/2022071423/611dff6ba00ea9282c0b473e/html5/thumbnails/19.jpg)
Distribution of the residuals
𝔼[��|𝐗] = (𝐈 − 𝐇)𝔼[𝐮|𝐗] = 𝟎Var[𝐮|𝐗] = 𝜎2𝑢(𝐈 − 𝐇)
• Variance of the 𝑖th residual:
Var[ 𝑢𝑖] = 𝜎2𝑢(1 − ℎ𝑖𝑖)
• The residuals are not independent:▶ We know that the must sum up to 0: ∑𝑖 𝑢𝑖 = 0, so if I know
𝑛 − 1 of them, I know the last one too.
• Residuals not independent, nor identically distributed, evenwhen all the OLS assumptions hold
• Can’t use them yet for checking Normality
19 / 69
![Page 20: Gov2000-10. TroubleshootingtheLinear Model · 2019. 7. 23. · Nonnormality of the errors 2. Nonlinearity of the regression function 3. Outliers, ... going? • Last few weeks: estimation](https://reader036.vdocument.in/reader036/viewer/2022071423/611dff6ba00ea9282c0b473e/html5/thumbnails/20.jpg)
Standardized residuals
• Problem: each residual has a different variance
Var[ 𝑢𝑖] = 𝜎2𝑢(1 − ℎ𝑖𝑖)
• Possible solution: calculate standardized residuals by dividingby their variance:
𝑢′𝑖 = 𝑢𝑖
��√1 − ℎ𝑖𝑖
• Remember that 𝐇 = 𝐗 (𝐗′𝐗)−1 𝐗′, which is observed• Problem: ��2 depends on estimated residuals 𝑢𝑖• Numerator and denominator not independent ⇝ unknown
distribution
20 / 69
![Page 21: Gov2000-10. TroubleshootingtheLinear Model · 2019. 7. 23. · Nonnormality of the errors 2. Nonlinearity of the regression function 3. Outliers, ... going? • Last few weeks: estimation](https://reader036.vdocument.in/reader036/viewer/2022071423/611dff6ba00ea9282c0b473e/html5/thumbnails/21.jpg)
Studentized residuals
• Solution: estimate residual variance without residual 𝑖:
��2−𝑖 = 𝐮′𝐮 − 𝑢2
𝑖 /(1 − ℎ𝑖𝑖)𝑛 − 𝑘 − 2
• Use this 𝑖-free estimate to standardize, which creates thestudentized residuals:
𝑢∗𝑖 = 𝑢𝑖
��−𝑖√1 − ℎ𝑖𝑖
• If the errors are Normal, the studentized residuals follow a 𝑡distribution with (𝑛 − 𝑘 − 2) degrees of freedom.
• Deviations from 𝑡 ⟹ violation of Normality
21 / 69
![Page 22: Gov2000-10. TroubleshootingtheLinear Model · 2019. 7. 23. · Nonnormality of the errors 2. Nonlinearity of the regression function 3. Outliers, ... going? • Last few weeks: estimation](https://reader036.vdocument.in/reader036/viewer/2022071423/611dff6ba00ea9282c0b473e/html5/thumbnails/22.jpg)
Buchanan vote example
mod <- lm(edaybuchanan ~ edaytotal, data = flvote)summary(mod)
#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) 54.22945 49.14146 1.10 0.27## edaytotal 0.00232 0.00031 7.48 2.4e-10 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 333 on 65 degrees of freedom## Multiple R-squared: 0.463, Adjusted R-squared: 0.455## F-statistic: 56 on 1 and 65 DF, p-value: 2.42e-10
22 / 69
![Page 23: Gov2000-10. TroubleshootingtheLinear Model · 2019. 7. 23. · Nonnormality of the errors 2. Nonlinearity of the regression function 3. Outliers, ... going? • Last few weeks: estimation](https://reader036.vdocument.in/reader036/viewer/2022071423/611dff6ba00ea9282c0b473e/html5/thumbnails/23.jpg)
Buchanan residuals
resids <- residuals(mod)stand.resids <- rstandard(mod)student.resids <- rstudent(mod)head(cbind(resids, stand.resids, student.resids))
## resids stand.resids student.resids## 1 -16.94 -0.05201 -0.05161## 2 -177.51 -0.53971 -0.53675## 3 -595.20 -2.02639 -2.07743## 4 -86.28 -0.26135 -0.25947## 5 -146.31 -0.44306 -0.44030## 6 -36.07 -0.10957 -0.10873
dotchart(student.resids, flvote$county, xlab = "Residuals")
23 / 69
![Page 24: Gov2000-10. TroubleshootingtheLinear Model · 2019. 7. 23. · Nonnormality of the errors 2. Nonlinearity of the regression function 3. Outliers, ... going? • Last few weeks: estimation](https://reader036.vdocument.in/reader036/viewer/2022071423/611dff6ba00ea9282c0b473e/html5/thumbnails/24.jpg)
Plotting the residuals
DuvalLeeBrowardMartinCollierDixiePinellasOsceolaMiami-DadeAlachuaGladesLeonVolusiaHighlandsHendrySt. LuciePolkMadisonOrangeOkeechobeeLafayetteDesotoIndian RiverWakullaBrevardManateeSarasotaJeffersonSuwanneeGulfColumbiaWaltonSumterPascoPutnamSt. JohnsSeminoleFranklinMonroeCharlotteGilchristWashingtonMarionJacksonFlaglerCitrusHolmesClayTaylorUnionLakeHillsboroughHernandoBradfordCalhounNassauBayBakerGadsdenOkaloosaEscambiaSanta RosaLevyPalm BeachHamiltonHardeeLiberty
0 5 10 15 20
Residuals
24 / 69
![Page 25: Gov2000-10. TroubleshootingtheLinear Model · 2019. 7. 23. · Nonnormality of the errors 2. Nonlinearity of the regression function 3. Outliers, ... going? • Last few weeks: estimation](https://reader036.vdocument.in/reader036/viewer/2022071423/611dff6ba00ea9282c0b473e/html5/thumbnails/25.jpg)
Plotting the residualsHistogram of resids
resids
Freq
uenc
y
-1000 -500 0 500 1000 1500 2000 2500
0
10
20
30
40
Histogram of stand.resids
stand.resids
Freq
uenc
y
-4 -2 0 2 4 6 8
0
10
20
30
40
Histogram of student.resids
student.resids
Freq
uenc
y
-5 0 5 10 15 20 25
0
10
20
30
40
25 / 69
![Page 26: Gov2000-10. TroubleshootingtheLinear Model · 2019. 7. 23. · Nonnormality of the errors 2. Nonlinearity of the regression function 3. Outliers, ... going? • Last few weeks: estimation](https://reader036.vdocument.in/reader036/viewer/2022071423/611dff6ba00ea9282c0b473e/html5/thumbnails/26.jpg)
Plotting the residuals
0 5 10 15 20
0.0
0.5
1.0
1.5
2.0
2.5
Studentized Residuals
Den
sity
Studentized ResidualsTheoretical t distribution
26 / 69
![Page 27: Gov2000-10. TroubleshootingtheLinear Model · 2019. 7. 23. · Nonnormality of the errors 2. Nonlinearity of the regression function 3. Outliers, ... going? • Last few weeks: estimation](https://reader036.vdocument.in/reader036/viewer/2022071423/611dff6ba00ea9282c0b473e/html5/thumbnails/27.jpg)
Quantile-Quantile plots
• Quantile-quantile plot or QQ-plot is useful for comparingdistributions
• Plots the quantiles of one distribution against those ofanother distribution
• For example, one point is the (𝑚𝑥, 𝑚𝑦) where 𝑚𝑥 is themedian of the 𝑥 distribution and 𝑚𝑦 is the median for the 𝑦distribution
• If distributions are equal ⟹ 45 degree line
27 / 69
![Page 28: Gov2000-10. TroubleshootingtheLinear Model · 2019. 7. 23. · Nonnormality of the errors 2. Nonlinearity of the regression function 3. Outliers, ... going? • Last few weeks: estimation](https://reader036.vdocument.in/reader036/viewer/2022071423/611dff6ba00ea9282c0b473e/html5/thumbnails/28.jpg)
Good QQ-plotlibrary(car)x <- rt(500, df = 50)qqPlot(x, distribution = "t", df = 50, pch = 19, cex = 1, col = "grey50",
col.lines = "indianred", las = 1, envelope = TRUE)
-3 -2 -1 0 1 2 3
-3
-2
-1
0
1
2
3
t quantiles
x
28 / 69
![Page 29: Gov2000-10. TroubleshootingtheLinear Model · 2019. 7. 23. · Nonnormality of the errors 2. Nonlinearity of the regression function 3. Outliers, ... going? • Last few weeks: estimation](https://reader036.vdocument.in/reader036/viewer/2022071423/611dff6ba00ea9282c0b473e/html5/thumbnails/29.jpg)
Buchanan QQ-plotqqPlot(student.resids, distribution = "t", df = nrow(flvote) - 1 -
2, envelope = TRUE, pch = 19, cex = 1, col = "grey50", col.lines = "indianred",las = 1)
-2 -1 0 1 2
0
5
10
15
20
t quantiles
stud
ent.
resi
ds
29 / 69
![Page 30: Gov2000-10. TroubleshootingtheLinear Model · 2019. 7. 23. · Nonnormality of the errors 2. Nonlinearity of the regression function 3. Outliers, ... going? • Last few weeks: estimation](https://reader036.vdocument.in/reader036/viewer/2022071423/611dff6ba00ea9282c0b473e/html5/thumbnails/30.jpg)
Dealing with non-Normal errors
• Remove problematic observations (be transparent!)• Add or drop variables in 𝐗• Transform 𝐲 (log(𝐲))
30 / 69
![Page 31: Gov2000-10. TroubleshootingtheLinear Model · 2019. 7. 23. · Nonnormality of the errors 2. Nonlinearity of the regression function 3. Outliers, ... going? • Last few weeks: estimation](https://reader036.vdocument.in/reader036/viewer/2022071423/611dff6ba00ea9282c0b473e/html5/thumbnails/31.jpg)
Buchanan revisitedflvote.nopb <- flvote[flvote$county != "Palm Beach", ]mod.nopb <- lm(log(edaybuchanan) ~ log(edaytotal), data = flvote.nopb)resids.nopb <- residuals(mod.nopb)stand.resids.nopb <- rstandard(mod.nopb)student.resids.nopb <- rstudent(mod.nopb)summary(mod.nopb)
clipped.print(summary(mod.nopb))
#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) -2.4860 0.3789 -6.56 1.1e-08 ***## log(edaytotal) 0.7031 0.0362 19.42 < 2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 0.436 on 64 degrees of freedom## Multiple R-squared: 0.855, Adjusted R-squared: 0.853## F-statistic: 377 on 1 and 64 DF, p-value: <2e-16
31 / 69
![Page 32: Gov2000-10. TroubleshootingtheLinear Model · 2019. 7. 23. · Nonnormality of the errors 2. Nonlinearity of the regression function 3. Outliers, ... going? • Last few weeks: estimation](https://reader036.vdocument.in/reader036/viewer/2022071423/611dff6ba00ea9282c0b473e/html5/thumbnails/32.jpg)
Buchanan revisitedHistogram of resids.nopb
Residuals
Freq
uenc
y
-1.0 -0.5 0.0 0.5 1.0
0
2
4
6
8
10
12
Histogram of stand.resids.nopb
Standardized Residuals
Freq
uenc
y
-2 -1 0 1 2
0
2
4
6
8
10
12
Histogram of student.resids.nopb
Studentized Residuals
Freq
uenc
y
-3 -2 -1 0 1 2
0
2
4
6
8
10
12
32 / 69
![Page 33: Gov2000-10. TroubleshootingtheLinear Model · 2019. 7. 23. · Nonnormality of the errors 2. Nonlinearity of the regression function 3. Outliers, ... going? • Last few weeks: estimation](https://reader036.vdocument.in/reader036/viewer/2022071423/611dff6ba00ea9282c0b473e/html5/thumbnails/33.jpg)
Buchanan revisited
-3 -2 -1 0 1 2 3
0.0
0.1
0.2
0.3
0.4
0.5
Studentized Residuals
Den
sity
Studentized ResidualsTheoretical t distribution
-2 -1 0 1 2
-2
-1
0
1
2
t quantiles
stud
ent.
resi
ds.n
opb
33 / 69
![Page 34: Gov2000-10. TroubleshootingtheLinear Model · 2019. 7. 23. · Nonnormality of the errors 2. Nonlinearity of the regression function 3. Outliers, ... going? • Last few weeks: estimation](https://reader036.vdocument.in/reader036/viewer/2022071423/611dff6ba00ea9282c0b473e/html5/thumbnails/34.jpg)
2/ Nonlinearity ofthe regressionfunction
34 / 69
![Page 35: Gov2000-10. TroubleshootingtheLinear Model · 2019. 7. 23. · Nonnormality of the errors 2. Nonlinearity of the regression function 3. Outliers, ... going? • Last few weeks: estimation](https://reader036.vdocument.in/reader036/viewer/2022071423/611dff6ba00ea9282c0b473e/html5/thumbnails/35.jpg)
Buchanan model, part 2
mod3 <- lm(edaybuchanan ~ edaytotal + absnbuchanan, data = flvote)summary(mod3)
#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) -29.34807 55.19635 -0.53 0.5969## edaytotal 0.00110 0.00048 2.29 0.0253 *## absnbuchanan 6.89546 2.12942 3.24 0.0019 **## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 317 on 61 degrees of freedom## (3 observations deleted due to missingness)## Multiple R-squared: 0.536, Adjusted R-squared: 0.521## F-statistic: 35.2 on 2 and 61 DF, p-value: 6.71e-11
35 / 69
![Page 36: Gov2000-10. TroubleshootingtheLinear Model · 2019. 7. 23. · Nonnormality of the errors 2. Nonlinearity of the regression function 3. Outliers, ... going? • Last few weeks: estimation](https://reader036.vdocument.in/reader036/viewer/2022071423/611dff6ba00ea9282c0b473e/html5/thumbnails/36.jpg)
Added variable plot
• Need a way to visualize conditional relationship between 𝑌and 𝑋𝑗
• How to construct an added variable plot:1. Get residuals from regression of 𝑌 on all covariates except 𝑋𝑗2. Get residuals from regression of 𝑋𝑗 on all other covariates3. Plot residuals from (1) against residuals from (2)
• In R: avPlots(model) from the car package• OLS fit to this plot will have exactly 𝛽𝑗 and 0 intercept• Use local smoother (loess) to detect any non-linearity
36 / 69
![Page 37: Gov2000-10. TroubleshootingtheLinear Model · 2019. 7. 23. · Nonnormality of the errors 2. Nonlinearity of the regression function 3. Outliers, ... going? • Last few weeks: estimation](https://reader036.vdocument.in/reader036/viewer/2022071423/611dff6ba00ea9282c0b473e/html5/thumbnails/37.jpg)
Buchanan AV plotpar(mfrow = c(1, 2))out <- avPlots(mod3, "edaytotal")lines(loess.smooth(x = out$edaytotal[, 1], y = out$edaytotal[, 2]),
col = "dodgerblue", lwd = 2)out2 <- avPlots(mod3, "absnbuchanan")lines(loess.smooth(x = out2$absnbuchanan[, 1], y = out2$absnbuchanan[,
2]), col = "dodgerblue", lwd = 2)
-100000 0 100000 300000 500000
-500
0
500
1000
1500
2000
edaytotal | others
eday
buch
anan
| o
ther
s
-80 -60 -40 -20 0 20 40 60
-1000
-500
0
500
1000
1500
2000
absnbuchanan | others
eday
buch
anan
| o
ther
s
37 / 69
![Page 38: Gov2000-10. TroubleshootingtheLinear Model · 2019. 7. 23. · Nonnormality of the errors 2. Nonlinearity of the regression function 3. Outliers, ... going? • Last few weeks: estimation](https://reader036.vdocument.in/reader036/viewer/2022071423/611dff6ba00ea9282c0b473e/html5/thumbnails/38.jpg)
How to deal with non-linearity
• Breaking up categorical variables into dummy variables• Including interaction terms• Including polynomial terms• Using transformations• Using more flexible models:
▶ Generalized additive models and splines allow the data to tellus what the functional form is.
▶ Complicated math, but important ideas.
38 / 69
![Page 39: Gov2000-10. TroubleshootingtheLinear Model · 2019. 7. 23. · Nonnormality of the errors 2. Nonlinearity of the regression function 3. Outliers, ... going? • Last few weeks: estimation](https://reader036.vdocument.in/reader036/viewer/2022071423/611dff6ba00ea9282c0b473e/html5/thumbnails/39.jpg)
Basis functions
• Basis functions are the function of 𝑋𝑖 that we include in themodel:
▶ Examples we’ve seen: ℎ𝑚(𝑋𝑖) = 𝑋𝑖, ℎ𝑚(𝑋𝑖) = 𝑋2𝑖 ,
ℎ𝑚(𝑋𝑖) = log(𝑋𝑖)
• Different basis functions will allow for different forms ofnon-linearity
• We could always break up 𝑋𝑖 into bins and estimate piecewiseconstant:
ℎ1 = 𝟙(𝑋𝑖 < 𝑏1), ℎ2 = 𝟙(𝑏1 < 𝑋𝑖 < 𝑏2), ℎ3 = 𝟙(𝑋𝑖 > 𝑏2)
• 𝑏1 < 𝑏2 are knots
39 / 69
![Page 40: Gov2000-10. TroubleshootingtheLinear Model · 2019. 7. 23. · Nonnormality of the errors 2. Nonlinearity of the regression function 3. Outliers, ... going? • Last few weeks: estimation](https://reader036.vdocument.in/reader036/viewer/2022071423/611dff6ba00ea9282c0b473e/html5/thumbnails/40.jpg)
Piecewise constant
-4 -2 0 2 4
-3
-2
-1
0
1
2
3
x
y
40 / 69
![Page 41: Gov2000-10. TroubleshootingtheLinear Model · 2019. 7. 23. · Nonnormality of the errors 2. Nonlinearity of the regression function 3. Outliers, ... going? • Last few weeks: estimation](https://reader036.vdocument.in/reader036/viewer/2022071423/611dff6ba00ea9282c0b473e/html5/thumbnails/41.jpg)
Piecewise linear
-4 -2 0 2 4
-3
-2
-1
0
1
2
3
x
y
41 / 69
![Page 42: Gov2000-10. TroubleshootingtheLinear Model · 2019. 7. 23. · Nonnormality of the errors 2. Nonlinearity of the regression function 3. Outliers, ... going? • Last few weeks: estimation](https://reader036.vdocument.in/reader036/viewer/2022071423/611dff6ba00ea9282c0b473e/html5/thumbnails/42.jpg)
Continuous piecewise linear
• Problem: piecewise functions are discontinuous.• Can use clever basis functions to get continuous piecewise
linear function of 𝑋𝑖:
ℎ1(𝑋𝑖) = 1, ℎ2(𝑋𝑖) = 𝑋𝑖,ℎ3(𝑋𝑖) = (𝑋𝑖 − 𝑏1)+, ℎ4(𝑋𝑖) = (𝑋𝑖 − 𝑏2)+
• (𝑋𝑖 − 𝑏1)+ = 𝑋𝑖 − 𝑏1 when 𝑋𝑖 > 𝑏1, 0, otherwise• Regression with these basis functions:
𝑌𝑖 = 𝛽1ℎ1(𝑋𝑖) + 𝛽2ℎ2(𝑋𝑖) + 𝛽3ℎ3(𝑋𝑖) + 𝛽4ℎ4(𝑋𝑖) + 𝜀𝑖
▶ 𝛽2 = slope when 𝑋𝑖 < 𝑏1▶ 𝛽2 + 𝛽3 = slope when 𝑏1 < 𝑋𝑖 < 𝑏2▶ 𝛽2 + 𝛽3 + 𝛽4 = slope when 𝑋𝑖 > 𝑏2
42 / 69
![Page 43: Gov2000-10. TroubleshootingtheLinear Model · 2019. 7. 23. · Nonnormality of the errors 2. Nonlinearity of the regression function 3. Outliers, ... going? • Last few weeks: estimation](https://reader036.vdocument.in/reader036/viewer/2022071423/611dff6ba00ea9282c0b473e/html5/thumbnails/43.jpg)
Continuous piecewise linearh2 <- xh3 <- 1 * (x > -1.5) * (x - -1.5)h4 <- 1 * (x > 1.5) * (x - 1.5)reg <- lm(y ~ h2 + h3 + h4)
-4 -2 0 2 4
-3
-2
-1
0
1
2
3
x
y
43 / 69
![Page 44: Gov2000-10. TroubleshootingtheLinear Model · 2019. 7. 23. · Nonnormality of the errors 2. Nonlinearity of the regression function 3. Outliers, ... going? • Last few weeks: estimation](https://reader036.vdocument.in/reader036/viewer/2022071423/611dff6ba00ea9282c0b473e/html5/thumbnails/44.jpg)
Cubic splines
• Continuous piecewise linear has “kinks” at the knots, but weprobably want “smooth” functions.
▶ What does smooth mean? Continuous derivatives!▶ ⇝ use higher-order polynomials in the basis functions
• Cubic spline basis: bases that produce continuous functionswith continuous first and second derivatives
ℎ1(𝑋𝑖) = 1, ℎ2(𝑋𝑖) = 𝑋𝑖, ℎ3(𝑋𝑖) = 𝑋2𝑗
ℎ4(𝑋𝑖) = 𝑋3𝑖 , ℎ5(𝑋𝑖) = (𝑋𝑖 − 𝑏1)3+, ℎ6(𝑋𝑖) = (𝑋𝑖 − 𝑏2)3+
• Basic idea: local polynomial regression (between knots) thathave to connect and be smooth at the knots.
44 / 69
![Page 45: Gov2000-10. TroubleshootingtheLinear Model · 2019. 7. 23. · Nonnormality of the errors 2. Nonlinearity of the regression function 3. Outliers, ... going? • Last few weeks: estimation](https://reader036.vdocument.in/reader036/viewer/2022071423/611dff6ba00ea9282c0b473e/html5/thumbnails/45.jpg)
Cubic splineh2 <- xh3 <- x^2h4 <- x^3h5 <- 1 * (x > -1.5) * (x - -1.5)^3h6 <- 1 * (x > 1.5) * (x - 1.5)^3reg <- lm(y ~ h2 + h3 + h4 + h5 + h6)
-4 -2 0 2 4
-3
-2
-1
0
1
2
3
x
y
45 / 69
![Page 46: Gov2000-10. TroubleshootingtheLinear Model · 2019. 7. 23. · Nonnormality of the errors 2. Nonlinearity of the regression function 3. Outliers, ... going? • Last few weeks: estimation](https://reader036.vdocument.in/reader036/viewer/2022071423/611dff6ba00ea9282c0b473e/html5/thumbnails/46.jpg)
Knotty problems
• Any function can be approximated as we increase the numberof knot points.
• How to choose the number/location of knot points?▶ More knot points ⇝ “rougher” function, less in-sample bias,
more variance.▶ Fewer knot points ⇝ “smoother” function, more in-sample
bias, less variance.
• In-sample fit might be great, out-of-sample fit might beterrible.
46 / 69
![Page 47: Gov2000-10. TroubleshootingtheLinear Model · 2019. 7. 23. · Nonnormality of the errors 2. Nonlinearity of the regression function 3. Outliers, ... going? • Last few weeks: estimation](https://reader036.vdocument.in/reader036/viewer/2022071423/611dff6ba00ea9282c0b473e/html5/thumbnails/47.jpg)
Cross-validation
• General strategy for bias-variance trade-offs: cross-validation.• Set aside units to test out-of-sample prediction• Cross-validation procedure:
1. Choose a number of evenly spread knots, 𝑏.2. Withhold unit 𝑖, estimate the CEF of 𝑦𝑖 given 𝑋𝑖 using a cubic
spline with 𝑏 knots.3. Get predicted value for 𝑖, 𝑦−𝑖
𝑖𝑏 and caculate squared predictionerror: (𝑦𝑖 − 𝑦−𝑖
𝑖𝑏 )2.4. Repeat 2-3 for each observation and take that average to get
the MSE with 𝑏 knots.5. Repeat 1-4 for different values of 𝑏 and choose the value of 𝑏
that has the lowest MSE.
47 / 69
![Page 48: Gov2000-10. TroubleshootingtheLinear Model · 2019. 7. 23. · Nonnormality of the errors 2. Nonlinearity of the regression function 3. Outliers, ... going? • Last few weeks: estimation](https://reader036.vdocument.in/reader036/viewer/2022071423/611dff6ba00ea9282c0b473e/html5/thumbnails/48.jpg)
Automatic knot selectionsmth <- smooth.spline(x, y)plot(x, y, ylim = c(-3, 3), pch = 19, col = "grey50", bty = "n")lines(smth, col = "indianred", lwd = 2)
-4 -2 0 2 4
-3
-2
-1
0
1
2
3
x
y
48 / 69
![Page 49: Gov2000-10. TroubleshootingtheLinear Model · 2019. 7. 23. · Nonnormality of the errors 2. Nonlinearity of the regression function 3. Outliers, ... going? • Last few weeks: estimation](https://reader036.vdocument.in/reader036/viewer/2022071423/611dff6ba00ea9282c0b473e/html5/thumbnails/49.jpg)
Generalized additive models
• Generalized additive models (GAMs) allow you to estimatethe spline of any particular variable in the regression.
▶ Each spline is additive: 𝑦𝑖 = 𝑓1(𝑋1) + 𝑓𝑥(𝑋2) + 𝜀𝑖
• Can plot the AV-plot of the spline to get a sense for thenonlinearity of the functional form.
• Use cross-validation to select the number of knots
49 / 69
![Page 50: Gov2000-10. TroubleshootingtheLinear Model · 2019. 7. 23. · Nonnormality of the errors 2. Nonlinearity of the regression function 3. Outliers, ... going? • Last few weeks: estimation](https://reader036.vdocument.in/reader036/viewer/2022071423/611dff6ba00ea9282c0b473e/html5/thumbnails/50.jpg)
GAM example fit## library(mgcv) ## GAM packageout <- gam(edaybuchanan ~ s(edaytotal) + s(absnbuchanan), data = flvote,
subset = county != "Palm Beach")
#### Family: gaussian## Link function: identity#### Formula:## edaybuchanan ~ s(edaytotal) + s(absnbuchanan)#### Parametric coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) 221.84 6.41 34.6 <2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Approximate significance of smooth terms:## edf Ref.df F p-value## s(edaytotal) 6.85 7.82 10.6 1.6e-09 ***## s(absnbuchanan) 2.95 3.64 22.6 1.6e-11 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### R-sq.(adj) = 0.95 Deviance explained = 95.8%## GCV = 3129 Scale est. = 2592.3 n = 63
50 / 69
![Page 51: Gov2000-10. TroubleshootingtheLinear Model · 2019. 7. 23. · Nonnormality of the errors 2. Nonlinearity of the regression function 3. Outliers, ... going? • Last few weeks: estimation](https://reader036.vdocument.in/reader036/viewer/2022071423/611dff6ba00ea9282c0b473e/html5/thumbnails/51.jpg)
Example: generalized additivemodels
plot(out, shade = TRUE, residual = TRUE, pch = 1)
0 200000 400000 600000
-200
020
040
060
0
edaytotal
s(ed
ayto
tal,6
.85)
0 20 40 60 80 100
-200
020
040
060
0
absnbuchanan
s(ab
snbu
chan
an,2.95)
51 / 69
![Page 52: Gov2000-10. TroubleshootingtheLinear Model · 2019. 7. 23. · Nonnormality of the errors 2. Nonlinearity of the regression function 3. Outliers, ... going? • Last few weeks: estimation](https://reader036.vdocument.in/reader036/viewer/2022071423/611dff6ba00ea9282c0b473e/html5/thumbnails/52.jpg)
3/ Outliers,leverage points,and influentialobservations
52 / 69
![Page 53: Gov2000-10. TroubleshootingtheLinear Model · 2019. 7. 23. · Nonnormality of the errors 2. Nonlinearity of the regression function 3. Outliers, ... going? • Last few weeks: estimation](https://reader036.vdocument.in/reader036/viewer/2022071423/611dff6ba00ea9282c0b473e/html5/thumbnails/53.jpg)
The trouble with Norway
• Lange and Garrett (1985): organizational and political powerof labor interact to improve economic growth
• Jackman (1987): relationship just due to North Sea Oil?• Table guide:
▶ 𝑥1 = organizational power of labor▶ 𝑥2 = political power of labor▶ Parentheses contain 𝑡-statistics
Constant 𝑥1 𝑥2 𝑥1 ⋅ 𝑥2Norway Obs Included .814 -.192 -.278 .137
(4.7) (2.0) (2.4) (2.9)Norway Obs Excluded .641 -.068 -.138 .054
(4.8) (0.9) (1.5) (1.3)
53 / 69
![Page 54: Gov2000-10. TroubleshootingtheLinear Model · 2019. 7. 23. · Nonnormality of the errors 2. Nonlinearity of the regression function 3. Outliers, ... going? • Last few weeks: estimation](https://reader036.vdocument.in/reader036/viewer/2022071423/611dff6ba00ea9282c0b473e/html5/thumbnails/54.jpg)
Creative curve fitting with Norway
54 / 69
![Page 55: Gov2000-10. TroubleshootingtheLinear Model · 2019. 7. 23. · Nonnormality of the errors 2. Nonlinearity of the regression function 3. Outliers, ... going? • Last few weeks: estimation](https://reader036.vdocument.in/reader036/viewer/2022071423/611dff6ba00ea9282c0b473e/html5/thumbnails/55.jpg)
Three types of extreme values
1. Outlier: extreme in the 𝑦 direction2. Leverage point: extreme in one 𝑥 direction3. Influence point: extreme in both directions
• Not all of these are problematic• If the data are truly “contaminated” (come from a different
distribution), can cause inefficiency and possibly bias• Can be a violation of iid (not identically distributed)
55 / 69
![Page 56: Gov2000-10. TroubleshootingtheLinear Model · 2019. 7. 23. · Nonnormality of the errors 2. Nonlinearity of the regression function 3. Outliers, ... going? • Last few weeks: estimation](https://reader036.vdocument.in/reader036/viewer/2022071423/611dff6ba00ea9282c0b473e/html5/thumbnails/56.jpg)
Outlier definition
-4 -2 0 2 4
-2
0
2
4
6
Without outlier
Full sample
Outlier
• An outlier is a data point with very large regression errors, 𝑢𝑖• Very distant from the rest of the data in the 𝑦-dimension• Increases standard errors (by increasing ��2)• No bias if typical in the 𝑥’s
56 / 69
![Page 57: Gov2000-10. TroubleshootingtheLinear Model · 2019. 7. 23. · Nonnormality of the errors 2. Nonlinearity of the regression function 3. Outliers, ... going? • Last few weeks: estimation](https://reader036.vdocument.in/reader036/viewer/2022071423/611dff6ba00ea9282c0b473e/html5/thumbnails/57.jpg)
Detecting outliers
• Use studentized residuals, 𝑢∗𝑖 , since outliers can skew the
residual variance upward▶ ��2 >> ��2
−𝑖 if 𝑖 is an outlier▶ 𝑢∗
𝑖 ∼ 𝑡𝑛−𝑘−2, when 𝑢𝑖 ∼ 𝑁(0, 𝜎2)
• Rule of thumb: | 𝑢∗𝑖 | > 2 will be relatively rare
• Extreme outliers, | 𝑢∗𝑖 | > 4 − 5 are much less likely
• People usually adjust cutoff for multiple testing
57 / 69
![Page 58: Gov2000-10. TroubleshootingtheLinear Model · 2019. 7. 23. · Nonnormality of the errors 2. Nonlinearity of the regression function 3. Outliers, ... going? • Last few weeks: estimation](https://reader036.vdocument.in/reader036/viewer/2022071423/611dff6ba00ea9282c0b473e/html5/thumbnails/58.jpg)
Buchanan outliers
0 10 20 30 40 50 60
0
5
10
15
20
Index
Stu
dent
ized
Res
idua
lsPalm Beach
Outlier CutoffAdjusted Cutoff
58 / 69
![Page 59: Gov2000-10. TroubleshootingtheLinear Model · 2019. 7. 23. · Nonnormality of the errors 2. Nonlinearity of the regression function 3. Outliers, ... going? • Last few weeks: estimation](https://reader036.vdocument.in/reader036/viewer/2022071423/611dff6ba00ea9282c0b473e/html5/thumbnails/59.jpg)
Leverage point definition
-4 -2 0 2 4 6 8
-3
-2
-1
0
1
2
y
Without leverage point
Full sampleLeverage Point
• Values that are extreme in the 𝑥 direction• That is, values far from the center of the covariate distribution• Decrease SEs (more 𝑋 variation)• No bias if typical in 𝑦 dimension
59 / 69
![Page 60: Gov2000-10. TroubleshootingtheLinear Model · 2019. 7. 23. · Nonnormality of the errors 2. Nonlinearity of the regression function 3. Outliers, ... going? • Last few weeks: estimation](https://reader036.vdocument.in/reader036/viewer/2022071423/611dff6ba00ea9282c0b473e/html5/thumbnails/60.jpg)
Hat values
�� = 𝐗𝜷 = 𝐗(𝐗′𝐗)−1𝐗′𝐲 = 𝐇𝐲
• For a particular observation 𝑖, we can show this means:
𝑦𝑖 =𝑛
∑𝑗=1
ℎ𝑖𝑗𝑦𝑗
• ℎ𝑖𝑗 = importance of observation 𝑗 is for the fitted value 𝑦𝑖• Leverage/hat values: ℎ𝑖 = ℎ𝑖𝑖 diagonal entries of the hat
matrix• With a simple linear regression, we have
ℎ𝑖 = 1𝑛 + (𝑋𝑖 − 𝑋)2
∑𝑛𝑗=1(𝑋𝑗 − 𝑋)2
▶ ⇝ how far 𝑖 is from the center of the 𝐗 distribution• Rule of thumb: examine hat values greater than 2(𝑘 + 1)/𝑛
60 / 69
![Page 61: Gov2000-10. TroubleshootingtheLinear Model · 2019. 7. 23. · Nonnormality of the errors 2. Nonlinearity of the regression function 3. Outliers, ... going? • Last few weeks: estimation](https://reader036.vdocument.in/reader036/viewer/2022071423/611dff6ba00ea9282c0b473e/html5/thumbnails/61.jpg)
Buchanan hats
head(hatvalues(mod), 5)
## 1 2 3 4 5## 0.04179 0.02285 0.22066 0.01556 0.01493
61 / 69
![Page 62: Gov2000-10. TroubleshootingtheLinear Model · 2019. 7. 23. · Nonnormality of the errors 2. Nonlinearity of the regression function 3. Outliers, ... going? • Last few weeks: estimation](https://reader036.vdocument.in/reader036/viewer/2022071423/611dff6ba00ea9282c0b473e/html5/thumbnails/62.jpg)
Buchanan hats
DuvalLeeBrowardMartinCollierDixiePinellasOsceolaMiami-DadeAlachuaGladesLeonVolusiaHighlandsHendrySt. LuciePolkMadisonOrangeOkeechobeeLafayetteDesotoIndian RiverWakullaBrevardManateeSarasotaJeffersonSuwanneeGulfColumbiaWaltonSumterPascoPutnamSt. JohnsSeminoleFranklinMonroeCharlotteGilchristWashingtonMarionJacksonFlaglerCitrusHolmesClayTaylorUnionLakeHillsboroughHernandoBradfordCalhounNassauBayBakerGadsdenOkaloosaEscambiaSanta RosaLevyPalm BeachHamiltonHardeeLiberty
0.05 0.10 0.15 0.20 0.25
Hat Values
62 / 69
![Page 63: Gov2000-10. TroubleshootingtheLinear Model · 2019. 7. 23. · Nonnormality of the errors 2. Nonlinearity of the regression function 3. Outliers, ... going? • Last few weeks: estimation](https://reader036.vdocument.in/reader036/viewer/2022071423/611dff6ba00ea9282c0b473e/html5/thumbnails/63.jpg)
Influence points
-4 -2 0 2 4 6 8
-2
0
2
4
6
y
Without influence point
Full sample
Influence Point
• An influence point is one that is both an outlier and aleverage point.
• Extreme in both the 𝑥 and 𝑦 dimensions• Causes the regression line to move toward it (bias?)
63 / 69
![Page 64: Gov2000-10. TroubleshootingtheLinear Model · 2019. 7. 23. · Nonnormality of the errors 2. Nonlinearity of the regression function 3. Outliers, ... going? • Last few weeks: estimation](https://reader036.vdocument.in/reader036/viewer/2022071423/611dff6ba00ea9282c0b473e/html5/thumbnails/64.jpg)
Overall measures of influence
• A measure of influence for each observation is Cook’s distance:
𝐷𝑖 = 𝑢′𝑖
𝑘 + 1 × ℎ𝑖1 − ℎ𝑖
• Remember here that 𝑢′𝑖 is the standardized residual and ℎ𝑖 is
the hat value.• Basically this is “outlier × leverage”• 𝐷𝑖 > 4/(𝑛 − 𝑘 − 1) considered “large”• Influence plot:
▶ x-axis: hat values, ℎ𝑖▶ y-axis: studentized residuals, 𝑢∗
𝑖▶ size of points: Cook’s distance
64 / 69
![Page 65: Gov2000-10. TroubleshootingtheLinear Model · 2019. 7. 23. · Nonnormality of the errors 2. Nonlinearity of the regression function 3. Outliers, ... going? • Last few weeks: estimation](https://reader036.vdocument.in/reader036/viewer/2022071423/611dff6ba00ea9282c0b473e/html5/thumbnails/65.jpg)
Influence Plot Buchanan
0.0 0.2 0.4 0.6 0.8 1.0
-10
0
10
20
30
40
50
Hat Values
Stu
dent
ized
Res
idua
ls
DuvalBrowardPinellas Miami-Dade
Palm Beach
HamiltonLiberty
65 / 69
![Page 66: Gov2000-10. TroubleshootingtheLinear Model · 2019. 7. 23. · Nonnormality of the errors 2. Nonlinearity of the regression function 3. Outliers, ... going? • Last few weeks: estimation](https://reader036.vdocument.in/reader036/viewer/2022071423/611dff6ba00ea9282c0b473e/html5/thumbnails/66.jpg)
Influence plot from lm outputplot(mod3, which = 5, labels.id = flvote$county)
0.0 0.1 0.2 0.3 0.4 0.5
-2
0
2
4
6
8
Leverage
Sta
ndar
dize
d re
sidu
als
lm(edaybuchanan ~ edaytotal + absnbuchanan)
Cook's distance
Palm Beach
Miami-DadeBroward
66 / 69
![Page 67: Gov2000-10. TroubleshootingtheLinear Model · 2019. 7. 23. · Nonnormality of the errors 2. Nonlinearity of the regression function 3. Outliers, ... going? • Last few weeks: estimation](https://reader036.vdocument.in/reader036/viewer/2022071423/611dff6ba00ea9282c0b473e/html5/thumbnails/67.jpg)
Limitations of the standard tools
0 2 4 6 8
-1.0
-0.5
0.0
0.5
1.0
1.5
y
• What happens when there are two influence points?• Red line drops the red influence point• Blue line drops the blue influence point• Neither of the “leave-one-out” approaches helps recover the
line 67 / 69
![Page 68: Gov2000-10. TroubleshootingtheLinear Model · 2019. 7. 23. · Nonnormality of the errors 2. Nonlinearity of the regression function 3. Outliers, ... going? • Last few weeks: estimation](https://reader036.vdocument.in/reader036/viewer/2022071423/611dff6ba00ea9282c0b473e/html5/thumbnails/68.jpg)
What to do about outliers andinfluential units?
• Is the data corrupted?▶ Fix the observation (obvious data entry errors)▶ Remove the observation▶ Be transparent either way
• Is the outlier part of the data generating process?▶ Transform the dependent variable (log(𝑦))▶ Use a method that is robust to outliers (robust regression,
least absolute deviations)
68 / 69
![Page 69: Gov2000-10. TroubleshootingtheLinear Model · 2019. 7. 23. · Nonnormality of the errors 2. Nonlinearity of the regression function 3. Outliers, ... going? • Last few weeks: estimation](https://reader036.vdocument.in/reader036/viewer/2022071423/611dff6ba00ea9282c0b473e/html5/thumbnails/69.jpg)
Summary
• For nonnormality, influential points, and nonlinearity:▶ Check your data! summary(), plot(), etc▶ Use transformations to make assumptions more plausible▶ Add covariates to help account for non-identical distributions
• Next week:▶ What if we have heteroskedastic data?
69 / 69