applied linear regression cstat workshop march 16, 2007 vince melfi
Post on 19-Dec-2015
223 views
TRANSCRIPT
![Page 1: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d395503460f94a12d92/html5/thumbnails/1.jpg)
Applied Linear Regression
CSTAT WorkshopMarch 16, 2007
Vince Melfi
![Page 2: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d395503460f94a12d92/html5/thumbnails/2.jpg)
References
• “Applied Linear Regression,” Third Edition by Sanford Weisberg.
• “Linear Models with R,” by Julian Faraway.
• Countless other books on Linear Regression, statistical software, etc.
![Page 3: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d395503460f94a12d92/html5/thumbnails/3.jpg)
Statistical Packages
• Minitab (we’ll use this today)
• SPSS
• SAS
• R
• Splus
• JMP
• ETC!!
![Page 4: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d395503460f94a12d92/html5/thumbnails/4.jpg)
Outline
I. Simple linear regression review
II. Multiple Regression: Adding predictors
III. Inference in Regression
IV. Regression Diagnostics
V. Model Selection
![Page 5: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d395503460f94a12d92/html5/thumbnails/5.jpg)
I. Simple Linear Regression Review
5
Savings Rate Data
Data on Savings Rate and other variables for 50 countries. Want to explore the effect of variables on savings rate.
• SaveRate: Aggregate Personal Savings divided by disposable personal income. (Response variable.)
• Pop>75: Percent of the population over 75 years old. (One of the predictors.)
![Page 6: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d395503460f94a12d92/html5/thumbnails/6.jpg)
I. Simple Linear Regression Review
6
543210
20
15
10
5
0
pop>75
SaveRate
Scatterplot of SaveRate vs pop>75
![Page 7: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d395503460f94a12d92/html5/thumbnails/7.jpg)
I. Simple Linear Regression Review
7
Regression Output
The regression equation isSaveRate = 7.152 + 1.099 pop>75
S = 4.29409 R-Sq = 10.0% R-Sq(adj) = 8.1%
Analysis of Variance
Source DF SS MS F PRegression 1 98.545 98.5454 5.34 0.025Error 48 885.083 18.4392Total 49 983.628
Fitted model
R2 (coeff. of determination)
Testing the model
![Page 8: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d395503460f94a12d92/html5/thumbnails/8.jpg)
Importance of Plots
• Four data sets
• All have – Regression line Y = 3 + 0.5 x– R2 = 66.7%– S = 1.24– Same t statistics, etc., etc.
• Without looking at plots, the four data sets would seem similar.
![Page 9: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d395503460f94a12d92/html5/thumbnails/9.jpg)
I. Simple Linear Regression Review
9
Importance of Plots (1)
15.012.510.07.55.0
11
10
9
8
7
6
5
4
x1
y1
S 1.23660R-Sq 66.7%R-Sq(adj) 62.9%
Fitted Line Ploty1 = 3.000 + 0.5001 x1
![Page 10: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d395503460f94a12d92/html5/thumbnails/10.jpg)
I. Simple Linear Regression Review
10
Importance of Plots (2)
15.012.510.07.55.0
10
9
8
7
6
5
4
3
x1
y2
S 1.23721R-Sq 66.6%R-Sq(adj) 62.9%
Fitted Line Ploty2 = 3.001 + 0.5000 x1
![Page 11: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d395503460f94a12d92/html5/thumbnails/11.jpg)
I. Simple Linear Regression Review
11
Importance of Plots (3)
15.012.510.07.55.0
13
12
11
10
9
8
7
6
5
4
x1
y3
S 1.23631R-Sq 66.6%R-Sq(adj) 62.9%
Fitted Line Ploty3 = 3.002 + 0.4997 x1
![Page 12: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d395503460f94a12d92/html5/thumbnails/12.jpg)
I. Simple Linear Regression Review
12
Importance of Plots (4)
2018161412108
13
12
11
10
9
8
7
6
5
x2
y4
S 1.23570R-Sq 66.7%R-Sq(adj) 63.0%
Fitted Line Ploty4 = 3.002 + 0.4999 x2
![Page 13: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d395503460f94a12d92/html5/thumbnails/13.jpg)
I. Simple Linear Regression Review
13
The model
• Yi = β0 + β1xi + ei, for i = 1, 2, …, n
• “Errors” e1, e2, …, en are assumed to be independent.
• Usually e1, e2, …, en are assumed to have the same standard deviation, σ.
• Often e1, e2, …, en are assumed to be normally distributed.
![Page 14: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d395503460f94a12d92/html5/thumbnails/14.jpg)
I. Simple Linear Regression Review
14
Least Squares
• The regression line (line of best fit) is based on “least squares.”
• The regression line is the line that minimizes the sum of the squared deviations from the data.
• The least squares line has certain optimality properties.
• The least squares line is denoted
iii eXY ˆˆˆ
![Page 15: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d395503460f94a12d92/html5/thumbnails/15.jpg)
I. Simple Linear Regression Review
15
Residuals
• The residuals represent the difference between the data and the least squares line:
iii YYe ˆˆ
1 2 3 4 5 6 7
45
67
89
10
X
Y
![Page 16: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d395503460f94a12d92/html5/thumbnails/16.jpg)
I. Simple Linear Regression Review
16
Checking assumptions
• Residuals are the main tool for checking model assumptions, including linearity and constant variance.
• Plotting the residuals versus the fitted values is always a good idea, to check linearity and constant variance.
• Histograms and Q-Q plots (normal probability plots) of residuals can help to check the normality assumption.
![Page 17: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d395503460f94a12d92/html5/thumbnails/17.jpg)
I. Simple Linear Regression Review
17
![Page 18: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d395503460f94a12d92/html5/thumbnails/18.jpg)
I. Simple Linear Regression Review
18
![Page 19: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d395503460f94a12d92/html5/thumbnails/19.jpg)
I. Simple Linear Regression Review
19
![Page 20: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d395503460f94a12d92/html5/thumbnails/20.jpg)
I. Simple Linear Regression Review
20
![Page 21: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d395503460f94a12d92/html5/thumbnails/21.jpg)
I. Simple Linear Regression Review
21
1050-5-10
99
90
50
10
1
Residual
Perc
ent
12111098
10
5
0
-5
-10
Fitted Value
Resi
dual
1050-5-10
16
12
8
4
0
Residual
Fre
quency
50454035302520151051
10
5
0
-5
-10
Observation Order
Resi
dual
Normal Probability Plot Versus Fits
Histogram Versus Order
Residual Plots for SaveRate
“Four in one” plot from Minitab
![Page 22: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d395503460f94a12d92/html5/thumbnails/22.jpg)
I. Simple Linear Regression Review
22
Coefficient of determination (R2)
Residual sum of squares, aka sum of squares for error:
Total sum of squares:
Coefficient of determination:
n
i ieSSERSS1
2ˆ
2
1)( yyTSSSST
n
i i
TSS
RSSTSSR
2
![Page 23: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d395503460f94a12d92/html5/thumbnails/23.jpg)
I. Simple linear regression review
23
R2
• The coefficient of determination, R2, measures the proportion of the variability in Y that is explained by the linear relationship with X.
• It’s also the square of the Pearson correlation coefficient
![Page 24: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d395503460f94a12d92/html5/thumbnails/24.jpg)
II. Multiple regression: Adding predictors
24
Adding a predictor
• Recall: Fitted model was SaveRate = 7.152 + 1.099 pop>75 (p-value for test of whether pop>75 is
significant was 0.025.)• Another predictor: DPI (per-capita income)• Fitted model: SaveRate = 8.57 + 0.000996 DPI (p-value for DPI: 0.124)
![Page 25: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d395503460f94a12d92/html5/thumbnails/25.jpg)
II. Multiple regression: Adding predictors
25
Adding a predictor (2)
• Model with both pop>75 and DPI is SaveRate = 7.06 + 1.30 pop>75 - 0.00034 DPI
• p-values are 0.100 and 0.738 for pop>75 and DPI
• The sign of the coefficient of DPI has changed!
• pop>75 was significant alone, but neither it nor DPI are significant together!
![Page 26: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d395503460f94a12d92/html5/thumbnails/26.jpg)
II. Multiple regression: Adding predictors
26
Adding a predictor (3)
40003000200010000
5
4
3
2
1
0
DPI
pop>
75
S 0.804599R-Sq 61.9%R-Sq(adj) 61.1%
Fitted Line Plotpop>75 = 1.158 + 0.001025 DPI
•What happened??
•The predictors pop>75 and DPI are highly correlated
![Page 27: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d395503460f94a12d92/html5/thumbnails/27.jpg)
II. Multiple regression: Adding predictors
27
Added variable plots and partial correlation
1. Residuals from a fit of SaveRate versus pop>75 give the variability in SaveRate that’s not explained by pop>75.
2. Residuals from a fit of DPI versus pop>75 give the variability in DPI that’s not explained by pop>75.
3. A fit of the residuals from (1) versus the residuals from (2) gives the relationship between SaveRate and DPI after adjusting for pop>75. This is called an “added variable plot.”
4. The correlation between the residuals from (1) and the residuals from (2) is the “partial correlation” between SaveRate and DPI adjusted for pop>75.
![Page 28: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d395503460f94a12d92/html5/thumbnails/28.jpg)
II. Multiple regression: Adding predictors
28
Added variable plot
25002000150010005000-500-1000
15
10
5
0
-5
-10
RESDPIvspop>75
RES
SRvsp
op>
75
S 4.28891R-Sq 0.2%R-Sq(adj) 0.0%
Fitted Line PlotRESSRvspop>75 = 0.0000 - 0.000341 RESDPIvspop>75
Note that the slope term,
-0.000341, is the same as the slope term for DPI in the two-predictor model
![Page 29: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d395503460f94a12d92/html5/thumbnails/29.jpg)
II. Multiple regression: Adding predictors
29
Scatterplot matrices (Matrix Plots)
• With one predictor X, a scatterplot of Y vs. X is very informative.
• With more than one predictor, scatterplots of Y vs. each of the predictors, and of each of the predictors vs. each other, is needed.
• A scatterplot matrix (or matrix plot) is just an organized display of the plots
![Page 30: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d395503460f94a12d92/html5/thumbnails/30.jpg)
II. Multiple regression: Adding predictors
30
20
10
0
403020 400020000
40
30
204
2
0 4000
2000
0
20100
16
8
0
420 1680
SaveR
ate
pop<
15
pop>
75
DPI
SaveRate
changeD
PI
pop<15 pop>75 DPI changeDPI
Matrix Plot of SaveRate, pop<15, ... vs SaveRate, pop<15, ...
![Page 31: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d395503460f94a12d92/html5/thumbnails/31.jpg)
II. Multiple regression: Adding predictors
31
Changes in R2
• Consider adding a predictor X2 to a model that already contains the predictor X1
• Let R2,1 be the R2 value for the fit of Y vs. X1, and let R2,2 be the R2 value for the fit of Y vs. X2
![Page 32: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d395503460f94a12d92/html5/thumbnails/32.jpg)
II. Multiple regression: Adding predictors
32
Changes in R2 (2)
• The R2 value for the multiple regression fit is always larger than R2,1 and R2,2
• The R2 value for the multiple regression fit of Y versus X1 and X2 may be– less than R2,1 + R2,2 (if the two predictors are
explaining the same variation)– equal to R2,1 + R2,2 (if the two predictors measure
different things)– more than R2,1 + R2,2 (e.g. Response is area of
rectangle, and the two predictors are length and width)
![Page 33: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d395503460f94a12d92/html5/thumbnails/33.jpg)
II. Multiple regression: Adding predictors
33
Multiple regression model• Response variable Y
• Predictors X1, X2, …, Xp
ipipiii eXXXY ...21
•Same assumptions on errors ei
(independent, constant variance, normality)
![Page 34: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d395503460f94a12d92/html5/thumbnails/34.jpg)
III. Inference in regression
34
Inference in regression
• Most inference procedures assume independence, constant variance, and normality of the errors.
• Most are “robust” to departures from normality, meaning that the p-values, confidence levels, etc. are approximately correct even if normality does not hold.
• In general, techniques like the bootstrap can be used when normality is suspect.
![Page 35: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d395503460f94a12d92/html5/thumbnails/35.jpg)
III. Inference in regression
35
New data set
• Response variable: – Fuel = per-capita fuel consumption (times 1000)
• Predictors:– Dlic = proportion of the population who are licensed
drivers (times 1000)– Tax = gasoline tax rate– Income = per person income in thousands of dollars– logMiles = base 2 log of federal-aid highway miles in
the state
![Page 36: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d395503460f94a12d92/html5/thumbnails/36.jpg)
III. Inference in regression
36
t tests• Regression Analysis: Fuel versus Tax, Dlic, Income, logMiles
• The regression equation is• Fuel = 154 - 4.23 Tax + 0.472 Dlic - 6.14 Income +
18.5 logMiles
• Predictor Coef SE Coef T P• Constant 154.2 194.9 0.79 0.433• Tax -4.228 2.030 -2.08 0.043• Dlic 0.4719 0.1285 3.67 0.001• Income -6.135 2.194 -2.80 0.008• logMiles 18.545 6.472 2.87 0.006
t statisticsp values
![Page 37: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d395503460f94a12d92/html5/thumbnails/37.jpg)
III. Inference in regression
37
t tests (2)
• The t statistics tests the hypothesis that a particular slope parameter is zero.
• The formula is
t = (coefficient estimate)/(standard error)
• degrees of freedom are n-(p+1)
• p-values given are for the two-sided alternative
• This is like simple linear regression
![Page 38: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d395503460f94a12d92/html5/thumbnails/38.jpg)
III. Inference in regression
38
F tests• General structure:
– Ha: Large model– H0: Smaller model, obtained by setting some
parameters in the large model to zero, or equal to each other, or equal to a constant
– RSSAH = resid. sum of squares after fitting the large (alt. hypothesis) model
– RSSNH = resid. sum of squares after fitting the smaller (null hypothesis) model
– dfNH and dfAH are the corresponding degrees of freedom
![Page 39: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d395503460f94a12d92/html5/thumbnails/39.jpg)
III. Inference in regression
39
F tests (2)
• Test statistic:
AH
AH
AHNH
AHNH
dfRSS
dfdfRSSRSS
F)(
)(
•Null distribution: F distribution with dfNH – dfAH numerator and dfAH denominator degrees of freedom
![Page 40: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d395503460f94a12d92/html5/thumbnails/40.jpg)
III. Inference in regression
40
F test example
• Can the “economic” variables tax and income be dropped from the model with all four predictors?
• AH model includes all predictors
• NH model includes only Dlic and logMiles
• Fit both models and get RSS and df values
![Page 41: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d395503460f94a12d92/html5/thumbnails/41.jpg)
III. Inference in regression
41
F test example (2)
• RSSAH = 193700; dfAH = 46
• RSSNH = 243006; dfNH = 48
85.546/193700
)4648/()193700243006(
F
•P-value is the area to the right of 5.85 under a F(2,46) distribution, approx. 0.0054
•There’s pretty strong evidence that removing both Tax and Income is unwise
![Page 42: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d395503460f94a12d92/html5/thumbnails/42.jpg)
III. Inference in regression
42
Another F test example
• Question: Does it make sense that the two “economic” predictors should have the same coefficient?
• Ha: Y = β0 + β1Tax + β2 Dlic+ β3 Income + β4 logMiles + error
• H0: Y = β0 + β1Tax + β2 Dlic+ β1 Income + β4 logMiles + error
• Note: H0: Y = β0 + β1 (Tax + Income)+ β2 Dlic + β4 logMiles + error
![Page 43: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d395503460f94a12d92/html5/thumbnails/43.jpg)
III. Inference in regression
43
Another F test example (2)
• Fit full model (AH)• Create new predictor “TI” by adding Tax and
Income, and fit a model with TI and Dlic and logMiles (NH)
424.046/193700
)4647/()193700195487(
F
•P-value is the area to the right of 5.85 under a F(1,46) distribution, approx. 0.518•This suggests that the simpler model with the same coefficient for Tax and Income fits well.
![Page 44: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d395503460f94a12d92/html5/thumbnails/44.jpg)
III. Inference in regression
44
Removing one predictor
• We have two ways to test whether one predictor can be removed from the model:– t test– F test
• The tests are equivalent, in the sense that t2 = F, and that the p-values will be equivalent.
![Page 45: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d395503460f94a12d92/html5/thumbnails/45.jpg)
III. Inference in regression
45
Confidence regions
• Confidence intervals for one parameter use the familiar t-interval.
• For example, to form a 95% confidence interval for the parameter of Income in the context of the full (four predictor) model:
• -6.135 ± (2.013)(2.194) = -6.135 ± 4.417.
From Minitab outputFrom t distribution with 46 df
![Page 46: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d395503460f94a12d92/html5/thumbnails/46.jpg)
III. Inference in regression
46
Joint confidence regions
• Joint confidence regions for two or more parameters are more complex, and use the F distribution in place of the t distribution.
• Minitab (and SPSS, and …) can’t draw these easily
• On the next page is a joint confidence region for the parameters of Dlic and Tax, drawn in R.
![Page 47: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d395503460f94a12d92/html5/thumbnails/47.jpg)
III. Inference in regression
47
-8 -6 -4 -2 0
0.0
0.2
0.4
0.6
0.8
1.0
Tax
Dlic
Joint confidence region for Dlic and Tax, with dotted lines indicating individual confidence intervals for the two.
(0,0)
Boundary of confidence region
![Page 48: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d395503460f94a12d92/html5/thumbnails/48.jpg)
III. Inference in regression
48
Prediction
• Given a new set of predictor values x1, x2, …, xp, what’s the predicted response?
• It’s easy to answer this: Just plug the new predictors into the fitted regression model:
ppxxxY ˆ...ˆˆˆˆ21
•But how do we assess the uncertainty in the prediction? How do we form a confidence interval?
![Page 49: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d395503460f94a12d92/html5/thumbnails/49.jpg)
III. Inference in regression
49
Predicted Values for New Observations
New
Obs Fit SE Fit 95% CI 95% PI
1 613.39 12.44 (588.34, 638.44) (480.39, 746.39)
Values of Predictors for New Observations
New
Obs Dlic Income logMiles Tax
1 900 28.0 15.0 17.0
Prediction interval for the fuel consumption for a state with Dlic=900, Income = 28, logMiles=15, and Tax = 17
Confidence interval for the average fuel consumption for states with Dlic = 900, Income = 28, logMiles=15, and Tax = 17
![Page 50: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d395503460f94a12d92/html5/thumbnails/50.jpg)
IV. Regression Diagnostics
50
Diagnostics
• Want to look for points that have a large influence on the fitted model
• Want to look for evidence that one or more model assumptions are untrue.
• Tools:– Residuals– Leverage– Influence and Cook’s Distance
![Page 51: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d395503460f94a12d92/html5/thumbnails/51.jpg)
IV. Regression Diagnostics
51
Leverage
• A point whose predictor values are far from the “typical” predictor values has high leverage.
• For a high leverage point, the fitted value
will be close to the data value Yi.
• A rule of thumb: Any point with leverage larger than 2(p+1)/n is interesting.
• Most statistical packages can compute leverages.
iY
![Page 52: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d395503460f94a12d92/html5/thumbnails/52.jpg)
IV. Regression Diagnostics
52
15.012.510.07.55.0
13
12
11
10
9
8
7
6
5
4
x1
y3
0.236364
0.127273
0.172727
0.318182
0.172727
0.318182
0.127273
0.090909
0.236364
0.100000
0.100000
Scatterplot with leverages
![Page 53: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d395503460f94a12d92/html5/thumbnails/53.jpg)
IV. Regression Diagnostics
53
50403020100
0.6
0.5
0.4
0.3
0.2
0.1
0.0
Index
Levera
ge
0.2
Malaysia
Libya
Uruguay
Jamaica
ZambiaVenezuela
UnitedStates
UnitedKingdom
Tunisia
Turkey
Switzerland
Sweden
Spain
SouthRhodesia
SouthAfrica
Portugal
PhilippinesPeruParaguayPanama
NicaraguaNewZealand
Netherlands
Norway
MaltaLuxembourgKorea
Japan
Italy
Ireland
IndiaIcelandHondurasGuatamala
GreeceGermany
France
FinlandEcuadorDenmark
CostaRicaColombiaChina
Chile
Canada
BrazilBoliviaBelgium
Austria
Australia
Scatterplot of Leverage vs Index
![Page 54: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d395503460f94a12d92/html5/thumbnails/54.jpg)
IV. Regression Diagnostics
54
Influential Observations
• A data point is influential if it has a large effect on the fitted model.
• Put another way, an observation is influential if the fitted model will change a lot if the observation is deleted.
• Cook’s Distance is a measure of the influence of an observation.
• It may make sense to refit the model after removing a few of the most influential observations.
![Page 55: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d395503460f94a12d92/html5/thumbnails/55.jpg)
IV. Regression Diagnostics
55
15.012.510.07.55.0
13
12
11
10
9
8
7
6
5
4
x1
y3
0.00695
0.00035
0.05954
0.03382
0.00052
0.30057
0.02598
0.00547
1.39285
0.00214
0.01176
Scatterplot with Cook's Distance (measure of influence)
High leverage, low influence High Influence
![Page 56: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d395503460f94a12d92/html5/thumbnails/56.jpg)
IV. Regression Diagnostics
56
50403020100
0.30
0.25
0.20
0.15
0.10
0.05
0.00
Index
Cook'
s Dis
tance
Malaysia
Libya
UruguayJamaica
Zambia
VenezuelaUnitedStatesUnitedKingdomTunisiaTurkeySwitzerland
Sweden
SpainSouthRhodesiaSouthAfricaPortugal
PhilippinesPeruParaguay
PanamaNicaraguaNewZealandNetherlandsNorway
MaltaLuxembourg
Korea
Japan
Italy
Ireland
India
Iceland
HondurasGuatamalaGreece
GermanyFrance
FinlandEcuador
DenmarkCostaRica
ColombiaChina
Chile
CanadaBrazil
BoliviaBelgiumAustriaAustralia
Scatterplot of Cook's Distance vs Index
![Page 57: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d395503460f94a12d92/html5/thumbnails/57.jpg)
V. Model Selection 57
Model Selection
• Question: With a large number of potential predictors, how do we choose the predictors to include in the model?
• Want good prediction, but parsimony: Occam’s Razor.
• Also can be thought of as a bias-variance tradeoff.
![Page 58: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d395503460f94a12d92/html5/thumbnails/58.jpg)
V. Model Selection 58
Model Selection Example
• Data on all 50 states, from the 1970s– Life.Exp = Life expectancy (response)– Population (in thousands)– Income = per-capita income– Illiteracy (in percent of population)– Murder = murder rate per 100,000– HS.Grad (in percent of population)– Frost = mean # days with min. temp < 32F– Area = land area in square miles
![Page 59: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d395503460f94a12d92/html5/thumbnails/59.jpg)
V. Model Selection 59
Forward Selection
• Choose a cutoff α
• Start with no predictors
• At each step, add the predictor with the lowest p-value less than α
• Continue until there are no unused predictors with p-values less than α
![Page 60: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d395503460f94a12d92/html5/thumbnails/60.jpg)
V. Model Selection 60
• Stepwise Regression: Life.Exp versus Population, Income, ...
• Forward selection. Alpha-to-Enter: 0.25
• Response is Life.Exp on 7 predictors, with N = 50
• Step 1 2 3 4• Constant 72.97 70.30 71.04 71.03
• Murder -0.284 -0.237 -0.283 -0.300• T-Value -8.66 -6.72 -7.71 -8.20• P-Value 0.000 0.000 0.000 0.000
• HS.Grad 0.044 0.050 0.047• T-Value 2.72 3.29 3.14• P-Value 0.009 0.002 0.003
• Frost -0.0069 -0.0059• T-Value -2.82 -2.46• P-Value 0.007 0.018
• Population 0.00005• T-Value 2.00• P-Value 0.052
• S 0.847 0.796 0.743 0.720• R-Sq 60.97 66.28 71.27 73.60• R-Sq(adj) 60.16 64.85 69.39 71.26• Mallows Cp 16.1 9.7 3.7 2.0
![Page 61: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d395503460f94a12d92/html5/thumbnails/61.jpg)
V. Model Selection 61
Variations on FS
• Backward elimination– Choose cutoff α– Start with all predictors in the model– Eliminate the predictor with the highest p-
value that is greater than α– ETC
• Stepwise: Allow addition or elimination at each step (hybrid of FS and BE)
![Page 62: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d395503460f94a12d92/html5/thumbnails/62.jpg)
V. Model Selection 62
All subsets
• Fit all possible models.
• Based on a “goodness” criterion, choose the model that fits best.
• Goodness criteria include AIC, BIC, Adjusted R2, Mallow’s Cp
• Some of the criteria will be described next
![Page 63: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d395503460f94a12d92/html5/thumbnails/63.jpg)
V. Model Selection 63
Notation
• RSS* = Resid. Sum of Squares for the current model
• p* = Number of terms (including intercept) in the current model
• n = number of observations
• s2 = RSS/(n-(p+1)) = Estimate of σ2 from model with all predictors and intercept term.
![Page 64: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d395503460f94a12d92/html5/thumbnails/64.jpg)
V. Model Selection 64
Goodness criteria
• Smaller is better for AIC, BIC, Cp*. Larger is better for adjR2
• AIC = n log(RSS*/n) + 2p*• BIC = n log(RSS*/n) + p* log(n)
• Cp* = RSS*/s2 + 2p* - n
• adjR2 = )1(
)1(
11 2R
pn
n
![Page 65: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d395503460f94a12d92/html5/thumbnails/65.jpg)
V. Model Selection 65
• Best Subsets Regression: Life.Exp versus Population, Income, ...
• Response is Life.Exp
• P I• o l• p l• u i H• l I t M S• a n e u . F• t c r r G r A• i o a d r o r• Mallows o m c e a s e• Vars R-Sq R-Sq(adj) Cp S n e y r d t a• 1 61.0 60.2 16.1 0.84732 X• 2 66.3 64.8 9.7 0.79587 X X• 3 71.3 69.4 3.7 0.74267 X X X• 4 73.6 71.3 2.0 0.71969 X X X X• 5 73.6 70.6 4.0 0.72773 X X X X X• 6 73.6 69.9 6.0 0.73608 X X X X X X• 7 73.6 69.2 8.0 0.74478 X X X X X X X
![Page 66: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d395503460f94a12d92/html5/thumbnails/66.jpg)
V. Model Selection 66
Model selection can overstate significance
• Generate Y and X1, X2, …, X50
• All are independent and standard normal.• So none of the predictors are related to
the response.
• Fit the full model and look at the overall F test.
• Use model selection to choose a “good” smaller model, and look at its overall F test
![Page 67: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d395503460f94a12d92/html5/thumbnails/67.jpg)
V. Model Selection 67
The full model
• Results from fitting model with all 50 predictors
• Note that the F test is not significant
• S = 0.915237 R-Sq = 57.6% R-Sq(adj) = 14.3%
• Analysis of Variance
• Source DF SS MS F P• Regression 50 55.7093 1.1142 1.33 0.160• Residual Error 49 41.0453 0.8377• Total 99 96.7546
![Page 68: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d395503460f94a12d92/html5/thumbnails/68.jpg)
V. Model Selection 68
The “good” small model
• Run FS with α = 0.05• Predictors x38, x41, and x24 are chosen.• Fit that three predictor model. Now the F test is
highly significant
• Analysis of Variance
• Source DF SS MS F P• Regression 3 20.9038 6.9679 8.82 0.000• Residual Error 96 75.8508 0.7901• Total 99 96.7546
![Page 69: Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d395503460f94a12d92/html5/thumbnails/69.jpg)
What’s left?
• Weighted least squares
• Tests for lack of fit
• Transformations of response and predictors
• Analysis of Covariance
• Etc.