regression
DESCRIPTION
Regression. Econ 240A. Retrospective. Week One Descriptive statistics Exploratory Data Analysis Week Two Probability Binomial Distribution Week Three Normal Distribution Interval Estimation, Hypothesis Testing, Decision Theory. Week Four. Bivariate Relationships - PowerPoint PPT PresentationTRANSCRIPT
1
Regression
Econ 240A
2
Retrospective Week One
• Descriptive statistics• Exploratory Data Analysis
Week Two• Probability• Binomial Distribution
Week Three• Normal Distribution• Interval Estimation, Hypothesis Testing,
Decision Theory
3
Week Four
Bivariate Relationships Correlation and Analysis of Variance
4
Outline A cognitive device to help understand the
formulas for estimating the slope and the intercept, as well as the analysis of variance
Table of Analysis of Variance (ANOVA) for regression
F distribution for testing the significance of the regression, i.e. does the independent variable, x, significantly explain the dependent variable, y?
5
Outline (Cont.) The Coefficient of Determination, R2, and
the Coefficient of Correlation, r. Estimate of the error variance, 2. Hypothesis tests on the slope, b.
6
Part I: A Cognitive Device
7
A Cognitive Device: The Conceptual Model
(1) yi = a + b*xi + ei
Take expectations , E: (2) E yi = a + b*E xi +E ei, where
• assume (3) E ei =0
Subtract (2) from (1) to obtain model in deviations:
(4) [yi - E yi ] = b*[xi - E xi ] + ei
Multiply (3) by [xi - E xi ] and take expectations:
8
A Cognitive Device: (Cont.) (5) E{[yi - E yi ] [xi - E xi ]} = b*E[xi - E xi ]2 +
E{ei [xi - E xi ] }, where assume
• E{ei [xi - E xi ] }= 0, i.e. e and x are independent
By definition, (6) cov yx = b* var x, i.e. (7) b= cov yx/ var x The corresponding empirical estimate, by the
method of moments:(8) ˆ b [y(i) y ][x(i) x ] [x(i) x ]2
i
i
9
A Cognitive Device (Cont.) The empirical counter part to (2)
Square both sides of (4), and take expectations,
(10) E [yi - E yi ]2 = b2*E[xi - E xi ]2 + 2E{ei*[xi - E xi ]}+ E[ei]2
Where (11) E{ei*[xi - E xi ] = 0 , i.e. the explanatory variable x and the error e are assumed to be independent, cov ex = 0
y a ˆ b * x ,so(9) ˆ a y ˆ b *x
10
A Cognitive Device (Cont.) From (10) by definition (11) var y = b2 * var x + var e, this is the
partition of the total variance in y into the variance explained by x, b2 * var x , and the unexplained or error variance, var e.
the empirical counterpart to (11) is the total sum of squares equals the explained sum of squares plus the unexplained sum of squares:
ii i
iexixbyiy 2222 )](ˆ[])([ˆ])([)12(
11
A Cognitive Device (Cont.) From Eq. 7, substitute for b in Eq. 11:
• Var y = [covyx]2/Var x + Var e
Divide by Var y: 1 = [covyx]2/vary*varx + var e/var y• or 1 = r2 + var e/var y where r is the correlation
coefficient
12
Population Model and Sample Model Side by Side
13
Conceptual Vs. Fitted Model Conceptual (1) yi = a + b*xi + ei
Take expectations, E (2) Ey = a + b*Ex +
Eei
(3) Where Eei = 0
Subtract (2) from (1) (4)[yi - Ey] = b*[xi -
Ex] + ei
Fitted
Minimize
)(*ˆˆ)(ˆ)( ixbaiyi )(ˆ)(ˆ)( iyiyeii
i
ixbaiyiv 2)](ˆˆ)(ˆ[)(
2])(ˆ[)(i
ieiii
14
Conceptual Vs. Fitted (Cont.) Conceptual Multiply (4) by [xi -
Ex] and take expectations, E
E [yi - Ey] [xi -Ex] = b*E [xi -Ex]2 + Eei* [xi
-Ex], (5) where Eei* [xi -Ex]
= 0 (6) cov[y*x] = b*varx (7) b = cov[y*x]/varx
Fitted First order condition
compare (3) & (vi) From (v) the fitted line
goes through the sample means
i
i
ievi
ixbaiyv
0)(ˆ)(
0)](ˆˆ)([)(
xbayvii ˆˆ)(
15
Conceptual vs. Fitted (Cont.)
i
i
ixieviii
ixixbaiyvii
0)(*)(ˆ)(
0)]()][(ˆˆ)([)(
16
Part II: ANOVA in Regression
17
ANOVA
Testing the significance of the regression, i.e. does x significantly explain y?
F1, n -2 = EMS/UMS
Distributed with the F distribution with 1 degree of freedom in the numerator and n-2 degrees of freedom in the denominator
18
Table of Analysis of Variance (ANOVA)
S o u rc e o fV a r ia t io n
S u m o fS q u a re s
D e g re e s o fF re e d o m
M e a nS q u a re
E x p la in e d ,E S S
b 2 [ x ( i ) x ] 2
i 1 ˆ b 2 [ x ( i ) x ] 2
i
E rro r ,U S S
[ ˆ e ( i ) ]2
i n - 2 [ ˆ e ( i ) ]2
i /n -2
T o ta l , T S S [ y ( i ) y ] 2
i n - 1 [ y ( i ) y ] 2
i /n -1
F1,n -2 = Explained Mean Square / Error Mean Square
19
Example from Lab Four
Linear Trend Model for UC Budget
UC Budget, General Fund Component, Millions of Nominal $
$2670.529
y = 81.613x + 19.497
R2 = 0.933
0
500
1000
1500
2000
2500
3000
3500
4000
68-6
9
70-7
1
72-7
3
74-7
5
76-7
7
78-7
9
80-8
1
82-8
3
84-8
5
86-8
7
88-8
9
90-9
1
92-9
3
94-9
5
96-9
7
98-9
9
00-0
1
02-0
3
04-0
5
Fiscal Year
Mil
lio
ns
$
21
SUMMARY OUTPUT
Regression StatisticsMultiple R 0.965901681R Square 0.932966058Adjusted R Square 0.931050802Standard Error 240.1544701Observations 37
ANOVAdf SS MS F Significance F
Regression 1 28094446.39 28094446 487.12355 3.992E-22Residual 35 2018595.933 57674.17Total 36 30113042.32
Coefficients Standard Error t Stat P-value Lower 95%Intercept 101.1096814 77.38814601 1.306527 0.1998941 -55.99679935X Variable 1 81.61255073 3.697748649 22.07088 3.992E-22 74.10571271
RESIDUAL OUTPUT
Observation Predicted Y Residuals1 101.1096814 190.19031862 182.7222321 146.5777679
Time index, t = 0 for 1968-69, t=1 for 1969-70 etc
22
Example from Lab Four
Exponential trend model for UC Budget UCBud(t) =exp[a+b*t+e(t)] taking the logarithms of both sides ln UCBud(t) = a + b*t +e(t)
23
UC Budget
y = 0.3623e0.0654x
R2 = 0.9131
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
0 5 10 15 20 25 30 35 40
Fiscal year
$ B
illio
ns
24
SUMMARY OUTPUT
Regression StatisticsMultiple R 0.95557214R Square 0.91311811Adjusted R Square 0.91063577Standard Error 0.22161289Observations 37
ANOVAdf SS MS F Significance F
Regression 1 18.06574216 18.06574 367.8458 3.77305E-20Residual 35 1.718929586 0.049112Total 36 19.78467174
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%Intercept -0.94975318 0.071413248 -13.2994 3.01E-15 -1.094729955 -0.80478X Variable 1 0.06544472 0.003412257 19.17931 3.77E-20 0.058517462 0.072372
RESIDUAL OUTPUT
Observation Predicted Y Residuals1 -0.94975318 -0.2836484392 -0.88430846 -0.2264776343 -0.81886374 -0.2720780474 -0.75341901 -0.2682322335 -0.68797429 -0.267317175
Time index, t = 0 for 1968-69, t=1 for 1969-70 etc
Exp(-0.950) = 0.387
25
Part III: The F Distribution
26
The F Distribution
The density function of the F distribution:
1 and 2 are the numerator and denominator degrees of freedom.
0FF
1
F
22
22
22
)F(f2
2
1
22
2
2
1
21
21
21
11
0FF
1
F
22
22
22
)F(f2
2
1
22
2
2
1
21
21
21
11
!!
!
27
0
0.002
0.004
0.006
0.008
0.01
0 1 2 3 4 5
This density function generates a rich family of distributions, depending on the values of 1 and 2
The F Distribution
1 = 5, 2 = 10
1 = 50, 2 = 10
00.0010.0020.0030.0040.0050.0060.0070.008
0 1 2 3 4 5
1 = 5, 2 = 10
1 = 5, 2 = 1
28
Determining Values of F
The values of the F variable can be found in the F table, Table 6(a) in Appendix B for a type I error of 5%, or Excel .
The entries in the table are the values of the F variable of the right hand tail probability (A), for which P(F1,2>FA) = A.
UC Budget, General Fund Component, Millions of Nominal $
$2670.529
y = 81.613x + 19.497
R2 = 0.933
0
500
1000
1500
2000
2500
3000
3500
4000
68-6
9
70-7
1
72-7
3
74-7
5
76-7
7
78-7
9
80-8
1
82-8
3
84-8
5
86-8
7
88-8
9
90-9
1
92-9
3
94-9
5
96-9
7
98-9
9
00-0
1
02-0
3
04-0
5
Fiscal Year
Mil
lio
ns
$
30
Time index, t = 0 for 1968-69, t=1 for 1969-70 etcSUMMARY OUTPUT
Regression StatisticsMultiple R 0.965901681R Square 0.932966058Adjusted R Square 0.931050802Standard Error 240.1544701Observations 37
ANOVAdf SS MS F Significance F
Regression 1 28094446.39 28094446 487.12355 3.992E-22Residual 35 2018595.933 57674.17Total 36 30113042.32
Coefficients Standard Error t Stat P-value Lower 95%Intercept 101.1096814 77.38814601 1.306527 0.1998941 -55.99679935X Variable 1 81.61255073 3.697748649 22.07088 3.992E-22 74.10571271
RESIDUAL OUTPUT
Observation Predicted Y Residuals1 101.1096814 190.19031862 182.7222321 146.5777679
F1, 35 = (n-2)*[R2/(1 - R2) =35*(0.933/0.067)= 487
31
1 dof
35 dofF1,35 = 4.12
32
Part IV: The Pearson Coefficient of Correlation, r The Pearson coefficient of correlation, r, is
(13) r = cov yx/[var x]1/2 [var y]1/2
Estimated counterpart
Comparing (13) to (7) note that (15) r*{[var y]1/2 /[var x]1/2} = b
(14) ˆ r [y(i) y ][x(i) x ] [y(i) i
i y ]2 [x(i) x ]2
i
33
A Cognitive Device: (Cont.)
(5) E{[yi - E yi ] [xi - E xi ]} = b*E[xi - E xi ]2 + E{ei [xi - E xi ] }, where assume
• E{ei [xi - E xi ] }= 0, i.e. e and x are independent
By definition, (6) cov yx = b* var x, i.e. (7) b= cov yx/ var x The corresponding empirical estimate:
(8) ˆ b [y(i) y ][x(i) x ] [x(i) x ]2
i
i
34
Part IV (Cont.) The coefficient of Determination, R2 For a bivariate regression of y on a single
explanatory variable, x, R2 = r2, i.e. the coefficient of determination equals the square of the Pearson coefficient of correlation
Using (14) to square the estimate of r
(16)[ ˆ r ]2 { [y(i) y ][x(i) x ]}2 [y(i) i
i y ]2 [x( i) x ]2
i
35
Part IV (Cont.) Using (8), (16) can be expressed as
And so
In general, including multivariate regression, the estimate of the coefficient of determination, , can be calculated from (21) =1 -USS/TSS .
(19) ˆ r 2 ˆ b 2 * [x(i) x ]2
i [y(i) y ]2
i ESS / TSS
(20)1 ˆ r 2 1 [ESS / TSS} [TSS ESS]/ TSS USS / TSS
ˆ R 2
ˆ R 2
36
Part IV (Cont.) For the bivariate regression, the F-test can
be calculated from F1, n-2 = [(n-2)/1][ESS/TSS]/[USS/TSS] F1, n-2 = [(n-2)/1][ESS/USS]=[(n-2)]
For a multivariate regression with k explanatory variables, the F-test can be calculated as Fk, n-2 = [(n-k-1)/k][ESS/USS] Fk, n-2 = [(n-k-1)/k]
ˆ R 2 [1 ˆ R 2 ]
ˆ R 2 [1 ˆ R 2 ]
37
Time index, t = 0 for 1968-69, t=1 for 1969-70 etcSUMMARY OUTPUT
Regression StatisticsMultiple R 0.965901681R Square 0.932966058Adjusted R Square 0.931050802Standard Error 240.1544701Observations 37
ANOVAdf SS MS F Significance F
Regression 1 28094446.39 28094446 487.12355 3.992E-22Residual 35 2018595.933 57674.17Total 36 30113042.32
Coefficients Standard Error t Stat P-value Lower 95%Intercept 101.1096814 77.38814601 1.306527 0.1998941 -55.99679935X Variable 1 81.61255073 3.697748649 22.07088 3.992E-22 74.10571271
RESIDUAL OUTPUT
Observation Predicted Y Residuals1 101.1096814 190.19031862 182.7222321 146.5777679
R2 = 1 – 2,018,596/30,113,042
38
Part V:Estimate of the Error Variance
Var ei =
Estimate is unexplained mean square, UMS
Standard error of the regression is
)2(])(ˆ)([)2()](ˆ[ˆ 222 niyiyniei i
ˆ
39
Time index, t = 0 for 1968-69, t=1 for 1969-70 etc
17.5767415.240ˆ UMS
SUMMARY OUTPUT
Regression StatisticsMultiple R 0.965901681R Square 0.932966058Adjusted R Square 0.931050802Standard Error 240.1544701Observations 37
ANOVAdf SS MS F Significance F
Regression 1 28094446.39 28094446 487.12355 3.992E-22Residual 35 2018595.933 57674.17Total 36 30113042.32
Coefficients Standard Error t Stat P-value Lower 95%Intercept 101.1096814 77.38814601 1.306527 0.1998941 -55.99679935X Variable 1 81.61255073 3.697748649 22.07088 3.992E-22 74.10571271
RESIDUAL OUTPUT
Observation Predicted Y Residuals1 101.1096814 190.19031862 182.7222321 146.5777679
40
Part VI: Hypothesis Tests on the Slope
Hypotheses, H0 : b=0; HA: b>0 Test statistic:
Set probability for the type I error, say 5% Note: for bivariate regression, the square of
the t-statistic for the null that the slope is zero is the F-statistic
[ ˆ b E( ˆ b )] ˆ ( ˆ b ),where E( ˆ b ) b under theH0
41
t = {81.6 - 0]/3.70 = 22.1
t2 = F, i.e. 22.36*22.36 = 500
SUMMARY OUTPUT
Regression StatisticsMultiple R 0.965901681R Square 0.932966058Adjusted R Square 0.931050802Standard Error 240.1544701Observations 37
ANOVAdf SS MS F Significance F
Regression 1 28094446.39 28094446 487.12355 3.992E-22Residual 35 2018595.933 57674.17Total 36 30113042.32
Coefficients Standard Error t Stat P-value Lower 95%Intercept 101.1096814 77.38814601 1.306527 0.1998941 -55.99679935X Variable 1 81.61255073 3.697748649 22.07088 3.992E-22 74.10571271
RESIDUAL OUTPUT
Observation Predicted Y Residuals1 101.1096814 190.19031862 182.7222321 146.5777679
t2 = F, i.e. 22.1*22.1 = 488
42
Part VII: Student’s t-Distribution
43
The Student t Distribution
The Student t density function
is the parameter of the student t distribution
E(t) = 0 V(t) =(– 2)
2/)1(2t1
)]!2[()]!1[(
)t(f
2/)1(2t1
)]!2[()]!1[(
)t(f
(for n > 2)(for n > 2)
44
The Student t Distribution
0
0.05
0.1
0.15
0.2
-6 -4 -2 0 2 4 6
0
0.05
0.1
0.15
0.2
-6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6
= 3
= 10
45
Determining Student t Values
The student t distribution is used extensively in statistical inference.
Thus, it is important to determine values of tA associated with a given number of degrees of freedom.
We can do this using• t tables , Table 4 Appendix B
• Excel
46
Degrees of Freedom1 3.078 6.314 12.706 31.821 63.6572 1.886 2.92 4.303 6.965 9.925. . . . . .. . . . . .
10 1.372 1.812 2.228 2.764 3.169. . . . . .. . . . . .
200 1.286 1.653 1.972 2.345 2.6011.282 1.645 1.96 2.326 2.576
tA
t.100 t.05 t.025 t.01 t.005
A = .05A = .05
-tA
The t distribution issymmetrical around 0
=1.812=-1.812
The table provides the t values (tA) for which P(t > tA) = A
Using the t Tabletttttttt
47
Problem 6.32 in TextTable of Joint Probabilities
Manual Calc. Computer Calc.
Quant Ed. 0.23 0.36
Other Ed. 0.11 0.30
48
Problem 6.32 The method of instruction in college and
university applied statistics courses is changing. Historically, most courses were taught with an emphasis on manual calculation. The alternative is to employ a computer and a software package to perform the calculations. An analysis of applied statistics courses investigated whether the instructor’s educational background is primarily mathematics (or statistics) or some other field.
49
Problem 6.32 A. What is the probability that a randomly
selected applied statistics course instructor whose education was in statistics emphasizes manual calculations?
What proportion of applied statistics courses employ a computer and software?
Are the educational background of the instructor and the way his or her course are taught independent?
50
Midterm 2000• .(15 points) The following table shows the results of
regressing the natural logarithm of California General Fund expenditures, in billions of nominal dollars, against year beginning in 1968 and ending in 2000. A plot of actual, estimated and residual values follows.
– .How much of the variance in the dependent variable is explained by trend?
– .What is the meaning of the F statistic in the table? Is it significant?
– .Interpret the estimated slope.
– .If General Fund expenditures was $68.819 billion in California for fiscal year 2000-2001, provide a point estimate for state expenditures for 2001-2002.
51
Cont.• A state senator believes that state expenditures
in nominal dollars have grown over time at 7% a year. Is the senator in the ballpark, or is his impression significantly below the estimated rate, using a 5% level of significance?
• If you were an aide to the Senator, how might you criticize this regression?
52
T a b l e
D e p e n d e n t V a r i a b l e : L N G E N F N DM e t h o d : L e a s t S q u a r e s
S a m p l e : 1 9 6 8 2 0 0 0I n c l u d e d o b s e r v a t i o n s : 3 3
V a r i a b l e C o e f f i c i e n t S t d . E r r o r t - S t a t i s t i c P r o b .
Y E A R 0 . 0 8 6 9 5 8 0 . 0 0 3 8 9 5 2 2 . 3 2 8 0 4 0 . 0 0 0 0C - 1 6 9 . 4 7 8 7 7 . 7 2 6 9 2 2 - 2 1 . 9 3 3 5 3 0 . 0 0 0 0
R - s q u a r e d 0 . 9 4 1 4 5 9 M e a n d e p e n d e n t v a r 3 . 0 4 6 4 0 4A d j u s t e d R - s q u a r e d 0 . 9 3 9 5 7 0 S . D . d e p e n d e n t v a r 0 . 8 6 6 5 9 4S . E . o f r e g r e s s i o n 0 . 2 1 3 0 3 0 A k a i k e i n f o c r i t e r i o n - 0 . 1 9 6 0 7 6S u m s q u a r e d r e s i d 1 . 4 0 6 8 3 5 S c h w a r z c r i t e r i o n - 0 . 1 0 5 3 7 9L o g l i k e l i h o o d 5 . 2 3 5 2 5 8 F - s t a t i s t i c 4 9 8 . 5 4 1 6D u r b i n - W a t s o n s t a t 0 . 1 1 8 5 7 5 P r o b ( F - s t a t i s t i c ) 0 . 0 0 0 0 0 0
P lo t
-0.4
-0.2
0.0
0.2
0.4
1
2
3
4
5
70 75 80 85 90 95 00
Residual Actual Fitted
Actual, Fitted and Residual Values from the Regressionof the Logarithm of General Fund Expenditures ($B) on Year