Download - Lesson 11:
Lesson11-1 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
Lesson 11:
Regressions Part IIRegressions Part II
Lesson11-2 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
Does watching television rot your mind?
Zavodny, Madeline (2006): “Does watching television rot your mind? Estimates of the effect on test scores,” Economics of Education Review, 25 (5): 565–573
Television is one of the most omnipresent features of Americans’ lives. The average American adult watches about 15 h of television per week, accounting for almost one-half of free time.
The substantial amount of time that most individuals spend watching television makes it important to examine its effects on society, including human capital accumulation and academic achievement.
Lesson11-3 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
Data & Regression model
This analysis uses three data sets to examine the relationship between television viewing and test scores: the National Longitudinal Survey of Youth 1979 (NLSY), the HSB survey and the NELS. Each survey includes test scores and a question about the number of hours of television watched by young adults.
Test score of individual i at time t
Lesson11-4 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
Summary of samples from data sets
Lesson11-5 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
Regression results
**p<0.01; *p<0.05; †p<0.1
Lesson11-6 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
Multiple Linear Regression Model
Relationship Between Variables Is a Linear Function
Y intercept Slope Random Error
Dependent (Response) Variable
Independent (Explanatory) Variable
Y = 0 + 1X1 + 2X2 + 3X3 + … + kXk +
Lesson11-7 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
Finance Application: multifactor pricing model
It is assumed that rate of return on a stock (R) is linearly related to the rate of return on some factor and the rate of return on the overall market (Rm).
Rate of return on a particular oil company stock i at time t
Rate of return on some major stock index
The rate of return on crude oil price on date t
Rit = 0 + oi Rot+ 1Rmt +
Lesson11-8 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
Estimation by Method of momentsNumber of moment condition needed
Y = 0 + 1X1 + 2X2 + 3X3 + … + kXk +
k+1 parameters to estimate. Need k+1 moment conditions.
Assumption #1 E() = 0 implies E(y) – 0 – 1 E(x1) – 2 E(x2) - … k E(xk)= 0
Assumption #2 E(x1) =0 implies E[(y – 0 – 1x1 - … - kxk)x1]=0 Since Cov(, x1) = E(x1) – E()E(x1) = E(x1), the
assumption really imply and x are uncorrelated. Assumption #3: E(x2) =0 Assumption #4: E(x3) =0 … Assumption #k+1: E(xk) =0
Lesson11-9 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
Estimation of 0, 1, 2,…, k
Method of moments
Two approaches:1. Solve the 0, 1, 2,…, k from the k+1 moment
conditions, in terms of covariances, variances and means. Plug in to sample analog of these covariances, variances and means ro produce the sample estimate b0, b1, b2,…, bk
2. Assume b0, b1, b2,…, bk, solve them from the sample analog of the k+1 moment conditions.
Lesson11-10 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
Estimation of 0, 1, 2,…, k
Maximum Likelihood
Assume i to be independent identically distributed with normal distribution of zero mean and variance 2. Denote the normal density for be f()=f(y-0-1x1-2x2-…-kxk)
f(e)= f(y-b0-b1x1-b2x2-…-bkxk)
normal density
Choose b0, b1, b2, …, bk to maximize the joint likelihood:
L(b0, b1, b2, …, bk) = f(e1)*f(e2)*…*f(en)
Lesson11-11 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
To estimate 0 and 1 using ML (Computer)
We do not know 0, 1, 2, …, k. Nor do we know i. In fact, our objective is estimate 0, 1, 2, …, k.
The procedure of ML:1. Assume a combination of 0, 1, 2, …, k, call it b0, b1, b2, …, bk.
Compute the implied ei = yi-b0-b1x1i-b2x2i-…-bkxki and f(ei)=f(yi-b0-b1x1i-b2x2i-…-bkxki)
2. Compute the joint likelihood conditional on the assumed values of b0, b1, b2, …, bk:
L(b0, b1, b2, …, bk) = f(e1)*f(e2)*…*f(en)
Assume many more combination of 0, 1, 2, …, k, and repeat the above two steps, using a computer program (such as Excel).
Choose the b0, b1, b2, …, bk that yield a largest joint likelihood.
Lesson11-12 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
To estimate 0 and 1 using ML (Calculus)
Choose b0, b1, b2, …, bk to maximize the likelihood function L(b0, b1, b2, …, bk) – using calculus.
Take the first derivative of L(b0, b1, b2, …, bk) with respect to b0, set it to zero.
Take the first derivative of L(b0, b1, b2, …, bk) with respect to bj, set it to zero.
Solve b0, b1, b2, …, bk using the k+1 equations.
The procedure of ML:1. Assume a combination of 0, 1, 2, …, k, call it b0, b1, b2, …, bk.
Compute the implied ei = yi-b0-b1x1i-b2x2i-…-bkxki and f(ei)=f(yi-b0-b1x1i-b2x2i-…-bkxki)
2. Compute the joint likelihood conditional on the assumed values of b0, b1, b2, …, bk:
L(b0, b1, b2, …, bk) = f(e1)*f(e2)*…*f(en)
Lesson11-13 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
Estimation Ordinary least squares
For each value of X, there is a group of Y values, and these Y values are normally distributed.
Yi~ N(E(Y|X1, X2,…,Xk), i2), i=1,2,…,n
The means of these normal distributions of Y values all lie on the straight line of regression.
E(Y|X1, X2,…,Xk) = 0+ 1X1 + 2X2 +… + kXk
The standard deviations of these normal distributions are equal.
i2= 2 i=1,2,…,n
i.e., homoskedasticity
Lesson11-14 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
Choosing the line that fits bestOrdinary Least Squares (OLS) Principle
Straight lines can be described generally by yi = b0 + b1x1i+ b2x2i +…+ bkxki i=1,…,n
Finding the best line with smallest sum of squared difference is the same as
It can be shown the minimization yields the similar sample moment conditions as discussed earlier in the method of moments.
Min S(b0,b1) = [yi – (b0 + b1x1i+ b2x2i +…+ bkxki)]2
Lesson11-15 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
It can be shown that the estimators are BLUE
Best: smallest variance Linear: linear combination of yi
Unbiased: E(b0) = 0, E(b1) = 1
Estimator
Lesson11-16 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
yi = b0 + b1x1i + b2x2i + … + bkxki + ei
Prediction: y* = b0 + b1x1 + b2x2 + … + bkxk
Slope (bj) Estimated Y changes by bj for each 1 unit increase in Xj,,
holding other variables constanty* + y= b0 + b1x1 + …+ bj(xj+1)+… + bkxk
y= bj
More generally,y* + y= b0 + b1x1 + …+ bj(xj+xj)+… + bkxk
y= bjxj
y/x = b1
Y-Intercept (b0 ) Estimated value of Y when X1 = X2 = … = Xk = 0
Interpretation of Coefficients
Lesson11-17 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
You work in advertising for the New York Times. You want to find the effect of ad size (sq. in.) & newspaper circulation (000) on the number of ad responses (00).
You’ve collected the You’ve collected the following data:following data:
RespResp SizeSize CircCirc
11 11 2244 88 8811 33 1133 55 7722 66 4444 1010 66
Parameter Estimation Example
y x1 x2
Lesson11-18 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
Parameter Estimates
Parameter Standard T for H0:Variable DF Estimate Error Param=0 Prob>|T|
INTERCEP 1 0.0640 0.2599 0.246 0.8214
ADSIZE 1 0.2049 0.0588 3.656 0.0399
CIRC 1 0.2805 0.0686 4.089 0.0264
Parameter Estimation Computer Output
Slope (b1): # Responses to Ad is expected to increase by .2049 (20.49) for each 1 sq. in. increase in Ad Size Holding Circulation Constant
Slope (b2): # Responses to Ad is expected to increase by .2805 (28.05) for each 1 unit (1,000) increase in circulation Holding Ad Size Constant
Lesson11-19 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
Assumptions: Observed Y values are normally distributed
around each estimated value of Y*
Constant variance
se measures the dispersion of the points around the regression line If se = 0, equation is a “perfect” estimator
se may be used to compute confidence intervals of the estimated value
Interpreting the Standard Error of the Estimate
Lesson11-20 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
1. Tests if there is a linear relationship between Xj & Y after other variables are controlled for.
2. Involves population slope j
3. Hypotheses H0: j = 0 (Xj should not appear in the linear
relationship) H1: j 0
4. Theoretical basis is sampling distribution of slopes
Test of Slope Coefficient (bj)
Lesson11-21 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
Basis for Inference About the Population Regression Slope
Let j be a population regression slope and bj its least squares estimate based on n data points. Then, if the standard regression assumptions hold and it can also be assumed that the errors i are normally distributed, the random variable
is distributed as Student’s t with (n – k - 1) degrees of freedom. In addition the central limit theorem enables us to conclude that this result is approximately valid for a wide range of non-normal distributions and large sample sizes, n.
t= (bj – j) / Sbj
Lesson11-22 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
Confidence Intervals for the Population Regression Slope j
If the regression errors i , are normally distributed and the standard regression assumptions hold, a 100(1 - )% confidence interval for the population regression slope j is given by
bj - t(n-k-1),/2 Sbj < j < bj + t(n-k-1),/2 Sbj
Lesson11-23 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
Some cautions about the interpretation of significance tests
Rejecting H0: j = 0 and concluding that the relationship between xj and y is significant does not enable us to conclude that a cause-and-effect relationship is present between xj and y.
Causation requires: Association Accurate time sequence Other explanation for correlation
Correlation Causation Correlation Causation
Lesson11-24 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
Some cautions about the interpretation of significance tests
Just because we are able to reject H0: j = 0 and demonstrate statistical significance does not enable us to conclude that the relationship between x and y is linear.
Linear relationship is a very small subset of possible relationship among variables.
A test of linear versus nonlinear relationship requires another batch of analysis.
Lesson11-25 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
Are the assumptions valid? Assumption #1: Linearity Assumption #2: A set of variables should be included. Assumption #3: The explanatory variables are
uncorrelated with error term. Assumption #4: The error term has a constant variance. Assumption #5: The errors are independent of each other.
yi = b0 + b1x1i + b2x2i + … + bkxki + ei
Evaluating the Model
Lesson11-26 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
Total Sum of Squares (SST) Measures variation of observed Yi around the
mean,Y Explained Variation (SSR)
Variation due to relationship between X & Y
Unexplained Variation (SSE) Variation due to other factors
SST=SSR+SSE
Measures of Variation in Regression
Lesson11-27 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
Variation in y (SST) = SSR + SSE
n
1i
2i )yy(
n
1i
2**i )yyyy(
n
1i
**i
2*2*i )yy)(yy()yy()yy(
n
1i
**i
n
1i
2*n
1i
2*i )yy)(yy()yy()yy(
n
1i
2*n
1i
2*i )yy()yy(
SST:
SSE SSR
=0, as imposed in the estimation, E(x)=0.
Lesson11-28 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
Y
X
Y
Xi
Total Sum of Squares (Yi - Y)2
Unexplained Sum of Squares (Yi -Yi
*)2
Explained Sum of Squares (Yi
* - Y)2
Yi
SST
SSE
SSR
yi* = b0 +b1xi
Variation Measures
Lesson11-29 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
R2 (=r2, the coefficient of determination) measures the proportion of the variation in y that is explained by the variation in x.
n2
i2 i 1
n n2 2
i ii 1 i 1
(y y) SSESSE SSR
R 1SST(y y) (y y)
R2 takes on any value between zero and one. R2 = 1: Perfect match between the line and the data
points. R2 = 0: There are no linear relationship between x and
y.
Variation in y (SST) = SSR + SSE
Lesson11-30 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
Adjusted R-square
(unadjusted) R-square increases with the number of variables included. Thus, using R-square as a measure, we will always
conclude a model with more variables are better. However, adding a new variables is costly. Additional variable may
add to the uncertainty of estimating y. Thus, we would like to have a measure that will penalize the addition
of variables.
1kn1n
)R1(1)1n/(SST
)1kn/(SSE1R 22
2R
Fix an R2, adjusted R2 decreases with k.
Fix k, adjusted R2 increases with R2.
Lesson11-31 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
International price discrimination
Cabolis, Christos, Sofronis Clerides, Ioannis Ioannou and Daniel Senft (2007): “A textbook example of international price discrimination,” Economics Letters, 95(1): 91-95.
Lesson11-32 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
Motivation
International price comparisons have a long history in economics. Macroeconomists have used them extensively to test for purchasing power parity and the law of one price. International trade economists have been interested in international price differences as evidence of trade barriers while industrial organization economists have studied issues of market structure. The popular and business press have also shown a keen interest and frequently report intercity price comparisons for standardized products such as the Big Mac or a Starbucks cappuccino.
The paper documents the existence of very large differences in the prices of textbooks across countries.
Lesson11-33 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
Data
Our data were collected from the Internet sites of Amazon.com in two distinct phases. In May 2002 we collected information on prices and characteristics of 268 books that were on sale on both the US and UK websites of Amazon, Inc. This data set includes both textbooks and general audience books and we refer to it as our “broad sample”. In December 2002, we collected additional data on economics textbooks; this is our “econ sample”. In this phase, we broadened our sample by including Canada in the search and collected more detailed information about each book.
We tested for price differences by running a simple hedonic regression of price on book characteristics and on dummy variables that aim to capture differences across countries and book types.
Lesson11-34 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
Estimates from the board sampledependent variable: ln(p)
Variable Coefficient Estimate
Standard errors
Intercept 1.045 0.272
Textbook 0.268 0.052
US general book 0.126 0.044
US Textbook 0.306 0.031
Ln(pages) 0.345 0.048
Hardcover 0.343 0.044
N 536
R2 0.454
F-stat 56.52
Notes: Coefficients that are statistically different from zero at 5% and 1% are marked with “*” and “**” respectively.
Lesson11-35 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
Estimates from the Economics sample dependent variable: ln(p)
CommercialHard.
Univ.Press Hard.
Commercial paper
Univ. press paper
US 0.478** (-0.043) 0.143** (-0.045) 0.008 (-0.072) −0.048 (-0.026)
CA 0.248** (-0.049) 0.132** (-0.03) −0.032 (-0.066) 0.011 (-0.036)
US-INTRO 0.027 (-0.045) 0.310* (-0.124)
CA-INTRO 0.074 (-0.062) 0.231 (-0.149)
DELTIME 0.024** (-0.006) 0.021* (-0.008) −0.004 (-0.011) 0.007 (-0.006)
N 304 170 109 99
R2 0.303 0.152 0.223 0.413
F-stat 40.23 6.3 3.92 15.64
Notes: Coefficients that are statistically different from zero at 5% and 1% are marked with “*” and “**” respectively.
Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
Testing for Linearity
Key Argument: If the value of y does not change linearly with the value of
x, then using the mean value of y is the best predictor for the actual value of y. This implies is preferable.
If the value of y does change linearly with the value of x, then using the regression model gives a better prediction for the value of y than using the mean of y. This implies y=y* is preferable.
yy
Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
Testing for Linearity
The Global F-testH0: β1 = β2 = … = βk = 0 (no linear relationship)H1: at least one βi ≠ 0 (at least one independent
variable affects Y)
)1kn/()yy(
k/)yy(
)1kn/(SSEk/SSR
MSEMSR
F2
n
1i
*ii
2n
1i
*i
F is distributed with k numerator degree of freedom and n-k-1 denominator degree of freedom. Reject H0 if F > Fk,n-k-1,
[Variation in y] = SSR + SSE. Large F results from a large SSR. Then, much of the variation in y is explained by the regression model. The null hypothesis should be rejected; thus, the model is valid.
Under the null SSR is either zero or very small!!
Test Statistic:
Lesson11-38 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
6.53862252.8
14730.0
MSE
MSRF
Regression Statistics
Multiple R 0.72213
R Square 0.52148
Adjusted R Square 0.44172
Standard Error 47.46341
Observations 15
ANOVA df SS MS F Significance F
Regression 2 29460.027 14730.013 6.53861 0.01201
Residual 12 27033.306 2252.776
Total 14 56493.333
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 306.52619 114.25389 2.68285 0.01993 57.58835 555.46404
Price -24.97509 10.83213 -2.30565 0.03979 -48.57626 -1.37392
Advertising 74.13096 25.96732 2.85478 0.01449 17.55303 130.70888
F-Test for Overall Significance
With 2 and 12 degrees of freedom
P-value for the F-Test
Lesson11-39 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
H0: β1 = β2 = 0
H1: β1 and β2 not both zero
= .05df1= 2 df2 = 12
Test Statistic:
Decision:
Conclusion:
Since F test statistic is in the rejection region (p-value < .05), reject H0
There is evidence that at least one independent variable affects Y
0
= .05
F.05 = 3.885Reject H0Do not
reject H0
6.5386MSE
MSRF
Critical Value:
F = 3.885
F-Test for Overall Significance(continued)
F
Lesson11-40 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
Consider a multiple regression model involving variables xj and zj , and the null hypothesis that the z variable coefficients are all zero:
Tests on a Subset of Regression Coefficients
yi = 0 + 1 x1i + …+ k xki + 1 z1i + … + r zri + i
H0: 1 = 2 = … = r = 0H1: at least one of j ≠0 (j=1,…,r)
Under the null SSR due to Z is either zero or very small!!
Lesson11-41 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
Goal: compare the error sum of squares for the complete model with the error sum of squares for the restricted model
First run a regression for the complete model and obtain SSE
Next run a restricted regression that excludes the z variables (the number of variables excluded is r) and obtain the restricted error sum of squares SSE(r).
Compute the F statistic and apply the decision rule for a significance level
Tests on a Subset of Regression Coefficients
0 r,n K r 1,α
(SSE(r) SSE) / rReject H if F F
SSE/(n-k-1)
Note: SSE/(n-k-1) = Se2
Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
EXAMPLE 1
A market researcher for Super Dollar Super Markets is studying the yearly amount families of four or more spend on food. Three independent variables are thought to be related to yearly food expenditures (Food). Those variables are: total family income (Income) in $00, size of family (Size), and whether the family has children in college (College).
Lesson11-43 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
Example 1 continued
Note the following regarding the regression equation. The variable college is called a dummy or indicator variable.
It can take only one of two possible outcomes. That is a child is a college student or not.
Other examples of dummy variables include gender, the part is acceptable or unacceptable, the voter will or will not vote for the incumbent governor.
We usually code one value of the dummy variable as “1” and the other “0.”
Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
EXAMPLE 1 continued
Family Food Income Size Student
1 3900 376 4 0
2 5300 515 5 1
3 4300 516 4 0
4 4900 468 5 0
5 6400 538 6 1
6 7300 626 7 1
7 4900 543 5 0
8 5300 437 4 0
9 6100 608 5 1
10 6400 513 6 1
11 7400 493 6 1
12 5800 563 5 0
Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
EXAMPLE 1 continued
Use a computer software package, such as Excel, to develop a correlation matrix.
From the analysis provided by Excel, write out the regression equation:
Y*= 954 +1.09X1 + 748X2 + 565X3
What food expenditure would you estimate for a family of 4, with no college students, and an income of $50,000 (which is input as 500)?
Lesson11-46 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
The regression equation is
Food = 954 + 1.09 Income + 748 Size + 565 Student
Predictor Coef SE Coef T P
Constant 954 1581 0.60 0.563
Income 1.092 3.153 0.35 0.738
Size 748.4 303.0 2.47 0.039
Student 564.5 495.1 1.14 0.287
S = 572.7 R-Sq = 80.4% R-Sq(adj) = 73.1%
Analysis of Variance
Source DF SS MS F P
Regression 3 10762903 3587634 10.94 0.003
Residual Error 8 2623764 327970
Total 11 13386667
EXAMPLE 1 continued
Lesson11-47 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
From the regression output we note: The coefficient of determination is 80.4 percent. This
means that more than 80 percent of the variation in the amount spent on food is accounted for by the variables income, family size, and student.
Each additional $100 dollars of income per year will increase the amount spent on food by $109 per year.
An additional family member will increase the amount spent per year on food by $748.
A family with a college student will spend $565 more per year on food than those without a college student.
EXAMPLE 1 continued
Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
EXAMPLE 1 continued
The estimated food expenditure for a family of 4 with a $500 (that is $50,000) income and no college student is $4,491.
Y* = 954 + 1.09(500) + 748(4) + 565 (0)
= 4491
Lesson11-49 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
EXAMPLE 1 continued
Conduct a global test of hypothesis to determine if any of the regression coefficients are not zero.
H0 is rejected if F>4.07.
From the computer output, the computed value of F is 10.94.
Decision: H0 is rejected. Not all the regression coefficients are zero
0 equal s all Not :0: 13210 HversusH
Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
EXAMPLE 1 continued
Conduct an individual test to determine which coefficients are not zero. This is the hypotheses for the independent variable family size.
From the computer output, the only significant variable is SIZE (family size) using the p-values. The other variables can be omitted from the model.
Thus, using the 5% level of significance, reject H0 if the p-value<.05
0 :0: 2120 HversusH
Lesson11-51 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
Correlation Matrix
A correlation matrix is used to show all possible simple correlation coefficients among the variables. See which xj are most correlated with y, and which xj
are strongly correlated with each other.
y x1 x2 xk
y 1.00 1x yr
2x yr kx yr
x1 1.00 1 2x xr 1 kx xr
x2 1.00 2 kx xr
xk 1.00
Lesson11-52 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
1. High correlation between X variables2. Multicollinearity makes it difficult to separate
effect of x1 on y from the effect of x2 on y. Leads to unstable coefficients depending on X variables in model
3. Always exists – a matter of degree
4. Example: using both age & height as explanatory variables in same model
Multicollinearity
Lesson11-53 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
1. Examine correlation matrix Correlations between pairs of X variables are
more than with Y variable
2. Few remedies Obtain new sample data Eliminate one correlated X variable
Detecting Multicollinearity
Lesson11-55 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
The correlation matrix is as follows: Food Income SizeIncome 0.587
Size 0.876 0.609
Student 0.773 0.491 0.743
The strongest correlation between the dependent variable and an independent variable is between family size and amount spent on food.
None of the correlations among the independent variables should cause problems. All are between –.70 and .70.
EXAMPLE 1 continued
Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
EXAMPLE 1 continued
We rerun the analysis using only the significant independent family size.
The new regression equation is:
Y* = 340 + 1031X2
The coefficient of determination is 76.8 percent. We dropped two independent variables, and the R-square term was reduced by only 3.6 percent.
Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
Example 1 continued
Regression Analysis: Food versus Size
The regression equation isFood = 340 + 1031 Size
Predictor Coef SE Coef T PConstant 339.7 940.7 0.36 0.726Size 1031.0 179.4 5.75 0.000
S = 557.7 R-Sq = 76.8% R-Sq(adj) = 74.4%
Analysis of Variance
Source DF SS MS F PRegression 1 10275977 10275977 33.03 0.000Residual Error 10 3110690 311069Total 11 13386667
Lesson11-58 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
Purposes Evaluate violations of assumptions, including the
assumption of linearity. Graphical Analysis of Residuals
Plot residuals versus Xi values
Difference between actual Yi & predicted Yi*
Studentized residuals:Allows consideration for the magnitude of the
residuals
Residual Analysis
Lesson11-59 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
Heteroscedasticity OK Homoscedasticity
Using Standardized Residuals (e/se)
SR
X
SR
X
Residual Analysis for Homoscedasticity
When the requirement of a constant variance (homoscedasticity) is violated, we have heteroscedasticity.
For example, for xi>xj
Var(i|xi)>var(j|xj)
Lesson11-60 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
Residual Analysis for Independence
Not Independent Independent
X
SR
X
SR
OK
Using Standardized Residuals (e/se)
Lesson11-61 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
+
+++ +
++
++
+ +
++ + +
+
++ +
+
+
+
+
+
+Time
Residual Residual
Time+
+
+
Note the runs of positive residuals,replaced by runs of negative residuals
Note the oscillating behavior of the residuals around zero.
0 0
Patterns in the appearance of the residuals over time indicates that autocorrelation exists.
Lesson11-62 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
n
ii
n
iii
e
eeD
1
2
2
21)( Should be close to 2.
If not, examine the model for autocorrelation.
Used when data is collected over time to detect autocorrelation (Residuals in one time period are related to residuals in another period)
Measures Violation of independence assumption
The Durbin-Watson Statistic
Intuition: If x and y are independent, Var(x-y)= Var(x) + Var(y)
Lesson11-63 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
An outlier is an observation that is unusually small or large.
Several possibilities need to be investigated when an outlier is observed: There was an error in recording the value. The point does not belong in the sample. The observation is valid.
Identify outliers from the scatter diagram. It is customary to suspect an observation is an
outlier if its |standard residual| > 2
Outliers
Lesson11-64 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
+
+
+
+
+ +
+ + ++
+
+
+
+
+
+
+
The outlier causes a shift in the regression line
… but, some outliers may be very influential
++++++++++
An outlier An influential observation
Lesson11-65 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
Nonnormality or heteroscedasticity can be remedied using transformations on the y variable.
The transformations can improve the linear relationship between the dependent variable and the independent variables.
Many computer software systems allow us to make the transformations easily.
Remedying violations of the required conditions
Lesson11-66 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
The relationship between the dependent variable and an independent variable may not be linear
Can review the scatter diagram to check for non-linear relationships
Example: Quadratic model
The second independent variable is the square of the first variable
Nonlinear Regression Models
εXβXββY 212110
Lesson11-67 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
Quadratic Regression Model
where:β0 = Y intercept
β1 = regression coefficient for linear effect of X on Y
β2 = regression coefficient for quadratic effect on Y
εi = random error in Y for observation i
i21i21i10i εXβXββY
Model form:
Lesson11-68 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
Linear fit does not give random residuals
Linear vs. Nonlinear Fit
Nonlinear fit gives random residuals
X
resi
dua
ls
X
Y
X
resi
dua
ls
Y
X
Lesson11-69 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
Quadratic Regression Model
Quadratic models may be considered when the scatter diagram takes on one of the following shapes:
X1
Y
X1X1
YYY
β1 < 0 β1 > 0 β1 < 0 β1 > 0
β1 = the coefficient of the linear term
β2 = the coefficient of the squared term
X1
β2 > 0 β2 > 0 β2 < 0 β2 < 0
i21i21i10i εXβXββY
Lesson11-70 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
Testing for Significance: Quadratic Effect
Testing the Quadratic Effect Compare the linear regression estimate
with quadratic regression estimate
HypothesesH0: 2=0 (The quadratic term does not improve the
model)
H1: 2≠0 (The quadratic term improves the model)
2 12110 xbxbby ˆ
110 xbby ˆ
Lesson11-71 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
Testing for Significance: Quadratic Effect
Testing the Quadratic EffectHypotheses H0: 2=0 (The quadratic term does not improve the model)
H1: 2≠0 (The quadratic term improves the model)
The test statistic is
2b
22
s
βbt
3nd.f.
where:
b2 = squared term slope coefficient
β2 = hypothesized slope (zero)
Sb = standard error of the slope
2
Lesson11-72 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
Testing for Significance: Quadratic Effect
Testing the Quadratic Effect
Compare Adjusted R2 from simple regression to
Adjusted R2 from the quadratic model
If Adjusted R2 from the quadratic model is larger than Adjusted R2 from the simple model, then the quadratic model is a better model
Lesson11-73 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
Example: Quadratic Model
Purity increases as filter time increases:Purity
FilterTime
3 1
7 2
8 3
15 5
22 7
33 8
40 10
54 12
67 13
70 14
78 15
85 15
87 16
99 17
Purity vs. Time
0
20
40
60
80
100
0 5 10 15 20
Time
Pu
rity
Lesson11-74 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
Example: Quadratic Model
Regression Statistics
R Square 0.96888
Adjusted R Square 0.96628
Standard Error 6.15997
Simple regression results: y* = -11.283 + 5.985 Time
CoefficientsStandard
Error t Stat P-value
Intercept -11.28267 3.46805 -3.25332 0.00691
Time 5.98520 0.30966 19.32819 2.078E-10
F Significance F
373.57904 2.0778E-10
Time Residual Plot
-10
-5
0
5
10
0 5 10 15 20
Time
Resid
uals
t statistic, F statistic, and R2 are all high.
But …. the residuals are not random:
Lesson11-75 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
CoefficientsStandard
Error t Stat P-value
Intercept 1.53870 2.24465 0.68550 0.50722
Time 1.56496 0.60179 2.60052 0.02467
Time-squared 0.24516 0.03258 7.52406 1.165E-05
Regression Statistics
R Square 0.99494
Adjusted R Square 0.99402
Standard Error 2.59513
F Significance F
1080.7330 2.368E-13
Quadratic regression results:
y = 1.539 + 1.565 Time + 0.245 (Time)2
^
Example: Quadratic Model
Time Residual Plot
-5
0
5
10
0 5 10 15 20
Time
Res
idua
ls
Time-squared Residual Plot
-5
0
5
10
0 100 200 300 400
Time-squared
Res
idua
lsThe quadratic term is significant and improves the model: R2 is higher and se is lower, residuals are now random
Lesson11-76 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
Original multiplicative model
Transformed multiplicative model
Some highly nonlinear models may be transformed into a linear modelThe Log Transformation
The Multiplicative Model:
εXXβY 21 β2
β10
)log(ε)log(Xβ)log(Xβ)log(βlog(Y) 22110
Lesson11-77 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
Interpretation of coefficients
For the multiplicative model:
When both dependent and independent variables are logged: The coefficient of the independent variable X1
can be interpreted as
A 1 percent change in X1 leads to an estimated b1 percentage change in the average value of Y
b1 is the elasticity of Y with respect to a change in X1
i1i10i ε logX log ββ log Ylog
Note: logY = b0 + b1 logX b1 = logY /logX = %Y/%X
logY = logY2 – log Y1 = log(Y2/Y1) = log(1+(Y2-Y1)/Y1) ≈ (Y2-Y1)/Y1
Lesson11-78 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
Dummy Variables
A dummy variable is a categorical independent variable with two levels: yes or no, on or off, male or female recorded as 0 or 1
Regression intercepts are different if the variable is significant
Assumes equal slopes for other variables If more than two levels, the number of dummy
variables needed is (number of levels - 1)
Lesson11-79 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
Dummy variable example
Intrersted in: Do the average income differ across male and female? Compute the average income for female. Compute the average income for male. Conduct a two sample test of equal mean.
Y= 0 + 1X1 +
Alternative approach: regression. Y=income X1 = 1 if male; 0 if female.
X1 = 0 implies Y = 0 + X1 = 1 implies Y = 0 + 1 + Test H0: 1=0.
Lesson11-80 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
Dummy Variable Example
Let:
y = Pie Sales
x1 = Price
x2 = Holiday (X2 = 1 if a holiday occurred during the week)
(X2 = 0 if there was no holiday that week)
210 xbxbby21
ˆ
Lesson11-81 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
Same slope
Dummy Variable Example
x1 (Price)
y (sales)
b0 + b2
b0
1010
12010
xb b (0)bxbby
xb)b(b(1)bxbby
121
121
ˆ
ˆHoliday
No Holiday
Different intercept
Holiday (x2 = 1)No Holiday (x
2 = 0)
If H0: β2 = 0 is rejected, then“Holiday” has a significant effect on pie sales
Lesson11-82 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
Sales: number of pies sold per weekPrice: pie price in $
Holiday:
Interpreting the Dummy Variable Coefficient
Example:
1 If a holiday occurred during the week
0 If no holiday occurred
b2 = 15: on average, sales were 15 pies greater in weeks with a holiday than in weeks without a holiday, given the same price
)15(Holiday 30(Price) - 300 Sales
Lesson11-83 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
Interaction Between Explanatory Variables
Hypothesizes interaction between pairs of x variables Response to one x variable may vary at different
levels of another x variable
Contains two-way cross product terms
)x(xbxbxbb
xbxbxbby
21322110
3322110
ˆ
Lesson11-84 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
Effect of Interaction
Given:
Without interaction term, effect of X1 on Y is measured by β1
With interaction term, effect of X1 on Y is measured by β1 + β3 X2,
21322110
1231220
XXβXβXββ
)XXβ(βXββY
which changes as X2 changes
Lesson11-85 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
x2 = 1:y = 1 + 2x1 + 3(1) + 4x1(1) = 4 + 6x1
x2 = 0: y = 1 + 2x1 + 3(0) + 4x1(0) = 1 + 2x1
Interaction Example
Slopes are different if the effect of x1 on y depends on x2 value
x1
44
88
1212
00
00 110.50.5 1.51.5
y
Suppose x2 is a dummy variable and the estimated regression equation is 2121 x4x3x2x1y ˆ
^
^
Lesson11-86 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
Significance of Interaction Term
The coefficient b3 is an estimate of the difference in the coefficient of x1 when x2 = 1 compared to when x2 = 0
The t statistic for b3 can be used to test the hypothesis
If we reject the null hypothesis we conclude that there is a difference in the slope coefficient for the two subgroups
0 3
1 3
H :β 0
H : β 0
Lesson11-87 Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data
- END -
Lesson 11:Lesson 11: Regressions Part IIRegressions Part II