simple regression model (assumptions)homes.chass.utoronto.ca/~murdockj/eco220/l18_220.pdf · –raw...

9
Lecture 18, Page 1 of 9 Simple Regression Model (Assumptions) Reading: Sections 18.1, 18.2, “Logarithms in Regression Analysis with Asiaphoria,” 19.6 – 19.8 (Optional: “Normal probability plot” pp. 607-8) 1 Lecture 18 Remember Regression? 2 60 65 70 75 80 Height son, inches 60 65 70 75 Height father, inches son_hat = 33.887 + 0.514*father n = 1078, R2 = 0.251, s_e = 2.437 OLS intercept 33.887: No interpretation b/c father cannot be 0 inches tall OLS slope 0.514: For every extra 1 inch of father’s height, son is on average about ½ inch taller ݕ(y-hat): Predicted y, given x; E.g. son of a 72 inch tall father predicted to be 70.895 inches (= 33.887 + 0.514*72) (residual): ݕ ݕ ; E.g. if ݕ is 70.895 but ݕ is 68.531, then residual is -2.364 inches ݏ (s.d. of residuals) 2.437 inches: measures scatter about OLS line 0.251: 25.1% of variation in sons’ heights explained by variation in their fathers’ heights Descriptive & Inferential Statistics Chap. 6: Scatterplots, Association, and Correlation ݏ௫௬ ሺ௫ ሻሺ௬ సభ ݎ సభ Chap. 7: Introduction to Linear Regression ݎ ݕതെݔ̅ ݕ ݕ ݏ సభ / Simple Reg.: Chaps. 18 & 19 (Inference for Regression & Understanding Regression Residuals) Multiple Reg.: Chaps. 20 & 21 (Multiple Regression & Building Multiple Regression Models) 3 BUT, multiple regression is also a new way to describe data: descriptive statistics

Upload: dodien

Post on 26-Jun-2018

225 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Simple Regression Model (Assumptions)homes.chass.utoronto.ca/~murdockj/eco220/L18_220.pdf · –Raw data: ch18_MCSP_Frozen_Pizza.csv –Are these data observational? –Cross-sectional,

Lecture 18, Page 1 of 9

Simple Regression Model (Assumptions)

Reading: Sections 18.1, 18.2, “Logarithms in Regression Analysis with Asiaphoria,” 19.6 – 19.8 (Optional: “Normal probability plot” pp. 607-8)

1

Lecture 18

Remember Regression?

2

60

65

70

75

80

Hei

ght s

on, i

nche

s

60 65 70 75Height father, inches

son_hat = 33.887 + 0.514*fathern = 1078, R2 = 0.251, s_e = 2.437

OLS intercept 33.887: No interpretation b/c father cannot be 0 inches tall

OLS slope 0.514: For every extra 1 inch of father’s height, son is on average about ½ inch taller

(y-hat): Predicted y, given x; E.g. son of a 72 inch tall father predicted to be 70.895 inches (= 33.887 + 0.514*72)

(residual): ; E.g. if is 70.895 but is 68.531,

then residual is -2.364 inches

(s.d. of residuals) 2.437 inches: measures scatter about OLS line

0.251: 25.1% of variation in sons’ heights explained by variation in their fathers’ heights

Descriptive & Inferential Statistics• Chap. 6: Scatterplots,

Association, and Correlation

– ∑–

∑• Chap. 7: Introduction to

Linear Regression

– ̅–

– ∑– /

• Simple Reg.: Chaps. 18 & 19 (Inference for Regression & Understanding Regression Residuals)

• Multiple Reg.: Chaps. 20 & 21 (Multiple Regression & Building Multiple Regression Models)

3

BUT, multiple regression is also a new way to describe data: descriptive statistics

Page 2: Simple Regression Model (Assumptions)homes.chass.utoronto.ca/~murdockj/eco220/L18_220.pdf · –Raw data: ch18_MCSP_Frozen_Pizza.csv –Are these data observational? –Cross-sectional,

Lecture 18, Page 2 of 9

Questions and Data: Still Important

• Which kind of question?– Research question: What

is causal effect of a change in X (e.g. match) on Y (e.g. amount given)

– Descriptive question: what are patterns in data (e.g. how does household spending on food vary with income?)

• Which kind of data?– Observational or

experimental • Correlation ≠ causation is

a cliché• Instead, apply

understanding of data and specific context to fully interpret quantitative results

– Cross-sectional, time series, or panel

4

5

Which kind of data are these?

Source: “The Economic Impact of Universities: Evidence from Across the Globe” a 2016 NBER Working Paper http://www.nber.org/papers/w22501

On this slide, what is review of Fall-term material and what is new material?

“Unsurprisingly, we find that higher university density is associated with higher GDP per capita levels.” p. 9 Why associatedinstead of correlated? Also, does quote imply causality?

6

“It is interesting that countries with more universities in 1960 generally had higher growth rates over 1960-2000 .” p. 9But is it a strong association?

Page 3: Simple Regression Model (Assumptions)homes.chass.utoronto.ca/~murdockj/eco220/L18_220.pdf · –Raw data: ch18_MCSP_Frozen_Pizza.csv –Are these data observational? –Cross-sectional,

Lecture 18, Page 3 of 9

X-variable is defined as Log(1 + universities per million people) • Logs can straighten curved scatter plot

– Plus one addresses countries with 0 universities– Example 1: x-value of 1 is a country with 1.72

universities per million: ln(1 + 1.72) 1• E.g. 10 universities w/ pop. 5.82 million: 1.7210/5.82

– Example 2: x-value of 3 is a country with 19.09universities per million: ln(1 + 19.09) 3

• E.g. 25 universities w/ pop. 1.31 million: 19.0925/1.31– University density is over 11 times bigger in Example

2, but x-value only 3 times as big (diminishing returns)

7

8

How to interpret b=2.64?

On average, countries with a university density (universities per million people) that is 10% higher have average years of schooling that is 0.264 years higher.

Frozen Pizza (p. 627)

• How does the volume of sales depend on the price of frozen pizza?– What is the economic name of this relationship?

• Weekly data on price and quantity for each of four cities (1994 – 1996); 156 weeks– Raw data: ch18_MCSP_Frozen_Pizza.csv– Are these data observational? – Cross-sectional, time series, or panel?

9

Page 4: Simple Regression Model (Assumptions)homes.chass.utoronto.ca/~murdockj/eco220/L18_220.pdf · –Raw data: ch18_MCSP_Frozen_Pizza.csv –Are these data observational? –Cross-sectional,

Lecture 18, Page 4 of 9

10

Demand Shifters

Quantity DemandedPrice

Observational Data

What are some demand shifters for frozen pizza?

Why do these demand shifters affect the price the firm chooses to set?

2468

10

Qua

ntity

, 10,

000s

2 2.2 2.4 2.6 2.8 3Price, $1s

Denver, 1994-96n = 156 weeks

Frozen Pizza: OLS

• 0.7697• 2 0.5924• 18.12 5.28• Interpret the line?

– Is the OLS line an estimate of the demand equation?

11

Simple Linear Regression:One x-variable

• Model: – : dependent var., regressand, y-var., LHS-var.– : independent var., regressor, explanatory var., x-

var., RHS-var. (i.e. right-hand side variable)– : observation index (often or cross-sectional

data; time series data; or panel data)– : intercept (constant) parameter – : slope parameter– : error term, residual, disturbance

12

Page 5: Simple Regression Model (Assumptions)homes.chass.utoronto.ca/~murdockj/eco220/L18_220.pdf · –Raw data: ch18_MCSP_Frozen_Pizza.csv –Are these data observational? –Cross-sectional,

Lecture 18, Page 5 of 9

– Line is expected value: E– Error explains deviations

from expectations

Error term in

• includes all other factors that affect aside from – Impossible to collect

data on everything: some variables unobserved to the researcher

– It reflects reality: model cannot control for everything

13

|

In the above graph is positive or negative?

Assumptions Tame Elusive Epsilon

• We cannot observe but we can observe – Notice how many of the six assumptions are

about the unobservable • Some assumptions can be checked by analyzing (the

statistic tied to the parameter , but some cannot • In general, models make assumptions about unknowns

– For example, a model could assume the outcome of the role of a die follows a discrete Uniform distribution: i.e. it’s fair with a 1/6 probability of each outcome {1, 2, 3, 4, 5, 6}

14

15

• Six assumptions of the linear regression model– Your book gives only four assumptions:

• I suspect it leaves one out because it is fairly obvious• I am sure the other one is left out because it is only

required if you wish to make a causal interpretation• To minimize confusion, number the extra two as 5 & 6

– Econometrics addresses violations in assumptions• ECO374H Applied Econometrics (Commerce)• ECO375H & ECO475H Applied Econometrics I & II

Six Assumptions

Page 6: Simple Regression Model (Assumptions)homes.chass.utoronto.ca/~murdockj/eco220/L18_220.pdf · –Raw data: ch18_MCSP_Frozen_Pizza.csv –Are these data observational? –Cross-sectional,

Lecture 18, Page 6 of 9

16

Assumption #1

• Regression equation is linear in the error and parameters; the variables (in boxes) are linearly related to each other

– Not assuming that what is in boxes is linear (so long as no nonlinear functions of parameters or nonlinear functions of the error)

• Example of a linear regression: • Example of a linear regression: lnFrozen Pizza (Chapter 18)

2

4

6

8

10

Wee

kly

Q, 1

0,00

0s

2 2.2 2.4 2.6 2.8 3Weekly price, $1s

Frozen Pizza, Denver, 1994-96Q-hat = 18.122 + -5.280*P

n = 156, R2 = 0.592, s.e.(b) = 0.353

-2

-1

0

1

2

3

e (re

sidu

als)

2 3 4 5 6 7Q_hat

Which violations can we see?

17

Natural Log Transformations

18

-.4

-.2

0

.2

.4

e (re

sidu

als)

2 3 4 5 6 7ln(Q)_hat

1

1.5

2

2.5

ln(W

eekl

y Q

, 10,

000s

)

.7 .8 .9 1 1.1ln(Weekly price, $1s)

Frozen Pizza, Denver, 1994-96ln(Q)-hat = 4.095 + -2.773*ln(P)

n = 156, R2 = 0.631, s.e.(b) = 0.171

Page 7: Simple Regression Model (Assumptions)homes.chass.utoronto.ca/~murdockj/eco220/L18_220.pdf · –Raw data: ch18_MCSP_Frozen_Pizza.csv –Are these data observational? –Cross-sectional,

Lecture 18, Page 7 of 9

Assumption #2

• No autocorrelation / no serial correlation: , 0 if – Common problem in

time-series data• E.g. higher than expected

inflation today, likely high tomorrow

– Errors assumed not systematically related across observations

19

-10-

50

510

resi

dual

(e)

0 10 20 30t

Assumption #2 HoldsNo Autocorrelation

-3-2

-10

12

resi

dual

(e)

0 10 20 30t

Assumption #2 ViolatedPositive Autocorrelation

Assumption #3

• Homoscedasticity: , 1, … ,– “Equal variance

assumption”– Error is just as “noisy”

for all values of x– Violation is called

heteroscedasticity– Common problem in

cross-sectional data

20

01

23

Y2

0 2 4 6 8 10X2

Heteroscedasticity

-10

12

3Y1

0 2 4 6 8 10X1

Homoscedasticity

Fix Assumption #1 issues before checking Assumption #3

21

-.4

-.2

0

.2

.4

e (re

sidu

als)

2 3 4 5 6 7ln(Q)_hat

1

1.5

2

2.5

ln(W

eekl

y Q

, 10,

000s

)

.7 .8 .9 1 1.1ln(Weekly price, $1s)

Frozen Pizza, Denver, 1994-96ln(Q)-hat = 4.095 + -2.773*ln(P)

n = 156, R2 = 0.631, s.e.(b) = 0.171

Heteroscedasticity – unequal variance of the residuals – is often a byproduct of a violation of the linearity assumption

Remember that Chapter 18 advises you to check the assumptions in order: start with the linearity assumption

Is Denver pizza regression an example?

Page 8: Simple Regression Model (Assumptions)homes.chass.utoronto.ca/~murdockj/eco220/L18_220.pdf · –Raw data: ch18_MCSP_Frozen_Pizza.csv –Are these data observational? –Cross-sectional,

Lecture 18, Page 8 of 9

22

Assumptions #4 & #5

• Galton’s data (Lec. 5)– Assumptions 1-3 hold?

• Normality: is Normal– is unobserved so

check • Error has mean zero: 0, 1, … ,

– Constant term (i.e. or ) picks up any constant

effects, not the error

6065707580

Son,

inch

es

60 65 70 75Father, inches

Galton's Heights, n = 1078

0.05.1

.15.2

Den

sity

-10 -5 0 5 10residuals (e)

n = 1078

Graphical Summary

23

||| ,,,Assumptions #3, #4, and #5 combined: ~ 0,

Would the elements in the population (not shown) lie on the line?

Is a reflection of sampling error?

24

2017 ON Public Sector Disclosure of 2016 salaries for University of Waterloo employees

Sex n Mean S.d.F 416 $139.74K $33.74KM 941 $155.36K $36.96K

OLS Results:Salary-hat = 139.74 + 15.62*MaleR2 = 0.0385, n = 1,357, = 36.006

Do Assumptions #1 - #5 hold?100

200

300

400

Sala

ry a

nd S

alar

y-ha

t

0 .2 .4 .6 .8 1Male

Waterloo, 2016 Salaries

100

200

300

400

Sala

ry (1

,000

0's

CAN

$)

Male=0 Male=1

Waterloo, 2016 Salaries

Page 9: Simple Regression Model (Assumptions)homes.chass.utoronto.ca/~murdockj/eco220/L18_220.pdf · –Raw data: ch18_MCSP_Frozen_Pizza.csv –Are these data observational? –Cross-sectional,

Lecture 18, Page 9 of 9

Assumption #6

• x uncorrelated w/ error: , 0– Exogeneity: x variable(s) unrelated with error

• Dosage is exogenous: • Experimental data can est. causal effect:

– Endogeneity: x variable(s) related with error• With observational data, lurking/unobserved/omitted/

confounding variables mean x and error are related• Price of pizza is endogenous: • Endogeneity bias means:

25

26

“Short-Hand” Assumptions

1) Linear relationship between variables (possibly non-linearly transformed)

2) No correlation amongst errors (no autocorrelation for time-series data)

3) Homoscedasticity (single variance) of errors4) Normally distributed errors5) Constant included (error has mean 0)6) No relationship between x and error