lecture 3: simple linear regression - university of...

Lecture 3: Simple Linear Regression

Xing Hong

Department of EconomicsUniversity of Maryland, College Park

Spring 2016

Overview

1 Introduction to the simple linear regression modelInterpretationCausal relationship

2 Deriving the OLS Estimators

3 Goodness-of-Fit

4 Extensions on Simple Linear RegressionUnit of MeasurementFunctional Forms: Nonlinear Relationships

5 Statistical Properties of the OLS estimatorsGauss-Markov AssumptionsUnbiasednessEfficiency

Definition of the Simple Linear Regression

“Explains variable y in terms of variable x”

Interpretation of the Simple Linear Regression

“ Studies how y varies with changes in x :"

The simple linear regression model is rarely applicable in practice but itsdiscussion is useful for pedagogical reasons.

Ceteris Paribus: everything else held constant

Definition of CAUSAL effect of x on y :How does variable y change if variable x is changed but all otherrelevant factors are held constant.Most economic questions are ceteris paribus questionsIt is important to define which causal effect one is interested inIt is useful to describe how an experiment would have to be designedto infer the causal effect in questionHowever, it is usually not possible to literally hold everything elseconstant. The key question in most empirical studies is:

Have enough other factors been held constant to make a case ofcausality?

Causality vs. Correlation

Correlation: x moves with y

Causality: x moves y

eg. An observational relationship: college graduates earn about 50%more than those without a college degree.

Interpretations:Causal: college is productiveNon-causal: high ability people go to college and high ability peopleearn more

It is important to find out if the relationship is causal or not, becauseif non-causal, then higher investment in education (e.g. governmentsubsidies) will not yield higher productivity or income.

An Example: Return of schooling

Suppose we build a simple regression model:

Wage = β0 + β1Edu + u

Wage: annual wage, in $Edu: years of formal schoolingu: all other unobserved factors that affect wage, including age,experience, gender, race, measurement error, etc.

An Example: Return of schooling

Wage = β0 + β1Edu + u

Assume that this simple model is a reasonably good description ofthe actual wage determination process (*)Estimate β0 and β1 using data, obtain OLS estimates β0 = 21000 andβ1 = 9000Interpretation: If the years of schooling increases by 1, then annualwage, on average, increases by $9000, keeping everything elseconstant (ceteris paribus).Issue: the validity/accuracy of the estimates, in particular 9000, relieson assumption (*), which is unlikely to be true in this specificproblem. (More on this later.)

When is there a causal effect?

Key Assumption for Causality:Zero Conditional Mean Assumption:

E (u|x) = E (u) = 0

Key Assumption for Causality: E (u|x) = E (u) = 0

1 E (u|x) = E (u):The average value of u does not depend on the value of x .i.e., the unobserved factors that affect y do not change when xchanges

In this example: wage = β0 + β1Edu + u, u contains the innate abilityof a person, among other things.To impose the Zero Condition Mean assumptionE (ability |Educ) = E (ability) is to assume: the average level of abilityis the same for people with different levels of education.For example, it also impliesE (ability |Educ = 6) = E (ability |Educ = 18). It means that the averageability for the group of all people with eight years of eduction is thesame as the average ability for the group of all people with sixteenyears of eduction.(Realistic?)A very strong assumption!

Key Assumption for Causality: E (u|x) = E (u) = 0

1 In addition, we assume E (u) = 0, a harmless assumption as long asβ0 is present in y = β0 + β1x + u.

if E (u) = ε, we can define a new error term u ′ by u ′ = u − ε.Then the original equation can be rewritten as y = (β0 + ε) + β1x + u ′,with E (u ′) = 0. (Exercise: Check this.)that is, the constant term β0 will “absorb” any non-zero mean of theerror term.

Understanding the Model

Once we have estimated the parameters, there are several ways to look atthe linear regression model:

1 with the population parameters

yi = E (yi |x) + ui

= β0 + β1xi + ui

2 with the estimated parameters

yi = yi + ui

= β0 + β1xi + ui

where :E (yi |x) = β0 + β1xi : “systematic part of y”yi = β0 + β1xi : predicted/fitted value of yi

ui = yi − yi : residual for observation i

Fitted values and Residuals

Interpretation of the Estimated β1

∆y = y2 − y1

= (β0 + β1x2) − (β0 + β1x1)

= β1(x2 − x1)

= β1∆x

When the zero conditional mean assumption holds, the estimatedslope gives the predicted impact of a unit change in x on y.Interpretation: If x increases by one unit, we predict that, onaverage, y increases or decreases (depending on the sign of β1) by β1units.When β1 = 0, we predict that there is no linear relationship between yand x.

Errors vs. Residuals

Errors ui : all other factors that affect y .never observed.Assumptions of the model are built around u

Residuals ui : computed from dataui = yi − yi

depends on how we estimate the parameters and obtain yi .will have several important algebraic properties

Overview



3 Goodness-of-Fit



Ordinary Least Squares (OLS) Estimators

Recall that an estimator is a general rule/ approach to select estimates.Method of Moments estimatorsMaximum Likelihood estimatorsLeast Squares estimators

The Ordinary Least Squares (OLS) estimators are obtained by minimizingthe sum of squared residuals (SSR):

minSSR =n∑

i=1u2

i =n∑

i=1(yi − β0 − β1xi)

2

Deriving OLS Estimates: algebra

Minimizing the sum of squared residuals:

min SSR =n∑

i=1u2

i =n∑

i=1(yi − β0 − β1xi)

2

First Order Conditions:

∂(SSR)

∂β0= −2

n∑i=1

(yi − β0 − β1xi) = 0 ⇒ n∑i=1

ui = 0 (1)

∂(SSR)

∂β1= −2

n∑i=1

(yi − β0 − β1xi)xi = 0 ⇒ n∑i=1

uixi = 0 (2)

Re-arrange Eq. (1) n∑i=1

yi

− nβ0 − β1

n∑i=1

xi

= 0

1n

n∑i=1

yi

− β0 − β1

1n

n∑i=1

xi

= 0

Writeβ0 = y − β1x (3)

where

y =1n

n∑i=1

yi

x =1n

n∑i=1

xi

Plug Eq. (3) into Eq. (2)

n∑i=1

xi[yi − (y − β1x) − β1xi

]= 0

n∑i=1

(xi(yi − y) − β1xi(xi − x)

)= 0

n∑i=1

xi(yi − y) − β1

n∑i=1

xi(xi − x) = 0

β1 =

n∑i=1

xi(yi − y)n∑

i=1xi(xi − x)

Recall the properties of summation:n∑

i=1xi(yi − y) =

n∑i=1

(xi − x)(yi − y)

n∑i=1

xi(xi − x) =n∑

i=1(xi − x)2

Now we have obtained the OLS estimators:

β1 =

n∑i=1

(xi − x)(yi − y)n∑

i=1(xi − x)2

β0 = y − β1x

An Example

Suppose we have data on the tire pressure and MPG:

ID 1 2 3 4 5

TirePres (xi) 20 25 30 35 40MPG (yi) 21.1 23.3 24.2 25.4 30.0

We want to estimate the population regression model

MPGi = β0 + β1TirePresi + ui

oryi = β0 + β1xi + ui

The OLS estimate of β1 is

β1 =

n∑i=1

(xi − x)(yi − y)n∑

i=1(xi − x)2

Average tire pressure, x , is 30 and average MPG, y , is 24.8.Applying the OLS formula gives

β0 = 13.00β1 = 0.3940

In practice, this is done by econometrics packages: eg. STATA, SAS

Algebraic Properties of OLS Statistics

Some properties follow immediately from the algebra.In other words, β0 and β1 are chosen, such that:

1 The sum (and the sample average) of the OLS residuals is zero:

n∑i=1

ui = 0

2 Sample covariance between the regressors and the OLS residuals iszero:

1n − 1

n∑i=1

(xi − x)(ui − ¯u) = 0

3 The point (x , y) is always on the OLS regression line.

Overview



3 Goodness-of-Fit



Goodness-of-Fit: Some Definitions

Total sum of squares (SST): measures the total sample variation inthe yi , or how spread out the yi are in the sample

SST =n∑

i=1(yi − y)2

recall: sample variance of y: S2y = 1

n−1∑n

i=1(yi − y)2 = 1n−1 SST

Explained sum of squares (SSE): measures the sample variation inthe yi

SSE =n∑

i=1(yi − y)2

Residual sum of squares (SSR): measures the sample variation inthe ui

SSR =n∑

i=1u2

i =n∑

i=1(yi − yi)

2

SST = SSE + SSR

The total variation in y can be expressed as the sum of the explainedvariation and the unexplained variation: SST = SSE + SSR.

SST =n∑

i=1(yi − y)2 =

n∑i=1

[(yi − yi) + (yi − y)]2

=n∑

i=1[ui + (yi − y)]2

=n∑

i=1u2

i + 2n∑

i=1ui(yi − y) +

n∑i=1

(yi − y)2

= SSR + 2n∑

i=1ui(yi − y) + SSE

But n∑i=1

ui(yi − y) = 0

n∑i=1

ui(yi − y) =n∑

i=1ui yi − y

n∑i=1

ui

=n∑

i=1ui(β0 + β1xi) + 0

= β0

n∑i=1

ui + β1

n∑i=1

uixi

= 0

So:SST = SSE + SSR

Goodness-of-Fit: R-squared

R-squared measures how good the OLS regression line fits the data.

R2 =

n∑i=1

(yi − y)2

n∑i=1

(yi − y)2=

SSESST = 1− SSR

SST

Some comments:R2 measures the proportion of the variation in y explained byvariation in x .R2 is always between 0 and 1.Higher R2 means that a higher proportion of variation in yi isexplained by the variation in xi .Low R2 are not uncommon, especially for cross-sectional data.

Overview



3 Goodness-of-Fit



Extensions

Two important issues in applied economics are:1 how does changing the units of measurement of the dependent or

independent variables affect OLS estimates?2 how to incorporate non-linear functional forms used in economics into

regression analysis?

Issue 1: Unit of Measurement

Consider the following estimated simple linear regression model.

Salary = 963, 191+ 18, 501ROE

whereSalary is the predicted salary of CEO measured in dollars;ROE is return-on-equity ratio of a firm (annual earnings divided byvalue-of-equity).

The estimated equation says that if ROE increases by 0.01, CEOsalary increases by $185.01 on average.But what if we measure Salary in thousands of dollars?What if we change the unit of measurement for ROE fromproportions to percentages?

Issue 1: Unit of Measurement

Changing the unit of measurement is equivalent to multiplying thevariable by a constant c

If change the unit of y :

cy =(cβ0

)+(cβ1

)x = β ′

0 + β ′1x

If change the unit of x :

y = β0 +

(β1c

)(cx) = β0 + β ′

1(cx)

Changing the unit of measurement does not change the interpretationof the regression results!

Issue 2: Incorporating Nonlinear Relationships

The Meaning of Linear Regression: linear in the Parameters:linear regression: y = β0 + β1x1 + β2x2

2 + β3 log x3 + unon-linear regression: y = 1√

β0+β1x+ u

This insight allows us to use functional forms to incorporate nonlinearrelationships between y and x in the simple linear regression model.

The limitation of a linear relationship

Previously, we have only used the level-level models.For example,

Wagei = β0 + β1Edui + ui

says that one additional year spent in school increases wage by β1dollars on average.The relationship between income and years of schooling is linear.Caveat: The model is only suitable for a constant return toeducation - β1 dollars is the increase for either the first year ofeducation or the twentieth year of education.What if there is an increasing return?

Example: A constant return to education

Example: An increasing return to education

A log-level model

How to capture an increasing return to education? - Use percentagechange.Suppose we observe that one more year spent in school increaseswage by 10% on average.In that case, the relationship can be represented by the log-levelmodel:

log(Wagei) = β0 + β1Edui + ui

We can generate a new variable: log(Wagei), and then use it as thedependent variable in the linear regression.Question: Why a change in log value can be interpreted as apercentage change?

Interpretation of changes in log value

A key property implied by Taylor series:

f (x +∆x) ≈ f (x) + f ′(x)∆x

where f (·) is a differentiable function and ∆x is small.Hence when f (·) = log(·),

log(x +∆x) − log(x) ≈ ∆xx

or∆log(x) ≈ ∆x

xwhere ∆x

x · 100% is the percentage change in x .Therefore, we can interpret changes in log values as percentagechanges. This is an approximation.

Interpretation of a log-level model

For a level-level model,

y = β0 + β1x + u

The way we interpret:∆y = β1∆x

For a log-level model,

log(y) = β0 + β1x + u

We interpret:∆log(y) = β1∆x

Applying the approximation of changes in log, we interpret:

∆yy = β1∆x

Log-level model

For a log-linear model,

log(Wagei) = β0 + β1Edu + ui

For an one-year increase in years of schooling, wage increases by(100β1) % on average.

∆WageWage ≈ β1∆Edu

If we estimate β1 to be 0.09, then one more year of schooling(∆Edu = 1) increases wage by about 9% on average.

∆WageWage ≈ 0.09×∆Edu = 9%

Functional Forms: continued

If we believe that the relationship between yi and xi is such that if xiincreases by 1 percent, all else equal, yi increases by β1 percent, thenwe should use the log-log model:

log(yi) = β0 + β1 log(xi) + ui

If we believe that the relationship between yi and xi is such that if xiincreases by 1 percent, all else equal, yi increases by some absoluteamount, then we should use a level-log model

yi = β0 + β1 log(xi) + ui

Four types of models

level-level: y = β0 + β1x + ulog-level: log(y) = β0 + β1x + ulevel-log: y = β0 + β1 log(x) + ulog-log: log(y) = β0 + β1 log(x) + u. In this case, β1 is the elasticity.

Summary of interpretations

Overview



3 Goodness-of-Fit



Statistical Properties of OLS

Question: Are the OLS estimators, β0 and β1, good estimators toestimate the population parameters, β0 and β1?

Recall some good properties of estimators:Unbiasedness: E (β0) = β0 and E (β1) = β1Efficiency: Var(β0) and Var(β1) should be relatively small

The statistical properties concern the distributions of β0 and β1 overdifferent random samples from the population, which depend on notonly how the estimators were constructed, but also the nature of thepopulation/data.In other words, we need to make assumptions on the populationto establish the statistical properties of the OLS estimators.In linear regression models, the set of assumptions we will make arecalled Gauss-Markov Assumptions.

Gauss-Markov Assumptions for Simple Regression: SLR.1-3

Assumption SLR.1 (Linear in Parameters): In the populationmodel, the dependent variable, y is related to the independentvariable, x, and the error (or disturbance), u, as

y = β0 + β1x + u

where β0 and β1 are the population intercept and slope parameters,respectively.Assumption SLR.2 (Random Sampling): We have a randomsample of size n, {(xi , yi) : i = 1, 2, . . . , n}, following the populationmodel in Assumption SLR.1.Assumption SLR.3 (Sample Variation in the ExplanatoryVariable): The sample outcomes on x, namely, {xi , i = 1, 2, . . . , n},are not all the same value.

Gauss-Markov Assumptions for Simple Regression: SLR.4-5

Assumption SLR.4 (Zero Conditional Mean): The error u has anexpected value of zero given any value of the explanatory variable.

E (u|x) = E (u) = 0

We have mentioned before that this is the key assumption to ensurethat the estimated β1 measures the ceteris paribus effect of x on y.In this section, we will show algebraically that this is the keyassumption to establish the unbiasedness of β1 and β0.

Assumption SLR.5 (Homoskedasticity): The error u has the samevariance given any value of the explanatory variable.

Var(u|x) = E (u2|x) = σ2

In this section, we use this assumption to obtain the usual OLSvariance formula. We will postpone the discussion on efficiency to later.

Heteroskedasticity vs. Homoskedasticity

Unbiasedness of β1: E (β1) = β1

Under the assumption SLR.1-SLR.5, we want to show that OLSestimators β1 is unbiased, i.e., to show E (β1) = β1.First, we can simplify β1 before taking expected values:

β1 =

n∑i=1

(xi − x)(yi − y)n∑

i=1(xi − x)2

=

n∑i=1

(xi − x)yi

n∑i=1

(xi − x)2=

n∑i=1

(xi − x)(β0 + β1xi + ui)

n∑i=1

(xi − x)2

= β0

n∑i=1

(xi − x)n∑

i=1(xi − x)2︸︷︷︸

=0

+β1

n∑i=1

(xi − x)xi

n∑i=1

(xi − x)2︸︷︷︸=

n∑i=1

(xi−x)(xi−x)

n∑i=1

(xi−x)2=1

+

n∑i=1

(xi − x)ui

n∑i=1

(xi − x)2

Unbiasedness of β1

β1 can rewritten as:

β1 = β1 +

n∑i=1

(xi − x)ui

n∑i=1

(xi − x)2Using SLR.1-3

Second, we take expectation on β1. In particular, we take theconditional expectation.

E (β1|x) = β1 + E

n∑

i=1(xi − x)ui

n∑i=1

(xi − x)2|x

= β1 +

n∑i=1

(xi − x)E (ui |x)n∑

i=1(xi − x)2

Unbiasedness of β1

Then we have:

E (β1|x) = β1 if SLR.4 holds: E (u|x) = 0

Finally, apply the law of iterated expectations:

E (β1) = E (E (β1|x)) = E (β1) = β1

⇒ OLS estimator β1 is an unbiased estimator of β1.

Question:Where did we use the assumption 1-5?What will happen if each assumption fails?

Variance of β1

We have shown that

β1 = β1 +

n∑i=1

(xi − x)ui

n∑i=1

(xi − x)2

Recall: Var(aX + b) = a2Var(X ) for any constants a, b.

Var(β1|x) = Var

n∑

i=1(xi − x)ui

n∑i=1

(xi − x)2|x

=

1[n∑

i=1(xi − x)2

]2 · Var

n∑i=1

(xi − x)ui |x

Variance of β1

Var

n∑i=1

(xi − x)ui |x

=n∑

i=1Var [(xi − x)ui |x ]

=n∑

i=1(xi − x)2Var(ui |x)

if SLR.5 holds: Var(u|x) = σ2

=n∑

i=1(xi − x)2σ2

= σ2n∑

i=1(xi − x)2

Variance of β1

Var(β1|x) =Var

[n∑

i=1(xi − x)ui |x

][

n∑i=1

(xi − x)2

]2

=

σ2n∑

i=1(xi − x)2

[n∑

i=1(xi − x)2

]2

=σ2

n∑i=1

(xi − x)2

When is Var(β1|x) small?Why do we need the SLR.3: "x1, x2, · · · xn are not all of the samevalue"?

lecture 3: simple linear regression - university of...

Documents