introduction to econometrics - zhaopeng(frank)qu … · 2017-10-09 · it is the main reason why...

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Introduction to EconometricsLecture 3 : Regression: CEF and Simple OLS

Zhaopeng Qu

Business School,Nanjing University

Oct. 9th, 2017

Zhaopeng Qu (Nanjing University) Introduction to Econometrics Oct. 9th, 2017 1 / 48

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Outlines

1 Review the Previous Lecture

2 Make Regression Make SenseEconomic Relationships and the CEFThe Properties of CEFLinear Regression and the CEF: Why Regress?

3 Simple Regression: Ordinary Least Squares

4 The Least Squares Assumptions


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Review the Previous Lecture


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Causal Inference in Social Science

Causality is our main goal in the studies of empirical socialscience.The existence of Selection Bias makes social science more difficultthan science.Although experimental method is a powerful tool for economists,every project can not be carried on by it.It is the main reason why modern econometrics exists anddevelops.”The modern menu of econometric methods can seem confusing, evento an experienced number cruncher. Luckily, not everything on themenu is equally valuable or important. Some of the more exotic itemsare needlessly complex and may even be harmful. On the plus side,the core methods of applied econometrics remain largely unchanged,while the interpretation of basic tools has become more nuanced andsophisticated.” Angrist and Pischke(2009)


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Furious Seven Weapons（七种武器）

Build a reasonable counterfactual world or find a proper controlgroup is the core of econometrical methods.

1 Random Trials(随机试验)2 Regression(Ordinary Least Squares)(OLS 回归)3 Matching and Propensity Score（匹配与倾向得分）4 Decomposition（分解）5 Instrumental Variable（工具变量）6 Regression Discontinuity（断点回归）7 Difference in Differences （双差分或倍差法)

The most basic of these tools is regression, which comparestreatment and control subjects who have the same observedcharacteristics.Regression concepts are foundational, paving the way for themore elaborate tools used in the class that follow.

So let’s start our exciting journey from it.Zhaopeng Qu (Nanjing University) Introduction to Econometrics Oct. 9th, 2017 5 / 48

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Make Regression Make Sense


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Regression: What You Need to Know

We spend our lives running regressions (I should say: ”regressions runme”). And yet this basic empirical tool is often misunderstood. So Ibegin with a recap of key regression properties. (Angrist, 2014)Our Regression Agenda

1 The CEF is all you need2 What is Regression and Why We Regress3 Regression and Causality


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Conditional Expectation Function(CEF): Education andEarnings

Most of what we want to do in the social science is learn about howtwo variables are related, such as Education and Earnings.On average, people with more schooling earn more than people withless schooling.

The connection between schooling and average earnings hasconsiderable predictive power, in spite of the enormous variation inindividual circumstances.The fact that more educated people earn more than less educatedpeople does not mean that schooling causes earnings to increase.However, it’s clear that education predicts earnings in a narrowstatistical sense.

This predictive power is compellingly summarized by the ConditionalExpectation Function.


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Probability Review: Conditional ExpectationFunction(CEF)

Both X and Y are r.v., then conditional on X, Y’s probabilitydensity function is

fY|X (y|x) = f(x, y)f(x)

Conditional on X, Y’s expectation is

E(Y|X) =

∫Y

yfY|X(y|x)dy =

∫Y

yf(x, y)f(x) dy

So Conditional Expectation Function(CEF) is a function of x,since x is a random variable, so CEF is also a random variable直观理解：期望就是求平均值，而条件期望就是“分组取平均”或“在... 条件下的均值”。


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Conditional Expectation Function(CEF): Education andEarnings

Most of what we want to do in the social science is learn about howtwo variables are related, such as Education and Earnings.On average, people with more schooling earn more than people withless schooling.

The connection between schooling and average earnings hasconsiderable predictive power, in spite of the enormous variation inindividual circumstances.The fact that more educated people earn more than less educatedpeople does not mean that schooling causes earnings to increase.However, it’s clear that education predicts earnings in a narrowstatistical sense.

This predictive power is compellingly summarized by the ConditionalExpectation Function.The Law of Iterated Expectations(LIE)

E[Y] = E[E[Y | X]]


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

The Law of Iterated Expectations(LIE)Proof.

E[E(Y|X)] =

∫E[Y | X = u]gX(u)du

=

∫ [∫tfY(t | X = u)dt

]gX(u)du

=

∫∫tfY(t | X = u)gX(u)dt du

=

∫t[∫

fY(t | X = u)gX(u)du]

dt

=

∫t[∫

fXY(t, u)du]

dt

=

∫tfY(t)dt


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Education and Wage: CEF View

The figure plots the CEF of log weekly wages given schooling for asample of middle-aged white men from the 1980 census.The CEF in the figure captures the fact that the enormous variationindividual circumstances notwithstanding - people with more schoolinggenerally earn more, on average.


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

The CEF Decomposition PropertyTheoremEvery random variable such as Yi can be written as

Yi = E[Yi | Xi] + εi

where εi is mean-independent of Xi, i.e., E[εi | Xi] = 0. and thereforeεi is uncorrelated with any function of Xi.

Proof.Skipped

This theorem says that any random variable, Yi, can bedecomposed into two parts

a piece that’s “explained by Xi”, i.e. the CEF,a piece left over which is orthogonal to (i.e. uncorrelatedwith) any function of Xi.


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

The CEF-Prediction PropertyTheoremLet m(Xi) be any function of Xi. The CEF is the Minimum MeanSquared Error(MMSE) predictor of Yi given Xi. Thus

E[Yi | Xi] = argminm(Xi)

E[[Yi − M(Xi)]

2]

Proof.

(Y − m(Xi))2 = [(Yi − E[Yi | Xi]) + (E[Yi | Xi]− m(Xi))]

2

= (Yi − E[Yi | Xi])2

+ 2 (Yi − E[Yi | Xi]) (E[Yi | Xi]− m(Xi))

+ (E[Yi | Xi]− m(Xi))2

The last term is minimized at zero when m(Xi) is the CEF.


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

The CEF-Prediction Property

Suppose we are interested in predicting Y using some functionm(Xi), the optimal predictor under the MMSE (Minimized MeanSquared Error) criterion is CEF.Therefore ,CEF is a natural summary of the relationship betweenY and X under MMSE.It means that if we can know CEF, then we can describe therelationship of Y and X.


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

CEF and Regression

We have already learned CEF is a natural summary of therelationships which we would like to know it.But CEF is an unknown functional form, so the next question isHow to model CEF, E(Y | X)?Answer: Two basic approaches

Nonparametric(Matching, Kernel Density etc.)Parametric(Regression)

Regression estimates provides a valuable baseline for almost allempirical research because Regression is tightly linked to CEF.


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Estimating the CEF: two discrete values

Suppose a binary X case：X only take on two values, 0 and 1( likeour formal example: here X is a treatment).We’ve been writing and for the means in different groups。

Then the mean in each group is just a conditional expectation:The fact that more educated people earn more than less educatedpeople does not mean that schooling causes earnings to increase.However, it’s clear that education predicts earnings in a narrowstatistical sense.

How to estimate E[Y | Xi = x]? it means that we have to use sampledata to inference the population.we could use sample means within each group.

E[Y | Xi = 1] =1

n1

∑i:Xi=1

Yi

E[Y | Xi = 0] =1

n0

∑i:Xi=0

Yi

here n0and n1 are numbers of men and women in the sample.Zhaopeng Qu (Nanjing University) Introduction to Econometrics Oct. 9th, 2017 17 / 48

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Estimating the CEF: multiple discrete values

What if X takes on > 2 discrete values?we can still estimate E[Y | Xi = x] with the sample mean amongthose who have Xi = x, thus

E[Y | Xi = x] =1

nx

∑i:Xi=x

Yi

where nxand is the number of group x in the sample.


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Estimating the CEF: continuous

What if X is continuous? Can we calculate a mean for every value ofXi.Because the probability could take values only in an interval for acontinuous variable. So we could turn it into a discrete variable. Thisis called as stratification.

Once it’s discrete, we can just calculate the means within each strata.

The stratification approach was fairly crude: it assumed that meanswere constant within strata, but that seems wrong.Now we will think about E[Y | Xi = x] as a function. What does thisfunction look like?

unknown functions in the population! make producing an estimatorvery difficult!


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Population Regression: What is a Regression?Definitionpopulation regression (”regression” for short) as the solution to thepopulation least squares problem. Specifically, the K×1 regressioncoefficient vector β is defined by solving

β = arg minb

E[(Yi − X′

ib)2]

Using the first order conditionE [Xi(Yi − X′

ib)] = 0

The solution for b can be writtenβ = E [XiX′

i]−1 E [XiY′

i]

Regression is a feature of data: just like expectation, correlation,etc. It’s a parametric linear function of population secondmoments to model m(Xi)


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Three Reasons to Regress

There are three reasons (three justifications) why the vector ofpopulation regression coefficient might be of interest.

1 The Best Linear Predictor Theorem2 The Linear CEF Theorem3 The Regression-CEF Theorem


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Regression Justification I

TheoremThe Best Linear Predictor TheoremRegression solves the population least squares problem and istherefore the Best Linear Predictor(BLP) of Yi given Xi.

Proof.By definition of regression.

In other words, just as CEF, which is the best predictor of Yigiven Xi in the class of all functions of Xi, the populationregression function is the best we can do in the class of linearfunctions.


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Regression Justification IITheoremThe Linear CEF TheoremSuppose the CEF is linear. Then the Regression function is it.

Proof.Suppose E(Yi|Xi) = X′

iβ∗ for a K×1 vector of coefficients. By the

CEF decomposition property, we have

E [Xi(Yi − E[Yi | Xi])] = 0

Then substitute using E(Yi|Xi) = X′iβ

∗

E [Xi(Yi − X′iβ

∗)] = 0

At last find that

β∗ = β = E [XiX′i]−1 E [XiYi]


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Regression Justification IIITheoremThe Regression-CEF TheoremThe population regression function X′

iβ provides the MMSE linearapproximation to E(Yi|Xi), thus

β = arg minb

E[(E[Yi|Xi]− X′

ib)2]

Proof.

(Yi − X′

ib)2

= [(Yi − E[Yi | Xi]) + (E[Yi | Xi]− Xib)]2

= (Yi − E[Yi | Xi])2 + (E[Yi | Xi]− Xib)2

+ 2 (Yi − E[Yi | Xi]) (E[Yi | Xi]− Xib)

The first term has no b and the last term by the CEF-decompositionproperty. Therefore the minimized problem has the same solution asthe regression least squares problems.Zhaopeng Qu (Nanjing University) Introduction to Econometrics Oct. 9th, 2017 24 / 48

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Regression Justification

The Best Linear Predictor Theorem and The Regression-CEFTheorem show us two more ways to view regression:

Regression provides the best linear predictor for the dependent variablein the same way that the CEF is the best unrestricted predictor of thedependent variable.If we prefer to think about approximating E(Yi|Xi), as opposed topredicting Yi, the Regression-CEF theorem tells us that even if the CEFis nonlinear, regression provides the best linear approximation to it.

Actually, The regression-CEF theorem is our favorite way tomotivate regression. The statement that regression approximatesthe CEF lines up with our view of empirical work as an effort todescribe the essential features of statistical relationships, withoutnecessarily trying to pin them down exactly.We are not really interested in predicting individual Yi; it’s thedistribution of Yi that we care about.


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

The CEF and Regression


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Simple Regression: Ordinary Least Squares


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Question: Class Size and Student’s Performance

Specific Question:What is the effect on district test scores if we would increasedistrict average class size by 1 student (or one unit ofStudent-Teacher’s Ratio)

Technically,we would like to know the real value of a parameterβ1,

β1 =∆Testscore∆ClassSize

And β1 is actually the definition of the slope of a straight linerelating test scores and class size. Thus

Test score = β0 + β1 × Class size

where β0 is the is the intercept of the straight line.


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Question: Class Size and Student’s Performance

BUT the average test score in district i does not only depend onthe average class sizeIt also depends on other factors such as

Student backgroundQuality of the teachersSchool’s facilitatesQuality of text books .....

So the equation describing the linear relation between Test scoreand Class size is better written as

Test scorei = β0 + β1 × Class sizei + ui

where ui lumps together all other district characteristics thataffect average test scores.


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Terminology for Simple Regression Model

The linear regression model with one regressor is denoted byYi = β0 + β1Xi + ui

WhereYi is the dependent variable(Test Score)Xi is the independent variable or regressor(Class Size orStudent-Teacher Ratio)β0 + β1Xi is the population regression line or the populationregression function

This is the relationship that holds between Y and X on average over thepopulation. (be familiar with? Recall the concept of conditionalexpectation)

The intercept β0 and the slope β1 are the coefficients of the populationregression line, also known as the parameters of the populationregression line.ui is the error term which contains all the other factors besides Xthat determine the value of the dependent variable, Y, for a specificobservation, i.


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Terminology for Simple Regression Model


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

How to find the “best” fitting line?

In general we don’t know β0 and β1 which are parameters ofpopulation regression function.We have to calculate them using abunch of data.

So how to find the line that fits the data best?Zhaopeng Qu (Nanjing University) Introduction to Econometrics Oct. 9th, 2017 32 / 48

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

The Ordinary Least Squares Estimator (OLS)

The OLS estimatorchooses the regression coefficients so that the estimated regression lineis as close as possible to the observed data, where closeness is measuredby the sum of the squared mistakes made in predicting Y given X.Let b0 and b1 be estimators of β0 and β1,thus b0 ≡ β0,b1 ≡ β1

The predicted value of Yi given Xi using these estimators is b0 + b1Xi,formally denotes as YiThe prediction mistake is the difference between Yi and Yi

ui = Yi − Yi = Yi − (b0 + b1Xi)

The estimators of the slope and intercept that minimize the sum of thesquares of ui , thus

arg minb0,b1

1

n

n∑i=1

u2i = min

b0,b1

1

n

n∑i=1

(Yi − b0 − b1Xi)2

are called the ordinary least squares (OLS) estimators of β0 and β1.


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

The Ordinary Least Squares Estimator (OLS)

OLS minimizes sum of squared prediction mistakes:minb0,b1

∑ni=1 u2

i =∑n

i=1(Yi − b0 − b1Xi)2

Solve the problem by F.O.C(the first order condition)Step 1:

∂

∂b0

n∑i=1

(Yi − b0 − b1Xi)2 = 0

Step 2:∂

∂b1

n∑i=1

(Yi − b0 − b1Xi)2 = 0


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Step 1: OLS estimator of β0

Optimization

∂

∂b0

n∑i=1

u2i = −2

n∑i=1

(Yi − b0 − b1Xi) = 0

⇒n∑

i=1

Yi −n∑

i=1

b0 −n∑

i=1

b1Xi = 0

⇒1

n

n∑i=1

Yi −1

n

n∑i=1

b0 − b11

n

n∑i=1

Xi = 0

⇒ Y − b0 − b1X = 0

OLS estimator of β0:

b0 = Y − b1X or β0 = Y − β1X


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


∂

∂b1

n∑i=1

u2i = −2

∑ni=1 Xi(Yi − b0 − b1Xi) = 0∑n

i=1 Xi[Yi − (Y − b1X)− b1Xi)] = 0

⇒∑n

i=1 Xi[(Yi − Y)− b1(Xi − X)] = 0

⇒∑n

i=1 Xi(Yi − Y)− b1∑n

i=1 Xi(Xi − X) = 0

⇒∑n

i=1(Xi − X)(Yi − Y)− b1∑n

i=1(Xi − X)(Xi − X) = 0


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


Algebra Trickn∑

i=1

(Xi − X)(Yi − Y) =

n∑i=1

XiYi −n∑

i=1

XiY −n∑

i=1

XYi +n∑

i=1

XY

=

n∑i=1

XiYi −n∑

i=1

XiY − nX(1

n

n∑i=1

Yi) + nXY

=

n∑i=1

Xi(Yi − Y)

By a similar reasoning, we could obtainn∑

i=1

(Xi − X)(Xi − X) =

n∑i=1

Xi(Xi − X)


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


Thus

∂

∂b1

n∑i=1

u2i =

n∑i=1

(Xi − X)(Yi − Y)− b1n∑

i=1

(Xi − X)(Xi − X) = 0

OLS estimator of β1

β1 =

∑ni=1(Xi − X)(Yi − Y)∑ni=1(Xi − X)(Xi − X)

Recall the OLS predicted values Yi and residuals ui are:

Yi = β0 + β1Xi

ui = Yi − Yi


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

The Estimated Regression Line


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Measures of Fit: The R2

Decompose Yi into the fitted value plus the residualYi = Yi + ui

The total sum of squares (SST) = the explained sum of squares(SSE) + the sum of squared residuals (SSR):

n∑i=1

(Yi − Y)2 =n∑

i=1

(Yi − Y)2 +n∑

i=1

(Yi − Yi)2

R2 or the coefficient of determination, is the fraction of the samplevariance of Yi explained/predicted by Xi

R2 =SSESST = 1− SSR

SSTSo 0 ≤ R2 ≤ 1.It seems that T-squares is bigger, the regression is better.But actually we don’t care much about R2 in modern appliedeconometrics.


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

The Least Squares Assumptions


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Assumption of the Linear regression model

In order to investigate the statistical properties of OLS, we need tomake some statistical assumptions

Linear Regression ModelThe observations, (Yi,Xi) come from a random sample(i.i.d) and satisfythe linear regression equation,

Yi = β0 + β1Xi + ui

and E[ui | Xi] = 0


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Conditional Mean is ZeroAssumption 1: Zero conditional mean of the errors given XThe error,ui has expected value of 0 given any value of the independentvariable

E[ui | Xi = x] = 0

an weaker condition that ui and Xi are uncorrelated:

Cov[ui,Xi] = E[uiXi] = 0

if both are correlated, then Assumption 1 is violated.Equivalently, the population regression line is the conditional mean ofYi given Xi , thus

E(Y | X = x) = β0 + β1Xi

(Exercise 4.6)Zhaopeng Qu (Nanjing University) Introduction to Econometrics Oct. 9th, 2017 43 / 48

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Conditional Mean is Zero


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Random Sample

Assumption 2: Random SampleWe have a i.i.d random sample of size , {(Xi,Yi), i = 1, ...,n} from thepopulation regression model above.

This is an implication of random sampling.generally won’t hold in other data structures.

Violations: time-series, selected samples.


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Large outliers are unlikely

Assumption 3: Large outliers are unlikelyobservations with values of Xi, Yi or both that are far outside the usualrange of the data(Outlier)-are unlikely. mathematically, it assume that Xand Y have nonzero finite fourth moments.

Large outliers can make OLS regression results misleading.One source of large outliers is data entry errors, such as atypographical error or incorrectly using different units for differentobservations.Data entry errors aside, the assumption of finite kurtosis is a plausibleone in many applications with economic data.


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Large outliers are unlikely


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Underlying assumptions of OLS

The OLS estimator is unbiased, consistent and has asymptoticallynormal sampling distribution if

1 Random sampling.2 Large outliers are unlikely.3 The conditional mean of ui given Xi is zero

OLS is an estimator—it’s a machine that we plug data into and weget out estimates.It has a sampling distribution, with a sampling variance/standarderror, etc.

Just like the sample mean, sample difference in means, or the samplevariance

We will discuss these characteristics of OLS in the next lecture.


introduction to econometrics - zhaopeng(frank)qu … · 2017-10-09 · it is the main reason why...

Documents