class 5 multiple regression lionel nesta observatoire français des conjonctures economiques...

68
Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques [email protected] SKEMA Ph.D programme 2010-2011

Upload: bryce-griffith

Post on 17-Dec-2015

217 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr SKEMA Ph.D programme 2010-2011

Class 5Multiple Regression

Lionel Nesta

Observatoire Français des Conjonctures Economiques

[email protected]

SKEMA Ph.D programme2010-2011

Page 2: Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr SKEMA Ph.D programme 2010-2011

Introduction to Regression Typically, the social scientist is dealing with multiple

and complex webs of interactions between variables.

An immediate and appealing extension to simple

linear regression is to extend the set of explanatory

variable to other variables.

Multiple regressions include several explanatory

variables in the empirical model

1 21 2

pi i i p i iy x x x u

Page 3: Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr SKEMA Ph.D programme 2010-2011

Introduction to Regression Typically, the social scientist is dealing with multiple

and complex webs of interactions between variables.

An immediate and appealing extension to simple

linear regression is to extend the set of explanatory

variable to other variables.

Multiple regressions include several explanatory

variables in the empirical model

1

k Kk

i k i ik

y x u

Page 4: Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr SKEMA Ph.D programme 2010-2011

22

1 1

21

1

2

1

220 , ,

ˆˆmin min

0

, ,

ˆ

,

n n k K

n

j k

ki i i iki i k

i

Kik

n

y y y x

To minimize the sum of squared errors

Page 5: Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr SKEMA Ph.D programme 2010-2011

1

12

ˆ

ˆcov( )

i i iy x u

β XX

y = Xβ +

y

u

X

β XX

Multivariate Least Square Estimator

Usually, the multivariate is described by matrix notation:

With the following least square solution:

Page 6: Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr SKEMA Ph.D programme 2010-2011

Assumption OLS 1

20 1 1y x u

It is possible to operate non linear transformation of the

variables (e.g. log of x) but not of the parameters like the

following :

0 1 1 2 2 k ky x x x u

LinearityThe model is linear in its parameters

OLS can not estimate this

Page 7: Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr SKEMA Ph.D programme 2010-2011

Assumption OLS 2

There is no selection bias in the sample. The results

pertain to the whole population

All observations are independent from one another (no

serial nor cross-sectional correlation)

Random SamplingThe n observations are a random sample of

the whole population

Page 8: Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr SKEMA Ph.D programme 2010-2011

Assumption OLS 3

No independent variable is constant. Each variable has

variance which can be used with the variance of the

dependent variable to compute the parameters.

No exact linear relationships amongst independent variables

No perfect Collinearity There is no collinearity between independent

variables

Page 9: Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr SKEMA Ph.D programme 2010-2011

Assumption OLS 4

Given any values of the independent variables (IV), the error

term must have an expected value of zero.

In this case, all independent variables are exogenous.

Otherwise, at least one IV suffers from an endogeneity problem.

Zero Conditional Mean The error term u has an expected value of zero

1 2 kE u x ,x , ,x 0

Page 10: Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr SKEMA Ph.D programme 2010-2011

Sources of endogeneity

Wrong specification of the model

Omitted variable correlated with one RHS.

Measurement errors of RHS

Mutual causation between LHS and RHS

Simultaneity

Page 11: Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr SKEMA Ph.D programme 2010-2011

Assumption OLS 5

21 2 k uVar u x ,x , ,x

Homoskedasticity The variance of the error term, u, conditional on RHS, is the same for all values of RHS.

Otherwise we speak of heteroskedasticity.

Page 12: Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr SKEMA Ph.D programme 2010-2011

Assumption OLS 6

Normality of error termThe error term is independent of all RHS and follows a normal distribution with zero mean

and variance σ²

2u Normal(0, )

Page 13: Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr SKEMA Ph.D programme 2010-2011

Assumptions OLS

OLS1 Linearity

OLS2 Random Sampling

OLS3 No perfect Collinearity

OLS4 Zero Conditional Mean

OLS5 Homoskedasticity

OLS6 Normality of error term

Page 14: Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr SKEMA Ph.D programme 2010-2011

Theorem 1

j jˆE , j 0,1,2, ,k

OLS1 - OLS4 : Unbiasedness of OLS. The set of estimated parameters is equal to the true unknown values of j

j

Page 15: Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr SKEMA Ph.D programme 2010-2011

Theorem 2

OLS1 – OLS5 : Variance of OLS estimate. The variance of the OLS estimator is

2u

j n 2 2ij j j

i 1

ˆVarx x 1 R

… where R²j is the R-squared from regressing xj on all other independent variables. But how can we measure ?

2u

Page 16: Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr SKEMA Ph.D programme 2010-2011

Theorem 3

OLS1 – OLS5 : The standard error of the regression is defined as

22

i ii2 2 i iu u

ˆy y uˆE

n k 1n k 1

This is also called the standard error of the estimate or the root mean squared errors (RMSE)

Page 17: Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr SKEMA Ph.D programme 2010-2011

Standard Error of Each Parameter Combining theorems 2 and 3 yields:

uj n 2 2

ij j ji 1

ˆˆse

x x 1 R

Page 18: Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr SKEMA Ph.D programme 2010-2011

Theorem 4

Under assumptions OLS1 – OLS5, estimators

are the Best Linear Unbiased Estimators

(BLUE) of

0 1 kˆ ˆ ˆ, , ,

0 1 k, , ,

Assumptions OLS1 – OLS5 are known as the Gauss-

Markov Theorem, which stipulates that under OLS1-5, the

OLS are the best estimation method

The estimates are unbiased (OLS1-4)

The estimates have the smallest variance (OLS5)

Page 19: Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr SKEMA Ph.D programme 2010-2011

Theorem 5

Under assumptions OLS1 – OLS6, the OLS estimates

follows a t distribution:

j jn k 1

j

ˆt

ˆse( )

Page 20: Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr SKEMA Ph.D programme 2010-2011

Extension of theorem 5: Inference We can define de confidence interval of β, at 95% :

.025

2 2

1

ˆt

1

ujj n

ij j ji

x x R

If the 95% CI does not include 0, then β is significantly different than 0.

Page 21: Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr SKEMA Ph.D programme 2010-2011

Student t Test for H0: βj=0 We are also in the position to infer on βj

H0: βj = 0

H1: βj ≠ 0

Rule of decision

Accept H0 is | t | < tα/2

Reject H0 is | t | ≥ tα/2

ˆ ˆ

tse se

Page 22: Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr SKEMA Ph.D programme 2010-2011

Summary

OLS1 Linearity

OLS2 Random Sampling

OLS3 No perfect Collinearity

OLS4 Zero Conditional Mean

OLS5 Homoskedasticity

OLS6 Normality of error term

T1

UnbiasednessT2-T4

BLUET5

β ~ t

Page 23: Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr SKEMA Ph.D programme 2010-2011

The knowledge production function

Application 1: seminal model

1 2

1 2

PAT f (RD,SIZE)

PAT A RD SIZE exp u

pat rd size u

Page 24: Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr SKEMA Ph.D programme 2010-2011

Application 1: modèle de base

_cons -.5909529 .3903255 -1.51 0.131 -1.358146 .1762404 lassets -.3712237 .0722135 -5.14 0.000 -.513161 -.2292864 lrd .6461714 .0868021 7.44 0.000 .47556 .8167828 lpat Coef. Std. Err. t P>|t| [95% Conf. Interval]

Total 636.011503 430 1.47909652 Root MSE = 1.1221 Adj R-squared = 0.1487 Residual 538.941858 428 1.25920995 R-squared = 0.1526 Model 97.0696447 2 48.5348224 Prob > F = 0.0000 F( 2, 428) = 38.54 Source SS df MS Number of obs = 431

. reg lpat lrd lassets

Page 25: Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr SKEMA Ph.D programme 2010-2011

Application 2: Changing specification

1

2

1 2

PAT f (RD,SIZE)

RDPAT A SIZE exp u

SIZE

RDpat log size u

SIZE

The knowledge production function

Page 26: Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr SKEMA Ph.D programme 2010-2011

_cons -.5909529 .3903255 -1.51 0.131 -1.358146 .1762404 lassets .2749477 .0337246 8.15 0.000 .2086614 .3412341 lrdi .6461714 .0868021 7.44 0.000 .47556 .8167828 lpat Coef. Std. Err. t P>|t| [95% Conf. Interval]

Total 636.011503 430 1.47909652 Root MSE = 1.1221 Adj R-squared = 0.1487 Residual 538.941858 428 1.25920995 R-squared = 0.1526 Model 97.0696447 2 48.5348224 Prob > F = 0.0000 F( 2, 428) = 38.54 Source SS df MS Number of obs = 431

. reg lpat lrdi lassets

Application 2: Changing specification

Page 27: Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr SKEMA Ph.D programme 2010-2011

The knowledge production function

Application 3: Adding variables

1

23

1 2 3

PAT f (RD,SIZE,SPE)

RDPAT A SIZE exp SPE u

SIZE

rdpat size SPE u

size

Page 28: Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr SKEMA Ph.D programme 2010-2011

Application 3: Adding variables

_cons -.4877403 .3895845 -1.25 0.211 -1.253482 .2780017 spe .423136 .1600635 2.64 0.009 .1085255 .7377464 lassets .2736255 .0334948 8.17 0.000 .2077903 .3394608 lrdi .670643 .0866968 7.74 0.000 .5002375 .8410485 lpat Coef. Std. Err. t P>|t| [95% Conf. Interval]

Total 636.011503 430 1.47909652 Root MSE = 1.1144 Adj R-squared = 0.1604 Residual 530.263469 427 1.24183482 R-squared = 0.1663 Model 105.748034 3 35.2493446 Prob > F = 0.0000 F( 3, 427) = 28.38 Source SS df MS Number of obs = 431

. reg lpat lrdi lassets spe

Page 29: Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr SKEMA Ph.D programme 2010-2011

Qualitative variables used as independent

variables

Page 30: Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr SKEMA Ph.D programme 2010-2011

Qualitative variables as indep. variables

Qualitative variables

Dummy variables

Generating dummy variables using STATA

Interpretation of coefficients in OLS

Interaction effects between continuous and dummy var.

Page 31: Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr SKEMA Ph.D programme 2010-2011

Qualitatives variables

Qualitative variables provide information on discrete characteristics

The number of categories taken by qualitative variables is general small.

These can be numerical values but each number denotes a attribute – a characteristics.

A qualitative variable may have several categories Two categories: male – female

Three categories: nationality (French, German, Turkish)

More than three categories: sectors (car, chemical, steel, electronic equip., etc.)

Page 32: Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr SKEMA Ph.D programme 2010-2011

Qualitative variables There are several ways to code qualitative variables with n

categories

Using one categorical variables

Producing n - 1 dummy variables

A dummy variable is a variable which takes values 0 or 1.

We also call them binary variables

We also call dichotomous variables

Page 33: Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr SKEMA Ph.D programme 2010-2011

Coding using one categorical variable Two categories: we generate a categorical variable called “gender”

set to 1 if the observation is a female, 2 if the observation is a male. Three categories: we generate a categorical variable called

“country” set to 1 if the observation is French, 2 if the observation is German, three if the observation if Turkish.

More than three categories : we generate a categorical variable called “sector” set to 1 if the observation is in the car industry, 2 for the chemical industry, three for the steel ifnustry, four for the electronic equip industry, etc..

This requires the use of label in order to know to which category a given number pertains

Qualitative variables

Page 34: Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr SKEMA Ph.D programme 2010-2011

Labelling variables

Labelling is tedious, boring and uninteresting.

But there are clear consequences when one must interpret the results

label variable. Decribe a variable, qualitative or quantitativelabel variable asset "real capital"

label define. Defines a label (meaning of numbers)label define firm_type 1 "biotech" 0 "Pharma"

label values Applies the label to a given variablelabel values type firm_type

Page 35: Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr SKEMA Ph.D programme 2010-2011

Exemple de labellisation*************************************************************************************

******* CREATION DES LABELS INDUSTRIES *********

*************************************************************************************

egen industrie = group(isic_oecd)

#delimit ;

label define induscode 1 "Text. Habill. & Cuir"

2 "Bois"

3 "Pap. Cart. & Imprim."

4 "Coke Raffin. Nucl."

5 "Chimie"

6 "Caoutc. Plast."

7 "Aut. Prod. min."

8 "Métaux de base"

9 "Travail des métaux"

10 "Mach. & Equip."

11 "Bureau & Inform."

12 "Mach. & Mat. Elec."

13 "Radio TV Telecom."

14 "Instrum. optique"

15 "Automobile"

16 "Aut. transp."

17 "Autres";

#delimit cr

label values industrie induscode

Page 36: Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr SKEMA Ph.D programme 2010-2011

Exercise

1. Open SKEMA_BIO.dta

2. Create variable firm_type from type

3. Label variable firm_type

4. Define a label for firm_type and apply it

Page 37: Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr SKEMA Ph.D programme 2010-2011

Dummy variables Coding categorical variables using dummy variables only

Two categories. We generate one dummy variable “female” set to 1if the obs. is a

female, 0 otherwise. We generate one dummy variable “male” set to 1if the obs. is a

male, 0 otherwise. But one of the dummy variable is simply redundant. When female

= 0, then necessarily male = 1 (and vice versa).

Hence with two categories, we only need one dummy variable.

Page 38: Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr SKEMA Ph.D programme 2010-2011

Dummy variables Coding categorical variables using dummy variables only

Three categories. We generate one dummy variable “France” set to 1if the obs. is a

French, 0 otherwise. We generate one dummy variable “Germany” set to 1if the obs. is a

German, 0 otherwise. We generate one dummy variable “Turkish” set to 1if the obs. is a

Turkish, 0 otherwise. But one of the dummy variable is simply redundant. When France=0

and German=0, then Turkish=1.

For a variable with n categories, we must create n - 1 dummy variables, each representing one particular category.

Page 39: Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr SKEMA Ph.D programme 2010-2011

Generation of dummies with STATA Using the if condition.

generate DEU = 0 replace DEU = 1 if country==“GERMANY”

generate LDF= 1 if size > 100 replace LDF =0 if size < 101

Avoiding the use of the if condition. generate FRA = country==“FRANCE” generate LDF = size > 100

Page 40: Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr SKEMA Ph.D programme 2010-2011

With n categories and n being large, generating dummty variables can become really tedious

Function tabulate has a very convenient extension, since it will generate n dummy variables at once. tabulate varcat, gen(v_)

tabulate country, gen(c_)

Will create n dummy variables with n being the number of country in the dataset, and c_1 being the first country, c_2 being second, c_3 the third, etc.

Generation of dummies with STATA

Page 41: Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr SKEMA Ph.D programme 2010-2011

Reading coefficients of dummy variables Remember! A coefficient tells us the increase in y

associated with a one-unit increase in x, other things held constant (ceteris paribus).

If the knowledge production function goes

with « y » being the number of patent and “biotech” being a dummy variable set to 1 for biotech fimrs, 0 otherwise.

y biotech u

Page 42: Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr SKEMA Ph.D programme 2010-2011

If the firm is biotech company, then the dummy variable “biotech” is equal to unity. Hence:

If the firm is pharma company, then the dummy variable “biotech” is equal to zero. Hence:

ˆ ˆˆ ˆy 1

ˆˆ ˆy 0

Reading coefficients of dummy variables

Page 43: Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr SKEMA Ph.D programme 2010-2011

The coefficient reads as the variation in the dependent variable when the dummy variable is set to 1 relative to the situation where the dummy variable is set to 0. With two categories, I must introduce one dummy variable.

With three categories, I must introduce two dummy variables.

With n categories, I must introduce (n-1) dummy variables.

Reading coefficients of dummy variables

Page 44: Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr SKEMA Ph.D programme 2010-2011

Exercise

1. Regress the following model:

2. Predict the number of patents for both biotech and pharma companies

3. Produce descriptive statistics of PAT for each type of company using the command table

4. What do you observe?

PAT biotech u

Page 45: Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr SKEMA Ph.D programme 2010-2011

For semi logarithmic forms (log Y), coefficient β must be read as an approximation of the percent change in Y associated with a variation of 1 unit of the explanatory variable.

This approximation is acceptable for small β (β < 0.1). When β is large (β ≥ 0.1), the exact percent change in Y is:

100 × (eβ – 1)

Reading coefficients of dummy variables

Page 46: Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr SKEMA Ph.D programme 2010-2011

Application 4: dummy variable

1

23 4

1 2 3 4

PAT f (RD,SIZE,SPE, )

RDPAT A SIZE exp SPE u

SIZE

rdpa

BIO

BIO

t size SPE usiz

Be

IO

The knowledge production function

Page 47: Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr SKEMA Ph.D programme 2010-2011

_cons -5.464644 .6164752 -8.86 0.000 -6.676356 -4.252932 biotech 1.657062 .1684813 9.84 0.000 1.325904 1.98822 spe .4212942 .1446661 2.91 0.004 .136946 .7056423 lassets .5558656 .0417126 13.33 0.000 .4738775 .6378537 lrdi .4924169 .0804249 6.12 0.000 .3343379 .650496 lpat Coef. Std. Err. t P>|t| [95% Conf. Interval]

Total 636.011503 430 1.47909652 Root MSE = 1.0072 Adj R-squared = 0.3142 Residual 432.137097 426 1.01440633 R-squared = 0.3206 Model 203.874406 4 50.9686015 Prob > F = 0.0000 F( 4, 426) = 50.24 Source SS df MS Number of obs = 431

. reg lpat lrdi lassets spe biotech

Application 4: dummy variable

Page 48: Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr SKEMA Ph.D programme 2010-2011

Patentln(PAT)

size

4ˆ ˆ

2 4ˆˆBiotech : size ˆ

2ˆslope

2ˆslope

2ˆˆPharma : size 4

Application 4: dummy variable

Page 49: Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr SKEMA Ph.D programme 2010-2011

Application 5: Interacting variables

1

23 4

1 2 3 5

BIO

BIO B

PAT f (RD,SIZE, )

RDPAT A SIZE exp u

SIZE

rdpat si

IO size

BIO BIO sizze usize

e

The knowledge production function

Page 50: Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr SKEMA Ph.D programme 2010-2011

Application 5: Interacting variables

_cons -6.482948 .8427254 -7.69 0.000 -8.139376 -4.826519 bio_assets -.1435349 .081221 -1.77 0.078 -.3031798 .0161099 biotech 3.592252 1.107872 3.24 0.001 1.41466 5.769843 spe .4131693 .1443802 2.86 0.004 .1293812 .6969573 lassets .619805 .0551395 11.24 0.000 .5114249 .7281852 lrdi .4742035 .0808846 5.86 0.000 .3152199 .6331871 lpat Coef. Std. Err. t P>|t| [95% Conf. Interval]

Total 636.011503 430 1.47909652 Root MSE = 1.0047 Adj R-squared = 0.3176 Residual 428.984767 425 1.00937592 R-squared = 0.3255 Model 207.026736 5 41.4053471 Prob > F = 0.0000 F( 5, 425) = 41.02 Source SS df MS Number of obs = 431

. reg lpat lrdi lassets spe biotech bio_assets

Page 51: Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr SKEMA Ph.D programme 2010-2011

Patentln(PAT)

Size

52 4ˆˆBiotech : size ˆ ˆ BIO size

2ˆˆPharma : size

2 5ˆ BIOˆsl izee sop

2ˆslope

5ˆ ˆ

5

Application 5: Interacting variables

Page 52: Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr SKEMA Ph.D programme 2010-2011

Specification Tests

Page 53: Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr SKEMA Ph.D programme 2010-2011

The knowledge production function

Specification Tests for Multiple OLS

1

23

1 2 3

PAT f (RD,SIZE,SPE)

RDPAT A SIZE exp SPE u

SIZE

rdpat size SPE u

size

Page 54: Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr SKEMA Ph.D programme 2010-2011

Specification Tests for Multiple OLS

Critical probability α such that : Pr(Ha|H0)= α

Student t test: concerning the significance of one parameter

Fisher F test: concerning the significance of several parameters simultaneously (Wald test)

Non linear restriction test: Testing for non-linear relationship between parameters

Page 55: Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr SKEMA Ph.D programme 2010-2011

Concerning one parameter onlyH0 : lassets = 0.30test size = 0.30

Test on several parameters

H0 : size = 0.30 and rdi = 0.70 test (size = 0.3) (rdi=0.7)

H0 : rdi = 2 * size test lrdi = 2 * lassets

H0 : lrdi + lassets = 1test lrdi + lassets = 1

lincom _b[lrdi] + _b[lassets] - 1

Specification Tests for Multiple OLSTesting linear combination of parameters

Page 56: Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr SKEMA Ph.D programme 2010-2011

Test on several parameters

H0 : size * rdi = 0.2testnl _b[lrdi] * _b[lassets] = 0.2nlcom _b[lrdi] * _b[lassets] = 0.2

Specification Tests for Multiple OLSTesting non linear combination of parameters

Page 57: Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr SKEMA Ph.D programme 2010-2011

Review of Assumptions

OLS assumption Consistency when violated

Efficiency when violated

Test

OLS1 Linearity - - -

OLS2 Random Sampling Biased β NoneNone. Redo

sampling & estimation

OLS3 No perfect Collinearity - - -

OLS4 Zero Conditional Mean Biased βPoorly estimated

variance of βLink test

Omitted Variable test

OLS5 Homoskedasticity NoneUnderestimated

variance of βBreusch-Pagan test

OLS6 Normality of error term NoneLack of reliability of the t test for β

Shapiro Wilk test

Page 58: Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr SKEMA Ph.D programme 2010-2011

Rule of thumb using graphs

Stata Instruction rvfplot

White Test

Stata Instruction estat imtest

Breusch-Pagan Test

Stata Instruction estat hettest

Specification Tests for Multiple OLSSpecification tests on the validity of assumptions

Hypothesis OLS5 : Homoskedasticity of residuals

Page 59: Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr SKEMA Ph.D programme 2010-2011

Specification Tests for Multiple OLSSpecification tests on the validity of assumptions

Hypothesis OLS5 : Homoskedasticity of residuals: rvfplot

-2-1

01

23

Res

idu

als

-1 0 1 2 3Fitted values

Page 60: Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr SKEMA Ph.D programme 2010-2011

Specification Tests for Multiple OLSSpecification tests on the validity of assumptions

Hypothesis OLS5 : Homoskedasticity of residuals: estat imtest

Total 40.34 13 0.0001 Kurtosis 15.55 1 0.0001 Skewness 3.05 3 0.3840 Heteroskedasticity 21.74 9 0.0097 Source chi2 df p

Cameron & Trivedi's decomposition of IM-test

. imtest

Page 61: Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr SKEMA Ph.D programme 2010-2011

Specification Tests for Multiple OLSSpecification tests on the validity of assumptions

Hypothesis OLS5 : Homoskedasticity of residuals: estat hettest

Prob > chi2 = 0.0927 chi2(1) = 2.83

Variables: fitted values of lpat Ho: Constant varianceBreusch-Pagan / Cook-Weisberg test for heteroskedasticity

. hettest

Page 62: Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr SKEMA Ph.D programme 2010-2011

Specification tests on the validity of assumptions

Hypothesis OLS6 : Normality of residuals

Rule of thumb using graphs

Stata Instruction predict res, residual kdensity res, normal

Formally using the Shapiro-Wilk Test

Stata Instruction predict res, residual swilk res, normal

Specification Tests for Multiple OLS

Page 63: Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr SKEMA Ph.D programme 2010-2011

Specification tests on the validity of assumptions

Hypothesis OLS6 : Normality of residuals: kdensity

Specification Tests for Multiple OLS

0.1

.2.3

.4D

ensi

ty

-4 -2 0 2 4Residuals

Kernel density estimate

Normal density

kernel = epanechnikov, bandwidth = 0.2971

Kernel density estimate

Page 64: Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr SKEMA Ph.D programme 2010-2011

Specification tests on the validity of assumptions

Hypothesis OLS6 : Normality of residuals

Specification Tests for Multiple OLS

res 431 0.98688 3.862 3.226 0.00063 Variable Obs W V z Prob>z

Shapiro-Wilk W test for normal data

. swilk res

Page 65: Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr SKEMA Ph.D programme 2010-2011

Specification tests on the validity of assumptions

There is no omitted variables (OLS4 on endogeneity)

Link test : Stata Instruction linktest

Regress the DV over the prediction and its squared value

Variable _hat must be significant, but not _hatsq

Ramsey RESET Test : Stata Instruction ovtest

Regress the DV over powers (4) of LHS variables

Regress the DV over powers (4) of RHS variables

Specification Tests for Multiple OLS

Page 66: Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr SKEMA Ph.D programme 2010-2011

Specification tests on the validity of assumptions

There is no omitted variables (OLS4 on endogeneity): linktest

Specification Tests for Multiple OLS

_cons .4943074 .2887035 1.71 0.088 -.0731457 1.061761 _hatsq .2707472 .1161699 2.33 0.020 .0424126 .4990817 _hat .2055605 .3574387 0.58 0.566 -.4969932 .9081141 lpat Coef. Std. Err. t P>|t| [95% Conf. Interval]

Total 636.011503 430 1.47909652 Root MSE = 1.1061 Adj R-squared = 0.1729 Residual 523.618213 428 1.22340704 R-squared = 0.1767 Model 112.393289 2 56.1966447 Prob > F = 0.0000 F( 2, 428) = 45.93 Source SS df MS Number of obs = 431

. linktest

. quietly: regress lpat lrdi lassets spe

Page 67: Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr SKEMA Ph.D programme 2010-2011

Specification tests on the validity of assumptions

There is no omitted variables (OLS4 on endogeneity): ovtest

Specification Tests for Multiple OLS

Prob > F = 0.0732 F(3, 424) = 2.34 Ho: model has no omitted variablesRamsey RESET test using powers of the fitted values of lpat

. ovtest

. quietly: regress lpat lrdi lassets spe

2 21 0

k 1n m k 2

1

R R

k 1F1 R

n m k

Page 68: Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr SKEMA Ph.D programme 2010-2011

Exercise1. Regress the following model

2. Assuming OLS1-3 to be correct, test OLS4-6 and conclude

1. OL4 on specification test using linktest and ovetst

2. OLS5 on homoskedasticity using imtest and hettest

3. OLS6 on normality of errors using kdensity and swilk test

1 2 3 4

rdpat size SPE u

sizBIO

e