lect 06 en 2014 introduction to regression analysis 06_en_2014... · examines the relationship...

Research Methodology: Tools

Applied Data Analysis (with SPSS)

Lecture 06: Introduction to Regression Analysis

April 2014

Prof. Dr. Jürg Schwarz

Lic. phil. Heidi Bruderer Enzler

MSc Business Administration

Slide 2

Contents

Aims of the Lecture ______________________________________________________________________________________ 3

Typical Syntax ___________________________________________________________________________________________ 4

Introduction _____________________________________________________________________________________________ 5

Example ................................................................................................................................................................................................................. 5

Outline and Concept of Regression Analysis _________________________________________________________________ 8

Key Steps in Regression Analysis ........................................................................................................................................................................ 10

General Purpose of Regression ........................................................................................................................................................................... 11

Scatterplots Illustrating Causal Relationships ....................................................................................................................................................... 12

Models ................................................................................................................................................................................................................. 13

Ordinary Least Squares (OLS) Estimates of β0 and β1 .......................................................................................................................................... 15

Gauss-Markov Theorem, Independence and Normal Distribution ......................................................................................................................... 18

Regression Analysis with SPSS: Detailed Examples __________________________________________________________ 21

Simple Example ................................................................................................................................................................................................... 21

Example with Nonlinearity of the Independent Variable ........................................................................................................................................ 33

Aims of the Lecture

You will understand the key steps in conducting a regression analysis.

You will understand the stochastic model of the regression analysis.

You will understand the ordinary least squares (OLS) method.

You will understand the 5 Gauss-Markov assumptions for OLS estimators.

You can conduct a regression analysis with SPSS

In particular, you will know how to >

◦ interpret the output

◦ describe the output

You will know how to use the regression equation to estimate values of the dependent variable.

Slide 4

Dependent variable weight

Independent variable height

Scatterplot of residuals

Histogramm of residuals

Typical Syntax

Scatterplot of variables height (X-axis) and weight (Y-axis) GGRAPH /GRAPHDATASET NAME="graphdataset" VARIABLES=height weight MISSING=LISTWISE RE-PORTMISSING=NO /GRAPHSPEC SOURCE=INLINE. BEGIN GPL SOURCE: s=userSource(id("graphdataset")) DATA: height=col(source(s), name("height")) DATA: weight=col(source(s), name("weight")) GUIDE: axis(dim(1), label("height (in cm)")) GUIDE: axis(dim(2), label("weight (in kg)")) ELEMENT: point(position(height*weight)) END GPL.

Regression analysis of weight on height REGRESSION /MISSING LISTWISE /STATISTICS COEFF OUTS R ANOVA /CRITERIA=PIN(.05) POUT(.10) /NOORIGIN /DEPENDENT weight /METHOD=ENTER height /SCATTERPLOT=(*ZRESID ,*ZPRED) /RESIDUALS HISTOGRAM(ZRESID).

Introduction

Example

Market research: Competition analysis

0

1

2

3

4

5

6

7

8

9

10

0 10 20 30 40 50

Sale

s v

olu

me

[US

D m

illio

n]

Number of employees [#]

Dataset

Sample of n = 10 companies

Variables for

◦ number of employees (empl)

◦ sales volume (sales)

Typical questions

Is there a linear relation between

employee number and sales volume?

What is the predicted sales volume of

a company with 12 employees?

Slide 6

Questions

Question in everyday language:

Does sales volume depend on the number of employees?

Research question:

Does the number of employees have an impact on sales volume?

How strong is the influence of the number of employees?

Is there a model?

Is regression analysis the right model?

Statistical question:

H0: "No model" (= No overall model and no significant coefficients)

HA: "Model" (= Overall model and significant coefficients)

Can we reject H0?

sales = β0 + β1 ⋅ empl + u where sales = dependent variable

empl = independent variable

β0, β1 = coefficients

u = error term

Results

The overall model is significant (F(1,8) = 36.276, p = .000).

The model explains 81.9% of the variance of sales volume (R2 = .819).

Slide 8

Both coefficients are significant (p < .05). Thus the model can be described as:

sales = 2.448 + 0.111 ⋅ empl

One additional employee increases

sales volume by .111 million USD.

Predicted sales volume of a company

with 12 employees:

3.78 million USD (= 2.448 + .111 ⋅ 12)

Sale

s v

olu

me

[US

D m

illio

n]

0

1

2

3

4

5

6

7

8

9

10

0 10 20 30 40 50

Number of employees [#]

Outline and Concept of Regression Analysis

Basic situation

Given: Two variables with a metric scale

Task: Find a relationship between the variables

Regression analysis

Postulation of a linear model 0 1y x u= β + β ⋅ +

Regression analysis

◦ Examines the relationship between the dependent variable y and the independent variable x.

◦ Uses inferential statistics to estimate the parameters β0 and β1.

Note

◦ In many cases the relation is assumed to be a cause-and-effect relation. Be aware that this requires strong theoretical or empirical evidence.

◦ "Regression" stands for going backwards from the dependent variable y to the determinant x. Therefore, we speak of "regression of y on x".

Slide 10

Key Steps in Regression Analysis

1. Formulation of the model

◦ Common sense > (remember the example with storks and babies)

◦ Linearity of relationship plausible

◦ Not too many variables (Principle of parsimony: Simplest solution to a problem)

2. Estimation of the model

◦ Estimation of the model by means of OLS estimation (ordinary least squares)

◦ Decision on procedure: Enter, stepwise regression!

3. Verification of the model

◦ Is the model as a whole significant? (i.e. are the coefficients significant as a group?) → F-test

◦ Are the regression coefficients significant? → t-tests (should be performed only if F-test is significant)

◦ How much variation does the regression equation explain? → Coefficient of determination (adjusted R-squared)

4. Considering other aspects (for example, multicollinearity)

5. Testing of assumptions (Gauss-Markov, independence and normal distribution)

6. Interpretation of the model and reporting

Text in italics: Only important in the case of multiple regression – see next lecture.

General Purpose of Regression

◦ Cause analysis

State a relationship between independent variables and the dependent variable.

Example

Is there a model that describes the dependence between sales volume and employee number, or do these two variables just form a random pattern?

◦ Impact analysis

Assess the impact of the independent variable to the dependent variable.

Example

If the number of employees increases, sales volume also increases: How strong is the impact? By how much will sales increase with each additional employee?

◦ Prediction

Predict the values of a dependent variable using new values for the independent variable.

Example

Which is the predicted sales volume of a company with 12 employees?

Slide 12

Illustrating Causal Relationships – Scatterplots and Equation

When displaying two variables with a causal relationship in a scatterplot, it is standard practice

to show the "Cause" variable as the X-axis and the "Effect" variable as the Y-axis.

0

1

2

3

4

5

6

7

8

9

10

0 10 20 30 40 50

In the regression equation y is the dependent variable and x is the independent variable.

The labels of the variables vary according to the field of application.

y x

Dependent Variable Independent Variable

Explained Variable Explanatory Variable

Response Variable Control Variable

Predicted Variable Predictor

Regressand Regressor

"Eff

ect"

(Y

-axis

)

"Cause" (X-axis)

Models

Mathematical model

The linear model describes y as a function of x

= β + β ⋅0 1y x equation of a straight line

The variable y is a linear function of the variable x.

β0 (intercept, constant)

The point where the regression line crosses the Y-axis.

The value of the dependent variable when all of the independent variables = 0.

β1 (regression coefficient)

The increase in the dependent variable per unit change in the

independent variable (also known as "the rise over the run", slope)

Stochastic model

0 1y x u= β + β ⋅ +

The error term u comprises all factors (other than x) that affect y.

These factors are treated as being unobservable.

→ u stands for "unobserved"

�

�

ββββ�

�� ∆∆∆∆�

��

�∆∆∆∆�

y

x

∆=

∆��

Slide 14

Stochastic model – Assumptions related to the error term

The error term u is (must be) >

◦ independent of the explanatory variable x

◦ normally distributed with mean 0 and variance σ2: u ~ N(0,σ2

)

0 1E(y) x= β +β ⋅

σ

Woold

ridge, Jeffre

y (

2011):

Intr

oducto

ry e

conom

etr

ics.

5th

Editio

n. [S

.l.]: S

outh

-Weste

rn.

0

0

0

Ordinary Least Squares (OLS) Estimates of β0 and β1

Step by step

Given the population model

0 1y x u= β + β ⋅ +

Given a random sample

Given random sample {(xi ,yi ): i = 1,>,n} for which the sample regression model holds:

i 0 1 i iˆ ˆ ˆy x u= β + β ⋅ +

The notation 0 1ˆ ˆ,β β read as "beta hat",

emphasizes that the coefficients are

estimates based on the sample.

Fitted values and residuals

Calculation of the fitted values

i 0 1 iˆ ˆy x= β + β ⋅

Calculation of residuals

i i i i 0 1 iˆ ˆˆ ˆu y y y x= − = − β − β ⋅

Slide 16

Choose β0 and β1 such that for the observations (i = 1,>,n) the following condition holds

n n2 2

i i 0 1 ii 1 i 1

ˆ ˆˆ(u ) (y x ) minimum= =

= − β − β ⋅ =∑ ∑

The squares drawn in the figure correspond to

the squared residuals and the total area of the

squares should be minimized.

The method used for this minimization is called

the ordinary least squares method (OLS).

Taking derivatives

A necessary and in this case also sufficient condition for β0 and β1

to solve the minimization problem is that the partial derivatives must be zero

n2

i ni 1

i 0 1 ii 10

n2

i ni 1

i i 0 1 i

i 11

ˆ(u )ˆ ˆ2 (y x ) 0

ˆ

ˆ(u )ˆ ˆ2 x (y x ) 0

ˆ

=

=

=

=

δ= − − β − β ⋅ =

δβ

δ= − − β − β ⋅ =

δβ

∑∑

∑∑

�

�

��

�

��

��

��

��

�

��

��

Resulting estimates of β0 and β1

The solutions 0 1ˆ ˆ,β β are called ordinary least squares (OLS) estimates for β0 and β1

n

i ix,yi 1

1 n 22 x

i

i 1

x,y

0 2

x

(x x)(y y)ˆ

(x x)

ˆ y x

=

=

− − σβ = =

σ−

σβ = −

σ

∑

∑

Ordinary least squares (OLS) estimates provide the best explanation/prediction of the data.

0 1ˆ ˆ,β β are the "correct" coefficients to describe the data from the sample.

The population regression equation 0 1y x u= β + β ⋅ + is still unknown.

Thus 0 1ˆ ˆ,β β are only estimates of the "true" set of coefficients in the population.

Another data sample will give a different regression equation, which may or may not be closer to

the population regression equation.

Note: The notation 0 1ˆ ˆ,β β emphasizes that the coefficients are estimates.

This notation will be replaced below by 0 1,β β because it is clear that we always use estimates.

Covariance (x,y) divided by Variance (x)

Slide 18

Gauss-Markov Theorem, Independence and Normal Distribution

Under the 5 Gauss-Markov assumptions the OLS estimator is the best, linear, unbiased estima-

tor of the true parameters βi, given the present sample.

→ The OLS estimator is BLUE

1. Linear in coefficients y = β0 + β1 ⋅ x + u

2. Random sample of n observations {(xi ,yi ): i = 1,>,n}

3. Zero conditional mean:

The error u has an expected value of 0,

given any values of the explanatory variable

E(ux) = 0

4. Sample variation in explanatory variables.

The xi’s are not constant and not all the same.

x ≠ const

x1 ≠ x2 ≠ > ≠ xn

5. Homoscedasticity:

The error u has the same variance given any value of the

explanatory variable.

Var(ux) = σσσσ2

Independence and normal distribution of error u ~ Normal(0,σσσσ2)

These assumptions need to be tested – among else by analyzing the residuals.

Based on: Wooldridge, Jeffrey (2011): Introductory econometrics. 5th Edition. [S.l.]: South-Western.

Gauss-Markov assumption on linearity in coefficients

The Gauss-Markov assumption is that the model of the population must be linear in the parame-

ters.

Standard case: = = β + β ⋅ +0 1OLS [y x u]

Examples for violations of this Gauss-Markov assumption (non-linear parameters):

2

0 1

0 1ln( )

OLS [y = x]

OLS [y = x]

≠ β + β ⋅

≠ β + β ⋅

→ In these examples, the OLS estimators are not BLUE.

Non-linear variables, however, are possible and often very useful.

For example, if the percentage increase of y is constant given one unit more of x,

then a model of the following form is appropriate:

0 1OLS [ln(y) = + x]≈ β β ⋅

Further examples:

0 1 2

0 1

≈ β β ⋅ β ⋅

≈ β β ⋅

2OLS [y = + x + x ]

OLS [y = + ln(x)]

Slide 20

Example of non-linearity in variables (not problematic!)

Percentage increase of y: If y is logarithmized, the relationship

If x increases by one unit (∆x), y increases becomes linear:

by a constant percentage increase (%∆y). If x increases by one unit,

ln(y) increases by one unit.

This model is both linear in variables

(ln(y), x) and linear in coefficients (β0, β1).

0 1 )y = exp( + xβ β ⋅ 0 1 0 1))ln(y) = ln(exp( + x + xβ β ⋅ = β β ⋅

∆x ∆x

% y+ ∆

% y+ ∆

y ln(y)

x x

Log

Regression Analysis with SPSS: Detailed Examples

Simple Example

Sample of 99 men with

information on body height and weight

Step 1: Formulation of the model

Regression equation of weight on height

0 1weight height u= β + β ⋅ +

0 1

weight dependent variable

height independent variable

, coefficients

u error term

==

β β =

=

The scatterplot confirms that there could be a

linear relationship between weight and height.

Slide 22

Step 2: Estimation of the model

SPSS: Analyze�Regression�Linear>

:

Step 3: Verification of the model – The F-test

The null hypothesis (H0) is that there is no effect of height.

The alternative hypothesis (HA) is that this is not the case.

H0: β1 = 0 (Multiple Regression => H0: β1 = β1 = > = βp = 0)

HA: β1 ≠ 0 (Multiple Regression => HA: βj ≠ 0 for at least one value of j)

Empirical F-value and the appropriate p-value ("Sig.") are computed by SPSS.

In the example, we can reject H0 in favor of HA (Sig. < 0.05).

The overall model is significant (F(1,97) = 116.530, p = .000).

The estimated model is not only a theoretical construct but one that exists in a statistical sense.

Slide 24

Step 3: Verification of the model – t-tests

The Coefficients table provides significance tests for the coefficients.

The significance test evaluates the null hypothesis that the regression coefficient is zero

H0: βi = 0

HA: βi ≠ 0

The t statistic for the height variable (β1) is associated with a p-value of .000 ("Sig.").

This indicates that the null hypothesis can be rejected.

Thus, the coefficient is significantly different from zero.

This holds also for the constant (β0) with Sig. = .000.

Intermezzo: Confidence intervals of the regression coefficients

The significance of a regression coefficient can also be determined by

its confidence interval. The 95% confidence interval gives a range of

values that includes the population parameter with a probability of 95%.

If a 95% confidence interval does not include the value 0,

the regression coefficient is significantly different from 0 (p < .05).

The confidence interval is computed as follows: [β - tcrit · s.e.β, β + tcrit · s.e.β]

The critical t-value (tcrit) can be looked up in tables. It is determined by its degrees of freedom

ν = n - k - 1, where n = sample height; k = number of independent variables.

For height: ν = 99 - 1 - 1 = 97, tcrit = 1.98

95% confidence interval for height: [1.086 - 1.98 · .101, 1.086 + 1.98 · .101] = [.886, 1.285]

It does not include 0. Thus, the regression coefficient is significantly different from 0.

Slide 26

Step 6: Interpretation of the model – The regression coefficients

i 0 1 iweight height= β + β ⋅

i iweight 120.375 1.086 height= − + ⋅

Unstandardized coefficients show absolute

change of the dependent variable if the

independent variable increases by one unit.

If height increases by 1 cm,

weight increases by 1.086 kg.

Note: The constant -120.375 has no specific

meaning. It's just the intersection with the Y-axis.

Back to Step 3: Verification of the model – The coefficient of determination

iy = Data point

iy = Estimation (model)

y = Sample mean

Error is also called residual.

Tota

l

Regre

ssio

n

Err

or iy

iy

y

ix

Slide 28

Summing up squared distances to sum of squares (SS)

SSTotal = SSRegression + SSError

∑∑∑===

−+−=−n

1i

2

ii

n

1i

2

i

n

1i

2

i )yy()yy()yy(

Regression

Total

≤ ≤SS

R Square = 0 R Square 1SS

R Square, the coefficient of determination, is .546.

In the example, about half the variation of weight is explained by the model (R2 = 54.6%).

In bivariate regression, R2 is qual to the squared value of the correlation coefficient of the two

variables (rxy = .739, rxy2 = .546).

The higher R Square, the better the fit.

Step 5: Testing of assumptions

In the example, are the requirements of the Gauss-Markov theorem as well as the other as-

sumptions met?

1. Is the model linear in coefficients Yes, decision for regression model.

2. Is it a random sample? Yes, clinical study.

3. Do the residuals have an expected value of 0

for all values of x? (zero conditional mean)

→ Scatterplot of residuals

4. Is there variation in the explanatory variable? Yes, clinical study.

5. Do the residuals have constant variance

for all values of x? (homoscedasticity)


Are the residuals independent from one another?

Are the residuals normally distributed?


→ (consider Durbin-Watson)

→ Histogram

Slide 30

Scatterplot of standardized predicted values of y vs. standardized residuals

3. Zero conditional mean: The mean values of the residuals do not differ visibly from 0 across

the range of standardized estimated values. → OK

5. Homoscedasticity: Residual plot trumpet-shaped; residuals do not have constant variance.

This Gauss-Markov requirement is violated. → There is heteroscedasticity.

Independence: There is no obvious pattern that indicates that the residuals would be influenc-

ing one another (for example a "wavelike" pattern). → OK

Histogram of standardized residuals

Normal distribution of residuals:

Distribution of the standardized residuals is more or less normal. → OK

Slide 32

Violation of the homoscedasticity assumption

How to diagnose heteroscedasticity

Informal methods:

◦ Look at the scatterplot of standardized predicted y-values vs. standardized residuals.

◦ Graph the data and look for patterns.

Formal methods (not pursued further in this course):

◦ Breusch-Pagan test / Cook-Weisberg test

◦ White test

Corrections

◦ Transformation of the variable: Possible correction in the case of this example is a log transformation of variable weight

◦ Use of robust standard errors (not implemented in SPSS)

◦ Use of Generalized Least Squares (GLS): The estimator is provided with information about the variance and covariance of the errors.

(The last two options are not pursued further in this course.)

Example with Nonlinearity of the Independent Variable

Function applied to simulate random data:

⋅ 2

i i i

1Y =10+ X +u

100

∈X {1, ..., 99}

iu ~ N(0,7.5) random variable

Step 1: Formulation of the model

As the scatterplot confirms, y and x obviously do

not have a linear relationship.

Run linear regression with SPSS anyway

= β + β ⋅ +0 1y x u

Slide 34

Step 3: Verification of the model

R Square: OK

F-Test: OK

t-tests of coefficients: OK

Model:

= β + β ⋅ +i 0 1 i iy x u

= − + ⋅i iy 3.724 1.032 x

Step 5: Testing of assumptions – Homoscedasticity?

Residuals plot U-shaped (= heteroscedasticity) Compare with original scatterplot

→ Model not linear in the independent variable.

→ Solution: regression with quadratic term = β + β ⋅ +2

i 0 1 i iy x u

(do not use Analyze�Regression�Nonlinear>)

How to calculate x2? => Syntax: COMPUTE x_r = x*x.

Slide 36

Regression analysis with quadratic term (x_r = x*x instead of x)

Step 3: Verification of the model

R Square: even better!

F-test: even better!

t-tests: even better!

Model:

i 0 1 iy x _r u= β + β ⋅ +

= + ⋅ 2

i iy 13.764 1.028 x

Step 5: Testing of assumptions – Homoscedasticity

Residuals now have more or less constant variance.

Slide 38

Notes:

lect 06 en 2014 introduction to regression analysis 06_en_2014... · examines the relationship...

Documents