lect 06 en 2014 introduction to regression analysis 06_en_2014... · examines the relationship...
TRANSCRIPT
Research Methodology: Tools
Applied Data Analysis (with SPSS)
Lecture 06: Introduction to Regression Analysis
April 2014
Prof. Dr. Jürg Schwarz
Lic. phil. Heidi Bruderer Enzler
MSc Business Administration
Slide 2
Contents
Aims of the Lecture ______________________________________________________________________________________ 3
Typical Syntax ___________________________________________________________________________________________ 4
Introduction _____________________________________________________________________________________________ 5
Example ................................................................................................................................................................................................................. 5
Outline and Concept of Regression Analysis _________________________________________________________________ 8
Key Steps in Regression Analysis ........................................................................................................................................................................ 10
General Purpose of Regression ........................................................................................................................................................................... 11
Scatterplots Illustrating Causal Relationships ....................................................................................................................................................... 12
Models ................................................................................................................................................................................................................. 13
Ordinary Least Squares (OLS) Estimates of β0 and β1 .......................................................................................................................................... 15
Gauss-Markov Theorem, Independence and Normal Distribution ......................................................................................................................... 18
Regression Analysis with SPSS: Detailed Examples __________________________________________________________ 21
Simple Example ................................................................................................................................................................................................... 21
Example with Nonlinearity of the Independent Variable ........................................................................................................................................ 33
Slide 3
Aims of the Lecture
You will understand the key steps in conducting a regression analysis.
You will understand the stochastic model of the regression analysis.
You will understand the ordinary least squares (OLS) method.
You will understand the 5 Gauss-Markov assumptions for OLS estimators.
You can conduct a regression analysis with SPSS
In particular, you will know how to >
◦ interpret the output
◦ describe the output
You will know how to use the regression equation to estimate values of the dependent variable.
Slide 4
Dependent variable weight
Independent variable height
Scatterplot of residuals
Histogramm of residuals
Typical Syntax
Scatterplot of variables height (X-axis) and weight (Y-axis) GGRAPH /GRAPHDATASET NAME="graphdataset" VARIABLES=height weight MISSING=LISTWISE RE-PORTMISSING=NO /GRAPHSPEC SOURCE=INLINE. BEGIN GPL SOURCE: s=userSource(id("graphdataset")) DATA: height=col(source(s), name("height")) DATA: weight=col(source(s), name("weight")) GUIDE: axis(dim(1), label("height (in cm)")) GUIDE: axis(dim(2), label("weight (in kg)")) ELEMENT: point(position(height*weight)) END GPL.
Regression analysis of weight on height REGRESSION /MISSING LISTWISE /STATISTICS COEFF OUTS R ANOVA /CRITERIA=PIN(.05) POUT(.10) /NOORIGIN /DEPENDENT weight /METHOD=ENTER height /SCATTERPLOT=(*ZRESID ,*ZPRED) /RESIDUALS HISTOGRAM(ZRESID).
Slide 5
Introduction
Example
Market research: Competition analysis
0
1
2
3
4
5
6
7
8
9
10
0 10 20 30 40 50
Sale
s v
olu
me
[US
D m
illio
n]
Number of employees [#]
Dataset
Sample of n = 10 companies
Variables for
◦ number of employees (empl)
◦ sales volume (sales)
Typical questions
Is there a linear relation between
employee number and sales volume?
What is the predicted sales volume of
a company with 12 employees?
Slide 6
Questions
Question in everyday language:
Does sales volume depend on the number of employees?
Research question:
Does the number of employees have an impact on sales volume?
How strong is the influence of the number of employees?
Is there a model?
Is regression analysis the right model?
Statistical question:
H0: "No model" (= No overall model and no significant coefficients)
HA: "Model" (= Overall model and significant coefficients)
Can we reject H0?
sales = β0 + β1 ⋅ empl + u where sales = dependent variable
empl = independent variable
β0, β1 = coefficients
u = error term
Slide 7
Results
The overall model is significant (F(1,8) = 36.276, p = .000).
The model explains 81.9% of the variance of sales volume (R2 = .819).
Slide 8
Both coefficients are significant (p < .05). Thus the model can be described as:
sales = 2.448 + 0.111 ⋅ empl
One additional employee increases
sales volume by .111 million USD.
Predicted sales volume of a company
with 12 employees:
3.78 million USD (= 2.448 + .111 ⋅ 12)
Sale
s v
olu
me
[US
D m
illio
n]
0
1
2
3
4
5
6
7
8
9
10
0 10 20 30 40 50
Number of employees [#]
Slide 9
Outline and Concept of Regression Analysis
Basic situation
Given: Two variables with a metric scale
Task: Find a relationship between the variables
Regression analysis
Postulation of a linear model 0 1y x u= β + β ⋅ +
Regression analysis
◦ Examines the relationship between the dependent variable y and the independent variable x.
◦ Uses inferential statistics to estimate the parameters β0 and β1.
Note
◦ In many cases the relation is assumed to be a cause-and-effect relation. Be aware that this requires strong theoretical or empirical evidence.
◦ "Regression" stands for going backwards from the dependent variable y to the determinant x. Therefore, we speak of "regression of y on x".
Slide 10
Key Steps in Regression Analysis
1. Formulation of the model
◦ Common sense > (remember the example with storks and babies)
◦ Linearity of relationship plausible
◦ Not too many variables (Principle of parsimony: Simplest solution to a problem)
2. Estimation of the model
◦ Estimation of the model by means of OLS estimation (ordinary least squares)
◦ Decision on procedure: Enter, stepwise regression!
3. Verification of the model
◦ Is the model as a whole significant? (i.e. are the coefficients significant as a group?) → F-test
◦ Are the regression coefficients significant? → t-tests (should be performed only if F-test is significant)
◦ How much variation does the regression equation explain? → Coefficient of determination (adjusted R-squared)
4. Considering other aspects (for example, multicollinearity)
5. Testing of assumptions (Gauss-Markov, independence and normal distribution)
6. Interpretation of the model and reporting
Text in italics: Only important in the case of multiple regression – see next lecture.
Slide 11
General Purpose of Regression
◦ Cause analysis
State a relationship between independent variables and the dependent variable.
Example
Is there a model that describes the dependence between sales volume and employee number, or do these two variables just form a random pattern?
◦ Impact analysis
Assess the impact of the independent variable to the dependent variable.
Example
If the number of employees increases, sales volume also increases: How strong is the impact? By how much will sales increase with each additional employee?
◦ Prediction
Predict the values of a dependent variable using new values for the independent variable.
Example
Which is the predicted sales volume of a company with 12 employees?
Slide 12
Illustrating Causal Relationships – Scatterplots and Equation
When displaying two variables with a causal relationship in a scatterplot, it is standard practice
to show the "Cause" variable as the X-axis and the "Effect" variable as the Y-axis.
0
1
2
3
4
5
6
7
8
9
10
0 10 20 30 40 50
In the regression equation y is the dependent variable and x is the independent variable.
The labels of the variables vary according to the field of application.
y x
Dependent Variable Independent Variable
Explained Variable Explanatory Variable
Response Variable Control Variable
Predicted Variable Predictor
Regressand Regressor
"Eff
ect"
(Y
-axis
)
"Cause" (X-axis)
Slide 13
Models
Mathematical model
The linear model describes y as a function of x
= β + β ⋅0 1y x equation of a straight line
The variable y is a linear function of the variable x.
β0 (intercept, constant)
The point where the regression line crosses the Y-axis.
The value of the dependent variable when all of the independent variables = 0.
β1 (regression coefficient)
The increase in the dependent variable per unit change in the
independent variable (also known as "the rise over the run", slope)
Stochastic model
0 1y x u= β + β ⋅ +
The error term u comprises all factors (other than x) that affect y.
These factors are treated as being unobservable.
→ u stands for "unobserved"
�
�
ββββ�
���� �∆∆∆∆�
��
�∆∆∆∆�
y
x
∆=
∆��������
Slide 14
Stochastic model – Assumptions related to the error term
The error term u is (must be) >
◦ independent of the explanatory variable x
◦ normally distributed with mean 0 and variance σ2: u ~ N(0,σ2
)
0 1E(y) x= β +β ⋅
σ
Woold
ridge, Jeffre
y (
2011):
Intr
oducto
ry e
conom
etr
ics.
5th
Editio
n. [S
.l.]: S
outh
-Weste
rn.
0
0
0
Slide 15
Ordinary Least Squares (OLS) Estimates of β0 and β1
Step by step
Given the population model
0 1y x u= β + β ⋅ +
Given a random sample
Given random sample {(xi ,yi ): i = 1,>,n} for which the sample regression model holds:
i 0 1 i iˆ ˆ ˆy x u= β + β ⋅ +
The notation 0 1ˆ ˆ,β β read as "beta hat",
emphasizes that the coefficients are
estimates based on the sample.
Fitted values and residuals
Calculation of the fitted values
i 0 1 iˆ ˆy x= β + β ⋅
Calculation of residuals
i i i i 0 1 iˆ ˆˆ ˆu y y y x= − = − β − β ⋅
Slide 16
Choose β0 and β1 such that for the observations (i = 1,>,n) the following condition holds
n n2 2
i i 0 1 ii 1 i 1
ˆ ˆˆ(u ) (y x ) minimum= =
= − β − β ⋅ =∑ ∑
The squares drawn in the figure correspond to
the squared residuals and the total area of the
squares should be minimized.
The method used for this minimization is called
the ordinary least squares method (OLS).
Taking derivatives
A necessary and in this case also sufficient condition for β0 and β1
to solve the minimization problem is that the partial derivatives must be zero
n2
i ni 1
i 0 1 ii 10
n2
i ni 1
i i 0 1 i
i 11
ˆ(u )ˆ ˆ2 (y x ) 0
ˆ
ˆ(u )ˆ ˆ2 x (y x ) 0
ˆ
=
=
=
=
δ= − − β − β ⋅ =
δβ
δ= − − β − β ⋅ =
δβ
∑∑
∑∑
�
�
��
�
��
��
��
��
�
�� ��
��
Slide 17
Resulting estimates of β0 and β1
The solutions 0 1ˆ ˆ,β β are called ordinary least squares (OLS) estimates for β0 and β1
n
i ix,yi 1
1 n 22 x
i
i 1
x,y
0 2
x
(x x)(y y)ˆ
(x x)
ˆ y x
=
=
− − σβ = =
σ−
σβ = −
σ
∑
∑
Ordinary least squares (OLS) estimates provide the best explanation/prediction of the data.
0 1ˆ ˆ,β β are the "correct" coefficients to describe the data from the sample.
The population regression equation 0 1y x u= β + β ⋅ + is still unknown.
Thus 0 1ˆ ˆ,β β are only estimates of the "true" set of coefficients in the population.
Another data sample will give a different regression equation, which may or may not be closer to
the population regression equation.
Note: The notation 0 1ˆ ˆ,β β emphasizes that the coefficients are estimates.
This notation will be replaced below by 0 1,β β because it is clear that we always use estimates.
Covariance (x,y) divided by Variance (x)
Slide 18
Gauss-Markov Theorem, Independence and Normal Distribution
Under the 5 Gauss-Markov assumptions the OLS estimator is the best, linear, unbiased estima-
tor of the true parameters βi, given the present sample.
→ The OLS estimator is BLUE
1. Linear in coefficients y = β0 + β1 ⋅ x + u
2. Random sample of n observations {(xi ,yi ): i = 1,>,n}
3. Zero conditional mean:
The error u has an expected value of 0,
given any values of the explanatory variable
E(ux) = 0
4. Sample variation in explanatory variables.
The xi’s are not constant and not all the same.
x ≠ const
x1 ≠ x2 ≠ > ≠ xn
5. Homoscedasticity:
The error u has the same variance given any value of the
explanatory variable.
Var(ux) = σσσσ2
Independence and normal distribution of error u ~ Normal(0,σσσσ2)
These assumptions need to be tested – among else by analyzing the residuals.
Based on: Wooldridge, Jeffrey (2011): Introductory econometrics. 5th Edition. [S.l.]: South-Western.
Slide 19
Gauss-Markov assumption on linearity in coefficients
The Gauss-Markov assumption is that the model of the population must be linear in the parame-
ters.
Standard case: = = β + β ⋅ +0 1OLS [y x u]
Examples for violations of this Gauss-Markov assumption (non-linear parameters):
2
0 1
0 1ln( )
OLS [y = x]
OLS [y = x]
≠ β + β ⋅
≠ β + β ⋅
→ In these examples, the OLS estimators are not BLUE.
Non-linear variables, however, are possible and often very useful.
For example, if the percentage increase of y is constant given one unit more of x,
then a model of the following form is appropriate:
0 1OLS [ln(y) = + x]≈ β β ⋅
Further examples:
0 1 2
0 1
≈ β β ⋅ β ⋅
≈ β β ⋅
2OLS [y = + x + x ]
OLS [y = + ln(x)]
Slide 20
Example of non-linearity in variables (not problematic!)
Percentage increase of y: If y is logarithmized, the relationship
If x increases by one unit (∆x), y increases becomes linear:
by a constant percentage increase (%∆y). If x increases by one unit,
ln(y) increases by one unit.
This model is both linear in variables
(ln(y), x) and linear in coefficients (β0, β1).
0 1 )y = exp( + xβ β ⋅ 0 1 0 1))ln(y) = ln(exp( + x + xβ β ⋅ = β β ⋅
∆x ∆x
% y+ ∆
% y+ ∆
y ln(y)
x x
Log
Slide 21
Regression Analysis with SPSS: Detailed Examples
Simple Example
Sample of 99 men with
information on body height and weight
Step 1: Formulation of the model
Regression equation of weight on height
0 1weight height u= β + β ⋅ +
0 1
weight dependent variable
height independent variable
, coefficients
u error term
==
β β =
=
The scatterplot confirms that there could be a
linear relationship between weight and height.
Slide 22
Step 2: Estimation of the model
SPSS: Analyze�Regression�Linear>
:
Slide 23
Step 3: Verification of the model – The F-test
The null hypothesis (H0) is that there is no effect of height.
The alternative hypothesis (HA) is that this is not the case.
H0: β1 = 0 (Multiple Regression => H0: β1 = β1 = > = βp = 0)
HA: β1 ≠ 0 (Multiple Regression => HA: βj ≠ 0 for at least one value of j)
Empirical F-value and the appropriate p-value ("Sig.") are computed by SPSS.
In the example, we can reject H0 in favor of HA (Sig. < 0.05).
The overall model is significant (F(1,97) = 116.530, p = .000).
The estimated model is not only a theoretical construct but one that exists in a statistical sense.
Slide 24
Step 3: Verification of the model – t-tests
The Coefficients table provides significance tests for the coefficients.
The significance test evaluates the null hypothesis that the regression coefficient is zero
H0: βi = 0
HA: βi ≠ 0
The t statistic for the height variable (β1) is associated with a p-value of .000 ("Sig.").
This indicates that the null hypothesis can be rejected.
Thus, the coefficient is significantly different from zero.
This holds also for the constant (β0) with Sig. = .000.
Slide 25
Intermezzo: Confidence intervals of the regression coefficients
The significance of a regression coefficient can also be determined by
its confidence interval. The 95% confidence interval gives a range of
values that includes the population parameter with a probability of 95%.
If a 95% confidence interval does not include the value 0,
the regression coefficient is significantly different from 0 (p < .05).
The confidence interval is computed as follows: [β - tcrit · s.e.β, β + tcrit · s.e.β]
The critical t-value (tcrit) can be looked up in tables. It is determined by its degrees of freedom
ν = n - k - 1, where n = sample height; k = number of independent variables.
For height: ν = 99 - 1 - 1 = 97, tcrit = 1.98
95% confidence interval for height: [1.086 - 1.98 · .101, 1.086 + 1.98 · .101] = [.886, 1.285]
It does not include 0. Thus, the regression coefficient is significantly different from 0.
Slide 26
Step 6: Interpretation of the model – The regression coefficients
i 0 1 iweight height= β + β ⋅
i iweight 120.375 1.086 height= − + ⋅
Unstandardized coefficients show absolute
change of the dependent variable if the
independent variable increases by one unit.
If height increases by 1 cm,
weight increases by 1.086 kg.
Note: The constant -120.375 has no specific
meaning. It's just the intersection with the Y-axis.
Slide 27
Back to Step 3: Verification of the model – The coefficient of determination
iy = Data point
iy = Estimation (model)
y = Sample mean
Error is also called residual.
Tota
l
Regre
ssio
n
Err
or iy
iy
y
ix
Slide 28
Summing up squared distances to sum of squares (SS)
SSTotal = SSRegression + SSError
∑∑∑===
−+−=−n
1i
2
ii
n
1i
2
i
n
1i
2
i )yy()yy()yy(
Regression
Total
≤ ≤SS
R Square = 0 R Square 1SS
R Square, the coefficient of determination, is .546.
In the example, about half the variation of weight is explained by the model (R2 = 54.6%).
In bivariate regression, R2 is qual to the squared value of the correlation coefficient of the two
variables (rxy = .739, rxy2 = .546).
The higher R Square, the better the fit.
Slide 29
Step 5: Testing of assumptions
In the example, are the requirements of the Gauss-Markov theorem as well as the other as-
sumptions met?
1. Is the model linear in coefficients Yes, decision for regression model.
2. Is it a random sample? Yes, clinical study.
3. Do the residuals have an expected value of 0
for all values of x? (zero conditional mean)
→ Scatterplot of residuals
4. Is there variation in the explanatory variable? Yes, clinical study.
5. Do the residuals have constant variance
for all values of x? (homoscedasticity)
→ Scatterplot of residuals
Are the residuals independent from one another?
Are the residuals normally distributed?
→ Scatterplot of residuals
→ (consider Durbin-Watson)
→ Histogram
Slide 30
Scatterplot of standardized predicted values of y vs. standardized residuals
3. Zero conditional mean: The mean values of the residuals do not differ visibly from 0 across
the range of standardized estimated values. → OK
5. Homoscedasticity: Residual plot trumpet-shaped; residuals do not have constant variance.
This Gauss-Markov requirement is violated. → There is heteroscedasticity.
Independence: There is no obvious pattern that indicates that the residuals would be influenc-
ing one another (for example a "wavelike" pattern). → OK
Slide 31
Histogram of standardized residuals
Normal distribution of residuals:
Distribution of the standardized residuals is more or less normal. → OK
Slide 32
Violation of the homoscedasticity assumption
How to diagnose heteroscedasticity
Informal methods:
◦ Look at the scatterplot of standardized predicted y-values vs. standardized residuals.
◦ Graph the data and look for patterns.
Formal methods (not pursued further in this course):
◦ Breusch-Pagan test / Cook-Weisberg test
◦ White test
Corrections
◦ Transformation of the variable: Possible correction in the case of this example is a log transformation of variable weight
◦ Use of robust standard errors (not implemented in SPSS)
◦ Use of Generalized Least Squares (GLS): The estimator is provided with information about the variance and covariance of the errors.
(The last two options are not pursued further in this course.)
Slide 33
Example with Nonlinearity of the Independent Variable
Function applied to simulate random data:
⋅ 2
i i i
1Y =10+ X +u
100
∈X {1, ..., 99}
iu ~ N(0,7.5) random variable
Step 1: Formulation of the model
As the scatterplot confirms, y and x obviously do
not have a linear relationship.
Run linear regression with SPSS anyway
= β + β ⋅ +0 1y x u
Slide 34
Step 3: Verification of the model
R Square: OK
F-Test: OK
t-tests of coefficients: OK
Model:
= β + β ⋅ +i 0 1 i iy x u
= − + ⋅i iy 3.724 1.032 x
Slide 35
Step 5: Testing of assumptions – Homoscedasticity?
Residuals plot U-shaped (= heteroscedasticity) Compare with original scatterplot
→ Model not linear in the independent variable.
→ Solution: regression with quadratic term = β + β ⋅ +2
i 0 1 i iy x u
(do not use Analyze�Regression�Nonlinear>)
How to calculate x2? => Syntax: COMPUTE x_r = x*x.
Slide 36
Regression analysis with quadratic term (x_r = x*x instead of x)
Step 3: Verification of the model
R Square: even better!
F-test: even better!
t-tests: even better!
Model:
i 0 1 iy x _r u= β + β ⋅ +
= + ⋅ 2
i iy 13.764 1.028 x
Slide 37
Step 5: Testing of assumptions – Homoscedasticity
Residuals now have more or less constant variance.
Slide 38
Notes: