1 lecture 8 regression: relationships between continuous variables slides available from statistics...
Post on 18-Dec-2015
212 views
TRANSCRIPT
1
Lecture 8Regression: Relationships between continuous variables
Slides available from Statistics & SPSS page of www.gpryce.com
Social Science Statistics Module I
Gwilym Pryce
Notices:
• Register
Plan:
• 1. Linear & Non-linear Relationships• 2. Fitting a line using OLS• 3. Inference in Regression• 4. Omitted Variables & R2 • 5. Summary
1. Linear & Non-linear relationships between variables
• Often of greatest interest in social science is investigation into relationships between variables:– is social class related to political perspective?
– is income related to education?
– is worker alienation related to job monotony?
• We are also interested in the direction of causation, but this is more difficult to prove empirically:– our empirical models are usually structured assuming a particular
theory of causation
Relationships between scale variables
• The most straight forward way to investigate evidence for relationship is to look at scatter plots:– traditional to:
• put the dependent variable (I.e. the “effect”) on the vertical axis– or “y axis”
• put the explanatory variable (I.e. the “cause”) on the horizontal axis
– or “x axis”
Scatter plot of IQ and Income:
IQ
1601401201008060
INC
OM
E
40000
30000
20000
10000
We would like to find the line of best fit:
IQ
1601401201008060
INC
OM
E
40000
30000
20000
10000bxay ˆ
line of slope
intercept
where,
b
ya
IQbaINCOME
Predicted values (i.e. values of y lying on the line of best fit) are given by:
What does the output mean?
Sometimes the relationship appears non-linear:
IQ2
3002001000
INC
OM
E
40000
30000
20000
10000
… straight line of best fit is not always very satisfactory:
IQ2
3002001000
INC
OM
E
40000
30000
20000
10000
Could try a quadratic line of best fit:
IQ2
3002001000
INC
OM
E
40000
30000
20000
10000
We can simulate a non-linear relationship by first transforming one of the variables:
IQ2
3002001000
INC
OM
E
40000
30000
20000
10000
IQ2
3002001000
INC
_S
Q
1000000000
800000000
600000000
400000000
200000000
0
e.g. squaring IQ and taking the natural log of IQ:
IQ2
3002001000
INC
OM
E
40000
30000
20000
10000
IQ2_LN
5.85.65.45.25.04.84.64.44.2
INC
OM
E
40000
30000
20000
10000
… or a cubic line of best fit: (over-fitted?)
IQ2
3002001000
INC
OM
E
40000
30000
20000
10000
Or could try two linear lines: “structural break”
IQ2
3002001000
INC
OM
E
40000
30000
20000
10000
2. Fitting a line using OLS
• The most popular algorithm for drawing the line of best fit is one that minimises the sum of squared deviations from the line to each observation:
n
iii yy
1
2)ˆ(minWhere:
yi = observed value of y
= predicted value of yi
= the value on the line of best fit corresponding to xi
iy
Regression estimates of a, b using Ordinary Least Squares (OLS):
• Solving the min[error sum of squares] problem yields estimates of the slope b and y-intercept a of the straight line:
22 )( xxn
yxxynb
xbya
3. Inference in Regression: Hypothesis tests on the slope coefficient:
• Regressions are usually run on samples, so what can we say about the population relationship between x and y?
• Repeated samples would yield a range of values for estimates of b ~ N(, sb)
• I.e. b is normally distributed with mean = = population mean = value of b if regression run on population
• If there is no relationship in the population between x and y, then = 0, & this is our H0
What does the standard error mean?
Returning to our IQ example:
Hypothesis test on b:
• (1) H0: = 0 (I.e. slope coefficient, if regression run on population, would = 0)
H1: • (2) = 0.05 or 0.01 etc.
• (3) Reject H0 iff P < • (N.B. Rule of thumb: P < 0.05 if tc 2, and P < 0.01 if tc 2.6)
• (4) Calculate P and conclude.
bbb s
b
s
b
s
bt
02ndf
Floor Area Example:
• You run a regression of house price on floor area which yields the following output. Use this output to answer the following questions:
Q/ What is the “Constant”? What does it’s value mean here?Q/ What is the slope coefficient and what does it tell you here?Q/ What is the estimated value of an extra square metre?Q/ How would you test for the existence of a relationship between purchase price
and floor area?Q/ How much is a 200m2 house worth?Q/ How much is a 100m2 house worth?Q/ On average, how much is the slope coefficient likely to vary from sample to
sample?• NB Write down your answers – you’ll need them later!
Coefficientsa
-10029.0 3242.545 -3.093 .002
638.825 26.108 .721 24.469 .000
(Constant)
Floor Area (sq meters)
Model1
B Std. Error
UnstandardizedCoefficients
Beta
Standardized
Coefficients
t Sig.
Dependent Variable: Purchase Pricea.
Floor area example:• (1) H0: no relationship between house price and floor area.
H1: there is a relationship• (2), (3), (4):
• P = 1- CDF.T(24.469,554) = 0.000000
Reject H0
Coefficientsa
-10029.0 3242.545 -3.093 .002
638.825 26.108 .721 24.469 .000
(Constant)
Floor Area (sq meters)
Model1
B Std. Error
UnstandardizedCoefficients
Beta
Standardized
Coefficients
t Sig.
Dependent Variable: Purchase Pricea.
4. Omitted Variables & R2
Floor Area (sq meters)
3002001000
Pu
rch
ase
Pri
ce
300000
200000
100000
0
Q/ is floor area the only factor?Q/ How much of the variation in Price does it explain?
R-square
• R-square tells you how much of the variation in y is explained by the explanatory variable x– 0 < R2 < 1 (NB: you want R2 to be near 1).
– If more than one explanatory variable, use Adjusted R2
Model Summary
.721a .519 .519 26925.18Model1
R R SquareAdjustedR Square
Std. Errorof the
Estimate
Predictors: (Constant), Floor Area (sq meters)a.
House Price Example cont’d: Two explanatory variables
Coefficientsa
-23817.3 3623.761 -6.573 .000
507.271 30.722 .572 16.511 .000
24951.769 3401.282 .254 7.336 .000
(Constant)
Floor Area (sq meters)
Number of Bathrooms
B Std. Error
UnstandardizedCoefficients
Beta
Standardized
Coefficients
t Sig.
Dependent Variable: Purchase Pricea.
Model Summary
.750a .562 .560R R Square
AdjustedR Square
Predictors: (Constant), Number of Bathrooms ,Floor Area (sq meters)
a.
Q/ How has the estimated value of an extra square metre changed?Q/ Do a hypothesis test for the existence of a relationship between price and
number of bathrooms.Q/ How much will an extra bathroom typically add to the value of a house?Q/ What is the value of a 200m2 house with one bathroom? Compare your
estimate with that from the previous model.Q/ What is the value of a 100m2 house with one bathroom? Compare your
estimate with that from the previous model.Q/ What is the value of a 100m2 house with two bathrooms? Compare your
estimate with that from the previous model.Q/ On average, how much is the slope coefficient on floor area likely to vary
from sample to sample?
Now add number of bathrooms as an extra explanatory variable…
Scatter plot (with floor spikes)
Purchase Price
300
100000
200000
300000
200
Floor Area (sq meters)3.53.0100 2.5
Number of Bathrooms2.01.51.0
3D Surface Plots: Construction, Price & Unemployment
Q = -246 + 27P - 0.2P2 - 73U + 3U2
Q
Ut-1
P
020
4060
80
0
5
1015
-500
0
500
020
4060
80
0
5
1015
Construction Equation in a Slump
Q = 315 + 4P - 73U + 5U2
020
4060
80
0
510
15
0
200
400
600
800
020
4060
80
0
510
15
Summary
• 1. Linear & Non-linear Relationships• 2. Fitting a line using OLS• 3. Inference in Regression• 4. Omitted Variables & R2
Reading:
• Regression Analysis:– *Pryce chapter on relationships.– *Field, A. chapters on regression.– *Moore and McCabe Chapters on regression.– Kennedy, P. ‘A Guide to Econometrics’– Bryman, Alan, and Cramer, Duncan (1999) “Quantitative
Data Analysis with SPSS for Windows: A Guide for Social Scientists”, Chapters 9 and 10.
– Achen, Christopher H. Interpreting and Using Regression (London: Sage, 1982).