1 lecture 8 regression: relationships between continuous variables slides available from statistics...

30
1 Lecture 8 Regression: Relationships between continuous variables Slides available from Statistics & SPSS page of www.gpryce.com Social Science Statistics Module I Gwilym Pryce

Post on 18-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Lecture 8 Regression: Relationships between continuous variables Slides available from Statistics & SPSS page of   Social

1

Lecture 8Regression: Relationships between continuous variables

Slides available from Statistics & SPSS page of www.gpryce.com

Social Science Statistics Module I

Gwilym Pryce

Page 2: 1 Lecture 8 Regression: Relationships between continuous variables Slides available from Statistics & SPSS page of   Social

Notices:

• Register

Page 3: 1 Lecture 8 Regression: Relationships between continuous variables Slides available from Statistics & SPSS page of   Social

Plan:

• 1. Linear & Non-linear Relationships• 2. Fitting a line using OLS• 3. Inference in Regression• 4. Omitted Variables & R2 • 5. Summary

Page 4: 1 Lecture 8 Regression: Relationships between continuous variables Slides available from Statistics & SPSS page of   Social

1. Linear & Non-linear relationships between variables

• Often of greatest interest in social science is investigation into relationships between variables:– is social class related to political perspective?

– is income related to education?

– is worker alienation related to job monotony?

• We are also interested in the direction of causation, but this is more difficult to prove empirically:– our empirical models are usually structured assuming a particular

theory of causation

Page 5: 1 Lecture 8 Regression: Relationships between continuous variables Slides available from Statistics & SPSS page of   Social

Relationships between scale variables

• The most straight forward way to investigate evidence for relationship is to look at scatter plots:– traditional to:

• put the dependent variable (I.e. the “effect”) on the vertical axis– or “y axis”

• put the explanatory variable (I.e. the “cause”) on the horizontal axis

– or “x axis”

Page 6: 1 Lecture 8 Regression: Relationships between continuous variables Slides available from Statistics & SPSS page of   Social

Scatter plot of IQ and Income:

IQ

1601401201008060

INC

OM

E

40000

30000

20000

10000

Page 7: 1 Lecture 8 Regression: Relationships between continuous variables Slides available from Statistics & SPSS page of   Social

We would like to find the line of best fit:

IQ

1601401201008060

INC

OM

E

40000

30000

20000

10000bxay ˆ

line of slope

intercept

where,

b

ya

IQbaINCOME

Predicted values (i.e. values of y lying on the line of best fit) are given by:

Page 8: 1 Lecture 8 Regression: Relationships between continuous variables Slides available from Statistics & SPSS page of   Social

What does the output mean?

Page 9: 1 Lecture 8 Regression: Relationships between continuous variables Slides available from Statistics & SPSS page of   Social

Sometimes the relationship appears non-linear:

IQ2

3002001000

INC

OM

E

40000

30000

20000

10000

Page 10: 1 Lecture 8 Regression: Relationships between continuous variables Slides available from Statistics & SPSS page of   Social

… straight line of best fit is not always very satisfactory:

IQ2

3002001000

INC

OM

E

40000

30000

20000

10000

Page 11: 1 Lecture 8 Regression: Relationships between continuous variables Slides available from Statistics & SPSS page of   Social

Could try a quadratic line of best fit:

IQ2

3002001000

INC

OM

E

40000

30000

20000

10000

Page 12: 1 Lecture 8 Regression: Relationships between continuous variables Slides available from Statistics & SPSS page of   Social

We can simulate a non-linear relationship by first transforming one of the variables:

IQ2

3002001000

INC

OM

E

40000

30000

20000

10000

IQ2

3002001000

INC

_S

Q

1000000000

800000000

600000000

400000000

200000000

0

Page 13: 1 Lecture 8 Regression: Relationships between continuous variables Slides available from Statistics & SPSS page of   Social

e.g. squaring IQ and taking the natural log of IQ:

IQ2

3002001000

INC

OM

E

40000

30000

20000

10000

IQ2_LN

5.85.65.45.25.04.84.64.44.2

INC

OM

E

40000

30000

20000

10000

Page 14: 1 Lecture 8 Regression: Relationships between continuous variables Slides available from Statistics & SPSS page of   Social

… or a cubic line of best fit: (over-fitted?)

IQ2

3002001000

INC

OM

E

40000

30000

20000

10000

Page 15: 1 Lecture 8 Regression: Relationships between continuous variables Slides available from Statistics & SPSS page of   Social

Or could try two linear lines: “structural break”

IQ2

3002001000

INC

OM

E

40000

30000

20000

10000

Page 16: 1 Lecture 8 Regression: Relationships between continuous variables Slides available from Statistics & SPSS page of   Social

2. Fitting a line using OLS

• The most popular algorithm for drawing the line of best fit is one that minimises the sum of squared deviations from the line to each observation:

n

iii yy

1

2)ˆ(minWhere:

yi = observed value of y

= predicted value of yi

= the value on the line of best fit corresponding to xi

iy

Page 17: 1 Lecture 8 Regression: Relationships between continuous variables Slides available from Statistics & SPSS page of   Social

Regression estimates of a, b using Ordinary Least Squares (OLS):

• Solving the min[error sum of squares] problem yields estimates of the slope b and y-intercept a of the straight line:

22 )( xxn

yxxynb

xbya

Page 18: 1 Lecture 8 Regression: Relationships between continuous variables Slides available from Statistics & SPSS page of   Social

3. Inference in Regression: Hypothesis tests on the slope coefficient:

• Regressions are usually run on samples, so what can we say about the population relationship between x and y?

• Repeated samples would yield a range of values for estimates of b ~ N(, sb)

• I.e. b is normally distributed with mean = = population mean = value of b if regression run on population

• If there is no relationship in the population between x and y, then = 0, & this is our H0

Page 19: 1 Lecture 8 Regression: Relationships between continuous variables Slides available from Statistics & SPSS page of   Social

What does the standard error mean?

Returning to our IQ example:

Page 20: 1 Lecture 8 Regression: Relationships between continuous variables Slides available from Statistics & SPSS page of   Social

Hypothesis test on b:

• (1) H0: = 0 (I.e. slope coefficient, if regression run on population, would = 0)

H1: • (2) = 0.05 or 0.01 etc.

• (3) Reject H0 iff P < • (N.B. Rule of thumb: P < 0.05 if tc 2, and P < 0.01 if tc 2.6)

• (4) Calculate P and conclude.

bbb s

b

s

b

s

bt

02ndf

Page 21: 1 Lecture 8 Regression: Relationships between continuous variables Slides available from Statistics & SPSS page of   Social

Floor Area Example:

• You run a regression of house price on floor area which yields the following output. Use this output to answer the following questions:

Q/ What is the “Constant”? What does it’s value mean here?Q/ What is the slope coefficient and what does it tell you here?Q/ What is the estimated value of an extra square metre?Q/ How would you test for the existence of a relationship between purchase price

and floor area?Q/ How much is a 200m2 house worth?Q/ How much is a 100m2 house worth?Q/ On average, how much is the slope coefficient likely to vary from sample to

sample?• NB Write down your answers – you’ll need them later!

Coefficientsa

-10029.0 3242.545 -3.093 .002

638.825 26.108 .721 24.469 .000

(Constant)

Floor Area (sq meters)

Model1

B Std. Error

UnstandardizedCoefficients

Beta

Standardized

Coefficients

t Sig.

Dependent Variable: Purchase Pricea.

Page 22: 1 Lecture 8 Regression: Relationships between continuous variables Slides available from Statistics & SPSS page of   Social

Floor area example:• (1) H0: no relationship between house price and floor area.

H1: there is a relationship• (2), (3), (4):

• P = 1- CDF.T(24.469,554) = 0.000000

Reject H0

Coefficientsa

-10029.0 3242.545 -3.093 .002

638.825 26.108 .721 24.469 .000

(Constant)

Floor Area (sq meters)

Model1

B Std. Error

UnstandardizedCoefficients

Beta

Standardized

Coefficients

t Sig.

Dependent Variable: Purchase Pricea.

Page 23: 1 Lecture 8 Regression: Relationships between continuous variables Slides available from Statistics & SPSS page of   Social

4. Omitted Variables & R2

Floor Area (sq meters)

3002001000

Pu

rch

ase

Pri

ce

300000

200000

100000

0

Q/ is floor area the only factor?Q/ How much of the variation in Price does it explain?

Page 24: 1 Lecture 8 Regression: Relationships between continuous variables Slides available from Statistics & SPSS page of   Social

R-square

• R-square tells you how much of the variation in y is explained by the explanatory variable x– 0 < R2 < 1 (NB: you want R2 to be near 1).

– If more than one explanatory variable, use Adjusted R2

Model Summary

.721a .519 .519 26925.18Model1

R R SquareAdjustedR Square

Std. Errorof the

Estimate

Predictors: (Constant), Floor Area (sq meters)a.

Page 25: 1 Lecture 8 Regression: Relationships between continuous variables Slides available from Statistics & SPSS page of   Social

House Price Example cont’d: Two explanatory variables

Coefficientsa

-23817.3 3623.761 -6.573 .000

507.271 30.722 .572 16.511 .000

24951.769 3401.282 .254 7.336 .000

(Constant)

Floor Area (sq meters)

Number of Bathrooms

B Std. Error

UnstandardizedCoefficients

Beta

Standardized

Coefficients

t Sig.

Dependent Variable: Purchase Pricea.

Model Summary

.750a .562 .560R R Square

AdjustedR Square

Predictors: (Constant), Number of Bathrooms ,Floor Area (sq meters)

a.

Q/ How has the estimated value of an extra square metre changed?Q/ Do a hypothesis test for the existence of a relationship between price and

number of bathrooms.Q/ How much will an extra bathroom typically add to the value of a house?Q/ What is the value of a 200m2 house with one bathroom? Compare your

estimate with that from the previous model.Q/ What is the value of a 100m2 house with one bathroom? Compare your

estimate with that from the previous model.Q/ What is the value of a 100m2 house with two bathrooms? Compare your

estimate with that from the previous model.Q/ On average, how much is the slope coefficient on floor area likely to vary

from sample to sample?

Now add number of bathrooms as an extra explanatory variable…

Page 26: 1 Lecture 8 Regression: Relationships between continuous variables Slides available from Statistics & SPSS page of   Social

Scatter plot (with floor spikes)

Purchase Price

300

100000

200000

300000

200

Floor Area (sq meters)3.53.0100 2.5

Number of Bathrooms2.01.51.0

Page 27: 1 Lecture 8 Regression: Relationships between continuous variables Slides available from Statistics & SPSS page of   Social

3D Surface Plots: Construction, Price & Unemployment

Q = -246 + 27P - 0.2P2 - 73U + 3U2

Q

Ut-1

P

020

4060

80

0

5

1015

-500

0

500

020

4060

80

0

5

1015

Page 28: 1 Lecture 8 Regression: Relationships between continuous variables Slides available from Statistics & SPSS page of   Social

Construction Equation in a Slump

Q = 315 + 4P - 73U + 5U2

020

4060

80

0

510

15

0

200

400

600

800

020

4060

80

0

510

15

Page 29: 1 Lecture 8 Regression: Relationships between continuous variables Slides available from Statistics & SPSS page of   Social

Summary

• 1. Linear & Non-linear Relationships• 2. Fitting a line using OLS• 3. Inference in Regression• 4. Omitted Variables & R2

Page 30: 1 Lecture 8 Regression: Relationships between continuous variables Slides available from Statistics & SPSS page of   Social

Reading:

• Regression Analysis:– *Pryce chapter on relationships.– *Field, A. chapters on regression.– *Moore and McCabe Chapters on regression.– Kennedy, P. ‘A Guide to Econometrics’– Bryman, Alan, and Cramer, Duncan (1999) “Quantitative

Data Analysis with SPSS for Windows: A Guide for Social Scientists”, Chapters 9 and 10.

– Achen, Christopher H. Interpreting and Using Regression (London: Sage, 1982).