1 lecture 8 regression: relationships between continuous variables slides available from statistics...

1

Lecture 8Regression: Relationships between continuous variables

Slides available from Statistics & SPSS page of www.gpryce.com

Social Science Statistics Module I

Gwilym Pryce

http://www.gpryce.com/

Notices:

• Register

Plan:

• 1. Linear & Non-linear Relationships• 2. Fitting a line using OLS• 3. Inference in Regression• 4. Omitted Variables & R2 • 5. Summary

1. Linear & Non-linear relationships between variables

• Often of greatest interest in social science is investigation into relationships between variables:– is social class related to political perspective?

– is income related to education?

– is worker alienation related to job monotony?

• We are also interested in the direction of causation, but this is more difficult to prove empirically:– our empirical models are usually structured assuming a particular

theory of causation

Relationships between scale variables

• The most straight forward way to investigate evidence for relationship is to look at scatter plots:– traditional to:

• put the dependent variable (I.e. the “effect”) on the vertical axis– or “y axis”

• put the explanatory variable (I.e. the “cause”) on the horizontal axis

– or “x axis”

Scatter plot of IQ and Income:

IQ

1601401201008060

INC

OM

E

40000

30000

20000

10000

We would like to find the line of best fit:

IQ

1601401201008060

INC

OM

E

40000

30000

20000

10000bxay ˆ

line of slope

intercept

where,

b

ya

IQbaINCOME

Predicted values (i.e. values of y lying on the line of best fit) are given by:

What does the output mean?

Sometimes the relationship appears non-linear:

IQ2

3002001000

INC

OM

E

40000

30000

20000

10000

… straight line of best fit is not always very satisfactory:

IQ2

3002001000

INC

OM

E

40000

30000

20000

10000

Could try a quadratic line of best fit:

IQ2

3002001000

INC

OM

E

40000

30000

20000

10000

We can simulate a non-linear relationship by first transforming one of the variables:

IQ2

3002001000

INC

OM

E

40000

30000

20000

10000

IQ2

3002001000

INC

_S

Q

1000000000

800000000

600000000

400000000

200000000

0

e.g. squaring IQ and taking the natural log of IQ:

IQ2

3002001000

INC

OM

E

40000

30000

20000

10000

IQ2_LN

5.85.65.45.25.04.84.64.44.2

INC

OM

E

40000

30000

20000

10000

… or a cubic line of best fit: (over-fitted?)

IQ2

3002001000

INC

OM

E

40000

30000

20000

10000

Or could try two linear lines: “structural break”

IQ2

3002001000

INC

OM

E

40000

30000

20000

10000

2. Fitting a line using OLS

• The most popular algorithm for drawing the line of best fit is one that minimises the sum of squared deviations from the line to each observation:

n

iii yy

1

2)ˆ(minWhere:

yi = observed value of y

= predicted value of yi

= the value on the line of best fit corresponding to xi

iy

Regression estimates of a, b using Ordinary Least Squares (OLS):

• Solving the min[error sum of squares] problem yields estimates of the slope b and y-intercept a of the straight line:

22 )( xxn

yxxynb

xbya

3. Inference in Regression: Hypothesis tests on the slope coefficient:

• Regressions are usually run on samples, so what can we say about the population relationship between x and y?

• Repeated samples would yield a range of values for estimates of b ~ N(, sb)

• I.e. b is normally distributed with mean = = population mean = value of b if regression run on population

• If there is no relationship in the population between x and y, then = 0, & this is our H0

What does the standard error mean?

Returning to our IQ example:

Hypothesis test on b:

• (1) H0: = 0 (I.e. slope coefficient, if regression run on population, would = 0)

H1: • (2) = 0.05 or 0.01 etc.

• (3) Reject H0 iff P < • (N.B. Rule of thumb: P < 0.05 if tc 2, and P < 0.01 if tc 2.6)

• (4) Calculate P and conclude.

bbb s

b

s

b

s

bt

02ndf

Floor Area Example:

• You run a regression of house price on floor area which yields the following output. Use this output to answer the following questions:

Q/ What is the “Constant”? What does it’s value mean here?Q/ What is the slope coefficient and what does it tell you here?Q/ What is the estimated value of an extra square metre?Q/ How would you test for the existence of a relationship between purchase price

and floor area?Q/ How much is a 200m2 house worth?Q/ How much is a 100m2 house worth?Q/ On average, how much is the slope coefficient likely to vary from sample to

sample?• NB Write down your answers – you’ll need them later!

Coefficientsa

-10029.0 3242.545 -3.093 .002

638.825 26.108 .721 24.469 .000

(Constant)

Floor Area (sq meters)

Model1

B Std. Error

UnstandardizedCoefficients

Beta

Standardized

Coefficients

t Sig.

Dependent Variable: Purchase Pricea.

Floor area example:• (1) H0: no relationship between house price and floor area.

H1: there is a relationship• (2), (3), (4):

• P = 1- CDF.T(24.469,554) = 0.000000

Reject H0

Coefficientsa

-10029.0 3242.545 -3.093 .002

638.825 26.108 .721 24.469 .000

(Constant)


Model1

B Std. Error


Beta

Standardized

Coefficients

t Sig.


4. Omitted Variables & R2


3002001000

Pu

rch

ase

Pri

ce

300000

200000

100000

0

Q/ is floor area the only factor?Q/ How much of the variation in Price does it explain?

R-square

• R-square tells you how much of the variation in y is explained by the explanatory variable x– 0 < R2 < 1 (NB: you want R2 to be near 1).

– If more than one explanatory variable, use Adjusted R2

Model Summary

.721a .519 .519 26925.18Model1

R R SquareAdjustedR Square

Std. Errorof the

Estimate

Predictors: (Constant), Floor Area (sq meters)a.

House Price Example cont’d: Two explanatory variables

Coefficientsa

-23817.3 3623.761 -6.573 .000

507.271 30.722 .572 16.511 .000

24951.769 3401.282 .254 7.336 .000

(Constant)


Number of Bathrooms

B Std. Error


Beta

Standardized

Coefficients

t Sig.


Model Summary

.750a .562 .560R R Square

AdjustedR Square

Predictors: (Constant), Number of Bathrooms ,Floor Area (sq meters)

a.

Q/ How has the estimated value of an extra square metre changed?Q/ Do a hypothesis test for the existence of a relationship between price and

number of bathrooms.Q/ How much will an extra bathroom typically add to the value of a house?Q/ What is the value of a 200m2 house with one bathroom? Compare your

estimate with that from the previous model.Q/ What is the value of a 100m2 house with one bathroom? Compare your

estimate with that from the previous model.Q/ What is the value of a 100m2 house with two bathrooms? Compare your

estimate with that from the previous model.Q/ On average, how much is the slope coefficient on floor area likely to vary

from sample to sample?

Now add number of bathrooms as an extra explanatory variable…

Scatter plot (with floor spikes)

Purchase Price

300

100000

200000

300000

200

Floor Area (sq meters)3.53.0100 2.5

Number of Bathrooms2.01.51.0

3D Surface Plots: Construction, Price & Unemployment

Q = -246 + 27P - 0.2P2 - 73U + 3U2

Q

Ut-1

P

020

4060

80

0

5

1015

-500

0

500

020

4060

80

0

5

1015

Construction Equation in a Slump

Q = 315 + 4P - 73U + 5U2

020

4060

80

0

510

15

0

200

400

600

800

020

4060

80

0

510

15

Summary

• 1. Linear & Non-linear Relationships• 2. Fitting a line using OLS• 3. Inference in Regression• 4. Omitted Variables & R2

Reading:

• Regression Analysis:– *Pryce chapter on relationships.– *Field, A. chapters on regression.– *Moore and McCabe Chapters on regression.– Kennedy, P. ‘A Guide to Econometrics’– Bryman, Alan, and Cramer, Duncan (1999) “Quantitative

Data Analysis with SPSS for Windows: A Guide for Social Scientists”, Chapters 9 and 10.

– Achen, Christopher H. Interpreting and Using Regression (London: Sage, 1982).

1 lecture 8 regression: relationships between continuous variables slides available from statistics...

Documents