1 correlation and simple regression. 2 introduction interested in the relationships between...

Post on 28-Mar-2015

212 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

Correlation and Simple Regression

2

Introduction

Interested in the relationships between variables. What will happen to one variable if another is

changed? To what extent is it the case that increases in the

interest rate reduce inflation? Might want to know how sensitive the relationship is,

and if possible, what form it takes. Models needed.

3

Koop’s Deforestation Data

Y – average annual forest loss, 1981-1990 as % of total forested area

X - #people per 1000 hectares

Date on 70 tropical countries (N=70)

4

Line of Best Fit to Forest Data

0

1

2

3

4

5

6

0 1000 2000 3000

Population Density

Fo

rest

Lo

ss (Y

)Y

PredictedY

Figure 1.1 Deforestation/Population Density Data with Line of Best Fit

5

Line of Best Fit to Forest Data

0

1

2

3

4

5

6

0 1000 2000 3000

Population Density

Fo

res

t L

os

s (

Y)

Y

PredictedY

Predicted Value of Forest Loss Given Population Density

6

X=2000 implies Y=2.3If there are 2000 people per 1000 hectares,forest loss would be about 2.3%.

Commentsi) Increased dispersion about the lineas X increases; more uncertainty aboutpredictions for higher population densities.ii) Ignores other impacts on deforestation.

7

Correlation

Objectives of Correlation To measures how close the relationship

between two variables is to linearity – strength of linear association

Capture the sign of relationship Determine on common scale for all cases: -1

to +1 Closer to zero, weaker correlation

8

Sample Covariance

X and Y vary about their mean values.

To what extent is this variation aligned?

9

Scatter Plot of Forest Loss AgainstPopulation Density: Axes Crossing at Mean Points

0

1

2

3

4

5

6

0 500 1000 1500 2000 2500 3000

P opul ati on Dens i ty

639X 14.1Y

0,0 YYXX ii

0,0 YYXX ii

0,0 YYXX ii

0,0 YYXX ii

10

Deviations from Mean

YYXX ii , same sign 0 YYXX ii

YYXX ii , opposite sign 0 YYXX ii

11

Sample Covariance Formula

1

1,

N

YYXXN

iii

YX

Problem: varies with the scale of the data

12

Sample Correlation I

Standardise using sample standard deviations

Sample variance:

Sample standard deviation:

X

ii

N

X X

N2

2

1

1

2XX

13

Sample Correlation II

2

1

22

1

2

1,,

YNYXNX

YXNYX

rN

ii

N

ii

N

iii

YX

YXYX

14

Calculations for Deforestation Data

34.267X

527569.762 X 8615.02 Y

9282.0Y

39.444, YX 6592.0,,

YX

YXYXr

15

Correlation and Causality Must distinguish between causality and

correlation. Correlated does not imply causality. Not even an indication from a correlation

of which way the causality should run (from X to Y or the other way round).

Two trending time series variables may be spuriously correlated.

Causality is judgmental.

16

Example: UK Aggregate Consumption and Income

Aggregate UK consumption and income over a period of years is highly correlated.

Economists believe there is a relationship between these two variables.

Take correlation to be evidence in favour of the existence of a causal relationship: income causes consumption.

17

Time Series Plot of UK Aggregate Consumption and Income

Time Series Plot of UK Constant Price Consumption and Income, £Million, 1955-1984

60000

80000

100000

120000

140000

160000

180000

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29

Time Index

Consumption

Income

18

Scatter Plot of UK Aggregate Consumption Against Income

Scatter Plot of UK Constant Price Consumption Against Constant Price Income, £Million, 1955-1984

r=0.9982

60000

80000

100000

120000

140000

160000

55000 75000 95000 115000 135000 155000 175000

Income

Co

nsu

mp

tio

n

19

Another Example

Ratio of unemployment benefit to wages, X, and the unemployment rate, Y.

Annual observations for 1920-1938 for the UK.

Theory: X causes Y Policy implication: r>0 implies cut benefits

relative to wages to reduce unemployment.

20

Scatter Plot of Unemployment Against Wage/Benefit Ratio

Scatter Plot of UK Unemployment Rate (Y) Against Benefit/Wage Ratio (X)

0

5

1015

20

25

0 0.1 0.2 0.3 0.4 0.5 0.6

Benefit/Wage Ratio

Un

em

plo

ym

en

t R

ate

What happens to r if the following observation is not included?

r = 0.3888

21

Final Comments Correlation measures linear association on scale [-1,+1]. r=-1,+1 indicates PERFECT linear correlation (exact

straight line). Only concerned with the relationship between TWO

variables (bivariate). This measure is sensitive to outliers. Correlation may be taken as supportive evidence of

a causal relationship, but correlation does not imply causality.

22

Bivariate regression

Correlation can: Indicate the strength of a relationship It cannot: Contribute to an understanding of how the variables

may be related Make predictions about Y based on knowledge of X Regression analysis can: Examine the nature of the relationship between X

and Y Make predictions from that.

23

Line of Best Fit to Forest Data

0

1

2

3

4

5

6

0 1000 2000 3000

Population Density

Fo

rest

Lo

ss (Y

)Y

PredictedY

Figure 2.1 Deforestation/Population Density Data with Line of Best Fit

24

Introduction

What is the line of best fit?How can it be defined?What does it mean?Can place line by eye, but non-

systematic.

25

Scatter Plot of UK Constant Price Consumption Against Constant Price Income, £Million, 1955-1984

r=0.9982

60000

70000

80000

90000

100000

110000

120000

130000

140000

150000

160000

55000 75000 95000 115000 135000 155000 175000

Income

Co

nsu

mp

tio

n

UK consumption-income scatter plot gives a very strong indication of a linear relationship.

26

Scatter Plot of UK Unemployment Rate (Y) Against Benefit/Wage Ratio (X)

r=0.3888

0

5

10

15

20

25

0 0.1 0.2 0.3 0.4 0.5 0.6

Benefit/Wage Ratio

Un

emp

loym

ent

Rat

e

UK unemployment-benefit to wage ratio plotdoes not look linear.

27

Models

Simplest model: straight line XY Too constrained – will never hold exactly.

Allow for disturbances for each case, i=1,2,…,N

iii XY

Properties of disturbances: on average zero, but they vary.

They have: mean zero, and variance denoted: 2

28

Scatter Plot of Data Generated According to Y=1+2X+ , Var( )=1r=0.9991

0

10

20

30

40

50

60

70

0 5 10 15 20 25 30 35

X

Y

29

Scatter Plot of Data Generated According to Y=1+2X+ , Var( )=20r=0.6749

-40

-20

0

20

40

60

80

100

0 5 10 15 20 25 30 35

X

Y

30

So what?

We have a theory that allows us to think of there being an underlying linear relationship, but one that isn’t exact.

This fits with what we observe. It leads to a statistical theory of errors, the

real life equivalent of the theoretical disturbances, that eventually allows testing of various sorts.

31

Least Squares Line: Bivariate Linear Regression

Want the BEST LINEAR description of the way Y depends on X

Deforestation on population density, or consumption on income, or unemployment on the benefit to wage ratio.

Geometrically, we want the best fitting straight line to the data presented on a scatter plot.

Needs to be defined

32

Scatter Plot of UK Unemployment Rate (Y) Against Benefit/Wage Ratio (X)

r=0.3888

0

5

10

15

20

25

0 0.1 0.2 0.3 0.4 0.5 0.6

Benefit/Wage Ratio

Un

em

plo

ym

en

t R

ate

error

error

Lots of big errors, ei

error

Errors smallerhere

33

Want calculate best values 

iii eXY ˆˆ

i=1,2,...,N.

ˆ,ˆin

34

XY ˆˆ

Equation of the ‘fitted’ line– note that subscripts are not used here:

Predicted (fitted) value of Yi given X i

ii XY ˆˆˆ

35

Scatter Plot of UK Unemployment Rate (Y) Against Benefit/Wage Ratio (X)

r=0.3888

0

5

10

15

20

25

0 0.1 0.2 0.3 0.4 0.5 0.6

Benefit/Wage Ratio

Un

em

plo

ym

en

t R

ate

Yi

XY ˆˆˆ

iY

Xi

(Xi,Yi)

36

The Error

Also called the RESIDUAL

ii

iii

XY

YYe

ˆˆ

ˆ

There are N, of these, one for each i=1,2,…,N

37

The Best Line

Actually, a best line – others can be defined

That which

minimises the sum of thesquares of the errors

top related