review of correlation & regression · 2017. 2. 27. · types of dependence ... correlation...

Miskolci Egyetem Gazdaságtudományi Kar

Üzleti Információgazdálkodási és Módszertani Intézet

Review of Correlation &

Regression

Petra Petrovics



Types of dependence

• association – between two nominal data

• mixed – between a nominal and a ratio data

• correlation – among ratio data



• X (or X1, X2, … , Xp):

known variable(s) / independent variable(s) / predictor(s)

• Y: unknown variable / dependent variable

• causal relationship: X „causes” Y to change

Correlation Regression

describes the strength of a

relationship, the degree to

which one variable is linearly

related to another

shows us how to determine

the nature of a relationship

between two or more

variables



Correlation Measures

1. Covariance

2. Coefficient of correlation

3. Coefficient of determination

4. Coefficient of rank correlation



Correlation Measures

1. Covariance

The covariance between two variables is a measure of the joint variation of the two variables

– ranges from - to +;

– Cov = 0, when X and Y are uncorrelated;

– its sign shows the direction of correlation

– it doesn’t measure the degree of relationship!!!

1n

yyxx yx,Cov



2. Coefficient of correlation (Pearson)

• its sign shows the direction of correlation

• it measures the strength of correlation

• 0 < r < 1 statistical dependence

r = 0 X and Y are uncorrelated

r = -1 negative ☻

r = 1 positive ☺

• You can use only in case of linear relationship!

yx ss

y,xCov r



3. Coefficient of determination

• r2

• The square of the sample correlation coefficient between

the outcomes and their predicted values.

• Measures the degree of correlation in percentage (%)

• It provides a measure of how well future outcomes are

likely to be predicted by the model.

• Vary from 0 to 1.

y

e

y

y2

S

S - 1 =

S

S r

ˆ



Example

• A firm administers a test tosales trainees before they gointo the field. Themanagement of the firm isinterested in determining therelationship between the testscores and the sales made bythe trainees at the end of oneyear in the field. Thefollowing data were collectedfor 45 sales personnel whohave been in the field oneyear.

• Calculate differentcorrelation measures!



Sales-

person

Test

score

Number of

units sold

K. A. 25 188 +9 +22 +198

L. Z. 16 157 0 -9 0

B. E. 30 165 +14 -1 -14

G. P. 5 124 -11 -42 +462

… … … … … …

… … … … … …

S. G. 10 158 -6 -8 +48

J. T. 24 224 +8 +58 +464

V. P. 17 169 +1 +3 +3

T. L. 6 114 -10 -52 +520

Total 716 7 464 0 0 ∑dxdy=8 894.5

X Y

independent dependent variable

xi dxx yi dyy yxii ddyyxx



Number of observed pairs: n = 45

Positive correlation

8.26 s 16 x x

30.99 s 166 y y

202.15 1-45

894.5 8

1n

dd C

yx



There is a strong & positive relationbetween test scores and number of unitssold.

The variation of test scores explains 62.36percent of the variation of number of unitssold.

% 62.36 r

0.7897 30.99 8.26

202.15

ss

C

2

yx

r



4. Coefficient of rank correlation

(Spearman)

• Measure of the relationship between two ordinal data

• n = number of paired observations,

d = difference between the ranks for each pair of

observations.

• perfect correlation rs = 1

perfect inverse correlation rs = -1

in case of independence rs = 0

)1 (nn

d6 -1 r

2

2i

s

1 r 0 s



Student

Ability

A B C D E F G H I J Total

Mathematics 1 2 3 4 5 6 7 8 9 10 -

Music 3 4 1 2 5 7 10 6 8 9 -

di = xi - yi -2 -2 2 2 0 -1 -3 2 1 1 0

di2 4 4 4 4 0 1 9 4 1 1 32

Example

Ten students were ranked by their

mathematical and musical ability:

0.806 1) - (1010

326 - 1

)1 (nn

d6 - 1 ρ

22

2

i

strong relationship



Simple Linear Regression Model

• We model the relationship between two variables, X and Y

as a straight line.

• The model contains two parameters:

an intercept parameter,

a slope parameter.

Y = β0 + β1x + ε

Y = deterministic component + random error

where: Y – dependent or response variable (the variable we

wish to explain or predict)

x – independent or predictor variable

ε – random error component

β0 – y-intercept of the line, i.e. point at which the line intercept the y-axis

β1 – slope of the line

E (y)

x

β0 = y-intercept

β1 = slope



y

x

Random error

Deterministic component• y = deterministic component +

random error

• We always assume that the mean value of the random error equals 0 the mean value of y equals the deterministic component.

• It is possible to find many lines for which the sum of the errors is equal to 0, but there is one (and only one) line for which the SSE (sum of squares of the errors) is a minimum:

least squares line / regression line.

ŷi = b0 + b1x i



• The method of least squares gives us the bestlinear unbiased estimators (BLUE) of the regressionparameters, β0, β1.

• The least-squares estimators:

b0 estimates β0

b1 estimates β1

• The (empirical) regression line:

y caret („hat”):

• Calculation of the estimators:

min!,

2

1

1010

n

i

ii xbbybbf

xbby 10ˆ



Least Square Methode• There is an extreme value (minimum) if

tha partial derivation is equal to 0

• After transformation…

• The normal equations (with 1 x)

Σy = nb0 + b1ΣxΣxy = b0Σx + b1Σx

2

• The estimated regression line:

02

02

10

1

10

0

iii

ii

xbbyxb

f

xbbyb

f

ŷ = b0 + b1x



Interpretation

• b0: when x=0, y=b0

If the X variable is 0, how much is the Y.

• b1: for every 1 unit increase in x we expect

y to change by b1 units on average.

• If the X is higher with 1, what is the

difference in Y on average.



No relationship

0

1000

2000

3000

4000

0 10 20 30 40Number of storks

Number of

births



Independence

- 2 - 1 0 1 2

- 3

- 2

- 1

0

1

2

3

N i n c s k o r r e lá c i ó

Y = - 7 . 4 E - 0 2 + 0 . 2 0 8 3 4 8 X

R - S q = 3 . 4 %



Positive correlation

3210- 1- 2- 3

3

2

1

0

- 1

- 2

- 3

P o z i t ív k o r r e lá c i ó

R -S q = 6 2 .5 %

Y = -8 . 6 E -0 2 + 0 . 6 9 0 2 8 6 X



Negative correlation

- 3 - 2 - 1 0 1 2 3

- 3

- 2

- 1

0

1

2

3

N e g a t ív k o r r e lá c i ó

Y = 5 . 0 7 E - 0 2 - 0 . 6 4 7 8 7 2 X

R - S q = 7 0 . 9 %



Curvilinear relation

- 3 - 2 - 1 0 1 2 3

0

1 0

2 0

3 0

4 0

N e m l i n e á r i s k o r r e lá c i ó

Y = 1 2 . 0 9 5 8 + 6 . 0 7 6 8 4 X + 1 . 1 6 6 8 6 X * * 2

R - S q = 8 8 . 4 %



Scatter diagrams

direct relationship

positive slope

0

10

20

30

40

50

0 10 20 30 40

Production (number of products per day)

w

a

s

t

a

g

e

0

400

800

1200

1600

0 10 20 30 40

Advertising in $

S

a

l

e

s

i

n

$ 0

1000

2000

3000

4000

5000

0 2 4 6 8 10 12Age of a house (year)

S

e

l

l

i

n

g

p

r

i

c

e

0

1000

2000

3000

4000

0 5 10 15

Age of a car (year)

S

e

l

l

i

n

g

p

r

i

c

e

linear

curvilinear

inverse relationship

negative slope



Power regression

Y = a Xb

logY = loga + b logX

↓ ↓ ↓

V = b0 + b1 ∙ x

b1 = b

b0 = lga

xbxbyx

xbnby

2

10

10

lglglglg

lglg



Compound regression

Y = a bx

logY = loga + logb x

↓ ↓ ↓

V = b0 + b1 ∙ x

b1 = lgb

b0 = lga

xbxbyx

xbnby

10

10

lg

lg



Estimation in Regression

• Regression estimation is a technique used to replace

missing values in data.

• If we know:

1. The estimated parameter value;

2. The hypothesized value of the parameter;

3. Confidence interval around the estimated parameter.

• The number of degrees of freedom equals the number of

observations minus the number of parameters estimated.

• = n-2



Parameter Estimated value Standard error

0 b0

1 b1

0

Y0

Estimation in Regression

2i

2i

)x(xn

x

es

2i )xx (

es

0y

2i

20

)xx

)xx

n

(

(1es

0y

2i

20

)x(x

)xx +

n

1

(1es

y

y

b

b

sty

sty

stb

stb

ˆ

ˆ

1

0

ˆ

ˆ

1

0

= n-2

In case of average Y values

In case of discrete Y values



Elasticity

xbb

x b x)E(y,

10

1

E(y, x) = bx

y1

Elasticity at the mean

% change in x demanded % change in y



Residual variable

n

i

ii

n

i

n

i

ii

iiii

iii

iii

yyyyyy

eyyyy

eyy

yye

1

2

1 1

22ˆˆ

ˆ

ˆ

ˆ

Sy = + Se

Sum of square of

Y

Sum of square

explained by

regression

Sum of square of the

errors

yS ˆ



Sum of

SquaresDf

Mean Sum

of SquaresF

Regression 1

Residual n-2

Total n-1

Analysis of Variance in

Regression Analysis

2e

2y

2y SS S ˆ

2

i

n

1=i

2n

1=i

i

n

1=i

2

i )y(y + )yy( )y(y

2

iy )yy( = S yS

2

ie )y(y = S )2/( nS s e2e

S = (y y)y i

2 1-n

Sy

2)-/(nS

S =F

e

y



Model testing

H0: β1 = 0

H1: β1 ≠ 0 (linear model)

Test statistic:

• F-statistic tests whether all the slope coefficients

in a linear regression are equal to 0.

• Measures how well the regression equation

explains the variation in the dependent variable.

2)-/(nS

S

s

S =F

e

y

2

e

y0

Pr

211 : H

F

);(

1

121 F

0

Pr

211 : H

);( 21

21

F

);(

1

12

21

F

F

0

Pr

211 : H

F);( 211 F

H0



Parameter testing

H0: β1 = 0

H1: β1 ≠ 0

Test statistic:

where: b1 is the least square estimate of the

regression slope

s(b1) is the standard error of b1

)( 1

1

bs

bt

1t 0

Pr01 : mH

2/1 t 0

Pr

2/1 t

01 : mH

0

Pr01 : mH

1t

H0



Thanks for your attention!

review of correlation & regression · 2017. 2. 27. · types of dependence ... correlation...

Documents