the multiple correlation coefficient. has (p +1)-variate normal distribution with mean vector and...

37
The Multiple Correlation Coefficient

Upload: octavia-hubbard

Post on 28-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Multiple Correlation Coefficient. has (p +1)-variate Normal distribution with mean vector and Covariance matrix We are interested if the variable

The Multiple Correlation Coefficient

Page 2: The Multiple Correlation Coefficient. has (p +1)-variate Normal distribution with mean vector and Covariance matrix We are interested if the variable

1

1Suppose

p

yx

x

1

1 11

yy y

y

has (p +1)-variate Normal distribution

with mean vector1

1 y

p

and Covariance matrix

We are interested if the variable y is independent of the vector

Definition

1x

The multiple correlation coefficient is the maximum correlation between y and a linear combination of the components of 1x

Page 3: The Multiple Correlation Coefficient. has (p +1)-variate Normal distribution with mean vector and Covariance matrix We are interested if the variable

1 1

1 0Let =

0

y yuAx

a x xv a

1

1 11

yy y

y

aA A

a a a

This vector has a bivariate Normal distribution

with mean vector

1

yAa

and Covariance matrix

We are interested if the variable y is independent of the vector

Derivation

1x

The multiple correlation coefficient is the maximum correlation between y and a linear combination of the components of 1x

Page 4: The Multiple Correlation Coefficient. has (p +1)-variate Normal distribution with mean vector and Covariance matrix We are interested if the variable

1 a x

1

11

y

yy

aa

a a

Thus we want to choose to maximize

The multiple correlation coefficient is the maximum correlation between y and

The correlation between y and 1 a x

a

a

Equivalently

2

1 1 12

11 11

1y y y

yy yy

a a aa

a a a a

Page 5: The Multiple Correlation Coefficient. has (p +1)-variate Normal distribution with mean vector and Covariance matrix We are interested if the variable

Note:

1 1 1111 1 1

2

2

11

1

y y

y y

yy

d a a d a aa a a a

da dad a

da a a

1 1 11 11 1 1

2

11

2 21 y y y y

yy

a a a a a a

a a

11 1 11 11

2

11

2 20

y yy

yy

a a a aa

a a

11 1 11 1or y ya a a a

11 1 111 1 11 1

1

or opt y y

y

a aa k

a

Page 6: The Multiple Correlation Coefficient. has (p +1)-variate Normal distribution with mean vector and Covariance matrix We are interested if the variable

1

1, ,

11n

y optopty x x

yy opt opt

aa

a a

The multiple correlation coefficient is independent of the value of k.

11 11 1

1 111 1 11 11 1

y y

yy y y

k

k k

1 11 11 1 1 11 1

11 11 1

y y y y

yyyy y y

Page 7: The Multiple Correlation Coefficient. has (p +1)-variate Normal distribution with mean vector and Covariance matrix We are interested if the variable

1

11 11 1

, ,and 0n

y yy x x

yy

We are interested if the variable y is independent of the vector

1if 0 y 1x

The sample Multiple correlation coefficient

1

1 11

Let denote the sample covariance matrix.yy y

y

s sS

s S

Then the sample Multiple correlation coefficient is

1

11 11 1

, , n

y yy x x

yy

s S sr

s

Page 8: The Multiple Correlation Coefficient. has (p +1)-variate Normal distribution with mean vector and Covariance matrix We are interested if the variable

1x

Testing for independence between y and

The test statistic 1

1

2, ,

2, ,

1

1n

n

y x x

y x x

rn pF

p r

11 11 1

11 11 1

1 y y

yy y y

s S sn p

p s s S s

If independence is true then the test statistic F will have an F-distributions with 1 = p degrees of freedom in the numerator and 1 = n – p + 1 degrees of freedom in the denominator

The test is to reject independence if:

, 1F F p n p

Page 9: The Multiple Correlation Coefficient. has (p +1)-variate Normal distribution with mean vector and Covariance matrix We are interested if the variable

1x

Other tests for Independence

The test statistic

2

2

1

nt r

r

If independence is true then the test statistic t will have a t -distributions with = n –2 degrees of freedom.

The test is to reject independence if:

2/ 2

nt t

Test for zero correlation (Independence between a two variables)

Page 10: The Multiple Correlation Coefficient. has (p +1)-variate Normal distribution with mean vector and Covariance matrix We are interested if the variable

1x

The test statistic

0

0

11 1 1ln ln

2 1 2 1

13

rr

z

n

If H0 is true the test statistic z will have approximately a Standard Normal distribution

/ 2z z

Test for non-zero correlation (H0:

We then reject H0 if:

Page 11: The Multiple Correlation Coefficient. has (p +1)-variate Normal distribution with mean vector and Covariance matrix We are interested if the variable

1x

The test statistic

1. , ,1

. , , 2

2

1pij x xp

ij x x

n pt r

r

If independence is true then the test statistic t will have a t -distributions with = n – p - 2 degrees of freedom.

The test is to reject independence if:

2/ 2

n pt t

Test for zero partial correlation correlation (Conditional independence between a two variables given a set of p Independent variables)

1. , , pij x xr = the partial correlation between yi and yj given x1, …, xp.

Page 12: The Multiple Correlation Coefficient. has (p +1)-variate Normal distribution with mean vector and Covariance matrix We are interested if the variable

The test statistic

1 1

1 1

0 0. , , . , ,

0 0. , , . , ,

1 11 1ln ln

2 1 2 1

13

p p

p p

ij x x ij x x

ij x x ij x x

r

rz

n p

If H0 is true the test statistic z will have approximately a Standard Normal distribution

/ 2z z

Test for non-zero partial correlation

We then reject H0 if:

1 1

00 . , , . , ,:

p pij x x ij x xH

Page 13: The Multiple Correlation Coefficient. has (p +1)-variate Normal distribution with mean vector and Covariance matrix We are interested if the variable

Canonical Correlation Analysis

Page 14: The Multiple Correlation Coefficient. has (p +1)-variate Normal distribution with mean vector and Covariance matrix We are interested if the variable

The problemQuite often when one has collected data on several variables.

The variables are grouped into two (or more) sets of variables and the researcher is interested in whether one set of variables is independent of the other set. In addition if it is found that the two sets of variates are dependent, it is then important to describe and understand the nature of this dependence.

The appropriate statistical procedure in this case is called Canonical Correlation Analysis.

Page 15: The Multiple Correlation Coefficient. has (p +1)-variate Normal distribution with mean vector and Covariance matrix We are interested if the variable

Canonical Correlation: An Example

In the following study the researcher was interested in whether specific instructions on how to relax when taking tests and how to increase Motivation , would affect performance on standardized achievement tests

• Reading, • Language and • Mathematics

Page 16: The Multiple Correlation Coefficient. has (p +1)-variate Normal distribution with mean vector and Covariance matrix We are interested if the variable

A group of 65 third- and fourth-grade students were rated after the instruction and immediately prior taking the Scholastic Achievement tests on:

In addition data was collected on the three achievement tests

• how relaxed they were (X1) and

• how motivated they were (X2).

• Reading (Y1),

• Language (Y2) and

• Mathematics (Y3). The data were tabulated on the next page

Page 17: The Multiple Correlation Coefficient. has (p +1)-variate Normal distribution with mean vector and Covariance matrix We are interested if the variable

Relaxation Motivation Reading Language Math Relaxation Motivation Reading Language MathCase X 1 X 2 Y 1 Y 2 Y 3 Case X 1 X 2 Y 1 Y 2 Y 3

1 7 14 311 436 154 34 40 20 362 416 1072 43 25 501 455 765 35 40 18 596 592 6223 32 21 507 473 702 36 35 17 431 346 4934 17 12 453 392 401 37 33 17 361 414 4045 23 12 419 337 284 38 40 27 663 451 6516 10 16 545 538 414 39 31 15 569 462 3987 22 21 509 512 491 40 29 19 699 622 4788 13 19 320 308 517 41 37 16 187 223 2219 31 21 357 296 496 42 21 23 1132 839 104410 24 26 485 372 685 43 24 15 457 410 40011 26 21 811 748 902 44 19 14 413 448 52012 35 20 367 436 393 45 33 22 569 605 61513 24 17 242 349 137 46 19 19 650 685 44014 20 8 237 140 331 47 26 22 424 427 48215 38 27 417 648 618 48 20 15 475 604 74216 32 19 429 446 458 49 22 21 519 612 44617 14 11 555 579 438 50 37 22 338 463 32718 24 12 599 497 414 51 41 28 674 613 53419 38 25 403 383 606 52 29 35 381 624 56520 30 8 550 324 674 53 25 12 199 171 31621 22 25 377 496 242 54 27 21 577 523 69922 36 28 671 585 710 55 22 20 425 466 40223 3 22 498 488 481 56 4 11 392 192 35424 44 28 477 583 260 57 27 22 401 520 55825 24 25 609 413 670 58 28 23 321 410 46026 33 18 521 522 716 59 33 20 682 433 74327 24 21 495 645 491 60 33 24 719 727 105228 28 20 400 555 624 61 31 33 672 705 65029 34 7 258 175 276 62 20 11 366 309 53730 39 20 466 541 348 63 26 25 581 558 38631 7 19 709 757 589 64 23 10 681 530 58132 13 17 586 472 492 65 30 22 1019 917 88033 32 18 418 361 428

Page 18: The Multiple Correlation Coefficient. has (p +1)-variate Normal distribution with mean vector and Covariance matrix We are interested if the variable

1

2

Let q

p q

xx

x

Definition: (Canonical variates and Canonical correlations)

11 12

12 22

have p-variate Normal distribution

with1

2

q

p q

and

Let

be such that U1 and V1 have achieved the maximum correlation 1.

1 11 1 1 1 1 q qU a x a x a x

and 1 1

1 1 2 1 1 q p q pV b x b x b x

Then U1 and V1 are called the first pair of canonical variates and 1 is called the first canonical correlation coefficient.

Page 19: The Multiple Correlation Coefficient. has (p +1)-variate Normal distribution with mean vector and Covariance matrix We are interested if the variable

derivation: ( 1st pair of Canonical variates and Canonical correlation)

1 111 12

12 221 1

0 0'

0 0

a aA A

b b

has covariance matrixThus

1 11 11 11

1 11 1 21 1

q q

q p q p

a xa x a xU

V b xb x b x

Now

1 1

21

0

0

a xAx

xb

1

1

U

V

1 11 1 1 12 1

1 12 1 1 22 1

a a a b

b a b b

Page 20: The Multiple Correlation Coefficient. has (p +1)-variate Normal distribution with mean vector and Covariance matrix We are interested if the variable

derivation: ( 1st pair of Canonical variates and Canonical correlation)

1 111 12 1 11 1 1 12 1

12 221 1 1 12 1 1 22 1

0 0'

0 0

a a a a a bA A

b b b a b b

has covariance matrixThus

1 11 11 11

1 11 1 21 1

q q

q p q p

a xa x a xU

V b xb x b x

Now

1 1

21

0

0

a xAx

xb

1

1

U

V

1 1

1 12 1

1 11 1 1 22 1

U V

a b

a a b b

hence

Page 21: The Multiple Correlation Coefficient. has (p +1)-variate Normal distribution with mean vector and Covariance matrix We are interested if the variable

Thus we want to choose 1 1 and a b

is at a maximum

so that

1 1

1 12 1

1 11 1 1 22 1

U V

a b

a a b b

is at a maximum

or

1 1

2

1 12 12

1 11 1 1 22 1

U V

a b

a a b b

Let

2

1 12 1

1 11 1 1 22 1

a bV

a a b b

Page 22: The Multiple Correlation Coefficient. has (p +1)-variate Normal distribution with mean vector and Covariance matrix We are interested if the variable

Computing derivatives

2

1 12 1 12 1 1 11 1 1 12 1 11 1

21 1 22 1 1 11 1

2 210

a b b a a a b aV

a b b a a

and

12 1 1 11 1 1 12 1 11 1b a a a b a

2

1 12 1 12 11 1 22 1 1 12 1 22 1

21 11 11 1 22 1

2 210

a b a b b a b bV

a ab b b

12 11 1 22 1 12 1 22 1a b b a b b 11 22

1 22 12 11

1 12

or b b

b aa b

Page 23: The Multiple Correlation Coefficient. has (p +1)-variate Normal distribution with mean vector and Covariance matrix We are interested if the variable

Thus

2

1 12 1112 22 12 111 11 1

1 22 11 1 11 1

a ba a

b b a a

2

1 12 11 111 12 22 12 111 1 1

1 22 11 1 11 1

a ba a ka

b b a a

This shows that 1a

is an eigenvector of 1 111 12 22 12

k is the largest eigenvalue of

1 1

2

1 12 1 2

1 11 1 1 22 1

U V

a bk

a a b b

1 1

2Thus is maximized whenU V1 1

11 12 22 12 and 1a

is the eigenvector associated with the

largest eigenvalue.

Page 24: The Multiple Correlation Coefficient. has (p +1)-variate Normal distribution with mean vector and Covariance matrix We are interested if the variable

Also

2

1 12 11 111 12 22 12 1 1

1 22 1 1 11 1

a ba a

b b a a

11 221 22 12 11

1 12

b bb a

a b

and

1 12 122 1 12 1

1 22 1

or a b

b ab b

2

1 12 11 112 11 12 22 12 1 12 1

1 22 1 1 11 1

a ba a

b b a a

2

1 12 11 1 1 12 1 1 12 112 11 12 22 22 1 22 1

1 22 1 1 22 11 22 1 1 11 1

a ba b a bb b

b b b bb b a a

2

1 12 11 122 12 11 12 1 1

1 22 1 1 11 1

a bb b

b b a a

Page 25: The Multiple Correlation Coefficient. has (p +1)-variate Normal distribution with mean vector and Covariance matrix We are interested if the variable

Summary:

are found by finding , eigenvectors of the matrices

1 11 1 1 1 1 q qU a x a x a x

associated with the largest eigenvalue (same for both matrices)

1 11 1 2 1 1 q p q pV b x b x b x

The first pair of canonical variates

1 1 and a b

1 1 1 112 11 12 22 22 12 11 12 and respectively

The largest eigenvalue of the two matrices is the square of the first canonical correlation coefficient1

1 112 11 12 22 the largest eigenvalue of

1 122 12 11 12= the largest eigenvalue of

Page 26: The Multiple Correlation Coefficient. has (p +1)-variate Normal distribution with mean vector and Covariance matrix We are interested if the variable

The remaining canonical variates and canonical correlation coefficients

are found by finding

, so that

2 22 2 1 1 1 q qU a x a x a x

1. (U2,V2) are independent of (U1,V1).

2 22 2 2 1 1 q p q pV b x b x b x

The second pair of canonical variates

2 2 and a b

2. The correlation between U2 and V2 is maximized

The correlation, 2, between U2 and V2 is called the second canonical correlation coefficient.

Page 27: The Multiple Correlation Coefficient. has (p +1)-variate Normal distribution with mean vector and Covariance matrix We are interested if the variable

are found by finding , so that

1 1 1 i i

i i q qU a x a x a x

1. (Ui,Vi) are independent of (U1,V1), …, (Ui-1,Vi-1).

1 1 i i

i i i q p q pV b x b x b x

The ith pair of canonical variates

and i ia b

2. The correlation between Ui and Vi is maximized

The correlation, 2, between U2 and V2 is called the second canonical correlation coefficient.

Page 28: The Multiple Correlation Coefficient. has (p +1)-variate Normal distribution with mean vector and Covariance matrix We are interested if the variable

derivation: ( 2nd pair of Canonical variates and Canonical correlation)

has covariance matrix

1 1 11

1 1 2 1 1

2 2 1 22

2 2 2 2

0

0 =

0

0

a x aU

V b x b xAx

U a x xa

V b x b

Now

1

1 21 11 12

12 222 1 2

2

0

0 00

0 0 0

0

a

a abA A

a b b

b

1 11 1 1 12 1 1 11 2 1 12 2

1 22 1 1 12 2 1 22 2

2 11 2 2 12 2

2 22 2

*

* *

* * *

a a a b a a a b

b b b a b b

a a a b

b b

Page 29: The Multiple Correlation Coefficient. has (p +1)-variate Normal distribution with mean vector and Covariance matrix We are interested if the variable

2 2

2 12 2

2 11 2 2 22 2

U V

a b

a a b b

Now

2 2

2

2 12 22

2 11 2 2 22 2

U V

a b

a a b b

and maximizing

Is equivalent to maximizing 2

2 12 2a b

2 11 2 2 22 2 1 11 2 1 12 2 1 12 21, 1, 0, 0, 0a a b b a a a b b a

subject to

2

2 12 2 1 2 11 2 2 2 22 21 1V a b a a b b

3 1 11 2 4 1 12 2 5 1 12 2 6 1 22 2a a a b b a b b

Using the Lagrange multiplier technique

1 22 2and b b

Page 30: The Multiple Correlation Coefficient. has (p +1)-variate Normal distribution with mean vector and Covariance matrix We are interested if the variable

Now

and

2 12 2 12 2 1 11 2 3 11 1 5 12 12

2 2 0V

a b b a a ba

2 12 2 12 2 2 22 2 4 12 1 6 22 1

2

2 2 0V

a b a b a bb

2

2 12 2 1 2 11 2 2 2 22 21 1V a b a a b b

3 1 11 2 4 1 12 2 5 1 12 2 6 1 22 2a a a b b a b b

0, 1, 6i

Vi

also gives the restrictions

Page 31: The Multiple Correlation Coefficient. has (p +1)-variate Normal distribution with mean vector and Covariance matrix We are interested if the variable

These equations can used to show that

are eigenvectors of the matrices

associated with the 2nd largest eigenvalue (same for both matrices)

1 1 and a b

1 1 1 112 11 12 22 22 12 11 12 and respectively

The 2nd largest eigenvalue of the two matrices is the square of the 2nd canonical correlation coefficient2

1 12 12 11 12 22 the 2 largest eigenvalue of nd

1 122 12 11 12= the 2 largest eigenvalue of nd

Page 32: The Multiple Correlation Coefficient. has (p +1)-variate Normal distribution with mean vector and Covariance matrix We are interested if the variable

Coefficients for the ith pair of canonical variates,

are eigenvectors of the matrices

associated with the ith largest eigenvalue (same for both matrices)

and i ia b

1 1 1 112 11 12 22 22 12 11 12 and respectively

The ith largest eigenvalue of the two matrices is the square of the ith canonical correlation coefficienti

1 112 11 12 22 the largest eigenvalue of th

i i

1 122 12 11 12= the largest eigenvalue of thi

continuing

Page 33: The Multiple Correlation Coefficient. has (p +1)-variate Normal distribution with mean vector and Covariance matrix We are interested if the variable

Example

Variables

• relaxation Score (X1)

• motivation score (X2). • Reading (Y1),

• Language (Y2) and

• Mathematics (Y3).

Page 34: The Multiple Correlation Coefficient. has (p +1)-variate Normal distribution with mean vector and Covariance matrix We are interested if the variable

Summary StatisticsUNIVARIATE SUMMARY STATISTICS ----------------------------- STANDARD VARIABLE MEAN DEVIATION 1 Relax 26.87692 9.50412 2 Mot 19.41538 5.83066 3 Read 499.03077 172.25508 4 Lang 485.83077 156.08957 5 Math 512.52308 195.18614 CORRELATIONS ------------ Relax Mot Read Lang Math 1 2 3 4 5 Relax 1 1.000 Mot 2 0.391 1.000 Read 3 0.002 0.280 1.000 Lang 4 0.050 0.510 0.781 1.000 Math 5 0.127 0.340 0.713 0.556 1.000

Page 35: The Multiple Correlation Coefficient. has (p +1)-variate Normal distribution with mean vector and Covariance matrix We are interested if the variable

Canonical Correlation statistics Statistics

CANONICAL NUMBER OF BARTLETT'S TEST FOR EIGENVALUE CORRELATION EIGENVALUES REMAINING EIGENVALUES CHI- TAIL SQUARE D.F. PROB. 27.86 6 0.0001 0.35029 0.59186 1 1.56 2 0.4586 0.02523 0.15885 BARTLETT'S TEST ABOVE INDICATES THE NUMBER OF CANONICAL VARIABLES NECESSARY TO EXPRESS THE DEPENDENCY BETWEEN THE TWO SETS OF VARIABLES. THE NECESSARY NUMBER OF CANONICAL VARIABLES IS THE SMALLEST NUMBER OF EIGENVALUES SUCH THAT THE TEST OF THE REMAINING EIGENVALUES IS NON-SIGNIFICANT. FOR EXAMPLE, IF A TEST AT THE .01 LEVEL WERE DESIRED, THEN 1 VARIABLES WOULD BE CONSIDERED NECESSARY. HOWEVER, THE NUMBER OF CANONICAL VARIABLES OF PRACTICAL VALUE IS LIKELY TO BE SMALLER.

Page 36: The Multiple Correlation Coefficient. has (p +1)-variate Normal distribution with mean vector and Covariance matrix We are interested if the variable

continued CANONICAL VARIABLE LOADINGS --------------------------- (CORRELATIONS OF CANONICAL VARIABLES WITH ORIGINAL VARIABLES) FOR FIRST SET OF VARIABLES CNVRF1 CNVRF2 1 2 Relax 1 0.197 0.980 Mot 2 0.979 0.203 -----------------------------

CANONICAL VARIABLE LOADINGS --------------------------- (CORRELATIONS OF CANONICAL VARIABLES WITH ORIGINAL VARIABLES) FOR SECOND SET OF VARIABLES CNVRS1 CNVRS2 1 2 Read 3 0.504 -0.361 Lang 4 0.900 -0.354 Math 5 0.565 0.391 ------------------------------

Page 37: The Multiple Correlation Coefficient. has (p +1)-variate Normal distribution with mean vector and Covariance matrix We are interested if the variable

Summary

U1 = 0.197 Relax + 0.979 Mot

V1 = 0.504 Read + 0.900 Lang + 0.565 Math

1 = .592

U2 = 0.980 Relax + 0.203 Mot

V2 = 0.391 Math - 0.361 Read - 0.354 Lang

2 = .159