stat3914 { applied statistics · stat3914 { applied statistics lecturer dr. john t. ormerod school...

90
Semester 1, 2012 (Last adjustments: September 5, 2012) Lecture Notes STAT3914 – Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod (at) sydney.edu.au

Upload: others

Post on 18-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

Semester 1, 2012 (Last adjustments: September 5, 2012)

Lecture Notes

STAT3914 – Applied Statistics

Lecturer

Dr. John T. Ormerod

School of Mathematics & Statistics F07

University of Sydney

(w) 02 9351 5883

(e) john.ormerod (at) sydney.edu.au

Page 2: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

STAT3914 – Outline

2 Multivariate Normal Distribution

2 Point Estimates and Confidence Intervals for MVN

2 Wishart Distriubtion

2 Derivation of Hotelling’s T2

2 The Expecation Maximization Algorithm

2 Missing Data Analysis

STAT3914: Lecture 3 2

Page 3: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

The standard univariate normal distribution

2 A random variable (RV) Z has a standard normal distribution N(0, 1), if Z has

density

ϕ(z) =1√2π

e−12z

2, −∞ < z <∞.

2 Z has moment generating function

E(etZ) =

∫ ∞−∞

etz · (2π)−12e−

12z

2dz

=

∫ ∞−∞

(2π)−12 e−

12(z−t)

2· e

12t

2dz

=e12t

2

2 It follows that E[Z] = 0 and Var[Z] = 1.

STAT3914: Lecture 3 3

Page 4: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

The standard multivariate normal distribution

2 Let U = (U1, U2, . . . , Up)T be a p-vector of NID(0, 1) RVs.

2 The joint density of U is given byp∏i=1

1√2πe−u

2i /2 = (2π)−p/2 exp

−1

2uTu.

2 Clearly,

E[U] = 0

where 0 is the 0 vector in Rp.

2 Similarly,

Cov[U] = E[UUT ] = I

where I = Ip is the p× p identity matrix.

2 We say U has a standard multivariate normal distribution which we denote

U ∼ Np(0, I).

STAT3914: Lecture 3 4

Page 5: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

Non-standard univariate normal distribution

2 Given a RV Z ∼ N(0, 1) we can generate a non-standard normal with mean

µ ∈ R and standard deviation σ > 0 by defining X = µ + σZ.

2 The density of X can be readily recovered by the formula for the density of

transformed RVs.

2 Let ψ(z) = µ + σz, then X = ψ(Z) and ψ−1(x) = (x− µ)/σ so

fX(x) = fZ(ψ−1(x))

∣∣∣∣ ddxψ−1(x)

∣∣∣∣= fZ

(x− µσ

) ∣∣∣∣1σ∣∣∣∣

=1√

2πσ2exp

−(x− µ)2

2σ2

.

STAT3914: Lecture 3 5

Page 6: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

Non-standard univariate normal distribution

2 The MGF of X is given by

E[etX ] = etµE[etσZ]

= exptµ + 1

2σ2t2.

2 An analogous transformation creates the general multivariate normal random

vector, i.e. if z ∼ N(0, 1) then

x = µ + σz ∼ N(µ, σ2).

STAT3914: Lecture 3 6

Page 7: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

Non-standard multivariate normal distribution

2 Let U ∼ Np(0, I) and define

X = ψ(U) ≡ µ + AU

where µ ∈ Rp, and A is a p× p non-singular matrix.

2 Clearly,

E[X] = µ

and its covariance is given by

Cov[X] = E[(X− µ) (X− µ)T

]= E

[AUUTAT

]= AE

[UUT

]AT

= AAT

≡ Σ.

STAT3914: Lecture 3 7

Page 8: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

Non-standard multivariate normal distribution

Suppose AAT = Σ then:

2 A is sort of a “square-root” of Σ (if Σ is symmetric then A = Σ12) and

|Σ| = |AAT | = |A| · |AT | = |A|2 > 0.

2 Claim: If U ∼ Np(0, I) and X = ψ(U) ≡ µ+ AU then X has a non-singular

multivariate normal distribution which we will denote

X ∼ Np(µ,Σ)

and X has density

fX(X) = (2π)−p/2|Σ|−12 exp

−1

2(x− µ)TΣ−1(x− µ).

2 Proof: Similarly to the univariate case with ψ−1(X) = A−1(X− µ)

fX(X) = fU

(ψ−1(X)

) ∣∣Jψ−1∣∣= (2π)−p/2 |Jψ|−1 exp

−1

2

[A−1(X− µ)

]T [A−1(X− µ)

]STAT3914: Lecture 3 8

Page 9: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

2 The Jacobian of a linear transformation is the determinant of its matrix so

|Jψ| =

∣∣∣∣∣∣∣∂ψ1∂u1· · · ∂ψ1

∂up... . . . ...

∂ψp∂u1· · · ∂ψp

∂up

∣∣∣∣∣∣∣ = |A| = |Σ|12 .

2 The exponent simplifies as follows[A−1(X− µ)

]T [A−1(X− µ)

]= (X− µ)T (A−1)T (A−1)(X− µ)

= (X− µ)TΣ−1(X− µ)

as (A−1)T (A−1) = (AAT )−1 = Σ−1.

2 Thus, X has density

fX(X) = (2π)−p/2|Σ|−12 exp

−1

2(x− µ)TΣ−1(x− µ).

STAT3914: Lecture 3 9

Page 10: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

The MGF of the multivariate normal distribution

2 Claim: If X ∼ Np(µ,Σ) then the moment generating function (MGF) of X is

given by:

MX(s) = expsTµ + 1

2sTΣs

.

2 Proof: For s ∈ Rp we have

MX(s) = E[esTX]

= E[exp(sTµ + sTAU)]

= exp(sTµ) · E[eaTU], where aT = sTA

= exp(sTµ) · exp12a

Ta

since aTU ∼ N(0, aTa)

= expsTµ + 1

2sTΣs

.

2 This expression for the MGF is well defined even when Σ is singular. While it

would be awkward to define the multivariate normal distribution using an MGF

it does suggest the following definition...

STAT3914: Lecture 3 10

Page 11: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

The (general) multivariate normal distribution

2 Definition. A p-dimensional random vector X has a multivariate normal dis-

tribution if for every a ∈ Rp the linear combination aTX is univariate normal.

2 Claim. X is a p-dimensional multivariate normal random vector if and only if

its MGF can be expressed as

MX(s) = expsTµ + 1

2sTΣs

.

2 Aside: if MX is as above then by differentiating the MGF it follows that EX = µ

and that Cov(X) = Σ.

STAT3914: Lecture 3 11

Page 12: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

The (general) multivariate normal distribution

2 Proof:. Suppose the above expression for the MGF holds. Then, the MGF of

Y = aTX is given by

MY (t) = E[etaTX]

= MX(ta)

= exp

taTµ +

1

2t2aTΣa

so that

aTX ∼ N(aTµ, aT Σa)

by the uniqueness theorem for MGFs.

STAT3914: Lecture 3 12

Page 13: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

Conversely, if Ys = sTX is univariate normal for all s ∈ Rp then

E[eYs] = expE[Ys] + 1

2Var[Ys].

However, E[Ys] = sTE[X] = sTµ and

Var[Ys] = E[sTX− sTµ

] (sTX− sTµ

)T= sT

[E (X− µ) (X− µ)T

]s

= sTCov(X)s

Therefore, with Σ = Cov(X)

MX(s) = E[eYs] = expsTµ + 1

2sTΣs

.

STAT3914: Lecture 3 13

Page 14: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

Marginal Distributions

2 If X ∼ Np(µ,Σ) then for all a ∈ Rp,

aTX ∼ N(aTµ, aTΣa).

2 So by choosing ei = (0, · · · , 0, 1, 0, · · · , 0) with 1 in the ith position we have

Xi ∼ N(µi,Σii).

2 Thus all marginal distributions are normal.

2 The converse is not true (see exercises) .

2 Moreover, if X1 = (X1, . . . , Xr)T consists of the first r coordinates of X, then

X1 is an r-dimensional normal RV (why?)

X1 ∼ Nr(µ1,Σ11),

where [µ1]i = [µ]i and [Σ11]i,j = [Σ]i,j for i, j = 1, . . . , r.

STAT3914: Lecture 3 14

Page 15: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

2 Claim. If Σ is diagonal then Xi are independent.

2 Proof.

STAT3914: Lecture 3 15

Page 16: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

Block/partitioned matrices

2 Suppose X is a p-dimensional random vector.

2 Let X1 = (X1, . . . , Xr)T be the first r coordinates of X and let X2 = (Xr+1, . . . , Xp)

T

be its last q = p− r coordinates so that

X =

(X1

X2

).

2 Let µi = E[Xi] (i = 1, 2) then

µ ≡ E[X] =

(µ1

µ2

)2 Similarly, with

[Σ]ij ≡ Cov(Xi,Xj) ≡ E[XiXTj ]− E[Xi]E[Xj]

T

STAT3914: Lecture 3 16

Page 17: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

(so that Σ11 is r × r and Σ12 is r × q, etc.)

Var(X) ≡ Σ =

(Σ11 Σ12

Σ21 Σ22

).

STAT3914: Lecture 3 17

Page 18: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

Degenerate multivariate normal distribution

2 Suppose X is a multivariate normal with mean µ and a singular covariance

matrix Σ with rank r < p.

2 As Σ is symmetric it has (p − r) eigenvalues equal to 0 and there exists an

orthogonal matrix P such that

PTΣP =

(D 0

0 0

)≡ D

where

D = diag(λ1, · · · , λr), with λ1 ≥ λ2 ≥ · · · ≥ λr > 0.

and the columns of P are the eigenvectors of Σ

STAT3914: Lecture 3 18

Page 19: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

2 Let Y = PTX and denote by Y1 = (Y1, Y2, . . . , Yr)T and Y2 = (Yr+1, . . . , Yp)

T

and similarly for s = (s1, s2) and for γ = PTµ = (γ1,γ2).

2 Then Y has MGF

E[esT1 Y1+sT2 Y2] = E[esTY]

= E[esTPTX]

= expsTPTµ + 1

2sTPTΣPs

= exp

sTγ + 1

2sTDs

= exp

sT1 γ1 +

1

2sT1 Ds1

· exp

sT2 γ2

.

2 The RHS is a product of two MGFs, one of a distribution that is the constant

≡ γ2, and the other is Nr(γ1,D).

2 By the uniqueness of the MGF, Y2 ≡ γ2 and Y1 ∼ Nr(γ1,D).

2 Note that the Y1, . . . , Yr are independent. Does this ring a bell? Principal

Component Analysis. . .

STAT3914: Lecture 3 19

Page 20: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

2 It also follows that if X ∼ N(µ,Σ) with Σ of rank r then there exists an

invertible linear transformation G such that X = GW + µ where

Wr+1 = . . . = Wp = 0 and (W1, · · · ,Wr)T ∼ Nr(0, I).

see exercises.

STAT3914: Lecture 3 20

Page 21: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

Block independence

2 Theorem 1. X1 and X2 are independent if and only if Σ12 = 0.

2 Proof. If X1 and X2 are independent then X1 and X2 are element-wise uncor-

related and so Σ12 = 0.

Conversely, if Σ12 = 0 then Σ21 = 0 and if s = (s1, s2)T (with s1 ∈ Rr) we

have

sTΣs = sT1 Σ11s1 + sT2 Σ22s2.

Therefore, X has MGF

MX(s) = expsTµ + 1

2sTΣs

= exp

sT1µ1 + 1

2sT1 Σ11s1

· exp

sT2µ2 + 1

2sT2 Σ22s2

and so (why?) X1 and X2 are independent with

X1 ∼ Nr(µ1,Σ11) and X2 ∼ Ns(µ2,Σ22).

STAT3914: Lecture 3 21

Page 22: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

STAT3914: Lecture 3 22

Page 23: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

Detour into the linear algebra of block matrices

2 Let

A =

(A11 A12

A21 A22

)be an invertible p×p matrix (where A11 is r×r and A22 is s×s with r+s = p).

2 It is common to denote the inverse of A by

A−1 =

(A11 A12

A21 A22

)(where A11 is r × r and A22 is s× s).

2 By the definition of a matrix inverse we have the relations(A11 A12

A21 A22

)(A11 A12

A21 A22

)=

(Ir 0

0 Is

).

STAT3914: Lecture 3 23

Page 24: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

2 In particular,A21A11 + A22A21 = 0

A21A12 + A22A22 = Is.

2 Multiply the first equation from the right by A−111 A12 and subtract from the

second equation to find that

A22(A22 −A21A−111 A12) = Is =⇒ A22 = (A22 −A21A

−111 A12)

−1.

2 The same kind of algebra yields (exercise)

A12 = −A−111 A12A22

A21 = −A22A21A−111

−A12A21 = A11A11 − Ir

A22 = (A22 −A21A−111 A12)

−1

A11 = (A11 −A12A−122 A21)

−1.

STAT3914: Lecture 3 24

Page 25: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

Linear Predictor (Estimator)

2 X = (X1, X2,XT3 )T is a RV.

2 A linear predictor (estimator) of X1 given X3 is of the form bTX3, where

b ∈ Rp−2 (ignore X2 for now).

2 We seek the best such linear predictor:

bo ≡ argminb

E[(X1 − bTX3)

2].

2 Assume µ = 0 (the results hold for any µ) and let

f (b) ≡ E[(X1 − bTX3)2]

= E[(X1 − bTX3)(X1 − bTX3)

T]

= Σ11 − bTΣ31 −Σ13b + bTΣ33b

= Σ11 − 2bTΣ31 + bTΣ33b

2 Use calculus or algebra to minimize f .

STAT3914: Lecture 3 25

Page 26: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

2 We want to minimize (with respect to b)

f (b) = Σ11 − 2bTΣ31 + bTΣ33b.

2 Clearly, ∇b(−2bTΣ31) = −2Σ31 ∈ Rp−2.

2 Since bTAb =∑

i,j aijbibj it follows that for a symmetric A that

∇b(bTAb) = 2Ab.

2 Therefore,

∇f (b) = −2Σ31 + 2Σ33b.

2 Thus, the unique stationary point is attained at

b0 = Σ−133 Σ31.

2 Which is a minimum (as Σ33 is positive definite).

STAT3914: Lecture 3 26

Page 27: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

2 Algebra: complete the quadratic form

f (b) = Σ11 − bTΣ31 −Σ13b + bTΣ33b

= Σ11 −Σ13Σ−133 Σ31 + (b−Σ−133 Σ31)

TΣ33(b−Σ−133 Σ31)

≥ Σ11 −Σ131Σ−133 Σ31

with equality if and only if b = Σ−133 Σ31.

2 Thus, the best linear estimator (predictor) of X1 given X3 is

PX3(X1) = Σ13Σ−133 X3

Note that this is a normal RV.

Similarly, PX3(X2) = Σ23Σ−133 X3.

2 This is also known as the projection of X1 (X2) on the subspace spanned by

X3.

STAT3914: Lecture 3 27

Page 28: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

Conditional Distributions

2 Let

X =

(X1

X2

)∼ Np(µ,Σ)

where Σ is non-singular, X1 = (X1, . . . , Xr)T and X2 = (Xr+1, . . . , Xp)

T .

Then

X1 ∼ Nr(µ1,Σ11)

and

X2 ∼ Ns(µ2,Σ22)

where s = p− r.

2 What is the conditional distribution of X2 given that X1 = x1?

STAT3914: Lecture 3 28

Page 29: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

2 Recall that for the bivariate normal (X1, X2) we defined

fX2|X1(x2|x1) =

fX1,X2(x1, x2)

fX1(x1)

2 We saw that for Xi ∼ N(0, 1), Cov(X1, X2) = ρ and

X2 |X1 = x1 ∼ N(ρx1, 1− ρ2)

STAT3914: Lecture 3 29

Page 30: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

2 We can pursue the obvious generalization here (still assuming µ = 0):

log fX2|X1(X2|X1)

= logfX(X)

fX1(X1)

= C − 12

[(XT

1 ,XT2 )

(Σ11 Σ12

Σ21 Σ22

)(X1

X2

)−XT

1 Σ−111 X1

]= C − 1

2

[XT

1

(Σ11 −Σ−111

)X1 + XT

2 Σ21X1

+XT1 Σ12X2 + XT

2 Σ22X2

],

where C = C(Σ, r) is a constant and

Σ−1 =

(Σ11 Σ12

Σ21 Σ22

).

STAT3914: Lecture 3 30

Page 31: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

From the previous slide

log fX2|X1(X2|X1)

= C − 12

[XT

1

(Σ11 −Σ−111

)X1 + XT

2 Σ21X1 + XT1 Σ12X2 + XT

2 Σ22X2

],

2 Using the block inverse identities we obtain

Σ11 −Σ−111 = (Σ11Σ11 − I)Σ−111 = −Σ12Σ21Σ−111 = Σ−111 Σ12Σ

22Σ21Σ−111

Σ21 = −Σ22Σ21Σ−111

Σ12 = −Σ−111 Σ12Σ22

into the above expression we have

XT1

(Σ11 −Σ−111

)X1 + XT

2 Σ21X1 + XT1 Σ12X2 + XT

2 Σ22X2

= XT1 Σ−111 Σ12Σ

22Σ21Σ−111 X1 −XT

2 Σ22Σ21Σ−111 X1

−XT1 Σ−111 Σ12Σ

22X2 + XT2 Σ22X2

=(X2 −Σ21Σ

−111 X1

)TΣ22

(X2 −Σ21Σ

−111 X1

)2 Therefore,

log fX2|X1(X2|X1) = C − 1

2

[(X2 −Σ21Σ

−111 X1

)TΣ22

(X2 −Σ21Σ

−111 X1

)].

STAT3914: Lecture 3 31

Page 32: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

2 As a function of X2 this is a density of a multivariate normal with mean

Σ21Σ−111 X1 and covariance matrix(

Σ22)−1

= Σ22 −Σ21Σ−111 Σ12.

2 We essentially proved:

Theorem 2. If X ∼ Np(µ,Σ) then the conditional distribution of X2 given

X1 = X is

N(µ2 + Σ21Σ

−111 (X− µ1),Σ22 −Σ21Σ

−111 Σ12

).

STAT3914: Lecture 3 32

Page 33: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

2 Rather than routinely generalizing the result to include the case µ 6= 0 we give

a different proof for the general case.

2 Recall that if (X1, X2) is bivariate normal with Xi ∼ N(0, 1) and correlation ρ,

then we can de-correlate X2 from X1 as follows:

Let Y2 = X2 − ρX1, then

Cov(Y2, X1) = Cov(X2, X1)− ρCov(X1, X1) = 0.

2 So X2 = Y2 + ρX1 where Y2 ∼ N(0, 1− ρ2) is independent of X1 (?)

2 Therefore, conditioning on X1 = x1,

X2 ∼ N(ρx1, 1− ρ2).

2 Next we generalize this to the multivariate normal case.

STAT3914: Lecture 3 33

Page 34: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

2 Let Q = Σ21Σ−111 and let

Y2 = X2 −QX1.

Note that QX1 is the best linear predictor of X2 given X1

2 Y2 and X1 are uncorrelated since

Cov(Y2,X1) = Cov(X2,X1)− Cov(QX1,X1)

= Σ21 −QΣ11

= 0.

2 Then (X1

Y2

)is a multivariate normal RV (why?)

STAT3914: Lecture 3 34

Page 35: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

2 It follows that Y2 and X1 are independent and

Y2 ∼ N(µ2 −Qµ1,Σ22 −Σ21Σ−111 Σ12),

as

Y2 = X2 −QX1 and Q = Σ21Σ−111

so

Cov(Y2) = Cov(X2 −QX1,X2 −QX1)

= Σ22 −Σ21QT −QΣ12 + QΣ11Q

T

= Σ22 −Σ21Σ−111 Σ12.

2 Finally, X2 = Y2 + QX1 so the conditional distribution of X2 given X1 = X is

N(µ2 −Qµ1 + QX,Σ22 −Σ21Σ

−111 Σ12

)2 Notation: Σ2·1 = Σ22 −Σ21Σ

−111 Σ12 =

(Σ22)−1

STAT3914: Lecture 3 35

Page 36: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

Partial Correlation

2 The best linear estimator of X1 given X3 is

PX3(X1) = Σ13Σ−133 X3.

2 The partial correlation between X1 and X2 given X3 is

ρ12·34...p ≡ Cor[X1 − PX3(X1), X2 − PX3(X2)]

2 Theorem: Let

Σ−1 =

Σ11 Σ12 Σ13

Σ21 Σ22 Σ23

Σ31 Σ32 Σ33

and let d = Σ11Σ22 − Σ12Σ21 then

ρ12·34...p = − Σ12

√Σ11Σ22

.

STAT3914: Lecture 3 36

Page 37: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

2 Proof:Covariance is bilinear so

Cov[X1 − PX3(X1), X2 − PX3(X2)]

= Cov(X1, X2)− Cov(X1, PX3(X2))

−Cov(PX3(X1), X2) + Cov(PX3(X1), PX3(X2))

= Σ12 − Cov(X1,X3)Σ−133 ΣT

23

−Σ13Σ−133 Cov(X3, X2) + Σ13Σ

−133 Cov(X3,X3)Σ

−133 ΣT

23

= Σ12 −Σ13Σ−133 Σ32.

2 Similarly,

Cov [X1 − PX3(X1)]

= Σ11 −Σ13Σ−133 ΣT

13 −Σ13Σ−133 Σ31 + Σ13Σ

−133 Σ33Σ

−133 ΣT

13

= Σ11 −Σ13Σ−133 Σ31.

2 Hence,

ρ12·34...p =Σ12 −Σ13Σ

−133 Σ32√

(Σ11 −Σ13Σ−133 Σ31)(Σ22 −Σ23Σ

−133 Σ32)

STAT3914: Lecture 3 37

Page 38: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

2 By our block inverse identities,

1

d

(Σ22 −Σ12

−Σ21 Σ11

)=

(Σ11 Σ12

Σ21 Σ22

)−1=

(Σ11 Σ12

Σ21 Σ22

)−

(Σ13

Σ23

)Σ−133

(Σ31 Σ32

).

2 Therefore,

ρ12·34...p =Σ12 −Σ13Σ

−133 Σ32√

(Σ11 −Σ13Σ−133 Σ31)(Σ22 −Σ23Σ

−133 Σ32)

= − Σ12/d√(Σ22/d)(Σ11/d)

= − Σ12

√Σ11Σ22

.

STAT3914: Lecture 3 38

Page 39: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

2 Our definition of partial correlation and analysis do not require X to be multi-

variate normal.

2 However, if X is multivariate normal then

ρ12·3...p = CorF (X1, X2),

where the correlation is with respect to the distribution F = F(X1,X2)|X3, the

joint conditional distribution of X1 and X2 given X3.

STAT3914: Lecture 3 39

Page 40: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

Maximum Likelihood Estimates for µ and Σ

2 X1, . . . ,Xn are independent Np(µ0,Σ0) random vectors (here µ0 and Σ0 are

the true mean and covariance respectively).

2 Theorem: The maximum likelihood estimators for µ and Σ are:

µ = X and Σ = 1nS

where

S =

n∑i=1

(Xi −X)(Xi −X)T .

STAT3914: Lecture 3 40

Page 41: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

2 Proof: Assuming |Σ| > 0 the likelihood of (µ,Σ) is

L(µ,Σ) =

n∏i=1

(2π)−p/2|Σ|−1/2 exp−1

2(Xi − µ)TΣ−1(Xi − µ)

2 Recall: if A is m× n and B is n×m then tr(AB) = tr(BA).

Hence,

(Xi − µ)TΣ−1(Xi − µ) = tr[(Xi − µ)TΣ−1(Xi − µ)

]= tr

[Σ−1(Xi − µ)(Xi − µ)T

]Thus,

L(µ,Σ) = (2π)−np/2|Σ|−n/2 exp

−1

2tr

[Σ−1

n∑i=1

(Xi − µ)(Xi − µ)T

].

STAT3914: Lecture 3 41

Page 42: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

From the previous slide,

L(µ,Σ) = (2π)−np/2|Σ|−n/2 exp

−1

2tr

[Σ−1

n∑i=1

(Xi − µ)(Xi − µ)T

].

2 Relying on the identityn∑i=1

(Xi − µ)(Xi − µ)T =

n∑i=1

(Xi −X)(Xi −X)T + n(X− µ)(X− µ)T

where X = n−1∑n

i=1 Xi, we have

L(µ,Σ) = (2π)−np/2|Σ|−n/2 exp

−12tr

[Σ−1

(n(X− µ)(X− µ)T

n∑i=1

(Xi −X)(Xi −X)T

)].

STAT3914: Lecture 3 42

Page 43: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

2 From the previous slide

L(µ,Σ) = (2π)−np/2|Σ|−n/2 exp

−12tr

[Σ−1

(n(X− µ)(X− µ)T

n∑i=1

(Xi −X)(Xi −X)T

)].

2 Maximizing over L(µ,Σ) (with respect to) µ is easy:

−12tr[Σ−1

((X− µ)(X− µ)T

)]= −1

2(X− µ)TΣ−1(X− µ) ≤ 0

with equality if and only if µ = X.

2 ThenmaxµL(µ,Σ) = L(X,Σ)

= (2π)−np/2|Σ|−n/2 exp−1

2tr(Σ−1S

)≡ LP (Σ)

where LP (Σ) is sometimes referred to as the profile likelihood.

STAT3914: Lecture 3 43

Page 44: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

2 Next note that

argmaxΣ LP (Σ) = argmaxΣ logLP (Σ)= argmaxΣ

−np

2 log(2π)− n2 log |Σ| − 1

2tr(Σ−1S

)since log(·) is monotonic.

2 In order to maximize the log profile likelihood we note that it can be shown that

∂ log |Σ|∂Σij

= tr

[Σ−1

∂Σ

∂Σij

]= tr

[Σ−1Eij

]where Eij is a zero matrix except for 1 in the (i, j)th entry. Next, it can be

shown that∂Σ−1

∂Σij= −Σ−1

∂Σ

∂ΣijΣ−1 = −Σ−1EijΣ

−1.

STAT3914: Lecture 3 44

Page 45: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

Hence,∂ logLP (Σ)

∂Σij= −n

2tr[Σ−1Eij

]+ 1

2tr(Σ−1EijΣ

−1S)

= 12tr[Σ−1

(SΣ−1 − nI

)Eij

].

Setting the above to 0 for all (i, j) we obtain the solution

Σ = 1nS.

STAT3914: Lecture 3 45

Page 46: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

Biasedness of MLE Estimators

Firstly, it is easy to shot that the MLE for µ is unbiased since,

E[X] =1

n

n∑i=1

E[Xi]

=1

n

n∑i=1

µ

= µ

and has covariance

Cov(X) =1

n2

n∑i=1

Cov[Xi]

=1

n2

n∑i=1

Σ

=1

nΣ.

STAT3914: Lecture 3 46

Page 47: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

However, the MLE for Σ is biased since

E[n−1S] = n−1E

[n∑i=1

(Xi −X)(Xi −X)T

]

= n−1E

[n∑i=1

XiXTi − nXX

T

]= n−1E

[n(Σ + µµT

)− n

(1

nΣ + µµT

)]= n−1

n Σ.

Hence,1

n−1S

is an unbiased estimator of Σ.

STAT3914: Lecture 3 47

Page 48: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

Sampling Distributions

Theorem: If Σ is a consistent estimator of Σ then, by virtue of the central limit

theorem √nΣ

−1/2(X− µ)

converges to Np(0, I) in distribution.

The sampling distribution for S is much more complicated and involve the Wishart

distribution (which we will cover later on).

STAT3914: Lecture 3 48

Page 49: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

Results on Quadratic Forms

2 Definition: A square matrix A is called idempotent if A2 = A.

2 Theorem: A symmetrix (symmetric matrix) A is idempotent if and only if all

its eigenvalues are in 0, 1.

2 Proof: Suppose that the eigenvalue descoposition of A is UΛUT then

A2 = UΛUTUΛU = UΛ2U

If A is idempotent then

UΛ2U = UΛU⇒ λ2i = λi

for 1 ≤ i ≤ p. Hence, if A is idempotent then λi ∈ 0, 1 for 1 ≤ i ≤ p.

If λi ∈ 0, 1 for 1 ≤ i ≤ p then λ2i = λi for 1 ≤ i ≤ p and

A2 = UΛ2U = UΛU = A.

STAT3914: Lecture 3 49

Page 50: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

2 Let X ∼ Np(0, I) and let C be a symmetric square matrix of rank r > 0.

2 Theorem 3: The random variable

XTCX ∼ χ2r

if and only if C is idempotent.

STAT3914: Lecture 3 50

Page 51: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

2 Proof: We can diagonalize: C = UDUT where

U is an orthogonal matrix and

D = diag(λ1, . . . , λr, 0, . . . , 0︸ ︷︷ ︸p−r

)

where λi ≥ λi+1 for i = 1, . . . , r − 1 (and λi 6= 0).

2 Let Y = UTX so that Y ∼ Np(0, I) and note that

W ≡ XTCX

= XTUDUTX

= YTDY

=

r∑i=1

λiY2i .

2 If C is idempotent then λi = 1 for 1 ≤ i ≤ r and W ∼ χ2r since Y 2

i are i.i.d.

χ21.

STAT3914: Lecture 3 51

Page 52: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

2 Conversely, assume W ∼ χ2r then its MGF is given by

MW (t) = (1− 2t)−r/2 for t < 12.

Since Y 2i are i.i.d. χ2

1, for λit < 1/2 for all i we also have

MW (t) =

r∏i=1

MλiY2i(t)

=

r∏i=1

MY 2i(λit)

=

r∏i=1

(1− 2λit)−1/2.

2 These two domains agree if and only if λ1 = 1 and λr > 0 so this has to be the

case.

STAT3914: Lecture 3 52

Page 53: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

2 Therefore, MW (t) = Mχ2r(t) for t < 1

2 only if

r∏i=1

(1− 2λit) = [MW (t)]−2

= [Mχ2(r)(t)]−2

=

r∏i=1

(1− 2t).

2 From the unique polynomial factorization, the equality holds if and only if λi = 1

for i = 1, . . . , r.

STAT3914: Lecture 3 53

Page 54: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

2 Theorem 4 [Craig’s Theorem]: If A and B are symmetric, non-negative

definite matrices and X ∼ Np(0, I) then

XTAX and XTBX

are independent if and only if AB = 0.

STAT3914: Lecture 3 54

Page 55: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

2 Proof: Assume AB = 0 then

BA = (AB)T = 0 so A and B commute.

Simultaneous diagonalization: there exist an orthogonal U and diagonal ma-

trices DA and DB such that UTAU = DA and UTBU = DB.

Next, AB = UDAUTUDBUT = UDADBUT so that AB = 0 implies

(WLOG) that

DA = diag(λ1, . . . , λr, 0, . . . , 0) and DB = diag(0, . . . , 0︸ ︷︷ ︸r

, λr+1, . . . , λp)

(for some r).

It follows that with Y ≡ UTX

XTAX = YTUTAUY =

r∑i=1

λiY2i and XTBX =

p∑i=r+1

λiY2i

As Y ∼ Np(0, I) the latter two are obviously independent random variables.

STAT3914: Lecture 3 55

Page 56: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

2 Conversely, suppose the quadratic forms are independent

Let U be an orthogonal matrix such that

UTAU = DA = diag(λ1, . . . , λr,0(p−r)).

Let B∗ = UTBU and let Y = UTX, then

XTAX = YTUTAUY =

r∑i=1

λiY2i

XTBX =∑i

∑j

b∗ijYiYj

Independence implies that

E[XTAX]E[XTBX] = E[XTAXXTBX]

Therefore,

E

[r∑i=1

λiY2i

]·E

∑i

∑j

b∗ijYiYj

= E

( r∑i=1

λiY2i

)∑j

∑k

b∗kjYkYj

STAT3914: Lecture 3 56

Page 57: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

But Y1, . . . , Yp are NID(0, 1) with E[Y 2i ] = 1, E[Y 3

i ] = 0 and E[Y 4i ] = 3.

Therefore, (r∑i=1

λi

)∑j

b∗jj

= 3

r∑i=1

λib∗ii +

∑i 6=j

λib∗jj.

Thus,∑r

i=1 λib∗ii = 0.

But λib∗ii ≥ 0 for all i (why?) and λi > 0 for i = i, . . . , r so b∗ii = 0 for

i = 1, . . . , r.

Since B∗ ≥ 0 it follows that any principal submatrix is non-negative definite

as well , hence b∗ij = 0 if i or j are in 1, . . . , r. Thus,

B∗ =

(0 0

0 B∗22

)hence DAB∗ = 0 and therefore

AB = AUUTB = UDAB∗UT = 0.

STAT3914: Lecture 3 57

Page 58: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

The Wishart Distribution

2 Let X = (X1, . . . ,Xn)T be n× p matrix with rows Xi ∼ NIDp(0,Σ)

2 The p×p matrix M ≡ XTX has a Wishart distribution Wp(Σ, n) whose density

is given by

C−1p,n|Σ|−n/2|M|(n−p−1)/2 exp−1

2tr[Σ−1M

].

where

Cp,n ≡ 2np/2πp(p−1)/4p∏i=1

Γ((n + 1− i)/2).

We also need n (called the degrees of freedom) to satisfy n > p − 1 and the

Wishart distribution parameter Σ (called the scale matrix) is assumed to be

positive definite.

STAT3914: Lecture 3 58

Page 59: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

2 Note that X =∑n

i=1 eiXTi where ei is a vector of length n whose elements are

0 except for the ith element which is equal to 1. Therefore,

M =

n∑j=1

XjeTj

n∑i=1

eiXTi

=∑i,j

XjδijXTi

=

n∑i=1

XiXTi

where δij is a scalar equal to 1 if i = j and 0 if i 6= j.

2 Hence, for known µ = 0 the MLE/sample covariance nΣ ∼ Wp(Σ, n).

2 If p = 1 then W1(Σ, n) = W1(σ2, n) = σ2χ2

n.

2 E[M] = E(∑n

i=1 XiXTi

)= nΣ.

2 Var(Mij) = n(Σ2ij + ΣiiΣjj).

STAT3914: Lecture 3 59

Page 60: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

2 Theorem 5: If M ∼ Wp(Σ, n) and B is a p× q matrix then

BTMB ∼ Wq(BTΣB, n).

2 Proof: Let Y = XB, then Y = (Y1, . . . ,Yn)T is an n × q matrix with

independent rows denoted YTi = XT

i B, or Yi = BTXi ∼ Nq(0,BTΣB).

It follows thatBTMB = BTXTXB

= YTY

∼ Wq(BTΣB, n).

2 Corollary: The principal submatrices of M have Wishart distributions.

STAT3914: Lecture 3 60

Page 61: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

2 Theorem 6: If M ∼ Wp(Σ, n) and a is any fixed p-vector such that aTΣa > 0

thenaTMa

aTΣa∼ χ2

n

2 Proof: Firstly, from Theorem 5 we have

aTMa ∼ W1(aTΣa, n)

and we have already argued that

W1(aTΣa, n) ∼ (aTΣa)χ2

n.

2 Corollary:

Mii ∼ Σ2iiχ

2n

2 The converse to Theorem 6 is not true.

STAT3914: Lecture 3 61

Page 62: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

2 Theorem 7: If

M1 ∼ Wp(Σ, n1) and M2 ∼ Wp(Σ, n2)

independently of M1 then

M1 + M2 ∼ Wp(Σ, n1 + n2).

2 Proof: Write Mi = XTi Xi where Xi has ni independent rows drawn from a

Np(0,Σ) distribution. Then

M1 + M2 = XT1 X1 + XT

2 X2

= XTX where X

=

(X1

X2

).

STAT3914: Lecture 3 62

Page 63: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

2 Recall that the sample covariance matrix is defined as 1nS where

S =

n∑i=1

(Xi −X)(Xi −X)T .

2 The Wishart distribution describes the distribution of S.

2 Theorem 8: If X is an n × p matrix with rows independent Np(µ,Σ) then

S ∼ Wp(Σ, n− 1) independently of X ∼ Np(µ, n−1Σ).

2 Proof: Recall that

S =

n∑i=1

(Xi −X)(Xi −X)T

=

n∑i=1

XiXTi − nXX

T

= XTX− nXXT.

STAT3914: Lecture 3 63

Page 64: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

Let P be an n× n orthogonal matrix with first row(1√n, . . . ,

1√n

).

Let Y = PX so Y1 =√nX where

Y = (Y1, . . . ,Yn)T .

Thus,

S = YTPPTY −Y1YT1

= YTY −Y1YT1

=

n∑i=2

YiYTi .

2 Exercise: WLOG µ = 0.

2 It follows from the following lemma that Yi ∼ NIDp(0,Σ) and therefore

S ∼ Wp(Σ, n− 1) independently of 1√nY1 = X.

STAT3914: Lecture 3 64

Page 65: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

2 Lemma: Let X be an n × p matrix with rows NIDp(0,Σ) and let U be an

n × n orthogonal matrix. Define Y = UX, and let YTi be the rows of Y

(i = 1, . . . , n). Then Yi ∼ NIDp(0,Σ).

STAT3914: Lecture 3 65

Page 66: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

2 Proof: We have E(Y) = UE(X) = 0np. Note that Yi = YTei and therefore

E[YiYTj ] = E[XTUTeie

Tj UX]

= E[XTuiuTj X]

= E

[(n∑k=1

XkeTk

)uiu

Tj

(n∑l=1

elXTl

)]

= E

∑k,l

XkuikujlXTl

=∑k,l

uikujlE[XkXTl ]

=∑k

uikujkΣ

= δijΣ,

where uTi the ith row of U.

STAT3914: Lecture 3 66

Page 67: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

Wishart Distribution (cont.)

2 Theorem 9: Suppose X has n rows which are NIDp(0,Σ) then XTAX ∼Wp(Σ, r) if and only if A is idempotent and of rank r.

2 Proof: If XTAX ∼ Wp(Σ, r) then by Theorem 6 for any a ∈ Rp with aTΣa >

0 we haveaTXTAXa

aTΣa∼ χ2

r.

Let

Y =Xa√aTΣa

,

then Y ∼ N(0n, In) and YTAY ∼ χ2r so A is idempotent of rank r by

Theorem 3.

STAT3914: Lecture 3 67

Page 68: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

2 Conversely, if A is idempotent of rank r then A = UDUT , where U is an

n× n orthogonal matrix and D = diag(1, . . . , 1︸ ︷︷ ︸r

, 0, . . . , 0︸ ︷︷ ︸n−r

).

2 Let Y = UTX so by the lemma Y has n rows YTi ∼ NIDp(0,Σ) and

XTAX = YTDY

=

n∑j=1

YjeTj

D

(n∑i=1

eiYTi

)

=

n∑i,j=1

Yj

(eTj Dei

)YTi

=

r∑i,j=1

YjδijYTi

=

r∑i=1

YiYTi ∼ Wp(Σ, r).

STAT3914: Lecture 3 68

Page 69: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

2 Theorem 10: If X has n rows which are NIDp(0,Σ) then for any symmetric,

non-negative definite n× n matrices A and B the random values XTAX and

XTBX are independent if and only if AB = 0.

2 Proof: Assume XTAX and XTBX are independent. Then

Choose a such that aTΣa > 0 and let

Y =Xa√aTΣa

,

then:

∗ Y ∼ Nn(0, I) and

∗ YTAY and YTBY are independent.

Thus by Craig’s Theorem we have AB = 0.

STAT3914: Lecture 3 69

Page 70: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

2 Conversely, assume AB = 0. Then

There exists an n× n orthogonal matrix P such that

PTAP = DA ≡ diag(α1, . . . , αr︸ ︷︷ ︸r

, 0 . . . 0︸ ︷︷ ︸n−r

),

and

PTBP = DB ≡ diag(0, . . . , 0︸ ︷︷ ︸r

, βr+1, . . . , βn︸ ︷︷ ︸n−r

)

STAT3914: Lecture 3 70

Page 71: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

Let Y = PTX, then by the last lemma again Y has n rows YTi ∼

NIDp(0,Σ) and

XTAX = YTDAY

=

n∑j=1

YjeTj

DA

(n∑i=1

eiYTi

)

=

n∑i,j=1

Yj

(eTj DAei

)YTi

=

r∑1

αiYiYTi .

Similarly, XTBX =∑n

r+1 βiYiYTi .

The latter two are obviously independent.

STAT3914: Lecture 3 71

Page 72: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

2 Theorem 11: Suppose X has n rows which are NIDp(0,Σ) with Σ positive

definite. LetM = XTX

=

(M11 M12

M21 M22

),

where M11 is an r × r matrix (with r < n). Let

M2·1 ≡M22 −M21M−111 M12.

Then

M2·1 ∼ Wq(Σ2·1, n− r)

independently of (M11,M12) where q = p− r and

Σ2·1 = Σ22 −Σ21Σ−111 Σ12.

STAT3914: Lecture 3 72

Page 73: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

2 Recall that

Σ2·1 is the conditional covariance of

X(2)i ≡ (Xi,r+1, . . . , Xip)

T given X(1)i ≡ (Xi1, . . . , Xir)

T .

This is also the covariance of

Yi ≡ X(2)i −QX

(1)i

where

QX(1)i ≡

(Σ21Σ

−111

)X

(1)i

is the best linear predictor of X(2)i given X

(1)i :

Yi ∼ NIDq(0q,Σ2·1)

and independent of X(1)i .

2 Hence, had we known this decomposition we could have estimated Σ2·1 as

YTY ∼ Wq(Σ2·1, n) (note the degrees of freedom is n rather than n− r).

2 But we typically don’t know Σ...

STAT3914: Lecture 3 73

Page 74: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

2 Proof (of Theorem 11).

Write X = (X1,X2) where X1 is n× r and X2 is n× q (with r + q = p),

thenM = XTX

=

(XT

1

XT2

)(X1,X2).

It can be shown that as Σ11 is positive definite and as n > r, M11 is positive

definite almost surely.

Thus, we can define

M2·1 = XT2 X2 −XT

2 X1M−111 XT

1 X2

= XT2 PX2,

where P = I−X1M−111 XT

1 .

STAT3914: Lecture 3 74

Page 75: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

if P = I −X1M−111 XT

1 it is easy to verify directly that P is idempotent (a

projection on Span(X1)⊥) and its rank is (n− r) as

tr(P) = n− tr[X1M−111 XT

1 ]

= n− r.

Also, XT1 P = PX1 = 0 so with

X2·1 ≡ X2 −X1Σ−111 Σ12

andM2·1 = XT

2 PX2

= XT2·1PX2·1.

But X2·1 is the component of X2 that is independent of X1 and in particular

we saw that the n rows of X2·1 ∼ NIDq(0q,Σ2·1).

It follows from Theorem 9 that, conditioned on X1, M2·1 ∼ Wq(Σ2·1, n−r).

STAT3914: Lecture 3 75

Page 76: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

This distribution does not depend on X1, therefore it is also the unconditional

distribution of M2·1.

Moreover, it follows that M2·1 is independent of X1 and hence of M11 =

XT1 X1.

Furthermore, P(I−P) = P−P = 0 so using the same rationale as in the

proof of Theorem 10 we see that given X1,

M2·1 = XT2·1PX2·1 and XT

1 (I−P)X2·1

are independent.

STAT3914: Lecture 3 76

Page 77: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

Now,

XT1 (I−P)X2·1 = XT

1 (X1M−111 XT

1 )(X2 −X1Σ−111 Σ12)

= M12 −M11Σ−111 Σ12.

Therefore, given X1, M2·1 is independent of

M12 = XT1 (I−P)X2·1 + M11Σ

−111 Σ12

Since M2·1 is also independent of X1 it is independent of (X1,M12) (why?)

It follows that M2·1 is independent of (M11,M12) (why?)

STAT3914: Lecture 3 77

Page 78: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

Linear Independence Lemma

2 Linear Independence Lemma: Suppose that Xi ∼ NIDp(Σ, 0) for i =

1, . . . , n ≤ p and that Σ is positive definite. Then X1, . . . ,Xn are almost

surely linearly independent.

2 Proof of lemma.

Firsly, without loss of generality we can assume that Σ = Ip. (exercise)

The proof is by induction where the case n = 1 is trivial since X1 ∼ Np(I,0)

implies that P (X1 = 0) = 0.

STAT3914: Lecture 3 78

Page 79: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

Now, assume that the lemma holds for n < p. In this case

∗ Either Xn+1 /∈ Span〈X1, . . . ,Xn〉 almost surely and we are done,

∗ or, there exists a set A ⊂ Ω with P (A) > 0 and random variables αi such

that

Xn+1(ω) =

n∑i=1

αi(ω)Xi(ω) and X1(ω), . . . ,Xn(ω)

are linearly independent for all ω ∈ A.

The latter identity can be thought of as p equations in n unknowns: αi(ω). Since X1(ω), . . . ,Xn(ω) are linearly independent, then without loss of gen-

erality the αi(ω) are uniquely determined from

X1(ω), . . . ,Xn(ω)

and Xn+1,k(ω) for k = 1, . . . , n.

STAT3914: Lecture 3 79

Page 80: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

But

Xn+1,n+1(ω) =

n∑i=1

αi(ω)Xi,n+1(ω)

and it follows that on the set A, Xn+1,n+1 can be determined from

X1(ω), . . . ,Xn(ω)

and Xn+1,k(ω) for k = 1, . . . , n.

This contradicts the independence assumption.

2 Corollary: If M ∼ Wp(Σ, n) with Σ positive definte and n ≥ p then |M| > 0

almost surely (exercise).

STAT3914: Lecture 3 80

Page 81: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

2 Theorem 12. If M ∼ Wp(Σ, n) with Σ positive definite and n ≥ p then

(a) For any fixed a 6= 0 ∈ Rp,

aTΣ−1a

aTM−1a∼ χ2

n−p+1.

In particular Σii/M ii ∼ χ2n−p+1.

(b) Mii is independent of all elements of M except Mii.

2 Proof:

Recall that M22 = M−12·1 and similarly Σ22 = Σ−12·1.

Therefore, from Theorem 11, with r = p− 1 we have

(Mpp)−1 = M2·1∼ Wp−r(Σ2·1, n− r)∼ (Σpp)−1χ2

n−p+1.

Moreover, M pp is independent of (M11,M12), i.e., M pp is independent of all

elements of M except Mpp

STAT3914: Lecture 3 81

Page 82: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

This proves the theorem for a = ep and hence for a = ei (why?)

For a general a 6= 0 ∈ Rp, let A be a non-singular matrix with last column

a, that is Aep = a. Then

aTΣ−1aaTM−1a

=eTpATΣ−1Aep

eTpATM−1Aep

=eTpΣ−1A ep

eTpM−1A ep

,

where ΣA = A−1Σ(A−1)T and MA = A−1M(A−1)T .

The proof is complete since by Theorem 5, MA ∼ Wp(ΣA, n).

STAT3914: Lecture 3 82

Page 83: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

2 Theorem 13. If M ∼ Wp(Σ, n) where n ≥ p then

|M| = |Σ| ·p∏i=1

Ui

where Ui are independent random variables, Ui ∼ χ2n−i+1.

2 Proof: Without loss of generality we may assume that Σ is positive definite

(exercise). We use induction on p.

For the case p = 1 we have

M ∼ σ2χ2n.

Write

M =

(M11 M12

M21 M22

)where M11 is a (p− 1)× (p− 1) matrix.

STAT3914: Lecture 3 83

Page 84: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

Recall that

M22 = [M−1]pp = |M11|/|M|

(similarly, Σ22 = |Σ11|/|Σ|).

Therefore,

|M| =|M11|M 22

=Σ22

M 22

|Σ||Σ11|

|M11|.

But from Theorem 12,

Up ≡Σ22

M 22=

[Σ−1]pp[M−1]pp

∼ χ2n−p+1

and is independent of M11.

We complete the proof by noting that M11 ∼ Wp−1(Σ11, n).

STAT3914: Lecture 3 84

Page 85: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

Hence, by the inductive hypothesis

|M11| = |Σ11|p−1∏i=1

Ui

where Ui ∼ χ2n−i+1 are independent of one another and obviously they depend

only on M11 and hence independent of Up.

STAT3914: Lecture 3 85

Page 86: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

2 Corollary: If X has n rows which are NIDp(µ,Σ) with Σ positive definite

and S = XTX− nXXT

then

Σkk/Skk ∼ χ2n−p.

aTΣ−1a

aTS−1a∼ χ2

n−p

for any fixed a 6= 0

If S11 is an r × r submatrix of S, then

S11 ∼ Wr(Σ11, n− 1)

independently ofS2·1 = S22 − S21S

−111 S12

∼ Wp−r(Σ2·1, n− r − 1).

STAT3914: Lecture 3 86

Page 87: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

Hotelling’s T 2

2 Definition. Let Z ∼ Np(0, I) independently of M ∼ Wp(I, n) with n ≥ p

then

nZTM−1Z ∼ T 2(p, n)

where T 2(p, n) is Hotelling’s T 2 distribution.

2 Theorem 14: If Y ∼ Np(µ,Σ) and M ∼ Wp(Σ, n) independently of Y with

Σ positive definite and n ≥ p then

n(Y − µ)TM−1(Y − µ) ∼ T 2(p, n).

2 Proof:

Let Z = Σ−1/2(Y − µ) and let MΣ = Σ−1/2MΣ−1/2.

Then Z ∼ Np(0, I), MΣ ∼ Wp(I, n) and

n(Y − µ)TM−1(Y − µ) = nZTM−1Σ Z.

STAT3914: Lecture 3 87

Page 88: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

2 Theorem 15:

T 2(p, n) ∼ np

n− p + 1Fp,n−p+1

2 Proof:

Let Z ∼ Np(0, I) independently of M ∼ Wp(I, n).

By Theorem 12, for a 6= 0, conditioned on Z = a

D ≡ ZTZ

ZTM−1Z∼ χ2

n−p+1

This conditional distribution of D does not depend on a hence it is also the

unconditional distribution of D.

For the same reason, D is independent of Z and therefore also of R ≡ZTZ ∼ χ2

p.

It follows thatnZTM−1Z

n=R

D∼ p

n− p + 1Fp,n−p+1.

STAT3914: Lecture 3 88

Page 89: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

2 Corollary 1: If X is the mean of a sample of size n ≥ p drawn from a Np(µ,Σ)

population with Σ positive definite, and if (n − 1)−1S is the unbiased sample

covariance matrix then

n(X− µ)T(

1

n− 1S

)−1(X− µ) ∼ (n− 1)p

n− pFp,n−p.

2 Proof:

It suffices to show that the LHS has a T 2(p, n− 1) distribution.

Let Y =√nX and let S = XTX− nXX

T.

Theorem 8 states: S ∼ Wp(Σ, n− 1) independently of Y ∼ Np(√nµ,Σ).

Therefore by Theorem 14,

(n− 1)(Y −√nµ)TS−1(Y −

√nµ) ∼ T 2(p, n− 1)

Finally,

n(X−µ)T(

1

n− 1S

)−1(X−µ) = (n− 1)(Y−

√nµ)TS−1(Y−

√nµ).

STAT3914: Lecture 3 89

Page 90: STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

STAT3914: Lecture 3 90