stat 714 linear statistical models

STAT 714

LINEAR STATISTICAL MODELS

Fall, 2010

Lecture Notes

Joshua M. Tebbs

Department of Statistics

The University of South Carolina

TABLE OF CONTENTS STAT 714, J. TEBBS

Contents

1 Examples of the General Linear Model 1

2 The Linear Least Squares Problem 13

2.1 Least squares estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2 Geometric considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.3 Reparameterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3 Estimability and Least Squares Estimators 28

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2 Estimability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2.1 One-way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.2.2 Two-way crossed ANOVA with no interaction . . . . . . . . . . . 37

3.2.3 Two-way crossed ANOVA with interaction . . . . . . . . . . . . . 39

3.3 Reparameterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.4 Forcing least squares solutions using linear constraints . . . . . . . . . . . 46

4 The Gauss-Markov Model 54

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.2 The Gauss-Markov Theorem . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.3 Estimation of σ2 in the GM model . . . . . . . . . . . . . . . . . . . . . 57

4.4 Implications of model selection . . . . . . . . . . . . . . . . . . . . . . . . 60

4.4.1 Underfitting (Misspecification) . . . . . . . . . . . . . . . . . . . . 60

4.4.2 Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.5 The Aitken model and generalized least squares . . . . . . . . . . . . . . 63

5 Distributional Theory 68

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

i


5.2 Multivariate normal distribution . . . . . . . . . . . . . . . . . . . . . . . 69

5.2.1 Probability density function . . . . . . . . . . . . . . . . . . . . . 69

5.2.2 Moment generating functions . . . . . . . . . . . . . . . . . . . . 70

5.2.3 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.2.4 Less-than-full-rank normal distributions . . . . . . . . . . . . . . 73

5.2.5 Independence results . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.2.6 Conditional distributions . . . . . . . . . . . . . . . . . . . . . . . 76

5.3 Noncentral χ2 distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.4 Noncentral F distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.5 Distributions of quadratic forms . . . . . . . . . . . . . . . . . . . . . . . 81

5.6 Independence of quadratic forms . . . . . . . . . . . . . . . . . . . . . . . 85

5.7 Cochran’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

6 Statistical Inference 95

6.1 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

6.2 Testing models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

6.3 Testing linear parametric functions . . . . . . . . . . . . . . . . . . . . . 103

6.4 Testing models versus testing linear parametric functions . . . . . . . . . 107

6.5 Likelihood ratio tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

6.5.1 Constrained estimation . . . . . . . . . . . . . . . . . . . . . . . . 109

6.5.2 Testing procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

6.6 Confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

6.6.1 Single intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

6.6.2 Multiple intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

7 Appendix 118

7.1 Matrix algebra: Basic ideas . . . . . . . . . . . . . . . . . . . . . . . . . 118

7.2 Linear independence and rank . . . . . . . . . . . . . . . . . . . . . . . . 120

ii


7.3 Vector spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

7.4 Systems of equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

7.5 Perpendicular projection matrices . . . . . . . . . . . . . . . . . . . . . . 134

7.6 Trace, determinant, and eigenproblems . . . . . . . . . . . . . . . . . . . 136

7.7 Random vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

iii

CHAPTER 1 STAT 714, J. TEBBS

1 Examples of the General Linear Model

Complementary reading from Monahan: Chapter 1.

INTRODUCTION : Linear models are models that are linear in their parameters. The

general form of a linear model is given by

Y = Xβ + ε,

where Y is an n× 1 vector of observed responses, X is an n× p (design) matrix of fixed

constants, β is a p× 1 vector of fixed but unknown parameters, and ε is an n× 1 vector

of (unobserved) random errors. The model is called a linear model because the mean of

the response vector Y is linear in the unknown parameter β.

SCOPE : Several models commonly used in statistics are examples of the general linear

model Y = Xβ + ε. These include, but are not limited to, linear regression models and

analysis of variance (ANOVA) models. Regression models generally refer to those for

which X is full rank, while ANOVA models refer to those for which X consists of zeros

and ones.

GENERAL CLASSES OF LINEAR MODELS :

• Model I: Least squares model : Y = Xβ + ε. This model makes no assumptions

on ε. The parameter space is Θ = β : β ∈ Rp.

• Model II: Gauss Markov model : Y = Xβ + ε, where E(ε) = 0 and cov(ε) = σ2I.

The parameter space is Θ = (β, σ2) : (β, σ2) ∈ Rp ×R+.

• Model III: Aitken model : Y = Xβ + ε, where E(ε) = 0 and cov(ε) = σ2V, V

known. The parameter space is Θ = (β, σ2) : (β, σ2) ∈ Rp ×R+.

• Model IV: General linear mixed model : Y = Xβ + ε, where E(ε) = 0 and

cov(ε) = Σ ≡ Σ(θ). The parameter space is Θ = (β,θ) : (β,θ) ∈ Rp × Ω,

where Ω is the set of all values of θ for which Σ(θ) is positive definite.

PAGE 1


GAUSS MARKOV MODEL: Consider the linear model Y = Xβ + ε, where E(ε) = 0

and cov(ε) = σ2I. This model is treated extensively in Chapter 4. We now highlight

special cases of this model.

Example 1.1. One-sample problem. Suppose that Y1, Y2, ..., Yn is an iid sample with

mean µ and variance σ2 > 0. If ε1, ε2, ..., εn are iid with mean E(εi) = 0 and common

variance σ2, we can write the GM model

Y = Xβ + ε,

where

Yn×1 =

Y1

Y2

...

Yn

, Xn×1 =

1

1...

1

, β1×1 = µ, εn×1 =

ε1

ε2...

εn

.

Note that E(ε) = 0 and cov(ε) = σ2I.

Example 1.2. Simple linear regression. Consider the model where a response variable

Y is linearly related to an independent variable x via

Yi = β0 + β1xi + εi,

for i = 1, 2, ..., n, where the εi are uncorrelated random variables with mean 0 and

common variance σ2 > 0. If x1, x2, ..., xn are fixed constants, measured without error,

then this is a GM model Y = Xβ + ε with

Yn×1 =

Y1

Y2

...

Yn

, Xn×2 =

1 x1

1 x2

......

1 xn

, β2×1 =

β0

β1

, εn×1 =

ε1

ε2...

εn

.

Note that E(ε) = 0 and cov(ε) = σ2I.

Example 1.3. Multiple linear regression. Suppose that a response variable Y is linearly

related to several independent variables, say, x1, x2, ..., xk via

Yi = β0 + β1xi1 + β2xi2 + · · ·+ βkxik + εi,

PAGE 2


for i = 1, 2, ..., n, where εi are uncorrelated random variables with mean 0 and common

variance σ2 > 0. If the independent variables are fixed constants, measured without

error, then this model is a special GM model Y = Xβ + ε where

Y =

Y1

Y2

...

Yn

, Xn×p =

1 x11 x12 · · · x1k

1 x21 x22 · · · x2k

......

.... . .

...

1 xn1 xn2 · · · xnk

, βp×1 =

β0

β1

β2

...

βk

, ε =

ε1

ε2...

εn

,

and p = k + 1. Note that E(ε) = 0 and cov(ε) = σ2I.

Example 1.4. One-way ANOVA. Consider an experiment that is performed to compare

a ≥ 2 treatments. For the ith treatment level, suppose that ni experimental units are

selected at random and assigned to the ith treatment. Consider the model

Yij = µ+ αi + εij,

for i = 1, 2, ..., a and j = 1, 2, ..., ni, where the random errors εij are uncorrelated random

variables with zero mean and common variance σ2 > 0. If the a treatment effects

α1, α2, ..., αa are best regarded as fixed constants, then this model is a special case of the

GM model Y = Xβ + ε. To see this, note that with n =∑a

i=1 ni,

Yn×1 =

Y11

Y12

...

Yana

, Xn×p =

1n1 1n1 0n1 · · · 0n1

1n2 0n2 1n2 · · · 0n2

......

.... . .

...

1na 0na 0na · · · 1na

, βp×1 =

µ

α1

α2

...

αa

,

where p = a + 1 and εn×1 = (ε11, ε12, ..., εana)′, and where 1ni is an ni × 1 vector of ones

and 0ni is an ni × 1 vector of zeros. Note that E(ε) = 0 and cov(ε) = σ2I.

NOTE : In Example 1.4, note that the first column of X is the sum of the last a columns;

i.e., there is a linear dependence in the columns of X. From results in linear algebra,

we know that X is not of full column rank. In fact, the rank of X is r = a, one less

PAGE 3


than the number of columns p = a + 1. This is a common characteristic of ANOVA

models; namely, their X matrices are not of full column rank. On the other hand,

(linear) regression models are models of the form Y = Xβ+ ε, where X is of full column

rank; see Examples 1.2 and 1.3.

Example 1.5. Two-way nested ANOVA. Consider an experiment with two factors,

where one factor, say, Factor B, is nested within Factor A. In other words, every level

of B appears with exactly one level of Factor A. A statistical model for this situation is

Yijk = µ+ αi + βij + εijk,

for i = 1, 2, ..., a, j = 1, 2, ..., bi, and k = 1, 2, ..., nij. In this model, µ denotes the overall

mean, αi represents the effect due to the ith level of A, and βij represents the effect

of the jth level of B, nested within the ith level of A. If all parameters are fixed, and

the random errors εijk are uncorrelated random variables with zero mean and constant

unknown variance σ2 > 0, then this is a special GM model Y = Xβ + ε. For example,

with a = 3, b = 2, and nij = 2, we have

Y =

Y111

Y112

Y121

Y122

Y211

Y212

Y221

Y222

Y311

Y312

Y321

Y322

, X =

1 1 0 0 1 0 0 0 0 0

1 1 0 0 1 0 0 0 0 0

1 1 0 0 0 1 0 0 0 0

1 1 0 0 0 1 0 0 0 0

1 0 1 0 0 0 1 0 0 0

1 0 1 0 0 0 1 0 0 0

1 0 1 0 0 0 0 1 0 0

1 0 1 0 0 0 0 1 0 0

1 0 0 1 0 0 0 0 1 0

1 0 0 1 0 0 0 0 1 0

1 0 0 1 0 0 0 0 0 1

1 0 0 1 0 0 0 0 0 1

, β =

µ

α1

α2

α3

β11

β12

β21

β22

β31

β32

,

and ε = (ε111, ε112, ..., ε322)′. Note that E(ε) = 0 and cov(ε) = σ2I. The X matrix is not

of full column rank. The rank of X is r = 6 and there are p = 10 columns.

PAGE 4


Example 1.6. Two-way crossed ANOVA with interaction. Consider an experiment with

two factors (A and B), where Factor A has a levels and Factor B has b levels. In general,

we say that factors A and B are crossed if every level of A occurs in combination with

every level of B. Consider the two-factor (crossed) ANOVA model given by

Yijk = µ+ αi + βj + γij + εijk,

for i = 1, 2, ..., a, j = 1, 2, ..., b, and k = 1, 2, ..., nij, where the random errors εij are

uncorrelated random variables with zero mean and constant unknown variance σ2 > 0.

If all the parameters are fixed, this is a special GM model Y = Xβ + ε. For example,

with a = 3, b = 2, and nij = 3,

Y =

Y111

Y112

Y113

Y121

Y122

Y123

Y211

Y212

Y213

Y221

Y222

Y223

Y311

Y312

Y313

Y321

Y322

Y323

, X =

1 1 0 0 1 0 1 0 0 0 0 0

1 1 0 0 1 0 1 0 0 0 0 0

1 1 0 0 1 0 1 0 0 0 0 0

1 1 0 0 0 1 0 1 0 0 0 0

1 1 0 0 0 1 0 1 0 0 0 0

1 1 0 0 0 1 0 1 0 0 0 0

1 0 1 0 1 0 0 0 1 0 0 0

1 0 1 0 1 0 0 0 1 0 0 0

1 0 1 0 1 0 0 0 1 0 0 0

1 0 1 0 0 1 0 0 0 1 0 0

1 0 1 0 0 1 0 0 0 1 0 0

1 0 1 0 0 1 0 0 0 1 0 0

1 0 0 1 1 0 0 0 0 0 1 0

1 0 0 1 1 0 0 0 0 0 1 0

1 0 0 1 1 0 0 0 0 0 1 0

1 0 0 1 0 1 0 0 0 0 0 1

1 0 0 1 0 1 0 0 0 0 0 1

1 0 0 1 0 1 0 0 0 0 0 1

, β =

µ

α1

α2

α3

β1

β2

γ11

γ12

γ21

γ22

γ31

γ32

,


of full column rank. The rank of X is r = 6 and there are p = 12 columns.

PAGE 5


Example 1.7. Two-way crossed ANOVA without interaction. Consider an experiment

with two factors (A and B), where Factor A has a levels and Factor B has b levels. The

two-way crossed model without interaction is given by

Yijk = µ+ αi + βj + εijk,

for i = 1, 2, ..., a, j = 1, 2, ..., b, and k = 1, 2, ..., nij, where the random errors εij are

uncorrelated random variables with zero mean and common variance σ2 > 0. Note that

no-interaction model is a special case of the interaction model in Example 1.6 when

H0 : γ11 = γ12 = · · · = γ32 = 0 is true. That is, the no-interaction model is a reduced

version of the interaction model. With a = 3, b = 2, and nij = 3 as before, we have

Y =

Y111

Y112

Y113

Y121

Y122

Y123

Y211

Y212

Y213

Y221

Y222

Y223

Y311

Y312

Y313

Y321

Y322

Y323

, X =

1 1 0 0 1 0

1 1 0 0 1 0

1 1 0 0 1 0

1 1 0 0 0 1

1 1 0 0 0 1

1 1 0 0 0 1

1 0 1 0 1 0

1 0 1 0 1 0

1 0 1 0 1 0

1 0 1 0 0 1

1 0 1 0 0 1

1 0 1 0 0 1

1 0 0 1 1 0

1 0 0 1 1 0

1 0 0 1 1 0

1 0 0 1 0 1

1 0 0 1 0 1

1 0 0 1 0 1

, β =

µ

α1

α2

α3

β1

β2

,


of full column rank. The rank of X is r = 4 and there are p = 6 columns. Also note that

PAGE 6


the design matrix for the no-interaction model is the same as the design matrix for the

interaction model, except that the last 6 columns are removed.

Example 1.8. Analysis of covariance. Consider an experiment to compare a ≥ 2

treatments after adjusting for the effects of a covariate x. A model for the analysis of

covariance is given by

Yij = µ+ αi + βixij + εij,

for i = 1, 2, ..., a, j = 1, 2, ..., ni, where the random errors εij are uncorrelated random

variables with zero mean and common variance σ2 > 0. In this model, µ represents the

overall mean, αi represents the (fixed) effect of receiving the ith treatment (disregarding

the covariates), and βi denotes the slope of the line that relates Y to x for the ith

treatment. Note that this model allows the treatment slopes to be different. The xij’s

are assumed to be fixed values measured without error.

NOTE : The analysis of covariance (ANCOVA) model is a special GM model Y = Xβ+ε.

For example, with a = 3 and n1 = n2 = n3 = 3, we have

Y =

Y11

Y12

Y13

Y21

Y22

Y23

Y31

Y32

Y33

, X =

1 1 0 0 x11 0 0

1 1 0 0 x12 0 0

1 1 0 0 x13 0 0

1 0 1 0 0 x21 0

1 0 1 0 0 x22 0

1 0 1 0 0 x23 0

1 0 0 1 0 0 x31

1 0 0 1 0 0 x32

1 0 0 1 0 0 x33

, β =

µ

α1

α2

α3

β1

β2

β3

, ε =

ε11

ε12

ε13

ε21

ε22

ε23

ε31

ε32

ε33

.

Note that E(ε) = 0 and cov(ε) = σ2I. The X matrix is not of full column rank. If there

are no linear dependencies among the last 3 columns, the rank of X is r = 6 and there

are p = 7 columns.

REDUCED MODEL: Consider the ANCOVA model in Example 1.8 which allows for

unequal slopes. If β1 = β2 = · · · = βa; that is, all slopes are equal, then the ANCOVA

PAGE 7


model reduces to

Yij = µ+ αi + βxij + εij.

That is, the common-slopes ANCOVA model is a reduced version of the model that

allows for different slopes. Assuming the same error structure, this reduced ANCOVA

model is also a special GM model Y = Xβ + ε. With a = 3 and n1 = n2 = n3 = 3, as

before, we have

Y =

Y11

Y12

Y13

Y21

Y22

Y23

Y31

Y32

Y33

, X =

1 1 0 0 x11

1 1 0 0 x12

1 1 0 0 x13

1 0 1 0 x21

1 0 1 0 x22

1 0 1 0 x23

1 0 0 1 x31

1 0 0 1 x32

1 0 0 1 x33

, β =

µ

α1

α2

α3

β

, ε =

ε11

ε12

ε13

ε21

ε22

ε23

ε31

ε32

ε33

.

As long as at least one of the xij’s is different, the rank of X is r = 4 and there are p = 5

columns.

GOAL: We now provide examples of linear models of the form Y = Xβ + ε that are not

GM models.

TERMINOLOGY : A factor of classification is said to be random if it has an infinitely

large number of levels and the levels included in the experiment can be viewed as a

random sample from the population of possible levels.

Example 1.9. One-way random effects ANOVA. Consider the model


for i = 1, 2, ..., a and j = 1, 2, ..., ni, where the treatment effects α1, α2, ..., αa are best

regarded as random; e.g., the a levels of the factor of interest are drawn from a large

population of possible levels, and the random errors εij are uncorrelated random variables

PAGE 8


with zero mean and common variance σ2 > 0. For concreteness, let a = 4 and nij = 3.

The model Y = Xβ + ε looks like

Y =

Y11

Y12

Y13

Y21

Y22

Y23

Y31

Y32

Y33

Y41

Y42

Y43

= 112µ+

13 03 03 03

03 13 03 03

03 03 13 03

03 03 03 13

︸︷︷︸

= Z1

α1

α2

α3

α4

︸︷︷︸

= ε1

+

ε11

ε12

ε13

ε21

ε22

ε23

ε31

ε32

ε33

ε41

ε42

ε43

︸︷︷︸

= ε2

= Xβ + Z1ε1 + ε2,

where we identify X = 112, β = µ, and ε = Z1ε1 + ε2. This is not a GM model because

cov(ε) = cov(Z1ε1 + ε2) = Z1cov(ε1)Z′1 + cov(ε2) = Z1cov(ε1)Z′1 + σ2I,

provided that the αi’s and the errors εij are uncorrelated. Note that cov(ε) 6= σ2I.

Example 1.10. Two-factor mixed model. Consider an experiment with two factors (A

and B), where Factor A is fixed and has a levels and Factor B is random with b levels.

A statistical model for this situation is given by


for i = 1, 2, ..., a, j = 1, 2, ..., b, and k = 1, 2, ..., nij. The αi’s are best regarded as fixed

and the βj’s are best regarded as random. This model assumes no interaction.

APPLICATION : In a randomized block experiment, b blocks may have been selected

randomly from a large collection of available blocks. If the goal is to make a statement

PAGE 9


about the large population of blocks (and not those b blocks in the experiment), then

blocks are considered as random. The treatment effects α1, α2, ..., αa are regarded as

fixed constants if the a treatments are the only ones of interest.

NOTE : For concreteness, suppose that a = 2, b = 4, and nij = 1. We can write the

model above as

Y =

Y11

Y12

Y13

Y14

Y21

Y22

Y23

Y24

=

14 14 04

14 04 14

µ

α1

α2

︸︷︷︸

= Xβ

+

I4

I4

β1

β2

β3

β4

︸︷︷︸

= Z1ε1

+

ε11

ε12

ε13

ε14

ε21

ε22

ε23

ε24

︸︷︷︸

= ε2

.

NOTE : If the αi’s are best regarded as random as well, then we have

Y =

Y11

Y12

Y13

Y14

Y21

Y22

Y23

Y24

= 18µ+

14 04

04 14

α1

α2

︸︷︷︸

= Z1ε1

+

I4

I4

β1

β2

β3

β4

︸︷︷︸

= Z2ε2

+

ε11

ε12

ε13

ε14

ε21

ε22

ε23

ε24

︸︷︷︸

= ε3

.

This model is also known as a random effects or variance component model.

GENERAL FORM : A linear mixed model can be expressed generally as

Y = Xβ + Z1ε1 + Z2ε2 + · · ·+ Zkεk,

where Z1,Z2, ...,Zk are known matrices (typically Zk = Ik) and ε1, ε2, ..., εk are uncorre-

lated random vectors with uncorrelated components.

PAGE 10


Example 1.11. Time series models. When measurements are taken over time, the GM

model may not be appropriate because observations are likely correlated. A linear model

of the form Y = Xβ + ε, where E(ε) = 0 and cov(ε) = σ2V, V known, may be more

appropriate. The general form of V is chosen to model the correlation of the observed

responses. For example, consider the statistical model

Yt = β0 + β1t+ εt,

for t = 1, 2, ..., n, where εt = ρεt−1 + at, at ∼ iid N (0, σ2), and |ρ| < 1 (this is a

stationarity condition). This is called a simple linear trend model where the error

process εt : t = 1, 2, ..., n follows an autoregressive model of order 1, AR(1). It is easy

to show that E(εt) = 0, for all t, and that cov(εt, εs) = σ2ρ|t−s|, for all t and s. Therefore,

if n = 5,

V = σ2

1 ρ ρ2 ρ3 ρ4

ρ 1 ρ ρ2 ρ3

ρ2 ρ 1 ρ ρ2

ρ3 ρ2 ρ 1 ρ

ρ4 ρ3 ρ2 ρ 1

.

Example 1.12. Random coefficient models. Suppose that t measurements are taken

(over time) on n individuals and consider the model

Yij = x′ijβi + εij,

for i = 1, 2, ..., n and j = 1, 2, ..., t; that is, the different p × 1 regression parameters βi

are “subject-specific.” If the individuals are considered to be a random sample, then we

can treat β1,β2, ...,βn as iid random vectors with mean β and p× p covariance matrix

Σββ, say. We can write this model as

Yij = x′ijβi + εij

= x′ijβ︸︷︷︸fixed

+ x′ij(βi − β) + εij︸︷︷︸random

.

If the βi’s are independent of the εij’s, note that

var(Yij) = x′ijΣββxij + σ2 6= σ2.

PAGE 11


Example 1.13. Measurement error models. Consider the statistical model

Yi = β0 + β1Xi + εi,

where εi ∼ iid N (0, σ2ε ). The Xi’s are not observed exactly; instead, they are measured

with non-negligible error so that

Wi = Xi + Ui,

where Ui ∼ iid N (0, σ2U). Here,

Observed data: (Yi,Wi)

Not observed: (Xi, εi, Ui)

Unknown parameters: (β0, β1, σ2ε , σ

2U).

As a frame of reference, suppose that Y is a continuous measurement of lung function

in small children and that X denotes the long-term exposure to NO2. It is unlikely that

X can be measured exactly; instead, the surrogate W , the amount of NO2 recorded at a

clinic visit, is more likely to be observed. Note that the model above can be rewritten as

Yi = β0 + β1(Wi − Ui) + εi

= β0 + β1Wi + (εi − β1Ui)︸︷︷︸= ε∗i

.

Because the Wi’s are not fixed in advance, we would at least need E(ε∗i |Wi) = 0 for this

to be a GM linear model. However, note that

E(ε∗i |Wi) = E(εi − β1Ui|Xi + Ui)

= E(εi|Xi + Ui)− β1E(Ui|Xi + Ui).

The first term is zero if εi is independent of both Xi and Ui. The second term generally

is not zero (unless β1 = 0, of course) because Ui and Xi + Ui are correlated. Therefore,

this can not be a GM model.

PAGE 12


2 The Linear Least Squares Problem

Complementary reading from Monahan: Chapter 2 (except Section 2.4).

INTRODUCTION : Consider the general linear model

Y = Xβ + ε,

where Y is an n×1 vector of observed responses, X is an n×p matrix of fixed constants,

β is a p× 1 vector of fixed but unknown parameters, and ε is an n× 1 vector of random

errors. If E(ε) = 0, then

E(Y) = E(Xβ + ε) = Xβ.

Since β is unknown, all we really know is that E(Y) = Xβ ∈ C(X). To estimate E(Y),

it seems natural to take the vector in C(X) that is closest to Y.

2.1 Least squares estimation

DEFINITION : An estimate β is a least squares estimate of β if Xβ is the vector in

C(X) that is closest to Y. In other words, β is a least squares estimate of β if

β = arg minβ∈Rp

(Y −Xβ)′(Y −Xβ).

LEAST SQUARES : Let β = (β1, β2, ..., βp)′ and define the error sum of squares

Q(β) = (Y −Xβ)′(Y −Xβ),

the squared distance from Y to Xβ. The point where Q(β) is minimized satisfies

∂Q(β)

∂β= 0, or, in other words,

∂Q(β)∂β1

∂Q(β)∂β2...

∂Q(β)∂βp

=

0

0...

0

.

This minimization problem can be tackled either algebraically or geometrically.

PAGE 13


Result 2.1. Let a and b be p× 1 vectors and A be a p× p matrix of constants. Then

∂a′b

∂b= a and

∂b′Ab

∂b= (A + A′)b.

Proof. See Monahan, pp 14.

NOTE : In Result 2.1, note that

∂b′Ab

∂b= 2Ab

if A is symmetric.

NORMAL EQUATIONS : Simple calculations show that

Q(β) = (Y −Xβ)′(Y −Xβ)

= Y′Y − 2Y′Xβ + β′X′Xβ.

Using Result 2.1, we have

∂Q(β)

∂β= −2X′Y + 2X′Xβ,

because X′X is symmetric. Setting this expression equal to 0 and rearranging gives

X′Xβ = X′Y.

These are the normal equations. If X′X is nonsingular, then the unique least squares

estimator of β is

β = (X′X)−1X′Y.

When X′X is singular, which can happen in ANOVA models (see Chapter 1), there can

be multiple solutions to the normal equations. Having already proved algebraically that

the normal equations are consistent, we know that the general form of the least squares

solution is

β = (X′X)−X′Y + [I− (X′X)−X′X]z,

for z ∈ Rp, where (X′X)− is a generalized inverse of X′X.

PAGE 14


2.2 Geometric considerations

CONSISTENCY : Recall that a linear system Ax = c is consistent if there exists an x∗

such that Ax∗ = c; that is, if c ∈ C(A). Applying this definition to

X′Xβ = X′Y,

the normal equations are consistent if X′Y ∈ C(X′X). Clearly, X′Y ∈ C(X′). Thus, we’ll

be able to establish consistency (geometrically) if we can show that C(X′X) = C(X′).

Result 2.2. N (X′X) = N (X).

Proof. Suppose that w ∈ N (X). Then Xw = 0 and X′Xw = 0 so that w ∈ N (X′X).

Suppose that w ∈ N (X′X). Then X′Xw = 0 and w′X′Xw = 0. Thus, ||Xw||2 = 0

which implies that Xw = 0; i.e., w ∈ N (X).

Result 2.3. Suppose that S1 and T1 are orthogonal complements, as well as S2 and T2.

If S1 ⊆ S2, then T2 ⊆ T1.

Proof. See Monahan, pp 244.

CONSISTENCY : We use the previous two results to show that C(X′X) = C(X′). Take

S1 = N (X′X), T1 = C(X′X), S2 = N (X), and T2 = C(X′). We know that S1 and

T1 (S2 and T2) are orthogonal complements. Because N (X′X) ⊆ N (X), the last result

guarantees C(X′) ⊆ C(X′X). But, C(X′X) ⊆ C(X′) trivially, so we’re done. Note also

C(X′X) = C(X′) =⇒ r(X′X) = r(X′) = r(X).

NOTE : We now state a result that characterizes all solutions to the normal equations.

Result 2.4. Q(β) = (Y−Xβ)′(Y−Xβ) is minimized at β if and only if β is a solution

to the normal equations.

Proof. (⇐=) Suppose that β is a solution to the normal equations. Then,

Q(β) = (Y −Xβ)′(Y −Xβ)

= (Y −Xβ + Xβ −Xβ)′(Y −Xβ + Xβ −Xβ)

= (Y −Xβ)′(Y −Xβ) + (Xβ −Xβ)′(Xβ −Xβ),

PAGE 15


since the cross product term 2(Xβ −Xβ)′(Y −Xβ) = 0; verify this using the fact that

β solves the normal equations. Thus, we have shown that Q(β) = Q(β) + z′z, where

z = Xβ −Xβ. Therefore, Q(β) ≥ Q(β) for all β and, hence, β minimizes Q(β). (=⇒)

Now, suppose that β minimizes Q(β). We already know that Q(β) ≥ Q(β), where

β = (X′X)−X′Y, by assumption, but also Q(β) ≤ Q(β) because β minimizes Q(β).

Thus, Q(β) = Q(β). But because Q(β) = Q(β) + z′z, where z = Xβ −Xβ, it must be

true that z = Xβ −Xβ = 0; that is, Xβ = Xβ. Thus,

X′Xβ = X′Xβ = X′Y,

since β is a solution to the normal equations. This shows that β is also solution to the

normal equations.

INVARIANCE : In proving the last result, we have discovered a very important fact;

namely, if β and β both solve the normal equations, then Xβ = Xβ. In other words,

Xβ is invariant to the choice of β.

NOTE : The following result ties least squares estimation to the notion of a perpendicular

projection matrix. It also produces a general formula for the matrix.

Result 2.5. An estimate β is a least squares estimate if and only if Xβ = MY, where

M is the perpendicular projection matrix onto C(X).

Proof. We will show that

(Y −Xβ)′(Y −Xβ) = (Y −MY)′(Y −MY) + (MY −Xβ)′(MY −Xβ).

Both terms on the right hand side are nonnegative, and the first term does not involve

β. Thus, (Y−Xβ)′(Y−Xβ) is minimized by minimizing (MY−Xβ)′(MY−Xβ), the

squared distance between MY and Xβ. This distance is zero if and only if MY = Xβ,

which proves the result. Now to show the above equation:

(Y −Xβ)′(Y −Xβ) = (Y −MY + MY −Xβ)′(Y −MY + MY −Xβ)

= (Y −MY)′(Y −MY) + (Y −MY)′(MY −Xβ)︸︷︷︸(∗)

+ (MY −Xβ)′(Y −MY)︸︷︷︸(∗∗)

+(MY −Xβ)′(MY −Xβ).

PAGE 16


It suffices to show that (∗) and (∗∗) are zero. To show that (∗) is zero, note that

(Y −MY)′(MY −Xβ) = Y′(I−M)(MY −Xβ) = [(I−M)Y]′(MY −Xβ) = 0,

because (I−M)Y ∈ N (X′) and MY −Xβ ∈ C(X). Similarly, (∗∗) = 0 as well.

Result 2.6. The perpendicular projection matrix onto C(X) is given by

M = X(X′X)−X′.

Proof. We know that β = (X′X)−X′Y is a solution to the normal equations, so it is a

least squares estimate. But, by Result 2.5, we know Xβ = MY. Because perpendicular

projection matrices are unique, M = X(X′X)−X′ as claimed.

NOTATION : Monahan uses PX to denote the perpendicular projection matrix onto

C(X). We will henceforth do the same; that is,

PX = X(X′X)−X′.

PROPERTIES : Let PX denote the perpendicular projection matrix onto C(X). Then

(a) PX is idempotent

(b) PX projects onto C(X)

(c) PX is invariant to the choice of (X′X)−

(d) PX is symmetric

(e) PX is unique.

We have already proven (a), (b), (d), and (e); see Matrix Algebra Review 5. Part (c) must

be true; otherwise, part (e) would not hold. However, we can prove (c) more rigorously.

Result 2.7. If (X′X)−1 and (X′X)−2 are generalized inverses of X′X, then

1. X(X′X)−1 X′X = X(X′X)−2 X′X = X

2. X(X′X)−1 X′ = X(X′X)−2 X′.

PAGE 17


Proof. For v ∈ Rn, let v = v1 + v2, where v1 ∈ C(X) and v2⊥C(X). Since v1 ∈ C(X),

we know that v1 = Xd, for some vector d. Then,

v′X(X′X)−1 X′X = v′1X(X′X)−1 X′X = d′X′X(X′X)−1 X′X = d′X′X = v′X,

since v2⊥C(X). Since v and (X′X)−1 were arbitrary, we have shown the first part. To

show the second part, note that

X(X′X)−1 X′v = X(X′X)−1 X′Xd = X(X′X)−2 X′Xd = X(X′X)−2 X′v.

Since v is arbitrary, the second part follows as well.

Result 2.8. Suppose X is n × p with rank r ≤ p, and let PX be the perpendicular

projection matrix onto C(X). Then r(PX) = r(X) = r and r(I−PX) = n− r.

Proof. Note that PX is n×n. We know that C(PX) = C(X), so the first part is obvious.

To show the second part, recall that I−PX is the perpendicular projection matrix onto

N (X′), so it is idempotent. Thus,

r(I−PX) = tr(I−PX) = tr(I)− tr(PX) = n− r(PX) = n− r,

because the trace operator is linear and because PX is idempotent as well.

SUMMARY : Consider the linear model Y = Xβ + ε, where E(ε) = 0; in what follows,

the cov(ε) = σ2I assumption is not needed. We have shown that a least squares estimate

of β is given by

β = (X′X)−X′Y.

This solution is not unique (unless X′X is nonsingular). However,

PXY = Xβ ≡ Y

is unique. We call Y the vector of fitted values. Geometrically, Y is the point in C(X)

that is closest to Y. Now, recall that I−PX is the perpendicular projection matrix onto

N (X′). Note that

(I−PX)Y = Y −PXY = Y − Y ≡ e.

PAGE 18


We call e the vector of residuals. Note that e ∈ N (X′). Because C(X) and N (X′) are

orthogonal complements, we know that Y can be uniquely decomposed as

Y = Y + e.

We also know that Y and e are orthogonal vectors. Finally, note that

Y′Y = Y′IY = Y′(PX + I−PX)Y

= Y′PXY + Y′(I−PX)Y

= Y′PXPXY + Y′(I−PX)(I−PX)Y

= Y′Y + e′e,

since PX and I−PX are both symmetric and idempotent; i.e., they are both perpendicular

projection matrices (but onto orthogonal spaces). This orthogonal decomposition of Y′Y

is often given in a tabular display called an analysis of variance (ANOVA) table.

ANOVA TABLE : Suppose that Y is n× 1, X is n× p with rank r ≤ p, β is p× 1, and

ε is n× 1. An ANOVA table looks like

Source df SS

Model r Y′Y = Y′PXY

Residual n− r e′e = Y′(I−PX)Y

Total n Y′Y = Y′IY

It is interesting to note that the sum of squares column, abbreviated “SS,” catalogues

3 quadratic forms, Y′PXY, Y′(I − PXY), and Y′IY. The degrees of freedom column,

abbreviated “df,” catalogues the ranks of the associated quadratic form matrices; i.e.,

r(PX) = r

r(I−PX) = n− r

r(I) = n.

The quantity Y′PXY is called the (uncorrected) model sum of squares, Y′(I − PX)Y

is called the residual sum of squares, and Y′Y is called the (uncorrected) total sum of

squares.

PAGE 19


NOTE : The following “visualization” analogy is taken liberally from Christensen (2002).

VISUALIZATION : One can think about the geometry of least squares estimation in

three dimensions (i.e., when n = 3). Consider your kitchen table and take one corner of

the table to be the origin. Take C(X) as the two dimensional subspace determined by the

surface of the table, and let Y be any vector originating at the origin; i.e., any point in

R3. The linear model says that E(Y) = Xβ, which just says that E(Y) is somewhere on

the table. The least squares estimate Y = Xβ = PXY is the perpendicular projection

of Y onto the surface of the table. The residual vector e = (I − PX)Y is the vector

starting at the origin, perpendicular to the surface of the table, that reaches the same

height as Y. Another way to think of the residual vector is to first connect Y and

PXY with a line segment (that is perpendicular to the surface of the table). Then,

shift the line segment along the surface (keeping it perpendicular) until the line segment

has one end at the origin. The residual vector e is the perpendicular projection of Y

onto C(I−PX) = N (X′); that is, the projection onto the orthogonal complement of the

table surface. The orthogonal complement C(I − PX) is the one-dimensional space in

the vertical direction that goes through the origin. Once you have these vectors in place,

sums of squares arise from Pythagorean’s Theorem.

A SIMPLE PPM : Suppose Y1, Y2, ..., Yn are iid with mean E(Yi) = µ. In terms of the

general linear model, we can write Y = Xβ + ε, where

Y =

Y1

Y2

...

Yn

, X = 1 =

1

1...

1

, β = µ, ε =

ε1

ε2...

εn

.

The perpendicular projection matrix onto C(X) is given by

P1 = 1(1′1)−1′ = n−111′ = n−1J,

where J is the n× n matrix of ones. Note that

P1Y = n−1JY = Y 1,

PAGE 20


where Y = n−1∑n

i=1 Yi. The perpendicular projection matrix P1 projects Y onto the

space

C(P1) = z ∈ Rn : z = (a, a, ..., a)′; a ∈ R.

Note that r(P1) = 1. Note also that

(I−P1)Y = Y −P1Y = Y − Y 1 =

Y1 − Y

Y2 − Y...

Yn − Y

,

the vector which contains the deviations from the mean. The perpendicular projection

matrix I−P1 projects Y onto

C(I−P1) =

z ∈ Rn : z = (a1, a2, ..., an)′; ai ∈ R,

n∑i=1

ai = 0

.

Note that r(I−P1) = n− 1.

REMARK : The matrix P1 plays an important role in linear models, and here is why.

Most linear models, when written out in non-matrix notation, contain an intercept

term. For example, in simple linear regression,


or in ANOVA-type models like


the intercept terms are β0 and µ, respectively. In the corresponding design matrices, the

first column of X is 1. If we discard the “other” terms like β1xi and αi + βj + γij in the

models above, then we have a reduced model of the form Yi = µ + εi; that is, a model

that relates Yi to its overall mean, or, in matrix notation Y = 1µ+ε. The perpendicular

projection matrix onto C(1) is P1 and

Y′P1Y = Y′P1P1Y = (P1Y)′(P1Y) = nY2.

PAGE 21


This is the model sum of squares for the model Yi = µ+ εi; that is, Y′P1Y is the sum of

squares that arises from fitting the overall mean µ. Now, consider a general linear model

of the form Y = Xβ + ε, where E(ε) = 0, and suppose that the first column of X is 1.

In general, we know that

Y′Y = Y′IY = Y′PXY + Y′(I−PX)Y.

Subtracting Y′P1Y from both sides, we get

Y′(I−P1)Y = Y′(PX −P1)Y + Y′(I−PX)Y.

The quantity Y′(I−P1)Y is called the corrected total sum of squares and the quantity

Y′(PX − P1)Y is called the corrected model sum of squares. The term “corrected”

is understood to mean that we have removed the effects of “fitting the mean.” This is

important because this is the sum of squares breakdown that is commonly used; i.e.,

Source df SS

Model (Corrected) r − 1 Y′(PX −P1)Y

Residual n− r Y′(I−PX)Y

Total (Corrected) n− 1 Y′(I−P1)Y

In ANOVA models, the corrected model sum of squares Y′(PX − P1)Y is often broken

down further into smaller components which correspond to different parts; e.g., orthog-

onal contrasts, main effects, interaction terms, etc. Finally, the degrees of freedom are

simply the corresponding ranks of PX −P1, I−PX, and I−P1.

NOTE : In the general linear model Y = Xβ + ε, the residual vector from the least

squares fit e = (I−PX)Y ∈ N (X′), so e′X = 0; that is, the residuals in a least squares

fit are orthogonal to the columns of X, since the columns of X are in C(X). Note that if

1 ∈ C(X), which is true of all linear models with an intercept term, then

e′1 =n∑i=1

ei = 0,

that is, the sum of the residuals from a least squares fit is zero. This is not necessarily

true of models for which 1 /∈ C(X).

PAGE 22


Result 2.9. If C(W) ⊂ C(X), then PX − PW is the perpendicular projection matrix

onto C[(I−PW)X].

Proof. It suffices to show that (a) PX −PW is symmetric and idempotent and that (b)

C(PX − PW) = C[(I − PW)X]. First note that PXPW = PW because the columns of

PW are in C(W) ⊂ C(X). By symmetry, PWPX = PW. Now,

(PX −PW)(PX −PW) = P2X −PXPW −PWPX + P2

W

= PX −PW −PW + PW = PX −PW.

Thus, PX−PW is idempotent. Also, (PX−PW)′ = P′X−P′W = PX−PW, so PX−PW

is symmetric. Thus, PX −PW is a perpendicular projection matrix onto C(PX −PW).

Suppose that v ∈ C(PX −PW); i.e., v = (PX −PW)d, for some d. Write d = d1 + d2,

where d1 ∈ C(X) and d2 ∈ N (X′); that is, d1 = Xa, for some a, and X′d2 = 0. Then,

v = (PX −PW)(d1 + d2)

= (PX −PW)(Xa + d2)

= PXXa + PXd2 −PWXa−PWd2

= Xa + 0−PWXa− 0

= (I−PW)Xa ∈ C[(I−PW)X].

Thus, C(PX − PW) ⊆ C[(I − PW)X]. Now, suppose that w ∈ C[(I − PW)X]. Then

w = (I−PW)Xc, for some c. Thus,

w = Xc−PWXc = PXXc−PWXc = (PX −PW)Xc ∈ C(PX −PW).

This shows that C[(I−PW)X] ⊆ C(PX −PW).

TERMINOLOGY : Suppose that V is a vector space and that S is a subspace of V ; i.e.,

S ⊂ V . The subspace

S⊥V = z ∈ V : z⊥S

is called the orthogonal complement of S with respect to V . If V = Rn, then S⊥V = S⊥

is simply referred to as the orthogonal complement of S.

PAGE 23


Result 2.10. If C(W) ⊂ C(X), then C(PX − PW) = C[(I − PW)X] is the orthogonal

complement of C(PW) with respect to C(PX); that is,

C(PX −PW) = C(PW)⊥C(PX).

Proof. C(PX−PW)⊥C(PW) because (PX−PW)PW = PXPW−P2W = PW−PW = 0.

Because C(PX−PW) ⊂ C(PX), C(PX−PW) is contained in the orthogonal complement

of C(PW) with respect to C(PX). Now suppose that v ∈ C(PX) and v⊥C(PW). Then,

v = PXv = (PX −PW)v + PWv = (PX −PW)v ∈ C(PX −PW),

showing that the orthogonal complement of C(PW) with respect to C(PX) is contained

in C(PX −PW).

REMARK : The preceding two results are important for hypothesis testing in linear

models. Consider the linear models

Y = Xβ + ε and Y = Wγ + ε,

where C(W) ⊂ C(X). As we will learn later, the condition C(W) ⊂ C(X) implies that

Y = Wγ + ε is a reduced model when compared to Y = Xβ + ε, sometimes called

the full model. If E(ε) = 0, then, if the full model is correct,

E(PXY) = PXE(Y) = PXXβ = Xβ ∈ C(X).

Similarly, if the reduced model is correct, E(PWY) = Wγ ∈ C(W). Note that if

the reduced model Y = Wγ + ε is correct, then the full model Y = Xβ + ε is also

correct since C(W) ⊂ C(X). Thus, if the reduced model is correct, PXY and PWY

are attempting to estimate the same thing and their difference (PX −PW)Y should be

small. On the other hand, if the reduced model is not correct, but the full model is, then

PXY and PWY are estimating different things and one would expect (PX − PW)Y to

be large. The question about whether or not to “accept” the reduced model as plausible

thus hinges on deciding whether or not (PX −PW)Y, the (perpendicular) projection of

Y onto C(PX −PW) = C(PW)⊥C(PX), is large or small.

PAGE 24


2.3 Reparameterization

REMARK : For estimation in the general linear model Y = Xβ + ε, where E(ε) = 0,

we can only learn about β through Xβ ∈ C(X). Thus, the crucial item needed is

PX, the perpendicular projection matrix onto C(X). For convenience, we call C(X) the

estimation space. PX is the perpendicular projection matrix onto the estimation space.

We call N (X′) the error space. I−PX is the perpendicular projection matrix onto the

error space.

IMPORTANT : Any two linear models with the same estimation space are really the

same model; the models are said to be reparameterizations of each other. Any two

such models will give the same predicted values, the same residuals, the same ANOVA

table, etc. In particular, suppose that we have two linear models:

Y = Xβ + ε and Y = Wγ + ε.

If C(X) = C(W), then PX does not depend on which of X or W is used; it depends only

on C(X) = C(W). As we will find out, the least-squares estimate of E(Y) is

Y = PXY = Xβ = Wγ.

IMPLICATION : The β parameters in the model Y = Xβ + ε, where E(ε) = 0, are

not really all that crucial. Because of this, it is standard to reparameterize linear models

(i.e., change the parameters) to exploit computational advantages, as we will soon see.

The essence of the model is that E(Y) ∈ C(X). As long as we do not change C(X), the

design matrix X and the corresponding model parameters can be altered in a manner

suitable to our liking.

EXAMPLE : Recall the simple linear regression model from Chapter 1 given by


for i = 1, 2, ..., n. Although not critical for this discussion, we will assume that ε1, ε2, ..., εn

are uncorrelated random variables with mean 0 and common variance σ2 > 0. Recall

PAGE 25


that, in matrix notation,

Yn×1 =

Y1

Y2

...

Yn

, Xn×2 =

1 x1

1 x2

......

1 xn

, β2×1 =

β0

β1

, εn×1 =

ε1

ε2...

εn

.

As long as (x1, x2, ..., xn)′ is not a multiple of 1n and at least one xi 6= 0, then r(X) = 2

and (X′X)−1 exists. Straightforward calculations show that

X′X =

n∑

i xi∑i xi

∑i x

2i

, (X′X)−1 =

1n

+ x2∑i(xi−x)2

− x∑i(xi−x)2

− x∑i(xi−x)2

1∑i(xi−x)2

.

and

X′Y =

∑i Yi∑i xiYi

.

Thus, the (unique) least squares estimator is given by

β = (X′X)−1X′Y =

β0

β1

=

Y − β1x∑i(xi−x)(Yi−Y )∑

i(xi−x)2

.

For the simple linear regression model, it can be shown (verify!) that the perpendicular

projection matrix PX is given by

PX = X(X′X)−1X′

=

1n

+ (x1−x)2∑i(xi−x)2

1n

+ (x1−x)(x2−x)∑i(xi−x)2

· · · 1n

+ (x1−x)(xn−x)∑i(xi−x)2

1n

+ (x1−x)(x2−x)∑i(xi−x)2

1n

+ (x2−x)2∑i(xi−x)2

· · · 1n

+ (x2−x)(xn−x)∑i(xi−x)2

......

. . ....

1n

+ (x1−x)(xn−x)∑i(xi−x)2

1n

+ (x2−x)(xn−x)∑i(xi−x)2

· · · 1n

+ (xn−x)2∑i(xi−x)2

.

A reparameterization of the simple linear regression model Yi = β0 + β1xi + εi is

Yi = γ0 + γ1(xi − x) + εi

or Y = Wγ + ε, where

Yn×1 =

Y1

Y2

...

Yn

, Wn×2 =

1 x1 − x

1 x2 − x...

...

1 xn − x

, γ2×1 =

γ0

γ1

, εn×1 =

ε1

ε2...

εn

.

PAGE 26


To see why this is a reparameterized model, note that if we define

U =

1 −x

0 1

,

then W = XU and X = WU−1 (verify!) so that C(X) = C(W). Moreover, E(Y) =

Xβ = Wγ = XUγ. Taking P′ = (X′X)−1X′ leads to β = P′Xβ = P′XUγ = Uγ; i.e.,

β =

β0

β1

=

γ0 − γ1x

γ1

= Uγ.

To find the least-squares estimator for γ in the reparameterized model, observe that

W′W =

n 0

0∑

i(xi − x)2

and (W′W)−1 =

1n

0

0 1∑i(xi−x)2

.

Note that (W′W)−1 is diagonal; this is one of the benefits to working with this param-

eterization. The least squares estimator of γ is given by

γ = (W′W)−1W′Y =

γ0

γ1

=

Y∑i(xi−x)(Yi−Y )∑

i(xi−x)2

,

which is different than β. However, it can be shown directly (verify!) that the perpen-

dicular projection matrix onto C(W) is

PW = W(W′W)−1W′

=

1n

+ (x1−x)2∑i(xi−x)2

1n

+ (x1−x)(x2−x)∑i(xi−x)2

· · · 1n

+ (x1−x)(xn−x)∑i(xi−x)2

1n

+ (x1−x)(x2−x)∑i(xi−x)2

1n

+ (x2−x)2∑i(xi−x)2

· · · 1n

+ (x2−x)(xn−x)∑i(xi−x)2

......

. . ....

1n

+ (x1−x)(xn−x)∑i(xi−x)2

1n

+ (x2−x)(xn−x)∑i(xi−x)2

· · · 1n

+ (xn−x)2∑i(xi−x)2

.

which is the same as PX. Thus, the fitted values will be the same; i.e., Y = PXY =

Xβ = Wγ = PWY, and the analysis will be the same under both parameterizations.

Exercise: Show that the one way fixed effects ANOVA model Yij = µ + αi + εij, for

i = 1, 2, ..., a and j = 1, 2, ..., ni, and the cell means model Yij = µi + εij are reparameter-

izations of each other. Does one parameterization confer advantages over the other?

PAGE 27


3 Estimability and Least Squares Estimators


3.1 Introduction

REMARK : Estimability is one of the most important concepts in linear models. Consider

the general linear model

Y = Xβ + ε,

where E(ε) = 0. In our discussion that follows, the assumption cov(ε) = σ2I is not

needed. Suppose that X is n×p with rank r ≤ p. If r = p (as in regression models), then

estimability concerns vanish as β is estimated uniquely by β = (X′X)−1X′Y. If r < p,

(a common characteristic of ANOVA models), then β can not be estimated uniquely.

However, even if β is not estimable, certain functions of β may be estimable.

3.2 Estimability

DEFINITIONS :

1. An estimator t(Y) is said to be unbiased for λ′β iff Et(Y) = λ′β, for all β.

2. An estimator t(Y) is said to be a linear estimator in Y iff t(Y) = c + a′Y, for

c ∈ R and a = (a1, a2, ..., an)′, ai ∈ R.

3. A function λ′β is said to be (linearly) estimable iff there exists a linear unbiased

estimator for it. Otherwise, λ′β is nonestimable.

Result 3.1. Under the model assumptions Y = Xβ + ε, where E(ε) = 0, a linear

function λ′β is estimable iff there exists a vector a such that λ′ = a′X; that is, λ′ ∈ R(X).

Proof. (⇐=) Suppose that there exists a vector a such that λ′ = a′X. Then, E(a′Y) =

a′Xβ = λ′β, for all β. Therefore, a′Y is a linear unbiased estimator of λ′β and hence

PAGE 28


λ′β is estimable. (=⇒) Suppose that λ′β is estimable. Then, there exists an estimator

c+a′Y that is unbiased for it; that is, E(c+a′Y) = λ′β, for all β. Note that E(c+a′Y) =

c + a′Xβ, so λ′β = c + a′Xβ, for all β. Taking β = 0 shows that c = 0. Successively

taking β to be the standard unit vectors convinces us that λ′ = a′X; i.e., λ′ ∈ R(X).

Example 3.1. Consider the one-way fixed effects ANOVA model


for i = 1, 2, ..., a and j = 1, 2, ..., ni, where E(εij) = 0. Take a = 3 and ni = 2 so that

Y =

Y11

Y12

Y21

Y22

Y31

Y32

, X =

1 1 0 0

1 1 0 0

1 0 1 0

1 0 1 0

1 0 0 1

1 0 0 1

, and β =

µ

α1

α2

α3

.

Note that r(X) = 3, so X is not of full rank; i.e., β is not uniquely estimable. Consider

the following parametric functions λ′β:

Parameter λ′ λ′ ∈ R(X)? Estimable?

λ′1β = µ λ′1 = (1, 0, 0, 0) no no

λ′2β = α1 λ′2 = (0, 1, 0, 0) no no

λ′3β = µ+ α1 λ′3 = (1, 1, 0, 0) yes yes

λ′4β = α1 − α2 λ′4 = (0, 1,−1, 0) yes yes

λ′5β = α1 − (α2 + α3)/2 λ′5 = (0, 1,−1/2,−1/2) yes yes

Because λ′3β = µ + α1, λ′4β = α1 − α2, and λ′5β = α1 − (α2 + α3)/2 are (linearly)

estimable, there must exist linear unbiased estimators for them. Note that

E(Y 1+) = E

(Y11 + Y12

2

)=

1

2(µ+ α1) +

1

2(µ+ α1) = µ+ α1 = λ′3β

PAGE 29


and that Y 1+ = c+ a′Y, where c = 0 and a′ = (1/2, 1/2, 0, 0, 0, 0). Also,

E(Y 1+ − Y 2+) = (µ+ α1)− (µ+ α2)

= α1 − α2 = λ′4β

and that Y 1+−Y 2+ = c+a′Y, where c = 0 and a′ = (1/2, 1/2,−1/2,−1/2, 0, 0). Finally,

E

Y 1+ −

(Y 2+ + Y 3+

2

)= (µ+ α1)− 1

2(µ+ α2) + (µ+ α3)

= α1 −1

2(α2 + α3) = λ′5β.

Note that

Y 1+ −(Y 2+ + Y 3+

2

)= c+ a′Y,

where c = 0 and a′ = (1/2, 1/2,−1/4,−1/4,−1/4,−1/4).

REMARKS :

1. The elements of the vector Xβ are estimable.

2. If λ′1β,λ′2β, ...,λ

′kβ are estimable, then any linear combination of them; i.e.,∑k

i=1 diλ′iβ, where di ∈ R, is also estimable.

3. If X is n× p and r(X) = p, then R(X) = Rp and λ′β is estimable for all λ.

DEFINITION : Linear functions λ′1β,λ′2β, ...,λ

′kβ are said to be linearly independent

if λ1,λ2, ...,λk comprise a set of linearly independent vectors; i.e., Λ = (λ1 λ2 · · · λk)

has rank k.

Result 3.2. Under the model assumptions Y = Xβ + ε, where E(ε) = 0, we can

always find r = r(X) linearly independent estimable functions. Moreover, no collection

of estimable functions can contain more than r linearly independent functions.

Proof. Let ζ ′i denote the ith row of X, for i = 1, 2, ..., n. Clearly, ζ ′1β, ζ′2β, ..., ζ

′nβ are

estimable. Because r(X) = r, we can select r linearly independent rows of X; the corre-

sponding r functions ζ ′iβ are linearly independent. Now, let Λ′β = (λ′1β,λ′2β, ...,λ

′kβ)′

be any collection of estimable functions. Then, λ′i ∈ R(X), for i = 1, 2, ..., k, and hence

PAGE 30


there exists a matrix A such that Λ′ = A′X. Therefore, r(Λ′) = r(A′X) ≤ r(X) = r.

Hence, there can be at most r linearly independent estimable functions.

DEFINITION : A least squares estimator of an estimable function λ′β is λ′β, where

β = (X′X)−X′Y is any solution to the normal equations.

Result 3.3. Under the model assumptions Y = Xβ + ε, where E(ε) = 0, if λ′β is

estimable, then λ′β = λ′β for any two solutions β and β to the normal equations.

Proof. Suppose that λ′β is estimable. Then λ′ = a′X, for some a. From Result 2.5,

λ′β = a′Xβ = a′PXY

λ′β = a′Xβ = a′PXY.

This proves the result.

Alternate proof. If β and β both solve the normal equations, then X′X(β− β) = 0; that

is, β − β ∈ N (X′X) = N (X). If λ′β is estimable, then λ′ ∈ R(X)⇐⇒ λ ∈ C(X′)⇐⇒

λ⊥N (X). Thus, λ′(β − β) = 0; i.e., λ′β = λ′β.

IMPLICATION : Least squares estimators of (linearly) estimable functions are invariant

to the choice of generalized inverse used to solve the normal equations.

Example 3.2. In Example 3.1, we considered the one-way fixed effects ANOVA model

Yij = µ+ αi + εij, for i = 1, 2, 3 and j = 1, 2. For this model, it is easy to show that

X′X =

6 2 2 2

2 2 0 0

2 0 2 0

2 0 0 2

and r(X′X) = 3. Here are two generalized inverses of X′X:

(X′X)−1 =

0 0 0 0

0 12

0 0

0 0 12

0

0 0 0 12

(X′X)−2 =

12−1

2−1

20

−12

1 12

0

−12

12

1 0

0 0 0 0

.

PAGE 31


Note that

X′Y =

1 1 1 1 1 1

1 1 0 0 0 0

0 0 1 1 0 0

0 0 0 0 1 1

Y11

Y12

Y21

Y22

Y31

Y32

=

Y11 + Y12 + Y21 + Y12 + Y31 + Y32

Y11 + Y12

Y21 + Y22

Y31 + Y32

.

Two least squares solutions (verify!) are thus

β = (X′X)−1 X′Y =

0

Y 1+

Y 2+

Y 3+

and β = (X′X)−2 X′Y =

Y 3+

Y 1+ − Y 3+

Y 2+ − Y 3+

0

.

Recall our estimable functions from Example 3.1:

Parameter λ′ λ′ ∈ R(X)? Estimable?

λ′3β = µ+ α1 λ′3 = (1, 1, 0, 0) yes yes

λ′4β = α1 − α2 λ′4 = (0, 1,−1, 0) yes yes

λ′5β = α1 − (α2 + α3)/2 λ′5 = (0, 1,−1/2,−1/2) yes yes

Note that

• for λ3β = µ+ α1, the (unique) least squares estimator is

λ′3β = λ′3β = Y 1+.

• for λ′4β = α1 − α2, the (unique) least squares estimator is

λ′4β = λ′4β = Y 1+ − Y 2+.

• for λ′5β = α1 − (α2 + α3)/2, the (unique) least squares estimator is

λ′5β = λ′5β = Y 1+ −1

2(Y 2+ + Y 3+).

PAGE 32


Finally, note that these three estimable functions are linearly independent since

Λ =(

λ3 λ4 λ5

)=

1 0 0

1 1 1

0 −1 −1/2

0 0 −1/2

has rank r(Λ) = 3. Of course, more estimable functions λ′iβ can be found, but we can

find no more linearly independent estimable functions because r(X) = 3.

Result 3.4. Under the model assumptions Y = Xβ + ε, where E(ε) = 0, the least

squares estimator λ′β of an estimable function λ′β is a linear unbiased estimator of λ′β.

Proof. Suppose that β solves the normal equations. We know (by definition) that λ′β is

the least squares estimator of λ′β. Note that

λ′β = λ′(X′X)−X′Y + [I− (X′X)−X′X]z

= λ′(X′X)−X′Y + λ′[I− (X′X)−X′X]z.

Also, λ′β is estimable by assumption, so λ′ ∈ R(X)⇐⇒ λ ∈ C(X′)⇐⇒ λ⊥N (X). Re-

sult MAR5.2 says that [I− (X′X)−X′X]z ∈ N (X′X) = N (X), so λ′[I− (X′X)−X′X]z =

0. Thus, λ′β = λ′(X′X)−X′Y, which is a linear estimator in Y. We now show that λ′β

is unbiased. Because λ′β is estimable, λ′ ∈ R(X) =⇒ λ′ = a′X, for some a. Thus,

E(λ′β) = Eλ′(X′X)−X′Y = λ′(X′X)−X′E(Y)

= λ′(X′X)−X′Xβ

= a′X(X′X)−X′Xβ

= a′PXXβ = a′Xβ = λ′β.

SUMMARY : Consider the linear model Y = Xβ + ε, where E(ε) = 0. From the

definition, we know that λ′β is estimable iff there exists a linear unbiased estimator for

it, so if we can find a linear estimator c+a′Y whose expectation equals λ′β, for all β, then

λ′β is estimable. From Result 3.1, we know that λ′β is estimable iff λ′ ∈ R(X). Thus,

if λ′ can be expressed as a linear combination of the rows of X, then λ′β is estimable.

PAGE 33


IMPORTANT : Here is a commonly-used method of finding necessary and sufficient

conditions for estimability in linear models with E(ε) = 0. Suppose that X is n × p

with rank r < p. We know that λ′β is estimable iff λ′ ∈ R(X).

• Typically, when we find the rank of X, we find r linearly independent columns of

X and express the remaining s = p − r columns as linear combinations of the r

linearly independent columns of X. Suppose that c1, c2, ..., cs satisfy Xci = 0, for

i = 1, 2, ..., s, that is, ci ∈ N (X), for i = 1, 2, ..., s. If c1, c2, ..., cs forms a basis

for N (X); i.e., c1, c2, ..., cs are linearly independent, then

λ′c1 = 0

λ′c2 = 0

...

λ′cs = 0

are necessary and sufficient conditions for λ′β to be estimable.

REMARK : There are two spaces of interest: C(X′) = R(X) and N (X). If X is n × p

with rank r < p, then dimC(X′) = r and dimN (X) = s = p − r. Therefore, if

c1, c2, ..., cs are linearly independent, then c1, c2, ..., cs must be a basis for N (X). But,

λ′β estimable⇐⇒ λ′ ∈ R(X) ⇐⇒ λ ∈ C(X′)

⇐⇒ λ is orthogonal to every vector in N (X)

⇐⇒ λ is orthogonal to c1, c2, ..., cs

⇐⇒ λ′ci = 0, i = 1, 2, ..., s.

Therefore, λ′β is estimable iff λ′ci = 0, for i = 1, 2, ..., s, where c1, c2, ..., cs are s linearly

independent vectors satisfying Xci = 0.

TERMINOLOGY : A set of linear functions λ′1β,λ′2β, ...,λ′kβ is said to be jointly

nonestimable if the only linear combination of λ′1β,λ′2β, ...,λ

′kβ that is estimable is

the trivial one; i.e., ≡ 0. These types of functions are useful in non-full-rank linear models

and are associated with side conditions.

PAGE 34


3.2.1 One-way ANOVA

GENERAL CASE : Consider the one-way fixed effects ANOVA model Yij = µ+ αi + εij,

for i = 1, 2, ..., a and j = 1, 2, ..., ni, where E(εij) = 0. In matrix form, X and β are

Xn×p =

1n1 1n1 0n1 · · · 0n1

1n2 0n2 1n2 · · · 0n2

......

.... . .

...

1na 0na 0na · · · 1na

and βp×1 =

µ

α1

α2

...

αa

,

where p = a+1 and n =∑

i ni. Note that the last a columns of X are linearly independent

and the first column is the sum of the last a columns. Hence, r(X) = r = a and

s = p− r = 1. With c1 = (1,−1′a)′, note that Xc1 = 0 so c1 forms a basis for N (X).

Thus, the necessary and sufficient condition for λ′β = λ0µ +∑a

i=1 λiαi to be estimable

is

λ′c1 = 0 =⇒ λ0 =a∑i=1

λi.

Here are some examples of estimable functions:

1. µ+ αi

2. αi − αk

3. any contrast in the α’s; i.e.,∑a

i=1 λiαi, where∑a

i=1 λi = 0.

Here are some examples of nonestimable functions:

1. µ

2. αi

3.∑a

i=1 niαi.

There is only s = 1 jointly nonestimable function. Later we will learn that jointly non-

estimable functions can be used to “force” particular solutions to the normal equations.

PAGE 35


The following are examples of sets of linearly independent estimable functions (verify!):

1. µ+ α1, µ+ α2, ..., µ+ αa

2. µ+ α1, α1 − α2, ..., α1 − αa.

LEAST SQUARES ESTIMATES : We now wish to calculate the least squares estimates

of estimable functions. Note that X′X and one generalized inverse of X′X is given by

X′X =

n n1 n2 · · · na

n1 n1 0 · · · 0

n2 0 n2 · · · 0...

......

. . ....

na 0 0 · · · na

and (X′X)− =

0 0 0 · · · 0

0 1/n1 0 · · · 0

0 0 1/n2 · · · 0...

......

. . ....

0 0 0 · · · 1/na

For this generalized inverse, the least squares estimate is

β = (X′X)−X′Y =

0 0 0 · · · 0

0 1/n1 0 · · · 0

0 0 1/n2 · · · 0...

......

. . ....

0 0 0 · · · 1/na

∑i

∑j Yij∑

j Y1j∑j Y2j

...∑j Yaj

=

0

Y 1+

Y 2+

...

Y a+

.

REMARK : We know that this solution is not unique; had we used a different generalized

inverse above, we would have gotten a different least squares estimate of β. However, least

squares estimates of estimable functions λ′β are invariant to the choice of generalized

inverse, so our choice of (X′X)− above is as good as any other. From this solution, we

have the unique least squares estimates:

Estimable function, λ′β Least squares estimate, λ′β

µ+ αi Y i+

αi − αk Y i+ − Y k+∑ai=1 λiαi, where

∑ai=1 λi = 0

∑ai=1 λiY i+

PAGE 36


3.2.2 Two-way crossed ANOVA with no interaction

GENERAL CASE : Consider the two-way fixed effects (crossed) ANOVA model


for i = 1, 2, ..., a and j = 1, 2, ..., b, and k = 1, 2, ..., nij, where E(εij) = 0. For ease of

presentation, we take nij = 1 so there is no need for a k subscript; that is, we can rewrite

the model as Yij = µ+ αi + βj + εij. In matrix form, X and β are

Xn×p =

1b 1b 0b · · · 0b Ib

1b 0b 1b · · · 0b Ib...

......

. . ....

...

1b 0b 0b · · · 1b Ib

and βp×1 =

µ

α1

α2

...

αa

β1

β2

...

βb

,

where p = a + b + 1 and n = ab. Note that the first column is the sum of the last b

columns. The 2nd column is the sum of the last b columns minus the sum of columns 3

through a + 1. The remaining columns are linearly independent. Thus, we have s = 2

linear dependencies so that r(X) = a+ b− 1. The dimension of N (X) is s = 2. Taking

c1 =

1

−1a

0b

and c2 =

1

0a

−1b

produces Xc1 = Xc2 = 0. Since c1 and c2 are linearly independent; i.e., neither is

a multiple of the other, c1, c2 is a basis for N (X). Thus, necessary and sufficient

conditions for λ′β to be estimable are

λ′c1 = 0 =⇒ λ0 =a∑i=1

λi

λ′c2 = 0 =⇒ λ0 =b∑

j=1

λa+j.

PAGE 37


Here are some examples of estimable functions:

1. µ+ αi + βj

2. αi − αk

3. βj − βk

4. any contrast in the α’s; i.e.,∑a

i=1 λiαi, where∑a

i=1 λi = 0

5. any contrast in the β’s; i.e.,∑b

j=1 λa+jβj, where∑b

j=1 λa+j = 0.

Here are some examples of nonestimable functions:

1. µ

2. αi

3. βj

4.∑a

i=1 αi

5.∑b

j=1 βj.

We can find s = 2 jointly nonestimable functions. Examples of sets of jointly nones-

timable functions are

1. αa, βb

2. ∑

i αi,∑

j βj.

A set of linearly independent estimable functions (verify!) is

1. µ+ α1 + β1, α1 − α2, ..., α1 − αa, β1 − β2, ..., β1 − βb.

NOTE : When replication occurs; i.e., when nij > 1, for all i and j, our estimability

findings are unchanged. Replication does not change R(X). We obtain the following

least squares estimates:

PAGE 38


Estimable function, λ′β Least squares estimate, λ′β

µ+ αi + βj Y ij+

αi − αl Y i++ − Y l++

βj − βl Y +j+ − Y +l+∑ai=1 ciαi, with

∑ai=1 ci = 0

∑ai=1 ciY i++∑b

j=1 diβj, with∑b

j=1 di = 0∑b

j=1 diY +j+

These formulae are still technically correct when nij = 1. When some nij = 0, i.e., there

are missing cells, estimability may be affected; see Monahan, pp 46-48.

3.2.3 Two-way crossed ANOVA with interaction

GENERAL CASE : Consider the two-way fixed effects (crossed) ANOVA model


for i = 1, 2, ..., a and j = 1, 2, ..., b, and k = 1, 2, ..., nij, where E(εij) = 0.

SPECIAL CASE : With a = 3, b = 2, and nij = 2, X and β are

X =

1 1 0 0 1 0 1 0 0 0 0 0

1 1 0 0 1 0 1 0 0 0 0 0

1 1 0 0 0 1 0 1 0 0 0 0

1 1 0 0 0 1 0 1 0 0 0 0

1 0 1 0 1 0 0 0 1 0 0 0

1 0 1 0 1 0 0 0 1 0 0 0

1 0 1 0 0 1 0 0 0 1 0 0

1 0 1 0 0 1 0 0 0 1 0 0

1 0 0 1 1 0 0 0 0 0 1 0

1 0 0 1 1 0 0 0 0 0 1 0

1 0 0 1 0 1 0 0 0 0 0 1

1 0 0 1 0 1 0 0 0 0 0 1

and β =

µ

α1

α2

α3

β1

β2

γ11

γ12

γ21

γ22

γ31

γ32

.

PAGE 39


There are p = 12 parameters. The last six columns of X are linearly independent, and the

other columns can be written as linear combinations of the last six columns, so r(X) = 6

and s = p − r = 6. To determine which functions λ′β are estimable, we need to find a

basis for N (X). One basis c1, c2, ..., c6 is

−1

1

1

1

0

0

0

0

0

0

0

0

,

−1

0

0

0

1

1

0

0

0

0

0

0

,

0

−1

0

0

0

0

1

1

0

0

0

0

,

0

0

−1

0

0

0

0

0

1

1

0

0

,

0

0

0

0

−1

0

1

0

1

0

1

0

,

−1

1

1

0

1

0

−1

0

−1

0

0

1

.

Functions λ′β must satisfy λ′ci = 0, for each i = 1, 2, ..., 6, to be estimable. It should be

obvious that neither the main effect terms nor the interaction terms; i.e, αi, βj, γij, are

estimable on their own. The six αi + βj + γij “cell means” terms are, but these are not

that interesting. No longer are contrasts in the α’s or β’s estimable. Indeed, interaction

makes the analysis more difficult.

3.3 Reparameterization

SETTING : Consider the general linear model

Model GL: Y = Xβ + ε, where E(ε) = 0.

Assume that X is n × p with rank r ≤ p. Suppose that W is an n × t matrix such

that C(W) = C(X). Then, we know that there exist matrices Tp×t and Sp×t such that

PAGE 40


W = XT and X = WS′. Note that Xβ = WS′β = Wγ, where γ = S′β. The model

Model GL-R: Y = Wγ + ε, where E(ε) = 0,

is called a reparameterization of Model GL.

REMARK : Since Xβ = WS′β = Wγ = XTγ, we might suspect that the estimation

of an estimable function λ′β under Model GL should be essentially the same as the

estimation of λ′Tγ under Model GL-R (and that estimation of an estimable function

q′γ under Model GL-R should be essentially the same as estimation of q′S′β under

Model GL). The upshot of the following results is that, in determining a least squares

estimate of an estimable function λ′β, we can work with either Model GL or Model

GL-R. The actual nature of these conjectured relationships is now made precise.

Result 3.5. Consider Models GL and GL-R with C(W) = C(X).

1. PW = PX.

2. If γ is any solution to the normal equations W′Wγ = W′Y associated with Model

GL-R, then β = Tγ is a solution to the normal equations X′Xβ = X′Y associated

with Model GL.

3. If λ′β is estimable under Model GL and if γ is any solution to the normal equations

W′Wγ = W′Y associated with Model GL-R, then λ′Tγ is the least squares

estimate of λ′β.

4. If q′γ is estimable under Model GL-R; i.e., if q′ ∈ R(W), then q′S′β is estimable

under Model GL and its least squares estimate is given by q′γ, where γ is any

solution to the normal equations W′Wγ = W′Y.

Proof.

1. PW = PX since perpendicular projection matrices are unique.

2. Note that

X′XTγ = X′Wγ = X′PWY = X′PXY = X′Y.

Hence, Tγ is a solution to the normal equations X′Xβ = X′Y.

PAGE 41


3. This follows from (2), since the least squares estimate is invariant to the choice of the

solution to the normal equations.

4. If q′ ∈ R(W), then q′ = a′W, for some a. Then, q′S′ = a′WS′ = a′X ∈ R(X), so

that q′S′β is estimable under Model GL. From (3), we know the least squares estimate

of q′S′β is q′S′Tγ. But,

q′S′Tγ = a′WS′Tγ = a′XTγ = a′Wγ = q′γ.

WARNING : The converse to (4) is not true; i.e., q′S′β being estimable under Model GL

doesn’t necessarily imply that q′γ is estimable under Model GL-R. See Monahan, pp 52.

TERMINOLOGY : Because C(W) = C(X) and r(X) = r, Wn×t must have at least r

columns. If W has exactly r columns; i.e., if t = r, then the reparameterization of

Model GL is called a full rank reparameterization. If, in addition, W′W is diagonal,

the reparameterization of Model GL is called an orthogonal reparameterization; see,

e.g., the centered linear regression model in Section 2 (notes).

NOTE : A full rank reparameterization always exists; just delete the columns of X that are

linearly dependent on the others. In a full rank reparameterization, (W′W)−1 exists, so

the normal equations W′Wγ = W′Y have a unique solution; i.e., γ = (W′W)−1W′Y.

DISCUSSION : There are two (opposing) points of view concerning the utility of full rank

reparameterizations.

• Some argue that, since making inferences about q′γ under the full rank reparam-

eterized model (Model GL-R) is equivalent to making inferences about q′S′β in

the possibly-less-than-full rank original model (Model GL), the inclusion of the

possibility that the design matrix has less than full column rank causes a needless

complication in linear model theory.

• The opposing argument is that, since computations required to deal with the repa-

rameterized model are essentially the same as those required to handle the original

model, we might as well allow for less-than-full rank models in the first place.

PAGE 42


• I tend to favor the latter point of view; to me, there is no reason not to include

less-than-full rank models as long as you know what you can and can not estimate.

Example 3.3. Consider the one-way fixed effects ANOVA model


for i = 1, 2, ..., a and j = 1, 2, ..., ni, where E(εij) = 0. In matrix form, X and β are

Xn×p =

1n1 1n1 0n1 · · · 0n1

1n2 0n2 1n2 · · · 0n2

......

.... . .

...

1na 0na 0na · · · 1na

and βp×1 =

µ

α1

α2

...

αa

,

where p = a + 1 and n =∑

i ni. This is not a full rank model since the first column is

the sum of the last a columns; i.e., r(X) = a.

Reparameterization 1: Deleting the first column of X, we have

Wn×t =

1n1 0n1 · · · 0n1

0n2 1n2 · · · 0n2

......

. . ....

0na 0na · · · 1na

and γt×1 =

µ+ α1

µ+ α2

...

µ+ αa

≡

µ1

µ2

...

µa

,

where t = a and µi = E(Yij) = µ + αi. This is called the cell-means model and is

written Yij = µi + εij. This is a full rank reparameterization with C(W) = C(X). The

least squares estimate of γ is

γ = (W′W)−1W′Y =

Y 1+

Y 2+

...

Y a+

.

Exercise: What are the matrices T and S associated with this reparameterization?

PAGE 43


Reparameterization 2: Deleting the last column of X, we have

Wn×t =

1n1 1n1 0n1 · · · 0n1

1n2 0n2 1n2 · · · 0n2

......

.... . .

...

1na−1 0na−1 0na−1 · · · 1na−1

1na 0na 0na · · · 0na

and γt×1 =

µ+ αa

α1 − αaα2 − αa

...

αa−1 − αa

,

where t = a. This is called the cell-reference model (what SAS uses by default). This

is a full rank reparameterization with C(W) = C(X). The least squares estimate of γ is

γ = (W′W)−1W′Y =

Y a+

Y 1+ − Y a+

Y 2+ − Y a+

...

Y (a−1)+ − Y a+

.

Reparameterization 3: Another reparameterization of the effects model uses

Wn×t =

1n1 1n1 0n1 · · · 0n1

1n2 0n2 1n2 · · · 0n2

......

.... . .

...

1na−1 0na−1 0na−1 · · · 1na−1

1na −1na −1na · · · −1na

and γt×1 =

µ+ α

α1 − α

α2 − α...

αa−1 − α

,

where t = a and α = a−1∑

i αi. This is called the deviations from the mean model.

This is a full rank reparameterization with C(W) = C(X).

Example 3.4. Two part multiple linear regression model. Consider the linear model

Y = Xβ + ε, where E(ε) = 0. Suppose that X is full rank. Write X = (X1 X2) and

β = (β′1,β′2)′ so that the model can be written as

Model GL: Y = X1β1 + X2β2 + ε.

Now, set W1 = X1 and W2 = (I − PX1)X2, where PX1 = X1(X′1X1)−1X′1 is the

perpendicular projection matrix onto C(X1). A reparameterized version of Model GL is

Model GL-R: Y = W1γ1 + W2γ2 + ε,

PAGE 44


where E(ε) = 0, W = (W1 W2) and

γ =

γ1

γ2

=

β1 + (X′1X1)−1X′1X2β2

β2

.

With this reparameterization, note that

W′W =

X′1X1 0

0 X′2(I−PX1)X2

and W′Y =

X′1Y

X′2(I−PX1)Y

so that

γ = (W′W)−1W′Y =

(X′1X1)−1X′1Y

X′2(I−PX1)X2−1X′2(I−PX1)Y

≡ γ1

γ2

.

In this reparameterization, W2 can be thought of as the “residual” from regressing (each

column of) X2 on X1. A further calculation shows that

β = (X′X)−1X′Y =

β1

β2

=

γ1 − (X′1X1)−1X′1X2β2

γ2

,

where note that (X′1X1)−1X′1X2 is the estimate obtained from “regressing” X2 on X1.

Furthermore, the estimate γ2 can be thought of as the estimate obtained from regressing

Y on W2 = (I−PX1)X2.

APPLICATION : Consider the two part full-rank regression model Y = X1β1+X2β2+ε,

where E(ε) = 0. Suppose that X2 = x2 is n× 1 and that β2 = β2 is a scalar. Consider

two different models:

Reduced model: Y = X1β1 + ε

Full model: Y = X1β1 + x2β2 + ε.

We use the term “reduced model” since C(X1) ⊂ C(X1,x2). Consider the full model

Y = X1β1 + x2β2 + ε and premultiply by I−PX1 to obtain

(I−PX1)Y = (I−PX1)X1β1 + b2(I−PX1)x2 + (I−PX1)ε

= b2(I−PX1)x2 + ε∗,

PAGE 45


where ε∗ = (I−PX1)ε. Now, note that

(I−PX1)Y = Y −PX1Y ≡ eY|X1,

say, are the residuals from regressing Y on X1. Similarly, (I − PX1)x2 ≡ ex2|X1are the

residuals from regressing x2 on X1. We thus have the following induced linear model

eY|X1= b2ex2|X1

+ ε∗,

where E(ε∗) = 0. The plot of eY|X1versus ex2|X1

is called an added-variable plot (or

partial regression) plot. It displays the relationship between Y and x2, after adjusting

for the effects of X1 being in the model.

• If a linear trend exists in this plot, this suggests that x2 enters into the (full) model

linearly. This plot can also be useful for detecting outliers and high leverage points.

• On the down side, added-variable plots only look at one predictor at a time so one

can not assess multicolinearity; that is, if the predictor x2 is “close” to C(X1), this

may not be detected in the plot.

• The slope of the least squares regression line for the added variable plot is

β2 = [(I−PX1)x2′(I−PX1)x2]−1(I−PX1)x2′(I−PX1)Y

= x′2(I−PX1)x2−1x′2(I−PX1)Y.

This is equal to the least squares estimate of β2 in the full model.

3.4 Forcing least squares solutions using linear constraints

REVIEW : Consider our general linear model Y = Xβ + ε, where E(ε) = 0, and X is

an n× p matrix with rank r. The normal equations are X′Xβ = X′Y.

• If r = p, then a unique least squares solution exists; i.e., β = (X′X)−1X′Y.

• If r < p, then a least squares solution is β = (X′X)−X′Y. This solution is not

unique; its value depends on which generalized inverse (X′X)− is used.

PAGE 46


Example 3.5. Consider the one-way fixed effects ANOVA model Yij = µ+ αi + εij, for

i = 1, 2, ..., a and j = 1, 2, ..., ni, where E(εij) = 0. The normal equations are

X′Xβ =

n n1 n2 · · · na

n1 n1 0 · · · 0

n2 0 n2 · · · 0...

......

. . ....

na 0 0 · · · na

µ

α1

α2

...

αa

=

∑i

∑j Yij∑

j Y1j∑j Y2j

...∑j Yaj

= X′Y,

or, written another way,

nµ+a∑i=1

niαi = Y++

niµ + niαi = Yi+, i = 1, 2, ..., a,

where Yi+ =∑

j Yij, for i = 1, 2, ..., a, and Y++ =∑

i

∑j Yij. This set of equations has no

unique solution. However, from our discussion on generalized inverses (and consideration

of this model), we know that

• if we set µ = 0, then we get the solution µ = 0 and αi = Y i+, for i = 1, 2, ..., a.

• if we set∑a

i=1 niαi = 0, then we get the solution µ = Y ++ and αi = Y i+ − Y ++,

for i = 1, 2, ..., a.

• if we set another nonestimable function equal to 0, we’ll get a different solution to

the normal equations.

REMARK : Equations like µ = 0 and∑a

i=1 niαi = 0 are used to “force” a particular so-

lution to the normal equations and are called side conditions. Different side conditions

produce different least squares solutions. We know that in the one-way ANOVA model,

the parameters µ and αi, for i = 1, 2, ..., a, are not estimable (individually). Imposing

side conditions does not change this. My feeling is that when we attach side conditions to

force a unique solution, we are doing nothing more than solving a mathematical problem

that isn’t relevant. After all, estimable functions λ′β have least squares estimates that

do not depend on which side condition was used, and these are the only functions we

should ever be concerned with.

PAGE 47


REMARK : We have seen similar results for the two-way crossed ANOVA model. In

general, what and how many conditions should we use to “force” a particular solution to

the normal equations? Mathematically, we are interested in imposing additional linear

restrictions of the form Cβ = 0 where the matrix C does not depend on Y.

TERMINOLOGY : We say that the system of equations Ax = q is compatible if

c′A = 0 =⇒ c′q = 0; i.e., c ∈ N (A′) =⇒ c′q = 0.

Result 3.6. The system Ax = q is consistent if and only if it is compatible.

Proof. If Ax = q is consistent, then Ax∗ = q, for some x∗. Hence, for any c such that

c′A = 0, we have c′q = c′Ax∗ = 0, so Ax = q is compatible. If Ax = q is compatible,

then for any c ∈ N (A′) = C(I − PA), we have 0 = c′q = q′c = q′(I − PA)z, for all

z. Successively taking z to be the standard unit vectors, we have q′(I − PA) = 0 =⇒

(I − PA)q = 0 =⇒ q = A(A′A)−A′q =⇒ q = Ax∗, where x∗ = (A′A)−A′q. Thus,

Ax = q is consistent.

AUGMENTED NORMAL EQUATIONS : We now consider adjoining the set of equations

Cβ = 0 to the normal equations; that is, we consider the new set of equations X′X

C

β =

X′Y

0

.

These are called the augmented normal equations. When we add the constraint

Cβ = 0, we want these equations to be consistent for all Y. We now would like to find

a sufficient condition for consistency. Suppose that w ∈ R(X′X) ∩R(C). Note that

w ∈ R(X′X) =⇒ w = X′Xv1, for some v1

w ∈ R(C) =⇒ w = −C′v2, for some v2.

Thus, 0 = w −w = X′Xv1 + C′v2

=⇒ 0 = v′1X′X + v′2C

=⇒ 0 = (v′1 v′2)

X′X

C

= v′

X′X

C

,

PAGE 48


where v′ = (v′1 v′2). We want C chosen so that X′X

C

β =

X′Y

0

is consistent, or equivalently from Result 3.6, is compatible. Compatibility occurs when

v′

X′X

C

= 0 =⇒ v′

X′Y

0

= 0.

Thus, we need v′1X′Y = 0, for all Y. Successively taking Y to be standard unit vectors,

for i = 1, 2, ..., n, convinces us that v′1X′ = 0⇐⇒ Xv1 = 0⇐⇒ X′Xv1 = 0 =⇒ w = 0.

Thus, the augmented normal equations are consistent whenR(X′X)∩R(C) = 0. Since

R(X′X) = R(X), a sufficient condition for consistency is R(X) ∩ R(C) = 0. Now,

consider the parametric function λ′Cβ, for some λ. We know that λ′Cβ is estimable if

and only if λ′C ∈ R(X). However, clearly λ′C ∈ R(C). Thus, λ′Cβ is estimable if and

only if λ′Cβ = 0. In other words, writing

C =

c′1

c′2...

c′s

,

the set of functions c′1β, c′2β, ..., c′sβ is jointly nonestimable. Therefore, we can set

a collection of jointly nonestimable functions equal to zero and augment the normal

equations so that they remain consistent. We get a unique solution if

r

X′X

C

= p.

Because R(X′X) ∩R(C) = 0,

p = r

X′X

C

= r(X′X) + r(C) = r + r(C),

showing that we need r(C) = s = p− r.

PAGE 49


SUMMARY : To augment the normal equations, we can find a set of s jointly nonestimable

functions c′1β, c′2β, ..., c′sβ with

r(C) = r

c′1

c′2...

c′s

= s.

Then, X′X

C

β =

X′Y

0

is consistent and has a unique solution.

Example 3.5 (continued). Consider the one-way fixed effects ANOVA model


for i = 1, 2, ..., a and j = 1, 2, ..., ni, where E(εij) = 0. The normal equations are

X′Xβ =

n n1 n2 · · · na

n1 n1 0 · · · 0

n2 0 n2 · · · 0...

......

. . ....

na 0 0 · · · na

µ

α1

α2

...

αa

=

∑i

∑j Yij∑

j Y1j∑j Y2j

...∑j Yaj

= X′Y.

We know that r(X) = r = a < p (this system can not be solved uniquely) and that

s = p − r = (a + 1) − a = 1. Thus, to augment the normal equations, we need to find

s = 1 (jointly) nonestimable function. Take c′1 = (1, 0, 0, ..., 0), which produces

c′1β = (1 0 0 · · · 0)

µ

α1

α2

...

αa

= µ.

PAGE 50


For this choice of c1, the augmented normal equations are

X′X

c1

β =

n n1 n2 · · · na

n1 n1 0 · · · 0

n2 0 n2 · · · 0...

......

. . ....

na 0 0 · · · na

1 0 0 · · · 0

µ

α1

α2

...

αa

=

∑i

∑j Yij∑

j Y1j∑j Y2j

...∑j Yaj

0

=

X′Y

0

.

Solving this (now full rank) system produces the unique solution

µ = 0

αi = Y i+ i = 1, 2, ..., a.

You’ll note that this choice of c1 used to augment the normal equations corresponds to

specifying the side condition µ = 0.

Exercise. Redo this example using (a) the side condition∑a

i=1 niαi = 0, (b) the side

condition αa = 0 (what SAS does), and (c) using another side condition.

Example 3.6. Consider the two-way fixed effects (crossed) ANOVA model

Yij = µ+ αi + βj + εij,

for i = 1, 2, ..., a and j = 1, 2, ..., b, where E(εij) = 0. For purposes of illustration, let’s

take a = b = 3, so that n = ab = 9 and p = a+ b+ 1 = 7. In matrix form, X and β are

X9×7 =

1 1 0 0 1 0 0

1 1 0 0 0 1 0

1 1 0 0 0 0 1

1 0 1 0 1 0 0

1 0 1 0 0 1 0

1 0 1 0 0 0 1

1 0 0 1 1 0 0

1 0 0 1 0 1 0

1 0 0 1 0 0 1

and β7×1 =

µ

α1

α2

α3

β1

β2

β3

.

PAGE 51


We see that r(X) = r = 5 so that s = p− r = 7− 5 = 2. The normal equations are

X′Xβ =

9 3 3 3 3 3 3

3 3 0 0 1 1 1

3 0 3 0 1 1 1

3 0 0 3 1 1 1

3 1 1 1 3 0 0

3 1 1 1 0 3 0

3 1 1 1 0 0 3

µ

α1

α2

α3

β1

β2

β3

=

∑i

∑j Yij∑

j Y1j∑j Y2j∑j Y3j∑i Yi1∑i Yi2∑i Yi3

= X′Y.

This system does not have a unique solution. To augment the normal equations, we will

need a set of s = 2 linearly independent jointly nonestimable functions. From Section

3.2.2, one example of such a set is ∑

i αi,∑

j βj. For this choice, our matrix C is

C =

c′1

c′2

=

0 1 1 1 0 0 0

0 0 0 0 1 1 1

.

Thus, the augmented normal equations become

X′X

C

β =

9 3 3 3 3 3 3

3 3 0 0 1 1 1

3 0 3 0 1 1 1

3 0 0 3 1 1 1

3 1 1 1 3 0 0

3 1 1 1 0 3 0

3 1 1 1 0 0 3

0 1 1 1 0 0 0

0 0 0 0 1 1 1

µ

α1

α2

α3

β1

β2

β3

=

∑i

∑j Yij∑

j Y1j∑j Y2j∑j Y3j∑i Yi1∑i Yi2∑i Yi3

0

0

=

X′Y

0

.

Solving this system produces the “estimates” of µ, αi and βj under the side conditions∑i αi =

∑j βj = 0. These “estimates” are

µ = Y ++

αi = Y i+ − Y ++, i = 1, 2, 3

βj = Y +j − Y ++, j = 1, 2, 3.

PAGE 52


Exercise. Redo this example using (a) the side conditions αa = 0 and βb = 0 (what

SAS does) and (b) using another set of side conditions.

QUESTION : In general, can we give a mathematical form for the particular solution?

Note that we are now solving X′X

C

β =

X′Y

0

,

which is equivalent to X′X

C′C

β =

X′Y

0

since Cβ = 0 iff C′Cβ = 0. Thus, any solution to this system must also satisfy

(X′X + C′C)β = X′Y.

But,

r(X′X + C′C) = r

(X′ C′)

X

C

= r

X

C

= p,

that is, X′X + C′C is nonsingular. Hence, the unique solution to the augmented normal

equations must be

β = (X′X + C′C)−1X′Y.

So, imposing s = p − r conditions Cβ = 0, where the elements of Cβ are jointly non-

estimable, yields a particular solution to the normal equations. Finally, note that by

Result 2.5 (notes),

Xβ = X(X′X + C′C)−1X′Y

= PXY,

which shows that

PX = X(X′X + C′C)−1X′

is the perpendicular projection matrix onto C(X). This shows that (X′X + C′C)−1 is a

(non-singular) generalized inverse of X′X.

PAGE 53


4 The Gauss-Markov Model


4.1 Introduction

REVIEW : Consider the general linear model Y = Xβ + ε, where E(ε) = 0.

• A linear estimator t(Y) = c+ a′Y is said to be unbiased for λ′β if and only if

Et(Y) = λ′β,

for all β. We have seen this implies that c = 0 and λ′ ∈ R(X); i.e., λ′β is estimable.

• When λ′β is estimable, it is possible to find several estimators that are unbiased for

λ′β. For example, in the one-way (fixed effects) ANOVA model Yij = µ+ αi + εij,

with E(εij) = 0, Y11, (Y11 + Y12)/2, and Y 1+ are each unbiased estimators of

λ′β = µ+ α1 (there are others too).

• If λ′β is estimable, then the (ordinary) least squares estimator λ′β, where β is any

solution to the normal equations X′Xβ = X′Y, is unbiased for λ′β. To see this,

recall that

λ′β = λ′(X′X)−X′Y = a′PXY,

where λ′ = a′X, for some a, that is, λ′ ∈ R(X), and PX is the perpendicular

projection matrix onto C(X). Thus,

E(λ′β) = E(a′PXY) = a′PXE(Y) = a′PXXβ = a′Xβ = λ′β.

GOAL: Among all linear unbiased estimators for λ′β, we want to find the “best” linear

unbiased estimator in the sense that it has the smallest variance. We will show that

the least squares estimator λ′β is the best linear unbiased estimator (BLUE) of λ′β,

provided that λ′β is estimable and cov(ε) = σ2I.

PAGE 54


4.2 The Gauss-Markov Theorem

Result 4.1. Consider the Gauss-Markov model Y = Xβ + ε, where E(ε) = 0 and

cov(ε) = σ2I. Suppose that λ′β is estimable and let β denote any solution to the normal

equations X′Xβ = X′Y. The (ordinary) least squares estimator λ′β is the best linear

unbiased estimator (BLUE) of λ′β, that is, the variance of λ′β is uniformly less than

that of any other linear unbiased estimator of λ′β.

Proof. Suppose that θ = c + a′Y is another linear unbiased estimator of λ′β. From

Result 3.1, we know that c = 0 and that λ′ = a′X. Thus, θ = a′Y, where λ′ = a′X.

Now, write θ = λ′β + (θ − λ′β). Note that

var(θ) = var[λ′β + (θ − λ′β)]

= var(λ′β) + var(θ − λ′β)︸︷︷︸≥0

+2cov(λ′β, θ − λ′β).

We now show that cov(λ′β, θ − λ′β) = 0. Recalling that λ′β = a′PXY, we have

cov(λ′β, θ − λ′β) = cov(a′PXY, a′Y − a′PXY)

= cov[a′PXY, a′(I−PX)Y]

= a′PXcov(Y,Y)[a′(I−PX)]′

= σ2Ia′PX(I−PX)a = 0,

since PX(I−PX) = 0. Thus, var(θ) ≥ var(λ′β), showing that λ′β has variance no larger

than that of θ. Equality results when var(θ − λ′β) = 0. However, if var(θ − λ′β) = 0,

then because E(θ − λ′β) = 0 as well, θ − λ′β is a degenerate random variable at 0; i.e.,

pr(θ = λ′β) = 1. This establishes uniqueness.

MULTIVARIATE CASE : Suppose that we wish to estimate simultaneously k estimable

linear functions

Λ′β =

λ′1β

λ′2β...

λ′kβ

,

PAGE 55


where λ′i = a′iX, for some ai; i.e., λ′i ∈ R(X), for i = 1, 2, ..., k. We say that Λ′β is

estimable if and only if λ′iβ, i = 1, 2, ..., k, are each estimable. Put another way, Λ′β is

estimable if and only if Λ′ = A′X, for some matrix A.

Result 4.2. Consider the Gauss-Markov model Y = Xβ + ε, where E(ε) = 0 and

cov(ε) = σ2I. Suppose that Λ′β is any k-dimensional estimable vector and that c+A′Y is

any vector of linear unbiased estimators of the elements of Λ′β. Let β denote any solution

to the normal equations. Then, the matrix cov(c + A′Y) − cov(Λ′β) is nonnegative

definite.

Proof. It suffices to show that x′[cov(c + A′Y)− cov(Λ′β)]x ≥ 0, for all x. Note that

x′[cov(c + A′Y)− cov(Λ′β)]x = x′cov(c + A′Y)x− x′cov(Λ′β)x

= var(x′c + x′A′Y)− var(x′Λ′β).

But x′Λ′β = x′A′Xβ (a scalar) is estimable since x′A′X ∈ R(X). Also, x′c + x′A′Y =

x′(c+A′Y) is a linear unbiased estimator of x′Λ′β. The least squares estimator of x′Λ′β

is x′Λ′β. Thus, by Result 4.1, var(x′c + x′A′Y)− var(x′Λ′β) ≥ 0.

OBSERVATION : Consider the Gauss-Markov linear model Y = Xβ+ε, where E(ε) = 0

and cov(ε) = σ2I. If X is full rank, then X′X is nonsingular and every linear combination

of λ′β is estimable. The (ordinary) least squares estimator of β is β = (X′X)−1X′Y. It

is unbiased and

cov(β) = cov[(X′X)−1X′Y] = (X′X)−1X′cov(Y)[(X′X)−1X′]′

= (X′X)−1X′σ2IX(X′X)−1 = σ2(X′X)−1.

Note that this is not correct if X is less than full rank.

Example 4.1. Recall the simple linear regression model


for i = 1, 2, ..., n, where ε1, ε2, ..., εn are uncorrelated random variables with mean 0 and

common variance σ2 > 0 (these are the Gauss Markov assumptions). Recall that, in

PAGE 56


matrix notation,

Y =

Y1

Y2

...

Yn

, X =

1 x1

1 x2

......

1 xn

, β =

β0

β1

, ε =

ε1

ε2...

εn

.

The least squares estimator of β is

β = (X′X)−1X′Y =

β0

β1

=


i(xi−x)2

.

The covariance matrix of β is

cov(β) = σ2(X′X)−1 = σ2

1n

+ x2∑i(xi−x)2

− x∑i(xi−x)2

− x∑i(xi−x)2

1∑i(xi−x)2

.

4.3 Estimation of σ2 in the GM model

REVIEW : Consider the Gauss-Markov model Y = Xβ+ε, where E(ε) = 0 and cov(ε) =

σ2I. The best linear unbiased estimator (BLUE) for any estimable function λ′β is λ′β,

where β is any solution to the normal equations. Clearly, E(Y) = Xβ is estimable and

the BLUE of E(Y) is

Xβ = X(X′X)−1X′Y = PXY = Y,

the perpendicular projection of Y onto C(X); that is, the fitted values from the least

squares fit. The residuals are given by

e = Y − Y = Y −PXY = (I−PX)Y,

the perpendicular projection of Y onto N (X′). Recall that the residual sum of squares

is

Q(β) = (Y −Xβ)′(Y −Xβ) = e′e = Y′(I−PX)Y.

We now turn our attention to estimating σ2.

PAGE 57


Result 4.3. Suppose that Z is a random vector with mean E(Z) = µ and covariance

matrix cov(Z) = Σ. Let A be nonrandom. Then

E(Z′AZ) = µ′Aµ + tr(AΣ).

Proof. Note that Z′AZ is a scalar random variable; hence, Z′AZ = tr(Z′AZ). Also, recall

that expectation E(·) and tr(·) are linear operators. Finally, recall that tr(AB) = tr(BA)

for conformable A and B. Now,

E(Z′AZ) = E[tr(Z′AZ)] = E[tr(AZZ′)]

= tr[AE(ZZ′)]

= tr[A(Σ + µµ′)]

= tr(AΣ) + tr(Aµµ′)

= tr(AΣ) + tr(µ′Aµ) = tr(AΣ) + µ′Aµ.

REMARK : Finding var(Z′AZ) is more difficult; see Section 4.9 in Monahan. Consider-

able simplification results when Z follows a multivariate normal distribution.

APPLICATION : We now find an unbiased estimator of σ2 under the GM model. Suppose

that Y is n×1 and X is n×p with rank r ≤ p. Note that E(Y) = Xβ. Applying Result

4.3 directly with A = I−PX, we have

E[Y′(I−PX)Y] = (Xβ)′(I−PX)Xβ︸︷︷︸= 0

+tr[(I−PX)σ2I]

= σ2[tr(I)− tr(PX)]

= σ2[n− r(PX)] = σ2(n− r).

Thus,

σ2 = (n− r)−1Y′(I−PX)Y

is an unbiased estimator of σ2 in the GM model. In non-matrix notation,

σ2 = (n− r)−1

n∑i=1

(Yi − Yi)2,

where Yi is the least squares fitted value of Yi.

PAGE 58


ANOVA: Consider the Gauss-Markov model Y = Xβ + ε, where E(ε) = 0 and cov(ε) =

σ2I. Suppose that Y is n × 1 and X is n × p with rank r ≤ p. Recall from Chapter 2

the basic form of an ANOVA table (with corrected sums of squares):

Source df SS MS F

Model (Corrected) r − 1 SSR = Y′(PX −P1)Y MSR = SSRr−1

F = MSRMSE

Residual n− r SSE = Y′(I−PX)Y MSE = SSEn−r

Total (Corrected) n− 1 SST = Y′(I−P1)Y

NOTES :

• The degrees of freedom associated with each SS is the rank of its appropriate

perpendicular projection matrix; that is, r(PX−P1) = r−1 and r(I−PX) = n−r.

• Note that

cov(Y, e) = cov[PXY, (I−PX)Y] = PXσ2I(I−PX) = 0.

That is, the least squares fitted values are uncorrelated with the residuals.

• We have just shown that E(MSE) = σ2. If Xβ /∈ C(1), that is, the independent

variables in X add to the model (beyond an intercept, for example), then

E(SSR) = E[Y′(PX −P1)Y] = (Xβ)′(PX −P1)Xβ + tr[(PX −P1)σ2I]

= (Xβ)′Xβ + σ2r(PX −P1)

= (Xβ)′Xβ + (r − 1)σ2.

Thus,

E(MSR) = (r − 1)−1E(SSR) = σ2 + (r − 1)−1(Xβ)′Xβ.

• If Xβ ∈ C(1), that is, the independent variables in X add nothing to the model,

then (Xβ)′(PX − P1)Xβ = 0 and MSR and MSE are both unbiased estimators

of σ2. If this is true, F should be close to 1. Large values of F occur when

(Xβ)′(PX − P1)Xβ is large, that is, when Xβ is “far away” from C(1), that is,

when the independent variables in X are more relevant in explaining E(Y).

PAGE 59


4.4 Implications of model selection

REMARK : Consider the linear model Y = Xβ+ε, where Y is n×1 and X is n×p with

rank r ≤ p. We now investigate two issues: underfitting (model misspecification) and

overfitting. We assume that E(ε) = 0 and cov(ε) = σ2I, that is, our usual Gauss-Markov

assumptions.

4.4.1 Underfitting (Misspecification)

UNDERFITTING : Suppose that, in truth, the “correct” model for Y is

Y = Xβ + Wδ + ε,

where E(ε) = 0 and cov(ε) = σ2I. The vector η = Wδ includes the variables and

coefficients missing from Xβ. If the analyst uses Y = Xβ + ε to describe the data,

s/he is missing important variables that are in W, that is, the analyst is misspecifying

the true model by underfitting. We now examine the effect of underfitting on (a) least

squares estimates of estimable functions and (b) the estimate of the error variance σ2.

CONSEQUENCES : Suppose that λ′β is estimable under Y = Xβ + ε, where E(ε) = 0

and cov(ε) = σ2I; i.e., λ′ = a′X, for some vector a. The least squares estimator of λ′β

is given by λ′β = λ′(X′X)−X′Y. If Wδ = 0, then Y = Xβ + ε is the correct model

and E(λ′β) = λ′β. If Wδ 6= 0, then, under the correct model,

E(λ′β) = E[λ′(X′X)−X′Y] = λ′(X′X)−X′E(Y)

= λ′(X′X)−X′(Xβ + Wδ)

= a′X(X′X)−X′Xβ + a′X(X′X)−X′Wδ

= a′PXXβ + a′PXWδ

= λ′β + a′PXWδ,

showing that λ′β is no longer unbiased, in general. The amount of the bias depends on

where Wδ is. If η = Wδ is orthogonal to C(X), then PXWδ = 0 and the estimation of

λ′β with λ′β is unaffected. Otherwise, PXWδ 6= 0 and the estimate of λ′β is biased.

PAGE 60


CONSEQUENCES : Now, let’s turn to the estimation of σ2. Under the correct model,

E[Y′(I−PX)Y] = (Xβ + Wδ)′(I−PX)(Xβ + Wδ) + tr[(I−PX)σ2I]

= (Wδ)′(I−PX)Wδ + σ2(n− r),

where r = r(X). Thus,

E(MSE) = σ2 + (n− r)−1(Wδ)′(I−PX)Wδ,

that is, σ2 = MSE is unbiased if and only if Wδ ∈ C(X).

4.4.2 Overfitting

OVERFITTING : Suppose that, in truth, the correct model for Y is

Y = X1β1 + ε,

where E(ε) = 0 and cov(ε) = σ2I, but, instead, we fit

Y = X1β1 + X2β2 + ε,

that is, the extra variables in X2 are not needed; i.e., β2 = 0. Set X = [X1 X2] and

suppose that X and X1 have full column rank (i.e., a regression setting). The least

squares estimator of β1 under the true model is β1 = (X′1X1)−1X′1Y. We know that

E(β1) = β1

cov(β1) = σ2(X′1X1)−1.

On the other hand, the normal equations associated with the larger (unnecessarily large)

model are

X′Xβ = X′Y ⇐⇒

X′1X1 X′1X2

X′2X1 X′2X2

β1

β2

=

X′1Y

X′2Y

and the least squares estimator of β is

β =

β1

β2

=

X′1X1 X′1X2

X′2X1 X′2X2

−1 X′1Y

X′2Y

.

PAGE 61


The least squares estimator β is still unbiased in the unnecessarily large model, that is,

E(β1) = β1

E(β2) = 0.

Thus, we assess the impact of overfitting by looking at cov(β1). Under the unnecessarily

large model,

cov(β1) = σ2(X′1X1)−1 + σ2(X′1X1)−1X′1X2[X′2(I−PX1)X2]−1X′2X1(X′1X1)−1;

see Exercise A.72 (pp 268) in Monahan. Thus,

cov(β1)− cov(β1) = σ2(X′1X1)−1X′1X2[X′2(I−PX1)X2]−1X′2X1(X′1X1)−1.

• If the columns of X2 are each orthogonal to C(X1), then X′1X2 = 0 and

X′X =

X′1X1 X′1X2

X′2X1 X′2X2

=

X′1X1 0

0 X′2X2

;

i.e., X′X is block diagonal, and β1 = (X′1X1)−1X′1Y = β1. This would mean that

using the unnecessarily large model has no effect on our estimate of β1. However,

the precision with which we can estimate σ2 is affected since r(I−PX) < r(I−PX1);

that is, we have fewer residual degrees of freedom.

• If the columns of X2 are not all orthogonal to C(X1), then

cov(β1) 6= σ2(X′1X1)−1.

Furthermore, as X2 gets “closer” to C(X1), then X′2(I − PX1)X2 gets “smaller.”

This makes [X′2(I−PX1)X2]−1 “larger.” This makes cov(β1) “larger.”

• Multicollinearity occurs when X2 is “close” to C(X1). Severe multicollinearity

can greatly inflate the variances of the least squares estimates. In turn, this can

have a deleterious effect on inference (e.g., confidence intervals too wide, hypothesis

tests with no power, predicted values with little precision, etc.). Various diagnostic

measures exist to assess multicollinearity (e.g., VIFs, condition numbers, etc.); see

the discussion in Monahan, pp 80-82.

PAGE 62


4.5 The Aitken model and generalized least squares

TERMINOLOGY : The general linear model

Y = Xβ + ε,

where E(ε) = 0 and cov(ε) = σ2V, V known, is called the Aitken model. It is

more flexible than the Guass-Markov (GM) model, because the analyst can incorporate

different correlation structures with the observed responses. The GM model is a special

case of the Aitken model with V = I; that is, where responses are uncorrelated.

REMARK : In practice, V is rarely known; rather, it must be estimated. We will discuss

this later. We assume that V is positive definite (pd), and, hence nonsingular, for reasons

that will soon be obvious. Generalizations are possible; see Christensen (Chapter 10).

RECALL: Because V is symmetric, we can write V in its Spectral Decomposition; i.e.,

V = QDQ′,

where Q is orthogonal and D is the diagonal matrix consisting of λ1, λ2, ..., λn, the eigen-

values of V. Because V is pd, we know that λi > 0, for each i = 1, 2, ..., n. The symmetric

square root of V is

V1/2 = QD1/2Q′,

where D1/2 = diag(√λ1,√λ2, ...,

√λn). Note that V1/2V1/2 = V and that V−1 =

V−1/2V−1/2, where

V−1/2 = QD−1/2Q′

and D−1/2 = diag(1/√λ1, 1/

√λ2, ..., 1/

√λn).

TRANSFORMATION : Consider the Aitken model Y = Xβ + ε, where E(ε) = 0 and

cov(ε) = σ2V, where V is known. Premultiplying by V−1/2, we get

V−1/2Y = V−1/2Xβ + V−1/2ε

⇐⇒ Y∗ = Uβ + ε∗,

PAGE 63


where Y∗ = V−1/2Y, U = V−1/2X, and ε∗ = V−1/2ε. It is easy to show that Y∗ =

Uβ + ε∗ is now a GM model. To see this, note that

E(ε∗) = V−1/2E(ε) = V−1/20 = 0

and

cov(ε∗) = V−1/2cov(ε)V−1/2 = V−1/2σ2VV−1/2 = σ2I.

Note also that R(X) = R(U), because V−1/2 is nonsingular. This means that λ′β is

estimable in the Aitken model if and only if λ′β is estimable in the transformed GM

model. The covariance structure on ε does not affect estimability.

AITKEN EQUATIONS : In the transformed model, the normal equations are

U′Uβ = U′Y∗.

However, note that

U′Uβ = U′Y∗ ⇐⇒ (V−1/2X)′V−1/2Xβ = (V−1/2X)′V−1/2Y

⇐⇒ X′V−1Xβ = X′V−1Y

in the Aitken model. The equations

X′V−1Xβ = X′V−1Y

are called the Aitken equations. These should be compared with the normal equations

X′Xβ = X′Y

in the GM model. In general, we will denote by βGLS and βOLS the solutions to the Aitken

and normal equations, respectively. “GLS” stands for generalized least squares. “OLS”

stands for ordinary least squares.

GENERALIZED LEAST SQUARES : Any solution βGLS to the Aitken equations is called

a generalized least squares (GLS) estimator of β. It is not necessarily unique (unless

X is full rank). The solution βGLS minimizes

Q∗(β) = (Y∗ −Uβ)′(Y∗ −Uβ) = (Y −Xβ)′V−1(Y −Xβ).

PAGE 64


When X is full rank, the unique GLS estimator is

βGLS = (X′V−1X)−1X′V−1Y.

When X is not full rank, a GLS estimator is

βGLS = (X′V−1X)−X′V−1Y.

NOTE : If V is diagonal; i.e., V = diag(v1, v2, ..., vn), then

Q∗(β) = (Y −Xβ)′V−1(Y −Xβ) =n∑i=1

wi(Yi − x′iβ)2,

where wi = 1/vi and x′i is the ith row of X. In this situation, βGLS is called the weighted

least squares estimator.

Result 4.4. Consider the Aitken model Y = Xβ+ε, where E(ε) = 0 and cov(ε) = σ2V,

where V is known. If λ′β is estimable, then λ′βGLS is the BLUE for λ′β.

Proof. Applying the Gauss-Markov Theorem to the transformed model Y∗ = Uβ + ε∗,

the GLS estimator λ′βGLS is the BLUE of λ′β among all linear unbiased estimators

involving Y∗ = V−1/2Y. However, any linear estimator in Y can be obtained from Y∗

because V−1/2 is invertible. Thus, λ′βGLS is the BLUE.

REMARK : If X is full rank, then estimability concerns vanish (as in the GM model) and

βGLS = (X′V−1X)−1X′V−1Y is unique. In this case, straightforward calculations show

that E(βGLS) = β and

cov(βGLS) = σ2(X′V−1X)−1.

Example 4.2. Heteroscedastic regression through the origin. Consider the regression

model Yi = βxi + εi, for i = 1, 2, ..., n, where E(εi) = 0, var(εi) = σ2g2(xi), for some real

function g(·), and cov(εi, εj) = 0, for i 6= j. For this model,

Y =

Y1

Y2

...

Yn

, X =

x1

x2

...

xn

, and V =

g2(x1) 0 · · · 0

0 g2(x2) · · · 0...

.... . .

...

0 0 · · · g2(xn)

.

PAGE 65


The OLS estimator of β = β is given by

βOLS = (X′X)−1X′Y =

∑ni=1 xiYi∑ni=1 x

2i

.

The GLS estimator of β = β is given by

βGLS = (X′V−1X)−1X′V−1Y =

∑ni=1 xiYi/g

2(xi)∑ni=1 x

2i /g

2(xi).

Which one is better? Both of these estimators are unbiased, so we turn to the variances.

Straightforward calculations show that

var(βOLS) =σ2∑n

i=1 x2i g

2(xi)

(∑n

i=1 x2i )

2

var(βGLS) =σ2∑n

i=1 x2i /g

2(xi).

We are thus left to compare∑ni=1 x

2i g

2(xi)

(∑n

i=1 x2i )

2 with1∑n

i=1 x2i /g

2(xi).

Write x2i = uivi, where ui = xig(xi) and vi = xi/g(xi). Applying Cauchy-Schwartz’s

inequality, we get(n∑i=1

x2i

)2

=

(n∑i=1

uivi

)2

≤n∑i=1

u2i

n∑i=1

v2i =

n∑i=1

x2i g

2(xi)n∑i=1

x2i /g

2(xi).

Thus,1∑n

i=1 x2i /g

2(xi)≤∑n

i=1 x2i g

2(xi)

(∑n

i=1 x2i )

2 =⇒ var(βGLS) ≤ var(βOLS).

This result should not be surprising; after all, we know that βGLS is BLUE.

Result 4.5. An estimate β is a generalized least squares estimate if and only if Xβ =

AY, where A = X(X′V−1X)−X′V−1.

Proof. The GLS estimate; i.e., the OLS estimate in the transformed model Y∗ = Uβ+ε∗,

where Y∗ = V−1/2Y, U = V−1/2X, and ε∗ = V−1/2ε, satisfies

V−1/2X[(V−1/2X)′V−1/2X]−(V−1/2X)′V−1/2Y = V−1/2Xβ,

by Result 2.5. Multiplying through by V1/2 and simplifying gives the result.

PAGE 66


Result 4.6. A = X(X′V−1X)−X′V−1 is a projection matrix onto C(X).

Proof. We need to show that

(a) A is idempotent

(b) Aw ∈ C(X), for any w

(c) Az = z, for all z ∈ C(X).

The perpendicular projection matrix onto C(V−1/2X) is

V−1/2X[(V−1/2X)′V−1/2X]−(V−1/2X)′,

which implies that

V−1/2X[(V−1/2X)′V−1/2X]−(V−1/2X)′V−1/2X = V−1/2X.

This can also be written as

V−1/2AX = V−1/2X.

Premultiplying by V1/2 gives AX = X. Thus,

AA = AX(X′V−1X)−X′V−1 = X(X′V−1X)−X′V−1 = A,

showing that A is idempotent. To show (b), note Aw = X(X′V−1X)−X′V−1w ∈ C(X).

To show (c), it suffices to show C(A) = C(X). But, A = X(X′V−1X)−X′V−1 implies

that C(A) ⊂ C(X) and AX = X implies that C(X) ⊂ C(A).

Result 4.7. In the Aitken model, if C(VX) ⊂ C(X), then the GLS and OLS estimates

will be equal; i.e., OLS estimates will be BLUE in the Aitken model.

Proof. The proof proceeds by showing that A = X(X′V−1X)−X′V−1 is the perpendicular

projection matrix onto C(X) when C(VX) ⊂ C(X). We already know that A is a

projection matrix onto C(X). Thus, all we have to show is that if w⊥C(X), then Aw = 0.

If V is nonsingular, then r(VX) = r(X). The only way this and C(VX) ⊂ C(X) holds

is if C(VX) = C(X), in which case VXB1 = X and VX = XB2, for some matrices

B1 and B2. Multiplying through by V−1 gives XB1 = V−1X and X = V−1XB2.

Thus, C(V−1X) = C(X) and C(V−1X)⊥ = C(X)⊥. If w⊥C(X), then w⊥C(V−1X); i.e.,

w ∈ N (X′V−1). Since Aw = X(X′V−1X)−1X′V−1w = 0, we are done.

PAGE 67


5 Distributional Theory

Complementary reading from Monahan: Chapter 5.

5.1 Introduction

PREVIEW : Consider the Gauss-Markov linear model Y = Xβ + ε, where Y is n× 1, X

is an n× p with rank r ≤ p, β is p× 1, and ε is n× 1 with E(ε) = 0 and cov(ε) = σ2I.

In addition to the first two moment assumptions, it is common to assume that ε follows

a multivariate normal distribution. This additional assumption allows us to formally

pursue various questions dealing with inference. In addition to the multivariate normal

distribution, we will also examine noncentral distributions and quadratic forms.

RECALL: If Z ∼ N (0, 1), then the probability density function (pdf) of Z is given by

fZ(z) =1√2πe−z

2/2I(z ∈ R).

The N (µ, σ2) family is a location-scale family generated by the standard density fZ(z).

TERMINOLOGY : The collection of pdfs

LS(f) =

fX(·|µ, σ) : fX(x|µ, σ) =

1

σfZ

(x− µσ

);µ ∈ R, σ > 0

is a location-scale family generated by fZ(z); see Casella and Berger, Chapter 3. That

is, if Z ∼ fZ(z), then

X = σZ + µ ∼ fX(x|µ, σ) =1

σfZ

(x− µσ

).

APPLICATION : With the standard normal density fZ(z), it is easy to see that

fX(x|µ, σ) =1

σfZ

(x− µσ

)=

1√2πσ

e−1

2σ2(x−µ)2I(x ∈ R).

That is, any normal random variable X ∼ N (µ, σ2) may be obtained by transforming

Z ∼ N (0, 1) via X = σZ + µ.

PAGE 68


5.2 Multivariate normal distribution

5.2.1 Probability density function

STARTING POINT : Suppose that Z1, Z2, ..., Zp are iid standard normal random vari-

ables. The joint pdf of Z = (Z1, Z2, ..., Zp)′ is given by

fZ(z) =

p∏i=1

fZ(zi)

=

(1√2π

)pe−

∑pi=1 z

2i /2

p∏i=1

I(zi ∈ R)

= (2π)−p/2 exp(−z′z/2)I(z ∈ Rp).

If Z has pdf fZ(z), we say that Z has a standard multivariate normal distribution;

i.e., a multivariate normal distribution with mean 0p×1 and covariance matrix Ip. We

write Z ∼ Np(0, I).

MULTIVARIATE NORMAL DISTRIBUTION : Suppose that Z ∼ Np(0, I). Suppose

that V is symmetric and positive definite (and, hence, nonsingular) and let V1/2 be the

symmetric square root of V. Define the transformation

Y = V1/2Z + µ,

where Y and µ are both p× 1. Note that

E(Y) = E(V1/2Z + µ) = µ,

since E(Z) = 0, and

cov(Y) = cov(V1/2Z + µ) = V1/2cov(Z)V1/2 = V,

since cov(Z) = I. The transformation y = g(z) = V1/2z+µ is linear in z (and hence, one-

to-one) and the pdf of Y can be found using a transformation. The inverse transformation

is z = g−1(y) = V−1/2(y − µ). The Jacobian of the inverse transformation is∣∣∣∣∂g−1(y)

∂y

∣∣∣∣ = |V−1/2|,

PAGE 69


where |A| denotes the determinant of A. The matrix V−1/2 is pd; thus, its determinant

is always positive. Thus, for y ∈ Rp,

fY(y) = fZg−1(y)|V−1/2|

= |V|−1/2fZV−1/2(y − µ)

= (2π)−p/2|V|−1/2 exp[−V−1/2(y − µ)′V−1/2(y − µ)/2]

= (2π)−p/2|V|−1/2 exp−(y − µ)′V−1(y − µ)/2.

If Y ∼ fY(y), we say that Y has a multivariate normal distribution with mean µ

and covariance matrix V. We write Y ∼ Np(µ,V).

IMPORTANT : In the preceding derivation, we assumed V to be pd (hence, nonsingular).

If V is singular, then the distribution of Y is concentrated in a subspace of Rp, with

dimension r(V). In this situation, the density function of Y does not exist.

5.2.2 Moment generating functions

REVIEW : Suppose that X is a random variable with cumulative distribution function

FX(x) = P (X ≤ x). If E(etX) <∞ for all |t| < δ, ∃δ > 0, then

MX(t) = E(etX) =

∫RetxdFX(x)

is defined for all t in an open neighborhood about zero. The function MX(t) is called the

moment generating function (mgf) of X.

Result 5.1.

1. If MX(t) exists, then E(|X|j) < ∞, for all j ≥ 1, that is, the moment generating

function characterizes an infinite set of moments.

2. MX(0) = 1.

3. The jth moment of X is given by

E(Xj) =∂jMX(t)

∂tj

∣∣∣∣∣t=0

.

PAGE 70


4. Uniqueness. If X1 ∼ MX1(t), X2 ∼ MX2(t), and MX1(t) = MX2(t) for all t in an

open neighborhood about zero, then FX1(x) = FX2(x) for all x.

5. If X1, X2, ..., Xn are independent random variables with mgfs MXi(t), i = 1, 2, ..., n,

and Y = a0 +∑n

i=1 aiXi, then

MY (t) = ea0tn∏i=1

MXi(ait).

Result 5.2.

1. If X ∼ N (µ, σ2), then MX(t) = exp(µt+ t2σ2/2), for all t ∈ R.

2. If X ∼ N (µ, σ2), then Y = a+ bX ∼ N (a+ bµ, b2σ2).

3. If X ∼ N (µ, σ2), then Z = 1σ(X − µ) ∼ N (0, 1).

TERMINOLOGY : Define the random vector X = (X1, X2, ..., Xp)′ and let t = (t1, t2, ..., tp)

′.

The moment generating function for X is given by

MX(t) = Eexp(t′X) =

∫Rp

exp(t′x)dFX(x),

provided that Eexp(t′X) <∞, for all ||t|| < δ, ∃δ > 0.

Result 5.3.

1. If MX(t) exists, then MXi(ti) = MX(t∗i ), where t∗i = (0, ..., 0, ti, 0, ..., 0)′. This

implies that E(|Xi|j) <∞, for all j ≥ 1.

2. The expected value of X is

E(X) =∂MX(t)

∂t

∣∣∣∣∣t=0

.

3. The p× p second moment matrix

E(XX′) =∂2MX(t)

∂t∂t′

∣∣∣∣∣t=0

.

Thus,

E(XrXs) =∂2MX(t)

∂trts

∣∣∣∣∣tr=ts=0

.

PAGE 71


4. Uniqueness. If X1 and X2 are random vectors with MX1(t) = MX2(t) for all t in

an open neighborhood about zero, then FX1(x) = FX2(x) for all x.

5. If X1,X1, ...,Xn are independent random vectors, and

Y = a0 +n∑i=1

AiXi,

for conformable a0 and Ai; i = 1, 2, ..., n, then

MY(t) = exp(a′0t)n∏i=1

MXi(A′it).

6. Let X = (X′1,X′2, ...,X

′m)′ and suppose that MX(t) exists. Let MXi

(ti) denote the

mgf of Xi. Then, X1,X2, ...,Xm are independent if and only if

MX(t) =n∏i=1

MXi(ti)

for all t = (t′1, t′2, ..., t

′m)′ in an open neighborhood about zero.

Result 5.4. If Y ∼ Np(µ,V), then MY(t) = exp(t′µ + t′Vt/2).

Proof. Exercise.

5.2.3 Properties

Result 5.5. Let Y ∼ Np(µ,V). Let a be p× 1, b be k × 1, and A be k × p. Then

1. X = a′Y ∼ N (a′µ, a′Va).

2. X = AY + b ∼ Nk(Aµ + b,AVA′).

Result 5.6. If Y ∼ Np(µ,V), then any r × 1 subvector of Y has an r-variate normal

distribution with the same means, variances, and covariances as the original distribution.

Proof. Partition Y = (Y′1,Y′2)′, where Y1 is r × 1. Partition µ = (µ′1,µ

′2)′ and

V =

V11 V12

V21 V22

PAGE 72


accordingly. Define A = (Ir 0), where 0 is r×(p−r). Since Y1 = AY is a linear function

of Y, it is normally distributed. Since Aµ = µ1 and AVA′ = V11, we are done.

COROLLARY : If Y ∼ Np(µ,V), then Yi ∼ N (µi, σ2i ), for i = 1, 2, ..., p.

WARNING : Joint normality implies marginal normality. That is, if Y1 and Y2 are

jointly normal, then they are marginally normal. However, if Y1 and Y2 are marginally

normal, this does not necessarily mean that they are jointly normal.

APPLICATION : Consider the general linear model

Y = Xβ + ε,

where ε ∼ Nn(0, σ2I). Note that E(Y) = Xβ and that V = cov(Y) = σ2I. Fur-

thermore, because Y is a linear combination of ε, it is also normally distributed;

i.e., Y ∼ Nn(Xβ, σ2I). With PX = X(X′X)−X′, we know that Y = PXY and

e = (I−PX)Y. Now,

E(Y) = E(PXY) = PXE(Y) = PXXβ = Xβ

and

cov(Y) = cov(PXY) = PXcov(Y)P′X = σ2PXIPX = σ2PX,

since PX is symmetric and idempotent. Also, Y = PXY is a linear combination of Y,

so it also has normal distribution. Putting everything together, we have

Y ∼ Nn(Xβ, σ2PX).

Exercise: Show that e ∼ Nn0, σ2(I−PX).

5.2.4 Less-than-full-rank normal distributions

TERMINOLOGY : The random vector Yp×1 ∼ Np(µ,V) is said to have a p-variate

normal distribution with rank k if Y has the same distribution as µp×1 + Γ′p×kZk×1,

where Γ′Γ = V, r(V) = k < p, and Z ∼ Nk(0, I).

PAGE 73


Example 5.1. Suppose that k = 1, Z1 ∼ N (0, 1), and Y = (Y1, Y2)′, where

Y =

0

0

+

γ1

γ2

Z1 =

γ1Z1

γ2Z1

,

where Γ′ = (γ1, γ2)′ and r(Γ) = 1. Since r(Γ) = 1, this means that at least one of γ1 and

γ2 is not equal to zero. Without loss, take γ1 6= 0, in which case

Y2 =γ2

γ1

Y1.

Note that E(Y) = 0 = (0, 0)′ and

cov(Y) = E(YY′) = E

γ2

1Z21 γ1γ2Z

21

γ1γ2Z21 γ2

2Z21

=

γ21 γ1γ2

γ1γ2 γ22

= Γ′Γ = V.

Note that |V| = 0. Thus, Y2×1 is a random vector with all of its probability mass located

in the linear subspace (y1, y2) : y2 = γ2y1/γ1. Since r(V) = 1 < 2, Y does not have a

density function.

5.2.5 Independence results

Result 5.7. Suppose that Y ∼ N (µ,V), where

Y =

Y1

Y2

...

Ym

, µ =

µ1

µ2

...

µm

, and V =

V11 V12 · · · V1m

V21 V22 · · · V2m

......

. . ....

Vm1 Vm2 · · · Vmm

.

Then, Y1,Y2, ...,Ym are jointly independent if and only if Vij = 0, for all i 6= j.

Proof. Sufficiency (=⇒): Suppose Y1,Y2, ...,Ym are jointly independent. For all i 6= j,

Vij = E(Yi − µi)(Yj − µj)′

= E(Yi − µi)E(Yj − µj)′ = 0.

Necessity (⇐=): Suppose that Vij = 0 for all i 6= j, and let t = (t′1, t′2, ..., t

′m)′. Note

that

t′Vt =m∑i=1

t′iViiti and t′µ =m∑i=1

t′iµi.

PAGE 74


Thus,

MY(t) = exp(t′µ + t′Vt/2)

= exp

(m∑i=1

t′iµi +m∑i=1

t′iViiti/2

)

=m∏i=1

exp(t′iµi + t′iViiti/2) =m∏i=1

MYi(ti).

Result 5.8. Suppose that X ∼ N (µ,Σ), and let Y1 = a1 + B1X and Y2 = a2 + B2X,

for nonrandom conformable ai and Bi; i = 1, 2. Then, Y1 and Y2 are independent if

and only if B1ΣB′2 = 0.

Proof. Write Y = (Y1,Y2)′ as Y1

Y2

=

a1

a2

+

B1

B2

X = a + BX.

Thus, Y is a linear combination of X; hence, Y follows a multivariate normal distribution

(i.e., Y1 and Y2 are jointly normal). Also, cov(Y1,Y2) = cov(B1X,B2X) = B1ΣB′2.

Now simply apply Result 5.7.

REMARK : If X1 ∼ N (µ1,Σ1), X2 ∼ N (µ2,Σ2), and cov(X1,X2) = 0, this does not

necessarily mean that X1 and X2 are independent! We need X = (X′1,X′2)′ to be jointly

normal.


Y = Xβ + ε,

where ε ∼ Nn(0, σ2I). We have already seen that Y ∼ Nn(Xβ, σ2I). Also, note that

with PX = X(X′X)−X′, Y

e

=

PX

I−PX

Y,

a linear combination of Y. Thus, Y and e are jointly normal. By the last result, we

know that Y and e are independent since

cov(Y, e) = PXσ2I(I−PX)′ = 0.

PAGE 75


That is, the fitted values and residuals from the least-squares fit are independent. This

explains why residual plots that display nonrandom patterns are consistent with a vio-

lation of our model assumptions.

5.2.6 Conditional distributions

RECALL: Suppose that (X, Y )′ ∼ N2(µ,Σ), where

µ =

µX

µY

and Σ =

σ2X ρσXσY

ρσXσY σ2Y

and ρ = corr(X, Y ). The conditional distribution of Y , given X, is also normally dis-

tributed, more precisely,

Y |X = x ∼ NµY + ρ(σY /σX)(x− µX), σ2

Y (1− ρ2).

It is important to see that the conditional mean E(Y |X = x) is a linear function of x.

Note also that the conditional variance var(Y |X = x) is free of x.

EXTENSION : We wish to extend the previous result to random vectors. In particular,

suppose that X and Y are jointly multivariate normal with ΣXY 6= 0. That is, suppose X

Y

∼ N µX

µY

,

ΣX ΣY X

ΣXY ΣY

,

and assume that ΣX is nonsingular. The conditional distribution of Y given X is

Y|X = x ∼ N (µY |X ,ΣY |X),

where

µY |X = µY + ΣY XΣ−1X (x− µX)

and

ΣY |X = ΣY −ΣY XΣ−1X ΣXY .

Again, the conditional mean µY |X is a linear function of x and the conditional covariance

matrix ΣY |X is free of x.

PAGE 76


5.3 Noncentral χ2 distribution

RECALL: Suppose that U ∼ χ2n; that is, U has a (central) χ2 distribution with n > 0

degrees of freedom. The pdf of U is given by

fU(u|n) =1

Γ(n2)2n/2

un2−1e−u/2I(u > 0).

The χ2n family of distributions is a gamma(α, β) subfamily with shape parameter α = n/2

and scale parameter β = 2. Note that E(U) = n, var(U) = 2n, and MU(t) = (1−2t)−n/2,

for t < 1/2.

RECALL: If Z1, Z2, ..., Zn are iid N (0, 1), then U1 = Z21 ∼ χ2

1 and

Z′Z =n∑i=1

Z2i ∼ χ2

n.

Proof. Exercise.

TERMINOLOGY : A univariate random variable V is said to have a noncentral χ2

distribution with degrees of freedom n > 0 and noncentrality parameter λ > 0 if it has

the pdf

fV (v|n, λ) =∞∑j=0

(e−λλj

j!

)1

Γ(n+2j2

)2(n+2j)/2vn+2j

2−1e−v/2I(v > 0)︸︷︷︸

fU (v|n+2j)

.

We write V ∼ χ2n(λ). When λ = 0, the χ2

n(λ) distribution reduces to the central χ2n

distribution. In the χ2n(λ) pdf, notice that e−λλj/j! is the jth term of a Poisson pmf with

parameter λ > 0.

Result 5.9. If V |W ∼ χ2n+2W and W ∼ Poisson(λ), then V ∼ χ2

n(λ).

Proof. Note that

fV (v) =∞∑j=0

fV,W (v, j)

=∞∑j=0

fW (j)fV |W (v|j)

=∞∑j=0

(e−λλj

j!

)1

Γ(n+2j2

)2(n+2j)/2vn+2j

2−1e−v/2I(v > 0).

PAGE 77


Result 5.10. If V ∼ χ2n(λ), then

MV (t) = (1− 2t)−n/2 exp

(2tλ

1− 2t

),

for t < 1/2.

Proof. The mgf of V , by definition, is MV (t) = E(etV ). Using an iterated expectation,

we can write, for t < 1/2,

MV (t) = E(etV ) = EE(etV |W ),

where W ∼ Poisson(λ). Note that E(etV |W ) is the conditional mgf of V , given W . We

know that V |W ∼ χ2n+2W ; thus, E(etV |W ) = (1− 2t)−(n+2W )/2 and

EE(etV |W ) =∞∑j=0

(1− 2t)−(n+2j)/2

(e−λλj

j!

)

= e−λ(1− 2t)−n/2∞∑j=0

λj

j!(1− 2t)−j

= e−λ(1− 2t)−n/2∞∑j=0

(λ

1−2t

)jj!

= e−λ(1− 2t)−n/2 exp

(λ

1− 2t

)= (1− 2t)−n/2 exp

(2tλ

1− 2t

).

MEAN AND VARIANCE : If V ∼ χ2n(λ), then

E(V ) = n+ 2λ and var(V ) = 2n+ 8λ.

Result 5.11. If Y ∼ N (µ, 1), then U = Y 2 ∼ χ21(λ), where λ = µ2/2.

Outline of the proof. The proof proceeds by finding the mgf of U and showing that it

equals

MU(t) = (1− 2t)−n/2 exp

(2tλ

1− 2t

),

with n = 1 and λ = µ2/2. Note that

MU(t) = E(etU) = E(etY2

) =

∫Rety

2 1√2πe−

12

(y−µ)2dy

PAGE 78


Now, combine the exponents in the integrand, square out the (y−µ)2 term, combine like

terms, complete the square, and collapse the expression to (1− 2t)−1/2 expµ2t/(1− 2t)

times some normal density that is integrated over R.

Result 5.12. If U1, U2, ..., Um are independent random variables, where Ui ∼ χ2ni

(λi);

i = 1, 2, ...,m, then U =∑

i Ui ∼ χ2n(λ), where n =

∑i ni and λ =

∑i λi.

Result 5.13. Suppose that V ∼ χ2n(λ). For fixed n and c > 0, the quantity Pλ(V > c)

is a strictly increasing function of λ.

Proof. See Monahan, pp 106-108.

IMPLICATION : If V1 ∼ χ2n(λ1) and V2 ∼ χ2

n(λ2), where λ2 > λ1, then pr(V2 > c) >

pr(V1 > c). That is, V2 is (strictly) stochastically greater than V1, written V2 >st V1.

Note that

V2 >st V1 ⇐⇒ FV2(v) < FV1(v)⇐⇒ SV2(v) > SV1(v),

for all v, where FVi(·) denotes the cdf of Vi and SVi(·) = 1− FVi(·) denotes the survivor

function of Vi.

5.4 Noncentral F distribution

RECALL: A univariate random variable W is said to have a (central) F distribution with

degrees of freedom n1 > 0 and n2 > 0 if it has the pdf

fW (w|n1, n2) =Γ(n1+n2

2)(n1

n2

)n1/2

w(n1−2)/2

Γ(n1

2)Γ(n2

2)(

1 + n1wn2

)(n1+n2)/2I(w > 0).

We write W ∼ Fn1,n2 . The moment generating function for the F distribution does not

exist in closed form.

RECALL: If U1 and U2 are independent central χ2 random variables with degrees of

freedom n1 and n2, respectively, then

W =U1/n1

U2/n2

∼ Fn1,n2 .

PAGE 79


TERMINOLOGY : A univariate random variable W is said to have a noncentral F

distribution with degrees of freedom n1 > 0 and n2 > 0 and noncentrality parameter

λ > 0 if it has the pdf

fW (w|n1, n2, λ) =∞∑j=0

(e−λλj

j!

) Γ(n1+2j+n2

2)(n1+2jn2

)(n1+2j)/2

w(n1+2j−2)/2

Γ(n1+2j2

)Γ(n2

2)(

1 + n1wn2

)(n1+2j+n2)/2I(w > 0).

We write W ∼ Fn1,n2(λ). When λ = 0, the noncentral F distribution reduces to the

central F distribution.

MEAN AND VARIANCE : If W ∼ Fn1,n2(λ), then

E(W ) =n2

n2 − 2

(1 +

2λ

n1

)and

var(W ) =2n2

2

n21(n2 − 2)

(n1 + 2λ)2

(n2 − 2)(n2 − 4)+n1 + 4λ

n2 − 4

.

E(W ) exists only when n2 > 2 and var(W ) exists only when n2 > 4. The moment

generating function for the noncentral F distribution does not exist in closed form.

Result 5.14. If U1 and U2 are independent random variables with U1 ∼ χ2n1

(λ) and

U2 ∼ χ2n2

, then

W =U1/n1

U2/n2

∼ Fn1,n2(λ).

Proof. See Searle, pp 51-52.

Result 5.15. Suppose that W ∼ Fn1,n2(λ). For fixed n1, n2, and c > 0, the quantity

Pλ(W > c) is a strictly increasing function of λ. That is, if W1 ∼ Fn1,n2(λ1) and

W2 ∼ Fn1,n2(λ2), where λ2 > λ1, then pr(W2 > c) > pr(W1 > c); i.e., W2 >st W1.

REMARK : The fact that the noncentral F distribution tends to be larger than the

central F distribution is the basis for many of the tests used in linear models. Typically,

test statistics are used that have a central F distribution if the null hypothesis is true

and a noncentral F distribution if the null hypothesis is not true. Since the noncentral

F distribution tends to be larger, large values of the test statistic are consistent with

PAGE 80


the alternative hypothesis. Thus, the form of an appropriate rejection region is to reject

H0 for large values of the test statistic. The power is simply the probability of rejection

region (defined under H0) when the probability distribution is noncentral F . Noncentral

F distributions are available in most software packages.

5.5 Distributions of quadratic forms

GOAL: We would like to find the distribution of Y′AY, where Y ∼ Np(µ,V). We will

obtain this distribution by taking steps. Result 5.16 is a very small step. Result 5.17 is

a large step, and Result 5.18 is the finish line. There is no harm in assuming that A is

symmetric.

Result 5.16. Suppose that Y ∼ Np(µ, I) and define

W = Y′Y = Y′IY =

p∑i=1

Y 2i .

Result 5.11 says Y 2i ∼ χ2

1(µ2i /2), for i = 1, 2, ..., p. Thus, from Result 5.12,

Y′Y =

p∑i=1

Y 2i ∼ χ2

p(µ′µ/2).

LEMMA: The p × p symmetric matrix A is idempotent of rank s if and only if there

exists a p× s matrix P1 such that (a) A = P1P′1 and (b) P′1P1 = Is.

Proof. (⇐=) Suppose that A = P1P′1 and P′1P1 = Is. Clearly, A is symmetric. Also,

A2 = P1P′1P1P

′1 = P1P

′1 = A.

Note also that r(A) = tr(A) = tr(P1P′1) = tr(P′1P1) = tr(Is) = s. Now, to go the other

way (=⇒), suppose that A is a symmetric, idempotent matrix of rank s. The spectral

decomposition of A is given by A = QDQ′, where D = diag(λ1, λ2, ..., λp) and Q is

orthogonal. Since A is idempotent, we know that s of the eigenvalues λ1, λ2, ..., λp are

equal to 1 and other p− s eigenvalues are equal to 0. Thus, we can write

A = QDQ′ =(

P1 P2

) Is 0

0 0

P′1

P′2

= P1P′1.

PAGE 81


Thus, we have shown that (a) holds. To show that (b) holds, note that because Q is

orthogonal,

Ip = Q′Q =

P′1

P′2

( P1 P2

)=

P′1P1 P′1P2

P′2P1 P′2P2

.

It is easy to convince yourself that P′1P1 is an identity matrix. Its dimension is s × s

because tr(P′1P1) = tr(P1P′1) = tr(A) = r(A), which equals s by assumption.

Result 5.17. Suppose that Y ∼ Np(µ, I). If A is idempotent of rank s, then Y′AY ∼

χ2s(λ), where λ = 1

2µ′Aµ.

Proof. Suppose that Y ∼ Np(µ, I) and that A is idempotent of rank s. By the last

lemma, we know that A = P1P′1, where P1 is p× s, and P′1P1 = Is. Thus,

Y′AY = Y′P1P′1Y = X′X,

where X = P′1Y. Since Y ∼ Np(µ, I), and since X = P′1Y is a linear combination of Y,

we know that

X ∼ Ns(P′1µ,P′1IP1) ∼ Ns(P′1µ, Is).

Result 5.16 says that Y′AY = X′X ∼ χ2s(P′1µ)′(P′1µ)/2. But,

λ ≡ 1

2(P′1µ)′P′1µ =

1

2µ′P1P

′1µ =

1

2µ′Aµ.

Result 5.18. Suppose that Y ∼ Np(µ,V), where r(V) = p. If AV is idempotent of

rank s, then Y′AY ∼ χ2s(λ), where λ = 1

2µ′Aµ.

Proof. Since Y ∼ Np(µ,V), and since X = V−1/2Y is a linear combination of Y, we

know that

X ∼ Np(V−1/2µ,V−1/2VV−1/2) ∼ Np(V−1/2µ, Ip).

Now,

Y′AY = X′V1/2AV1/2X = X′BX,

where B = V1/2AV1/2. Recall that V1/2 is the symmetric square root of V. From Result

5.17, we know that Y′AY = X′BX ∼ χ2s(λ) if B is idempotent of rank s. However, note

that

r(B) = r(V1/2AV1/2) = r(A) = r(AV) = s,

PAGE 82


since AV has rank s (by assumption) and V and V1/2 are both nonsingular. Also, AV

is idempotent by assumption so that

AV = AVAV ⇒ A = AVA

⇒ V1/2AV1/2 = V1/2AV1/2V1/2AV1/2

⇒ B = BB.

Thus, B is idempotent of rank s. This implies that Y′AY = X′BX ∼ χ2s(λ). Noting

that

λ =1

2(V−1/2µ)′B(V−1/2µ) =

1

2µ′V−1/2V1/2AV1/2V−1/2µ =

1

2µ′Aµ

completes the argument.

Example 5.2. Suppose that Y = (Y1, Y2, ..., Yn)′ ∼ Nn(µ1, σ2I), so that µ = µ1 and

V = σ2I, where 1 is n× 1 and I is n× n. The statistic

(n− 1)S2 =n∑i=1

(Yi − Y )2 = Y′(I− n−1J)Y = Y′AY,

where A = I− n−1J. Thus, consider the quantity

(n− 1)S2

σ2= Y′BY,

where B = σ−2A = σ−2(I− n−1J). Note that BV = σ−2(I− n−1J)σ2I = I− n−1J = A,

which is idempotent with rank

r(BV) = tr(BV) = tr(A) = tr(I− n−1J) = n− n−1n = n− 1.

Result 5.18 says that (n− 1)S2/σ2 = Y′BY ∼ χ2n−1(λ), where λ = 1

2µ′Bµ. However,

λ =1

2µ′Bµ =

1

2(µ1)′σ−2(I− n−1J)µ1 = 0,

since µ1 ∈ C(1) and I− n−1J is the ppm onto C(1)⊥. Thus,

(n− 1)S2

σ2= Y′BY ∼ χ2

n−1,

a central χ2 distribution with n− 1 degrees of freedom.

PAGE 83


Example 5.3. Consider the general linear model

Y = Xβ + ε,

where X is n× p with rank r ≤ p and ε ∼ Nn(0, σ2I). Let PX = X(X′X)−X′ denote the

perpendicular projection matrix onto C(X). We know that Y ∼ Nn(Xβ, σ2I). Consider

the (uncorrected) partitioning of the sums of squares given by

Y′Y = Y′PXY + Y′(I−PX)Y.

• We first consider the residual sum of squares Y′(I−PX)Y. Dividing this quantity

by σ2, we get

Y′(I−PX)Y/σ2 = Y′σ−2(I−PX)Y = Y′AY,

where A = σ−2(I−PX). With V = σ2I, note that

AV = σ−2(I−PX)σ2I = I−PX,

an idempotent matrix with rank

r(I−PX) = tr(I−PX) = tr(I)− tr(PX) = tr(I)− r(PX) = n− r,

since r(PX) = r(X) = r, by assumption. Result 5.18 says that

Y′(I−PX)Y/σ2 = Y′AY ∼ χ2n−r(λ),

where λ = 12µ′Aµ. However,

λ =1

2µ′Aµ =

1

2(Xβ)′σ−2(I−PX)Xβ = 0,

because Xβ ∈ C(X) and I−PX projects onto the orthogonal complement N (X′).

Thus, we have shown that Y′(I−PX)Y/σ2 ∼ χ2n−r, a central χ2 distribution with

n− r degrees of freedom.

• Now, we turn our attention to the (uncorrected) model sum of squares Y′PXY.

Dividing this quantity by σ2, we get

Y′PXY/σ2 = Y′(σ−2PX)Y = Y′BY,

PAGE 84


where B = σ−2PX. With V = σ2I, note that

BV = σ−2PXσ2I = PX,

an idempotent matrix with rank r(PX) = r(X) = r. Result 5.18 says that

Y′PXY/σ2 = Y′BY ∼ χ2r(λ),

where λ = 12µ′Bµ. Note that

λ =1

2µ′Bµ =

1

2(Xβ)′σ−2PXXβ = (Xβ)′Xβ/2σ2.

That is, Y′PXY/σ2 has a noncentral χ2 distribution with r degrees of freedom and

noncentrality parameter λ = (Xβ)′Xβ/2σ2.

In the last calculation, note that λ = (Xβ)′Xβ/2σ2 = 0 iff Xβ = 0. In this case, both

quadratic forms Y′(I−PX)Y/σ2 and Y′PXY/σ2 have central χ2 distributions.

5.6 Independence of quadratic forms

GOALS : In this subsection, we consider two problems. With Y ∼ N (µ,V), we would

like to establish sufficient conditions for

(a) Y′AY and BY to be independent, and

(b) Y′AY and Y′BY to be independent.

Result 5.19. Suppose that Y ∼ Np(µ,V). If BVA = 0, then Y′AY and BY are

independent.

Proof. We may assume that A is symmetric. Write A = QDQ′, where D =

diag(λ1, λ2, ..., λp) and Q is orthogonal. We know that s ≤ p of the eigenvalues

λ1, λ2, ..., λp are nonzero where s = r(A). We can thus write

A = QDQ′ =(

P1 P2

) D1 0

0 0

P′1

P′2

= P1D1P′1,

PAGE 85


where D1 = diag(λ1, λ2, ..., λs). Thus,

Y′AY = Y′P1D1P′1Y = X′D1X,

where X = P′1Y. Notice that BY

X

=

B

P′1

Y ∼ N

Bµ

P′1µ

,

BVB′ BVP1

P′1VB′ P′1VP1

.

Suppose that BVA = 0. Then,

0 = BVA = BVP1D1P′1 = BVP1D1P

′1P1.

But, because Q is orthogonal,

Ip = Q′Q =

P′1

P′2

( P1 P2

)=

P′1P1 P′1P2

P′2P1 P′2P2

.

This implies that P′1P1 = Is, and, thus,

0 = BVP1D1P′1P1 = BVP1D1 = BVP1D1D

−11 = BVP1.

Therefore, cov(BY,X) = 0, that is, X and BY are independent. But, Y′AY = X′D1X,

a function of X. Thus, Y′AY and BY are independent as well.

Example 5.4. Suppose that Y = (Y1, Y2, ..., Yn)′ ∼ Nn(µ1, σ2I), where 1 is n× 1 and I

is n× n, so that µ = µ1 and V = σ2I. Recall that

(n− 1)S2 = Y′(I− n−1J)Y = YAY,

where A = I− n−1J. Also,

Y = n−11′Y = BY,

where B = n−11′. These two statistics are independent because

BVA = n−11′σ2I(I− n−1J) = σ2n−11′(I− n−1J) = 0,

because I − n−1J is the ppm onto C(1)⊥. Since functions of independent statistics are

also independent, Y and S2 are also independent.

PAGE 86


Result 5.20. Suppose that Y ∼ Np(µ,V). If BVA = 0, then Y′AY and Y′BY are

independent.

Proof. Write A and B in their spectral decompositions; that is, write

A = PDP′ =(

P1 P2

) D1 0

0 0

P′1

P′2

= P1D1P′1,

where D1 = diag(λ1, λ2, ..., λs) and s = r(A). Similarly, write

B = QRQ′ =(

Q1 Q2

) R1 0

0 0

Q′1

Q′2

= Q1R1Q′1,

where R1 = diag(γ1, γ2, ..., γt) and t = r(B). Since P and Q are orthogonal, this implies

that P′1P1 = Is and Q′1Q1 = It. Suppose that BVA = 0. Then,

0 = BVA = Q1R1Q′1VP1D1P

′1

= Q′1Q1R1Q′1VP1D1P

′1P1

= R1Q′1VP1D1

= R−11 R1Q

′1VP1D1D

−11

= Q′1VP1

= cov(Q′1Y,P′1Y).

Now, P′1

Q′1

Y ∼ N

P′1µ

Q′1µ

,

P′1VP1 0

0 Q′1VQ1

.

That is, P′1Y and Q′1Y are jointly normal and uncorrelated; thus, they are independent.

So are Y′P1D1P′1Y and Y′Q1R1Q

′1Y. But A = P1D1P

′1 and B = Q1R1Q

′1, so we are

done.

Example 5.5. Consider the general linear model

Y = Xβ + ε,

where X is n × p with rank r ≤ p and ε ∼ Nn(0, σ2I). Let PX = X(X′X)−X′ denote

the perpendicular projection matrix onto C(X). We know that Y ∼ Nn(Xβ, σ2I). In

PAGE 87


Example 5.3, we showed that

Y′(I−PX)Y/σ2 ∼ χ2n−r,

and that

Y′PXY/σ2 ∼ χ2r(λ),

where λ = (Xβ)′Xβ/2σ2. Note that

Y′(I−PX)Y/σ2 = Y′σ−2(I−PX)Y = Y′AY,

where A = σ−2(I−PX). Also,

Y′PXY/σ2 = Y′(σ−2PX)Y = Y′BY,

where B = σ−2PX. Applying Result 5.20, we have

BVA = σ−2PXσ2Iσ−2(I−PX) = 0,

that is, Y′(I − PX)Y/σ2 and Y′PXY/σ2 are independent quadratic forms. Thus, the

statistic

F =Y′PXY/r

Y′(I−PX)Y/(n− r)

=σ−2Y′PXY/r

σ−2Y′(I−PX)Y/(n− r)∼ Fr,n−r

(Xβ)′Xβ/2σ2

,

a noncentral F distribution with degrees of freedom r (numerator) and n − r (denomi-

nator) and noncentrality parameter λ = (Xβ)′Xβ/2σ2.

OBSERVATIONS :

• Note that if Xβ = 0, then F ∼ Fr,n−r since the noncentrality parameter λ = 0.

• On the other hand, as the length of Xβ gets larger, so does λ. This shifts the

noncentral Fr,n−r (Xβ)′Xβ/2σ2 distribution to the right, because the noncentral

F distribution is stochastically increasing in its noncentrality parameter.

• Therefore, large values of F are consistent with large values of ||Xβ||.

PAGE 88


5.7 Cochran’s Theorem

REMARK : An important general notion in linear models is that sums of squares like

Y′PXY and Y′Y can be “broken down” into sums of squares of smaller pieces. We now

discuss Cochran’s Theorem (Result 5.21), which serves to explain why this is possible.

Result 5.21. Suppose that Y ∼ Nn(µ, σ2I). Suppose that A1,A2, ...,Ak are n × n

symmetric and idempotent matrices, where r(Ai) = si, for i = 1, 2, ..., k. If A1 + A2 +

· · · + Ak = In, then Y′A1Y/σ2,Y′A2Y/σ2, ...,Y′AkY/σ2 follow independent χ2si

(λi)

distributions, where λi = µ′Aiµ/2σ2, for i = 1, 2, ..., k, and

∑ki=1 si = n.

Outline of proof. If A1 + A2 + · · ·+ Ak = In, then Ai idempotent implies that both (a)

AiAj = 0, for i 6= j, and (b)∑k

i=1 si = n hold. That Y′AiY/σ2 ∼ χ2si

(λi) follows from

Result 5.18 with V = σ2I. Because AiAj = 0, Result 5.20 with V = σ2I guarantees

that Y′AiY/σ2 and Y′AjY/σ2 are independent.

IMPORTANCE : We now show how Cochran’s Threorem can be used to deduce the joint

distribution of the sums of squares in an analysis of variance. Suppose that we partition

the design matrix X and the parameter vector β in Y = Xβ+ ε into k+ 1 parts, so that

Y = (X0 X1 · · · Xk)

β0

β1

...

βk

,

where the dimensions of Xi and βi are n× pi and pi × 1, respectively, and∑k

i=0 pi = p.

We can now write this as a k + 1 part model (the full model):

Y = X0β0 + X1β1 + · · ·+ Xkβk + ε.

Now consider fitting each of the k submodels

Y = X0β0 + ε

Y = X0β0 + X1β1 + ε

...

Y = X0β0 + X1β1 + · · ·+ Xk−1βk−1 + ε,

PAGE 89


and let R(β0,β1, ...,βi) denote the regression (model) sum of squares from fitting the

ith submodel, for i = 0, 1, ..., k; that is,

R(β0,β1, ...,βi) = Y′(X0X1 · · ·Xi) [(X0X1 · · ·Xi)′(X0X1 · · ·Xi)]

−(X0X1 · · ·Xi)

′Y

= Y′PX∗iY,

where

PX∗i

= (X0X1 · · ·Xi) [(X0X1 · · ·Xi)′(X0X1 · · ·Xi)]

−(X0X1 · · ·Xi)

′

is the perpendicular projection matrix onto C(X∗i ), where X∗i = (X0X1 · · ·Xi) and i =

0, 1, ..., k. Clearly,

C(X∗0) ⊂ C(X∗1) ⊂ · · · ⊂ C(X∗k−1) ⊂ C(X).

We can now partition the total sums of squares as

Y′Y = R(β0) + [R(β0,β1)−R(β0)] + [R(β0,β1,β2)−R(β0,β1)] + · · ·

+ [R(β0,β1, ...,βk)−R(β0,β1, ...,βk−1)] + [Y′Y −R(β0,β1, ...,βk)]

= Y′A0Y + Y′A1Y + Y′A2Y + · · ·+ Y′AkY + Y′Ak+1Y,

where

A0 = PX∗0

Ai = PX∗i−PX∗

i−1i = 1, 2, ..., k − 1,

Ak = PX −PX∗k−1

Ak+1 = I−PX.

Note that A0 + A1 + A2 + · · ·+ Ak+1 = I. Note also that the Ai matrices are symmetric,

for i = 0, 1, ..., k + 1, and that

A2i = (PX∗

i−PX∗

i−1)(PX∗

i−PX∗

i−1)

= P2X∗i−PX∗

i−1−PX∗

i−1+ P2

X∗i−1

= PX∗i−PX∗

i−1= Ai,

for i = 1, 2, ..., k, since PX∗iPX∗

i−1= PX∗

i−1and PX∗

i−1PX∗

i= PX∗

i−1. Thus, Ai is idempo-

tent for i = 1, 2, ..., k. However, clearly A0 = PX∗0

= X0(X′0X0)−X′0 and Ak+1 = I−PX

PAGE 90


are also idempotent. Let

s0 = r(A0) = r(X0)

si = r(Ai) = tr(Ai) = tr(PX∗i)− tr(PX∗

i−1) = r(PX∗

i)− r(PX∗

i−1), i = 1, 2, ..., k,

sk+1 = r(Ak+1) = n− r(X).

It is easy to see that∑k+1

i=0 si = n.

APPLICATION : Consider the Gauss-Markov model Y = Xβ + ε, where ε ∼ Nn(0, σ2I)

so that Y ∼ Nn(Xβ, σ2I). Cochran’s Theorem applies and we have

1

σ2Y′A0Y ∼ χ2

s0[λ0 = (Xβ)′A0Xβ/2σ2]

1

σ2Y′AiY ∼ χ2

si[λi = (Xβ)′AiXβ/2σ2], i = 1, 2, ..., k,

1

σ2Y′Ak+1Y =

1

σ2Y′(I−PX)Y ∼ χ2

sk+1,

where sk+1 = n − r(X). Note that the last quadratic follows a central χ2 distribution

because λk+1 = (Xβ)′(I−PX)Xβ/2σ2 = 0. Cochran’s Theorem also guarantees that the

quadratic forms Y′A0Y/σ2,Y′A1Y/σ2, ...,Y′AkY/σ2,Y′Ak+1Y/σ2 are independent.

ANOVA TABLE : The quadratic forms Y′AiY, for i = 0, 1, ..., k + 1, and the degrees of

freedom si = r(Ai) are often presented in the following ANOVA table:

Source df SS Noncentrality

β0 s0 R(β0) λ0 = (Xβ)′A0Xβ/2σ2

β1 (after β0) s1 R(β0,β1)−R(β0) λ1 = (Xβ)′A1Xβ/2σ2

β2 (after β0,β1) s2 R(β0,β1,β2)−R(β0,β1) λ2 = (Xβ)′A2Xβ/2σ2

......

......

βk (after β0, ...,βk−1) sk R(β0, ...,βk)−R(β0, ...,βk−1) λk = (Xβ)′AkXβ/2σ2

Residual sk+1 Y′Y −R(β0, ...,βk) λk+1 = 0

Total n Y′Y (Xβ)′Xβ/2σ2

Note that if X0 = 1, then β0 = µ, R(β0) = Y′P1Y = nY2. The R(·) notation

will come in handy when we talk about hypothesis testing later. The sums of squares

PAGE 91


R(β0), R(β0,β1) − R(β0), ..., R(β0, ...,βk) − R(β0, ...,βk−1) are called the sequential

sums of squares. These correspond to the Type I sums of squares printed out by SAS

in the ANOVA and GLM procedures. We will also use the notation

R(βi|β0,β1, ...,βi−1) = R(β0,β1, ...,βi)−R(β0,β1, ...,βi−1).

Example 5.6. Consider the one-way (fixed effects) analysis of variance model


for i = 1, 2, ..., a and j = 1, 2, ..., ni, where εij ∼ iid N (0, σ2) random variables. In matrix

form, Y, X, and β are

Yn×1 =

Y11

Y12

...

Yana

, Xn×p =

1n1 1n1 0n1 · · · 0n1

1n2 0n2 1n2 · · · 0n2

......

.... . .

...

1na 0na 0na · · · 1na

, and βp×1 =

µ

α1

α2

...

αa

,

where p = a+ 1 and n =∑

i ni. Note that we can write

X = (X0 X1) ,

where X0 = 1,

X1 =

1n1 0n1 · · · 0n1

0n2 1n2 · · · 0n2

......

. . ....

0na 0na · · · 1na

,

and β = (β0,β′1)′, where β0 = µ and β1 = (α1, α2, ..., αa)

′. That is, we can express this

model in the form

Y = X0β0 + X1β1 + ε.

The submodel is

Y = X0β0 + ε,

PAGE 92


where X0 = 1 and β0 = µ. We have

A0 = P1

A1 = PX −P1

A2 = I−PX.

These matrices are clearly symmetric and idempotent. Also, note that

s0 + s1 + s2 = r(P1) + r(PX −P1) + r(I−PX) = 1 + (a− 1) + (n− a) = n

and that A0 + A1 + A2 = I. Therefore, Cochran’s Theorem applies and we have

1

σ2Y′P1Y ∼ χ2

1[λ0 = (Xβ)′P1Xβ/2σ2]

1

σ2Y′(PX −P1)Y ∼ χ2

a−1[λ1 = (Xβ)′(PX −P1)Xβ/2σ2]

1


n−a.

Cochran’s Theorem also guarantees the quadratic forms Y′P1Y/σ2,Y′(PX −P1)Y/σ2,

and Y′(I−PX)Y/σ2 are independent. The sums of squares, using our new notation, are

Y′P1Y = R(µ)

Y′(PX −P1)Y = R(µ, α1, ..., αa)−R(µ) ≡ R(α1, ..., αa|µ)

Y′(I−PX)Y = Y′Y −R(µ, α1, ..., αa).

The ANOVA table for this one-way analysis of variance model is

Source df SS Noncentrality

µ 1 R(µ) λ0 = (Xβ)′P1Xβ/2σ2

α1, ..., αa (after µ) a− 1 R(µ, α1, ..., αa)−R(µ) λ1 = (Xβ)′(PX −P1)Xβ/2σ2

Residual n− a Y′Y −R(µ, α1, ..., αa) 0

Total n Y′Y (Xβ)′Xβ/2σ2

F STATISTIC : Because1

σ2Y′(PX −P1)Y ∼ χ2

a−1(λ1)

PAGE 93


and1


n−a,

and because these two quadratic forms are independent, it follows that

F =Y′(PX −P1)Y/(a− 1)

Y′(I−PX)Y/(n− a)∼ Fa−1,n−a(λ1).

This is the usual F statistic to test H0 : α1 = α2 = · · · = αa = 0.

• If H0 is true, then Xβ ∈ C(1) and λ1 = (Xβ)′(PX−P1)Xβ/2σ2 = 0 since PX−P1

is the ppm onto C(1)⊥C(X). In this case, F ∼ Fa−1,n−a, a central F distribution. A

level α rejection region is therefore RR = F : F > Fa−1,n−a,α.

• If H0 is not true, then Xβ /∈ C(1) and λ1 = (Xβ)′(PX − P1)Xβ/2σ2 > 0. In

this case, F ∼ Fa−1,n−a(λ1), which is stochastically larger than the central Fa−1,n−a

distribution. Therefore, if H0 is not true, we would expect F to be large.

EXPECTED MEAN SQUARES : We already know that

MSE = (n− a)−1Y′(I−PX)Y

is an unbiased estimator of σ2, i.e., E(MSE) = σ2. Recall that if V ∼ χ2n(λ), then

E(V ) = n+ 2λ. Therefore,

E

[1

σ2Y′(PX −P1)Y

]= (a− 1) + (Xβ)′(PX −P1)Xβ/σ2

and

E(MSR) = E[(a− 1)−1Y′(PX −P1)Y

]= σ2 + (Xβ)′(PX −P1)Xβ/(a− 1)

= σ2 +a∑i=1

ni(αi − α+)2/(a− 1),

where α+ = a−1∑a

i=1 αi. Again, note that if H0 : α1 = α2 = · · · = αa = 0 is true, then

MSR is also an unbiased estimator of σ2. Therefore, values of

F =Y′(PX −P1)Y/(a− 1)

Y′(I−PX)Y/(n− a)

should be close to 1 when H0 is true and larger than 1 otherwise.

PAGE 94


6 Statistical Inference

Complementary reading from Monahan: Chapter 6 (and revisit Sections 3.9 and 4.7).

6.1 Estimation

PREVIEW : Consider the general linear model Y = Xβ + ε, where X is n × p with

rank r ≤ p and ε ∼ Nn(0, σ2I). Note that this is the usual Gauss-Markov model with

the additional assumption of normality. With this additional assumption, we can now

rigorously pursue questions that deal with statistical inference. We start by examining

minimum variance unbiased estimation and maximum likelihood estimation.

SUFFICIENCY : Under the assumptions stated above, we know that Y ∼ Nn(Xβ, σ2I).

Set θ = (β′, σ2)′. The pdf of Y, for all y ∈ Rn, is given by

fY(y|θ) = (2π)−n/2(σ2)−n/2 exp−(y −Xβ)′(y −Xβ)/2σ2

= (2π)−n/2(σ2)−n/2 exp−y′y/2σ2 + y′Xβ/σ2 − (Xβ)′Xβ/2σ2

= (2π)−n/2(σ2)−n/2 exp−(Xβ)′Xβ/2σ2 exp−y′y/2σ2 + β′X′y/σ2

= h(y)c(θ) expw1(θ)t1(y) + w2(θ)t2(y),

where h(y) = (2π)−n/2I(y ∈ Rn), c(θ) = (σ2)−n/2 exp−(Xβ)′Xβ/2σ2, and

w1(θ) = −1/2σ2 t1(y) = y′y

w2(θ) = β/σ2 t2(y) = X′y,

that is, Y has pdf in the exponential family (see Casella and Berger, Chapter 3). The

family is full rank (i.e., it is not curved), so we know that T(Y) = (Y′Y,X′Y) is

a complete sufficient statistic for θ. We also know that minimum variance unbiased

estimators (MVUEs) of functions of θ are unbiased functions of T(Y).

Result 6.1. Consider the general linear model Y = Xβ+ ε, where X is n× p with rank

r ≤ p and ε ∼ Nn(0, σ2I). The MVUE for an estimable function Λ′β is given by Λ′β,

PAGE 95


where β = (X′X)−X′Y is any solution to the normal equations. The MVUE for σ2 is

MSE = (n− r)−1Y′(I−PX)Y

= (n− r)−1(Y′Y − β′X′Y),

where PX is the perpendicular projection matrix onto C(X).

Proof. Both Λ′β and MSE are unbiased estimators of Λ′β and σ2, respectively. These

estimators are also functions of T(Y) = (Y′Y,X′Y), the complete sufficient statistic.

Thus, each estimator is the MVUE for its expected value.

MAXIMUM LIKELIHOOD : Consider the general linear model Y = Xβ + ε, where X is

n× p with rank r ≤ p and ε ∼ Nn(0, σ2I). The likelihood function for θ = (β′, σ2)′ is

L(θ|y) = L(β, σ2|y) = (2π)−n/2(σ2)−n/2 exp−(y −Xβ)′(y −Xβ)/2σ2.

Maximum likelihood estimators for β and σ2 are found by maximizing

logL(β, σ2|y) = −n2

log(2π)− n

2log σ2 − (y −Xβ)′(y −Xβ)/2σ2

with respect to β and σ2. For every value of σ2, maximizing the loglikelihood is the same

as minimizing Q(β) = (y −Xβ)′(y −Xβ), that is, the least squares estimator

β = (X′X)−X′Y,

is also an MLE. Now substitute (y − Xβ)′(y − Xβ) = y′(I − PX)y in for Q(β) and

differentiate with respect to σ2. The MLE of σ2 is

σ2MLE = n−1Y′(I−PX)Y.

Note that the MLE for σ2 is biased. The MLE is rarely used in practice; MSE is the

conventional estimator for σ2.

INVARIANCE : Under the normal GM model, the MLE for an estimable function Λ′β

is Λ′β, where β is any solution to the normal equations. This is true because of the

invariance property of maximum likelihood estimators (see, e.g., Casella and Berger,

Chapter 7). If Λ′β is estimable, recall that Λ′β is unique even if β is not.

PAGE 96


6.2 Testing models

PREVIEW : We now provide a general discussion on testing reduced versus full models

within a Gauss Markov linear model framework. Assuming normality will allow us to

derive the sampling distribution of the resulting test statistic.

PROBLEM : Consider the linear model

Y = Xβ + ε,

where r(X) = r ≤ p, E(ε) = 0, and cov(ε) = σ2I. Note that these are our usual GM

model assumptions. For the purposes of this discussion, we assume that this model (the

full model) is a “correct” model for the data. Consider also the linear model

Y = Wγ + ε,

where E(ε) = 0, cov(ε) = σ2I and C(W) ⊂ C(X). We call this a reduced model because

the estimation space is smaller than in the full model. Our goal is to test whether or not

the reduced model is also correct.

• If the reduced model is also correct, there is no reason not to use it. Smaller

models are easier to interpret and fewer degrees of freedom are spent in estimating

σ2. Thus, there are practical and statistical advantages to using the reduced model

if it is also correct.

• Hypothesis testing in linear models essentially reduces to putting a constraint on

the estimation space C(X) in the full model. If C(W) = C(X), then the Wγ model

is a reparameterization of the Xβ model and there is nothing to test.

RECALL: Let PW and PX denote the perpendicular projection matrices onto C(W) and

C(X), respectively. Because C(W) ⊂ C(X), we know that PX − PW is the ppm onto

C(PX −PW) = C(W)⊥C(X).

GEOMETRY : In a general reduced-versus-full model testing framework, we start by as-

suming the full model Y = Xβ + ε is essentially “correct” so that E(Y) = Xβ ∈ C(X).

PAGE 97


If the reduced model is also correct, then E(Y) = Wγ ∈ C(W) ⊂ C(X). Geometri-

cally, performing a reduced-versus-full model test therefore requires the analyst to decide

whether E(Y) is more likely to be in C(W) or C(X) − C(W). Under the full model,

our estimate for E(Y) = Xβ is PXY. Under the reduced model, our estimate for

E(Y) = Wγ is PWY.

• If the reduced model is correct, then PXY and PWY are estimates of the same

thing, and PXY −PWY = (PX −PW)Y should be small.

• If the reduced model is not correct, then PXY and PWY are estimating different

things, and PXY −PWY = (PX −PW)Y should be large.

• The decision about reduced model adequacy therefore hinges on assessing whether

(PX − PW)Y is large or small. Note that (PX − PW)Y is the perpendicular

projection of Y onto C(W)⊥C(X).

MOTIVATION : An obvious measure of the size of (PX − PW)Y is its squared length,

that is,

(PX −PW)Y′(PX −PW)Y = Y′(PX −PW)Y.

However, the length of (PX −PW)Y is also related to the sizes of C(X) and C(W). We

therefore adjust for these sizes by using

Y′(PX −PW)Y/r(PX −PW).

We now compute the expectation of this quantity when the reduced model is/is not

correct. For notational simplicity, set r∗ = r(PX − PW). When the reduced model is

correct, then

EY′(PX −PW)Y/r∗ =1

r∗[(Wγ)′(PX −PW)Wγ + tr(PX −PW)σ2I

]=

1

r∗σ2tr(PX −PW)

=1

r∗(r∗σ2) = σ2.

This is correct because (PX − PW)Wγ = 0 and tr(PX − PW) = r(PX − PW) = r∗.

Thus, if the reduced model is correct, Y′(PX−PW)Y/r∗ is an unbiased estimator of σ2.

PAGE 98


When the reduced model is not correct, then

EY′(PX −PW)Y/r∗ =1

r∗[(Xβ)′(PX −PW)Xβ + tr(PX −PW)σ2I

]=

1

r∗(Xβ)′(PX −PW)Xβ + r∗σ2

= σ2 + (Xβ)′(PX −PW)Xβ/r∗.

Thus, if the reduced model is not correct, Y′(PX − PW)Y/r∗ is estimating something

larger than σ2. Of course, σ2 is unknown, so it must be estimated. Because the full

model is assumed to be correct,

MSE = (n− r)−1Y′(I−PX)Y,

the MSE from the full model, is an unbiased estimator of σ2.

TEST STATISTIC : To test the reduced model versus the full model, we use

F =Y′(PX −PW)Y/r∗

MSE.

Using only our GM model assumptions (i.e., not necessarily assuming normality), we can

surmise the following:

• When the reduced model is correct, the numerator and denominator of F are both

unbiased estimators of σ2, so F should be close to 1.

• When the reduced model is not correct, the numerator in F is estimating something

larger than σ2, so F should be larger than 1. Thus, values of F much larger than

1 are not consistent with the reduced model being correct.

• Values of F much smaller than 1 may mean something drastically different; see

Christensen (2003).

OBSERVATIONS : In the numerator of F , note that

Y′(PX −PW)Y = Y′PXY −Y′PWY = Y′(PX −P1)Y −Y′(PW −P1)Y,

which is the difference in the regression (model) sum of squares, corrected or uncorrected,

from fitting the two models. Also, the term

r∗ = r(PX −PW) = tr(PX −PW) = tr(PX)− tr(PW) = r(PX)− r(PW) = r − r0,

PAGE 99


say, where r0 = r(PW) = r(W). Thus, r∗ = r − r0 is the difference in the ranks of the

X and W matrices. This also equals the difference in the model degrees of freedom from

the two ANOVA tables.

REMARK : You will note that we have formulated a perfectly sensible strategy for testing

reduced versus full models while avoiding the question, “What is the distribution of F?”

Our entire argument is based on first and second moment assumptions, that is, E(ε) = 0

and cov(ε) = σ2I, the GM assumptions. We now address the distributional question.

DISTRIBUTION OF F: To derive the sampling distribution of


MSE,

we require that ε ∼ Nn(0, σ2I), from which it follows that Y ∼ Nn(Xβ, σ2I). First,

we handle the denominator MSE = Y′(I − PX)Y/(n − r). In Example 5.3 (notes), we

showed that

Y′(I−PX)Y/σ2 ∼ χ2n−r.

This distributional result holds regardless of whether or not the reduced model is correct.

Now, we turn our attention to the numerator. Take A = σ−2(PX − PW) and consider

the quadratic form

Y′AY = Y′(PX −PW)Y/σ2.

With V = σ2I, the matrix

AV = σ−2(PX −PW)σ2I = PX −PW

is idempotent with rank r(PX − PW) = r∗. Therefore, we know that Y′AY ∼ χ2r∗(λ),

where

λ =1

2µ′Aµ =

1

2σ2(Xβ)′(PX −PW)Xβ.

Now, we make the following observations:

• If the reduced model is correct and Xβ ∈ C(W), then (PX−PW)Xβ = 0 because

PX − PW projects onto C(W)⊥C(X). This means that the noncentrality parameter

λ = 0 and Y′(PX −PW)Y/σ2 ∼ χ2r∗ , a central χ2 distribution.

PAGE 100


• If the reduced model is not correct and Xβ /∈ C(W), then (PX−PW)Xβ 6= 0 and

λ > 0. In this event, Y′(PX − PW)Y/σ2 ∼ χ2r∗(λ), a noncentral χ2 distribution

with noncentrality parameter λ.

• Regardless of whether or not the reduced model is correct, the quadratic forms

Y′(PX −PW)Y and Y′(I−PX)Y are independent since

(PX −PW)σ2I(I−PX) = σ2(PX −PW)(I−PX) = 0.

CONCLUSION : Putting this all together, we have that


MSE=

σ−2Y′(PX −PW)Y/r∗

σ−2Y′(I−PX)Y/(n− r)∼ Fr∗,n−r(λ),

where r∗ = r − r0 and

λ =1

2σ2(Xβ)′(PX −PW)Xβ.

If the reduced model is correct, that is, if Xβ ∈ C(W), then λ = 0 and F ∼ Fr∗,n−r, a

central F distribution. Note also that if the reduced model is correct,

E(F ) =n− r

n− r − 2≈ 1.

This reaffirms our (model free) assertion that values of F close to 1 are consistent with the

reduced model being correct. Because the noncentral F family is stochastically increasing

in λ, larger values of F are consistent with the reduced model not being correct.

SUMMARY : Consider the linear model Y = Xβ + ε, where X is n× p with rank r ≤ p

and ε ∼ Nn(0, σ2I), Suppose that we would like to test

H0 : Y = Wγ + ε

versus

H1 : Y = Xβ + ε,

where C(W) ⊂ C(X). An α level rejection region is

RR = F : F > Fr∗,n−r,α,

where r∗ = r − r0, r0 = r(W), and Fr∗,n−r,α is the upper α quantile of the Fr∗,n−r

distribution.

PAGE 101


Example 6.1. Consider the simple linear regression model

Yi = β0 + β1(xi − x) + εi,

for i = 1, 2, ..., n, where εi ∼ iid N (0, σ2). In matrix notation,

Y =

Y1

Y2

...

Yn

, X =

1 x1 − x

1 x2 − x...

...

1 xn − x

, β =

β0

β1

, ε =

ε1

ε2...

εn

,

where ε ∼ Nn(0, σ2I). Suppose that we would like to test whether the reduced model

Yi = β0 + εi,

for i = 1, 2, ..., n, also holds. In matrix notation, the reduced model can be expressed as

Y =

Y1

Y2

...

Yn

, W =

1

1...

1

= 1, γ = β0, ε =

ε1

ε2...

εn

,

where ε ∼ Nn(0, σ2I) and 1 is an n × 1 vector of ones. Note that C(W) ⊂ C(X) with

r0 = 1, r = 2, and r∗ = r − r0 = 1. When the reduced model is correct,


MSE∼ F1,n−2,

where MSE is the mean-squared error from the full model. When the reduced model is

not correct, F ∼ F1,n−2(λ), where

λ =1

2σ2(Xβ)′(PX −PW)Xβ

= β21

n∑i=1

(xi − x)2/2σ2.

Exercises: (a) Verify that this expression for the noncentrality parameter λ is correct.

(b) Suppose that n is even and the values of xi can be selected anywhere in the interval

(d1, d2). How should we choose the xi values to maximize the power of a level α test?

PAGE 102


6.3 Testing linear parametric functions

PROBLEM : Consider our usual Gauss-Markov linear model with normal errors; i.e.,

Y = Xβ + ε, where X is n × p with rank r ≤ p and ε ∼ Nn(0, σ2I). We now consider

the problem of testing

H0 : K′β = m

versus

H1 : K′β 6= m,

where K is a p× s matrix with r(K) = s and m is s× 1.

Example 6.2. Consider the regression model Yi = β0 +β1xi1 +β2xi2 +β3xi3 +β4xi4 + εi,

for i = 1, 2, ..., n. Express each hypothesis in the form H0 : K′β = m:

1. H0 : β1 = 0

2. H0 : β3 = β4 = 0

3. H0 : β1 + β3 = 1, β2 − β4 = −1

4. H0 : β2 = β3 = β4.

Example 6.3. Consider the analysis of variance model Yij = µ+αi+εij, for i = 1, 2, 3, 4

and j = 1, 2, ..., ni. Express each hypothesis in the form H0 : K′β = m:

1. H0 : µ+ α1 = 5, α3 − α4 = 1

2. H0 : α1 − α2 = α3 − α4

3. H0 : α1 − 2 = 13(α2 + α3 + α4).

TERMINOLOGY : The general linear hypothesis H0 : K′β = m is said to be testable

iff K has full column rank and each component of K′β is estimable. In other words, K′β

contains s linearly independent estimable functions. Otherwise, H0 : K′β = m is said to

be nontestable.

PAGE 103


GOAL: Our goal is to develop a test for H0 : K′β = m, K′β estimable, in the Gauss-

Markov model with normal errors. We start by noting that the BLUE of K′β is K′β,

where β is a least squares estimator of β. Also,

K′β = K′(X′X)−X′Y,

a linear function of Y, so K′β follows an s-variate normal distribution with mean

E(K′β) = K′β and covariance matrix

cov(K′β) = K′cov(β)K = σ2K′(X′X)−X′X(X′X)−K = σ2K′(X′X)−K = σ2H,

where H = K′(X′X)−K. That is, we have shown K′β ∼ Ns(K′β, σ2H).

NOTE : In the calculation above, note that K′(X′X)−X′X(X′X)−K = K′(X′X)−K only

because K′β is estimable; i.e., K′ = A′X for some A. It is also true that H is nonsingular.

LEMMA: If K′β is estimable, then H is nonsingular.

Proof. First note that H is an s× s matrix. We can write

H = K′(X′X)−K = A′PXA = A′PXPXA = A′P′XPXA,

since K′ = A′X for some A (this follows since K′β is estimable). Therefore,

r(H) = r(A′P′XPXA) = r(PXA)

and

s = r(K) = r(X′A) = r(X′PXA) ≤ r(PXA) = r[X(X′X)−X′A] ≤ r(X′A) = s.

Therefore, r(H) = r(PXA) = s, showing that H is nonsingular.

IMPLICATION : The lemma above is important, because it convinces us that the distri-

bution of K′β is full rank. Subtracting m, we have

K′β −m ∼ Ns(K′β −m, σ2H).

F STATISTIC : Now, consider the quadratic form

(K′β −m)′(σ2H)−1(K′β −m).

PAGE 104


Note that (σ2H)−1σ2H = Is, an idempotent matrix with rank s. Therefore,

(K′β −m)′(σ2H)−1(K′β −m) ∼ χ2s(λ),

where the noncentrality parameter

λ =1

2σ2(K′β −m)′H−1(K′β −m).

We have already shown that Y′(I−PX)Y/σ2 ∼ χ2n−r. Also,

(K′β −m)′(σ2H)−1(K′β −m) and Y′(I−PX)Y/σ2

are independent (verify!). Thus, the ratio

F =(K′β −m)′H−1(K′β −m)/s

MSE

=(K′β −m)′H−1(K′β −m)/s

Y′(I−PX)Y/(n− r)

=(K′β −m)′(σ2H)−1(K′β −m)/s

σ−2Y′(I−PX)Y/(n− r)∼ Fs,n−r(λ),

where λ = (K′β − m)′H−1(K′β − m)/2σ2. Note that if H0 : K′β = m is true, the

noncentrality parameter λ = 0 and F ∼ Fs,n−r. Therefore, an α level rejection region for

the test of H0 : K′β = m versus H1 : K′β 6= m is

RR = F : F > Fs,n−r,α.

SCALAR CASE : We now consider the special case of testing H0 : K′β = m when

r(K) = 1, that is, K′β is a scalar estimable function. This function (and hypothesis) is

perhaps more appropriately written as H0 : k′β = m to emphasize that k is a p×1 vector

and m is a scalar. Often, k is chosen in a way so that m = 0 (e.g., testing a contrast in

an ANOVA model, etc.). The hypotheses in Example 6.2 (#1) and Example 6.3 (#3)

are of this form. Testing a scalar hypothesis is a mere special case of the general test we

have just derived. However, additional flexibility results in the scalar case; in particular,

we can test for one sided alternatives like H1 : k′β > m or H1 : k′β < m. We first discuss

one more noncentral distribution.

PAGE 105


TERNINOLOGY : Suppose Z ∼ N (µ, 1) and V ∼ χ2k. If Z and V are independent, then

T =Z√V/k

follows a noncentral t distribution with k degrees of freedom and noncentrality pa-

rameter µ. We write T ∼ tk(µ). If µ = 0, the tk(µ) distribution reduces to a central t

distribution with k degrees of freedom. Note that T ∼ tk(µ) implies T 2 ∼ F1,k(µ2/2).

TEST PROCEDURE : Consider the Gauss-Markov linear model with normal errors; i.e.,

Y = Xβ + ε, where X is n × p with rank r ≤ p and ε ∼ Nn(0, σ2I). Suppose that k′β

is estimable; i.e., k′ ∈ R(X), and that our goal is to test

H0 : k′β = m

versus

H1 : k′β 6= m

or versus a suitable one-sided alternative H1. The BLUE for k′β is k′β, where β is a

least squares estimator. Straightforward calculations show that

k′β ∼ Nk′β, σ2k′(X′X)−k =⇒ Z =k′β − k′β√σ2k′(X′X)−k

∼ N (0, 1).

We know that V = Y′(I−PX)Y/σ2 ∼ χ2n−r and that Z and V are independent (verify!).

Thus,

T =(k′β − k′β)/

√σ2k′(X′X)−k√

σ−2Y′(I−PX)Y/(n− r)=

k′β − k′β√MSE k′(X′X)−k

∼ tn−r.

When H0 : k′β = m is true, the statistic

T =k′β −m√

MSE k′(X′X)−k∼ tn−r.

Therefore, an α level rejection region, when H1 is two sided, is

RR = T : T ≥ tn−r,α/2.

One sided tests use rejection regions that are suitably adjusted. When H0 is not true,

T ∼ tn−r(µ), where

µ =k′β −m√σ2k′(X′X)−k

.

This distribution is of interest for power and sample size calculations.

PAGE 106


6.4 Testing models versus testing linear parametric functions

SUMMARY : Under our Gauss Markov linear model Y = Xβ + ε, where X is n × p

with rank r ≤ p and ε ∼ Nn(0, σ2I), we have presented F statistics to test (a) a reduced

model versus a full model in Section 6.2 and (b) a hypothesis defined by H0 : K′β = m,

for K′β estimable, in Section 6.3. In fact, testing models and testing linear parametric

functions essentially is the same thing, as we now demonstrate. For simplicity, we take

m = 0, although the following argument can be generalized.

DISCUSSION : Consider the general linear model Y = Xβ + ε, where X is n × p with

rank r ≤ p and β ∈ Rp. Consider the testable hypothesis H0 : K′β = 0. We can write

this hypothesis in the following way:

H0 : Y = Xβ + ε and K′β = 0.

We now find a reduced model that corresponds to this hypothesis. Note that K′β = 0

holds if and only if β⊥C(K). To identify the reduced model, pick a matrix U such that

C(U) = C(K)⊥. We then have

K′β = 0⇐⇒ β⊥C(K)⇐⇒ β ∈ C(U)⇐⇒ β = Uγ,

for some vector γ. Substituting β = Uγ into the linear model Y = Xβ + ε gives the

reduced model Y = XUγ + ε, or letting W = XU, our hypothesis above can be written

H0 : Y = Wγ + ε,

where C(W) ⊂ C(X).

OBSERVATION : When K′β is estimable, that is, when K′ = D′X for some n×s matrix

D, we can find the perpendicular projection matrix for testing H0 : K′β = 0 in terms

of D and PX. From Section 6.2, recall that the numerator sum of squares to test the

reduced model Y = Wγ+ε versus the full model Y = Xβ+ε is Y′(PX−PW)Y, where

PX−PW is the ppm onto C(PX−PW) = C(W)⊥C(X). For testing the estimable function

K′β = 0, we now show that the ppm onto C(PXD) is also the ppm onto C(W)⊥C(X); i.e.,

that C(PXD) = C(W)⊥C(X).

PAGE 107


PROPOSITION : C(PX −PW) = C(W)⊥C(X) = C(XU)⊥C(X) = C(PXD).

Proof. We showed C(PX −PW) = C(W)⊥C(X) in Chapter 2 and W = XU, so the second

equality is obvious. Suppose that v ∈ C(XU)⊥C(X). Then, v′XU = 0, so that X′v⊥C(U).

Because C(U) = C(K)⊥, we know that X′v ∈ C(K) = C(X′D), since K′ = D′X. Thus,

v = PXv = X(X′X)−X′v ∈ C[X(X′X)−X′D] = C(PXD).

Suppose that v ∈ C(PXD). Clearly, v ∈ C(X). Also, v = PXDd, for some d and

v′XU = d′D′PXXU = d′D′XU = d′K′U = 0,

because C(U) = C(K)⊥. Thus, v ∈ C(XU)⊥C(X).

IMPLICATION : It follows immediately that the numerator sum of squares for testing

the reduced model Y = Wγ + ε versus the full model Y = Xβ + ε is Y′MPXDY, where

MPXD = PXD[(PXD)′(PXD)]−(PXD)′ = PXD(D′PXD)−D′PX

is the ppm onto C(PXD). If ε ∼ Nn(0, σ2I), the resulting test statistic

F =Y′MPXDY/r(MPXD)

Y′(I−PX)Y/r(I−PX)∼ Fr(MPXD),r(I−PX)(λ),

where the noncentrality parameter

λ =1

2σ2(Xβ)′MPXDXβ.

GOAL: Our goal now is to show that the F statistic above is the same F statistic we

derived in Section 6.3 with m = 0, that is,

F =(K′β)′H−1K′β/s

Y′(I−PX)Y/(n− r).

Recall that this statistic was derived for the testable hypothesis H0 : K′β = 0. First,

we show that r(MPXD) = s, where, recall, s = r(K). To do this, it suffices to show

that r(K) = r(PXD). Because K′β is estimable, we know that K′ = D′X, for some D.

Writing K = X′D, we see that for any vector a,

X′Da = 0⇐⇒ Da⊥C(X),

PAGE 108


which occurs iff PXDa = 0. Note that the X′Da = 0 ⇐⇒ PXDa = 0 equivalence

implies that

N (X′D) = N (PXD)⇐⇒ C(D′X)⊥ = C(D′PX)⊥ ⇐⇒ C(D′X) = C(D′PX)

so that r(D′X) = r(D′PX). But K′ = D′X, so r(K) = r(D′X) = r(D′PX) = r(PXD),

which is what we set out to prove. Now, consider the quadratic form Y′MPXDY, and

let β denote a least squares estimator of β. Result 2.5 says that Xβ = PXY, so that

K′β = D′Xβ = D′PXY. Substitution gives

Y′MPXDY = Y′PXD(D′PXD)−D′PXY

= (K′β)′[D′X(X′X)−X′D]−K′β

= (K′β)′[K′(X′X)−K]−K′β.

Recalling that H = K′(X′X)−K and that H is nonsingular (when K′β is estimable)

should convince you that the numerator sum of squares in

F =Y′MPXDY/r(MPXD)

Y′(I−PX)Y/r(I−PX)

and

F =(K′β)′H−1K′β/s

Y′(I−PX)Y/(n− r)are equal. We already showed that r(MPXD) = s, and because r(I − PX) = n − r, we

are done.

6.5 Likelihood ratio tests

6.5.1 Constrained estimation

REMARK : In the linear model Y = Xβ+ε, where E(ε) = 0 (note the minimal assump-

tions), we have, up until now, allowed the p× 1 parameter vector β to take on any value

in Rp, that is, we have made no restrictions on the parameters in β. We now consider

the case where β is restricted to the subspace of Rp consisting of values of β that satisfy

PAGE 109


P′β = δ, where P is a p × q matrix and δ is q × 1. To avoid technical difficulties, we

will assume that the system P′β = δ is consistent; i.e., δ ∈ C(P′). Otherwise, the set

β ∈ Rp : P′β = δ could be empty.

PROBLEM : In the linear model Y = Xβ+ε, where E(ε) = 0, we would like to minimize

Q(β) = (Y −Xβ)′(Y −Xβ)

subject to the constraint that P′β = δ. Essentially, this requires us to find the minimum

value of Q(β) over the linear subspace β ∈ Rp : P′β = δ. This is a restricted mini-

mization problem and standard Lagrangian methods apply; see Appendix B in Monahan.

The Lagrangian a(β,θ) is a function of β and the Lagrange multipliers in θ and can be

written as

a(β,θ) = (Y −Xβ)′(Y −Xβ) + 2θ′(P′β − δ).

Taking partial derivatives, we have

∂a(β,θ)

∂β= −2X′Y + 2X′Xβ + 2Pθ

∂a(β,θ)

∂θ= 2(P′β − δ).

Setting these equal to zero leads to the restricted normal equations (RNEs), that is, X′X P

P′ 0

β

θ

=

X′Y

δ

.

Denote by βH and θH the solutions to the RNEs, respectively. The solution βH is called

a restricted least squares estimator.

DISCUSSION : We now present some facts regarding this restricted linear model and its

(restricted) least squares estimator. We have proven all of these facts for the unrestricted

model; restricted versions of the proofs are all in Monahan.

1. The restricted normal equations are consistent; see Result 3.8, Monahan (pp 62-63).

2. A solution βH minimizes Q(β) over the set T ≡ β ∈ Rp : P′β = δ; see Result

3.9, Monahan (pp 63).

PAGE 110


3. The function λ′β is estimable in the restricted model

Y = Xβ + ε, E(ε) = 0, P′β = δ,

if and only if λ′ = a′X + d′P′, for some a and d, that is,

λ′ ∈ R

X

P′

.

See Result 3.7, Monahan (pp 60).

4. If λ′β is estimable in the unrestricted model; i.e., the model without the linear

restriction, then λ′β is estimable in the restricted model. The converse is not true.

5. Under the GM model assumptions, if λ′β is estimable in the restricted model, then

λ′βH is the BLUE of λ′β in the restricted model. See Result 4.5, Monahan (pp

89-90).

6.5.2 Testing procedure

SETTING : Consider the Gauss Markov linear model Y = Xβ+ε, where X is n×p with

rank r ≤ p and ε ∼ Nn(0, σ2I). We are interested in deriving the likelihood ratio test

(LRT) for

H0 : K′β = m

versus

H1 : K′β 6= m,

where K is a p× s matrix with r(K) = s and m is s× 1. We assume that H0 : K′β = m

is testable, that is, K′β is estimable.

RECALL: A likelihood ratio testing procedure is intuitive. One simply compares the

maximized likelihood over the restricted parameter space (that is, the space under H0)

to the maximized likelihood over the entire parameter space. If the former is small when

compared to the latter, then there a large amount of evidence against H0.

PAGE 111


DERIVATION : Under our model assumptions, we know that Y ∼ Nn(Xβ, σ2I). The

likelihood function for θ = (β′, σ2)′ is

L(θ|y) = L(β, σ2|y) = (2πσ2)−n/2 exp−Q(β)/2σ2,

where Q(β) = (y −Xβ)′(y −Xβ). The unrestricted parameter space is

Θ = θ : β ∈ Rp, σ2 ∈ R+.

The restricted parameter space, that is, the parameter space under H0 : K′β = m, is

Θ0 = θ : β ∈ Rp, K′β = m, σ2 ∈ R+.

The likelihood ratio statistic is

λ ≡ λ(Y) =supΘ0

L(θ|Y)

supΘ L(θ|Y).

We reject the null hypothesis H0 for small values of λ = λ(Y). Thus, to perform a level

α test, reject H0 when λ < c, where c ∈ (0, 1) is chosen to satisfy PH0λ(Y) ≤ c = α.

We have seen (Section 6.1) that the unrestricted MLEs of β and σ2 are

β = (X′X)−X′Y and σ2 =Q(β)

n.

Similarly, maximizing L(θ|y) over Θ0 produces the solutions βH and σ2 = Q(βH)/n,

where βH is any solution to X′X K

K′ 0

β

θ

=

X′Y

m

,

the restricted normal equations. Algebra shows that

λ =L(βH , σ

2|Y)

L(β, σ2|Y)=

(σ2

σ2

)n/2=

Q(β)

Q(βH)

n/2

.

More algebra shows thatQ(β)

Q(βH)

n/2

< c ⇐⇒ Q(βH)−Q(β)/sQ(β)/(n− r)

> c∗,

PAGE 112


where s = r(K) and c∗ = s−1(n − r)(c−2/n − 1). Furthermore, Monahan’s Theorem 6.1

(pp 139-140) shows that when K′β is estimable,

Q(βH)−Q(β) = (K′β −m)′H−1(K′β −m),

where H = K′(X′X)−K. Applying this result, and noting that Q(β)/(n − r) = MSE,

we see that

Q(βH)−Q(β)/sQ(β)/(n− r)

> c∗ ⇐⇒ F =(K′β −m)′H−1(K′β −m)/s

MSE> c∗.

That is, the LRT specifies that we reject H0 when F is large. Choosing c∗ = Fs,n−r,α

provides a level α test. Therefore, under the Gauss Markov model with normal errors,

the LRT for H0 : K′β = m is the same test as that in Section 6.3.

6.6 Confidence intervals

6.6.1 Single intervals

PROBLEM : Consider the Gauss Markov linear model Y = Xβ + ε, where X is n × p

with rank r ≤ p and ε ∼ Nn(0, σ2I). Suppose that λ′β estimable, that is, λ′ = a′X, for

some vector a. Our goal is to write a 100(1− α) percent confidence interval for λ′β.

DERIVATION : We start with the obvious point estimator λ′β, the least squares esti-

mator (and MLE) of λ′β. Under our model assumptions, we know that

λ′β ∼ Nλ′β, σ2λ′(X′X)−λ

and, hence,

Z =λ′β − λ′β√σ2λ′(X′X)−λ

∼ N (0, 1).

If σ2 was known, our work would be done as Z is a pivot. More likely, this is not the

case, so we must estimate it. An obvious point estimator for σ2 is MSE, where

MSE = (n− r)−1Y′(I−PX)Y.

PAGE 113


We consider the quantity

T =λ′β − λ′β√

MSE λ′(X′X)−λ

and subsequently show that T ∼ tn−r. Note that

T =λ′β − λ′β√


=(λ′β − λ′β)/

√σ2λ′(X′X)−λ√

(n− r)−1Y′(I−PX)Y/σ2∼ “N (0, 1)”√

“χ2n−r”/(n− r)

.

To verify that T ∼ tn−r, it remains only to show that Z and Y′(I − PX)Y/σ2 are

independent, or equivalently, that λ′β and Y′(I − PX)Y are, since λ′β is a function of

Z and since σ2 is not random. Note that

λ′β = a′X(X′X)−X′Y = a′PXY,

a linear function of Y. Using Result 5.19, λ′β and Y′(I−PX)Y are independent since

a′PXσ2I(I−PX) = 0. Thus, T ∼ tn−r, i.e., t is a pivot, so that

pr

(−tn−r,α/2 <

λ′β − λ′β√MSE λ′(X′X)−λ

< tn−r,α/2

)= 1− α.

Algebra shows that this probability statement is the same as

pr

(λ′β − tn−r,α/2

√MSE λ′(X′X)−λ < λ′β < λ′β + tn−r,α/2

√MSE λ′(X′X)−λ

)= 1−α,

showing that

λ′β ± tn−r,α/2√


is a 100(1− α) percent confidence interval for λ′β.

Example 6.4. Recall the simple linear regression model


for i = 1, 2, ..., n, where ε1, ε2, ..., εn are iid N (0, σ2). Recall also that the least squares

estimator of β = (β0, β1)′ is

β = (X′X)−1X′Y =

β0

β1

=


i(xi−x)2

,

PAGE 114


and that the covariance matrix of β is

cov(β) = σ2(X′X)−1 = σ2

1n

+ x2∑i(xi−x)2

− x∑i(xi−x)2

− x∑i(xi−x)2

1∑i(xi−x)2

.

We now consider the problem of writing a 100(1− α) percent confidence interval for

E(Y |x = x0) = β0 + β1x0,

the mean response of Y when x = x0. Note that E(Y |x = x0) = β0 + β1x0 = λ′β, where

λ′ = (1 x0). Also, λ′β is estimable because this is a regression model so our previous

work applies. The least squares estimator (and MLE) of E(Y |x = x0) is

λ′β = β0 + β1x0.

Straightforward algebra (verify!) shows that

λ′(X′X)−1λ =(

1 x0

) 1n

+ x2∑i(xi−x)2

− x∑i(xi−x)2

− x∑i(xi−x)2

1∑i(xi−x)2

1

x0

=

1

n+

(x0 − x)2∑i(xi − x)2

.

Thus, a 100(1− α) percent confidence interval for λ′β = E(Y |x = x0) is

(β0 + β1x0)± tn−2,α/2

√MSE

1

n+

(x0 − x)2∑i(xi − x)2

.

6.6.2 Multiple intervals

PROBLEM : Consider the Gauss Markov linear model Y = Xβ+ε, where X is n×p with

rank r ≤ p and ε ∼ Nn(0, σ2I). We now consider the problem of writing simultaneous

confidence intervals for the k estimable functions λ′1β,λ′2β, ...,λ

′kβ. Let the p×k matrix

Λ = (λ1 λ2 · · · λk) so that

τ = Λ′β =

λ′1β

λ′2β...

λ′kβ

.

PAGE 115


Because τ = Λ′β is estimable, it follows that

τ = Λ′β ∼ Nk(Λ′β, σ2H),

where H = Λ′(X′X)−Λ. Furthermore, because λ′1β,λ′2β, ...,λ

′kβ are jointly normal, we

have that

λ′jβ ∼ N (λ′jβ, σ2hjj),

where hjj is the jth diagonal element of H. Using our previous results, we know that

λ′jβ ± tn−r,α/2√σ2hjj,

where σ2 = MSE, is a 100(1− α) percent confidence interval for λ′jβ, that is,

pr(λ′jβ − tn−r,α/2

√σ2hjj < λ′jβ < λ′jβ + tn−r,α/2

√σ2hjj

)= 1− α.

This statement is true for a single interval.

SIMULTANEOUS COVERAGE : To investigate the simultaneous coverage probability

of the set of intervals

λ′jβ ± tn−r,α/2√σ2hjj, j = 1, 2, ..., k,

let Ej denote the event that interval j contains λ′jβ, that is, pr(Ej) = 1 − α, for j =

1, 2, ..., k. The probability that each of the k intervals includes their target λ′jβ is

pr

(k⋂j=1

Ej

)= 1− pr

(k⋃j=1

Ej

),

by DeMorgan’s Law. In turn, Boole’s Inequality says that

pr

(k⋃j=1

Ej

)≤

k∑j=1

pr(Ej) = kα.

Thus, the probability that each interval contains its intended target is

pr

(k⋂j=1

Ej

)≥ 1− kα.

PAGE 116


Obviously, this lower bound 1 − kα can be quite a bit lower than 1 − α, that is, the

simultaneous coverage probability of the set of

λ′jβ ± tn−r,α/2√σ2hjj, j = 1, 2, ..., k

can be much lower than the single interval coverage probability.

GOAL: We would like the set of intervals λ′jβ ± d√σ2hjj, j = 1, 2, ..., k to have a

simultaneous coverage probability of at least 1−α. Here, d represents a probability

point that guarantees the desired simultaneous coverage. Because taking d = tn−r,α/2

does not guarantee this minimum, we need to take d to be larger.

BONFERRONI : From the argument on the last page, it is clear that if one takes d =

tn−r,α/2k, then

pr

(k⋂j=1

Ej

)≥ 1− k(α/k) = 1− α.

Thus, 100(1− α) percent simultaneous confidence intervals for λ′1β,λ′2β, ...,λ

′kβ are

λ′jβ ± tn−r,α/2k√σ2hjj

for j = 1, 2, ..., k.

SCHEFFE : The idea behind Scheffe’s approach is to consider an arbitrary linear combi-

nation of τ = Λ′β, say, u′τ = u′Λ′β and construct a confidence interval

C(u, d) = (u′τ − d√σ2u′Hu, u′τ + d

√σ2u′Hu),

where d is chosen so that

pru′τ ∈ C(u, d), for all u = 1− α.

Since d is chosen in this way, one guarantees the necessary simultaneous coverage proba-

bility for all possible linear combinations of τ = Λ′β (an infinite number of combinations).

Clearly, the desired simultaneous coverage is then conferred for the k functions of interest

τj = λ′jβ, j = 1, 2, ..., k; these functions result from taking u to be the standard unit

vectors. The argument in Monahan (pp 144) shows that d = (kFk,n−r,α)1/2.

PAGE 117


7 Appendix

7.1 Matrix algebra: Basic ideas

TERMINOLOGY : A matrix A is a rectangular array of elements; e.g.,

A =

3 5 4

1 2 8

.

The (i, j)th element of A is denoted by aij. The dimensions of A are m (the number of

rows) by n (the number of columns). If m = n, A is square. If we want to emphasize

the dimension of A, we can write Am×n.

TERMINOLOGY : A vector is a matrix consisting of one column or one row. A column

vector is denoted by an×1. A row vector is denoted by a1×n. By convention, we assume

a vector is a column vector, unless otherwise noted; that is,

a =

a1

a2

...

an

a′ =(a1 a2 · · · an

).

TERMINOLOGY : If A = (aij) is an m× n matrix, the transpose of A, denoted by A′

or AT , is the n×m matrix (aji). If A′ = A, we say A is symmetric.

Result MAR1.1.

(a) (A′)′ = A

(b) For any matrix A, A′A and AA′ are symmetric

(c) A = 0 iff A′A = 0

(d) (AB)′ = B′A′

(e) (A + B)′ = A′ + B′

PAGE 118


TERMINOLOGY : The n× n identity matrix I is given by

I = In =

1 0 · · · 0

0 1 · · · 0...

.... . .

...

0 0 · · · 1

n×n

;

that is, aij = 1 for i = j, and aij = 0 when i 6= j. The n× n matrix of ones J is

J = Jn =

1 1 · · · 1

1 1 · · · 1...

.... . .

...

1 1 · · · 1

n×n

;

that is, aij = 1 for all i and j. Note that J = 11′, where 1 = 1n is an n × 1 (column)

vector of ones. The n×n matrix where aij = 0, for all i and j, is called the null matrix,

or the zero matrix, and is denoted by 0.

TERMINOLOGY : If A is an n× n matrix, and there exists a matrix C such that

AC = CA = I,

then A is nonsingular and C is called the inverse of A; henceforth denoted by A−1.

If A is nonsingular, A−1 is unique. If A is a square matrix and is not nonsingular, A is

singular.

SPECIAL CASE : The inverse of the 2× 2 matrix

A =

a b

c d

is given by A−1 =1

ad− bc

d −b

−c a

.

SPECIAL CASE : The inverse of the n× n diagonal matrix

A =

a11 0 · · · 0

0 a22 · · · 0...

.... . .

...

0 0 · · · ann

is given by A−1 =

a−1

11 0 · · · 0

0 a−122 · · · 0

......

. . ....

0 0 · · · a−1nn

.

PAGE 119


Result MAR1.2.

(a) A is nonsingular iff |A| 6= 0.

(b) If A and B are nonsingular matrices, (AB)−1 = B−1A−1.

(c) If A is nonsingular, then (A′)−1 = (A−1)′.

7.2 Linear independence and rank

TERMINOLOGY : The m× 1 vectors a1, a2, ..., an are said to be linearly dependent

if and only if there exist scalars c1, c2, ..., cn such that

n∑i=1

ciai = 0

and at least one of the ci’s is not zero; that is, it is possible to express at least one vector

as a nontrivial linear combination of the others. If

n∑i=1

ciai = 0 =⇒ c1 = c2 = · · · = cn = 0,

then a1, a2, ..., an are linearly independent. If m < n, then a1, a2, ..., an must be

linearly dependent.

NOTE : If a1, a2, ..., an denote the columns of an m× n matrix A; i.e.,

A =(

a1 a2 · · · an

),

then the columns of A are linearly independent if and only if Ac = 0 ⇒ c = 0, where

c = (c1, c2, ..., cn)′. Thus, if you can find at least one nonzero c such that Ac = 0, the

columns of A are linearly dependent.

TERMINOLOGY : The rank of a matrix A is defined as

r(A) = number of linearly independent columns of A

= number of linearly independent rows of A.

PAGE 120


The number of linearly independent rows of any matrix is always equal to the number of

linearly independent columns. Alternate notation for r(A) is rank(A).

TERMINOLOGY : If A is n× p, then r(A) ≤ minn, p.

• If r(A) = minn, p, then A is said to be of full rank.

• If r(A) = n, we say that A is of full row rank.

• If r(A) = p, we say that A is of full column rank.

• If r(A) < minn, p, we say that A is less than full rank or rank deficient.

• Since the maximum possible rank of an n × p matrix is the minimum of n and p,

for any rectangular (i.e., non-square) matrix, either the rows or columns (or both)

must be linearly dependent.

Result MAR2.1.

(a) For any matrix A, r(A′) = r(A).

(b) For any matrix A, r(A′A) = r(A).

(c) For conformable matrices, r(AB) ≤ r(A) and r(AB) ≤ r(B).

(d) If B is nonsingular, then, r(AB) = r(A).

(e) For any n× n matrix A, r(A) = n⇐⇒ A−1 exists⇐⇒ |A| 6= 0.

(f) For any matrix An×n and vector bn×1, r(A,b) ≥ r(A); i.e., the inclusion of a

column vector cannot decrease the rank of a matrix.

ALTERNATE DEFINITION : An m × n matrix A has rank r if the dimension of the

largest possible nonsingular submatrix of A is r × r.

APPLICATION TO LINEAR MODELS : Consider our general linear model

Y = Xβ + ε,

PAGE 121


where Y is an n×1 vector of observed responses, X is an n×p matrix of fixed constants, β

is a p×1 vector of fixed but unknown parameters, and ε is an n×1 vector of (unobserved)

random errors with zero mean. If X is n×p, then X′X is p×p. Furthermore, if r(X) = p

(i.e., it is full column rank), then r(X′X) = p. Thus, we know that (X′X)−1 exists. On

the other hand, if X has rank r < p, then r(X′X) < p and (X′X)−1 does not exist.

Consider the normal equations (which will be motivated later):

X′Xβ = X′Y.

We see that left multiplication by (X′X)−1 produces the solution

β = (X′X)−1X′Y.

This is the unique solution to the normal equations (since inverses are unique). Note

that if r(X) = r < p, then a unique solution to the normal equations does not exist.

TERMINOLOGY : We say that two vectors a and b are orthogonal, and write a⊥b, if

their inner product is zero; i.e.,

a′b = 0.

Vectors a1, a2, ..., an are mutually orthogonal if and only if a′iaj = 0 for all i 6= j. If

a1, a2, ..., an are mutually orthogonal, then they are also linearly independent (verify!).

The converse is not necessarily true.

TERMINOLOGY : Suppose that a1, a2, ..., an are orthogonal. If a′iai = 1, for all

i = 1, 2, ..., n, we say that a1, a2, ..., an are orthonormal.

TERMINOLOGY : Suppose that a1, a2, ..., an are orthogonal. Then

ci = ai/||ai||,

where ||ai|| = (a′iai)1/2, i = 1, 2, ..., n, are orthonormal. The quantity ||ai|| is the length

of ai. If a1, a2, ..., an are the columns of A, then A′A is diagonal; similarly, C′C = I.

TERMINOLOGY : Let A be an n × n (square) matrix. We say that A is orthogonal

if A′A = I = AA′, or equivalently, if A′ = A−1. Note that if A is orthogonal, then

PAGE 122


||Ax|| = ||x||. Geometrically, this means that multiplication of x by A only rotates the

vector x (since the length remains unchanged).

7.3 Vector spaces

TERMINOLOGY : Let V ⊆ Rn be a set of n× 1 vectors. We call V a vector space if

(i) x1 ∈ V , x2 ∈ V ⇒ x1 + x2 ∈ V , and

(ii) x ∈ V ⇒ cx ∈ V for c ∈ R.

That is, V is closed under addition and scalar multiplication.

TERMINOLOGY : A set of n × 1 vectors S ⊆ Rn is a subspace of V if S is a vector

space and S ⊆ V ; i.e., if x ∈ S ⇒ x ∈ V .

TERMINOLOGY : We say that subspaces S1 and S2 are orthogonal, and write S1⊥S2,

if x′1x2 = 0, for all x1 ∈ S1 and for all x2 ∈ S2.

Example. Suppose that V = R3. Then, V is a vector space.

Proof. Suppose x1 ∈ V and x2 ∈ V . Then, x1 + x2 ∈ V and cx1 ∈ V for all c ∈ R.

Example. Suppose that V = R3. The subspace consisting of the z-axis is

S1 =

0

0

z

: for z ∈ R

.

The subspace consisting of the x-y plane is

S2 =

x

y

0

: for x, y ∈ R

.

PAGE 123


It is easy to see that S1 and S2 are orthogonal. That S1 is a subspace is argued as follows.

Clearly, S1 ⊆ V . Now, suppose that x1 ∈ S1 and x2 ∈ S1; i.e.,

x1 =

0

0

z1

and x2 =

0

0

z2

,

for z1, z2 ∈ R. Then,

x1 + x2 =

0

0

z1 + z2

∈ S1

and

cx1 =

0

0

cz1

∈ S1,

for all c ∈ R. Thus, S1 is a subspace. That S2 is a subspace follows similarly.

TERMINOLOGY : Suppose that V is a vector space and that x1,x2, ...,xn ∈ V . The set

of all linear combinations of x1,x2, ...,xn; i.e.,

S =

x ∈ V : x =

n∑i=1

cixi

is a subspace of V . We say that S is generated by x1,x2, ...,xn. In other words, S is

the space spanned by x1,x2, ...,xn, written S = spanx1,x2, ...,xn.

Example. Suppose that V = R3 and let

x1 =

1

1

1

and x2 =

1

0

0

.

For c1, c2 ∈ R, the linear combination c1x1 + c2x2 = (c1 + c2, c1, c1)′. Thus, the space

spanned by x1 and x2 is the subspace S which consists of all the vectors in R3 of the

form (a, b, b)′, for a, b ∈ R.

PAGE 124


TERMINOLOGY : Suppose that S is a subspace of V . If x1,x2, ...,xn is a linearly

independent spanning set for S, we call x1,x2, ...,xn a basis for S. In general, a basis

is not unique. However, the number of vectors in the basis, called the dimension of S,

written dim(S), is unique.

Result MAR3.1. Suppose that S and T are vector spaces. If S ⊆ T , and dim(S) =

dim(T ), then S = T .

Proof. See pp 244-5 in Monahan.

TERMINOLOGY : The subspaces S1 and S2 are orthogonal complements in Rm if

and only if S1 ⊆ Rm, S2 ⊆ Rm, S1 and S2 are orthogonal, S1 ∩ S2 = 0, dim(S1) = r,

and dim(S2) = m− r.

Result MAR3.2. Let S1 and S2 be orthogonal complements in Rm. Then, any vector

y ∈ Rm can be uniquely decomposed as y = y1 + y2, where y1 ∈ S1 and y2 ∈ S2.

Proof. Suppose that the decomposition is not possible; that is, suppose that y is linearly

independent of basis vectors in both S1 and S2. However, this would give m+ 1 linearly

independent vectors in Rm, which is not possible. Thus, the decomposition must be

possible. To establish uniqueness, suppose that y = y1 + y2 and y = y∗1 + y∗2, where

y1,y∗1 ∈ S1 and y2,y

∗2 ∈ S2. Then, y1−y∗1 = y∗2−y2. But, y1−y∗1 ∈ S1 and y∗2−y2 ∈ S2.

Thus, both y1 − y∗1 and y∗2 − y2 must be the 0 vector.

NOTE : In the last result, note that we can write

||y||2 = y′y = (y1 + y2)′(y1 + y2) = y′1y1 + 2y′1y2 + y′2y2 = ||y1||2 + ||y2||2.

This is simply Pythagorean’s Theorem. The cross product term is zero since y1 and

y2 are orthogonal.

TERMINOLOGY : For the matrix

Am×n =(

a1 a2 · · · an

),

PAGE 125


where aj is m× 1, the column space of A,

C(A) =

x ∈ Rm : x =

n∑j=1

cjaj; cj ∈ R

= x ∈ Rm : x = Ac; c ∈ Rn,

is the set of all m× 1 vectors spanned by the columns of A; that is, C(A) is the set of all

vectors that can be written as a linear combination of the columns of A. The dimension

of C(A) is the column rank of A.

TERMINOLOGY : Let

Am×n =

b′1

b′2...

b′m

,

where bi is n× 1. Denote

R(A) = x ∈ Rn : x =m∑i=1

dibi; di ∈ R

= x ∈ Rn : x′ = d′A; d ∈ Rm.

We call R(A) the row space of A. It is the set of all n× 1 vectors spanned by the rows

of A; that is, the set of all vectors that can be written as a linear combination of the

rows of A. The dimension of R(A) is the row rank of A.

TERMINOLOGY : The set N (A) = x : Ax = 0 is called the null space of A, denoted

N (A). The dimension of N (A) is called the nullity of A.

Result MAR3.3.

(a) C(B) ⊆ C(A) iff B = AC for some matrix C.

(b) R(B) ⊆ R(A) iff B = DA for some matrix D.

(c) C(A), R(A), and N (A) are all vector spaces.

(d) R(A′) = C(A) and C(A′) = R(A).

PAGE 126


(e) C(A′A) = C(A′) and R(A′A) = R(A).

(f) For any A and B, C(AB) ⊆ C(A). If B is nonsingular, then C(AB) = C(A).

Result MAR3.4. If A has full column rank, then N (A) = 0.

Proof. Suppose that A has full column rank. Then, the columns of A are linearly

independent and the only solution to Ax = 0 is x = 0.

Example. Define

A =

1 1 2

1 0 3

1 0 3

and c =

3

−1

−1

.

The column space of A is the set of all linear combinations of the columns of A; i.e., the

set of vectors of the form

c1a1 + c2a2 + c3a3 =

c1 + c2 + 2c3

c1 + 3c3

c1 + 3c3

,

where c1, c2, c3 ∈ R. Thus, the column space C(A) is the set of all 3 × 1 vectors of

the form (a, b, b)′, where a, b ∈ R. Any two vectors of a1, a2, a3 span this space. In

addition, any two of a1, a2, a3 are linearly independent, and hence form a basis for

C(A). The set a1, a2, a3 is not linearly independent since Ac = 0. The dimension of

C(A); i.e., the rank of A, is r = 2. The dimension of N (A) is 1, and c forms a basis for

this space.

Result MAR3.5. For an m× n matrix A with rank r ≤ n, the dimension of N (A) is

n− r. That is, dimC(A)+ dimN (A) = n.

Proof. See pp 241-2 in Monahan.

Result MAR3.6. For an m×n matrix A, N (A′) and C(A) are orthogonal complements

in Rm.

Proof. Both N (A′) and C(A) are vector spaces with vectors in Rm. From the last

result, we know that dimC(A) = rank(A) = r, say, and dimN (A′) = m − r, since

PAGE 127


r = rank(A) = rank(A′). Now we need to show N (A′) ∩ C(A) = 0. Suppose x is in

both spaces. If x ∈ C(A), then x = Ac for some c. If x ∈ N (A′), then A′x = 0. Thus,

A′x = A′Ac = 0 =⇒ c′A′Ac = 0 =⇒ (Ac)′Ac = 0 =⇒ Ac = x = 0.

To finish the proof, we need to show thatN (A′) and C(A) are orthogonal spaces. Suppose

that x1 ∈ C(A) and x2 ∈ N (A′). It suffices to show that x′1x2 = 0. But, note that

x1 ∈ C(A) =⇒ x1 = Ac, for some c. Also, x2 ∈ N (A′) =⇒ A′x2 = 0. Since x′1x2 =

(Ac)′x2 = c′A′x2 = c′0 = 0, the result follows.

Result MAR3.7. Suppose that S1 and T1 are orthogonal complements. Suppose that

S2 and T2 are orthogonal complements. If S1 ⊆ S2, then T2 ⊆ T1.

Proof. See pp 244 in Monahan.

7.4 Systems of equations

REVIEW : Consider the system of equations Ax = c. If A is square and nonsingular,

then there is a unique solution to the system and it is x = A−1c. If A is not nonsingular,

then the system can have no solution, finitely many solutions, or infinitely many solutions.

TERMINOLOGY : The linear system Ax = c is consistent if there exists an x∗ such

that Ax∗ = c; that is, if c ∈ C(A).

REMARK : We will show that

• for every m× n matrix A, there exists a n×m matrix G such that AGA = A.

• for a consistent system Ax = c, if AGA = A, then x∗ = Gc is a solution.

Result MAR4.1. Suppose that Ax = c is consistent. If G is a matrix such that

AGA = A, then x∗ = Gc is a solution to Ax = c.

Proof. Because Ax = c is consistent, there exists an x∗ such that Ax∗ = c. Note that

AGc = AGAx∗ = Ax∗ = c. Thus, x∗ = Gc is a solution.

PAGE 128


TERMINOLOGY : A matrix G that satisfies AGA = A is called a generalized inverse

of A and is denoted by A−. That is,

AGA = A =⇒ AA−A = A.

If A is square and nonsingular, then the generalized inverse of A is A−1 since AA−A =

AA−1A = A.

NOTES :

• Every matrix A, regardless of its dimension, has a generalized inverse.

• Generalized inverses are not unique unless A is nonsingular.

• If A is m× n, then A− is n×m.

• A generalized inverse of A, A symmetric, is not necessarily symmetric. However, a

symmetric generalized inverse can always be found. We will thus assume that the

generalized inverse of a symmetric matrix is symmetric.

• If G is a generalized inverse of A, then G′ is a generalized inverse of A′.

• Monahan uses Ag to denote generalized inverse, but I will use A−.

Example. Consider the matrices

A =

4 1 2

1 1 5

3 1 3

and G =

1/3 −1/3 0

−1/3 4/3 0

0 0 0

.

Note that r(A) = 2 because −a1 + 6a2 − a3 = 0. Thus A−1 does not exist. However, it

is easy to show that AGA = A; thus, G is a generalized inverse of A.

Result MAR4.2. Let A be an m × n matrix with r(A) = r. If A can be partitioned

as follows

A =

C D

E F

,

PAGE 129


where r(A) = r(C) = r, and Cr×r is nonsingular, then

G =

C−1 0

0 0

is a generalized inverse of A. This result essentially shows that every matrix has a

generalized inverse (see Results A.10 and A.11, Monahan). Also, it gives a method to

compute it.

COMPUTATION : This is an algorithm for finding a generalized inverse A− for A, any

m× n matrix of rank r.

1. Find any r × r nonsingular submatrix C. It is not necessary that the elements of

C occupy adjacent rows and columns in A.

2. Find C−1 and (C−1)′.

3. Replace the elements of C by the elements of (C−1)′.

4. Replace all other elements of A by zeros.

5. Transpose the resulting matrix.

Result MAR4.3. Let Am×n, xn×1, cm×1, and In×n be matrices, and suppose that

Ax = c is consistent. Then, x∗ is a solution to Ax = c if and only if

x∗ = A−c + (I−A−A)z,

for some z ∈ Rn. Thus, we can generate all solutions by just knowing one of them; i.e.,

by knowing A−c.

Proof. (⇐=) We know that x∗ = A−c is a solution (Result MAR4.1). Suppose that

x∗ = A−c + (I−A−A)z, for some z ∈ Rn. Thus,

Ax∗ = AA−c + (A−AA−A)z

= AA−c = Ax∗ = c;

PAGE 130


that is, x∗ = A−c + (I−A−A)z solves Ax = c. Conversely, (=⇒) suppose that x∗ is a

solution to Ax = c. Now,

x∗ = A−c + x∗ −A−c

= A−c + x∗ −A−Ax∗ = A−c + (I−A−A)x∗.

Thus, x∗ = A−c + (I−A−A)z, where z = x∗. Note that if A is nonsingular, A− = A−1

and x∗ = A−1c + (I−A−1A)z = A−1c; i.e., there is just one solution.

NOTE : Consider the general form of the solution to Ax = c (which is assumed to be

consistent); i.e., x∗ = A−c+(I−A−A)z. We call A−c a particular solution. The term

(I−A−A)z is the general solution to the homogeneous equations Ax = 0, producing

vectors in N (A).

COMPARE : Suppose that X1, X2, ..., Xn is an iid sample from fX(x; θ) and let X =

(X1, X2, ..., Xn)′. Suppose also that θ1 = θ1(X) is an unbiased estimator of θ; that is,

Eθ[θ1(X)] = θ for all θ ∈ Θ, say. The general form of an unbiased estimator for θ is

θ = θ1 + T,

where Eθ(T ) = 0 for all θ ∈ Θ.


Y = Xβ + ε,

where Y is an n×1 vector of observed responses, X is an n×p matrix of rank r < p, β is

a p× 1 vector of fixed but unknown parameters, and ε is an n× 1 vector of (unobserved)

random errors. The normal equations are given by

X′Xβ = X′Y.

The normal equations are consistent (see below). Thus, the general form of the least

squares estimator is given by

β = (X′X)−X′Y + [I− (X′X)−X′X]z,

PAGE 131


where z ∈ Rp. Of course, if r(X) = p, then (X′X)−1 exists, and the unique solution

becomes

β = (X′X)−1X′Y.

PROPOSITION : The normal equations X′Xβ = X′Y are consistent.

Proof. First, we will state and prove the following lemma.

LEMMA: For any matrix X, and for any matrices A and B,

X′XA = X′XB⇐⇒ XA = XB.

Proof. The necessity part (⇐=) is obvious. For the sufficiency part (=⇒), note that

X′XA = X′XB =⇒ X′XA−X′XB = 0

=⇒ (A−B)′(X′XA−X′XB) = 0

=⇒ (A−B)′X′(XA−XB) = 0

=⇒ (A′X′ −B′X′)(XA−XB) = 0

=⇒ (XA−XB)′(XA−XB) = 0.

This can only be true if XA −XB = 0. Thus, the lemma is proven. Now, let (X′X)−

denote a generalized inverse of X′X so that X′X(X′X)−X′X = X′X. Taking A′ =

X′X(X′X)− and B′ = I in the lemma, we have

X′X(X′X)−X′X = X′X =⇒ X′X(X′X)−X′ = X′

=⇒ X′X(X′X)−X′Y = X′Y.

This implies that β = (X′X)−X′Y is a solution to the normal equations. Hence, the

normal equations are consistent.

Example. Consider the one-way fixed effects ANOVA model


for i = 1, 2 and j = 1, 2, ..., ni, where n1 = 2 and n2 = 3. It is easy to show that

X′X =

5 2 3

2 2 0

3 0 3

.

PAGE 132


One generalized inverse of X′X is

(X′X)−1 =

0 0 0

0 1/2 0

0 0 1/3

,

and a solution to the normal equations (based on this generalized inverse) is

β1 = (X′X)−1 X′Y =

0 0 0

0 1/2 0

0 0 1/3

Y11 + Y12 + Y21 + Y22 + Y23

Y11 + Y12

Y21 + Y22 + Y23

=

0

12(Y11 + Y12)

13(Y21 + Y22 + Y23)

=

0

Y 1+

Y 2+

.

Another generalized inverse of X′X is

(X′X)−2 =

1/3 −1/3 0

−1/3 5/6 0

0 0 0

,

and a solution to the normal equations (based on this generalized inverse) is

β2 = (X′X)−2 X′Y =

1/3 −1/3 0

−1/3 5/6 0

0 0 0

Y11 + Y12 + Y21 + Y22 + Y23

Y11 + Y12

Y21 + Y22 + Y23

=

13(Y21 + Y22 + Y23)

12(Y11 + Y12)− 1

3(Y21 + Y22 + Y23)

0

=

Y 2+

Y 1+ − Y 2+

0

.

The general solution is given by

β = (X′X)−1 X′Y + [I− (X′X)−1 X′X]z

=

0

Y 1+

Y 2+

+

1 0 0

−1 0 0

−1 0 0

z1

z2

z3

=

z1

Y 1+ − z1

Y 2+ − z1

,

where z = (z1, z2, z3)′ ∈ R3. Furthermore, we see that the first particular solution

corresponds to z1 = 0 while the second corresponds to z1 = Y 2+.

PAGE 133


7.5 Perpendicular projection matrices

TERMINOLOGY : A square matrix P is idempotent if P2 = P.

TERMINOLOGY : A square matrix P is a projection matrix onto the vector space S

if and only if

1. P is idempotent

2. Px ∈ S, for any x

3. z ∈ S =⇒ Pz = z (projection).

Result MAR5.1. The matrix P = AA− projects onto C(A).

Proof. Clearly, AA− is a square matrix. Note that AA−AA− = AA−. Note that

AA−x = A(A−x) ∈ C(A). Finally, if z ∈ C(A), then z = Ax, for some x. Thus,

AA−z = AA−Ax = Ax = z.

NOTE : In general, projection matrices are not unique. However, if we add the require-

ment that Pz = 0, for any z⊥S, then P is called a perpendicular projection matrix,

which is unique. These matrices are important in linear models.

Result MAR5.2. The matrix I−A−A projects onto N (A).

Proof. Clearly, I−A−A is a square matrix. Note that

(I−A−A)(I−A−A) = I− 2A−A + A−A = I−A−A.

For any x, note that (I − A−A)x ∈ N (A) because A(I − A−A)x = 0. Finally, if

z ∈ N (A), then Az = 0. Thus, (I−A−A)z = z−A−Az = z.

Example. Consider the (linear) subspace of R2 defined by

S =

z : z =

2a

a

, for a ∈ R

and take

P =

0.8 0.4

0.4 0.2

.

PAGE 134


Exercise. Show that P is a projection matrix onto S. Show that I−P is a projection

matrix onto S⊥, the orthogonal complement of S.

Result MAR5.3. The matrix M is a perpendicular projection matrix onto C(M) if and

only if M is symmetric and idempotent.

Proof. (=⇒) Suppose that M is a perpendicular projection matrix onto C(M) and write

v = v1 + v2, where v1 ∈ C(M) and v2⊥C(M). Also, let w = w1 + w2, where w1 ∈ C(M)

and w2⊥C(M). Since (I−M)v = (I−M)v2 and Mw = Mw1 = w1, we get

w′M′(I−M)v = w′1M′(I−M)v2 = w′1v2 = 0.

This is true for any v and w, so it must be true that M′(I −M) = 0 =⇒M′ = M′M.

Since M′M is symmetric, so is M′, and this, in turn, implies that M = M2. (⇐=) Now,

suppose that M is symmetric and idempotent. If M = M2 and v ∈ C(M), then since

v = Mb, for some b, we have that Mv = MMb = Mb = v (this establishes that M is

a projection matrix). To establish perpendicularity, note that if M′ = M and w⊥C(M),

then Mw = M′w = 0, because the columns of M are in C(M).

Result MAR5.4. If M is a perpendicular projection matrix onto C(X), then C(M) =

C(X).

Proof. We need to show that C(M) ⊆ C(X) and C(X) ⊆ C(M). Suppose that v ∈ C(M).

Then, v = Mb, for some b. Now, write b = b1 + b2, where b1 ∈ C(X) and b2⊥C(X).

Thus, v = Mb = M(b1 + b2) = Mb1 + Mb2 = b1 ∈ C(X). Thus, C(M) ⊆ C(X). Now

suppose that v ∈ C(X). Since M is a perpendicular projection matrix onto C(X), we

know that v = Mv = M(v1 + v2), where v1 ∈ C(X) and v2⊥C(X). But, M(v1 + v2) =

Mv1, showing that v ∈ C(M). Thus, C(X) ⊆ C(M) and the result follows.

Result MAR5.5. Perpendicular projection matrices are unique.

Proof. Suppose that M1 and M2 are both perpendicular projection matrices onto any

arbitrary subspace S ⊆ Rn. Let v ∈ Rn and write v = v1 +v2, where v1 ∈ S and v2⊥S.

Since v is arbitrary and M1v = v1 = M2v, we have M1 = M2.

Result MAR5.6. If M is the perpendicular projection matrix onto C(X), then I−M

PAGE 135


is the perpendicular projection matrix onto N (X′).

Sketch of Proof. I −M is symmetric and idempotent so I −M is the perpendicular

projection matrix onto C(I−M). Show that C(I−M) = N (X′); use Result MAR5.5.

7.6 Trace, determinant, and eigenproblems

TERMINOLOGY : The sum of the diagonal elements of a square matrix A is called the

trace of A, written tr(A), that is, for An×n = (aij),

tr(A) =n∑i=1

aii.

Result MAR6.1.

1. tr(A±B) = tr(A)± tr(B)

2. tr(cA) = ctr(A)

3. tr(A′) = tr(A)

4. tr(AB) = tr(BA)

5. tr(A′A) =∑n

i=1

∑nj=1 a

2ij.

TERMINOLOGY : The determinant of a square matrix A is a real number denoted by

|A| or det(A).

Result MAR6.2.

1. |A′| = |A|

2. |AB| = |BA|

3. |A−1| = |A|−1

4. |A| = 0 iff A is singular

5. For any n× n upper (lower) triangular matrix, |A| =∏n

i=1 aii.

PAGE 136


REVIEW : The table below summarizes equivalent conditions for the existence of an

inverse matrix A−1 (where A has dimension n× n).

A−1 exists A−1 does not exist

A is nonsingular A is singular

|A| 6= 0 |A| = 0

A has full rank A has less than full rank

r(A) = n r(A) < n

A has LIN rows (columns) A does not have LIN rows (columns)

Ax = 0 has one solution, x = 0 Ax = 0 has many solutions

EIGENVALUES : Suppose that A is a square matrix and consider the equations Au =

λu. Note that

Au = λu⇐⇒ Au− λu = (A− λI)u = 0.

If u 6= 0, then A − λI must be singular (see last table). Thus, the values of λ which

satisfy Au = λu are those values where

|A− λI| = 0.

This is called the characteristic equation of A. If A is n× n, then the characteristic

equation is a polynomial (in λ) of degree n. The roots of this polynomial, say, λ1, λ2, ..., λn

are the eigenvalues of A (some of these may be zero or even imaginary). If A is a

symmetric matrix, then λ1, λ2, ..., λn must be real.

EIGENVECTORS : If λ1, λ2, ..., λn are eigenvalues for A, then vectors ui satisfying

Aui = λiui,

for i = 1, 2, ..., n, are called eigenvectors. Note that

Aui = λiui =⇒ Aui − λiui = (A− λiI)ui = 0.

From our discussion on systems of equations and consistency, we know a general solution

for ui is given by ui = [I− (A− λiI)−(A− λiI)]z, for z ∈ Rn.

PAGE 137


Result MAR6.3. If λi and λj are eigenvalues of a symmetric matrix A, and if λi 6= λj,

then the corresponding eigenvectors, ui and uj, are orthogonal.

Proof. We know that Aui = λiui and Auj = λjuj. The key is to recognize that

λiu′iuj = u′iAuj = λju

′iuj,

which can only happen if λi = λj or if u′iuj = 0. But λi 6= λj by assumption.

PUNCHLINE : For a symmetric matrix A, eigenvectors associated with distinct eigen-

values are orthogonal (we’ve just proven this) and, hence, are linearly independent. If

the symmetric matrix A has an eigenvalue λk, of multiplicity mk, then we can find mk

orthogonal eigenvectors of A which correspond to λk (Searle, pp 291). This leads to the

following result (c.f., Christensen, pp 402):

Result MAR6.4. If A is a symmetric matrix, then there exists a basis for C(A) con-

sisting of eigenvectors of nonzero eigenvalues. If λ is a nonzero eigenvalue of multiplicity

m, then the basis will contain m eigenvectors for λ. Furthermore, N (A) consists of the

eigenvectors associated with λ = 0 (along with 0).

SPECTRAL DECOMPOSITION : Suppose that An×n is symmetric with eigenvalues

λ1, λ2, ..., λn. The spectral decomposition of A is given by A = QDQ′, where

• Q is orthogonal; i.e., QQ′ = Q′Q = I,

• D = diag(λ1, λ2, ..., λn), a diagonal matrix consisting of the eigenvalues of A; note

that r(D) = r(A), because Q is orthogonal, and

• the columns of Q are orthonormal eigenvectors of A.

Result MAR6.5. If A is an n×n symmetric matrix with eigenvalues λ1, λ2, ..., λn, then

1. |A| =∏n

i=1 λi

2. tr(A) =∑n

i=1 λi.

PAGE 138


NOTE : These facts are also true for a general n× n matrix A.

Proof (in the symmetric case). Write A in its Spectral Decomposition A = QDQ′.

Note that |A| = |QDQ′| = |DQ′Q| = |D| =∏n

i=1 λi. Also, tr(A) = tr(QDQ′) =

tr(DQ′Q) = tr(D) =∑n

i=1 λi.

Result MAR6.6. Suppose that A is symmetric. The rank of A equals the number of

nonzero eigenvalues of A.

Proof. Write A in its spectral decomposition A = QDQ′. Because r(D) = r(A) and

because the only nonzero elements in D are the nonzero eigenvalues, the rank of D must

be the number of nonzero eigenvalues of A.

Result MAR6.7. The eignenvalues of an idemptotent matrix A are equal to 0 or 1.

Proof. If λ is an eigenvalue of A, then Au = λu. Note that A2u = AAu = Aλu =

λAu = λ2u. This shows that λ2 is an eigenvalue of A2 = A. Thus, we have Au = λu

and Au = λ2u, which implies that λ = 0 or λ = 1.

Result MAR6.8. If the n× n matrix A is idempotent, then r(A) = tr(A).

Proof. From the last result, we know that the eignenvalues of A are equal to 0 or 1. Let

v1,v2, ...,vr be a basis for C(A). Denote by S the subspace of all eigenvectors associated

with λ = 1. Suppose v ∈ S. Then, because Av = v ∈ C(A), v can be written

as a linear combination of v1,v2, ...,vr. This means that any basis for C(A) is also a

basis for S. Furthermore, N (A) consists of eigenvectors associated with λ = 0 (because

Av = 0v = 0). Thus,

n = dim(Rn) = dim[C(A)] + dim[N (A)]

= r + dim[N (A)],

showing that dim[N (A)] = n− r. Since A has n eigenvalues, all are accounted for λ = 1

(with multiplicity r) and for λ = 0 (with multiplicity n − r). Now tr(A) =∑

i λi = r,

the multiplicity of λ = 1. But r(A) = dim[C(A)] = r as well.

PAGE 139


TERMINOLOGY : Suppose that x is an n× 1 vector. A quadratic form is a function

f : Rn → R of the form

f(x) =n∑i=1

n∑j=1

aijxixj = x′Ax.

The matrix A is called the matrix of the quadratic form.

Result MAR6.9. If x′Ax is any quadratic form, there exists a symmetric matrix B

such that x′Ax = x′Bx.

Proof. Note that x′A′x = (x′Ax)′ = x′Ax, since a quadratic form is a scalar. Thus,

x′Ax =1

2x′Ax +

1

2x′A′x

= x′(

1

2A +

1

2A′)

x = x′Bx,

where B = 12A + 1

2A′. It is easy to show that B is symmetric.

UPSHOT : In working with quadratic forms, we can, without loss of generality, assume

that the matrix of the quadratic form is symmetric.

TERMINOLOGY : The quadratic form x′Ax is said to be

• nonnegative definite (nnd) if x′Ax ≥ 0, for all x ∈ Rn.

• positive definite (pd) if x′Ax > 0, for all x 6= 0.

• positive semidefinite (psd) if x′Ax is nnd but not pd.

TERMINOLOGY : A symmetric n × n matrix A is said to be nnd, pd, or psd if the

quadratic form x′Ax is nnd, pd, or psd, respectively.

Result MAR6.10. Let A be a symmetric matrix. Then

1. A pd =⇒ |A| > 0

2. A nnd =⇒ |A| ≥ 0.

Result MAR6.11. Let A be a symmetric matrix. Then

PAGE 140


1. A pd ⇐⇒ all eigenvalues of A are positive

2. A nnd ⇐⇒ all eigenvalues of A are nonnegative.

Result MAR6.12. A pd matrix is nonsingular. A psd matrix is singular. The converses

are not true.

CONVENTION : If A1 and A2 are n × n matrices, we write A1 ≥nnd A2 if A1 −A2 is

nonnegative definite (nnd) and A1 ≥pd A2 if A1 −A2 is positive definite (pd).

Result MAR6.13. Let A be an m× n matrix of rank r. Then A′A is nnd with rank

r. Furthermore, A′A is pd if r = n and is psd if r < n.

Proof. Let x be an n× 1 vector. Then x′(A′A)x = (Ax)′Ax ≥ 0, showing that A′A is

nnd. Also, r(A′A) = r(A) = r. If r = n, then the columns of A are linearly independent

and the only solution to Ax = 0 is x = 0. This shows that A′A is pd. If r < n, then

the columns of A are linearly dependent; i.e., there exists an x 6= 0 such that Ax = 0.

Thus, A′A is nnd but not pd, so it must be psd.

RESULT : A square matrix A is pd iff there exists a nonsingular lower triangular matrix

L such that A = LL′. This is called the Choleski Factorization of A. Monahan

proves this result (see pp 258), provides an algorithm on how to find L, and includes an

example.

RESULT : Suppose that A is symmetric and pd. Writing A in its Spectral Decompo-

sition, we have A = QDQ′. Because A is pd, λ1, λ2, ..., λn, the eigenvalues of A, are

positive. If we define A1/2 = QD1/2Q′, where D1/2 = diag(√λ1,√λ2, ...,

√λn), then A1/2

is symmetric and

A1/2A1/2 = QD1/2Q′QD1/2Q′ = QD1/2ID1/2Q′ = QDQ′ = A.

The matrix A1/2 is called the symmetric square root of A. See Monahan (pp 259-60)

for an example.

PAGE 141


7.7 Random vectors

TERMINOLOGY : Suppose that Y1, Y2, ..., Yn are random variables. We call

Y =

Y1

Y2

...

Yn

a random vector. The joint pdf of Y is denoted by fY(y).

DEFINTION : Suppose that E(Yi) = µi, var(Yi) = σ2i , for i = 1, 2, ..., n, and cov(Yi, Yj) =

σij, for i 6= j. The mean of Y is

µ = E(Y) =

E(Y1)

E(Y2)...

E(Yn)

=

µ1

µ2

...

µn

.

The variance-covariance matrix of Y is

Σ = cov(Y) =

σ2

1 σ12 · · · σ1n

σ21 σ22 · · · σ2n

......

. . ....

σn1 σn2 · · · σ2n

.

NOTE : Note that Σ contains the variances σ21, σ

22, ..., σ

2n on the diagonal and the

(n2

)covariance terms cov(Yi, Yj), for i < j, as the elements strictly above the diagonal. Since

cov(Yi, Yj) = cov(Yj, Yi), it follows that Σ is symmetric.

EXAMPLE : Suppose that Y1, Y2, ..., Yn is an iid sample with mean E(Yi) = µ and

variance var(Yi) = σ2 and let Y = (Y1, Y2, ..., Yn)′. Then µ = E(Y) = µ1n and

Σ = cov(Y) = σ2In.

EXAMPLE : Consider the GM linear model Y = Xβ+ε. In this model, the random errors

ε1, ε2, ..., εn are uncorrelated random variables with zero mean and constant variance σ2.

We have E(ε) = 0n×1 and cov(ε) = σ2In.

PAGE 142


TERMINOLOGY : Suppose that Z11, Z12, ..., Znp are random variables. We call

Zn×p =

Z11 Z12 · · · Z1p

Z21 Z22 · · · Z2p

......

. . ....

Zn1 Zn2 · · · Znp

a random matrix. The mean of Z is

E(Z) =

E(Z11) E(Z12) · · · E(Z1p)

E(Z21) E(Z22) · · · E(Z2p)...

.... . .

...

E(Zn1) E(Zn2) · · · E(Znp)

n×p

.

Result RV1. Suppose that Y is a random vector with mean µ. Then

Σ = cov(Y) = E[(Y − µ)(Y − µ)′] = E(YY′)− µµ′.

Proof. That cov(Y) = E[(Y −µ)(Y −µ)′] follows straightforwardly from the definition

of variance and covariance in the scalar case. Showing this equals E(YY′)−µµ′ is simple

algebra.

DEFINITION : Suppose that Yp×1 and Xq×1 are random vectors with means µY and

µX, respectively. The covariance between Y and X is the p× q matrix defined by

cov(Y,X) = E(Y − µY)(X− µX)′ = (σij)p×q,

where

σij = E[Yi − E(Yi)Xj − E(Xj)] = cov(Yi, Xj).

DEFINITION : Random vectors Yp×1 and Xq×1 are uncorrelated if cov(Y,X) = 0p×q.

Result RV2. If cov(Y,X) = 0, then cov(Y, a + BX) = 0, for all nonrandom con-

formable a and B. That is, Y is uncorrelated with any linear function of X.

PAGE 143


TERMINOLOGY : Suppose that var(Yi) = σ2i , for i = 1, 2, ..., n, and cov(Yi, Yj) = σij,

for i 6= j. The correlation matrix of Y is the n× n matrix

R = (ρij) =

1 ρ12 · · · ρ1n

ρ21 1 · · · ρ2n

......

. . ....

ρn1 ρn2 · · · 1

,

where, recall, the correlation ρij is given by

ρij =σijσiσj

,

for i, j = 1, 2, ..., n.

TERMINOLOGY : Suppose that Y1, Y2, ..., Yn are random variables and that a1, a2, ..., an

are constants. Define a = (a1, a2, ..., an)′ and Y = (Y1, Y2, ..., Yn)′. The random variable

X = a′Y =n∑i=1

aiYi

is called a linear combination of Y1, Y2, ..., Yn.

Result RV3. If a = (a1, a2, ..., an)′ is a vector of constants and Y = (Y1, Y2, ..., Yn)′ is a

random vector with mean µ = E(Y), then

E(a′Y) = a′µ.

Proof. The quantity a′Y is a scalar so E(a′Y) is also a scalar. Note that

E(a′Y) = E

(n∑i=1

aiYi

)=

n∑i=1

aiE(Yi) =n∑i=1

aiµi = a′µ.

Result RV4. Suppose that Y = (Y1, Y2, ..., Yn)′ is a random vector with mean µ =

E(Y), let Z be a random matrix, and let A and B (a and b) be nonrandom conformable

matrices (vectors). Then

1. E(AY) = Aµ

2. E(a′Zb) = a′E(Z)b.

PAGE 144


3. E(AZB) = AE(Z)B.

Result RV5. If a = (a1, a2, ..., an)′ is a vector of constants and Y = (Y1, Y2, ..., Yn)′ is a

random vector with mean µ = E(Y) and covariance matrix Σ = cov(Y), then

var(a′Y) = a′Σa.

Proof. The quantity a′Y is a scalar random variable, and its variance is given by

var(a′y) = E(a′Y − a′µ)2 = E[a′(Y − µ)2] = Ea′(Y − µ)a′(Y − µ).

But, note that a′(Y − µ) is a scalar, and hence equals (Y − µ)′a. Using this fact, we

can rewrite the last expectation to get

Ea′(Y − µ)(Y − µ)′a = a′E(Y − µ)(Y − µ)′a = a′Σa.

Result RV6. Suppose that Y = (Y1, Y2, ..., Yn)′ is a random vector with covariance

matrix Σ = cov(Y), and let a and b be conformable vectors of constants. Then

cov(a′Y,b′Y) = a′Σb.

Result RV7. Suppose that Y = (Y1, Y2, ..., Yn)′ is a random vector with mean µ = E(Y)

and covariance matrix Σ = cov(Y). Let b, A, and B denote nonrandom conformable

vectors/matrices. Then

1. E(AY + b) = Aµ + b

2. cov(AY + b) = AΣA′

3. cov(AY,BY) = AΣB′.

Result RV8. A variance-covariance matrix Σ = cov(Y) is nonnegative definite.

Proof. Suppose that Yn×1 has variance-covariance matrix Σ. We need to show that

a′Σa ≥ 0, for all a ∈ Rn. Consider X = a′Y, where a is a conformable vector of

constants. Then, X is scalar and var(X) ≥ 0. But, var(X) = var(a′Y) = a′Σa. Since a

is arbitrary, the result follows.

PAGE 145


Result RV9. If Y = (Y1, Y2, ..., Yn)′ is a random vector with mean µ = E(Y) and

covariance matrix Σ, then P(Y − µ) ∈ C(Σ) = 1.

Proof. Without loss, take µ = 0, and let MΣ be the perpendicular projection matrix

onto C(Σ). We know that Y = MΣY + (I−MΣ)Y and that

E(I−MΣ)Y = (I−MΣ)E(Y) = 0,

since µ = E(Y) = 0. Also,

cov(I−MΣ)Y = (I−MΣ)Σ(I−MΣ)′ = (Σ−MΣΣ)(I−MΣ)′ = 0,

since MΣΣ = Σ. Thus, we have shown that P(I −MΣ)Y = 0 = 1, which implies

that P (Y = MΣY) = 1. Since MΣY ∈ C(Σ), we are done.

IMPLICATION : Result RV9 says that there exists a subset C(Σ) ⊆ Rn that contains Y

with probability one (i.e., almost surely). If Σ is positive semidefinite (psd), then Σ is

singular and C(Σ) is concentrated in a subspace of Rn, where the subspace has dimension

r = r(Σ), r < n. In this situation, the pdf of Y may not exist.

Result RV10. Suppose that X, Y, and Z are n× 1 vectors and that X = Y + Z. Then

1. E(X) = E(Y) + E(Z)

2. cov(X) = cov(Y) + cov(Z) + 2cov(Y,Z)

3. if Y and Z are uncorrelated, then cov(X) = cov(Y) + cov(Z).

PAGE 146

stat 714 linear statistical models

Documents