factor analysis - statistical sciencepdh10/teaching/832/notes/fana.pdf · 2020. 10. 9. · gaussian...

Factor Analysis

Peter Hoff

October 9, 2020

Contents

1 Non-random factor models 2

2 Random factor models 6

3 Gaussian factor analysis 10

3.1 MLEs for the Gaussian FA . . . . . . . . . . . . . . . . . . . . 12

3.2 Numerical examples . . . . . . . . . . . . . . . . . . . . . . . . 15

4 VARIMAX rotation 26

5 Recovery of latent factors 33

1

6 Decorrelation and ICA 36

6.1 Decorrelation . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

6.2 ICA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

Abstract

Extending the partial isotropy model to the case of heteroscedas-

tic measurement error yields the factor analysis model. We consider

versions of factor analysis based on matrix decomposition methods,

method of moments, and likelihood estimation with normality as-

sumptions. A reference for some of this material can be found in

Chapter 10 of Hardle and Simar [2015], Chapter 15 of Izenman [2008]

and Chapter 9 of Mardia et al. [1979],

1 Non-random factor models

Suppose we have a data matrix Y whose ith row yi is a p-variate measure-

ment of a signal vector mi plus mean-zero measurement error with diagonal

covariance Ψ. Then

yi = mi + Ψ1/2ei

E[yi|mi] = mi

Var[yi|mi] = Ψ,

2

where E[ei] = 0 and Var[ei] = 0. Assuming the measurement errors across

the rows are uncorrelated, we have

Y = M + EΨ1/2

E[Y|M] = M

Var[Y|M] = Ψ⊗ In

where Var[E] = Ip ⊗ In and M has rows m1, . . . ,mn.

Statistical inference for M and Ψ is challenging if nothing else is assumed

about M or Ψ . For example, while the OLS estimator M = Y is unbiased

for M, its precision depends on the unknown matrix Ψ, which can’t be well-

estimated unless we make assumptions about M. For example, the residual

sum of squares for the OLS estimator is zero. Assuming normality of E does

not help. In this case, the MLE of M is Y, and the MLE of Ψ is 0, which is

generally unreasonable.

However, in many applications it is reasonable to assume that the hetero-

geneity in the rows m1, . . . ,mn ⊂ Rp of M are due to heterogeneity in some

unobserved, lower-dimensional row-specific factors z1, . . . , zn ⊂ Rq for q < p,

that is, there is some function that maps each unobserved latent factor zi to

the signal mi. If this map is linear, we have

mi = µ + Azi

M = 1µ> + ZA>.

Plugging this into the model for Y gives

Y = 1µ> + ZA> + EΨ1/2

E[Y|Z] = 1µ> + ZA>

Var[Y|Z] = Ψ⊗ In.

3

This model is sometimes referred to as a q-dimensional linear factor model.

The rows of Z are referred to as factors and the matrix A is referred to as

the factor loading matrix.

Note that in this model, the mean matrix M is exactly a column-wise location

shift of a rank q matrix. As discussed in the notes on eigendecompositions,

for a given value of q an ordinary least squares (OLS) estimator of M =

1µ> + ZA> is given by

• µ = y = Y>1/n

• ZA> = UqDqV>q ,

where UDV> is the SVD of CY. A couple of comments on such an estimate:

1. The estimate of M is (typically) unique.

2. One OLS estimator of (Z,A) is (Z, A) = (UqDq,Vq), which we pre-

viously called the first q principal component scores and axes, respec-

tively.

3. If (Z, A) is an OLS estimator of (Z,A), then so is (ZG, AG−1) for any

invertible G ∈ Rq×q.

4. If Z is an OLS estimator of Z then the column means of Z are all zero.

Item 2 indicates why the expressions “PCA” and “factor analysis” are some-

times used interchangeably. However, note that the OLS estimator presented

here essentially ignores the possibility of heteroscedastic errors, that is, that

Ψ might not be proportional to the identity. If we used a GLS or weighted

4

least-squares estimator that allowed for heteroscedasticity (possibly using fea-

sible GLS), the estimates would not be the same as the best q-dimensional

affine approximation obtained from PCA.

Additionally, note that

{ZA> : Z ∈ Rn×q,A ∈ Rp×q} = {M ∈ Rn×p : rank(M) ≤ q},

and so estimation of ZA> is in the category of the “low-rank matrix esti-

mation” problem, which has a large literature across many disciplines. A

particular area of recent research has been low-rank matrix estimation when

the rank q is unknown. The strategy here is typically to threshold or shrink

the singular values of Y or CY. Some references include the following:

• Singular value shrinkage: Overloading notation, let Y = UDV> be the

SVD of Y. The estimate of M is taken to be M = Uf(D)V>, where

f shrinks and/or thresholds the singular values of D. Some references

include Cai et al. [2010], Mazumder et al. [2010], Josse and Sardy [2013],

Donoho and Gavish [2013]. Gerard and Hoff [2015] consider the tensor

case.

• Bayesian rank selection and model averaging: A more computationally

demanding procedure is to put a prior on U, D and V that allows zeros

in the diagonal elements of D. Such an approach is detailed in Hoff

[2007].

Finally, some foreshadowing: Regarding items 3 and 4 above, suppose we

take (Z, A) = (√nUq,DqVq/

√n). Then

Z>Z/n = nU>q Uq/n = Iq

Z>1/n = 0.

5

In other words, the columns of Z are mean-zero and uncorrelated. This

means that our estimated latent factor representation of Y has the form

Y ≈ 1µ> + ZA>

where the rows z1, . . . , zn of Z have sample mean vector zero and sample

variance equal to the identity. We could take instead an OLS estimator of Z

to have a different sample variance matrix, but the fit would not change.

2 Random factor models

If the rows of Y represent a random sample of n objects from a population,

then the rows of Z are a random sample as well, in which case we are more

likely to be interested in the population mean and variance of Y, on average

across different random samples of objects, or equivalently, on average across

different random samples of the rows of Z. The factor model in this case is

referred to as a random factor model, since the factor matrix Z is thought of

as a random sample from some population or process.

Random factor model: The q-dimensional random factor model for a n×pmatrix Y is given by

Y = 1µ> + ZA> + EΨ1/2,

where

• µ ∈ Rp, A ∈ Rp×q, Ψ = diag(ψ1, . . . , ψp) are unknown parameters;

• E[Z] = 0, Var[Z] = Iq ⊗ In;

• E[E] = 0, Var[E] = Ip ⊗ In;

6

• Z and E are uncorrelated.

Terminology:

• The matrix A is called the factor loading matrix ;

• The rows of Z are called the common factors ;

• The rows of E are called the specific factors or unique factors.

The elements of a common factor zi are shared among the elements of yi,

those of a specific factor ei are not.

Under this model,

Var[yi] = AA> + Ψ

Var[yi,j] =

q∑k=1

a2jk + ψj

•∑q

k=1 a2jk is the communality of the jth variable;

• ψj is the jth specific variance or unique variance.

Note that if Ψ = σ2I for some σ2 > 0, then this is the partial isotropy model.

7

Moment conditions on Z: The random factor model above specifies that

the mean and variance of Z are 0 and Iq ⊗ In respectively. There are other

specifications in the literature, but this one is pretty common. Why do we

make these moment assumptions about Z? The short answer is that there is

no reason not to. We could allow different first and second moments for Z,

but doing so won’t change the range of first and second moments for Y that

are possible. An alternative way of saying this is that the first two moments

of Z are not separately identifiable from the other parameters, µ and A.

To see this, suppose E[Z] = 1θ> and Var[Z] = Φ ⊗ I for some θ ∈ Rq and

Φ ∈ S+q , that is, the rows of Z have a common expectation and variance and

are uncorrelated. This gives

E[Y] = 1µ> + 1θ>A>

= 1(µ + Aθ)>

Var[Y] = (AΦA>)⊗ I + (Ψ⊗ I)

= (AΦA> + Ψ)⊗ I.

Now construct

Z = (Z− 1θ>)Φ−1/2 , so that

Z = 1θ> + ZΦ1/2.

Note that E[Z] = 0 and Var[Z] = I⊗ I Now Y can be expressed either as a

function of Z or Z:

Y = 1µ> + ZA> + EΨ1/2

= 1(µ + Aθ)> + ZΦ1/2A> + EΨ1/2

≡ 1µ> + ZA> + EΨ1/2.

So note that

• the first and second moments of Z and Z are different;

8

• reexpressing Y in terms of Z doesn’t change its moments.

Therefore, we won’t be able to tell the difference between Z and Z, that is

the difference between

(µ,A,θ,Φ) and (µ, A,0, I) = (µ + Aθ,Φ1/2A,0, I)

unless

• we have prior information about the values of the parameters, or

• we use more information from Y than just first and second moments.

More precisely, we have the following:

Theorem 1. For every (µ,A,θ,Φ) there exists µ ∈ Rp and A ∈ Rp×q such

that

E[Y|µ,A,θ,Φ,Ψ] = E[Y|µ, A,0, I,Ψ]

Var[Y|µ,A,θ,Φ,Ψ] = Var[Y|µ, A,0, I,Ψ]

Thus any statistical method based on just the first and second moments of Y

will not be able to distinguish the moments of Z from µ and A. Therefore,

moment-based approaches usually set the moments of Z to some convenient

value (like 0 and I ⊗ I), and recognize that any aspects of Z that we may

estimate are only known up to affine transformations.

More non-identifiability: Even after making some restrictions on the

moments of Z, some model non-identifiability remains. Under the above-

described factor model, the first and second moments of Y are

E[Y] = 1µ>

Var[Y] = (AA> + Ψ)⊗ I.

9

Now recall that

AA> + Ψ = AO(AO)> + Ψ

for any matrix O ∈ Oq, for which OO> = O>O = Iq. Therefore

E[Y|µ,A,Ψ] = E[Y|µ,AO,Ψ]

Var[Y|µ,A,Ψ] = Var[Y|µ,AO,Ψ]

Thus the mapping of the parameter space to the moment space is not 1-1.

Some remedies include the following:

• Use a reduced parameterization, such as AA> = UD2U>.

• Impose an identifiability constraint, such as A>ΨA being diagonal.

• Not do anything except remember A and AO can’t

be distinguished without additional information or assumptions.

Scale invariance: Let Y follow a q-dimensional factor model. Now con-

struct Y = YD where D is a diagonal matrix. Then

E[Y] = 1(Dµ)>

Var[Y] = [DA(DA)> + DΨD]⊗ I,

and so Y also follows a q-dimensional factor model. So we say that the factor

model is closed under rescalings of the variable, or scale invariant. Note that

the spiked covariance model does not have this invariance.

3 Gaussian factor analysis

Now let’s add the most basic of distributional assumptions, that the common

and specific factors are normally distributed:

10

Gaussian factor model (factor representation): The q-factor model

for a random n× p Gaussian matrix Y is given by

Y = 1µ> + ZA> + EΨ1/2,

where

• µ ∈ Rp, A ∈ Rp×q, Ψ = diag(ψ1, . . . , ψp);

• Z ∼ Nn×q(0, I⊗ I);

• E ∼ Nn×q(0, I⊗ I);

• Z and E are independent.

Note that we have retained the same identifiability restrictions on the mean

and variance of Z. This is because

• the mean and variance of Z is not separately identifiable from µ and

A in terms of the 1st and 2nd moments of Y;

• the 1st and 2nd moments of a Gaussian matrix determine the distribu-

tion.

Therefore, we set the mean and variance of Z to some reference values.

Gaussian factor model (covariance representation): As we just noted,

the distribution of a Gaussian matrix Y is entirely determined by its 1st and

2nd moments. What are these moments?

E[Y|µ,A,Ψ] = 1µ>

11

For practice, lets compute the variance of Y, or equivalently, of y = vec(Y):

y = (µ⊗ I)1 + (A⊗ I)z + (Ψ1/2 ⊗ I)e

y − E[y] = (A⊗ I)z + (Ψ1/2 ⊗ I)e

Cov[y] = E[(y − E[y])(y − E[y])>]

= E[(A⊗ I)zz>(A⊗ I)> + (Ψ1/2 ⊗ I)ee>(Ψ1/2 ⊗ I)> + crossterms]

= AA> ⊗ I + Ψ⊗ I

= (AA> + Ψ)⊗ I

Therefore, the Gaussian factor model is 100% completely equivalent to the

following multivariate normal model

Y ∼ Nn×p(1µ>, (AA> + Ψ)⊗ I)

3.1 MLEs for the Gaussian FA

We now show how to obtain the MLEs for (µ,A,Ψ) The MLE for µ is easy to

obtain, but getting he MLEs for A and Ψ will require an iterative algorithm.

MLE of µ: The -2 log likelihood (modulo constants) is

d(µ,A,Σ) = n log |Σ|+ tr((Y − 1µ>)Σ−1(Y − 1µ>)>)

For any Σ, including Σ = AA> + Ψ, the minimizer in µ is the value that

minimizes the trace term

tr((Y − 1µ>)Σ−1(Y − 1µ>)>) = tr(YΣ−1Y>)− 2tr(YΣ−1µ1>) +

tr(1µ>Σ−1µ1>).

As a function of µ, this is

−2ntr(y>Σ−1µ) + ntr(µ>Σ−1µ).

12

You can take some derivatives or complete the square to show that this is

minimized by µ = y. Note that this holds for any Σ, so whatever Σ optimizes

the likelihood, the MLE for (µ,Σ) will be (y, Σ).

MLE of A,Ψ: Additionally, this means that the MLE for (A,Ψ) may be

found by subtracting off the column means from Y and using this centered

matrix to find the MLE under a zero-mean factor model. Specifically, let

Y now be the column-centered data matrix. The MLE for (A,Ψ) is the

MLE under the model that assumes Y ∼ Nn×p(0, (AA>+ Ψ)⊗ In). We can

approximate the MLE using the EM algorithm as follows: Recall that the

density for Y under this model can be written as

p(Y|AA> + Ψ) =

∫p(Y|Z,A,Ψ)p(dZ)

where the conditional density of Y given Z is that of the N(ZA>,Ψ ⊗ In)

distribution. This problem can be expressed in generic form as follows:

maxθp(y|θ) = max

θ

∫p(y|θ, z)p(dz).

One approach to finding the maximizer is with the EM algorithm, which

proceeds iteratively as follows:

• E-step: compute l(θ : θ(s)) = E[log p(y|z, θ)|y, θ(s)];

• M-step: let θ(s+1) = arg maxθ l(θ : θ(s)).

Actually, the EM algorithm is more general than this - it can handle models

where there are parameters for the missing/latent data z.

Let’s derive the algorithm for the case of our Gaussian factor model. Instead

of maximizing the log likelihood, we will minimize the -2 log likelihood, or

13

what I will somewhat incorrectly call the deviance. Recall that the condi-

tional deviance is

d(A,Ψ : Z) = n log |Ψ|+ tr((Y − ZA>)Ψ−1(Y − ZA>))

= n log |Ψ|+ tr(YΨ−1Y>)− 2tr(A>Ψ−1Y>Z) + tr(A>Ψ−1AZ>Z).

The EM algorithm proceeds by minimizing in (A,Ψ) the expected value of

this quantity, where the expectation is over Z with respect to its conditional

distribution given Y and current values (A(s),Ψ(s)). Let

• Z = E[Z|Y,A(s),Ψ(s)]

• S = E[Z>Z|Y,A(s),Ψ(s)].

The expected conditional deviance is then

d(A,Ψ) = n log |Ψ|+ tr(YΨ−1Y>)− 2tr(A>Ψ−1Y>Z) + tr(A>Ψ−1AS).

Exercise 1. Show via completing the square and/or calculus that the mini-

mizer of d(A,Ψ) in A is A = Y>ZS−1.

Plugging A into d and manipulating with traces gives

d(A,Ψ) = n log |Ψ|+ tr(Ψ−1Y>Y)− 2tr(Ψ−1Y>ZA>) + tr(Ψ−1ASA>)

= n log |Ψ|+ tr(

Ψ−1(Y>Y − 2Y>ZA> + ASA>))

≡ n log |Ψ|+ tr(Ψ−1E>E).

Exercise 2. Show that the minimizer of d(A,Ψ) in Ψ is Ψ = diag(diag(E>E))/n.

To summarize, the MLEs of (µ,A,Ψ) for the Gaussian factor model may be

obtained as follows:

14

1. Let µ = y ;

2. Reassign Y ← CY;

3. Given a starting value (A(1),Ψ(1)), iterate the following steps until con-

vergence:

(a) Compute Z = E[Z|Y,A(s),Ψ(s)] and S = E[Z>Z|Y,A(s),Ψ(s)].

(b) Let A(s+1) = Y>ZS−1

(c) Let Ψ(s+1) = diag(diag(E>E))/n, where

E>E = Y>Y − 2Y>ZA> + ASA>

Are we done? No, we haven’t yet determined the formulas for Z and S.

Exercise 3. Obtain the conditional distribution of Z given Y and values of

A and Ψ. Show that

Z ≡ E[Z|Y,A,Ψ] = YΨ−1A[A>Ψ−1A + I]−1

S ≡ E[Z>Z|Y,A,Ψ] = Z>Z + n(A>Ψ−1A + I)−1.

Ok, now we’re done.

3.2 Numerical examples

Here is some R-code:

fana_em<-function(Y,APsi)

{#### ---- one EM step

A<-APsi$A ; Psi<-APsi$Psi ; iPsi<-diag(1/diag(Psi))

15

Vz<-solve( t(A)%*%iPsi%*%A + diag(nrow=ncol(A)) )

Zb<- Y%*%iPsi%*%A%*%Vz

Sb<- t(Zb)%*%Zb + nrow(Y)*Vz

A<-t(Y)%*%Zb%*%solve(Sb)

Psi<-diag(diag( t(Y)%*%Y - 2*t(Y)%*%Zb%*%t(A) + A%*%Sb%*%t(A) ))/nrow(Y)

list(A=A,Psi=Psi)

}

Here is the -2 log likelihood:

#### ---- -2 log likelihood

fana_m2ll<-function(Y,APsi)

{A<-APsi$A ; Psi<-APsi$Psi

Sigma<- tcrossprod(A) + Psi

nrow(Y)*log(det((Sigma)))+ sum(diag(crossprod(Y)%*%solve(Sigma)))

}

Note that these functions were written for transparency and not computa-

tional efficiency.

Putting these things together, here is a function to find an MLE (recall A is

only identified up to right rotations):

fana_mle<-function(Y,q,tol=1e-8)

{

## ---- sweep out mean

mu<-apply(Y,2,mean)

Y<-sweep(Y,2,mu,"-")

16

## ---- if q=0

if(q==0){A<-matrix(0,nrow=ncol(Y),ncol=0)

Psi<-diag(apply(Y,2,var))

APsi<-list(A=A,Psi=Psi)

M2LL<-fana_m2ll(Y,APsi)


}

if(q>0){## ---- starting values

s<-apply(Y,2,sd)

R<-cor(Y)

tmp<-R; diag(tmp)<-0 ; h<-apply(abs(tmp),1,max)

Psi<-diag(1-h,nrow=ncol(Y) )

for(j in 1:2)

{eX<-svd( R-Psi,nu=q,nv=0)

A<-eX$u[,1:q,drop=FALSE]%*%sqrt(diag(eX$d[1:q],nrow=q ))

Psi<-diag( pmax( diag(R-tcrossprod(A)),1e-3) )

}A<-sweep(A,1,s,"*")

diag(Psi)<-diag(Psi)*s^2


## ---- EM algorithm

M2LL<- c(Inf,fana_m2ll(Y,APsi))

while(diff(rev(tail(M2LL,2)))/abs(tail(M2LL,1)) >tol)

{APsi<-fana_em(Y,APsi)

M2LL<-c(M2LL,fana_m2ll(Y,APsi) )

}

}

17

## ---- output

list(mu=mu,A=APsi$A, Psi=APsi$Psi, M2LL=M2LL,

Sigma=tcrossprod(APsi$A) + APsi$Psi,

npq=c(nrow(Y),ncol(Y),ncol(A)))

}

Simulation example: Let’s try it out on a challenging dataset where n <

p.

n<-50 ; p<-100 ; q<-4

## ---- parameters

A0<-matrix(rexp(p*q),p,q)/4

Psi0<-diag(rexp(p))

## ---- data

Z0<-matrix(rnorm(n*q),n,q) ; E0<-matrix(rnorm(n*p),n,p)

Y<- Z0%*%t(A0) + E0%*%sqrt(Psi0)

## ---- fit

fit_em<-fana_mle(Y,4)

The EM algorithm moves very fast at first, then slows considerably:

18

0 20 40 60 80 100 120 140

2060

2080

2100

2120

Index

fit_e

m$M

2LL

100 110 120 130 140

2045

.291

2045

.291

2045

.292

length(fit_em$M2LL) − (50:1)

tail(

fit_e

m$M

2LL,

50)

Now lets compare the true variances and correlations to their estimates, using

the q = 4 factor model and the unrestricted MLE (the sample covariance

matrix).

0.5 1.0 1.5 2.0

0.5

1.0

1.5

2.0

true variances

fitte

d va

rianc

es

unrestrictedq=4

0.0 0.2 0.4 0.6 0.8 1.0

−0.

40.

00.

40.

8

true correlations

fitte

d co

rrel

atio

ns

19

How does R’s built in function fare?

fit_R<-factanal(Y,4)

Error in solve.default(cv): system is computationally singular: reciprocal

condition number = 1.8377e-20

Selecting the number of factors: Often we don’t know how many factors

there are. Statistical assessment of the number of factors can be done with

hypothesis tests, or some other model selection criteria. A popular model

selection criteria is the “Bayes information criteria” [Schwarz, 1978], which

is an approximation to the Bayes factor:

−2 log p(y|q) = −2 log

∫p(y|θ)p(θ|q) dθ

≈ −2 log(p(y|θq)( n

2π)−k/2

)= −2 log p(y|θq) + k log n− k log 2π

BIC(q) = −2 log p(y|θq) + k log n,

where k is the number of parameters in the model. One value of q is preferred

over another if the BIC is lower. For a discussion of this and other model

selection methods for factor analysis, see Hirose et al. [2011].

Exercise 4. Compute the number of parameters k in the (p, q) factor model.

Here is some code to compute the BIC for a model fit returned by the EM-algorithm function:

#### ---- BIC for FANA

fana_bic<-function(fit)

20

{npar<- min( fit$npq[2]*(fit$npq[3]+1) - choose(fit$npq[3],2),

choose(fit$npq[2]+1,2) )

tail(fit$M2LL,1) + log(fit$npq[1])*npar

}

Notice the min in the code above. To get a sense of where that comes from,

consider the case that p = 4 and q = 2. In this case, it would appear as if the

number of parameters is 4× 2− 1 for the A matrix, and 4 for the Ψ matrix,

for a total of 11. But this is more than the number of free parameters (10) in

an unconstrained 4×4 covariance matrix. The correct number of parameters

is the maximal rank of the Jacobian matrix of the map (A,Ψ) 7→ AA> + Ψ

[Drton et al., 2007].

Exercise 5. Consider the set of 4 × 4 covariance matrices that can be ex-

pressed as AA> + Ψ for A ∈ R4×2 and diagonal Ψ. Is this set equal to the

set of all 4× 4 covariance matrices?

WAIS data: The WAIS dataset consists of data from four subtests of theWechsler Adult Intelligence Scale (WAIS) on 49 elderly individuals. Here aresome univariate and bivariate descriptions of the data:

#### ---- WAIS dataset on elderly individuals

Y<-readRDS("../Data/wais.rds")[,-1]

dim(Y)

[1] 49 4

pairs(Y)

21

information

05

1015

5 10 15

02

46

810

0 5 10 15

similarities

arithmetic

4 6 8 10 12 14 16

0 2 4 6 8 10

510

154

68

1012

1416

picture completion

var(Y)

information similarities arithmetic picture completion

information 13.78 12.26 9.16 5.63

similarities 12.26 16.63 9.61 5.03

arithmetic 9.16 9.61 13.02 4.38

picture completion 5.63 5.03 4.38 7.65

cor(Y)

22


information 1.000 0.810 0.684 0.548

similarities 0.810 1.000 0.653 0.445

arithmetic 0.684 0.653 1.000 0.439


Do you think a factor analysis model would fit these data well? What shouldthe number of factors be?

eigen(cor(Y))

eigen() decomposition

$values

[1] 2.813 0.631 0.378 0.179

$vectors

[,1] [,2] [,3] [,4]

[1,] -0.549 -0.124 0.3186 0.7626

[2,] -0.527 -0.315 0.4763 -0.6297

[3,] -0.498 -0.281 -0.8183 -0.0622

[4,] -0.416 0.898 -0.0449 -0.1344

fit0<-fana_mle(Y,0)

fit1<-fana_mle(Y,1)

fit2<-fana_mle(Y,2)

sapply( list(fit0$M2LL,fit1$M2LL,fit2$M2LL),function(x){tail(x,1)})

[1] 684 581 580

sapply( list(fit0,fit1,fit2), fana_bic)

[1] 699 612 619

23

So a one-factor random factor model is selected by BIC. Let’s examine the

parameter estimates for this model.

fit1$A

[,1]

information -3.45

similarities -3.48

arithmetic -2.64

picture completion -1.56

round(cor(Y),2)


information 1.00 0.81 0.68 0.55

similarities 0.81 1.00 0.65 0.45

arithmetic 0.68 0.65 1.00 0.44


fit1$Psi

[,1] [,2] [,3] [,4]

[1,] 1.59 0.00 0.00 0.00

[2,] 0.00 4.21 0.00 0.00

[3,] 0.00 0.00 5.81 0.00

[4,] 0.00 0.00 0.00 5.07

tcrossprod(fit1$A) + fit1$Psi


information 13.50 11.99 9.09 5.37

24

similarities 11.99 16.29 9.16 5.41

arithmetic 9.09 9.16 12.76 4.10


var(Y)


information 13.78 12.26 9.16 5.63

similarities 12.26 16.63 9.61 5.03

arithmetic 9.16 9.61 13.02 4.38


4 6 8 10 12 14 16

46

810

1214

16

fit1$Sigma

Sig

maM

LE

q=1q=2

25

4 VARIMAX rotation

In terms of the model fit, we have

−2 log p(Y : µ, A, Ψ) = −2 log p(Y : µ, AR, Ψ)

for any rotation matrix R ∈ Oq. Therefore, we are not really estimating an

A matrix, we are estimating the class

A = {AR : R ∈ Oq},

or equivalently, we are estimating AA>.

Exercise 6. Show that A1A>1 = A2A

>2 if and only if A1 = A2R for some

rotation matrix R.

For this reason, some have argued against interpreting the coefficients of A

beyond AA>.

For the same reason, others have argued in favor of selecting an element of

A = {AR : R ∈ Oq}

that most conforms to ones theories. Many theories are available, one of

which is that observed variables are associated primarily with only a subset

of the latent factors, and hence many coefficients of A should be zero.

An alternative way to think of this is that the variances of the magnitudes of

the (a1,k, . . . , ap,k) should be high - a few variables should have high loadings

on factor k, and others should have small loadings. The VARIMAX rotation

finds the rotation R that maximizes the within-column variances (of the

squared elements):

RVMAX = arg maxR∈Oq

q∑k=1

Var[(AR)[k] ◦ (AR)[k]].

26

The VARIMAX rotation is implemented by default in R. Let’s see how it

works in the Swiss heads example. First lets select the number of factors,

and then fit the model using both the EM algorithm and the built-in R-

function.

heads<-readRDS("../Data/heads.rds")

Y<-rbind(heads$males,heads$females)

pairs(Y,col=1+rep(c(1,2),times=c(200,59)))

MFB

100

110

120

130

5060

70

80 100 120

120

130

140

150

100 110 120 130

BAM

TFH

100 120 140

50 60 70

LGAN

LTN

105 115 125 135

120 130 140 150

8010

012

010

012

014

010

511

512

513

5

LTG

27

fana_bic(fana_mle(Y,0))

[1] 7068


[1] 6754


[1] 6745


[1] 6749

fit1<-fana_mle(Y,1)

fit2<-fana_mle(Y,2)

fit1R<-factanal(Y,1)

fit2R<-factanal(Y,2)

First let’s compare the output from the two functions in the q = 1 model(VARIMAX is not relevant in this case):

fit1$A

[,1]

28

MFB -4.81

BAM -1.94

TFH -3.70

LGAN -1.67

LTN -3.27

LTG -5.75

fit1R$loadings

Loadings:

Factor1

MFB 0.615

BAM 0.360

TFH 0.528

LGAN 0.377

LTN 0.737

LTG 0.844

Factor1

SS loadings 2.187

Proportion Var 0.364

fit1R$loadings*apply(Y,2,sd)

Loadings:

Factor1

MFB 4.82

BAM 1.94

TFH 3.70

LGAN 1.67

LTN 3.28

29

LTG 5.77

Factor1

SS loadings 87.5

Proportion Var 14.6

The R function scales the variables to have unit variance before fitting the

model for some reason.

Now lets compare the parameters for the q = 2 model:

A<-fit2$A

psi<-diag(fit2$Psi)

AR<-sweep(fit2R$loadings,1,apply(Y,2,sd),"*")

psiR<-fit2R$uniquenesses*apply(Y,2,var)

First check Ψ:

psi

[1] 37.838 24.276 0.837 15.459 9.027 11.937

psiR

MFB BAM TFH LGAN LTN LTG

37.996 24.383 0.245 15.545 9.066 11.959

Now check AA>

30

A%*%t(A)


MFB 23.26 8.93 22.15 8.34 15.49 27.65

BAM 8.93 4.72 2.58 2.04 6.64 12.08

TFH 22.15 2.58 48.05 13.23 11.57 19.60

LGAN 8.34 2.04 13.23 4.03 4.93 8.59

LTN 15.49 6.64 11.57 4.93 10.69 19.20

LTG 27.65 12.08 19.60 8.59 19.20 34.54

AR%*%t(AR)


MFB 23.34 8.96 22.29 8.36 15.55 27.76

BAM 8.96 4.73 2.59 2.06 6.66 12.13

TFH 22.29 2.59 48.85 13.29 11.62 19.67

LGAN 8.36 2.06 13.29 4.02 4.95 8.63

LTN 15.55 6.66 11.62 4.95 10.73 19.28

LTG 27.76 12.13 19.67 8.63 19.28 34.70

Close enough for me. Now lets see what the rotated loadings look like:

A

[,1] [,2]

MFB 4.81 -0.405

BAM 1.75 -1.291

TFH 5.01 4.788

LGAN 1.81 0.870

LTN 3.15 -0.880

LTG 5.60 -1.773

AR

31

Loadings:

Factor1 Factor2

MFB 4.006 2.701

BAM 2.172

TFH 0.890 6.932

LGAN 0.870 1.806

LTN 3.011 1.290

LTG 5.491 2.132

Factor1 Factor2

SS loadings 61.5 64.8

Proportion Var 10.3 10.8

Cumulative Var 10.3 21.1

Here is graphical comparison. The VARIMAX solution is in green.

−2

02

46

A


−2

02

46

A


Note that the print method for coefficients from factanal makes the A

matrix look “sparse” when it really isn’t. Beware of this additional pitfall of

32

using the factanal command in R.

AR

Loadings:

Factor1 Factor2

MFB 4.006 2.701

BAM 2.172

TFH 0.890 6.932

LGAN 0.870 1.806

LTN 3.011 1.290

LTG 5.491 2.132

Factor1 Factor2

SS loadings 61.5 64.8

Proportion Var 10.3 10.8

Cumulative Var 10.3 21.1

AR[2,2]

[1] 0.0944

5 Recovery of latent factors

There are several ways to estimate the common factors, depending on what

assumptions one is willing to make. Imagine for the moment that A and Ψ

are known and we would like to infer the values of Z, perhaps as a tool to

cluster the rows. One perspective is then that Z can be viewed as a fixed

parameter, and we can estimate it using OLS, GLS, MLE etc. Recall that

33

the conditional model is

Y ∼ Nn×p(ZA>,Ψ⊗ I).

Given values of A and Ψ, the ML/GLS estimator of Z is the minimizer of

tr((Y − ZA>)Ψ−1(Y − ZA>)>) ' −2tr(YΨ−1AZ>) + tr(ZA>Ψ−1AZ>)

= −2tr(Z>[YΨ−1A]) + tr(Z>Z[A>Ψ−1A])

Exercise 7. Show that the minimizer in Z is Z = [YΨ−1A][A>Ψ−1A]−1.

The optimizer Z is sometimes called “Bartlett’s factor score” matrix. How-

ever, note that in practice, A and Ψ are not known and so typically their

MLEs are plugged-in to obtain this pseudo-MLE for Z. Also, because A is

not identifiable, neither is Z: If (A, Z) are an MLE of A and a Bartlett’s

factor matrix, then so are (AR, ZR) for any R ∈ Oq.

An alternative perspective is that we are assuming the distribution of the

latent factors is Z ∼ Nn×q(0, I⊗I). Therefore, if A and Ψ are known it would

be most appropriate to find the conditional distribution and conditional mode

of Z given Y.

Exercise 8. Show that the conditional distribution of Z is

Z|Y ∼ Nn×q([YΨ−1A]V,V ⊗ In),

where V = [A>Ψ−1A + Iq]−1. Hence the conditional mean/mode is

Z = [YΨ−1A][A>Ψ−1A + Iq]−1.

This estimator is sometimes called “Thompson’s factor score” matrix. This

estimate is very similar to the MLE, except that the elements within a row get

shrunk a bit towards zero due to the assumption of standard normal factors.

34

However, whatever the method, recall that Z is only estimable up to right

rotations, as A is only identifiable up to rotations under this normal random

factor model. To obtain identifiability, something else must be assumed about

Z or A, such as

• sparsity of A, or

• non-Gaussianity of Z.

The latter is what is assumed in a signal-recovery method known as “inde-

pendent components analysis” (ICA).

fana_scores<-function(Y,fit)

{Y<-sweep(Y,2,apply(Y,2,mean),"-")

A<-fit$A ; iPsi<-diag(1/diag(fit$Psi ))

(Y%*%iPsi%*%A) %*% solve( t(A)%*%iPsi%*%A + diag(ncol(A)) )

}

Z<-fana_scores(Y,fit2)

fit2R<-factanal(Y,2,scores="regression")

ZR<-fit2R$scores

35

−2 −1 0 1 2 3

−3

−2

−1

01

2

Z[,1]

Z[,2

]

−2 −1 0 1 2−

4−

3−

2−

10

12

Factor1

Fac

tor2

6 Decorrelation and ICA

The following material might more appropriately be located with the material

on PCA.

6.1 Decorrelation

Consider for the moment the “low noise” model with p = q, so

Y ≈ ZA>

where Z is a random matrix of mean-zero, variance-one uncorrelated random

factors i.e. “white noise.” The mixing matrix A ∈ Rp induces correlation

36

among the columns of Y:

Y>Y/n ≈ AZ>ZA>/n

≈ AA>.

To make things easier, lets further assume that n is very large so that this

approximation is very good - so good that from now on we will write Y =

ZA> and Y>Y/n = A>A.

Now suppose we’d like to try to recover Z from Y. How can we do this?

Obviously if we could estimate A−1, we’d be in business. However, in general

you can’t estimate A from Y>Y = AA> for all values of A because of the

rotational nonidentifiability.

Decorrelation matrices: Consider the following intuitive approach - the

columns of Y are correlated linear combinations of the uncorrelated columns

of Z. Maybe if we decorrelate Y, the result will look like Z.

Definition 1. W ∈ Rp×p is a decorrelation matrix for Y if the (sample)

column covariance of YW> is the identify matrix, that is, if

W(Y>Y/n)W> = Ip.

Note that such a W both decorrelates and standardizes the columns of Y.

So what matrices W do the job? Let A = VL1/2R> be the SVD of A. Then

Y>Y/n = (VL1/2R>)(VL1/2R>)>

= VLV>.

So W is a decorrelation matrix if WVLV>W> = I.

Exercise 9. Show that W is a rotation matrix if and only if W = RL−1/2V>

for some rotation matrix R ∈ Op.

37

How can we obtain a decorrelation matrix W = RL−1/2V>?

• V and L can be obtained from the sample covariance matrix, since

Y>Y/n = VLV>;

• Y>Y/n does not provide guidance on how to choose R.

Now suppose we obtain a decorrelation Z of Y using a decorrelation matrix

W. Will Z resemble Z?

Z = YW> = ZA>W>

= ZRL1/2V>VL−1/2R>

= ZRR>.

So Z is a column permutation ZP of Z if R is a row permutation of R:

Z = ZP⇔ R = PR.

PCA decorrelation: Recall that the “principal components transforma-

tion” rotates the columns of Y to construct uncorrelated principal compo-

nents F:

F = YV>

F>F/n = V>(Y>Y/n)V

= L.

To standardize, multiply F by L−1/2 to give

ZPCA = FL−1/2 = YVL−1/2

= YW>PCA.

where WPCA = L−1/2V>. In signal processing, decorrelating Y with WPCA

is called PCA whitening.

38

Exercise 10. Show that PCA whitening can recover Z up to permutations

if A = P1A0P2, where A0 is a matrix with orthogonal columns and P1 and

P2 are signed permutation matrices.

ZCA decorrelation: Another popular “default” decorrelation is ZCA whiten-

ing, obtained by multiplying Y by the decorrelation matrix WZCA given by

WZCA = RZCAL−1/2V> where

RZCA = V, so

WZCA = VL−1/2V> = Σ−1/2.

Recall that when we empirically standardize a scalar random variable we

perform the operation z = (σ2)−1/2y. ZCA whitening is the closest analogue

in the multivariate case:

z = Σ−1/2y

Z = YΣ−1/2.

Sometimes this transformation is called the Mahalinobis transformation.

Exercise 11. Show that ZCA whitening can recover Z up to permutations

if A = P1A0P2, where A0 is a symmetric matrix and P1 and P2 are signed

permutation matrices.

Comments:

1. Letting Y =√nUDV> be the SVD of Y, the decorrelated variables

resulting from ZCA whitening are ZZCA =√nUV>, which is

√n times

the polar decomposition of Y. Of course, Z>ZCAZZCA = nI, as desired.

2. ZCA whitening can be shown to have certain optimality properties: If

the columns of Y began with unit variance, then ||Y−Z||2 is minimized

among decorrelations Z by ZZCA.

39

Numerical examples: Let’s confirm what we have derived with two nu-

merical examples, one where PCA recovers the factors and another where

ZCA will.

## data dimensions

n<-1000 ; p<-3

Recovery via PCA:

## orthogonal columns

Ap<-eigen(tcrossprod(matrix(rnorm(p*p),p,p)))$vec %*% diag(rexp(p))

## permute

Ap<-Ap[c(3,1,2),c(1,3,2) ]

## generate data

Z<-matrix(rnorm(n*p),n,p)

Y<-Z%*%t(Ap)

## decorrelate

eS<-eigen(t(Y)%*%Y/n)

V<-eS$vec ; L<-eS$val

Zzca<- Y%*%t( V %*% diag(1/sqrt(L)) %*% t(V) )

Zpca<- Y%*%t( diag(1/sqrt(L)) %*% t(V) )

crossprod(Z,Zpca)/n

[,1] [,2] [,3]

[1,] -0.0158 0.02416 -0.9943554

[2,] 0.9567 -0.00341 -0.0000349

[3,] -0.0372 -1.00383 -0.0005494

40

crossprod(Z,Zzca)/n

[,1] [,2] [,3]

[1,] -0.324 0.137 -0.93053

[2,] 0.779 -0.453 -0.32147

[3,] 0.471 0.887 -0.00889

Recovery via ZCA:

## symmetric A plus noise

Az<-matrix( c( 0.85, 0.42, 0.26,

0.42, 0.85, 0.42,

0.26, 0.42, 1.00 ), 3,3)

## permute

Az<-Az[c(3,1,2),c(1,3,2) ]

## generate data

Z<-matrix(rnorm(n*p),n,p)

Y<-Z%*%t(Az)

## decorrelate





crossprod(Z,Zpca)/n

[,1] [,2] [,3]

41

[1,] -0.506 -0.646 -0.576

[2,] -0.588 0.776 -0.304

[3,] -0.608 -0.223 0.804

crossprod(Z,Zzca)/n

[,1] [,2] [,3]

[1,] -0.0112 1.00254 -0.0130

[2,] 1.0192 -0.01718 -0.0303

[3,] -0.0275 0.00105 1.0325

6.2 ICA

To summarize the last subsection, we can obtain good estimates of Z for

Gaussian and non-Gaussian factors for certain types of matrices A.

However, there is another decorrelation method, called independent compo-

nent analysis (ICA) designed specifically for deconvolving mixed signals, that

can perform well for certain types of Z-matrices for all kinds of A-matrices.

Specifically, ICA works when the source signals have non-Gaussian distribu-

tions.

The ICA decorrelation is given by WICA = RICAL−1/2V>, where

• V and L are obtained from the approximation Y>Y/n = VLV>;

• RICA is obtained from some other aspect of the data Y.

The idea behind ICA is similar to that which motivated factor recovery via

42

PCA/ZCA:

PCA/ZCA: Z uncorrelated, so Z = YW> should be uncorrelated.

ICA: Z non-Gaussian and uncorrelated, so Z = YW> should be non-

Gaussian and uncorrelated.

Measuring non-normality: To implement ICA, first one selects a mea-

sure of non-normality. Standard implementations measure deviations from

normality by deviations of functional sample moments from their expecta-

tions under normality. For example, we might measure (non)-Gaussianity

with a function G given by

G(Z) =∑j

(∑i

g(zi,j)/n− E[g(z)]

)2

where z ∼ N(0, 1). Standard choices of g include

• g(z) = z4 (kurtotis);

• g(z) = (log cosh(az)) /a ;

• g(z) = −ez2/2.

The first is simple but not robust to outliers. The second and third are

approximations to a measure of entropy.

Once a non-normality measure has been selected, the ICA solution ZICA can

be obtained as follows:

1. Whiten: Compute ZPCA.

43

2. Optimize: Find RICA = maxR∈Op G(ZPCAR>).

3. Transform: Let ZICA = ZPCAR>.

Note that this algorithm is the same as finding the decorrelation matrix

W = RL−1/2V> that maximizes G(YW>) over orthogonal matrices R.

Why does ICA work? Recall that any decorrelation W = VL−1/2R>

produces estimated latent factors

Z = ZRR>

for R ∈ Op. Recovery occurs when RR> is signed permutation matrix. Now

if the columns of Z are non-Gaussian, then the columns of Z will look equally

non-Gaussian for the right choice of R, because in this case each column of

Z is one of the columns of Z (up to sign changes). On the other hand, if R is

chosen poorly, then each column of Z is a linear combination of the columns

of Z, and by the central limit theorem, will therefore look more Gaussian

than the original columns.

This might sound surprising, since we think of the CLT as an asymptotic

result and here we are applying it in a situation where the number of things

we are taking linear combinations of can be quite small (2 or 3). However,

it is generally true that if Z1 and Z2 are independent non-normal random

variables, then Z1 + Z2 will generally be more normally distributed than

either one of them individually. Here is a numerical illustration:

Z<-matrix(rexp(n*p),n,p) ; Z<-sweep(Z,2,apply(Z,2,mean),"-")

A<-matrix( c(.65,-.45,.45,-.45,.65,-.45,.45,-.45,.65), 3,3)

A

44

[,1] [,2] [,3]

[1,] 0.65 -0.45 0.45

[2,] -0.45 0.65 -0.45

[3,] 0.45 -0.45 0.65

Y<-A%*%t(Z)

Z

Fre

quen

cy

−1 0 1 2 3 4 5 6

020

040

060

080

010

00

Y

Fre

quen

cy

−2 0 2 4

020

040

060

080

0

Exercise 12. Will ICA work if Z is a matrix of i.i.d. standard normal ran-

dom variables? Formally explain why or why not.

Parametric ICA: There are many variations on ICA, and it might be more

appropriate to refer to it as an objective rather than a single model or al-

gorithm. One important variant is what I would call “parametric” ICA,

where some additional structure on the components is assumed (sparsity, a

parametric model, temporal dependence, etc.).

Numerical examples: First let’s generate some non-normal factors:

45

## non-normal factors

Z<-matrix(rexp(n*p),n,p)

Z<-sweep(Z,2,apply(Z,2,mean),"-")

Z<-sweep(Z,2,apply(Z,2,sd),"/")

Recovery via PCA and ICA:

Y<-Z%*%t(Ap)

## decorrelate





## ICA

Zica<-fastICA::fastICA(Y,p)$S

crossprod(Z,Zpca)/n

[,1] [,2] [,3]

[1,] 0.0240 0.02814 -0.9988160

[2,] 0.9995 -0.00369 0.0000458

[3,] -0.0423 -0.99860 -0.0006497

crossprod(Z,Zzca)/n

[,1] [,2] [,3]

[1,] -0.296 0.116 -0.94747

[2,] 0.813 -0.473 -0.33746

[3,] 0.465 0.885 -0.00685

46

crossprod(Z,Zica)/n

[,1] [,2] [,3]

[1,] -0.0310 0.9990 -0.000748

[2,] 0.0383 0.0243 -0.998469

[3,] -0.9977 -0.0595 -0.001124

Recovery via ZCA and ICA:

Y<-Z%*%t(Az)

## decorrelate





## ICA

Zica<-fastICA::fastICA(Y,p)$S

crossprod(Z,Zpca)/n

[,1] [,2] [,3]

[1,] -0.526 -0.638 -0.561

[2,] -0.606 0.733 -0.308

[3,] -0.570 -0.188 0.799

crossprod(Z,Zzca)/n

[,1] [,2] [,3]

[1,] 0.0144 0.99935 -0.0097

47

[2,] 0.9994 0.00929 -0.0129

[3,] -0.0255 -0.01842 0.9990

crossprod(Z,Zica)/n

[,1] [,2] [,3]

[1,] -0.0308 0.00372 -0.9990

[2,] 0.0383 -0.99835 -0.0287

[3,] -0.9977 -0.00132 0.0592

References

Jian-Feng Cai, Emmanuel J. Candes, and Zuowei Shen. A singular value

thresholding algorithm for matrix completion. SIAM J. Optim., 20(4):

1956–1982, 2010. ISSN 1052-6234. doi: 10.1137/080738970. URL http:

//dx.doi.org/10.1137/080738970.

David L Donoho and Matan Gavish. The optimal hard threshold for singular

values is 4/sqrt (3). arXiv preprint arXiv:1305.5870, 2013.

Mathias Drton, Bernd Sturmfels, and Seth Sullivant. Algebraic factor analy-

sis: tetrads, pentads and beyond. Probab. Theory Related Fields, 138(3-4):

463–493, 2007. ISSN 0178-8051. doi: 10.1007/s00440-006-0033-2. URL

https://doi.org/10.1007/s00440-006-0033-2.

D. Gerard and P.D. Hoff. Adaptive higher-order spectral estimators. Techni-

cal Report 633, Department of Statistics, University of Washington, 2015.

Wolfgang Karl Hardle and Leopold Simar. Applied multivariate statistical

analysis. Springer, Heidelberg, fourth edition, 2015. ISBN 978-3-662-

48

http://dx.doi.org/10.1137/080738970

http://dx.doi.org/10.1137/080738970

https://doi.org/10.1007/s00440-006-0033-2

45170-0; 978-3-662-45171-7. doi: 10.1007/978-3-662-45171-7. URL http:

//dx.doi.org/10.1007/978-3-662-45171-7.

Kei Hirose, Shuichi Kawano, Sadanori Konishi, and Masanori Ichikawa.

Bayesian information criterion and selection of the number of factors in

factor analysis models. Journal of Data Science, 9(2):243–259, 2011.

P.D. Hoff. Model averaging and dimension selection for the singular value

decomposition. J. Amer. Statist. Assoc., 102(478):674–685, 2007. ISSN

0162-1459.

Alan Julian Izenman. Modern multivariate statistical techniques. Springer

Texts in Statistics. Springer, New York, 2008. ISBN 978-0-387-78188-

4. doi: 10.1007/978-0-387-78189-1. URL http://dx.doi.org/10.1007/

978-0-387-78189-1. Regression, classification, and manifold learning.

Julie Josse and Sylvain Sardy. Reduced rank matrix estimation by adap-

tive trace norm regularization, 2013. URL http://arxiv.org/abs/1310.

6602. arXiv:1310.6602.

Kantilal Varichand Mardia, John T. Kent, and John M. Bibby. Multivariate

analysis. Academic Press [Harcourt Brace Jovanovich Publishers], London,

1979. ISBN 0-12-471250-9. Probability and Mathematical Statistics: A

Series of Monographs and Textbooks.

Rahul Mazumder, Trevor Hastie, and Robert Tibshirani. Spectral regular-

ization algorithms for learning large incomplete matrices. J. Mach. Learn.

Res., 11:2287–2322, 2010. ISSN 1532-4435.

Gideon Schwarz. Estimating the dimension of a model. Ann. Statist., 6(2):

461–464, 1978. ISSN 0090-5364.

49

http://dx.doi.org/10.1007/978-3-662-45171-7

http://dx.doi.org/10.1007/978-3-662-45171-7

http://dx.doi.org/10.1007/978-0-387-78189-1

http://dx.doi.org/10.1007/978-0-387-78189-1

http://arxiv.org/abs/1310.6602

http://arxiv.org/abs/1310.6602

factor analysis - statistical sciencepdh10/teaching/832/notes/fana.pdf · 2020. 10. 9. · gaussian...

Documents