factor analysis - statistical sciencepdh10/teaching/832/notes/fana.pdf · 2020. 10. 9. · gaussian...
TRANSCRIPT
Factor Analysis
Peter Hoff
October 9, 2020
Contents
1 Non-random factor models 2
2 Random factor models 6
3 Gaussian factor analysis 10
3.1 MLEs for the Gaussian FA . . . . . . . . . . . . . . . . . . . . 12
3.2 Numerical examples . . . . . . . . . . . . . . . . . . . . . . . . 15
4 VARIMAX rotation 26
5 Recovery of latent factors 33
1
6 Decorrelation and ICA 36
6.1 Decorrelation . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6.2 ICA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Abstract
Extending the partial isotropy model to the case of heteroscedas-
tic measurement error yields the factor analysis model. We consider
versions of factor analysis based on matrix decomposition methods,
method of moments, and likelihood estimation with normality as-
sumptions. A reference for some of this material can be found in
Chapter 10 of Hardle and Simar [2015], Chapter 15 of Izenman [2008]
and Chapter 9 of Mardia et al. [1979],
1 Non-random factor models
Suppose we have a data matrix Y whose ith row yi is a p-variate measure-
ment of a signal vector mi plus mean-zero measurement error with diagonal
covariance Ψ. Then
yi = mi + Ψ1/2ei
E[yi|mi] = mi
Var[yi|mi] = Ψ,
2
where E[ei] = 0 and Var[ei] = 0. Assuming the measurement errors across
the rows are uncorrelated, we have
Y = M + EΨ1/2
E[Y|M] = M
Var[Y|M] = Ψ⊗ In
where Var[E] = Ip ⊗ In and M has rows m1, . . . ,mn.
Statistical inference for M and Ψ is challenging if nothing else is assumed
about M or Ψ . For example, while the OLS estimator M = Y is unbiased
for M, its precision depends on the unknown matrix Ψ, which can’t be well-
estimated unless we make assumptions about M. For example, the residual
sum of squares for the OLS estimator is zero. Assuming normality of E does
not help. In this case, the MLE of M is Y, and the MLE of Ψ is 0, which is
generally unreasonable.
However, in many applications it is reasonable to assume that the hetero-
geneity in the rows m1, . . . ,mn ⊂ Rp of M are due to heterogeneity in some
unobserved, lower-dimensional row-specific factors z1, . . . , zn ⊂ Rq for q < p,
that is, there is some function that maps each unobserved latent factor zi to
the signal mi. If this map is linear, we have
mi = µ + Azi
M = 1µ> + ZA>.
Plugging this into the model for Y gives
Y = 1µ> + ZA> + EΨ1/2
E[Y|Z] = 1µ> + ZA>
Var[Y|Z] = Ψ⊗ In.
3
This model is sometimes referred to as a q-dimensional linear factor model.
The rows of Z are referred to as factors and the matrix A is referred to as
the factor loading matrix.
Note that in this model, the mean matrix M is exactly a column-wise location
shift of a rank q matrix. As discussed in the notes on eigendecompositions,
for a given value of q an ordinary least squares (OLS) estimator of M =
1µ> + ZA> is given by
• µ = y = Y>1/n
• ZA> = UqDqV>q ,
where UDV> is the SVD of CY. A couple of comments on such an estimate:
1. The estimate of M is (typically) unique.
2. One OLS estimator of (Z,A) is (Z, A) = (UqDq,Vq), which we pre-
viously called the first q principal component scores and axes, respec-
tively.
3. If (Z, A) is an OLS estimator of (Z,A), then so is (ZG, AG−1) for any
invertible G ∈ Rq×q.
4. If Z is an OLS estimator of Z then the column means of Z are all zero.
Item 2 indicates why the expressions “PCA” and “factor analysis” are some-
times used interchangeably. However, note that the OLS estimator presented
here essentially ignores the possibility of heteroscedastic errors, that is, that
Ψ might not be proportional to the identity. If we used a GLS or weighted
4
least-squares estimator that allowed for heteroscedasticity (possibly using fea-
sible GLS), the estimates would not be the same as the best q-dimensional
affine approximation obtained from PCA.
Additionally, note that
{ZA> : Z ∈ Rn×q,A ∈ Rp×q} = {M ∈ Rn×p : rank(M) ≤ q},
and so estimation of ZA> is in the category of the “low-rank matrix esti-
mation” problem, which has a large literature across many disciplines. A
particular area of recent research has been low-rank matrix estimation when
the rank q is unknown. The strategy here is typically to threshold or shrink
the singular values of Y or CY. Some references include the following:
• Singular value shrinkage: Overloading notation, let Y = UDV> be the
SVD of Y. The estimate of M is taken to be M = Uf(D)V>, where
f shrinks and/or thresholds the singular values of D. Some references
include Cai et al. [2010], Mazumder et al. [2010], Josse and Sardy [2013],
Donoho and Gavish [2013]. Gerard and Hoff [2015] consider the tensor
case.
• Bayesian rank selection and model averaging: A more computationally
demanding procedure is to put a prior on U, D and V that allows zeros
in the diagonal elements of D. Such an approach is detailed in Hoff
[2007].
Finally, some foreshadowing: Regarding items 3 and 4 above, suppose we
take (Z, A) = (√nUq,DqVq/
√n). Then
Z>Z/n = nU>q Uq/n = Iq
Z>1/n = 0.
5
In other words, the columns of Z are mean-zero and uncorrelated. This
means that our estimated latent factor representation of Y has the form
Y ≈ 1µ> + ZA>
where the rows z1, . . . , zn of Z have sample mean vector zero and sample
variance equal to the identity. We could take instead an OLS estimator of Z
to have a different sample variance matrix, but the fit would not change.
2 Random factor models
If the rows of Y represent a random sample of n objects from a population,
then the rows of Z are a random sample as well, in which case we are more
likely to be interested in the population mean and variance of Y, on average
across different random samples of objects, or equivalently, on average across
different random samples of the rows of Z. The factor model in this case is
referred to as a random factor model, since the factor matrix Z is thought of
as a random sample from some population or process.
Random factor model: The q-dimensional random factor model for a n×pmatrix Y is given by
Y = 1µ> + ZA> + EΨ1/2,
where
• µ ∈ Rp, A ∈ Rp×q, Ψ = diag(ψ1, . . . , ψp) are unknown parameters;
• E[Z] = 0, Var[Z] = Iq ⊗ In;
• E[E] = 0, Var[E] = Ip ⊗ In;
6
• Z and E are uncorrelated.
Terminology:
• The matrix A is called the factor loading matrix ;
• The rows of Z are called the common factors ;
• The rows of E are called the specific factors or unique factors.
The elements of a common factor zi are shared among the elements of yi,
those of a specific factor ei are not.
Under this model,
Var[yi] = AA> + Ψ
Var[yi,j] =
q∑k=1
a2jk + ψj
•∑q
k=1 a2jk is the communality of the jth variable;
• ψj is the jth specific variance or unique variance.
Note that if Ψ = σ2I for some σ2 > 0, then this is the partial isotropy model.
7
Moment conditions on Z: The random factor model above specifies that
the mean and variance of Z are 0 and Iq ⊗ In respectively. There are other
specifications in the literature, but this one is pretty common. Why do we
make these moment assumptions about Z? The short answer is that there is
no reason not to. We could allow different first and second moments for Z,
but doing so won’t change the range of first and second moments for Y that
are possible. An alternative way of saying this is that the first two moments
of Z are not separately identifiable from the other parameters, µ and A.
To see this, suppose E[Z] = 1θ> and Var[Z] = Φ ⊗ I for some θ ∈ Rq and
Φ ∈ S+q , that is, the rows of Z have a common expectation and variance and
are uncorrelated. This gives
E[Y] = 1µ> + 1θ>A>
= 1(µ + Aθ)>
Var[Y] = (AΦA>)⊗ I + (Ψ⊗ I)
= (AΦA> + Ψ)⊗ I.
Now construct
Z = (Z− 1θ>)Φ−1/2 , so that
Z = 1θ> + ZΦ1/2.
Note that E[Z] = 0 and Var[Z] = I⊗ I Now Y can be expressed either as a
function of Z or Z:
Y = 1µ> + ZA> + EΨ1/2
= 1(µ + Aθ)> + ZΦ1/2A> + EΨ1/2
≡ 1µ> + ZA> + EΨ1/2.
So note that
• the first and second moments of Z and Z are different;
8
• reexpressing Y in terms of Z doesn’t change its moments.
Therefore, we won’t be able to tell the difference between Z and Z, that is
the difference between
(µ,A,θ,Φ) and (µ, A,0, I) = (µ + Aθ,Φ1/2A,0, I)
unless
• we have prior information about the values of the parameters, or
• we use more information from Y than just first and second moments.
More precisely, we have the following:
Theorem 1. For every (µ,A,θ,Φ) there exists µ ∈ Rp and A ∈ Rp×q such
that
E[Y|µ,A,θ,Φ,Ψ] = E[Y|µ, A,0, I,Ψ]
Var[Y|µ,A,θ,Φ,Ψ] = Var[Y|µ, A,0, I,Ψ]
Thus any statistical method based on just the first and second moments of Y
will not be able to distinguish the moments of Z from µ and A. Therefore,
moment-based approaches usually set the moments of Z to some convenient
value (like 0 and I ⊗ I), and recognize that any aspects of Z that we may
estimate are only known up to affine transformations.
More non-identifiability: Even after making some restrictions on the
moments of Z, some model non-identifiability remains. Under the above-
described factor model, the first and second moments of Y are
E[Y] = 1µ>
Var[Y] = (AA> + Ψ)⊗ I.
9
Now recall that
AA> + Ψ = AO(AO)> + Ψ
for any matrix O ∈ Oq, for which OO> = O>O = Iq. Therefore
E[Y|µ,A,Ψ] = E[Y|µ,AO,Ψ]
Var[Y|µ,A,Ψ] = Var[Y|µ,AO,Ψ]
Thus the mapping of the parameter space to the moment space is not 1-1.
Some remedies include the following:
• Use a reduced parameterization, such as AA> = UD2U>.
• Impose an identifiability constraint, such as A>ΨA being diagonal.
• Not do anything except remember A and AO can’t
be distinguished without additional information or assumptions.
Scale invariance: Let Y follow a q-dimensional factor model. Now con-
struct Y = YD where D is a diagonal matrix. Then
E[Y] = 1(Dµ)>
Var[Y] = [DA(DA)> + DΨD]⊗ I,
and so Y also follows a q-dimensional factor model. So we say that the factor
model is closed under rescalings of the variable, or scale invariant. Note that
the spiked covariance model does not have this invariance.
3 Gaussian factor analysis
Now let’s add the most basic of distributional assumptions, that the common
and specific factors are normally distributed:
10
Gaussian factor model (factor representation): The q-factor model
for a random n× p Gaussian matrix Y is given by
Y = 1µ> + ZA> + EΨ1/2,
where
• µ ∈ Rp, A ∈ Rp×q, Ψ = diag(ψ1, . . . , ψp);
• Z ∼ Nn×q(0, I⊗ I);
• E ∼ Nn×q(0, I⊗ I);
• Z and E are independent.
Note that we have retained the same identifiability restrictions on the mean
and variance of Z. This is because
• the mean and variance of Z is not separately identifiable from µ and
A in terms of the 1st and 2nd moments of Y;
• the 1st and 2nd moments of a Gaussian matrix determine the distribu-
tion.
Therefore, we set the mean and variance of Z to some reference values.
Gaussian factor model (covariance representation): As we just noted,
the distribution of a Gaussian matrix Y is entirely determined by its 1st and
2nd moments. What are these moments?
E[Y|µ,A,Ψ] = 1µ>
11
For practice, lets compute the variance of Y, or equivalently, of y = vec(Y):
y = (µ⊗ I)1 + (A⊗ I)z + (Ψ1/2 ⊗ I)e
y − E[y] = (A⊗ I)z + (Ψ1/2 ⊗ I)e
Cov[y] = E[(y − E[y])(y − E[y])>]
= E[(A⊗ I)zz>(A⊗ I)> + (Ψ1/2 ⊗ I)ee>(Ψ1/2 ⊗ I)> + crossterms]
= AA> ⊗ I + Ψ⊗ I
= (AA> + Ψ)⊗ I
Therefore, the Gaussian factor model is 100% completely equivalent to the
following multivariate normal model
Y ∼ Nn×p(1µ>, (AA> + Ψ)⊗ I)
3.1 MLEs for the Gaussian FA
We now show how to obtain the MLEs for (µ,A,Ψ) The MLE for µ is easy to
obtain, but getting he MLEs for A and Ψ will require an iterative algorithm.
MLE of µ: The -2 log likelihood (modulo constants) is
d(µ,A,Σ) = n log |Σ|+ tr((Y − 1µ>)Σ−1(Y − 1µ>)>)
For any Σ, including Σ = AA> + Ψ, the minimizer in µ is the value that
minimizes the trace term
tr((Y − 1µ>)Σ−1(Y − 1µ>)>) = tr(YΣ−1Y>)− 2tr(YΣ−1µ1>) +
tr(1µ>Σ−1µ1>).
As a function of µ, this is
−2ntr(y>Σ−1µ) + ntr(µ>Σ−1µ).
12
You can take some derivatives or complete the square to show that this is
minimized by µ = y. Note that this holds for any Σ, so whatever Σ optimizes
the likelihood, the MLE for (µ,Σ) will be (y, Σ).
MLE of A,Ψ: Additionally, this means that the MLE for (A,Ψ) may be
found by subtracting off the column means from Y and using this centered
matrix to find the MLE under a zero-mean factor model. Specifically, let
Y now be the column-centered data matrix. The MLE for (A,Ψ) is the
MLE under the model that assumes Y ∼ Nn×p(0, (AA>+ Ψ)⊗ In). We can
approximate the MLE using the EM algorithm as follows: Recall that the
density for Y under this model can be written as
p(Y|AA> + Ψ) =
∫p(Y|Z,A,Ψ)p(dZ)
where the conditional density of Y given Z is that of the N(ZA>,Ψ ⊗ In)
distribution. This problem can be expressed in generic form as follows:
maxθp(y|θ) = max
θ
∫p(y|θ, z)p(dz).
One approach to finding the maximizer is with the EM algorithm, which
proceeds iteratively as follows:
• E-step: compute l(θ : θ(s)) = E[log p(y|z, θ)|y, θ(s)];
• M-step: let θ(s+1) = arg maxθ l(θ : θ(s)).
Actually, the EM algorithm is more general than this - it can handle models
where there are parameters for the missing/latent data z.
Let’s derive the algorithm for the case of our Gaussian factor model. Instead
of maximizing the log likelihood, we will minimize the -2 log likelihood, or
13
what I will somewhat incorrectly call the deviance. Recall that the condi-
tional deviance is
d(A,Ψ : Z) = n log |Ψ|+ tr((Y − ZA>)Ψ−1(Y − ZA>))
= n log |Ψ|+ tr(YΨ−1Y>)− 2tr(A>Ψ−1Y>Z) + tr(A>Ψ−1AZ>Z).
The EM algorithm proceeds by minimizing in (A,Ψ) the expected value of
this quantity, where the expectation is over Z with respect to its conditional
distribution given Y and current values (A(s),Ψ(s)). Let
• Z = E[Z|Y,A(s),Ψ(s)]
• S = E[Z>Z|Y,A(s),Ψ(s)].
The expected conditional deviance is then
d(A,Ψ) = n log |Ψ|+ tr(YΨ−1Y>)− 2tr(A>Ψ−1Y>Z) + tr(A>Ψ−1AS).
Exercise 1. Show via completing the square and/or calculus that the mini-
mizer of d(A,Ψ) in A is A = Y>ZS−1.
Plugging A into d and manipulating with traces gives
d(A,Ψ) = n log |Ψ|+ tr(Ψ−1Y>Y)− 2tr(Ψ−1Y>ZA>) + tr(Ψ−1ASA>)
= n log |Ψ|+ tr(
Ψ−1(Y>Y − 2Y>ZA> + ASA>))
≡ n log |Ψ|+ tr(Ψ−1E>E).
Exercise 2. Show that the minimizer of d(A,Ψ) in Ψ is Ψ = diag(diag(E>E))/n.
To summarize, the MLEs of (µ,A,Ψ) for the Gaussian factor model may be
obtained as follows:
14
1. Let µ = y ;
2. Reassign Y ← CY;
3. Given a starting value (A(1),Ψ(1)), iterate the following steps until con-
vergence:
(a) Compute Z = E[Z|Y,A(s),Ψ(s)] and S = E[Z>Z|Y,A(s),Ψ(s)].
(b) Let A(s+1) = Y>ZS−1
(c) Let Ψ(s+1) = diag(diag(E>E))/n, where
E>E = Y>Y − 2Y>ZA> + ASA>
Are we done? No, we haven’t yet determined the formulas for Z and S.
Exercise 3. Obtain the conditional distribution of Z given Y and values of
A and Ψ. Show that
Z ≡ E[Z|Y,A,Ψ] = YΨ−1A[A>Ψ−1A + I]−1
S ≡ E[Z>Z|Y,A,Ψ] = Z>Z + n(A>Ψ−1A + I)−1.
Ok, now we’re done.
3.2 Numerical examples
Here is some R-code:
fana_em<-function(Y,APsi)
{#### ---- one EM step
A<-APsi$A ; Psi<-APsi$Psi ; iPsi<-diag(1/diag(Psi))
15
Vz<-solve( t(A)%*%iPsi%*%A + diag(nrow=ncol(A)) )
Zb<- Y%*%iPsi%*%A%*%Vz
Sb<- t(Zb)%*%Zb + nrow(Y)*Vz
A<-t(Y)%*%Zb%*%solve(Sb)
Psi<-diag(diag( t(Y)%*%Y - 2*t(Y)%*%Zb%*%t(A) + A%*%Sb%*%t(A) ))/nrow(Y)
list(A=A,Psi=Psi)
}
Here is the -2 log likelihood:
#### ---- -2 log likelihood
fana_m2ll<-function(Y,APsi)
{A<-APsi$A ; Psi<-APsi$Psi
Sigma<- tcrossprod(A) + Psi
nrow(Y)*log(det((Sigma)))+ sum(diag(crossprod(Y)%*%solve(Sigma)))
}
Note that these functions were written for transparency and not computa-
tional efficiency.
Putting these things together, here is a function to find an MLE (recall A is
only identified up to right rotations):
fana_mle<-function(Y,q,tol=1e-8)
{
## ---- sweep out mean
mu<-apply(Y,2,mean)
Y<-sweep(Y,2,mu,"-")
16
## ---- if q=0
if(q==0){A<-matrix(0,nrow=ncol(Y),ncol=0)
Psi<-diag(apply(Y,2,var))
APsi<-list(A=A,Psi=Psi)
M2LL<-fana_m2ll(Y,APsi)
APsi<-list(A=A,Psi=Psi)
}
if(q>0){## ---- starting values
s<-apply(Y,2,sd)
R<-cor(Y)
tmp<-R; diag(tmp)<-0 ; h<-apply(abs(tmp),1,max)
Psi<-diag(1-h,nrow=ncol(Y) )
for(j in 1:2)
{eX<-svd( R-Psi,nu=q,nv=0)
A<-eX$u[,1:q,drop=FALSE]%*%sqrt(diag(eX$d[1:q],nrow=q ))
Psi<-diag( pmax( diag(R-tcrossprod(A)),1e-3) )
}A<-sweep(A,1,s,"*")
diag(Psi)<-diag(Psi)*s^2
APsi<-list(A=A,Psi=Psi)
## ---- EM algorithm
M2LL<- c(Inf,fana_m2ll(Y,APsi))
while(diff(rev(tail(M2LL,2)))/abs(tail(M2LL,1)) >tol)
{APsi<-fana_em(Y,APsi)
M2LL<-c(M2LL,fana_m2ll(Y,APsi) )
}
}
17
## ---- output
list(mu=mu,A=APsi$A, Psi=APsi$Psi, M2LL=M2LL,
Sigma=tcrossprod(APsi$A) + APsi$Psi,
npq=c(nrow(Y),ncol(Y),ncol(A)))
}
Simulation example: Let’s try it out on a challenging dataset where n <
p.
n<-50 ; p<-100 ; q<-4
## ---- parameters
A0<-matrix(rexp(p*q),p,q)/4
Psi0<-diag(rexp(p))
## ---- data
Z0<-matrix(rnorm(n*q),n,q) ; E0<-matrix(rnorm(n*p),n,p)
Y<- Z0%*%t(A0) + E0%*%sqrt(Psi0)
## ---- fit
fit_em<-fana_mle(Y,4)
The EM algorithm moves very fast at first, then slows considerably:
18
0 20 40 60 80 100 120 140
2060
2080
2100
2120
Index
fit_e
m$M
2LL
100 110 120 130 140
2045
.291
2045
.291
2045
.292
length(fit_em$M2LL) − (50:1)
tail(
fit_e
m$M
2LL,
50)
Now lets compare the true variances and correlations to their estimates, using
the q = 4 factor model and the unrestricted MLE (the sample covariance
matrix).
0.5 1.0 1.5 2.0
0.5
1.0
1.5
2.0
true variances
fitte
d va
rianc
es
unrestrictedq=4
0.0 0.2 0.4 0.6 0.8 1.0
−0.
40.
00.
40.
8
true correlations
fitte
d co
rrel
atio
ns
19
How does R’s built in function fare?
fit_R<-factanal(Y,4)
Error in solve.default(cv): system is computationally singular: reciprocal
condition number = 1.8377e-20
Selecting the number of factors: Often we don’t know how many factors
there are. Statistical assessment of the number of factors can be done with
hypothesis tests, or some other model selection criteria. A popular model
selection criteria is the “Bayes information criteria” [Schwarz, 1978], which
is an approximation to the Bayes factor:
−2 log p(y|q) = −2 log
∫p(y|θ)p(θ|q) dθ
≈ −2 log(p(y|θq)( n
2π)−k/2
)= −2 log p(y|θq) + k log n− k log 2π
BIC(q) = −2 log p(y|θq) + k log n,
where k is the number of parameters in the model. One value of q is preferred
over another if the BIC is lower. For a discussion of this and other model
selection methods for factor analysis, see Hirose et al. [2011].
Exercise 4. Compute the number of parameters k in the (p, q) factor model.
Here is some code to compute the BIC for a model fit returned by the EM-algorithm function:
#### ---- BIC for FANA
fana_bic<-function(fit)
20
{npar<- min( fit$npq[2]*(fit$npq[3]+1) - choose(fit$npq[3],2),
choose(fit$npq[2]+1,2) )
tail(fit$M2LL,1) + log(fit$npq[1])*npar
}
Notice the min in the code above. To get a sense of where that comes from,
consider the case that p = 4 and q = 2. In this case, it would appear as if the
number of parameters is 4× 2− 1 for the A matrix, and 4 for the Ψ matrix,
for a total of 11. But this is more than the number of free parameters (10) in
an unconstrained 4×4 covariance matrix. The correct number of parameters
is the maximal rank of the Jacobian matrix of the map (A,Ψ) 7→ AA> + Ψ
[Drton et al., 2007].
Exercise 5. Consider the set of 4 × 4 covariance matrices that can be ex-
pressed as AA> + Ψ for A ∈ R4×2 and diagonal Ψ. Is this set equal to the
set of all 4× 4 covariance matrices?
WAIS data: The WAIS dataset consists of data from four subtests of theWechsler Adult Intelligence Scale (WAIS) on 49 elderly individuals. Here aresome univariate and bivariate descriptions of the data:
#### ---- WAIS dataset on elderly individuals
Y<-readRDS("../Data/wais.rds")[,-1]
dim(Y)
[1] 49 4
pairs(Y)
21
information
05
1015
5 10 15
02
46
810
0 5 10 15
similarities
arithmetic
4 6 8 10 12 14 16
0 2 4 6 8 10
510
154
68
1012
1416
picture completion
var(Y)
information similarities arithmetic picture completion
information 13.78 12.26 9.16 5.63
similarities 12.26 16.63 9.61 5.03
arithmetic 9.16 9.61 13.02 4.38
picture completion 5.63 5.03 4.38 7.65
cor(Y)
22
information similarities arithmetic picture completion
information 1.000 0.810 0.684 0.548
similarities 0.810 1.000 0.653 0.445
arithmetic 0.684 0.653 1.000 0.439
picture completion 0.548 0.445 0.439 1.000
Do you think a factor analysis model would fit these data well? What shouldthe number of factors be?
eigen(cor(Y))
eigen() decomposition
$values
[1] 2.813 0.631 0.378 0.179
$vectors
[,1] [,2] [,3] [,4]
[1,] -0.549 -0.124 0.3186 0.7626
[2,] -0.527 -0.315 0.4763 -0.6297
[3,] -0.498 -0.281 -0.8183 -0.0622
[4,] -0.416 0.898 -0.0449 -0.1344
fit0<-fana_mle(Y,0)
fit1<-fana_mle(Y,1)
fit2<-fana_mle(Y,2)
sapply( list(fit0$M2LL,fit1$M2LL,fit2$M2LL),function(x){tail(x,1)})
[1] 684 581 580
sapply( list(fit0,fit1,fit2), fana_bic)
[1] 699 612 619
23
So a one-factor random factor model is selected by BIC. Let’s examine the
parameter estimates for this model.
fit1$A
[,1]
information -3.45
similarities -3.48
arithmetic -2.64
picture completion -1.56
round(cor(Y),2)
information similarities arithmetic picture completion
information 1.00 0.81 0.68 0.55
similarities 0.81 1.00 0.65 0.45
arithmetic 0.68 0.65 1.00 0.44
picture completion 0.55 0.45 0.44 1.00
fit1$Psi
[,1] [,2] [,3] [,4]
[1,] 1.59 0.00 0.00 0.00
[2,] 0.00 4.21 0.00 0.00
[3,] 0.00 0.00 5.81 0.00
[4,] 0.00 0.00 0.00 5.07
tcrossprod(fit1$A) + fit1$Psi
information similarities arithmetic picture completion
information 13.50 11.99 9.09 5.37
24
similarities 11.99 16.29 9.16 5.41
arithmetic 9.09 9.16 12.76 4.10
picture completion 5.37 5.41 4.10 7.50
var(Y)
information similarities arithmetic picture completion
information 13.78 12.26 9.16 5.63
similarities 12.26 16.63 9.61 5.03
arithmetic 9.16 9.61 13.02 4.38
picture completion 5.63 5.03 4.38 7.65
4 6 8 10 12 14 16
46
810
1214
16
fit1$Sigma
Sig
maM
LE
q=1q=2
25
4 VARIMAX rotation
In terms of the model fit, we have
−2 log p(Y : µ, A, Ψ) = −2 log p(Y : µ, AR, Ψ)
for any rotation matrix R ∈ Oq. Therefore, we are not really estimating an
A matrix, we are estimating the class
A = {AR : R ∈ Oq},
or equivalently, we are estimating AA>.
Exercise 6. Show that A1A>1 = A2A
>2 if and only if A1 = A2R for some
rotation matrix R.
For this reason, some have argued against interpreting the coefficients of A
beyond AA>.
For the same reason, others have argued in favor of selecting an element of
A = {AR : R ∈ Oq}
that most conforms to ones theories. Many theories are available, one of
which is that observed variables are associated primarily with only a subset
of the latent factors, and hence many coefficients of A should be zero.
An alternative way to think of this is that the variances of the magnitudes of
the (a1,k, . . . , ap,k) should be high - a few variables should have high loadings
on factor k, and others should have small loadings. The VARIMAX rotation
finds the rotation R that maximizes the within-column variances (of the
squared elements):
RVMAX = arg maxR∈Oq
q∑k=1
Var[(AR)[k] ◦ (AR)[k]].
26
The VARIMAX rotation is implemented by default in R. Let’s see how it
works in the Swiss heads example. First lets select the number of factors,
and then fit the model using both the EM algorithm and the built-in R-
function.
heads<-readRDS("../Data/heads.rds")
Y<-rbind(heads$males,heads$females)
pairs(Y,col=1+rep(c(1,2),times=c(200,59)))
MFB
100
110
120
130
5060
70
80 100 120
120
130
140
150
100 110 120 130
BAM
TFH
100 120 140
50 60 70
LGAN
LTN
105 115 125 135
120 130 140 150
8010
012
010
012
014
010
511
512
513
5
LTG
27
fana_bic(fana_mle(Y,0))
[1] 7068
fana_bic(fana_mle(Y,1))
[1] 6754
fana_bic(fana_mle(Y,2))
[1] 6745
fana_bic(fana_mle(Y,3))
[1] 6749
fit1<-fana_mle(Y,1)
fit2<-fana_mle(Y,2)
fit1R<-factanal(Y,1)
fit2R<-factanal(Y,2)
First let’s compare the output from the two functions in the q = 1 model(VARIMAX is not relevant in this case):
fit1$A
[,1]
28
MFB -4.81
BAM -1.94
TFH -3.70
LGAN -1.67
LTN -3.27
LTG -5.75
fit1R$loadings
Loadings:
Factor1
MFB 0.615
BAM 0.360
TFH 0.528
LGAN 0.377
LTN 0.737
LTG 0.844
Factor1
SS loadings 2.187
Proportion Var 0.364
fit1R$loadings*apply(Y,2,sd)
Loadings:
Factor1
MFB 4.82
BAM 1.94
TFH 3.70
LGAN 1.67
LTN 3.28
29
LTG 5.77
Factor1
SS loadings 87.5
Proportion Var 14.6
The R function scales the variables to have unit variance before fitting the
model for some reason.
Now lets compare the parameters for the q = 2 model:
A<-fit2$A
psi<-diag(fit2$Psi)
AR<-sweep(fit2R$loadings,1,apply(Y,2,sd),"*")
psiR<-fit2R$uniquenesses*apply(Y,2,var)
First check Ψ:
psi
[1] 37.838 24.276 0.837 15.459 9.027 11.937
psiR
MFB BAM TFH LGAN LTN LTG
37.996 24.383 0.245 15.545 9.066 11.959
Now check AA>
30
A%*%t(A)
MFB BAM TFH LGAN LTN LTG
MFB 23.26 8.93 22.15 8.34 15.49 27.65
BAM 8.93 4.72 2.58 2.04 6.64 12.08
TFH 22.15 2.58 48.05 13.23 11.57 19.60
LGAN 8.34 2.04 13.23 4.03 4.93 8.59
LTN 15.49 6.64 11.57 4.93 10.69 19.20
LTG 27.65 12.08 19.60 8.59 19.20 34.54
AR%*%t(AR)
MFB BAM TFH LGAN LTN LTG
MFB 23.34 8.96 22.29 8.36 15.55 27.76
BAM 8.96 4.73 2.59 2.06 6.66 12.13
TFH 22.29 2.59 48.85 13.29 11.62 19.67
LGAN 8.36 2.06 13.29 4.02 4.95 8.63
LTN 15.55 6.66 11.62 4.95 10.73 19.28
LTG 27.76 12.13 19.67 8.63 19.28 34.70
Close enough for me. Now lets see what the rotated loadings look like:
A
[,1] [,2]
MFB 4.81 -0.405
BAM 1.75 -1.291
TFH 5.01 4.788
LGAN 1.81 0.870
LTN 3.15 -0.880
LTG 5.60 -1.773
AR
31
Loadings:
Factor1 Factor2
MFB 4.006 2.701
BAM 2.172
TFH 0.890 6.932
LGAN 0.870 1.806
LTN 3.011 1.290
LTG 5.491 2.132
Factor1 Factor2
SS loadings 61.5 64.8
Proportion Var 10.3 10.8
Cumulative Var 10.3 21.1
Here is graphical comparison. The VARIMAX solution is in green.
−2
02
46
A
MFB BAM TFH LGAN LTN LTG
−2
02
46
A
MFB BAM TFH LGAN LTN LTG
Note that the print method for coefficients from factanal makes the A
matrix look “sparse” when it really isn’t. Beware of this additional pitfall of
32
using the factanal command in R.
AR
Loadings:
Factor1 Factor2
MFB 4.006 2.701
BAM 2.172
TFH 0.890 6.932
LGAN 0.870 1.806
LTN 3.011 1.290
LTG 5.491 2.132
Factor1 Factor2
SS loadings 61.5 64.8
Proportion Var 10.3 10.8
Cumulative Var 10.3 21.1
AR[2,2]
[1] 0.0944
5 Recovery of latent factors
There are several ways to estimate the common factors, depending on what
assumptions one is willing to make. Imagine for the moment that A and Ψ
are known and we would like to infer the values of Z, perhaps as a tool to
cluster the rows. One perspective is then that Z can be viewed as a fixed
parameter, and we can estimate it using OLS, GLS, MLE etc. Recall that
33
the conditional model is
Y ∼ Nn×p(ZA>,Ψ⊗ I).
Given values of A and Ψ, the ML/GLS estimator of Z is the minimizer of
tr((Y − ZA>)Ψ−1(Y − ZA>)>) ' −2tr(YΨ−1AZ>) + tr(ZA>Ψ−1AZ>)
= −2tr(Z>[YΨ−1A]) + tr(Z>Z[A>Ψ−1A])
Exercise 7. Show that the minimizer in Z is Z = [YΨ−1A][A>Ψ−1A]−1.
The optimizer Z is sometimes called “Bartlett’s factor score” matrix. How-
ever, note that in practice, A and Ψ are not known and so typically their
MLEs are plugged-in to obtain this pseudo-MLE for Z. Also, because A is
not identifiable, neither is Z: If (A, Z) are an MLE of A and a Bartlett’s
factor matrix, then so are (AR, ZR) for any R ∈ Oq.
An alternative perspective is that we are assuming the distribution of the
latent factors is Z ∼ Nn×q(0, I⊗I). Therefore, if A and Ψ are known it would
be most appropriate to find the conditional distribution and conditional mode
of Z given Y.
Exercise 8. Show that the conditional distribution of Z is
Z|Y ∼ Nn×q([YΨ−1A]V,V ⊗ In),
where V = [A>Ψ−1A + Iq]−1. Hence the conditional mean/mode is
Z = [YΨ−1A][A>Ψ−1A + Iq]−1.
This estimator is sometimes called “Thompson’s factor score” matrix. This
estimate is very similar to the MLE, except that the elements within a row get
shrunk a bit towards zero due to the assumption of standard normal factors.
34
However, whatever the method, recall that Z is only estimable up to right
rotations, as A is only identifiable up to rotations under this normal random
factor model. To obtain identifiability, something else must be assumed about
Z or A, such as
• sparsity of A, or
• non-Gaussianity of Z.
The latter is what is assumed in a signal-recovery method known as “inde-
pendent components analysis” (ICA).
fana_scores<-function(Y,fit)
{Y<-sweep(Y,2,apply(Y,2,mean),"-")
A<-fit$A ; iPsi<-diag(1/diag(fit$Psi ))
(Y%*%iPsi%*%A) %*% solve( t(A)%*%iPsi%*%A + diag(ncol(A)) )
}
Z<-fana_scores(Y,fit2)
fit2R<-factanal(Y,2,scores="regression")
ZR<-fit2R$scores
35
−2 −1 0 1 2 3
−3
−2
−1
01
2
Z[,1]
Z[,2
]
−2 −1 0 1 2−
4−
3−
2−
10
12
Factor1
Fac
tor2
6 Decorrelation and ICA
The following material might more appropriately be located with the material
on PCA.
6.1 Decorrelation
Consider for the moment the “low noise” model with p = q, so
Y ≈ ZA>
where Z is a random matrix of mean-zero, variance-one uncorrelated random
factors i.e. “white noise.” The mixing matrix A ∈ Rp induces correlation
36
among the columns of Y:
Y>Y/n ≈ AZ>ZA>/n
≈ AA>.
To make things easier, lets further assume that n is very large so that this
approximation is very good - so good that from now on we will write Y =
ZA> and Y>Y/n = A>A.
Now suppose we’d like to try to recover Z from Y. How can we do this?
Obviously if we could estimate A−1, we’d be in business. However, in general
you can’t estimate A from Y>Y = AA> for all values of A because of the
rotational nonidentifiability.
Decorrelation matrices: Consider the following intuitive approach - the
columns of Y are correlated linear combinations of the uncorrelated columns
of Z. Maybe if we decorrelate Y, the result will look like Z.
Definition 1. W ∈ Rp×p is a decorrelation matrix for Y if the (sample)
column covariance of YW> is the identify matrix, that is, if
W(Y>Y/n)W> = Ip.
Note that such a W both decorrelates and standardizes the columns of Y.
So what matrices W do the job? Let A = VL1/2R> be the SVD of A. Then
Y>Y/n = (VL1/2R>)(VL1/2R>)>
= VLV>.
So W is a decorrelation matrix if WVLV>W> = I.
Exercise 9. Show that W is a rotation matrix if and only if W = RL−1/2V>
for some rotation matrix R ∈ Op.
37
How can we obtain a decorrelation matrix W = RL−1/2V>?
• V and L can be obtained from the sample covariance matrix, since
Y>Y/n = VLV>;
• Y>Y/n does not provide guidance on how to choose R.
Now suppose we obtain a decorrelation Z of Y using a decorrelation matrix
W. Will Z resemble Z?
Z = YW> = ZA>W>
= ZRL1/2V>VL−1/2R>
= ZRR>.
So Z is a column permutation ZP of Z if R is a row permutation of R:
Z = ZP⇔ R = PR.
PCA decorrelation: Recall that the “principal components transforma-
tion” rotates the columns of Y to construct uncorrelated principal compo-
nents F:
F = YV>
F>F/n = V>(Y>Y/n)V
= L.
To standardize, multiply F by L−1/2 to give
ZPCA = FL−1/2 = YVL−1/2
= YW>PCA.
where WPCA = L−1/2V>. In signal processing, decorrelating Y with WPCA
is called PCA whitening.
38
Exercise 10. Show that PCA whitening can recover Z up to permutations
if A = P1A0P2, where A0 is a matrix with orthogonal columns and P1 and
P2 are signed permutation matrices.
ZCA decorrelation: Another popular “default” decorrelation is ZCA whiten-
ing, obtained by multiplying Y by the decorrelation matrix WZCA given by
WZCA = RZCAL−1/2V> where
RZCA = V, so
WZCA = VL−1/2V> = Σ−1/2.
Recall that when we empirically standardize a scalar random variable we
perform the operation z = (σ2)−1/2y. ZCA whitening is the closest analogue
in the multivariate case:
z = Σ−1/2y
Z = YΣ−1/2.
Sometimes this transformation is called the Mahalinobis transformation.
Exercise 11. Show that ZCA whitening can recover Z up to permutations
if A = P1A0P2, where A0 is a symmetric matrix and P1 and P2 are signed
permutation matrices.
Comments:
1. Letting Y =√nUDV> be the SVD of Y, the decorrelated variables
resulting from ZCA whitening are ZZCA =√nUV>, which is
√n times
the polar decomposition of Y. Of course, Z>ZCAZZCA = nI, as desired.
2. ZCA whitening can be shown to have certain optimality properties: If
the columns of Y began with unit variance, then ||Y−Z||2 is minimized
among decorrelations Z by ZZCA.
39
Numerical examples: Let’s confirm what we have derived with two nu-
merical examples, one where PCA recovers the factors and another where
ZCA will.
## data dimensions
n<-1000 ; p<-3
Recovery via PCA:
## orthogonal columns
Ap<-eigen(tcrossprod(matrix(rnorm(p*p),p,p)))$vec %*% diag(rexp(p))
## permute
Ap<-Ap[c(3,1,2),c(1,3,2) ]
## generate data
Z<-matrix(rnorm(n*p),n,p)
Y<-Z%*%t(Ap)
## decorrelate
eS<-eigen(t(Y)%*%Y/n)
V<-eS$vec ; L<-eS$val
Zzca<- Y%*%t( V %*% diag(1/sqrt(L)) %*% t(V) )
Zpca<- Y%*%t( diag(1/sqrt(L)) %*% t(V) )
crossprod(Z,Zpca)/n
[,1] [,2] [,3]
[1,] -0.0158 0.02416 -0.9943554
[2,] 0.9567 -0.00341 -0.0000349
[3,] -0.0372 -1.00383 -0.0005494
40
crossprod(Z,Zzca)/n
[,1] [,2] [,3]
[1,] -0.324 0.137 -0.93053
[2,] 0.779 -0.453 -0.32147
[3,] 0.471 0.887 -0.00889
Recovery via ZCA:
## symmetric A plus noise
Az<-matrix( c( 0.85, 0.42, 0.26,
0.42, 0.85, 0.42,
0.26, 0.42, 1.00 ), 3,3)
## permute
Az<-Az[c(3,1,2),c(1,3,2) ]
## generate data
Z<-matrix(rnorm(n*p),n,p)
Y<-Z%*%t(Az)
## decorrelate
eS<-eigen(t(Y)%*%Y/n)
V<-eS$vec ; L<-eS$val
Zzca<- Y%*%t( V %*% diag(1/sqrt(L)) %*% t(V) )
Zpca<- Y%*%t( diag(1/sqrt(L)) %*% t(V) )
crossprod(Z,Zpca)/n
[,1] [,2] [,3]
41
[1,] -0.506 -0.646 -0.576
[2,] -0.588 0.776 -0.304
[3,] -0.608 -0.223 0.804
crossprod(Z,Zzca)/n
[,1] [,2] [,3]
[1,] -0.0112 1.00254 -0.0130
[2,] 1.0192 -0.01718 -0.0303
[3,] -0.0275 0.00105 1.0325
6.2 ICA
To summarize the last subsection, we can obtain good estimates of Z for
Gaussian and non-Gaussian factors for certain types of matrices A.
However, there is another decorrelation method, called independent compo-
nent analysis (ICA) designed specifically for deconvolving mixed signals, that
can perform well for certain types of Z-matrices for all kinds of A-matrices.
Specifically, ICA works when the source signals have non-Gaussian distribu-
tions.
The ICA decorrelation is given by WICA = RICAL−1/2V>, where
• V and L are obtained from the approximation Y>Y/n = VLV>;
• RICA is obtained from some other aspect of the data Y.
The idea behind ICA is similar to that which motivated factor recovery via
42
PCA/ZCA:
PCA/ZCA: Z uncorrelated, so Z = YW> should be uncorrelated.
ICA: Z non-Gaussian and uncorrelated, so Z = YW> should be non-
Gaussian and uncorrelated.
Measuring non-normality: To implement ICA, first one selects a mea-
sure of non-normality. Standard implementations measure deviations from
normality by deviations of functional sample moments from their expecta-
tions under normality. For example, we might measure (non)-Gaussianity
with a function G given by
G(Z) =∑j
(∑i
g(zi,j)/n− E[g(z)]
)2
where z ∼ N(0, 1). Standard choices of g include
• g(z) = z4 (kurtotis);
• g(z) = (log cosh(az)) /a ;
• g(z) = −ez2/2.
The first is simple but not robust to outliers. The second and third are
approximations to a measure of entropy.
Once a non-normality measure has been selected, the ICA solution ZICA can
be obtained as follows:
1. Whiten: Compute ZPCA.
43
2. Optimize: Find RICA = maxR∈Op G(ZPCAR>).
3. Transform: Let ZICA = ZPCAR>.
Note that this algorithm is the same as finding the decorrelation matrix
W = RL−1/2V> that maximizes G(YW>) over orthogonal matrices R.
Why does ICA work? Recall that any decorrelation W = VL−1/2R>
produces estimated latent factors
Z = ZRR>
for R ∈ Op. Recovery occurs when RR> is signed permutation matrix. Now
if the columns of Z are non-Gaussian, then the columns of Z will look equally
non-Gaussian for the right choice of R, because in this case each column of
Z is one of the columns of Z (up to sign changes). On the other hand, if R is
chosen poorly, then each column of Z is a linear combination of the columns
of Z, and by the central limit theorem, will therefore look more Gaussian
than the original columns.
This might sound surprising, since we think of the CLT as an asymptotic
result and here we are applying it in a situation where the number of things
we are taking linear combinations of can be quite small (2 or 3). However,
it is generally true that if Z1 and Z2 are independent non-normal random
variables, then Z1 + Z2 will generally be more normally distributed than
either one of them individually. Here is a numerical illustration:
Z<-matrix(rexp(n*p),n,p) ; Z<-sweep(Z,2,apply(Z,2,mean),"-")
A<-matrix( c(.65,-.45,.45,-.45,.65,-.45,.45,-.45,.65), 3,3)
A
44
[,1] [,2] [,3]
[1,] 0.65 -0.45 0.45
[2,] -0.45 0.65 -0.45
[3,] 0.45 -0.45 0.65
Y<-A%*%t(Z)
Z
Fre
quen
cy
−1 0 1 2 3 4 5 6
020
040
060
080
010
00
Y
Fre
quen
cy
−2 0 2 4
020
040
060
080
0
Exercise 12. Will ICA work if Z is a matrix of i.i.d. standard normal ran-
dom variables? Formally explain why or why not.
Parametric ICA: There are many variations on ICA, and it might be more
appropriate to refer to it as an objective rather than a single model or al-
gorithm. One important variant is what I would call “parametric” ICA,
where some additional structure on the components is assumed (sparsity, a
parametric model, temporal dependence, etc.).
Numerical examples: First let’s generate some non-normal factors:
45
## non-normal factors
Z<-matrix(rexp(n*p),n,p)
Z<-sweep(Z,2,apply(Z,2,mean),"-")
Z<-sweep(Z,2,apply(Z,2,sd),"/")
Recovery via PCA and ICA:
Y<-Z%*%t(Ap)
## decorrelate
eS<-eigen(t(Y)%*%Y/n)
V<-eS$vec ; L<-eS$val
Zzca<- Y%*%t( V %*% diag(1/sqrt(L)) %*% t(V) )
Zpca<- Y%*%t( diag(1/sqrt(L)) %*% t(V) )
## ICA
Zica<-fastICA::fastICA(Y,p)$S
crossprod(Z,Zpca)/n
[,1] [,2] [,3]
[1,] 0.0240 0.02814 -0.9988160
[2,] 0.9995 -0.00369 0.0000458
[3,] -0.0423 -0.99860 -0.0006497
crossprod(Z,Zzca)/n
[,1] [,2] [,3]
[1,] -0.296 0.116 -0.94747
[2,] 0.813 -0.473 -0.33746
[3,] 0.465 0.885 -0.00685
46
crossprod(Z,Zica)/n
[,1] [,2] [,3]
[1,] -0.0310 0.9990 -0.000748
[2,] 0.0383 0.0243 -0.998469
[3,] -0.9977 -0.0595 -0.001124
Recovery via ZCA and ICA:
Y<-Z%*%t(Az)
## decorrelate
eS<-eigen(t(Y)%*%Y/n)
V<-eS$vec ; L<-eS$val
Zzca<- Y%*%t( V %*% diag(1/sqrt(L)) %*% t(V) )
Zpca<- Y%*%t( diag(1/sqrt(L)) %*% t(V) )
## ICA
Zica<-fastICA::fastICA(Y,p)$S
crossprod(Z,Zpca)/n
[,1] [,2] [,3]
[1,] -0.526 -0.638 -0.561
[2,] -0.606 0.733 -0.308
[3,] -0.570 -0.188 0.799
crossprod(Z,Zzca)/n
[,1] [,2] [,3]
[1,] 0.0144 0.99935 -0.0097
47
[2,] 0.9994 0.00929 -0.0129
[3,] -0.0255 -0.01842 0.9990
crossprod(Z,Zica)/n
[,1] [,2] [,3]
[1,] -0.0308 0.00372 -0.9990
[2,] 0.0383 -0.99835 -0.0287
[3,] -0.9977 -0.00132 0.0592
References
Jian-Feng Cai, Emmanuel J. Candes, and Zuowei Shen. A singular value
thresholding algorithm for matrix completion. SIAM J. Optim., 20(4):
1956–1982, 2010. ISSN 1052-6234. doi: 10.1137/080738970. URL http:
//dx.doi.org/10.1137/080738970.
David L Donoho and Matan Gavish. The optimal hard threshold for singular
values is 4/sqrt (3). arXiv preprint arXiv:1305.5870, 2013.
Mathias Drton, Bernd Sturmfels, and Seth Sullivant. Algebraic factor analy-
sis: tetrads, pentads and beyond. Probab. Theory Related Fields, 138(3-4):
463–493, 2007. ISSN 0178-8051. doi: 10.1007/s00440-006-0033-2. URL
https://doi.org/10.1007/s00440-006-0033-2.
D. Gerard and P.D. Hoff. Adaptive higher-order spectral estimators. Techni-
cal Report 633, Department of Statistics, University of Washington, 2015.
Wolfgang Karl Hardle and Leopold Simar. Applied multivariate statistical
analysis. Springer, Heidelberg, fourth edition, 2015. ISBN 978-3-662-
48
45170-0; 978-3-662-45171-7. doi: 10.1007/978-3-662-45171-7. URL http:
//dx.doi.org/10.1007/978-3-662-45171-7.
Kei Hirose, Shuichi Kawano, Sadanori Konishi, and Masanori Ichikawa.
Bayesian information criterion and selection of the number of factors in
factor analysis models. Journal of Data Science, 9(2):243–259, 2011.
P.D. Hoff. Model averaging and dimension selection for the singular value
decomposition. J. Amer. Statist. Assoc., 102(478):674–685, 2007. ISSN
0162-1459.
Alan Julian Izenman. Modern multivariate statistical techniques. Springer
Texts in Statistics. Springer, New York, 2008. ISBN 978-0-387-78188-
4. doi: 10.1007/978-0-387-78189-1. URL http://dx.doi.org/10.1007/
978-0-387-78189-1. Regression, classification, and manifold learning.
Julie Josse and Sylvain Sardy. Reduced rank matrix estimation by adap-
tive trace norm regularization, 2013. URL http://arxiv.org/abs/1310.
6602. arXiv:1310.6602.
Kantilal Varichand Mardia, John T. Kent, and John M. Bibby. Multivariate
analysis. Academic Press [Harcourt Brace Jovanovich Publishers], London,
1979. ISBN 0-12-471250-9. Probability and Mathematical Statistics: A
Series of Monographs and Textbooks.
Rahul Mazumder, Trevor Hastie, and Robert Tibshirani. Spectral regular-
ization algorithms for learning large incomplete matrices. J. Mach. Learn.
Res., 11:2287–2322, 2010. ISSN 1532-4435.
Gideon Schwarz. Estimating the dimension of a model. Ann. Statist., 6(2):
461–464, 1978. ISSN 0090-5364.
49