principal components analysis bmtry 726 3/27/14. uses goal: explain the variability of a set of...
TRANSCRIPT
Principal Components Analysis
BMTRY 7263/27/14
UsesGoal: Explain the variability of a set of variables using a “small”
set of linear combinations of those variablesWhy: There are several reasons we may want to do this
(1) Dimension Reduction (use k of p components)-Note, total variability still requires p components
(2) Identify “hidden” underlying relationships (i.e. patterns in the data)-Use these relationships in further analyses
(3) Select subsets of variables
“Exact” Principal ComponentsWe can represent data X as linear combinations of p random
measurements on j = 1,2,…,n subjects
“Exact” Principal ComponentsPrincipal components are those combinations that are:
(1) Uncorrelated (linear combinations Y1, Y2,…, Yp)
(2) Variance as large as possible(3) Subject to:
'1
' '1 1 1
'2
' ' ' '2 2 2 1 2
'
' ' ' '
1 linear combo maximizes
subject to 1
2 linear combo maximizes
subject to 1 and , 0
linear combo maximizes
subject to 1 and , 0 f
st
nd
thp
p p p i p
PC
Var
PC
Var Cov
p PC
Var Cov
a X
a X a a
a X
a X a a a X a X
a X
a X a a a X a X
or i p
Finding PC’s Under Constraints
• So how do we find PC’s that meet the constraints we just discussed?
• We want to maximize subject to the constraint that
• This constrained maximization problem can be done using the method of Lagrange multipliers
• Thus we want to maximize the function
' ' 1i i i i i a a a a
' 'i i i iVar Y Var a X a a
' 1i i a a
Finding PC’s Under Constraints
• Differentiate w.r.t ai :
Finding PC’s Under Constraints
• But how do we choose our eigenvector (i.e. which eigenvector corresponds to which PC?)
• We can see that what we want to maximize is
• So we choose li to be as large as possible
• If l1 is our largest eigenvalue with corresponding eigenvector ei then the solution for our max is
' ' 'i i i i i i i i i a a a a a a
1 1 1 a e
Finding PC’s Under Constraints
• Recall we had a second constraint
• We could conduct a second Lagrangian maximization to find our second PC
• However we already know that eigenvectors are independent (so this constraint is met)
• We choose the order of the PCs by the magnitude of the eigenvalues
' ', , 0i k i kCov Y Y Cov a X a X
“Exact” Principal ComponentsSo we can compute the PCs from the variance matrix of X, S:
1 2
1 2
'
'
'1 1 2 2
1. eigenvalues of
2. , , , corresponding eigenvectors of such that
1
0
This yields our principalcomponent
...
p
p
i i i
i i
i j
th
i i i i pi p
Var
i j
i
Y
X Σ
e e e Σ
Σe e
e e
e e
e X e X e X e X
PropertiesWe can also find the moments of our PC’s
'1 1 11 1 12 2 1
1
1
First : ... p pPC Y
E Y
Var Y
e X e x e x e x
PropertiesWe can also find the moments of our PC’s
'1 1 2 2: ...
,
thk k k k kp p
k
k
i k
k PC Y
E Y
Var Y
Cov Y Y
e X e x e x e x
PropertiesNormality assumption not required to find PC’sIf Xj ~ Np(m,S) then:
Total Variance:
1'1 1
2' '
'
1 2
~ ,
and , ,..., are independent
j
j j
pj pp
j j pj
Y
N
Y
Y Y Y
e
X Γ X Γ μ
e
1 2
1 2
1 2
1 1
...
...
...
and proportion total variance accounted for component
p
p
p
th
k kp p
i ii i
trace Var X Var X Var X
Var Y Var Y Var Y
k
Var Y
Var Y
Σ
Principal ComponentsConsider data with p random measures on j = 1,2,…,n subjectsFor the jth subject we then have the random vector
1
2 1,2,...,
Suppose ~ , ...if we set 2 we know what looks like
j
j
j
pj
j j
X
Xj n
X
N p
X
X μ Σ X
X1
X2
m1
m2
Graphic Representation
'1 2
' 1 2
' '
1
11 '
1
1'
1 1' 1 1 ' 1 '
1 '
'11
2 ' 1 ' '1
~ ,
Densityof is constant on theellipsoid
Recall: and
Note:
Λ
Λ but and
Λ
i
i
j j j j pj
p
i i i i ii
p
i i ii
p
i ii
i i
N X X X
c
Y
c
X μ Σ X
X X Σ X
e X Σ e e
Σ e e
P P
P P P P P P
P P
e e
X Σ X X e e
22 21 2
2 2 21 2
'11 1
and ... 1
i
p
p
p p
i ii i
YY Y
c c c
Y Y
X
Graphic RepresentationNow suppose X1, X2 ~ N2(m, S)
Y1 axis selected to maximize variation in the scores
Y2 axis must be orthogonal to Y1 and maximize variation in the scores
2
1 11
n
jjY Y
2
2 21
n
jjY Y
Y2
X2
Y1
X1
Dimension ReductionProportion of total variance accounted for by the first k
components is
If the proportion of variance accounted for by the first k principal components is large, we might want to restrict our attention to only these first k components
Keep in mind, components are simply linear combinations of the original p measurements
Ideally look for meaningful interpretations of our choose k components
1
1
k
iip
ii
PC’s from Standardized VariablesWe may want to standardize our variables before finding PCs
1 1
1111
2 2
12222 2
11 11
1 12 2
1
11
2
1
1 1
1 1
p ppp
pp
pp pp
X
X
Xp
ij
ij
ii jj
Z
Z
Z
Cov
Z X μ V X μ
Z V ΣV
ρ
PC’s from Standardized Variables
So the covariance of V equals the correlation of XWe can define our PC’s for Z the same way as before….
12
'
1
PC :
but now and are the eigenvalues/vectors for
because they are standardized:
1 and
thi i
i i
p
i ii
i Y
Var Y Var Y p
Z V X μ
e Z
e ρ
Compare Standardized/Non-standardized PCs
' '1 1 1 1
' '2 2 2 2
1
1 2
Non-standardized Standardized
1 4 1 0.4
4 100 0.4 1
100.16 0.04 0.999 1.4 0.707 0.707
0.84 0.999 0.04 0.6 0.707 0.707
proportion varianceexplained by the first
100.160
101
PC
Σ ρ
e e
e e
1
1 2
1.4.992 0.70
2
EstimationIn general we do not know what S is- we must estimate if from
the sampleSo what are our estimated principal components?
1 2
'1
1 1
1 2 1 2
1 2 1 2
Assume we havea randomsample : , ,...,
We can use :
Eigenvalues for :
ˆ ˆ ˆ.... (consistent estimators , ,..., )
Eigenvectors for :
ˆ ˆ ˆ.... (consistent estimators , ,..., )
n
n
j jn j
p p
p p
X X X
S X X X X
S
S
e e e e e e
'
principal component :
ˆˆ
th
i i
i
y e x
Sample PropertiesIn general we do not know what S is- estimate it from sampleSo what are our estimated principal components?
21
1 1
1 2 1
ˆ1. Estimated Variance of y :
ˆ ˆ ˆ
ˆ ˆ2. Sample covariance and correlations for , :
ˆ ˆ, 0
ˆ3. Proportion total variance accounted for by :
ˆ ˆ
ˆ ˆ ˆ ˆ...
i
n
i ij in j
i k
i k
k
k kp
p ii
y y
y y
Cov y y i k
y
1ˆ , 2 2
1 1
ˆ4. Estimated correlation for , :
ˆˆ ˆ ˆ
ˆ ˆi k
i k
n
ij i kj kj ik iy x
n nkk
ij i kj kj j
y
y y X Xr
sy y X X
x
e
CenteringWe often center our observations before defining our PCsThe centered PCs are found according to:
'
'
'11
'11
'1
ˆˆ , 1,2,...,
ˆˆ , 1, 2,...,
ˆˆ
ˆ
ˆ
i i
ij i j
n
i i jn j
n
i jn j
in
y i p
y j n
y
e x x
e x x
e x x
e x x
e 0 0
ExampleJolicoeur and Mosimann (1960) conducted a study looking at the
relationship between size and shape of painted turtle carapaces.
We can develop PC’s for natural log of length, width, and height of female turtles’ carapaces
1
2
3
1 2 3
log carapace length .0624 .0201 .0249
log carapace width & .0162 .0194
log carapace height .0249
.627 .553 .550
ˆ ˆ ˆ.488 , .272 .830
.608 .788 .993
j
j j
j
x
x
x
x S
e e e
ˆ .06623 .00077 .00054
λ
ExampleThe first PC is:
This might be interpreted as an overall size component
'1 1ˆˆ
0.627*log length 0.488*log width 0.608*log height
y
e x
Shell dimensionssmall
Small valuesy1
Shell dimensionslarge
Large valuesy1
ExampleThe second PC is:
Emphasizes contrast between length and height of the shell
'2 2ˆˆ
0.553*log length 0.272*log width 0.788*log height
y
e x
Small valuesy2
Large valuesy2
ExampleThe third PC is:
Emphasizes contrast between width and length of the shell
'3 3ˆˆ
0.550*log length 0.830*log width 0.099*log height
y
e x
Small valuesy3
Large valuesy3
Example
Consider the proportion of variability accounted for by each PC
ˆ .06623 .00077 .00054λ
ExampleHow are the PCs correlated with each of the x’s?
Then
ˆ ,
ˆˆi j
ij iy
jj
er
s
x
Trait
x10.99 0.09 -0.08
X20.99 0.06 0.15
X30.99 -0.14 -0.01
1 1
11 1ˆ ,
11
ˆˆ 0.627 0.066230.99
0.0264y
er
s
x
1y 2y 3y
Interpretation of PCsConsider data x1, x2, …., xp:
PCs are actually projections onto the estimated eigenvectors
-1st PC is the one with the largest projection-For data reduction, only use PCA if the eigenvalues vary-If x’s are uncorrelated, we can’t really do data reduction
11
'
' 1 2
'
ˆˆLet:
Consider thecontour:
This contour mimics thedensityof ,
ˆ ˆˆ length of projection of in direction of
p
ip i
i i
i i i
Var
y
c
N
y
x S x x
e x x
x x S x x
μ Σ
e x x x x e
Choosing Number of PCsOften the goal of PCA is dimension reduction of data
Select a limited number of PCs that capture majority of the variability in the data
How do we decide how many PCs to include:1. Scree plot: plot of versus i2. Select all PCs with (for standardized observations)
3. Choose some proportion of the variance you want to account for
iˆ 1i
Scree Plots
Choosing Number of PCsShould principal components that only account for a small
proportion of variance always be ignored?
Not necessarily, they may indicate near perfect colinearities among traits
In the turtle example, this is true-very little variation of the variation in shell measurements can be attributed to the 2nd and 3rd components
Large Sample PropertiesIf n is large, there are nice properties we can use
' '11 21 1
2
1
2
22
ˆ ˆ ˆ ˆwith
ˆFor large : , 2
where:
a. estimated eigen values for are asymptotically independent
ˆb. distribution of ~ ,
c.
n
j j pn j
D
p
p
i i in
n n N
diag
N
S x x x x λ
λ λ 0 Λ
Λ λ
Σ
2 2
2 21 1
2 2
2 2
21
ˆ ˆ an approximate CI for is:
1 1
ˆd. alternative approximation: ln ~ ln ,
ˆ ˆn n
i ii i
n n
i i n
z z
i i i
z z
N
e e
Large Sample PropertiesAlso for our estimated eigenvectors
These results assume that X1, X2, …., Xn are N(m, S)
'
2
ˆ1. For large : ,
where:
ˆ ˆ2. For large , is approximately independent of the distribution for
D
i i p i
ki i k kk i
k i
i i
n n N E
E
n
e e 0
e e
e
SummaryPrincipal component analysis most useful for dimensionality
reduction
Can also be used for identifying colinear variables
Note, use of PCA in a regression setting is therefore one way to handle multi-colinearity
A caveat… principal components can be difficult to interpret and should therefore be used with caution