principal component analysis - nus computingcs5240/lecture/pca.pdf · 2019-09-05 · principal...
TRANSCRIPT
Principal Component Analysis
CS5240 Theoretical Foundations in Multimedia
Leow Wee Kheng
Department of Computer Science
School of Computing
National University of Singapore
Leow Wee Kheng (NUS) Principal Component Analysis 1 / 54
Motivation
Motivation
How wide is the widest part of NGC 1300?
Leow Wee Kheng (NUS) Principal Component Analysis 2 / 54
Motivation
How thick is the thickest part of NGC 4594?
Use principal component analysis.
Leow Wee Kheng (NUS) Principal Component Analysis 3 / 54
Maximum Variance Estimate
Maximum Variance Estimate
Consider a set of points xi, i = 1, . . . , n, in an m-dimensional spacesuch that their mean µ = 0, i.e., centroid is at the origin.
x i
di
yi
x1
x2
x4
x1
x2
x3
O
q
l
Want to find a line l through the origin that maximizes the projectionsyi of the points xi on l.
Leow Wee Kheng (NUS) Principal Component Analysis 4 / 54
Maximum Variance Estimate
Let q denote the unit vector along line l.Then, the projection yi of xi on l is
yi = x⊤
i q. (1)
The mean squared projection, which is the variance, V over all points is
V =1
n
n∑
i=1
y2i =1
n
n∑
i=1
(x⊤
i q)2 ≥ 0. (2)
Because x⊤
i q = q⊤xi, expanding Eq. 2 gives
V =1
n
n∑
i=1
(q⊤xi)(x⊤
i q) = q⊤
[1
n
n∑
i=1
xix⊤
i
]q. (3)
The middle factor is the covariance matrix C of the data points
C =1
n
n∑
i=1
xix⊤
i . (4)
Leow Wee Kheng (NUS) Principal Component Analysis 5 / 54
Maximum Variance Estimate
We want to find a unit vector q that maximizes the variance V , i.e.,
maximize V = q⊤Cq subject to ‖q‖ = 1. (5)
This is a constrained optimization problem.
Use Lagrange multiplier method:combine V and the constraint using Lagrange multiplier λ
maximize V ′ = q⊤Cq− λ(q⊤q− 1). (6)
Leow Wee Kheng (NUS) Principal Component Analysis 6 / 54
Maximum Variance Estimate Lagrange Multiplier Method
Lagrange Multiplier Method
Lagrange multiplier is a method for solving constrained optimization.
Consider this problem:
maximize f(x) subject to g(x) = c. (7)
Lagrange multiplier method introduces a Lagrange multiplier λ tocombine f(x) and g(x):
L(x, λ) = f(x) + λ (g(x)− c). (8)
The sign of λ can be positive or negative.
Then, solve for the stationary point of L(x, λ):
∂L(x, λ)
∂x= 0. (9)
Leow Wee Kheng (NUS) Principal Component Analysis 7 / 54
Maximum Variance Estimate Lagrange Multiplier Method
◮ If x0 is a solution of the original problem, then there is a λ0 suchthat (x0, λ0) is a stationary point of L.
◮ Not all stationary points yield solutions of the original problem.
◮ The same method applies to minimization problem.
◮ Multiple constraints can be combined by adding multiple terms.
minimize f(x) subject to g1(x) = c1, g2(x) = c2 (10)
is solved with
L(x) = f(x) + λ1(g1(x)− c1) + λ2(g2(x)− c2). (11)
Leow Wee Kheng (NUS) Principal Component Analysis 8 / 54
Maximum Variance Estimate
Now, we differentiate V ′ with respect to q and set to 0:
∂V ′
∂q= 2q⊤C− 2λq⊤ = 0. (12)
Rearranging the terms gives
q⊤C = λq⊤. (13)
Since the covariance matrix C is symmetric (homework), C⊤ = C.So, transposing both sides of Eq. 13 gives
Cq = λq. (14)
◮ Eq. 14 is called an eigenvector equation.
◮ q is the eigenvector.
◮ λ is the eigenvalue.
Thus, the eigenvector q of C gives the line that maximizes variance V .
Leow Wee Kheng (NUS) Principal Component Analysis 9 / 54
Maximum Variance Estimate
The perpendicular distance di of xi from the line l is
di = ‖xi − yiq‖ = ‖xi − (x⊤
i q)q‖. (15)
x i
di
yi
x1
x2
x4
x1
x2
x3
O
q
l
Leow Wee Kheng (NUS) Principal Component Analysis 10 / 54
Maximum Variance Estimate
The squared distance is
d2i = ‖xi − (x⊤
i q)q‖2 = (xi − (x⊤
i q)q)⊤(xi − (x⊤
i q)q)
= x⊤
i xi − x⊤
i (x⊤
i q)q− (x⊤
i q)q⊤xi + (x⊤
i q)q⊤(x⊤
i q)q.
With x⊤
i q being a scalar and x⊤
i q = q⊤xi, we obtain
d2i = x⊤
i xi − (x⊤
i q)2 − (x⊤
i q)2 + (x⊤
i q)2(q⊤q)
= x⊤
i xi − (x⊤
i q)2.
Averaging over all i gives
D =1
n
n∑
i=1
d2i =1
n
n∑
i=1
x⊤
i xi − V. (16)
So, maximizing V means minimizing D.
The eigenvalue λ = V (homework); so λ ≥ 0.
Leow Wee Kheng (NUS) Principal Component Analysis 11 / 54
Maximum Variance Estimate
Name q as the first eigenvector q1.The component of xi orthogonal to q1 is x′
i = xi − x⊤
i q1q1.
Repeat previous method on x′
i in an (m− 1)-D subspace orthogonal toq1 to get q2. Then, repeat to get q3, . . . ,qm.
x1
x2
x3
orthogonalsubspace
q1
O
Leow Wee Kheng (NUS) Principal Component Analysis 12 / 54
Eigendecomposition
Eigendecomposition
In general, a m×m covariance matrix C has m eigenvectors:
Cqj = λj qj , j = 1, . . . ,m. (17)
Transpose both sides of the equations to get
q⊤
j C = λj q⊤
j , j = 1, . . . ,m, (18)
which are row matrices. Stack the row matrices into a column to get
q⊤1...
q⊤m
C =
λ1 0 0
0. . . 0
0 0 λm
q⊤1...
q⊤m
(19)
Denote matrix Q and Λ as
Q = [q1 · · · qm], Λ = diag(λ1, . . . , λm). (20)
Leow Wee Kheng (NUS) Principal Component Analysis 13 / 54
Eigendecomposition
Then, Eq. 19 becomesQ⊤C = ΛQ⊤ (21)
orC = QΛQ⊤. (22)
This is the matrix equation for eigendecomposition.
Properties:
◮ The eigenvectors are orthonormal:
q⊤
j qj = 1,
q⊤
j qk = 0, for k 6= j.(23)
So, the eigenmatrix Q is orthogonal:
Q⊤Q = QQ⊤ = I. (24)
◮ The eigenvalues are arranged to be sorted λj ≥ λj+1.
Leow Wee Kheng (NUS) Principal Component Analysis 14 / 54
General PCA
General PCA
In general, the mean µ of the data points xi is not at the origin.In this case, we subtract µ from each xi,obtaining shifted or zero-mean data points xi − µ.
PCA transforms xi into a new vector yi through Q as follows:
yi = Q⊤(xi − µ) =
q⊤1 (xi − µ)
...q⊤m(xi − µ)
=
m∑
j=1
(xi − µ)⊤qj qj , (25)
Each component of yi is
yij = (xi − µ)⊤qj . (26)
This is the projection of xi − µ on qj .
Leow Wee Kheng (NUS) Principal Component Analysis 15 / 54
General PCA
The original xi can be recovered from yi:
xi = Qyi + µ. (27)
Caution!
◮ xi 6= yi + µ.
◮ xi is in the original input space but yi is in the eigenspace.Let xj denote the unit vectors that form the input space.qj are the eigenvectors that form the eigenspace. Then,
xi = (xi1, xi2, . . . , xim) =m∑
j=1
xijxj ,
yi = (yi1, yi2, . . . , yim) =
m∑
j=1
yijqj .
Leow Wee Kheng (NUS) Principal Component Analysis 16 / 54
General PCA
Leow Wee Kheng (NUS) Principal Component Analysis 17 / 54
General PCA Properties of PCA
Properties of PCA
◮ Mean µy over all yi is 0 (homework).
◮ Variance σ2j along qj is λj (homework).
◮ Since λ1 ≥ · · · ≥ λm ≥ 0, so σ1 ≥ · · · ≥ σm ≥ 0.
◮ q1 gives orientation of the largest variance.
◮ q2 gives orientation of largest variance orthogonal to q1
(2nd largest variance).
◮ qj gives orientation of largest variance orthogonal to q1, . . . ,qj−1
(j-th largest variance).
◮ qm is orthogonal to all other eigenvectors (least variance).
Leow Wee Kheng (NUS) Principal Component Analysis 18 / 54
General PCA Properties of PCA
Data Points
Leow Wee Kheng (NUS) Principal Component Analysis 19 / 54
General PCA Properties of PCA
Centroid at Origin
Leow Wee Kheng (NUS) Principal Component Analysis 20 / 54
General PCA Properties of PCA
Principal Components
Leow Wee Kheng (NUS) Principal Component Analysis 21 / 54
General PCA Properties of PCA
Another Example
Leow Wee Kheng (NUS) Principal Component Analysis 22 / 54
General PCA Properties of PCA
Centroid at Origin
Leow Wee Kheng (NUS) Principal Component Analysis 23 / 54
General PCA Properties of PCA
Principal Components
Leow Wee Kheng (NUS) Principal Component Analysis 24 / 54
General PCA Properties of PCA
Special Cases: Points on a rectangle.
−3 −2 −1 0 1 2 3−2
−1
0
1
2
−3 −2 −1 0 1 2 3−2
−1
0
1
2
−3 −2 −1 0 1 2 3−2
−1
0
1
2
−3 −2 −1 0 1 2 3−2
−1
0
1
2
Blue line: direction of 1st principal component.
To verify the results, run the program in Programming Homework 3 on these data.
Leow Wee Kheng (NUS) Principal Component Analysis 25 / 54
General PCA Properties of PCA
Special Cases: Points on a rectangle.
−3 −2 −1 0 1 2 3−2
−1
0
1
2
−3 −2 −1 0 1 2 3−2
−1
0
1
2
−2 −1 0 1 2−2
−1
0
1
2
−2 −1 0 1 2−2
−1
0
1
2
Blue line: direction of 1st principal component.
Leow Wee Kheng (NUS) Principal Component Analysis 26 / 54
General PCA Properties of PCA
Special Cases: Points on a square / diamond.
−1 0 1
−1
0
1
−1 0 1
−1
0
1
Blue line: direction of 1st principal component.
This is a surprise! But, it is easy to see:The 4 points are equi-distant from the origin.So, they actually lie on a sphere!
A sphere has an infinite number of possible first principal componentsalong any diameter.
Leow Wee Kheng (NUS) Principal Component Analysis 27 / 54
General PCA Properties of PCA
Special Cases: Points on a diamond.
−1 0 1
−1
0
1
What about this case? Try it out yourself!
Leow Wee Kheng (NUS) Principal Component Analysis 28 / 54
PCA Algorithm 1
PCA Algorithm 1
Let xi denote m-dimensional vectors (data points), i = 1, ..., n.
1. Compute the mean vector of the data points
µ =1
n
n∑
i=1
xi. (28)
2. Compute the covariance matrix
C =1
n
n∑
i=1
(xi − µ)(xi − µ)⊤. (29)
3. Perform eigendecomposition of C
C = QΛQ⊤. (30)
Leow Wee Kheng (NUS) Principal Component Analysis 29 / 54
PCA Algorithm 1
◮ Some books and papers use sample covariance
C =1
n− 1
n∑
i=1
(xi − µ)(xi − µ)⊤. (31)
Eq. 31 differs from Eq. 29 by only a constant.
Leow Wee Kheng (NUS) Principal Component Analysis 30 / 54
PCA Algorithm 1
Notes:
◮ Covariance matrix C is a m×m matrix.
◮ Eigendecomposition of very large matrix is inefficient.
◮ Example:◮ A 256×256 colour image has 256×256× 3 values.◮ Number of dimensions m = 196680.◮ C has a size of 196608×196608!!◮ Number of images n is usually ≪ m, e.g., 1000.
Leow Wee Kheng (NUS) Principal Component Analysis 31 / 54
PCA Algorithm 2
PCA Algorithm 2
1. Compute mean µ of data points xi.
µ =1
n
n∑
i=1
xi.
2. Form a m×n matrix A, n≪ m:
A = [(x1 − µ) · · · (xn − µ)]. (32)
3. Compute A⊤A, which is just a n×n matrix.
4. Apply eigendecomposition on A⊤A:
(A⊤A)qj = λjqj . (33)
5. Pre-multiply A to Eq. 33 giving
AA⊤Aqj = Aλjqj . (34)
Leow Wee Kheng (NUS) Principal Component Analysis 32 / 54
PCA Algorithm 2
6. Recover eigenvectors and eigenvalues.
Since AA⊤ = nC (homework), Eq. 34 is
nC (Aqj) = λj(Aqj) (35)
C (Aqj) =λj
n(Aqj). (36)
Therefore, eigenvectors of C are Aqj , andeigenvalues of C are λj/n.
Leow Wee Kheng (NUS) Principal Component Analysis 33 / 54
PCA Algorithm 3
PCA Algorithm 3
1. Compute mean µ of data points xi.
µ =1
n
n∑
i=1
xi.
2. Form a m×n matrix A:
A = [(x1 − µ) · · · (xn − µ)].
3. Apply singular value decomposition (SVD) on A:
A = UΣV⊤.
4. Recover eigenvectors and eigenvalues (page 31).
Leow Wee Kheng (NUS) Principal Component Analysis 34 / 54
PCA Algorithm 3 Singular Value Decomposition
Singular Value Decomposition
Singular value decomposition (SVD) decomposes a matrix A into
A = UΣV⊤. (37)
◮ Column vectors of U are left singular vectors uj , anduj are orthonormal.
◮ Column vectors of V are right singular vectors vj , andvj are orthonormal.
◮ Σ is diagonal and contains singular values sj .
◮ Rank of A = number of non-zero singular values.
Refer to Appendix for more details.
Leow Wee Kheng (NUS) Principal Component Analysis 35 / 54
PCA Algorithm 3 Singular Value Decomposition
Notice that
AA⊤ = UΣV⊤
(UΣV⊤
)⊤
= UΣV⊤VΣU⊤ = UΣ2U⊤
and eigendecomposition of AA⊤ is
AA⊤ = QΛQ⊤.
So, eigenvector of AA⊤ = uj , eigenvalue of AA⊤ = s2j .
On the other hand,
A⊤A =(UΣV⊤
)⊤
UΣV⊤ = VΣU⊤UΣV⊤ = VΣ2V⊤.
Compare with the eigendecomposition of A⊤A:
A⊤A = QΛQ⊤.
So, eigenvector of A⊤A = vj , eigenvalue of A⊤A = s2j .
Leow Wee Kheng (NUS) Principal Component Analysis 36 / 54
PCA Algorithm 3
4. Recover eigenvectors and eigenvalues.
Compare AA⊤ = nC with eigendecomposition of C:
AA⊤ = UΣ2U⊤,
C = QΛQ⊤.
Therefore, eigenvectors qj of C = uj in U, and
eigenvalues λj of C = s2j/n, for sj in Σ.
Leow Wee Kheng (NUS) Principal Component Analysis 37 / 54
Application Examples
Application Examples
Apply PCA for line fitting in 2-D.
Compute PCA of the points. Then,
◮ mean of points = a point on line
◮ 1st eigenvector = unit vector along line
xi1x1
x2
xi2 ip
f
Leow Wee Kheng (NUS) Principal Component Analysis 38 / 54
Application Examples
Apply PCA for plane fitting in 3-D.
Compute PCA of the points. Then,
◮ mean of points = a point on plane
◮ 3rd eigenvector = unit normal vector of plane
Leow Wee Kheng (NUS) Principal Component Analysis 39 / 54
Dimensionality Reduction
Dimensionality Reduction
If a point represents a 256×256 colour image,then the dimensionality is 256×256×3 = 196680!!
But, not all 196680 eigenvalues are non-zero.
Leow Wee Kheng (NUS) Principal Component Analysis 40 / 54
Dimensionality Reduction
Case 1: Number of data vectors n ≤ m number of dimensions.
◮ Data vectors are all independent.Then, number of non-zero eigenvalues (eigenvectors) = n− 1,i.e., rank of covariance matrix = n− 1.Why not = n?
◮ Data vectors are not independent.Then, rank of covariance matrix < n− 1.
Case 2: Number of data vectors n > m.
◮ m or more independent data vectors.Then, rank of covariance matrix = m.
◮ Fewer than m data vectors are independent.In this case, what is the rank of covariance matrix?
In practice, it is often possible to reduce the dimensionality.
Leow Wee Kheng (NUS) Principal Component Analysis 41 / 54
Dimensionality Reduction
Eigenmatrix Q isQ = [q1 · · · qm] . (38)
PCA maps a data point x to a vector y in the eigenspace as
y = Q⊤(x− µ) =m∑
j=1
(x− µ)⊤qjqj , (39)
which has m dimensions.
Pick l eigenvectors with the largest eigenvalues to form truncated Q:
Q = [q1 · · · ql], (40)
which spans a subspace of the eigenspace.
Then, Q maps x to y, an estimate of y:
y = Q⊤(x− µ) =l∑
j=1
(x− µ)⊤qjqj , (41)
which has l < m dimensions.
Leow Wee Kheng (NUS) Principal Component Analysis 42 / 54
Dimensionality Reduction
x y y x
x1 y1 y1 x1
...Q⊤
−→...
...Q⊤
←−...
...Q←− yl yl
Q−→/
...
... yl+1
...
......
...
xm ym xm
dimensionalityreduction
Difference between y and y is
y − y =m∑
j=l+1
(x− µ)⊤qjqj . (42)
Leow Wee Kheng (NUS) Principal Component Analysis 43 / 54
Dimensionality Reduction
Beware:y = Q⊤(x− µ)
and, therefore,x = Qy + µ.
But,y = Q⊤(x− µ),
whereasx = Q y + µ 6= x. (43)
Why?
With n data points xi, sum-squared error E between xi and itsestimate xi is
E =n∑
i=1
‖xi − xi‖2. (44)
Leow Wee Kheng (NUS) Principal Component Analysis 44 / 54
Dimensionality Reduction
When all m dimensions are kept, the total variance is
m∑
j=1
σ2j =
m∑
j=1
λj . (45)
When l dimensions are used, the ratio R′ of unaccounted variance is:
R′(l) =
m∑
j=l+1
σ2j
m∑
j=1
σ2j
=
m∑
j=l+1
λj
m∑
j=1
λj
. (46)
Leow Wee Kheng (NUS) Principal Component Analysis 45 / 54
Dimensionality Reduction
A sample plot of R′ vs. number of dimensions used:
notenough
number ofdimensionsml
enough
1
0
R’
How to choose appropriate l?Choose l such that larger than l doesn’t reduce R′ significantly:
R′(l)−R′(l + 1) < ǫ. (47)
Leow Wee Kheng (NUS) Principal Component Analysis 46 / 54
Dimensionality Reduction
Alternatively, compute ratio R of accounted variance:
R(l) =
l∑
j=1
σ2j
m∑
j=1
σ2j
=
l∑
j=1
λj
m∑
j=1
λj
. (48)
l is large enough ifR(l + 1)−R(l) < ǫ.
number ofdimensions
notenough
ml
1
0
R
enough
Leow Wee Kheng (NUS) Principal Component Analysis 47 / 54
Summary
Summary
◮ Eigendecomposition of covariance matrix = PCA.
◮ In practice, PCA is computed using SVD.
◮ PCA maximizes variances along the eigenvectors.
◮ Eigenvalues = variances along eigenvectors.
◮ Best fitting plane computed by PCA minimizes distance to points.
◮ PCA can be used for dimensionality reduction.
Leow Wee Kheng (NUS) Principal Component Analysis 48 / 54
Probing Questions
Probing Questions
◮ When applying PCA to plane fitting, how to check whether thepoints really lie close to the plane?
◮ If you apply PCA to a set of points on a curve surface, where doyou expect the eigenvectors to point at?
◮ In dimensionality reduction, the m-D vector y is reduced to a l-Dvector y. Since l 6= m, vector subtraction is undefined. But,subtraction of Eq. 39 and 41 gives
y − y =m∑
j=l+1
(x− µ)⊤qjqj .
How is this possible? Is there a contradiction?
◮ Show that when n ≤ m, there are at most n− 1 eigenvectors.
Leow Wee Kheng (NUS) Principal Component Analysis 49 / 54
Homework
Homework I
1. Describe the essence of PCA in one sentence.
2. Show that the covariance matrix C of a set of data points xi,i = 1, . . . , n, is symmetric about its diagonal.
3. Show that the eigenvalue λ of covariance matrix C equals thevariance V as defined in Eq. 2.
4. Show that the mean µy over all yi in the eigenspace is 0.
5. Show that the variance σ2j along eigenvector qj is λj .
6. Show that AA⊤ = nC.
7. Derive the difference Qy − Q y in dimensionality reduction.
Leow Wee Kheng (NUS) Principal Component Analysis 50 / 54
Homework
Homework II
8. Suppose n ≤ m, and the truncated eigenmatrix Q contains all theeigenvectors with non-zero eigenvalues, Q = [q1 · · · qn−1].
Now, map an input vector x to y and y respectively by thecomplete Q (Eq. 39) and the truncated Q (Eq. 41). What is theerror y − y?
What is the result of mapping y back to the input space by Q(Eq. 43)?
9. Q3 of AY2015/16 Final Evaluation.
Leow Wee Kheng (NUS) Principal Component Analysis 51 / 54
Appendix
Appendix
Consider a general m×n matrix A, where m > n. The SVD of A gives
A = UΣV⊤, (49)
◮ U is m×m orthogonal matrix,
◮ Σ is m×n diagonal matrix, and
◮ V is n×n orthogonal matrix.
This is called full PCA.
Since m > n, Σ has the structure
Σ =
[Σn
0m−n
],
◮ Σn is n×n diagonal matrix,
◮ 0m−n is (m− n)×n zero matrix.
Leow Wee Kheng (NUS) Principal Component Analysis 52 / 54
Appendix
So, Eq. 49 can be written in this form
A =[Un Um−n
] [ Σn
0m−n
]V⊤. (50)
That is,A = UnΣnV
⊤. (51)
This is called Thin SVD.
In practice, we usually compute thin SVD instead of full SVD becauseit is useless to compute Um−n.
If the rank r of A is less than n, then can also computecompact SVD as follows:
A = UrΣrV⊤
r , (52)
where Ur is m×r, Σr is r×r diagonal, and Vr is r×r.
Leow Wee Kheng (NUS) Principal Component Analysis 53 / 54
References
References
1. G. Strang, Introduction to Linear Algebra, 4th ed., Wellesley-Cambridge,2009. www-math.mit.edu/~gs
2. C. Shalizi, Advanced Data Analysis from an Elementary Point of View,2015. www.stat.cmu.edu/~cshalizi/ADAfaEPoV
Leow Wee Kheng (NUS) Principal Component Analysis 54 / 54