principal component analysis - nus computingcs5240/lecture/pca.pdf · 2019-09-05 · principal...

Principal Component Analysis

CS5240 Theoretical Foundations in Multimedia

Leow Wee Kheng

Department of Computer Science

School of Computing

National University of Singapore

Leow Wee Kheng (NUS) Principal Component Analysis 1 / 54

Motivation

Motivation

How wide is the widest part of NGC 1300?


Motivation

How thick is the thickest part of NGC 4594?

Use principal component analysis.


Maximum Variance Estimate


Consider a set of points xi, i = 1, . . . , n, in an m-dimensional spacesuch that their mean µ = 0, i.e., centroid is at the origin.

x i

di

yi

x1

x2

x4

x1

x2

x3

O

q

l

Want to find a line l through the origin that maximizes the projectionsyi of the points xi on l.



Let q denote the unit vector along line l.Then, the projection yi of xi on l is

yi = x⊤

i q. (1)

The mean squared projection, which is the variance, V over all points is

V =1

n

n∑

i=1

y2i =1

n

n∑

i=1

(x⊤

i q)2 ≥ 0. (2)

Because x⊤

i q = q⊤xi, expanding Eq. 2 gives

V =1

n

n∑

i=1

(q⊤xi)(x⊤

i q) = q⊤

[1

n

n∑

i=1

xix⊤

i

]q. (3)

The middle factor is the covariance matrix C of the data points

C =1

n

n∑

i=1

xix⊤

i . (4)



We want to find a unit vector q that maximizes the variance V , i.e.,

maximize V = q⊤Cq subject to ‖q‖ = 1. (5)

This is a constrained optimization problem.

Use Lagrange multiplier method:combine V and the constraint using Lagrange multiplier λ

maximize V ′ = q⊤Cq− λ(q⊤q− 1). (6)


Maximum Variance Estimate Lagrange Multiplier Method

Lagrange Multiplier Method

Lagrange multiplier is a method for solving constrained optimization.

Consider this problem:

maximize f(x) subject to g(x) = c. (7)

Lagrange multiplier method introduces a Lagrange multiplier λ tocombine f(x) and g(x):

L(x, λ) = f(x) + λ (g(x)− c). (8)

The sign of λ can be positive or negative.

Then, solve for the stationary point of L(x, λ):

∂L(x, λ)

∂x= 0. (9)


Maximum Variance Estimate Lagrange Multiplier Method

◮ If x0 is a solution of the original problem, then there is a λ0 suchthat (x0, λ0) is a stationary point of L.

◮ Not all stationary points yield solutions of the original problem.

◮ The same method applies to minimization problem.

◮ Multiple constraints can be combined by adding multiple terms.

minimize f(x) subject to g1(x) = c1, g2(x) = c2 (10)

is solved with

L(x) = f(x) + λ1(g1(x)− c1) + λ2(g2(x)− c2). (11)



Now, we differentiate V ′ with respect to q and set to 0:

∂V ′

∂q= 2q⊤C− 2λq⊤ = 0. (12)

Rearranging the terms gives

q⊤C = λq⊤. (13)

Since the covariance matrix C is symmetric (homework), C⊤ = C.So, transposing both sides of Eq. 13 gives

Cq = λq. (14)

◮ Eq. 14 is called an eigenvector equation.

◮ q is the eigenvector.

◮ λ is the eigenvalue.

Thus, the eigenvector q of C gives the line that maximizes variance V .



The perpendicular distance di of xi from the line l is

di = ‖xi − yiq‖ = ‖xi − (x⊤

i q)q‖. (15)

x i

di

yi

x1

x2

x4

x1

x2

x3

O

q

l



The squared distance is

d2i = ‖xi − (x⊤

i q)q‖2 = (xi − (x⊤

i q)q)⊤(xi − (x⊤

i q)q)

= x⊤

i xi − x⊤

i (x⊤

i q)q− (x⊤

i q)q⊤xi + (x⊤

i q)q⊤(x⊤

i q)q.

With x⊤

i q being a scalar and x⊤

i q = q⊤xi, we obtain

d2i = x⊤

i xi − (x⊤

i q)2 − (x⊤

i q)2 + (x⊤

i q)2(q⊤q)

= x⊤

i xi − (x⊤

i q)2.

Averaging over all i gives

D =1

n

n∑

i=1

d2i =1

n

n∑

i=1

x⊤

i xi − V. (16)

So, maximizing V means minimizing D.

The eigenvalue λ = V (homework); so λ ≥ 0.



Name q as the first eigenvector q1.The component of xi orthogonal to q1 is x′

i = xi − x⊤

i q1q1.

Repeat previous method on x′

i in an (m− 1)-D subspace orthogonal toq1 to get q2. Then, repeat to get q3, . . . ,qm.

x1

x2

x3

orthogonalsubspace

q1

O


Eigendecomposition

Eigendecomposition

In general, a m×m covariance matrix C has m eigenvectors:

Cqj = λj qj , j = 1, . . . ,m. (17)

Transpose both sides of the equations to get

q⊤

j C = λj q⊤

j , j = 1, . . . ,m, (18)

which are row matrices. Stack the row matrices into a column to get

q⊤1...

q⊤m

C =

λ1 0 0

0. . . 0

0 0 λm

q⊤1...

q⊤m

(19)

Denote matrix Q and Λ as

Q = [q1 · · · qm], Λ = diag(λ1, . . . , λm). (20)


Eigendecomposition

Then, Eq. 19 becomesQ⊤C = ΛQ⊤ (21)

orC = QΛQ⊤. (22)

This is the matrix equation for eigendecomposition.

Properties:

◮ The eigenvectors are orthonormal:

q⊤

j qj = 1,

q⊤

j qk = 0, for k 6= j.(23)

So, the eigenmatrix Q is orthogonal:

Q⊤Q = QQ⊤ = I. (24)

◮ The eigenvalues are arranged to be sorted λj ≥ λj+1.


General PCA

General PCA

In general, the mean µ of the data points xi is not at the origin.In this case, we subtract µ from each xi,obtaining shifted or zero-mean data points xi − µ.

PCA transforms xi into a new vector yi through Q as follows:

yi = Q⊤(xi − µ) =

q⊤1 (xi − µ)

...q⊤m(xi − µ)

=

m∑

j=1

(xi − µ)⊤qj qj , (25)

Each component of yi is

yij = (xi − µ)⊤qj . (26)

This is the projection of xi − µ on qj .


General PCA

The original xi can be recovered from yi:

xi = Qyi + µ. (27)

Caution!

◮ xi 6= yi + µ.

◮ xi is in the original input space but yi is in the eigenspace.Let xj denote the unit vectors that form the input space.qj are the eigenvectors that form the eigenspace. Then,

xi = (xi1, xi2, . . . , xim) =m∑

j=1

xijxj ,

yi = (yi1, yi2, . . . , yim) =

m∑

j=1

yijqj .


General PCA


General PCA Properties of PCA

Properties of PCA

◮ Mean µy over all yi is 0 (homework).

◮ Variance σ2j along qj is λj (homework).

◮ Since λ1 ≥ · · · ≥ λm ≥ 0, so σ1 ≥ · · · ≥ σm ≥ 0.

◮ q1 gives orientation of the largest variance.

◮ q2 gives orientation of largest variance orthogonal to q1

(2nd largest variance).

◮ qj gives orientation of largest variance orthogonal to q1, . . . ,qj−1

(j-th largest variance).

◮ qm is orthogonal to all other eigenvectors (least variance).



Data Points



Centroid at Origin



Principal Components



Another Example



Centroid at Origin



Principal Components



Special Cases: Points on a rectangle.

−3 −2 −1 0 1 2 3−2

−1

0

1

2

−3 −2 −1 0 1 2 3−2

−1

0

1

2

−3 −2 −1 0 1 2 3−2

−1

0

1

2

−3 −2 −1 0 1 2 3−2

−1

0

1

2

Blue line: direction of 1st principal component.

To verify the results, run the program in Programming Homework 3 on these data.



Special Cases: Points on a rectangle.

−3 −2 −1 0 1 2 3−2

−1

0

1

2

−3 −2 −1 0 1 2 3−2

−1

0

1

2

−2 −1 0 1 2−2

−1

0

1

2

−2 −1 0 1 2−2

−1

0

1

2




Special Cases: Points on a square / diamond.

−1 0 1

−1

0

1

−1 0 1

−1

0

1


This is a surprise! But, it is easy to see:The 4 points are equi-distant from the origin.So, they actually lie on a sphere!

A sphere has an infinite number of possible first principal componentsalong any diameter.



Special Cases: Points on a diamond.

−1 0 1

−1

0

1

What about this case? Try it out yourself!


PCA Algorithm 1

PCA Algorithm 1

Let xi denote m-dimensional vectors (data points), i = 1, ..., n.

1. Compute the mean vector of the data points

µ =1

n

n∑

i=1

xi. (28)

2. Compute the covariance matrix

C =1

n

n∑

i=1

(xi − µ)(xi − µ)⊤. (29)

3. Perform eigendecomposition of C

C = QΛQ⊤. (30)


PCA Algorithm 1

◮ Some books and papers use sample covariance

C =1

n− 1

n∑

i=1

(xi − µ)(xi − µ)⊤. (31)

Eq. 31 differs from Eq. 29 by only a constant.


PCA Algorithm 1

Notes:

◮ Covariance matrix C is a m×m matrix.

◮ Eigendecomposition of very large matrix is inefficient.

◮ Example:◮ A 256×256 colour image has 256×256× 3 values.◮ Number of dimensions m = 196680.◮ C has a size of 196608×196608!!◮ Number of images n is usually ≪ m, e.g., 1000.


PCA Algorithm 2

PCA Algorithm 2

1. Compute mean µ of data points xi.

µ =1

n

n∑

i=1

xi.

2. Form a m×n matrix A, n≪ m:

A = [(x1 − µ) · · · (xn − µ)]. (32)

3. Compute A⊤A, which is just a n×n matrix.

4. Apply eigendecomposition on A⊤A:

(A⊤A)qj = λjqj . (33)

5. Pre-multiply A to Eq. 33 giving

AA⊤Aqj = Aλjqj . (34)


PCA Algorithm 2

6. Recover eigenvectors and eigenvalues.

Since AA⊤ = nC (homework), Eq. 34 is

nC (Aqj) = λj(Aqj) (35)

C (Aqj) =λj

n(Aqj). (36)

Therefore, eigenvectors of C are Aqj , andeigenvalues of C are λj/n.


PCA Algorithm 3

PCA Algorithm 3

1. Compute mean µ of data points xi.

µ =1

n

n∑

i=1

xi.

2. Form a m×n matrix A:

A = [(x1 − µ) · · · (xn − µ)].

3. Apply singular value decomposition (SVD) on A:

A = UΣV⊤.

4. Recover eigenvectors and eigenvalues (page 31).


PCA Algorithm 3 Singular Value Decomposition

Singular Value Decomposition

Singular value decomposition (SVD) decomposes a matrix A into

A = UΣV⊤. (37)

◮ Column vectors of U are left singular vectors uj , anduj are orthonormal.

◮ Column vectors of V are right singular vectors vj , andvj are orthonormal.

◮ Σ is diagonal and contains singular values sj .

◮ Rank of A = number of non-zero singular values.

Refer to Appendix for more details.


PCA Algorithm 3 Singular Value Decomposition

Notice that

AA⊤ = UΣV⊤

(UΣV⊤

)⊤

= UΣV⊤VΣU⊤ = UΣ2U⊤

and eigendecomposition of AA⊤ is

AA⊤ = QΛQ⊤.

So, eigenvector of AA⊤ = uj , eigenvalue of AA⊤ = s2j .

On the other hand,

A⊤A =(UΣV⊤

)⊤

UΣV⊤ = VΣU⊤UΣV⊤ = VΣ2V⊤.

Compare with the eigendecomposition of A⊤A:

A⊤A = QΛQ⊤.

So, eigenvector of A⊤A = vj , eigenvalue of A⊤A = s2j .


PCA Algorithm 3

4. Recover eigenvectors and eigenvalues.

Compare AA⊤ = nC with eigendecomposition of C:

AA⊤ = UΣ2U⊤,

C = QΛQ⊤.

Therefore, eigenvectors qj of C = uj in U, and

eigenvalues λj of C = s2j/n, for sj in Σ.


Application Examples


Apply PCA for line fitting in 2-D.

Compute PCA of the points. Then,

◮ mean of points = a point on line

◮ 1st eigenvector = unit vector along line

xi1x1

x2

xi2 ip

f



Apply PCA for plane fitting in 3-D.

Compute PCA of the points. Then,

◮ mean of points = a point on plane

◮ 3rd eigenvector = unit normal vector of plane


Dimensionality Reduction


If a point represents a 256×256 colour image,then the dimensionality is 256×256×3 = 196680!!

But, not all 196680 eigenvalues are non-zero.



Case 1: Number of data vectors n ≤ m number of dimensions.

◮ Data vectors are all independent.Then, number of non-zero eigenvalues (eigenvectors) = n− 1,i.e., rank of covariance matrix = n− 1.Why not = n?

◮ Data vectors are not independent.Then, rank of covariance matrix < n− 1.

Case 2: Number of data vectors n > m.

◮ m or more independent data vectors.Then, rank of covariance matrix = m.

◮ Fewer than m data vectors are independent.In this case, what is the rank of covariance matrix?

In practice, it is often possible to reduce the dimensionality.



Eigenmatrix Q isQ = [q1 · · · qm] . (38)

PCA maps a data point x to a vector y in the eigenspace as

y = Q⊤(x− µ) =m∑

j=1

(x− µ)⊤qjqj , (39)

which has m dimensions.

Pick l eigenvectors with the largest eigenvalues to form truncated Q:

Q = [q1 · · · ql], (40)

which spans a subspace of the eigenspace.

Then, Q maps x to y, an estimate of y:

y = Q⊤(x− µ) =l∑

j=1

(x− µ)⊤qjqj , (41)

which has l < m dimensions.



x y y x

x1 y1 y1 x1

...Q⊤

−→...

...Q⊤

←−...

...Q←− yl yl

Q−→/

...

... yl+1

...

......

...

xm ym xm

dimensionalityreduction

Difference between y and y is

y − y =m∑

j=l+1

(x− µ)⊤qjqj . (42)



Beware:y = Q⊤(x− µ)

and, therefore,x = Qy + µ.

But,y = Q⊤(x− µ),

whereasx = Q y + µ 6= x. (43)

Why?

With n data points xi, sum-squared error E between xi and itsestimate xi is

E =n∑

i=1

‖xi − xi‖2. (44)



When all m dimensions are kept, the total variance is

m∑

j=1

σ2j =

m∑

j=1

λj . (45)

When l dimensions are used, the ratio R′ of unaccounted variance is:

R′(l) =

m∑

j=l+1

σ2j

m∑

j=1

σ2j

=

m∑

j=l+1

λj

m∑

j=1

λj

. (46)



A sample plot of R′ vs. number of dimensions used:

notenough

number ofdimensionsml

enough

1

0

R’

How to choose appropriate l?Choose l such that larger than l doesn’t reduce R′ significantly:

R′(l)−R′(l + 1) < ǫ. (47)



Alternatively, compute ratio R of accounted variance:

R(l) =

l∑

j=1

σ2j

m∑

j=1

σ2j

=

l∑

j=1

λj

m∑

j=1

λj

. (48)

l is large enough ifR(l + 1)−R(l) < ǫ.

number ofdimensions

notenough

ml

1

0

R

enough


Summary

Summary

◮ Eigendecomposition of covariance matrix = PCA.

◮ In practice, PCA is computed using SVD.

◮ PCA maximizes variances along the eigenvectors.

◮ Eigenvalues = variances along eigenvectors.

◮ Best fitting plane computed by PCA minimizes distance to points.

◮ PCA can be used for dimensionality reduction.


Probing Questions

Probing Questions

◮ When applying PCA to plane fitting, how to check whether thepoints really lie close to the plane?

◮ If you apply PCA to a set of points on a curve surface, where doyou expect the eigenvectors to point at?

◮ In dimensionality reduction, the m-D vector y is reduced to a l-Dvector y. Since l 6= m, vector subtraction is undefined. But,subtraction of Eq. 39 and 41 gives

y − y =m∑

j=l+1

(x− µ)⊤qjqj .

How is this possible? Is there a contradiction?

◮ Show that when n ≤ m, there are at most n− 1 eigenvectors.


Homework

Homework I

1. Describe the essence of PCA in one sentence.

2. Show that the covariance matrix C of a set of data points xi,i = 1, . . . , n, is symmetric about its diagonal.

3. Show that the eigenvalue λ of covariance matrix C equals thevariance V as defined in Eq. 2.

4. Show that the mean µy over all yi in the eigenspace is 0.

5. Show that the variance σ2j along eigenvector qj is λj .

6. Show that AA⊤ = nC.

7. Derive the difference Qy − Q y in dimensionality reduction.


Homework

Homework II

8. Suppose n ≤ m, and the truncated eigenmatrix Q contains all theeigenvectors with non-zero eigenvalues, Q = [q1 · · · qn−1].

Now, map an input vector x to y and y respectively by thecomplete Q (Eq. 39) and the truncated Q (Eq. 41). What is theerror y − y?

What is the result of mapping y back to the input space by Q(Eq. 43)?

9. Q3 of AY2015/16 Final Evaluation.


Appendix

Appendix

Consider a general m×n matrix A, where m > n. The SVD of A gives

A = UΣV⊤, (49)

◮ U is m×m orthogonal matrix,

◮ Σ is m×n diagonal matrix, and

◮ V is n×n orthogonal matrix.

This is called full PCA.

Since m > n, Σ has the structure

Σ =

[Σn

0m−n

],

◮ Σn is n×n diagonal matrix,

◮ 0m−n is (m− n)×n zero matrix.


Appendix

So, Eq. 49 can be written in this form

A =[Un Um−n

] [ Σn

0m−n

]V⊤. (50)

That is,A = UnΣnV

⊤. (51)

This is called Thin SVD.

In practice, we usually compute thin SVD instead of full SVD becauseit is useless to compute Um−n.

If the rank r of A is less than n, then can also computecompact SVD as follows:

A = UrΣrV⊤

r , (52)

where Ur is m×r, Σr is r×r diagonal, and Vr is r×r.


References

References

1. G. Strang, Introduction to Linear Algebra, 4th ed., Wellesley-Cambridge,2009. www-math.mit.edu/~gs

2. C. Shalizi, Advanced Data Analysis from an Elementary Point of View,2015. www.stat.cmu.edu/~cshalizi/ADAfaEPoV


www-math.mit.edu/~gs

www.stat.cmu.edu/~cshalizi/ADAfaEPoV

principal component analysis - nus computingcs5240/lecture/pca.pdf · 2019-09-05 · principal...

Documents