20090504_ir_studygroup

Theory and Toolkits of PCA

2009 5/4 IRLab Study Group

Presenter : Chin-Hui Chen

Agenda

Theory :◦1. Scenario◦2. What is PCA?◦3. How to minimize Squared-Error ?◦4. Dimensionality Reduction

Toolkit : ◦A list of PCA toolkits◦Demo

Scenario (Point? Line?)

Consider a 2-dimension space

d

Least Squared Error

Agenda



What is PCA ? (1)

Principal component analysis (PCA) involves a mathematical procedure that transforms a number of possibly correlated variables into a smaller number of uncorrelated variables called “principal components”.

What is PCA ? (2)

What can PCA do ?◦Dimensionality Reduction

For example :

◦Assuming N points in D-dim space◦e.g. {x1, x2, x3, x4} ; xi = (v1, v2)

◦A set (M) of basis for projection◦e.g. {u1}

They are orthonormal bases ( 長度 1, 兩兩內積 0) M << D (represent the feature in M dimensions)

◦e.g. xi = (p1)

Agenda



How to minimize Squared-Error ?

Consider a D-dimension space◦Given N point : {x1, x2, …, xn}

◦ xi is a D-dim vector

How to ◦1. 找一個點使得 squared-error 最小◦2. 找一條線使得 squared-error 最小

How to ? - Point

◦Goal : Find x0 s.t. min.◦ ◦Let .

How to ? – Point - Line

∴ x0 =

◦1. 找一個點使得 squared-error 最小◦2. 找一條線使得 squared-error 最小

L : xk’- x0 = ake xk’= x0 + ake = m + ake

How to ? – Line

L : xk’ = m + akeGoal :

Find a1…an

How to ? – Line

每個部份微分後 [2ak – 2aket(xk-m)]

What does it mean ?

How to ? – Line

Then, how about e ?

How to ? – Line

Let

Independent of e

How to ? – Line

f(x,y) ->

But if x,y : g(x,y)=0

J’1(e) = -etSeUse lagrange multiplier :

Because |e| = 1 , u = etSe – λ(ete-1)

How to ? – Line

◦What is S ?

Covariance Matrix ( 共變異數矩陣 )◦Assume D-dim

How to ? – Line

, we know S.Then, what is e ? Eigenvectors of S.

AX= λX Eigen : same

How to ? – conclusion

Summary :◦ Find a line : xk’= m + ake

ak = et(xk-m) Se = λe ; e = eigenvectors of covariance matrix.

◦D-dim space can find D eigenvectors.

Agenda



Dimensionality Reduction


Consider a 2-dim space …

X1 = (a,b) X2 = (c,d)

X1 = (a’,b’) X2 = (c’,d’)

We are going to do …X1 = (a’) X2 = (c’)


We want to proof :◦Axes of the data are independent.

Consider N m-dim vectors◦{x1, x2, … ,xn}

◦Let X=[x1-m x2-m … xn-m]T m = mean

◦Let E = [e1 e2 … em]

Se = λe eigen decomposition Eigen vector {e1,…,em}

Eigen value {λ1,…, λm}


SE = [Se1 Se2 … Sem] = [λe1 λe2 … λem] =

= EDS = EDE-1

E = [e1 e2 … em]


We want to know new Covariance Matrix of projected vectors.

Let Y = [y1 y2 … yn]T

E = [e1 e2 … em]

Y = ETX

SY


SY = D

1. Covariance of two axes are 0.2. represent data↑->covariance of axes↑ -> λ ↑


Conclusion : If we want to reduce

dimension D to M (M<<D) 1. Find S 2. ->eigenvalues 3. Select Top M 4. Project data

Agenda



Toolkits

A List of PCA Toolkits

C & Java◦ Fionn Murtagh's Multivariate Data Analysis Software and Resources ◦ http://astro.u-strasbg.fr/~fmurtagh/mda-sw/

Perl◦ PDL::PCA

Matlab◦ Statistics Toolbox™ : princomp

Weka◦ weka.attributeSelection.PrincipalComponents

(http://www.laps.ufpa.br/aldebaro/weka/feature_selection.html )

http://astro.u-strasbg.fr/~fmurtagh/mda-sw/

http://www.laps.ufpa.br/aldebaro/weka/feature_selection.html

A List of PCA Toolkits

C & Java◦ Fionn Murtagh's Multivariate Data Analysis Software and Resources ◦ http://astro.u-strasbg.fr/~fmurtagh/mda-sw/

C : Download: pca.c Compile: cc pca.c -lm -o pcac Run: ./pcac spectr.dat 36 8 R > pcaout.c.txt

Java : Download: JAMA, PCAcorr.java Compile: javac –classpath Jama-1.0.2.jar PCAcorr.java Run: java PCAcorr iris.dat > pcaout.java.txt

http://astro.u-strasbg.fr/~fmurtagh/mda-sw/

20090504_ir_studygroup

Documents

toolkits of pca

c dimensionality reduction

ddim space

d eigenvectors

dimension d

dim space x1

e1 e2 em se

dim vectors