principal component analysis

Principal Component Analysis

Philosophy of PCA

Introduced by Pearson (1901) and Hotelling (1933) to describe the variation in a set of multivariate data in terms of a set of uncorrelated variables

We typically have a data matrix of n observations on p correlated variables x1,x2,…xp

PCA looks for a transformation of the xi into p new variables yi that are uncorrelated

The data matrix

case ht (x1) wt(x2) age(x3) sbp(x4) heart rate (x5)

1 175 1225 25 117 56

2 156 1050 31 122 63

n 202 1350 58 154 67

Reduce dimension

The simplet way is to keep one variable and discard all others: not reasonable!

Wheigt all variable equally: not reasonable (unless they have same variance)

Wheigted average based on some citerion.

Which criterion?

Let us write it first

Looking for a transformation of the data matrix X (nxp) such that

Y= T X=1 X1+ 2 X2+..+ p Xp

Where =(1 , 2 ,.., p)T is a column vector of wheights with

1²+ 2²+..+ p² =1

One good criterion

Maximize the variance of the projection of the observations on the Y variables

Find so that

Var(T X)= T Var(X) is maximal

The matrix C=Var(X) is the covariance matrix of the Xi variables

Let us see it on a figure

Good Better

Covariance matrix

)(..........

........)(

........)(

21

2221

1211

ppp

p

p

xv),xc(x),xc(x

),xc(xxv),xc(x

),xc(x),xc(xxv

C=

And so.. We find that

The direction of is given by the eigenvector 1 correponding to the largest eigenvalue of matrix C

The second vector that is orthogonal (uncorrelated) to the first is the one that has the second highest variance which comes to be the eignevector corresponding to the second eigenvalue

And so on …

So PCA gives

New variables Yi that are linear combination of the original variables (xi):

Yi= ai1x1+ai2x2+…aipxp ; i=1..p

The new variables Yi are derived in decreasing order of importance;

they are called ‘principal components’

Calculating eignevalues and eigenvectors

The eigenvalues i are found by solving the equation

det(C-I)=0Eigenvectors are columns of the

matrix A such that C=A D AT

Where D=

p

............0

0

0.......0

0........01

2

An example

Let us take two variables with covariance c>0

C= C-I=

det(C-I)=(1- )²-c²

Solving this we find 1 =1+c

2 =1-c < 1

1

1

c

c

1

1

c

c

and eigenvectors

Any eigenvector A satisfies the condition CA=A

Solving we find

21

21

aca

caa

2

1

a

a

1

1

c

cA= CA= =

2

1

a

a

2

1

a

a=

A1 A2

PCA is sensitive to scale

If you multiply one variable by a scalar you get different results

(can you show it?)This is because it uses covariance

matrix (and not correlation)PCA should be applied on data that

have approximately the same scale in each variable

Interpretation of PCA

The new variables (PCs) have a variance equal to their corresponding eigenvalue

Var(Yi)= i for all i=1…pSmall i small variance data

change little in the direction of component Yi

The relative variance explained by each PC is given by i / i

How many components to keep?

Enough PCs to have a cumulative variance explained by the PCs that is >50-70%

Kaiser criterion: keep PCs with eigenvalues >1

Scree plot: represents the ability of PCs to explain de variation in data

Do it graphically

Interpretation of components

See the wheights of variables in each component

If Y1= 0.89 X1 +0.15X2-0.77X3+0.51X4

Then X1 and X3 have the highest wheights and so are the mots important variable in the first PC

See the correlation between variables Xi and PCs: circle of correlation

Circle of correlation

Normalized (standardized) PCA

If variables have very heterogenous variances we standardize them

The standardized variables Xi*

Xi*= (Xi-mean)/variance

The new variables all have the same variance (1), so each variable have the same wheight.

Application of PCA in Genomics

PCA is useful for finding new, more informative, uncorrelated features; it reduces dimensionality by rejecting low variance features

Analysis of expression dataAnalysis of metabolomics data

(Ward et al., 2003)

However

PCA is only powerful if the biological question is related to the highest variance in the dataset

If not other techniques are more useful : Independent Component Analysis

Introduced by Jutten in 1987

What is ICA?

That looks like that

The idea behind ICA

How it works?

Rationale of ICA

Find the components Si that are as independent as possible in the sens of maximizing some function F(s1,s2,.,sk) that measures indepedence

All ICs (except 1) should be non-Normal

The variance of all ICs is 1There is no hierarchy between ICs

How to find ICs ?

Many choices of objective function FMutual information

We use the kurtosis of the variables to approximate the distribution function

The number of ICs is chosen by the user

)()...()(

),...,,(),...,,(

2211

2121

kk

kk sfsfsf

sssfLogsssfMI

Difference with PCA

It is not a dimensionality reduction technique

There is no single (exact) solution for components; uses different algorithms (in R: FastICA, PearsonICA, MLICA)

ICs are of course uncorrelated but also as independent as possible

Uninteresting for Normally distributed variables

Example: Lee and Batzoglou (2003)

Microarray expression data on 7070 genes in 59 Normal human tissue samples (19 types)

We are not interested in reducing dimension but rather in looking for genes that show tissue specific expression profile (what make tissue types differents)

PCA vs ICA

Hsiao et al (2002) applied PCA and by visual inspection observed three gene cluster of 425 genes: liver-specific, brain-specific and muscle-specific

ICA identified more tissue-specific genes than PCA

principal component analysis

Documents

variables x1

wheights of variables

standardized variables

original variables xi

p new variables yi

pthe new variables yi

maximalthe matrix c

pca givesnew variables