principal components analysis babak rasolzadeh tuesday, 5th december 2006

29
Principal Components Analysis Babak Rasolzadeh Tuesday, 5th December 2006

Upload: thuy

Post on 05-Jan-2016

28 views

Category:

Documents


1 download

DESCRIPTION

Principal Components Analysis Babak Rasolzadeh Tuesday, 5th December 2006. Example: 53 Blood and urine measurements (wet chemistry) from 65 people (33 alcoholics, 32 non-alcoholics). Matrix Format. Spectral Format. Data Presentation. Data Presentation. Univariate. Bivariate. Trivariate. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Principal Components Analysis Babak Rasolzadeh Tuesday, 5th December 2006

Principal Components Analysis

Babak RasolzadehTuesday, 5th December 2006

Page 2: Principal Components Analysis Babak Rasolzadeh Tuesday, 5th December 2006

Data Presentation

• Example: 53 Blood and urine measurements (wet chemistry) from 65 people (33 alcoholics, 32 non-alcoholics).

• Matrix Format

• Spectral Format

H - W B C H - R B C H - H g b H - H c t H - M C V H - M C H H - M C H CH - M C H C

A 1 8 . 0 0 0 0 4 . 8 2 0 0 1 4 . 1 0 0 0 4 1 . 0 0 0 0 8 5 . 0 0 0 0 2 9 . 0 0 0 0 3 4 . 0 0 0 0

A 2 7 . 3 0 0 0 5 . 0 2 0 0 1 4 . 7 0 0 0 4 3 . 0 0 0 0 8 6 . 0 0 0 0 2 9 . 0 0 0 0 3 4 . 0 0 0 0

A 3 4 . 3 0 0 0 4 . 4 8 0 0 1 4 . 1 0 0 0 4 1 . 0 0 0 0 9 1 . 0 0 0 0 3 2 . 0 0 0 0 3 5 . 0 0 0 0

A 4 7 . 5 0 0 0 4 . 4 7 0 0 1 4 . 9 0 0 0 4 5 . 0 0 0 0 1 0 1 . 0 0 0 0 3 3 . 0 0 0 0 3 3 . 0 0 0 0

A 5 7 . 3 0 0 0 5 . 5 2 0 0 1 5 . 4 0 0 0 4 6 . 0 0 0 0 8 4 . 0 0 0 0 2 8 . 0 0 0 0 3 3 . 0 0 0 0

A 6 6 . 9 0 0 0 4 . 8 6 0 0 1 6 . 0 0 0 0 4 7 . 0 0 0 0 9 7 . 0 0 0 0 3 3 . 0 0 0 0 3 4 . 0 0 0 0

A 7 7 . 8 0 0 0 4 . 6 8 0 0 1 4 . 7 0 0 0 4 3 . 0 0 0 0 9 2 . 0 0 0 0 3 1 . 0 0 0 0 3 4 . 0 0 0 0

A 8 8 . 6 0 0 0 4 . 8 2 0 0 1 5 . 8 0 0 0 4 2 . 0 0 0 0 8 8 . 0 0 0 0 3 3 . 0 0 0 0 3 7 . 0 0 0 0

A 9 5 . 1 0 0 0 4 . 7 1 0 0 1 4 . 0 0 0 0 4 3 . 0 0 0 0 9 2 . 0 0 0 0 3 0 . 0 0 0 0 3 2 . 0 0 0 0

0 10 20 30 40 50 600100200300400500600700800900

1000

measurementV

alue

Measurement

Page 3: Principal Components Analysis Babak Rasolzadeh Tuesday, 5th December 2006

0 10 20 30 40 50 60 700

0.20.40.60.811.21.41.61.8

Person

H-B

ands

0 50 150 250 350 45050100150200250300350400450500550

C-Triglycerides

C-L

DH

0100

200300

400500

0200

4006000

1

2

3

4

C-TriglyceridesC-LDH

M-E

PI

Univariate Bivariate

Trivariate

Data Presentation

Page 4: Principal Components Analysis Babak Rasolzadeh Tuesday, 5th December 2006

• Better presentation than ordinate axes?• Do we need a 53 dimension space to view data?• How to find the ‘best’ low dimension space that

conveys maximum useful information?• One answer: Find “Principal Components”

Data Presentation

Page 5: Principal Components Analysis Babak Rasolzadeh Tuesday, 5th December 2006

Principal Components

• All principal components (PCs) start at the origin of the ordinate axes.

• First PC is direction of maximum variance from origin

• Subsequent PCs are orthogonal to 1st PC and describe maximum residual variance

0 5 10 15 20 25 300

5

10

15

20

25

30

Wavelength 1

Wa

vele

ng

th 2

0 5 10 15 20 25 300

5

10

15

20

25

30

Wavelength 1

Wa

vele

ng

th 2

PC 1

PC 2

Page 6: Principal Components Analysis Babak Rasolzadeh Tuesday, 5th December 2006

Algebraic Interpretation

• Given m points in a n dimensional space, for large n, how does one project on to a low dimensional space while preserving broad trends in the data and allowing it to be visualized?

Page 7: Principal Components Analysis Babak Rasolzadeh Tuesday, 5th December 2006

Algebraic Interpretation – 1D

• Given m points in a n dimensional space, for large n, how does one project on to a 1 dimensional space?

• Choose a line that fits the data so the points are spread out well along the line

Page 8: Principal Components Analysis Babak Rasolzadeh Tuesday, 5th December 2006

• Formally, minimize sum of squares of distances to the line.

• Why sum of squares? Because it allows fast minimization, assuming the line passes through 0

Algebraic Interpretation – 1D

Page 9: Principal Components Analysis Babak Rasolzadeh Tuesday, 5th December 2006

• Minimizing sum of squares of distances to the line is the same as maximizing the sum of squares of the projections on that line, thanks to Pythagoras.

Algebraic Interpretation – 1D

Page 10: Principal Components Analysis Babak Rasolzadeh Tuesday, 5th December 2006

• How is the sum of squares of projection lengths expressed in algebraic terms?

Point 1Point 2Point 3

:Point m

Line

P P P … Pt t t … t1 2 3 … m

L i n e

B xBTxT

Algebraic Interpretation – 1D

Page 11: Principal Components Analysis Babak Rasolzadeh Tuesday, 5th December 2006

PCA: General

From k original variables: x1,x2,...,xk:

Produce k new variables: y1,y2,...,yk:

y1 = a11x1 + a12x2 + ... + a1kxk

y2 = a21x1 + a22x2 + ... + a2kxk

...

yk = ak1x1 + ak2x2 + ... + akkxk

Page 12: Principal Components Analysis Babak Rasolzadeh Tuesday, 5th December 2006

From k original variables: x1,x2,...,xk:

Produce k new variables: y1,y2,...,yk:

y1 = a11x1 + a12x2 + ... + a1kxk

y2 = a21x1 + a22x2 + ... + a2kxk

...

yk = ak1x1 + ak2x2 + ... + akkxk

such that:

yk's are uncorrelated (orthogonal)y1 explains as much as possible of original variance in data sety2 explains as much as possible of remaining varianceetc.

PCA: General

Page 13: Principal Components Analysis Babak Rasolzadeh Tuesday, 5th December 2006

4.0 4.5 5.0 5.5 6.02

3

4

5

1st Principal Component, y1

2nd Principal Component, y2

Page 14: Principal Components Analysis Babak Rasolzadeh Tuesday, 5th December 2006

PCA Scores

4.0 4.5 5.0 5.5 6.02

3

4

5

xi2

xi1

yi,1 yi,2

Page 15: Principal Components Analysis Babak Rasolzadeh Tuesday, 5th December 2006

PCA Eigenvalues

4.0 4.5 5.0 5.5 6.02

3

4

5

λ1λ2

Page 16: Principal Components Analysis Babak Rasolzadeh Tuesday, 5th December 2006

From k original variables: x1,x2,...,xk:

Produce k new variables: y1,y2,...,yk:

y1 = a11x1 + a12x2 + ... + a1kxk

y2 = a21x1 + a22x2 + ... + a2kxk

...

yk = ak1x1 + ak2x2 + ... + akkxk

such that:

yk's are uncorrelated (orthogonal)y1 explains as much as possible of original variance in data sety2 explains as much as possible of remaining varianceetc.

yk's arePrincipal Components

PCA: Another Explanation

Page 17: Principal Components Analysis Babak Rasolzadeh Tuesday, 5th December 2006

{a11,a12,...,a1k} is 1st Eigenvector of correlation/covariance matrix, and coefficients of first principal component

{a21,a22,...,a2k} is 2nd Eigenvector of correlation/covariance matrix, and coefficients of 2nd principal component

{ak1,ak2,...,akk} is kth Eigenvector of correlation/covariance matrix, and coefficients of kth principal component

PCA: General

Page 18: Principal Components Analysis Babak Rasolzadeh Tuesday, 5th December 2006

PCA Summary until now

• Rotates multivariate dataset into a new configuration which is easier to interpret

• Purposes– simplify data– look at relationships between variables– look at patterns of units

Page 19: Principal Components Analysis Babak Rasolzadeh Tuesday, 5th December 2006

A 2D Numerical Example

Page 20: Principal Components Analysis Babak Rasolzadeh Tuesday, 5th December 2006

PCA Example –STEP 1

• Subtract the mean

from each of the data dimensions. All the x values have x subtracted and y values have y subtracted from them. This produces a data set whose mean is zero.

Subtracting the mean makes variance and covariance calculation easier by simplifying their equations. The variance and co-variance values are not affected by the mean value.

Page 21: Principal Components Analysis Babak Rasolzadeh Tuesday, 5th December 2006

PCA Example –STEP 1

DATA:x y2.5 2.40.5 0.72.2 2.91.9 2.23.1 3.02.3 2.72 1.61 1.11.5 1.61.1 0.9

ZERO MEAN DATA:

x y

.69 .49

-1.31 -1.21

.39 .99

.09 .29

1.29 1.09

.49 .79

.19 -.31

-.81 -.81

-.31 -.31

-.71 -1.01

Page 22: Principal Components Analysis Babak Rasolzadeh Tuesday, 5th December 2006

PCA Example –STEP 1

Page 23: Principal Components Analysis Babak Rasolzadeh Tuesday, 5th December 2006

PCA Example –STEP 2

• Calculate the covariance matrix

cov = .616555556 .615444444

.615444444 .716555556

• since the non-diagonal elements in this covariance matrix are positive, we should expect that both the x and y variable increase together.

Page 24: Principal Components Analysis Babak Rasolzadeh Tuesday, 5th December 2006

PCA Example –STEP 3

• Calculate the eigenvectors and eigenvalues of the covariance matrix

eigenvalues = .0490833989

1.28402771

eigenvectors = -.735178656 -.677873399

.677873399 -.735178656

Page 25: Principal Components Analysis Babak Rasolzadeh Tuesday, 5th December 2006

PCA Example –STEP 3

•eigenvectors are plotted as diagonal dotted lines on the plot. •Note they are perpendicular to each other. •Note one of the eigenvectors goes through the middle of the points, like drawing a line of best fit. •The second eigenvector gives us the other, less important, pattern in the data, that all the points follow the main line, but are off to the side of the main line by some amount.

Page 26: Principal Components Analysis Babak Rasolzadeh Tuesday, 5th December 2006

PCA Example –STEP 4

• Reduce dimensionality and form feature vectorthe eigenvector with the highest eigenvalue is the principle component of the data set.

In our example, the eigenvector with the larges eigenvalue was the one that pointed down the middle of the data.

Once eigenvectors are found from the covariance matrix, the next step is to order them by eigenvalue, highest to lowest. This gives you the components in order of significance.

Page 27: Principal Components Analysis Babak Rasolzadeh Tuesday, 5th December 2006

PCA Example –STEP 4

Now, if you like, you can decide to ignore the components of lesser significance.

You do lose some information, but if the eigenvalues are small, you don’t lose much

• n dimensions in your data • calculate n eigenvectors and eigenvalues• choose only the first p eigenvectors• final data set has only p dimensions.

Page 28: Principal Components Analysis Babak Rasolzadeh Tuesday, 5th December 2006

PCA Example –STEP 4

• Feature Vector

FeatureVector = (eig1 eig2 eig3 … eign)We can either form a feature vector with both of the eigenvectors:

-.677873399 -.735178656 -.735178656 .677873399

or, we can choose to leave out the smaller, less significant component and only have a single column: - .677873399

- .735178656

Page 29: Principal Components Analysis Babak Rasolzadeh Tuesday, 5th December 2006

Reconstruction of original Data

x

-.827970186 1.77758033 -.992197494 -.274210416 -1.67580142 -.912949103 .0991094375 1.14457216 .438046137 1.22382056