new sargur srihari university at buffalosrihari/cse626/lecture-slides/... · 2010. 2. 23. ·...
TRANSCRIPT
![Page 1: New Sargur Srihari University at Buffalosrihari/CSE626/Lecture-Slides/... · 2010. 2. 23. · Ch3-part2-PCA.ppt Author: Sargur Srihari Created Date: 2/23/2010 5:38:55 PM](https://reader034.vdocument.in/reader034/viewer/2022051603/5ff034ed2edf922b821850d0/html5/thumbnails/1.jpg)
Principal Components Analysis
Sargur Srihari University at Buffalo
1
![Page 2: New Sargur Srihari University at Buffalosrihari/CSE626/Lecture-Slides/... · 2010. 2. 23. · Ch3-part2-PCA.ppt Author: Sargur Srihari Created Date: 2/23/2010 5:38:55 PM](https://reader034.vdocument.in/reader034/viewer/2022051603/5ff034ed2edf922b821850d0/html5/thumbnails/2.jpg)
Topics
• Projection Pursuit Methods • Principal Components • Examples of using PCA • Graphical use of PCA • Multidimensional Scaling
Srihari 2
![Page 3: New Sargur Srihari University at Buffalosrihari/CSE626/Lecture-Slides/... · 2010. 2. 23. · Ch3-part2-PCA.ppt Author: Sargur Srihari Created Date: 2/23/2010 5:38:55 PM](https://reader034.vdocument.in/reader034/viewer/2022051603/5ff034ed2edf922b821850d0/html5/thumbnails/3.jpg)
Motivation
• Scatterplots – Good for two variables at a time – Disadvantage
• may miss complicated relationships
• PCA is a method to transform into new variables
• Projections along different directions to detect relationships – Say along direction defined by 2x1+3x2+x3=0
3
![Page 4: New Sargur Srihari University at Buffalosrihari/CSE626/Lecture-Slides/... · 2010. 2. 23. · Ch3-part2-PCA.ppt Author: Sargur Srihari Created Date: 2/23/2010 5:38:55 PM](https://reader034.vdocument.in/reader034/viewer/2022051603/5ff034ed2edf922b821850d0/html5/thumbnails/4.jpg)
Projection pursuit methods • Allow searching for “interesting” directions • Interesting means maximum variability • Data in 2-d space projected to 1-d:
x1
x2
2x1+3x2=0
Projection Task is to find a
4
![Page 5: New Sargur Srihari University at Buffalosrihari/CSE626/Lecture-Slides/... · 2010. 2. 23. · Ch3-part2-PCA.ppt Author: Sargur Srihari Created Date: 2/23/2010 5:38:55 PM](https://reader034.vdocument.in/reader034/viewer/2022051603/5ff034ed2edf922b821850d0/html5/thumbnails/5.jpg)
Principal Components
• Find linear combinations that maximize variance subject to being uncorrelated with those already selected
• Hopefully there are few such linear combinations-- known as principal components
• Task is to find a k-dimensional projection where 0 < k < d-1
5 Srihari
![Page 6: New Sargur Srihari University at Buffalosrihari/CSE626/Lecture-Slides/... · 2010. 2. 23. · Ch3-part2-PCA.ppt Author: Sargur Srihari Created Date: 2/23/2010 5:38:55 PM](https://reader034.vdocument.in/reader034/viewer/2022051603/5ff034ed2edf922b821850d0/html5/thumbnails/6.jpg)
Data Matrix Definition
X = n x d data matrix of n cases
x(1)
x(i)
x(n)
d variables
x(i) is a d x 1 column vector
Each row of matrix is of the form x(i)T
Assume X is mean-centered, so that the value of each variable is subtracted for that variable
6
![Page 7: New Sargur Srihari University at Buffalosrihari/CSE626/Lecture-Slides/... · 2010. 2. 23. · Ch3-part2-PCA.ppt Author: Sargur Srihari Created Date: 2/23/2010 5:38:55 PM](https://reader034.vdocument.in/reader034/viewer/2022051603/5ff034ed2edf922b821850d0/html5/thumbnails/7.jpg)
Projection Definition Let a be a p x 1 column vector of projection weights that result in the largest variance when the data X are projected along a
Projection of a data vector x = (x1,..xp)t
onto a = (a1,..,ap)t is the linear combination
€
a tx = ajx jj=1
p
∑
Projected values of all data vectors in X onto a is Xa -- an n x 1 column vector-- a set of scalar values
corresponding to n projected points Since X is n x p and
a is p x 1 Therefore Xa is n x 1 7
![Page 8: New Sargur Srihari University at Buffalosrihari/CSE626/Lecture-Slides/... · 2010. 2. 23. · Ch3-part2-PCA.ppt Author: Sargur Srihari Created Date: 2/23/2010 5:38:55 PM](https://reader034.vdocument.in/reader034/viewer/2022051603/5ff034ed2edf922b821850d0/html5/thumbnails/8.jpg)
Variance along Projection
€
σ a2 = Xa( )T Xa( )
= aT X tXa= aTVa
Variance along a is
Thus variance is a function of both the projection line a and the covariance matrix V
€
where V = X tX is the p× p covariance matrix of the datasince X has zero mean
8
![Page 9: New Sargur Srihari University at Buffalosrihari/CSE626/Lecture-Slides/... · 2010. 2. 23. · Ch3-part2-PCA.ppt Author: Sargur Srihari Created Date: 2/23/2010 5:38:55 PM](https://reader034.vdocument.in/reader034/viewer/2022051603/5ff034ed2edf922b821850d0/html5/thumbnails/9.jpg)
Maximization of Variance Maximizing variance along a is not well-defined since
we can increase it without limit by increasing the size of the components of a.
Impose a normalization constraint on the a vectors such that aTa = 1
Optimization problem is to maximize
€
u = a tVa − λ(a ta −1)
Where λ is a Lagrange multiplier. Differentiating wrt a yields
€
∂u∂a
= 2Va − 2λa = 0
which reduces to(V - λI)a = 0 Characteristic Equation!
![Page 10: New Sargur Srihari University at Buffalosrihari/CSE626/Lecture-Slides/... · 2010. 2. 23. · Ch3-part2-PCA.ppt Author: Sargur Srihari Created Date: 2/23/2010 5:38:55 PM](https://reader034.vdocument.in/reader034/viewer/2022051603/5ff034ed2edf922b821850d0/html5/thumbnails/10.jpg)
What is the Characteristic Equation?
Given a d x d matrix V a very important class of linear Equations is of the form
d x d d x 1 d x 1
which can be rewritten as
€
Vx = λx
If V is real and symmetric there are d possible solution vectors, called Eigen Vectors, e1, ed and associated Eigen values
€
(V − λI)x = 0
10 Srihari
![Page 11: New Sargur Srihari University at Buffalosrihari/CSE626/Lecture-Slides/... · 2010. 2. 23. · Ch3-part2-PCA.ppt Author: Sargur Srihari Created Date: 2/23/2010 5:38:55 PM](https://reader034.vdocument.in/reader034/viewer/2022051603/5ff034ed2edf922b821850d0/html5/thumbnails/11.jpg)
Principal Component is obtained from the Covariance Matrix
Then its Characteristic Equation is
€
(V − λI)a = 0Roots are Eigen Values Corresponding Eigen Vectors are principal components
If the matrix V is the Covariance matrix
First principal component is the Eigen Vector associated with the largest Eigen value of V.
11 Srihari
![Page 12: New Sargur Srihari University at Buffalosrihari/CSE626/Lecture-Slides/... · 2010. 2. 23. · Ch3-part2-PCA.ppt Author: Sargur Srihari Created Date: 2/23/2010 5:38:55 PM](https://reader034.vdocument.in/reader034/viewer/2022051603/5ff034ed2edf922b821850d0/html5/thumbnails/12.jpg)
Other Principal Components
• Second Principal component is in direction orthogonal to first
• Has second largest Eigen value, etc
First Principal Component e1
X1
X2
Second Principal Component e2
12
![Page 13: New Sargur Srihari University at Buffalosrihari/CSE626/Lecture-Slides/... · 2010. 2. 23. · Ch3-part2-PCA.ppt Author: Sargur Srihari Created Date: 2/23/2010 5:38:55 PM](https://reader034.vdocument.in/reader034/viewer/2022051603/5ff034ed2edf922b821850d0/html5/thumbnails/13.jpg)
Projection into k Eigen Vectors • Variance of data projected into first
k Eigen vectors e1,..ek is
• Squared error in approximating true data matrix X using only first k Eigen vectors is
• How to choose k ? – increase k until squared error is less than a threshold €
λ jj= k+1
d
∑
λll=1
d
∑
Usually 5-10 principal components capture 90% variance in data
13 Srihari
![Page 14: New Sargur Srihari University at Buffalosrihari/CSE626/Lecture-Slides/... · 2010. 2. 23. · Ch3-part2-PCA.ppt Author: Sargur Srihari Created Date: 2/23/2010 5:38:55 PM](https://reader034.vdocument.in/reader034/viewer/2022051603/5ff034ed2edf922b821850d0/html5/thumbnails/14.jpg)
Scree Plot Amount of variance explained by each consecutive Eigen value
CPU data 8 Eigen values: 63.26 10.70 10.30 6.68 5.23 2.18 1.31 0.34
Weights put by first component e1
on eight variables are: 0.199 -0.365 -0.399 -0.336 -0.331 -0.298 -0.421 -0.423
Eigen values of Correlation Matrix
An example Eigen Vector
Scatterplot Matrix
CPU data
Eigen Value number Pe
rcen
t Va
rianc
e Ex
plai
ned Example of PCA
14
![Page 15: New Sargur Srihari University at Buffalosrihari/CSE626/Lecture-Slides/... · 2010. 2. 23. · Ch3-part2-PCA.ppt Author: Sargur Srihari Created Date: 2/23/2010 5:38:55 PM](https://reader034.vdocument.in/reader034/viewer/2022051603/5ff034ed2edf922b821850d0/html5/thumbnails/15.jpg)
PCA using correlation matrix and covariance matrix
Proportions of variation attributable to different components: 96.02 3.93 0.04 0.01 0 0 0 0
Scree Plot Correlation Matrix
Scree Plot Covariance Matrix
Eigen Value number
Eigen Value number
Perc
ent
Varia
nce
Expl
aine
d Pe
rcen
t Va
rianc
e Ex
plai
ned
15
![Page 16: New Sargur Srihari University at Buffalosrihari/CSE626/Lecture-Slides/... · 2010. 2. 23. · Ch3-part2-PCA.ppt Author: Sargur Srihari Created Date: 2/23/2010 5:38:55 PM](https://reader034.vdocument.in/reader034/viewer/2022051603/5ff034ed2edf922b821850d0/html5/thumbnails/16.jpg)
Graphical Use of PCA
Projection onto first two principal components of six dimensional data 17 pills (data points) Six values are times at which specified proportion of pill has dissolved: 10%, 30%, 50%, 70%, 75%, 90%
Pill 3 is very different Principal Component 1
Prin
cipal
Com
pone
nt 2
16 Srihari
![Page 17: New Sargur Srihari University at Buffalosrihari/CSE626/Lecture-Slides/... · 2010. 2. 23. · Ch3-part2-PCA.ppt Author: Sargur Srihari Created Date: 2/23/2010 5:38:55 PM](https://reader034.vdocument.in/reader034/viewer/2022051603/5ff034ed2edf922b821850d0/html5/thumbnails/17.jpg)
Computational Issue: Scaling with Dimensionality
• O(nd2+d3)
To calculate V Solve Eigen value equations for the d x d matrix
Can be applied to large numbers of records n But does not scale well with dimensionality d
Also, appropriate Scalings of variables have to be done 17
![Page 18: New Sargur Srihari University at Buffalosrihari/CSE626/Lecture-Slides/... · 2010. 2. 23. · Ch3-part2-PCA.ppt Author: Sargur Srihari Created Date: 2/23/2010 5:38:55 PM](https://reader034.vdocument.in/reader034/viewer/2022051603/5ff034ed2edf922b821850d0/html5/thumbnails/18.jpg)
Multidimensional Scaling
• Using PCA to project on a plane is effective only if data lie on 2-d subspace
• Intrinsic Dimensionality – Data may lie on string or surface in d-space – E.g., when a digit image is translated and rotated
• Then images in pixel space lie on a 3-dimensional manifold (defined by location and orientation)
18 Srihari
![Page 19: New Sargur Srihari University at Buffalosrihari/CSE626/Lecture-Slides/... · 2010. 2. 23. · Ch3-part2-PCA.ppt Author: Sargur Srihari Created Date: 2/23/2010 5:38:55 PM](https://reader034.vdocument.in/reader034/viewer/2022051603/5ff034ed2edf922b821850d0/html5/thumbnails/19.jpg)
Goal of Multidimensional Scaling • Detecting underlying structure • Represent data in lower dimensional space
so that distances are preserved – Distances between data points are mapped to a
reduced space • Typically displayed on a 2-d plot • Begin with distances and then compute the
plot – E.g., psychometrics and market research where
similarities between objects are given by subjects 19
![Page 20: New Sargur Srihari University at Buffalosrihari/CSE626/Lecture-Slides/... · 2010. 2. 23. · Ch3-part2-PCA.ppt Author: Sargur Srihari Created Date: 2/23/2010 5:38:55 PM](https://reader034.vdocument.in/reader034/viewer/2022051603/5ff034ed2edf922b821850d0/html5/thumbnails/20.jpg)
Defining the B Matrix
• For an n x d data matrix X we could compute n x n matrix B = XXt
• We will see (next slide) that the Euclidean distance between the ith and jth objects is given by dij
2=bii+bjj-2bij
• Matrices XXt and XtX are both meaningful
20 Srihari
![Page 21: New Sargur Srihari University at Buffalosrihari/CSE626/Lecture-Slides/... · 2010. 2. 23. · Ch3-part2-PCA.ppt Author: Sargur Srihari Created Date: 2/23/2010 5:38:55 PM](https://reader034.vdocument.in/reader034/viewer/2022051603/5ff034ed2edf922b821850d0/html5/thumbnails/21.jpg)
XtX versus XXt
• If X is n x d d=4
• XtX is d x d
• B=XXt is n x n
n x d d x n d x d
n x d d x n n x n
Covariance Matrix
B Matrix contains distance information dij
2=bii+bjj-2bij
![Page 22: New Sargur Srihari University at Buffalosrihari/CSE626/Lecture-Slides/... · 2010. 2. 23. · Ch3-part2-PCA.ppt Author: Sargur Srihari Created Date: 2/23/2010 5:38:55 PM](https://reader034.vdocument.in/reader034/viewer/2022051603/5ff034ed2edf922b821850d0/html5/thumbnails/22.jpg)
Factorizing the B matrix
• Given a matrix of distances D – Derived from original data by computing n(n-1)/2 distances – Compute elements of B by inverting
• Factorize B – in terms of eigen vectors to yield coordinates of points – Two largest eigen values would give 2-d representation
dij2=bii+bjj-2bij
22 Srihari
![Page 23: New Sargur Srihari University at Buffalosrihari/CSE626/Lecture-Slides/... · 2010. 2. 23. · Ch3-part2-PCA.ppt Author: Sargur Srihari Created Date: 2/23/2010 5:38:55 PM](https://reader034.vdocument.in/reader034/viewer/2022051603/5ff034ed2edf922b821850d0/html5/thumbnails/23.jpg)
Inverting distances to get B
• Summing over i
• Summing over j
• Summing over i and j
dij2=bii+bjj-2bij
Thus expressing bij as a function of dij2
Method is known as Principal Coordinates Method
Can obtain tr(B)
Can obtain bii
Can obtain bjj
23
![Page 24: New Sargur Srihari University at Buffalosrihari/CSE626/Lecture-Slides/... · 2010. 2. 23. · Ch3-part2-PCA.ppt Author: Sargur Srihari Created Date: 2/23/2010 5:38:55 PM](https://reader034.vdocument.in/reader034/viewer/2022051603/5ff034ed2edf922b821850d0/html5/thumbnails/24.jpg)
Criterion for Multidimensional Scaling
• Find projection into two dimensions to minimize
Observed distance between points i and j in d-space
Distance between the points in two-dimensional space
Criterion is invariant wrt rotations and translations. However it is not invariant to scaling Better criterion is or Called
stress 24 Srihari
![Page 25: New Sargur Srihari University at Buffalosrihari/CSE626/Lecture-Slides/... · 2010. 2. 23. · Ch3-part2-PCA.ppt Author: Sargur Srihari Created Date: 2/23/2010 5:38:55 PM](https://reader034.vdocument.in/reader034/viewer/2022051603/5ff034ed2edf922b821850d0/html5/thumbnails/25.jpg)
Algorithm for Multidimensional Scaling
• Two stage procedure • Assume that dij=a+bδij+eij
• Regressioin in 2-D on given dissimilarities yielding estimates for a and b
• Find new values of dij that minimize the stress • Repeat until convergence
Original dissimilarities
25 Srihari
![Page 26: New Sargur Srihari University at Buffalosrihari/CSE626/Lecture-Slides/... · 2010. 2. 23. · Ch3-part2-PCA.ppt Author: Sargur Srihari Created Date: 2/23/2010 5:38:55 PM](https://reader034.vdocument.in/reader034/viewer/2022051603/5ff034ed2edf922b821850d0/html5/thumbnails/26.jpg)
Multidimensional Scaling Plot: Dialect Similarities Numerical codes of villages and their counties
Each Pair of villages rated by percentage of 60 items for which villagers used different words
We are able to visualize 625 distances intuitively 26
![Page 27: New Sargur Srihari University at Buffalosrihari/CSE626/Lecture-Slides/... · 2010. 2. 23. · Ch3-part2-PCA.ppt Author: Sargur Srihari Created Date: 2/23/2010 5:38:55 PM](https://reader034.vdocument.in/reader034/viewer/2022051603/5ff034ed2edf922b821850d0/html5/thumbnails/27.jpg)
Variations of Multidimensional Scaling
• Above methods are called metric methods • Sometimes precise similarities may not be
known– only rank orderings • Also may not be able to assume a
particular form of relationship between dij and δij – Requires a two-stage approach – Replace simple linear regression with
monotonic regression
27 Srihari
![Page 28: New Sargur Srihari University at Buffalosrihari/CSE626/Lecture-Slides/... · 2010. 2. 23. · Ch3-part2-PCA.ppt Author: Sargur Srihari Created Date: 2/23/2010 5:38:55 PM](https://reader034.vdocument.in/reader034/viewer/2022051603/5ff034ed2edf922b821850d0/html5/thumbnails/28.jpg)
Multidimensional Scaling: Disadvantages
• When there are too many data points structure becomes obscured
• Highly sophisticated transformations of the data (compared to scatter lots and PCA) – Possibility of introducing artifacts – Dissimilarities can be more accurately determined
when they are similar than when they are very dissimilar
• Horseshoe effect when objects manufactured in a short time span differ greatly from objects separated by greater time gap
• Biplots show both data points and variables 28 Srihari