![Page 1: CS 6890: Deep Learning Principal Component Analysisace.cs.ohio.edu/~razvan/courses/dl6890/lecture03.pdfPrincipal Component Analysis (PCA) •A technique widely used for: –dimensionality](https://reader034.vdocument.in/reader034/viewer/2022051814/6034eca6fe42d360d4441fe6/html5/thumbnails/1.jpg)
CS 6890: Deep Learning
Razvan C. Bunescu
School of Electrical Engineering and Computer Science
Principal Component Analysis
![Page 2: CS 6890: Deep Learning Principal Component Analysisace.cs.ohio.edu/~razvan/courses/dl6890/lecture03.pdfPrincipal Component Analysis (PCA) •A technique widely used for: –dimensionality](https://reader034.vdocument.in/reader034/viewer/2022051814/6034eca6fe42d360d4441fe6/html5/thumbnails/2.jpg)
Principal Component Analysis (PCA)
• A technique widely used for:– dimensionality reduction.– data compression.– feature extraction.– data visualization.
• Two equivalent definitions of PCA:1) Project the data onto a lower dimensional space such that the
variance of the projected data is maximized.2) Project the data onto a lower dimensional space such that the
mean squared distance between data points and their projections (average projection cost) is minimized.
2
maximum variance
minimum error
![Page 3: CS 6890: Deep Learning Principal Component Analysisace.cs.ohio.edu/~razvan/courses/dl6890/lecture03.pdfPrincipal Component Analysis (PCA) •A technique widely used for: –dimensionality](https://reader034.vdocument.in/reader034/viewer/2022051814/6034eca6fe42d360d4441fe6/html5/thumbnails/3.jpg)
Principal Component Analysis (PCA)
3
![Page 4: CS 6890: Deep Learning Principal Component Analysisace.cs.ohio.edu/~razvan/courses/dl6890/lecture03.pdfPrincipal Component Analysis (PCA) •A technique widely used for: –dimensionality](https://reader034.vdocument.in/reader034/viewer/2022051814/6034eca6fe42d360d4441fe6/html5/thumbnails/4.jpg)
Principal Component Analysis (PCA)
4
![Page 5: CS 6890: Deep Learning Principal Component Analysisace.cs.ohio.edu/~razvan/courses/dl6890/lecture03.pdfPrincipal Component Analysis (PCA) •A technique widely used for: –dimensionality](https://reader034.vdocument.in/reader034/viewer/2022051814/6034eca6fe42d360d4441fe6/html5/thumbnails/5.jpg)
PCA (Maximum Variance)
• Let X = {xn}1£n£N be a set of observations:
– Each xnÎRD (D is the dimensionality of xn).
• Project X onto an M dimensional space (M < D) such that the variance of the projected X is maximized.– Minimum error formulation leads to the same solution [PRML
12.1.2].• shows how PCA can be used for compression.
• Work out solution for M = 1, then generalize to any M < D.
5
![Page 6: CS 6890: Deep Learning Principal Component Analysisace.cs.ohio.edu/~razvan/courses/dl6890/lecture03.pdfPrincipal Component Analysis (PCA) •A technique widely used for: –dimensionality](https://reader034.vdocument.in/reader034/viewer/2022051814/6034eca6fe42d360d4441fe6/html5/thumbnails/6.jpg)
PCA (Maximum Variance, M = 1)
• The lower dimensional space is defined by a vector u1ÎRD.
– Only direction is important Þ choose ||u1||=1.
• Each xn is projected onto a scalar
• The (sample) mean of the data is:
• The (sample) mean of the projected data is
6
å=
=N
nnN 1
1 xx
nTxu1
xuT1
![Page 7: CS 6890: Deep Learning Principal Component Analysisace.cs.ohio.edu/~razvan/courses/dl6890/lecture03.pdfPrincipal Component Analysis (PCA) •A technique widely used for: –dimensionality](https://reader034.vdocument.in/reader034/viewer/2022051814/6034eca6fe42d360d4441fe6/html5/thumbnails/7.jpg)
PCA (Maximum Variance, M = 1)
• The (sample) variance of the projected data:
where Σ is the data covariance matrix:
• Optimization problem is:
7
1N
u1Tx n−u1
Tx( )2
n=1
N
∑ = u1TΣu1
Σ =1N
x n−x( ) x n−x( )Tn=1
N
∑
−u1TΣu1
111 =uuT
minimize:
subject to:
![Page 8: CS 6890: Deep Learning Principal Component Analysisace.cs.ohio.edu/~razvan/courses/dl6890/lecture03.pdfPrincipal Component Analysis (PCA) •A technique widely used for: –dimensionality](https://reader034.vdocument.in/reader034/viewer/2022051814/6034eca6fe42d360d4441fe6/html5/thumbnails/8.jpg)
PCA (Maximum Variance, M = 1)
• Lagrangian function:
where l1 is the Lagrangian multiplier for constraint
• Solve:
8
111 =uuT
LP (u1,λ1) = −u1TΣu1 +λ1(u1
Tu1 −1)
∂LP∂u1
= 0⇒ Σu1 = λ1u1⇒u1 is an eigenvector of Σl1 is an eigenvalue of Σ
⇒−u1TΣu1 = −λ1u1
Tu1 = −λ1Þ l1 is the largest eigenvalue of Σ.
![Page 9: CS 6890: Deep Learning Principal Component Analysisace.cs.ohio.edu/~razvan/courses/dl6890/lecture03.pdfPrincipal Component Analysis (PCA) •A technique widely used for: –dimensionality](https://reader034.vdocument.in/reader034/viewer/2022051814/6034eca6fe42d360d4441fe6/html5/thumbnails/9.jpg)
PCA (Maximum Variance, M = 1)
• l1 is the largest eigenvalue of Σ.• u1 is the eigenvector corresponding to l1:
– also called the first principal component.
• For M < D dimensions:– u1 u2 … uM are the eigenvectors corresponding to the largest
eigenvalues l1 l2 … lM of Σ.– proof by induction.
9
![Page 10: CS 6890: Deep Learning Principal Component Analysisace.cs.ohio.edu/~razvan/courses/dl6890/lecture03.pdfPrincipal Component Analysis (PCA) •A technique widely used for: –dimensionality](https://reader034.vdocument.in/reader034/viewer/2022051814/6034eca6fe42d360d4441fe6/html5/thumbnails/10.jpg)
PCA on Normalized Data
• Preprocess data X ={x(i)}1£i£m such that:– features have the same mean (0).
– features have the same variance (1).
10
![Page 11: CS 6890: Deep Learning Principal Component Analysisace.cs.ohio.edu/~razvan/courses/dl6890/lecture03.pdfPrincipal Component Analysis (PCA) •A technique widely used for: –dimensionality](https://reader034.vdocument.in/reader034/viewer/2022051814/6034eca6fe42d360d4441fe6/html5/thumbnails/11.jpg)
PCA on Natural Images
• Stationarity: the statistics in one part of the image should be the same as any other.Þno need for variance normalization.Þdo mean normalization by subtracting from each image its mean
intensity.
11
![Page 12: CS 6890: Deep Learning Principal Component Analysisace.cs.ohio.edu/~razvan/courses/dl6890/lecture03.pdfPrincipal Component Analysis (PCA) •A technique widely used for: –dimensionality](https://reader034.vdocument.in/reader034/viewer/2022051814/6034eca6fe42d360d4441fe6/html5/thumbnails/12.jpg)
PCA on Normalized Data
• The covariance matrix is:
• The eigenvectors are:
• Equivalent with:
12
Σ =1mXXT =
1m
x(i) x(i)( )T
i=1
m
∑
Σu j = λ ju j
ΣU =UΛU = [u1,u2,…,uD ]Λ = diag(λ1,λ2,…,λD )
where λ1 ≥ λ2 ≥…≥ λD and ujTuj =1
λ1 ≥ λ2 ≥…≥ λD and UTU = I
![Page 13: CS 6890: Deep Learning Principal Component Analysisace.cs.ohio.edu/~razvan/courses/dl6890/lecture03.pdfPrincipal Component Analysis (PCA) •A technique widely used for: –dimensionality](https://reader034.vdocument.in/reader034/viewer/2022051814/6034eca6fe42d360d4441fe6/html5/thumbnails/13.jpg)
PCA on Normalized Data
• U is an orthogonal (rotation) matrix, i.e. UTU = I.
• The full transformation (rotation) of x(i) through PCA is:
• The k-dimensional projection of x(i) through PCA is:
• How many components k should be used?
13
y(i) =UT x(i)
y(i) =U1,kT x(i) = [u1,…,uk ]
T x(i)
⇒ x(i) =Uy(i)
⇒ x(i) =U1,k y(i)
![Page 14: CS 6890: Deep Learning Principal Component Analysisace.cs.ohio.edu/~razvan/courses/dl6890/lecture03.pdfPrincipal Component Analysis (PCA) •A technique widely used for: –dimensionality](https://reader034.vdocument.in/reader034/viewer/2022051814/6034eca6fe42d360d4441fe6/html5/thumbnails/14.jpg)
How many components k should be used?
• Compute percentage of variance retained by Y = {y(i)}, for each value of k:
14
y(i) = [u1,…,uk ]T x(i)
Var(k) = Var yj!" #$j=1
k
∑ = Var ujT x!" #$
j=1
k
∑
=1m
ujT x(i) −uj
T x( )2
i=1
m
∑j=1
k
∑ =1m
ujT x(i)( )
2
i=1
m
∑j=1
k
∑ = λ jj=1
k
∑
HW: Prove it is λj
![Page 15: CS 6890: Deep Learning Principal Component Analysisace.cs.ohio.edu/~razvan/courses/dl6890/lecture03.pdfPrincipal Component Analysis (PCA) •A technique widely used for: –dimensionality](https://reader034.vdocument.in/reader034/viewer/2022051814/6034eca6fe42d360d4441fe6/html5/thumbnails/15.jpg)
How many components k should be used?
• Compute percentage of variance retained by Y = {y(i)}, for each value of k:
– Variance retained:
– Total variance:
– Percentage of variance retained:
15
Var(k) = λ jj=1
k∑
P(k) =λ jj=1
k∑
λ jj=1
D∑
Var(D) = λ jj=1
D∑
![Page 16: CS 6890: Deep Learning Principal Component Analysisace.cs.ohio.edu/~razvan/courses/dl6890/lecture03.pdfPrincipal Component Analysis (PCA) •A technique widely used for: –dimensionality](https://reader034.vdocument.in/reader034/viewer/2022051814/6034eca6fe42d360d4441fe6/html5/thumbnails/16.jpg)
How many components k should be used?
• Compute percentage of variance retained by Y = {y(i)}, for each value of k:
• Choose smallest k as to retain 99% of variance:
16
P(k) =λ jj=1
k∑
λ jj=1
D∑
k = argmin1≤k≤D
P(k) ≥ 0.99[ ]
![Page 17: CS 6890: Deep Learning Principal Component Analysisace.cs.ohio.edu/~razvan/courses/dl6890/lecture03.pdfPrincipal Component Analysis (PCA) •A technique widely used for: –dimensionality](https://reader034.vdocument.in/reader034/viewer/2022051814/6034eca6fe42d360d4441fe6/html5/thumbnails/17.jpg)
PCA on Normalized Data:
17
[x1(i), x2
(i) ]T
![Page 18: CS 6890: Deep Learning Principal Component Analysisace.cs.ohio.edu/~razvan/courses/dl6890/lecture03.pdfPrincipal Component Analysis (PCA) •A technique widely used for: –dimensionality](https://reader034.vdocument.in/reader034/viewer/2022051814/6034eca6fe42d360d4441fe6/html5/thumbnails/18.jpg)
Rotation through PCA:
18
[u1T x(i),u2
T x(i) ]T
![Page 19: CS 6890: Deep Learning Principal Component Analysisace.cs.ohio.edu/~razvan/courses/dl6890/lecture03.pdfPrincipal Component Analysis (PCA) •A technique widely used for: –dimensionality](https://reader034.vdocument.in/reader034/viewer/2022051814/6034eca6fe42d360d4441fe6/html5/thumbnails/19.jpg)
1-Dimensional PCA Projection:
19
[u1T x(i), 0]T
![Page 20: CS 6890: Deep Learning Principal Component Analysisace.cs.ohio.edu/~razvan/courses/dl6890/lecture03.pdfPrincipal Component Analysis (PCA) •A technique widely used for: –dimensionality](https://reader034.vdocument.in/reader034/viewer/2022051814/6034eca6fe42d360d4441fe6/html5/thumbnails/20.jpg)
1-Dimensional PCA Approximation:
20
u1u1T x(i)
![Page 21: CS 6890: Deep Learning Principal Component Analysisace.cs.ohio.edu/~razvan/courses/dl6890/lecture03.pdfPrincipal Component Analysis (PCA) •A technique widely used for: –dimensionality](https://reader034.vdocument.in/reader034/viewer/2022051814/6034eca6fe42d360d4441fe6/html5/thumbnails/21.jpg)
PCA as a Linear Auto-Encoder
• The full transformation (rotation) of x(i) through PCA is:
• The k-dimensional projection of x(i) through PCA is:
• The minimum error formulation of PCA:
21
y =UT x⇒ x =Uy
y =U1,kT x = [u1,…,uk ]
T x⇒ x =U1,k y =U1,kU1,kT x
U1,k* = argmin
U1,kU1,kU1,k
T x(i) − x(i)2
i=1
m
∑a linear auto-encoder
with tied weights!
![Page 22: CS 6890: Deep Learning Principal Component Analysisace.cs.ohio.edu/~razvan/courses/dl6890/lecture03.pdfPrincipal Component Analysis (PCA) •A technique widely used for: –dimensionality](https://reader034.vdocument.in/reader034/viewer/2022051814/6034eca6fe42d360d4441fe6/html5/thumbnails/22.jpg)
PCA as a Linear Auto-Encoder
22
x1
x2
x3
x4
x1
x2
x3
x4
zi(2)
U1,kT U1,k
wi1(1)
wi2(1)
wi3(1)
wi4(1)
w1i(2)
w2i(2)
w3i(2)
w4i(2)
ui = wi1(1),wi2
(1),wi3(1),wi4
(1)!" #$T
ui = w1i(2),w2i
(2),w3i(2),w4i
(2)!" #$T
![Page 23: CS 6890: Deep Learning Principal Component Analysisace.cs.ohio.edu/~razvan/courses/dl6890/lecture03.pdfPrincipal Component Analysis (PCA) •A technique widely used for: –dimensionality](https://reader034.vdocument.in/reader034/viewer/2022051814/6034eca6fe42d360d4441fe6/html5/thumbnails/23.jpg)
PCA and Decorrelation
• The full transformation (rotation) of x(i) through PCA is:
• What is the covariance matrix of the rotated data Y?
23
y(i) =UT x(i) ⇒Y =UTX
1mYY T =
1mUTX( ) UTX( )
T=1mUTXXTU
=UT 1mXXT!
"#
$
%&U =UTΣU = Λ
= diag(λ1,λ2,…,λD ) => the features in yare decorrelated!
![Page 24: CS 6890: Deep Learning Principal Component Analysisace.cs.ohio.edu/~razvan/courses/dl6890/lecture03.pdfPrincipal Component Analysis (PCA) •A technique widely used for: –dimensionality](https://reader034.vdocument.in/reader034/viewer/2022051814/6034eca6fe42d360d4441fe6/html5/thumbnails/24.jpg)
PCA Whitening (Sphering)
• The goal of whitening is to make the input less redundant, i.e. the learning algorithm sees a training input where:
1. The features are not correlated with each other.
2. The features all have the same variance.
24
1. PCA already results in uncorrelated features:
y(i) =UT x(i)⇔Y =UTX 1mYY T = diag(λ1,λ2,…,λD )
2. Transform to identity covariance (PCA Whitening) :
yj(i) =
ujT x(i)
λ j
⇔ y(i) = Λ−12UT x(i)⇔Y = Λ−12UTX
![Page 25: CS 6890: Deep Learning Principal Component Analysisace.cs.ohio.edu/~razvan/courses/dl6890/lecture03.pdfPrincipal Component Analysis (PCA) •A technique widely used for: –dimensionality](https://reader034.vdocument.in/reader034/viewer/2022051814/6034eca6fe42d360d4441fe6/html5/thumbnails/25.jpg)
PCA on Normalized Data:
25
[x1(i), x2
(i) ]T
![Page 26: CS 6890: Deep Learning Principal Component Analysisace.cs.ohio.edu/~razvan/courses/dl6890/lecture03.pdfPrincipal Component Analysis (PCA) •A technique widely used for: –dimensionality](https://reader034.vdocument.in/reader034/viewer/2022051814/6034eca6fe42d360d4441fe6/html5/thumbnails/26.jpg)
Rotation through PCA:
26
[u1T x(i),u2
T x(i) ]T
![Page 27: CS 6890: Deep Learning Principal Component Analysisace.cs.ohio.edu/~razvan/courses/dl6890/lecture03.pdfPrincipal Component Analysis (PCA) •A technique widely used for: –dimensionality](https://reader034.vdocument.in/reader034/viewer/2022051814/6034eca6fe42d360d4441fe6/html5/thumbnails/27.jpg)
PCA Whitening:
27
u1T x(i)
λ1, u2
T x(i)
λ2
!
"##
$
%&&
T
![Page 28: CS 6890: Deep Learning Principal Component Analysisace.cs.ohio.edu/~razvan/courses/dl6890/lecture03.pdfPrincipal Component Analysis (PCA) •A technique widely used for: –dimensionality](https://reader034.vdocument.in/reader034/viewer/2022051814/6034eca6fe42d360d4441fe6/html5/thumbnails/28.jpg)
ZCA Whitening (Sphering)
• Observation: If Y has identity covariance and R is an orthogonal matrix, then RY has identity covariance.
28
YPCA = Λ−12UTX
1. PCA Whitening:
YZCA =UYPCA =UΛ−12UTX
2. ZCA Whitening:
Out of all rotations, U makes YZCA closest to original X.
![Page 29: CS 6890: Deep Learning Principal Component Analysisace.cs.ohio.edu/~razvan/courses/dl6890/lecture03.pdfPrincipal Component Analysis (PCA) •A technique widely used for: –dimensionality](https://reader034.vdocument.in/reader034/viewer/2022051814/6034eca6fe42d360d4441fe6/html5/thumbnails/29.jpg)
ZCA Whitening:
29
YZCA =UΛ−12UTX
![Page 30: CS 6890: Deep Learning Principal Component Analysisace.cs.ohio.edu/~razvan/courses/dl6890/lecture03.pdfPrincipal Component Analysis (PCA) •A technique widely used for: –dimensionality](https://reader034.vdocument.in/reader034/viewer/2022051814/6034eca6fe42d360d4441fe6/html5/thumbnails/30.jpg)
Smoothing
• When eigenvalues λj are very close to 0, dividing by λj-1/2 is
numerically unstable.
• Smoothing: add a small ε to eigenvalues before scaling for PCA/ZCA whitening:
• ZCA whitening is a rough model of how the biological eye (the retina) processes images (through retinal neurons).
30
yj(i) =
ujT x(i)
λ j +εε ≈10−5