hassan a. kingravi pca.pdf · 1 n xn i=1 x ix t: (1) the diagonalization of the matrix can be...
TRANSCRIPT
Goal PCA Kernel PCA: Derivation Kernel PCA: Examples KPCA and Feature Space
Principal Components Analysis
Hassan A. Kingravi
IVALab
July 1, 2013
1
Goal PCA Kernel PCA: Derivation Kernel PCA: Examples KPCA and Feature Space
Goal
1 Consider data X = {xi}n1, xi ∈ Rd . Suppose d > 3, but we would like toplot the data. How do we do this?
2 We can pick a few of the relevant axes and plot them to see what theylook like.
3 Or, we can reduce the data to a few relevant variables.
2
Goal PCA Kernel PCA: Derivation Kernel PCA: Examples KPCA and Feature Space
Principal Components Analysis
The main idea behind PCA: project the data onto the axes in whose directionthe data have maximum variance.
−3 −2 −1 0 1 2 3 4−6
−4
−2
0
2
4
6
3
Goal PCA Kernel PCA: Derivation Kernel PCA: Examples KPCA and Feature Space
Principal Components Analysis
The main idea behind PCA: project the data onto the axes in whose directionthe data have maximum variance.
−3 −2 −1 0 1 2 3 4−6
−4
−2
0
2
4
6
4
Goal PCA Kernel PCA: Derivation Kernel PCA: Examples KPCA and Feature Space
Principal Components Analysis
Put another way: find another linear basis to represent data, which maximizesvariance. Well, in d-dimensional data, ‘variance’ (or second moments) isrepresented by the covariance matrix. If X ∈ Rd is zero mean, the covariancematrix is given by
C =1
nXXT .
Compute the eigendecomposition of this matrix:
C = UΛUT .
The eigenvectors are the directions that you project your data onto:
D = UTX .
If you keep only the components with the largest eigenvalues (which measure,roughly, how much of the ‘energy’ of the signal was within them), you areperforming dimensionality blackuction.
5
Goal PCA Kernel PCA: Derivation Kernel PCA: Examples KPCA and Feature Space
Principal Components Analysis
Data projected onto new space: much simpler representation.
−6 −4 −2 0 2 4 6 8−1.5
−1
−0.5
0
0.5
1
1.5
6
Goal PCA Kernel PCA: Derivation Kernel PCA: Examples KPCA and Feature Space
PCA: The Dual View
The covariance matrix is defined to be
C =1
n
n∑i=1
xixTi . (1)
The diagonalization of the matrix can be written as
Cv = λv , (2)
where v is the eigenvector, and λ the eigenvalue. Plugging (1) into (2) allowsus to write the eigenvector in terms of the data (i.e. the dual view) as
Cv =1
n
n∑i=1
〈xi , v〉Rd xi .
Therefore, the eigenvector equation is equivalent to solving the problem
λ〈xi , v〉Rd = 〈xi ,Cv〉Rd ∀i = 1, . . . , n. (3)
7
Goal PCA Kernel PCA: Derivation Kernel PCA: Examples KPCA and Feature Space
Kernelizing PCA
The covariance matrix is defined to be
C̄ =1
n
n∑i=1
ψ(xi )ψ(xi )T . (4)
The diagonalization of the matrix can be written as
C̄V = λV , (5)
where V is the eigenvector, and λ the eigenvalue. Plugging (4) into (5) allowsus to write the eigenvector in terms of the data as
C̄V =1
n
n∑i=1
〈ψ(xi ),V 〉Hψ(xi ).
Therefore, the eigenvector equation is equivalent to solving the problem
λ〈ψ(xi ),V 〉H = 〈ψ(xi ), C̄V 〉H ∀i = 1, . . . , n. (6)
8
Goal PCA Kernel PCA: Derivation Kernel PCA: Examples KPCA and Feature Space
Kernelizing PCA
Due to the dual view, we can write eigenvectors as
V =n∑
i=1
αiψ(xi ), (7)
for some coefficients αi . Plugging this into the previous equation, we get
λ
n∑i=1
αi 〈ψ(xk), ψ(xi )〉H =n∑
i=1
αi 〈ψ(xk), C̄ψ(xi )〉H
λn∑
i=1
αi 〈ψ(xk), ψ(xi )〉H =n∑
i=1
αi
⟨ψ(xk),
1
n
(n∑
j=1
ψ(xj)ψ(xj)T
)ψ(xi )
⟩H
λ
n∑i=1
αi 〈ψ(xk), ψ(xi )〉H =1
n
n∑i=1
n∑j=1
αi 〈ψ(xi ), ψ(xj)〉H〈ψ(xj), ψ(xk)〉H.
Doing this for every data point yields the matrix equation
nλKα = K 2α
(⇒) Kα = nλα.
9
Goal PCA Kernel PCA: Derivation Kernel PCA: Examples KPCA and Feature Space
Kernel PCA
So the algorithm becomes clear. Ignoring a subtlety (data may not be centeredin feature space), perform eigendecomposition
Kαk = λkαk . (8)
Then the eigenvectors are given by
V k =n∑
i=1
αiψ(xi ).
We require that the vectors be normalized, i.e. 〈V k ,V k〉H = 1, because wewant orthonormal system. It can be shown that this boils down to requiring
〈alk , αk〉Rn =1
λk.
Then the projection of the data onto the KPCA eigenspace is given by
〈V k , ψ(xj)〉H =
⟨n∑
i=1
αki ψ(xi ), ψ(xj)
⟩H
(9)
=n∑
i=1
αki k(xi , xj). (10)
10
Goal PCA Kernel PCA: Derivation Kernel PCA: Examples KPCA and Feature Space
Gaussian Kernel
−10 −5 0 5 10−10
−5
0
5
10Data generated from Gaussian Mixture Models
11
Goal PCA Kernel PCA: Derivation Kernel PCA: Examples KPCA and Feature Space
Gaussian Kernel
−0.4 −0.20.0
0.20.4
0.6
−0.4
−0.2
0.0
0.2
0.4
−0.90
−0.85
−0.80
−0.75
−0.70
−0.65
−0.60
Gaussian KPCA Embedding for GMM Data
12
Goal PCA Kernel PCA: Derivation Kernel PCA: Examples KPCA and Feature Space
Gaussian Kernel: Eigenfunctions
−5 0 5
−5
0
5
−0.6
−0.4
−0.2
0.0
0.2
0.4
0.6
Gaussian KPCA Eigenfunction 1 (Centered)
−50
5 −50
5
−0.6
−0.4
−0.2
0.0
0.2
0.4
0.6
Gaussian KPCA Eigenfunction 2 (Centered)
13
Goal PCA Kernel PCA: Derivation Kernel PCA: Examples KPCA and Feature Space
Polynomial Kernel
−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0−2.0
−1.5
−1.0
−0.5
0.0
0.5
1.0
1.5
2.0Non-linearly separable data
14
Goal PCA Kernel PCA: Derivation Kernel PCA: Examples KPCA and Feature Space
Polynomial Kernel
−2−1
01
23
−2
−1
0
1
2
0.5
1.0
1.5
2.0
2.5
3.0
Polynomial KPCA Embedding for Nonseperable Data
15
Goal PCA Kernel PCA: Derivation Kernel PCA: Examples KPCA and Feature Space
Polynomial Kernel: Eigenfunctions
−1.5−1.0−0.50.0
0.51.0
1.5−1.5−1.0−0.5
0.00.5
1.01.5
−4
−2
0
2
4
Polynomial KPCA Eigenfunction 1 (Centered)
−1.5−1.0−0.50.0
0.51.0
1.5−1.5−1.0−0.5
0.00.5
1.01.5
−3
−2
−1
0
1
2
3
Polynomial KPCA Eigenfunction 2 (Centered)
16
Goal PCA Kernel PCA: Derivation Kernel PCA: Examples KPCA and Feature Space
Reproducing Kernel Hilbert Spaces
The embedding of the data for the polynomial kernel should look familiar: it’salmost as if the mapping is exactly as we knew the feature map itself. So far,we know that if we have the kernel, we can avoid computing the feature map,and operate in H without ψ. In fact, KPCA allows us to compute low-rankapproximations to ψ.To understand how, we need to understand what the kernel matrix isapproximating in the limit.
17
Goal PCA Kernel PCA: Derivation Kernel PCA: Examples KPCA and Feature Space
Integral Operators
Consider yet another perspective on the kernel matrix; as an operator. For adataset X = {xi}n1, xi ∈ R, suppose you want to smooth (or interpolate) thedata, using a smoothing kernel k(x , y) : R× R→ R. Then the smoothedversion of the data can be computed as
x̃1x̃2...
x̃n
=
k(x1, x1) k(x1, x2) · · · k(x1, xn)k(x2, x1) k(x2, x2) · · · k(x2, xn)
......
. . ....
k(xn, x1) k(xn, x2) · · · k(xn, xn)
x1x2...
xn
This smoothing, in the limit, can be written as the integral operator
(Kf )(x) :=
∫D
k(x , y)f (y)dy , (11)
for f ∈ L2(D). Depending on structure of k(x , y), f is projected onto asubspace of L2(D).
18
Goal PCA Kernel PCA: Derivation Kernel PCA: Examples KPCA and Feature Space
Mercer’s Theorem
Mercer’s theorem: eigendecomposition of operator (λι, φι)Nι=1 orthonormal
basis (ONB) of L2(D).
Kernel: k(x , y) =∑N
ι=1 λιφι(x)φι(x), N ∈ {N,∞}.Feature map:
ψ :=(√λ1φ1(x),
√λ2φ2(x), . . . )
k(x , y) =〈ψ(x), ψ(y)〉H
19
Goal PCA Kernel PCA: Derivation Kernel PCA: Examples KPCA and Feature Space
Mercer’s Theorem
Mercer’s theorem: eigendecomposition of operator (λι, φι)Nι=1 orthonormal
basis (ONB) of L2(D).
Kernel: k(x , y) =∑N
ι=1 λιφι(x)φι(x), N ∈ {N,∞}.Feature map:
ψ :=(√λ1φ1(x),
√λ2φ2(x), . . . )
k(x , y) =〈ψ(x), ψ(y)〉H
19
Goal PCA Kernel PCA: Derivation Kernel PCA: Examples KPCA and Feature Space
Next Time
We will start Gaussian process regression.
20