hassan a. kingravi pca.pdf · 1 n xn i=1 x ix t: (1) the diagonalization of the matrix can be...

Goal PCA Kernel PCA: Derivation Kernel PCA: Examples KPCA and Feature Space

Principal Components Analysis

Hassan A. Kingravi

IVALab

July 1, 2013

1


Goal

1 Consider data X = {xi}n1, xi ∈ Rd . Suppose d > 3, but we would like toplot the data. How do we do this?

2 We can pick a few of the relevant axes and plot them to see what theylook like.

3 Or, we can reduce the data to a few relevant variables.

2



The main idea behind PCA: project the data onto the axes in whose directionthe data have maximum variance.

−3 −2 −1 0 1 2 3 4−6

−4

−2

0

2

4

6

3



The main idea behind PCA: project the data onto the axes in whose directionthe data have maximum variance.

−3 −2 −1 0 1 2 3 4−6

−4

−2

0

2

4

6

4



Put another way: find another linear basis to represent data, which maximizesvariance. Well, in d-dimensional data, ‘variance’ (or second moments) isrepresented by the covariance matrix. If X ∈ Rd is zero mean, the covariancematrix is given by

C =1

nXXT .

Compute the eigendecomposition of this matrix:

C = UΛUT .

The eigenvectors are the directions that you project your data onto:

D = UTX .

If you keep only the components with the largest eigenvalues (which measure,roughly, how much of the ‘energy’ of the signal was within them), you areperforming dimensionality blackuction.

5



Data projected onto new space: much simpler representation.

−6 −4 −2 0 2 4 6 8−1.5

−1

−0.5

0

0.5

1

1.5

6


PCA: The Dual View

The covariance matrix is defined to be

C =1

n

n∑i=1

xixTi . (1)

The diagonalization of the matrix can be written as

Cv = λv , (2)

where v is the eigenvector, and λ the eigenvalue. Plugging (1) into (2) allowsus to write the eigenvector in terms of the data (i.e. the dual view) as

Cv =1

n

n∑i=1

〈xi , v〉Rd xi .

Therefore, the eigenvector equation is equivalent to solving the problem

λ〈xi , v〉Rd = 〈xi ,Cv〉Rd ∀i = 1, . . . , n. (3)

7


Kernelizing PCA

The covariance matrix is defined to be

C̄ =1

n

n∑i=1

ψ(xi )ψ(xi )T . (4)

The diagonalization of the matrix can be written as

C̄V = λV , (5)

where V is the eigenvector, and λ the eigenvalue. Plugging (4) into (5) allowsus to write the eigenvector in terms of the data as

C̄V =1

n

n∑i=1

〈ψ(xi ),V 〉Hψ(xi ).

Therefore, the eigenvector equation is equivalent to solving the problem

λ〈ψ(xi ),V 〉H = 〈ψ(xi ), C̄V 〉H ∀i = 1, . . . , n. (6)

8


Kernelizing PCA

Due to the dual view, we can write eigenvectors as

V =n∑

i=1

αiψ(xi ), (7)

for some coefficients αi . Plugging this into the previous equation, we get

λ

n∑i=1

αi 〈ψ(xk), ψ(xi )〉H =n∑

i=1

αi 〈ψ(xk), C̄ψ(xi )〉H

λn∑

i=1

αi 〈ψ(xk), ψ(xi )〉H =n∑

i=1

αi

⟨ψ(xk),

1

n

(n∑

j=1

ψ(xj)ψ(xj)T

)ψ(xi )

⟩H

λ

n∑i=1

αi 〈ψ(xk), ψ(xi )〉H =1

n

n∑i=1

n∑j=1

αi 〈ψ(xi ), ψ(xj)〉H〈ψ(xj), ψ(xk)〉H.

Doing this for every data point yields the matrix equation

nλKα = K 2α

(⇒) Kα = nλα.

9


Kernel PCA

So the algorithm becomes clear. Ignoring a subtlety (data may not be centeredin feature space), perform eigendecomposition

Kαk = λkαk . (8)

Then the eigenvectors are given by

V k =n∑

i=1

αiψ(xi ).

We require that the vectors be normalized, i.e. 〈V k ,V k〉H = 1, because wewant orthonormal system. It can be shown that this boils down to requiring

〈alk , αk〉Rn =1

λk.

Then the projection of the data onto the KPCA eigenspace is given by

〈V k , ψ(xj)〉H =

⟨n∑

i=1

αki ψ(xi ), ψ(xj)

⟩H

(9)

=n∑

i=1

αki k(xi , xj). (10)

10


Gaussian Kernel

−10 −5 0 5 10−10

−5

0

5

10Data generated from Gaussian Mixture Models

11


Gaussian Kernel

−0.4 −0.20.0

0.20.4

0.6

−0.4

−0.2

0.0

0.2

0.4

−0.90

−0.85

−0.80

−0.75

−0.70

−0.65

−0.60

Gaussian KPCA Embedding for GMM Data

12


Gaussian Kernel: Eigenfunctions

−5 0 5

−5

0

5

−0.6

−0.4

−0.2

0.0

0.2

0.4

0.6

Gaussian KPCA Eigenfunction 1 (Centered)

−50

5 −50

5

−0.6

−0.4

−0.2

0.0

0.2

0.4

0.6

Gaussian KPCA Eigenfunction 2 (Centered)

13


Polynomial Kernel

−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0−2.0

−1.5

−1.0

−0.5

0.0

0.5

1.0

1.5

2.0Non-linearly separable data

14


Polynomial Kernel

−2−1

01

23

−2

−1

0

1

2

0.5

1.0

1.5

2.0

2.5

3.0

Polynomial KPCA Embedding for Nonseperable Data

15


Polynomial Kernel: Eigenfunctions

−1.5−1.0−0.50.0

0.51.0

1.5−1.5−1.0−0.5

0.00.5

1.01.5

−4

−2

0

2

4

Polynomial KPCA Eigenfunction 1 (Centered)

−1.5−1.0−0.50.0

0.51.0

1.5−1.5−1.0−0.5

0.00.5

1.01.5

−3

−2

−1

0

1

2

3

Polynomial KPCA Eigenfunction 2 (Centered)

16


Reproducing Kernel Hilbert Spaces

The embedding of the data for the polynomial kernel should look familiar: it’salmost as if the mapping is exactly as we knew the feature map itself. So far,we know that if we have the kernel, we can avoid computing the feature map,and operate in H without ψ. In fact, KPCA allows us to compute low-rankapproximations to ψ.To understand how, we need to understand what the kernel matrix isapproximating in the limit.

17


Integral Operators

Consider yet another perspective on the kernel matrix; as an operator. For adataset X = {xi}n1, xi ∈ R, suppose you want to smooth (or interpolate) thedata, using a smoothing kernel k(x , y) : R× R→ R. Then the smoothedversion of the data can be computed as

x̃1x̃2...

x̃n

=

k(x1, x1) k(x1, x2) · · · k(x1, xn)k(x2, x1) k(x2, x2) · · · k(x2, xn)

......

. . ....

k(xn, x1) k(xn, x2) · · · k(xn, xn)

x1x2...

xn

This smoothing, in the limit, can be written as the integral operator

(Kf )(x) :=

∫D

k(x , y)f (y)dy , (11)

for f ∈ L2(D). Depending on structure of k(x , y), f is projected onto asubspace of L2(D).

18


Mercer’s Theorem

Mercer’s theorem: eigendecomposition of operator (λι, φι)Nι=1 orthonormal

basis (ONB) of L2(D).

Kernel: k(x , y) =∑N

ι=1 λιφι(x)φι(x), N ∈ {N,∞}.Feature map:

ψ :=(√λ1φ1(x),

√λ2φ2(x), . . . )

k(x , y) =〈ψ(x), ψ(y)〉H

19


Next Time

We will start Gaussian process regression.

20

hassan a. kingravi pca.pdf · 1 n xn i=1 x ix t: (1) the diagonalization of the matrix can be...

Documents