hassan a. kingravi pca.pdf · 1 n xn i=1 x ix t: (1) the diagonalization of the matrix can be...

21
Goal PCA Kernel PCA: Derivation Kernel PCA: Examples KPCA and Feature Space Principal Components Analysis Hassan A. Kingravi IVALab July 1, 2013 1

Upload: others

Post on 25-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hassan A. Kingravi PCA.pdf · 1 n Xn i=1 x ix T: (1) The diagonalization of the matrix can be written as Cv = v; (2) where v is the eigenvector, and the eigenvalue. Plugging (1) into

Goal PCA Kernel PCA: Derivation Kernel PCA: Examples KPCA and Feature Space

Principal Components Analysis

Hassan A. Kingravi

IVALab

July 1, 2013

1

Page 2: Hassan A. Kingravi PCA.pdf · 1 n Xn i=1 x ix T: (1) The diagonalization of the matrix can be written as Cv = v; (2) where v is the eigenvector, and the eigenvalue. Plugging (1) into

Goal PCA Kernel PCA: Derivation Kernel PCA: Examples KPCA and Feature Space

Goal

1 Consider data X = {xi}n1, xi ∈ Rd . Suppose d > 3, but we would like toplot the data. How do we do this?

2 We can pick a few of the relevant axes and plot them to see what theylook like.

3 Or, we can reduce the data to a few relevant variables.

2

Page 3: Hassan A. Kingravi PCA.pdf · 1 n Xn i=1 x ix T: (1) The diagonalization of the matrix can be written as Cv = v; (2) where v is the eigenvector, and the eigenvalue. Plugging (1) into

Goal PCA Kernel PCA: Derivation Kernel PCA: Examples KPCA and Feature Space

Principal Components Analysis

The main idea behind PCA: project the data onto the axes in whose directionthe data have maximum variance.

−3 −2 −1 0 1 2 3 4−6

−4

−2

0

2

4

6

3

Page 4: Hassan A. Kingravi PCA.pdf · 1 n Xn i=1 x ix T: (1) The diagonalization of the matrix can be written as Cv = v; (2) where v is the eigenvector, and the eigenvalue. Plugging (1) into

Goal PCA Kernel PCA: Derivation Kernel PCA: Examples KPCA and Feature Space

Principal Components Analysis

The main idea behind PCA: project the data onto the axes in whose directionthe data have maximum variance.

−3 −2 −1 0 1 2 3 4−6

−4

−2

0

2

4

6

4

Page 5: Hassan A. Kingravi PCA.pdf · 1 n Xn i=1 x ix T: (1) The diagonalization of the matrix can be written as Cv = v; (2) where v is the eigenvector, and the eigenvalue. Plugging (1) into

Goal PCA Kernel PCA: Derivation Kernel PCA: Examples KPCA and Feature Space

Principal Components Analysis

Put another way: find another linear basis to represent data, which maximizesvariance. Well, in d-dimensional data, ‘variance’ (or second moments) isrepresented by the covariance matrix. If X ∈ Rd is zero mean, the covariancematrix is given by

C =1

nXXT .

Compute the eigendecomposition of this matrix:

C = UΛUT .

The eigenvectors are the directions that you project your data onto:

D = UTX .

If you keep only the components with the largest eigenvalues (which measure,roughly, how much of the ‘energy’ of the signal was within them), you areperforming dimensionality blackuction.

5

Page 6: Hassan A. Kingravi PCA.pdf · 1 n Xn i=1 x ix T: (1) The diagonalization of the matrix can be written as Cv = v; (2) where v is the eigenvector, and the eigenvalue. Plugging (1) into

Goal PCA Kernel PCA: Derivation Kernel PCA: Examples KPCA and Feature Space

Principal Components Analysis

Data projected onto new space: much simpler representation.

−6 −4 −2 0 2 4 6 8−1.5

−1

−0.5

0

0.5

1

1.5

6

Page 7: Hassan A. Kingravi PCA.pdf · 1 n Xn i=1 x ix T: (1) The diagonalization of the matrix can be written as Cv = v; (2) where v is the eigenvector, and the eigenvalue. Plugging (1) into

Goal PCA Kernel PCA: Derivation Kernel PCA: Examples KPCA and Feature Space

PCA: The Dual View

The covariance matrix is defined to be

C =1

n

n∑i=1

xixTi . (1)

The diagonalization of the matrix can be written as

Cv = λv , (2)

where v is the eigenvector, and λ the eigenvalue. Plugging (1) into (2) allowsus to write the eigenvector in terms of the data (i.e. the dual view) as

Cv =1

n

n∑i=1

〈xi , v〉Rd xi .

Therefore, the eigenvector equation is equivalent to solving the problem

λ〈xi , v〉Rd = 〈xi ,Cv〉Rd ∀i = 1, . . . , n. (3)

7

Page 8: Hassan A. Kingravi PCA.pdf · 1 n Xn i=1 x ix T: (1) The diagonalization of the matrix can be written as Cv = v; (2) where v is the eigenvector, and the eigenvalue. Plugging (1) into

Goal PCA Kernel PCA: Derivation Kernel PCA: Examples KPCA and Feature Space

Kernelizing PCA

The covariance matrix is defined to be

C̄ =1

n

n∑i=1

ψ(xi )ψ(xi )T . (4)

The diagonalization of the matrix can be written as

C̄V = λV , (5)

where V is the eigenvector, and λ the eigenvalue. Plugging (4) into (5) allowsus to write the eigenvector in terms of the data as

C̄V =1

n

n∑i=1

〈ψ(xi ),V 〉Hψ(xi ).

Therefore, the eigenvector equation is equivalent to solving the problem

λ〈ψ(xi ),V 〉H = 〈ψ(xi ), C̄V 〉H ∀i = 1, . . . , n. (6)

8

Page 9: Hassan A. Kingravi PCA.pdf · 1 n Xn i=1 x ix T: (1) The diagonalization of the matrix can be written as Cv = v; (2) where v is the eigenvector, and the eigenvalue. Plugging (1) into

Goal PCA Kernel PCA: Derivation Kernel PCA: Examples KPCA and Feature Space

Kernelizing PCA

Due to the dual view, we can write eigenvectors as

V =n∑

i=1

αiψ(xi ), (7)

for some coefficients αi . Plugging this into the previous equation, we get

λ

n∑i=1

αi 〈ψ(xk), ψ(xi )〉H =n∑

i=1

αi 〈ψ(xk), C̄ψ(xi )〉H

λn∑

i=1

αi 〈ψ(xk), ψ(xi )〉H =n∑

i=1

αi

⟨ψ(xk),

1

n

(n∑

j=1

ψ(xj)ψ(xj)T

)ψ(xi )

⟩H

λ

n∑i=1

αi 〈ψ(xk), ψ(xi )〉H =1

n

n∑i=1

n∑j=1

αi 〈ψ(xi ), ψ(xj)〉H〈ψ(xj), ψ(xk)〉H.

Doing this for every data point yields the matrix equation

nλKα = K 2α

(⇒) Kα = nλα.

9

Page 10: Hassan A. Kingravi PCA.pdf · 1 n Xn i=1 x ix T: (1) The diagonalization of the matrix can be written as Cv = v; (2) where v is the eigenvector, and the eigenvalue. Plugging (1) into

Goal PCA Kernel PCA: Derivation Kernel PCA: Examples KPCA and Feature Space

Kernel PCA

So the algorithm becomes clear. Ignoring a subtlety (data may not be centeredin feature space), perform eigendecomposition

Kαk = λkαk . (8)

Then the eigenvectors are given by

V k =n∑

i=1

αiψ(xi ).

We require that the vectors be normalized, i.e. 〈V k ,V k〉H = 1, because wewant orthonormal system. It can be shown that this boils down to requiring

〈alk , αk〉Rn =1

λk.

Then the projection of the data onto the KPCA eigenspace is given by

〈V k , ψ(xj)〉H =

⟨n∑

i=1

αki ψ(xi ), ψ(xj)

⟩H

(9)

=n∑

i=1

αki k(xi , xj). (10)

10

Page 11: Hassan A. Kingravi PCA.pdf · 1 n Xn i=1 x ix T: (1) The diagonalization of the matrix can be written as Cv = v; (2) where v is the eigenvector, and the eigenvalue. Plugging (1) into

Goal PCA Kernel PCA: Derivation Kernel PCA: Examples KPCA and Feature Space

Gaussian Kernel

−10 −5 0 5 10−10

−5

0

5

10Data generated from Gaussian Mixture Models

11

Page 12: Hassan A. Kingravi PCA.pdf · 1 n Xn i=1 x ix T: (1) The diagonalization of the matrix can be written as Cv = v; (2) where v is the eigenvector, and the eigenvalue. Plugging (1) into

Goal PCA Kernel PCA: Derivation Kernel PCA: Examples KPCA and Feature Space

Gaussian Kernel

−0.4 −0.20.0

0.20.4

0.6

−0.4

−0.2

0.0

0.2

0.4

−0.90

−0.85

−0.80

−0.75

−0.70

−0.65

−0.60

Gaussian KPCA Embedding for GMM Data

12

Page 13: Hassan A. Kingravi PCA.pdf · 1 n Xn i=1 x ix T: (1) The diagonalization of the matrix can be written as Cv = v; (2) where v is the eigenvector, and the eigenvalue. Plugging (1) into

Goal PCA Kernel PCA: Derivation Kernel PCA: Examples KPCA and Feature Space

Gaussian Kernel: Eigenfunctions

−5 0 5

−5

0

5

−0.6

−0.4

−0.2

0.0

0.2

0.4

0.6

Gaussian KPCA Eigenfunction 1 (Centered)

−50

5 −50

5

−0.6

−0.4

−0.2

0.0

0.2

0.4

0.6

Gaussian KPCA Eigenfunction 2 (Centered)

13

Page 14: Hassan A. Kingravi PCA.pdf · 1 n Xn i=1 x ix T: (1) The diagonalization of the matrix can be written as Cv = v; (2) where v is the eigenvector, and the eigenvalue. Plugging (1) into

Goal PCA Kernel PCA: Derivation Kernel PCA: Examples KPCA and Feature Space

Polynomial Kernel

−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0−2.0

−1.5

−1.0

−0.5

0.0

0.5

1.0

1.5

2.0Non-linearly separable data

14

Page 15: Hassan A. Kingravi PCA.pdf · 1 n Xn i=1 x ix T: (1) The diagonalization of the matrix can be written as Cv = v; (2) where v is the eigenvector, and the eigenvalue. Plugging (1) into

Goal PCA Kernel PCA: Derivation Kernel PCA: Examples KPCA and Feature Space

Polynomial Kernel

−2−1

01

23

−2

−1

0

1

2

0.5

1.0

1.5

2.0

2.5

3.0

Polynomial KPCA Embedding for Nonseperable Data

15

Page 16: Hassan A. Kingravi PCA.pdf · 1 n Xn i=1 x ix T: (1) The diagonalization of the matrix can be written as Cv = v; (2) where v is the eigenvector, and the eigenvalue. Plugging (1) into

Goal PCA Kernel PCA: Derivation Kernel PCA: Examples KPCA and Feature Space

Polynomial Kernel: Eigenfunctions

−1.5−1.0−0.50.0

0.51.0

1.5−1.5−1.0−0.5

0.00.5

1.01.5

−4

−2

0

2

4

Polynomial KPCA Eigenfunction 1 (Centered)

−1.5−1.0−0.50.0

0.51.0

1.5−1.5−1.0−0.5

0.00.5

1.01.5

−3

−2

−1

0

1

2

3

Polynomial KPCA Eigenfunction 2 (Centered)

16

Page 17: Hassan A. Kingravi PCA.pdf · 1 n Xn i=1 x ix T: (1) The diagonalization of the matrix can be written as Cv = v; (2) where v is the eigenvector, and the eigenvalue. Plugging (1) into

Goal PCA Kernel PCA: Derivation Kernel PCA: Examples KPCA and Feature Space

Reproducing Kernel Hilbert Spaces

The embedding of the data for the polynomial kernel should look familiar: it’salmost as if the mapping is exactly as we knew the feature map itself. So far,we know that if we have the kernel, we can avoid computing the feature map,and operate in H without ψ. In fact, KPCA allows us to compute low-rankapproximations to ψ.To understand how, we need to understand what the kernel matrix isapproximating in the limit.

17

Page 18: Hassan A. Kingravi PCA.pdf · 1 n Xn i=1 x ix T: (1) The diagonalization of the matrix can be written as Cv = v; (2) where v is the eigenvector, and the eigenvalue. Plugging (1) into

Goal PCA Kernel PCA: Derivation Kernel PCA: Examples KPCA and Feature Space

Integral Operators

Consider yet another perspective on the kernel matrix; as an operator. For adataset X = {xi}n1, xi ∈ R, suppose you want to smooth (or interpolate) thedata, using a smoothing kernel k(x , y) : R× R→ R. Then the smoothedversion of the data can be computed as

x̃1x̃2...

x̃n

=

k(x1, x1) k(x1, x2) · · · k(x1, xn)k(x2, x1) k(x2, x2) · · · k(x2, xn)

......

. . ....

k(xn, x1) k(xn, x2) · · · k(xn, xn)

x1x2...

xn

This smoothing, in the limit, can be written as the integral operator

(Kf )(x) :=

∫D

k(x , y)f (y)dy , (11)

for f ∈ L2(D). Depending on structure of k(x , y), f is projected onto asubspace of L2(D).

18

Page 19: Hassan A. Kingravi PCA.pdf · 1 n Xn i=1 x ix T: (1) The diagonalization of the matrix can be written as Cv = v; (2) where v is the eigenvector, and the eigenvalue. Plugging (1) into

Goal PCA Kernel PCA: Derivation Kernel PCA: Examples KPCA and Feature Space

Mercer’s Theorem

Mercer’s theorem: eigendecomposition of operator (λι, φι)Nι=1 orthonormal

basis (ONB) of L2(D).

Kernel: k(x , y) =∑N

ι=1 λιφι(x)φι(x), N ∈ {N,∞}.Feature map:

ψ :=(√λ1φ1(x),

√λ2φ2(x), . . . )

k(x , y) =〈ψ(x), ψ(y)〉H

19

Page 20: Hassan A. Kingravi PCA.pdf · 1 n Xn i=1 x ix T: (1) The diagonalization of the matrix can be written as Cv = v; (2) where v is the eigenvector, and the eigenvalue. Plugging (1) into

Goal PCA Kernel PCA: Derivation Kernel PCA: Examples KPCA and Feature Space

Mercer’s Theorem

Mercer’s theorem: eigendecomposition of operator (λι, φι)Nι=1 orthonormal

basis (ONB) of L2(D).

Kernel: k(x , y) =∑N

ι=1 λιφι(x)φι(x), N ∈ {N,∞}.Feature map:

ψ :=(√λ1φ1(x),

√λ2φ2(x), . . . )

k(x , y) =〈ψ(x), ψ(y)〉H

19

Page 21: Hassan A. Kingravi PCA.pdf · 1 n Xn i=1 x ix T: (1) The diagonalization of the matrix can be written as Cv = v; (2) where v is the eigenvector, and the eigenvalue. Plugging (1) into

Goal PCA Kernel PCA: Derivation Kernel PCA: Examples KPCA and Feature Space

Next Time

We will start Gaussian process regression.

20